All of our events may be recorded and shared via the University of Sheffield Kaltura platform so those who cannot attend may still benefit. We will consider your attendance implict consent to this.
Scientific workflows (for example in bioinformatics) often require several different scripts or pieces of software to be run and their outputs fed into each other. Ideally, we’d like to be able to understand the provenance of the output, without having to re-run the entire workflow every time a minor change is made to code or data. This is often difficult to achieve with bash or Python scripts (for example), but workflow managers can help.
In this session we will learn more about the problems of controlling workflows / pipelines using scripts, and about what workflow managers are, and how they can help.
Scripts let us chain together several bits of software, often taking the output of one and feeding it into another and/or running the same software repeatedly with different inputs. This is often far better than entering commands into a prompt and then having to remember those commands and their sequence when doing the same job again.
However, glueing together the parts of a multi-step or multi-input workflow using scripts can be difficult: it’s all too easy to create scripts that aren’t portable between machines/systems, are brittle/unreliable, are difficult to read and test and aren’t modular or reusable.
For which cases are scripts a good way of running a workflow and for which might they be problematic?
Slides for “The trouble with scripts (Will Furnass)”
Nextflow is a relatively new workflow manager which allows for writing scalable computational pipelines. It is based on Java and written in written in ‘Groovy’ (a programming language which compiles to Java byte code), however it provides a multitude of tools and template scripts to allow for a quick and easy access to typical workflows, without the need to learn its programming language. Major advantages of Nextflow are:
Slides for Nextflow (Magda Dabrowska)
Ruffus is one of the oldest of the modern style workflow management systems. Ruffus pipelines consist of a series of python functions that are linked together using a python feature known as decorators. Ruffus is a lighter weight alternative to systems such as Nextflow, while still offering the ability to create rich dependency graphs of tasks and orchestrate their submission to an HPC. It embodies the philosophy that for users to choose to use a reproducible pipeline in the real world, when under pressure, using a pipeline must be as easy, or easier, than the manual alternative. In my talk I will demonstrate the creation of a pipeline for a bioinformatics task consisting of multiple non-trival steps within 15 minutes.
Common Workflow Language (CWL) is an open standard for describing how to run command line tools and connect them to create workflows.
The CWL standard is not specific to any language or technology, so enables interoperability between the many different workflow languages, enabling researchers to share workflows. It’s more abstract than any particular language implementation, aiming to support a super-set of the features of other languages. CWL allows portability across these systems because the logical/computational parts of a workflow are decoupled from the code that runs them.
Slides for Common Workflow Language (Joe Heffer)
We have a Google Jam Board where questions were fielded.
For queries relating to collaborating with the RSE team on projects: rse@sheffield.ac.uk
Information and access to JADE II and Bede.
Join our mailing list so as to be notified when we advertise talks and workshops by subscribing to this Google Group.
Queries regarding free research computing support/guidance should be raised via our Code clinic or directed to the University IT helpdesk.