Abstract: High performance compute facilities were originally envisaged to run single, monolithic programs that could run across multiple compute nodes. However many scientific analyses consist of a “pipeline” of smaller, interdependent tasks, with each step consisting of transforming, combining or summarizing data in some fashion or another. While each job will generally use only a small amount of resources (e.g. a single core, or a small number of cores on a single machine), a complete analysis will consist of running many of these smaller jobs for different input files(or subsets of input files) in an “embarrassingly parallel” manner. Orchestrating such an analysis and tracking the dependencies across multiple analysis steps manually can be a confusing, time consuming and error prone-task. Software tools that track dependencies in a pipeline are not new, with the oldest and most well known example being
make, but a new generation of tools, designed specially with data analysis pipelines in mind has appeared in the last few years. Such systems allow the coordination of the many steps of an analysis into a reproducible/reusable recipe or script, handle submission of jobs that can be run in parallel to compute clusters (or even cloud appliances) and restarting half-finished analyses and log metadata about pipeline runs. They can be used to generate production grade, reusable systems, or as computational lab book that eases record keeping and distribution of prototype analyses across clusters.
I will review the common systems, the advantages of using them, and give a demonstration of developing a simple pipeline using one example tool: the CGAT-core/ruffus system.
Bio: Ian Sudbery is a Lecturer in Bioinformatics. He began his career as a wet-lab scientist working in Genomics and System Biology (in Cambridge and Boston, MA respectively), before laying down his pipettes and taking up the keyboard and code. A graduate of the MRC CGAT program to retrain lab scientists in computation, he joined the University of Sheffield in 2014 where he heads a group of 6 PhD students and postdocs studying Gene Regulation and its break down in disease, mainly through the analysis of high-throughput DNA and RNA sequencing data.
For queries relating to collaborating with the RSE team on projects: email@example.com
Join our mailing list so as to be notified when we advertise talks and workshops by subscribing to this Google Group.