Running Serverless HPC Workloads on Top of Kubernetes and Jupyter Notebooks

Date
27 March 2019 - 13:00-14:00
Location
COM-G12-Main Lewin
Speaker
Dr Christopher Woods, University of Bristol

All of our events may be recorded and shared via the University of Sheffield Kaltura platform so those who cannot attend may still benefit. We will consider your attendance implict consent to this.

Abstract: The cloud holds the promise of a new way to perform digital science - interactive, elastically scaling, open data, open compute, and sharing reproducible workflows to collaboratively solve global grand challenge problems. The Research Software Engineering group at Bristol work with most of the major public cloud companies (Amazon, Google, IBM, Microsoft and Oracle) on a range of projects creating everything from elastically scaling slurm clusters, through workflows for Cryo-EM image refinement, to national platforms for tracking UK greenhouse gas emissions. Through this, we’ve recognised that comparing the cloud to on-premise HPC is like comparing a fixed-line telephone to a smartphone. Just as an iPhone is more than just a mobile telephone, so the cloud is more than just an “on-demand cluster”.

To this end, via all of our projects, we have been gradually building Acquire. Acquire provides multi-cloud identity and access management to cloud-based storage and compute services. Acquire is designed to make it easy for HPC jobs to be run interactively within Jupyter notebooks. Jupyter notebooks, deployed on top of kubernetes (k8s), are finding rapid adoption in universities and industry. While k8s can spawn new pods for each notebook session, launching high performance computing (HPC) jobs during dynamic workflows, and then managing access to the resulting output data is complicated. Acquire builds on top of the Fn serverless framework (https://fnproject.io) to deploy individual simulations as Fn functions that are called dynamically from workflows run within Jupyter notebooks. A notebook running on a lightweight k8s cluster can burst HPC workloads via Fn serverless calls to a dynamically provisioned cluster running on a bare metal or VM-based HPC/GPU cloud. Using Fn, we are constructing a distributed identity, access, and accounting layer around dynamically scaling compute resources and globally distributed object stores. This adds security and accountability, thereby making it easy for end users to manage complex multi-cloud workflows. Researchers can control costs by translating billing into units of “simulation” rather than “core hours”, and will be able to publish and share the results via access-controlled DOIs. Altogether, Acquire will help us realise the potential of the cloud as a truly planetary supercomputer. Put more succinctly, Acquire is helping us build the Netflix of simulation.

Bio: Christopher is an EPSRC Research Software Engineering (RSE) Fellow, managing the RSE Group in the Advanced Computing Research Centre at the University of Bristol. Christopher’s started his research career as a computational chemist, developing new methods and software for biomolecular simulation (https://protoms.org, https://siremol.org). This software is now sold and used in the pharmaceutical industry (https://www.cresset-group.com/flare/). Christopher’s aim is to improve the quality and sustainability of research software by raising awareness of the importance of software engineering skills, and advocating the development of sustainable funding pathways and careers for people who develop research software. Christopher is joint-chair of the Research Software Engineering Association (https://rse.ac.uk). He provides software engineering training to researchers across the UK (https://chryswoods.com/main/courses), and regularly provides advice to universities on how to set up and manage successful RSE teams.

From a start in computational chemistry, Christopher’s research now covers software engineering in everything from monitoring greenhouse gases, via manufacturing airplane components, to resolving Cryo-EM images and managing complex biomolecular simulation and data analysis workflows. Most of these projects now require developing and adapting software for the cloud. As such, the Bristol RSE group has a growing international reputation for being at the forefront of cloud software engineering research. Christopher contributes to long-term UK cloud strategy via close working relationships with engineers at many of the public cloud companies, and membership of the EPSRC eInfrastructure Strategic Advisory Team and UKRI eInfrastructure expert group.

SLIDES

Contact Us

For queries relating to collaborating with the RSE team on projects: rse@sheffield.ac.uk

Information and access to JADE II and Bede.

Join our mailing list so as to be notified when we advertise talks and workshops by subscribing to this Google Group.

Queries regarding free research computing support/guidance should be raised via our Code clinic or directed to the University IT helpdesk.