A successful 2nd RSE conference

RSE Sheffield in the 2nd RSE conference


The second RSE conference took place on the 7th and 8th of September 2017 at the Museum of Science and Industry MOSI. There were over 200 attendees, 40 talks, 15 workshops, 3 keynote talks, one of which was given by our very own head honcho Mike Croucher (slides here), and geeky chats galore.

RSE team members Mozhgan and Tania were involved in the organising committee as talks co-chairs and diversity chair (disclose: they had nothing to do with Mike's keynote). Also, all of the RSE Sheffield team members made it to the conference, which seems to be a first due to the diverse commitments and project involvement of all of us.

Once again, the event was a huge success thanks to the efforts of the committee and volunteers as well as the amazing RSE community that made this a en engaging and welcoming event.

Conference highlights

With so many parallel sessions, workshops, and chats happening all at the same time it is quite complicated to keep a track of every single thing going on. And it seems rather unlikely that this will change over time as it was evident that the RSE community has outgrown the current conference size. So we decided to highlight our favourites of the event:

  • The talk on 'Imposter syndrome' by Vijay Sharma: Who in the scientific community has not ever experienced this? Exactly! So when given the chance everyone jumped into this talk full of relatable stories and handy tips on how to get over it.

  • Another talk that seemed to have gathered loads of interest was that of Toby Hodges from EMBL on community building. This came as no surprise (at least to me) as RSEs often act as community builders or as a bridge between collaborating communities. Opposed to just being focused on developing software and pushing it into production.

  • During the first day the RSEs had the chance to have a go at interacting with the Microsoft Hololens. There was a considerable queue to have a go at this, and unfortunately, we were not among the chosen ones to play with this. Maybe in the future.

  • My hands-on workshop on 'Jupyter notebooks for reproducible research'. I was ecstatic to know the community found this workshop interesting and had to run this twice!!!

  • Also, I'd like to casually throw in here that I have been elected as a committee member for the UK RSE association, so expect to read more about this in this blog.

For obvious reasons I missed most of the workshops but Kenji Takeda's workshop on 'Learn how to become an AI Super-RSE' was another favourite of the delegates as this was run twice too!

Our workshop on Jupyter notebooks for reproducible research

Being a RSE means that I serve as an advocate of sustainable software development. Also, as I have discussed here before: I am greatly concerned about reproducibility and replicability in science. Which, I might add, is not an easy task to embark onto. Thankfully, there are loads of tools and practices that we can adopt as part of our workflows to ensure that the code we develop is done by following the best practices possible, and as a consequence, can support science accordingly.

Naturally, as members of the community come up with more refined and powerful tools in the realm of scientific computing we (the users and other developers) adopt some of those tools meaning that we often end up modifying our workflows.

Such is the case of Jupyter notebooks. They brought up to life a whole new era of literate programming: where scientist, students, data scientist, and aficionados can share their scripts in a human readable format. What is more important, they transform scripts into a conveying scientific narrative where functions and loops are followed by their graphical outputs or allow the user to interact via widgets. This ability to openly share whole analysis pipelines is for sure, a step in the right direction.

However, the adoption of tools like this brings not only a number of advantages but also presents a number of challenges and integration issues with previously developed tools. For example, the traditional version control tools (including diff and merge tools) do not play nicely with the notebooks. Also, the notebooks have to be tested as any other piece of code.

During the workshop, I introduced two tools: nbdime and nbval, which were developed as part of the European funded project: OpenDreamKit. Such tools introduce very much needed version control and validation capabilities to the Jupyter notebooks, addressing some of the issues mentioned before.

So in order to cover these tools as well as how you would integrate them within your workflow I divided the workshop in three parts: diffing and merging of the notebooks, notebooks validation, and a brief 101 on reproducibility practices.

Notebooks diffing and merging

During the first part of the workshop the attendees shared their experiences using traditional version control tools with Jupyter notebooks... unsurprisingly everyone had had terrible experiences.

Then all of them had some hands-on time on how to use nbdime for diffing and merging from the command line as well as from their rich html rendered version (completely offline). As we progressed with the tutorial I could see some happy faces around the room and they all agreed that this was much needed.

Need more convincing? This tweet showed up in my feed just this week And just earlier this week this tweet showed up on my feed:

Notebooks validation

The second part of the workshop focused on the validation of the notebooks. And here I would like to ask this first: 'How many of you have found an amazing notebook somewhere in the web just to clone it and find out that it just does not work: dependencies are broken, functions are deprecated, can't tell if the results are reproducible?

I can tell you, we have all been there. And in such cases nbval is your best friend. It is a py.test plugin to determine whether execution of the stored inputs match the stored outputs of the .ipynb file. Whilst also ensuring that the notebooks are running without errors.

This lead to an incredible discussion on its place within conventional testing approaches. Certainly, it does not replace unittesting or integration testing, but it could be seen as a form of regression testing for the notebooks. Want to make sure that your awesome documentation formed by Jupyter notebooks is still working in a few months time? Why not use CI and nbval?

Wrapping up

The closing to the workshop was a 101 on working towards reproducible scientific computing. We shared some of our approaches for reproducible workflows and encouraged the delegates to share theirs. We covered topics such as valuing your digital assets, licensing, automation, version control and continuous integration, among others.

The perfect close to a great RSE conference!

Just a few more things

Let me highlight that all the materials for the workshop can be found at: https://github.com/trallard/JNB_reproducible and that all of it is completely self contained in the form of a Docker container.

If you missed out on the conference and would like to see the videos and slides of the various talks do not forget to visit the RSE conference website.

Iceberg vs ShARC

TL;DR Around 100 of Iceberg's nodes are ancient and weaker than a decent laptop. You may get better performance by switching to ShARC. You'll get even better performance by investing in the RSE project on ShARC.

Benchmarking different nodes on our HPC systems

I have been benchmarking various nodes on Iceberg and ShARC using Matrix-Matrix multiplication. This operation is highly parallel and optimised these days and is also a vital operation in many scientific workflows.

The benchmark units are GigaFlops (Billion operations per second) and higher is better Here are the results for maximum matrix sizes of 10000 by 10000, sorted worst to best

According to the Iceberg cluster specs, over half of Iceberg is made up of the old 'Westmere' nodes. According to these benchmarks, these are almost 4 times slower than a standard node on ShARC.

The RSE project - the fastest nodes available

We in the RSE group have co-invested with our collaborators in additional hardware on ShARC to form a 'Premium queue'. This hardware includes large memory nodes (768 Gigabytes per node - 12 times the amount that's normally available), Advanced GPUs (A DGX-1 server) and 'dense-core' nodes with 32 CPUs each.

These 32 core nodes are capable of over 800 Gigaflops and so are 6.7 times faster than the old Iceberg nodes. Furthermore, since they are only available to contributors, the queues will be shorter too!

Details of how to participate in the RSE-queue experiment on ShARC can be found on our website

What if ShARC is slower than Iceberg?

These benchmarks give reproducible evidence that ShARC can be significantly faster than Iceberg when well-optimised code is used. We have heard some unconfirmed reports that code run on ShARC can be slower than code run on Iceberg. If this is the case for you, please get in touch with us and give details.

Sheffield R Users group celebrates Hacktoberfest

We'll be honest here and say that our Sheffield R Users group Hacktoberfest celebrations started as a last minute stroke of inspiration. Nearing our standard first Tuesday of the month meetup, our speaker lineup for October was thin. At the same time I'd spent the last month mentoring as part of Mozilla Open Leadership program again, which was gearing up to have projects participate in Hacktoberfest, a global month long celebration of open source, organised by Digital Ocean and designed to get people engaged with projects hosted openly on GitHub. For those unfamiliar with the platform, GitHub is one of many code repositories where open projects live allowing anyone to copy, modify and even contribute back to open source projects, many of which depend on such volunteer contributions. As it takes a village to raise a child so it takes a small village to build, maintain, continue to develop and support the users of a succesful open source project, where even small non-technical contributions, for example, to documentation, can be a huge help to maintainers (see Yihui Xie's (of knitr fame) blog post on this).

So what better way to entice folks to get involved than the promise of stickers and a free t-shirt on completion of the Hacktoberfest challenge! And the challenge? Simple. Make four contributions (pull requests) to any open source project on GitHub between the 1st and 31st of October. And the contribution can be anything — fixing bugs, creating new features, or updating and writing documentation. Game on!

Many project owners had labelled specific issues up and we noticed there were many rOpenSci projects in need of some #rstats help.

Given that doing is the best way to learn and working on problems outside our daily routines can be a great distraction, we thought it'd be a great idea to skip the standard talk meetup format for October and instead opt for some hands on Hacktoberfest action! It would also give the opportunity to any of our R users who were curious but did not have previous experience with GitHub and open source to learn more through practice and also in a friendly space where they could get help with any questions or uncertainties. Working the details through on Twitter (as you do!), an exciting plan emerged...not only would we extend to holding weekly sessions throughout the whole month, we would end with a special Halloween celebratory session!

Kick off meetup - briefing session

At the kick off meetup, fellow Sheffield R Users Group co-organisers Tamora James (\@soaypim) and Mathew Hall (\@mathew_hall) introduced participants to the general ideas and principles of open source, discussed contibuting to open projects, introduced GitHub and walked through scanning issues (seeing what things need doing in a particular project), forking repositories (making a copy of the materials associated with a project) and making pull requests (sending contributions...yes it was all greek to me in the beginning too...and I'm Greek!). Given the short time we had to prepare for the session, the materials provided by Digital Ocean on their Hacktoberfest event kit page were an invaluable resource and we can easily recommend them as a great introduction to contributing to open source. Of the 8 folks that made it to the session, 3 would go on to contribute pull requests over the month.

The sessions

Admittedly, when you work at a computer all day, spending another 3 hours at your screen voluntarily is probably not everyone's top choice. But I personally found the opportunity to carve some time out to explore the huge variety of projects and diverse ways in which to get involved engaging and in some ways quite relaxing. The great collaborative spaces available for booking at the University of Sheffield, the informal setting and hanging out with friends made the sessions something I actually looked forward to. And the "no pressure" aspect of voluntary contribution meant I was free to play around, follow my own curiosity and explore things I was interested in but don't necessarily get the time to work with during my normal working day. Indeed some participants came along to make use of the company and learn some new things not necessarily related to the Hacktoberfest challenge. So collaboratory, no pressure spaces can be really useful for sharing knowledge.

Halloween R-stravaganza

Finally it was time for the closing event, our Halloween Hacktoberfest special! Excitement was building from the day before when Tamora and I spent the evening carving our too cute to be scary octocat :heart: spooky R pumpkin!

We also got some candy donations from Sheffield RSE and a special guest, Raniere Silva (\@rgaiacs), who came all the way from Manchester to join us (although technically it had been his idea after all). The stage was set for a fun finale!


While we all got our t-shirts I was really impressed with Tamora and Raniere's contributions and their approach served as the biggest take-away for me. They both focused on a problem or feature that would improve a tool they were already interested in / used. They got feedback on their suggestion before they even begun by opening an issue on GitHub and interacting with the project's owners about their idea. That meant their efforts were well focused and much more likely to be accepted.

My t-shirt in the end was mainly earned by helping with typos. For Hackoberfest, the size of your contribution doesn't matter as long as you send a contribution. And finding typos is actually non-trivial and time consuming due to our brain's auto-correct feature. Sadly, the coding pieces that I worked on over the session did not end up making the cut to submit as a functional pull request yet (there'll be a personal blog about my experience during Hacktoberfest coming soon instead). Mostly however I loved the experience and am already looking forward to organising it next year!.

Thinking ahead, 3 things I'd do differently in 2018 would be:

  • Reach out to more organisations: There's a great variety of clubs and meetups at the University and more widely in Sheffield that could be interested in joining forces for a Hacktoberfest event. This would give us R users an opportunity to interact with users of other tools and potentially even tackle issues requiring mixed skills as teams.
  • Start planning earlier! This would give us an opportunity to advertise better leading up to the kick-off session and allow us to co-ordinate with other groups.

  • Run a git & GitHub clinic before the first hack session\: This would give the opportunity to folks that have not used GitHub before to get some experience and confidence before turning up to a hack session.

So long #Hactoberfest! See you in 2018!

Pumpkin Carving session and Halloween special powered by:

New Group Member: Phil Tooley

I am thrilled to have joined the Research Software Engineering team at Sheffield. My new role is attached to the INSIGNEO Institute for in silico Medicine here at the University of Sheffield, developing image registration software as part of the CompBioMed project.

About Me

I am a former theoretical and computational physicist with particular interests in mathematical modelling and code optimisation. I am also a stalwart champion of the use of the Scipy stack for scientific computing, and enjoy trying to make python speed competitive with C and Fortran (Spoiler alert: it totally can be!)

Although I am a user of both C(++) and Fortran I find that python is often a better choice for many tasks even in high performance scientific computing. As well as the huge array of fast mathematical libraries available to python, it is also possible to write custom routines in pure (Numba) or nearly-pure (Cython) python code, and compile them to native machine code. This can give python equivalent performance to traditional C or Fortran codes, and opens up the possibility of running python code efficiently on HPC platforms. (More to come on this topic...)

Outside the office you will usually find me rock climbing or hiking somewhere.

My previous work

For the last 4 years I have been working on my Ph.D in Theoretical and Computational Plasma Physics at the University of Strathclyde working on novel accelerator technology known as Laser Wakefield Acceleration (LWFA). This involves firing a short \((30\, \mathrm{fs})\) but highly intense \((10^{25}\mathrm{W/m^2})\) laser pulse, into a jet of helium gas. The laser ionises the gas to a plasma and drives an electrostatic plasma wave with intense electric fields which can accelerate electrons to Gigaelectronvolt energies over just a few mm. (Compared to conventional accelerators such as SLAC which have to be kilometers in length to achieve the same energies.)

My research was into methods of controlling and improving the performance of these accelerations, and relied heavily on Particle in Cell Codes. These are massively parallel codes designed to run on HPCs and I spent a lot of my time developing code extensions and analysis tools for the terabyte scale datasets that they produce. Typically I was interested in extracting a very small (usually \(<1\%\) of the total) subset of the electrons based on some selection criteria. To do this efficiently I developed a custom C++-based analysis code which can extract electron trajectories from the data based on arbitrary criteria.

A major application of LWFAs is as compact sources of X-ray and UV light for imaging and materials analysis tools in science and industry, and so the second major theme of my work was analysis of the radiation produced by the accelerated electrons. This analysis required numerical integrators to calculate the radiation from the extracted electron trajectories by solving the Liénard-Wiechert potential equations.

ATI Data study group

I had the opportunity to attend the Alan Turing Institute (ATI) Data Study Group (22nd-26th May 2017). The ATI is the national institute for data science and as such, it has strong ties to both academia and industry.

The event was a week long data hackathon in which multiple groups used their best skills to crack the various projects proposed by the industrial partners. Over 5 days 6 groups worked intensively to deliver feasible solutions to the problems proposed.

The projects

The projects spanned over a wide range of topics, each with their very unique challenges and requirements.

  • DSTL presented two projects: the first of which was aimed at cyber security, thus trying to identify malicious attacks from IP's traffic data. The second project was focused on the identification of places/geographical landmarks so that the team could predict the likelihood of a given event to take place at a given location.
  • HSBC's challenge consisted in not the study of a particular data set(s) as the rest of the problems, but was in fact based on the development of a synthetic dataset that could be used along with some algorithms to evaluate the users' behaviour.
  • Siemens' project was centred around the study of vehicle traffic data that would enable efficient traffic lights and traffic volume control, which would eventually lead to the reduction of carbon emissions.
  • Samsung, being one of the leaders in the smartphones industry decided on using their collected (anonymous) users' data to analyse the users' gaming behaviour (e.g. which games would you biuy/play based on your current gaming habits) as well as developing a gaming recommendation engine.
  • Thomson Reuters's challenge was centred around graph analysis. Such a project had as primary goal to identify how positive/negative news of a given company affect other companies/industries within their network and how far does this effect extend.

The Hack

I joined the Thomson Reuters' project as this seemed to be one of the projects with the richest data set, both in its extension and type (e.g news, sentiment analysis, stock market, time series, etc.). The team was formed by 13 people with a huge variety of skills and coming from totally different backgrounds, which is what makes hackathons so exciting. You have to make the most of the skill sets your team has, in a very limited amount of time... pushing you out of your comfort zone.


After a brief team introduction our 3 Thomson Reuters facilitators described the data and the challenge in more detail. We then identified the main goals of the project and subdivided the team in about other 4 teams. Once the initial planning was completed, we spent Monday's evening through Wednesday morning learning about their various API's, getting the data, wrangling data... and getting more data.

We soon realised that analysing all the data was incredibly complex and of course, there was not one correct way to do it. So we had to reduce the scope of the data we were in fact going to use and the sort of features/phenomena we were interested in.

The rest of the Wednesday and Thursday were used to start doing some prediction and regression on the data as well as writing up the report and finishing off our pitch presentation.


The findings

Certainly, we were able to obtain loads of insight from our data and the various algorithms we used. Some of the most important expected and unexpected findings were:

  • Negative news have a longer impact on the companies involved and those within their network (20 days as opposed to a 4 days impact from the positive news)
  • The companies are related to each other based on whether their are affiliates, competitors, parents, etc., not surprisingly the competitors are the companies that have the biggest effect on the other companies' stock prices
  • Different types of industries react differently to negative/positive news and the degree of extension of such an impact varies considerably from one industry type to another

Taking home

As every time I have been to a hackathon of some sort I ended up feeling absolutely drained, but accomplished at the same time. Hacks are such a good opportunity to meet like minded people, learn loads of stuff, test yourself, and have fun.

Would I encourage people to go to one of these events? Absolutely! If you are interested in all things data science you should keep an eye on the future events by the Alan Turing Institute. If you only want to have a go at hacking for fun, for a cause, or to meet new people I would suggest you have a look at the Major League Hacking. I am sure you will be able to find something near you and for all sort of interests. Our guys at Hack Sheffield organise a hackathon once a year so you might want to keep an eye on their activities. For those around Manchester area check out Hac100 including the youth and junior hacks (formerly Hack Manchester).

Will the RSE team at the university of Sheffield organise hacks? We have a crazy amount of work on our plates at the moment but we are definitely interested in (co)organising hackathons and many other events throughout the year. So keep your eyes peeled!

Mozsprint 2017 at the University of Sheffield

The 1st-2nd of June 2017 saw the Mozilla Global Sprint circle the globe for another time this year. It's Mozilla's flagship two-day community event, bringing together people from all over the world to celebrate the power of open collaboration by working on a huge diversity of community led projects, from developing open source software, building open tools to writing curriculum, planning events, and more. So here's a few of my own thoughts and reflections on this year's happenings.

Lead up to the sprint

Open Leadership Training mentorship

I joined my first Mozilla Global Sprint last year as the culmination of the Science Lab’s inaugural Working Open Workshop and mentorship program. I worked on my very own open science project, rmacroRDM which I'd spent the lead up to the Sprint preparing for. This year however it was a different experience for a number of reasons.

Firstly, the roles have been reversed, and from mentee, I was now a seasoned open leadership mentor. In fact, I had enjoyed the Open Leadership training program so much that I’d volunteered to mentor on the following two rounds, the first culminating at MozFest 2016 and this latest round at the Global Sprint 2017. Apart from staying connected to the vibrant network of movers and makers that is Mozilla, I also found I got a lot out of mentoring myself. From improving skills in understanding different people’s styles and support requirements to being introduced to new ideas, tools and technologies by interesting people from all over the world! Overall I find mentorship a positive sum activity for all parties involved.

So the lead up this year involved mentoring two projects while they prepare to launch at the global sprint. The Open Leadershp Training program involves mentees working through the OLT materials over 13 weeks while developing the resources required to open their projects up, ready to receive contributions. On a practical level, the program teaches approaches to help clearly define and promote the project and the use of github as a tool to openly host work on the web, plan, manage, track, discuss and collaborate. But the program delves deeper into the very essence of building open, supportive and welcoming communities in which people with an interest in a tool/cause/idea can contribute what they can, learn and develop themselves and feel valued and welcome members of a community.

Weekly contacts with the program alternated between whole cohort vidyo call check-ins and more focused one-on-one skype calls between mentors and mentees. This round I co-mentored with the wonderful Chris Ritzo from New America’s Open Technologiy Institute and we took on two extremely exciting projects, Aletheia and Teach-R.

Mentee projects:

Headed up by Kade Morton (@cypath), a super sharp, super visionary, super motivated, self-described crypto nerd from Brisbane, Australia, Aletheia doesn't pull any punches when describing it's reason for being:

In response they're building a decentralised and distributed database as a publishing platform for scientific research, leveraging two key pieces of technology, IPFS and blockchain. Many of the technical details are frankly over my head but I nonetheless learned a lot from Kade’s meticulous preparation and drive. Read more about the ideas behind the project here.

What can I say about Marcos Vital, professor of Quantitative Ecology at Federal University of Alagoas (UFAL), Brazil and fellow #rstats aficionado apart from he is also a huge inspiration! An effortless community builder, he runs a very successful local study group and has built a popular and engaged online community through his lab facebook page promoting science communication.

The topic of his project Teach-R is close to my heart, aiming to collate and develop training materials to TEACH people to TEACH R. Read more about it here

Hosting a Sheffield site.

Secondly, this year I helped host a site here at the University of Sheffield, and seeing as the sprint coincided with my first day as a Research Software Engineer for Sheffield RSE, we decided to take the event under our wing. With space secured and swag and coffee funds supplied by the Science Lab, the local site was read for action!

The Sprint!

Sprint at the University of Sheffield.

The was a good buzz of activity throughout the sprint at the site, with a few core participants while others came and went as they could. At the very least, roaming participants managed to soak up some of the atmosphere and pick up some git and github skills,...a success in my books!

Stuart Mumford (@StuartMumford) led project SunPy, a python based open-source solar data analysis environment and attracted a number of local contributors, including a new PhD student, although, as is often the case, much of the first morning seemed to be spent battling python installation on his laptop! Worth it for picking up a local contributor that will hopefully remain engaged throughout his studies though, and the team managed to push on with bug fixes and documentation development.

Jez Cope (@jezcope), our University's Research Data Manager was contributing to Library Carpentry, one of the biggest and most popular projects at his year's Sprint and also brought super tasty banana bread. He's also blogged about his experiences here.

Myself, while of course tempted by the many R, open science and reproducibility projects on offer, in the end chose to work on something unrelated to what I'm lucky to do for work and focus on a project I'm interested in personally. So I teamed up with Tyler Kolody (@TyTheSciGuy) on his timely project EchoBurst. The project aims to address our growing, social media facilitated, retreat into echo chambers, which is resulting in increasingly polarised public discourse and an unwillingness to engage with views we disagree with. The idea is to attempt to burst through such bubbles, by developing a browser extension with the potential to distinguish toxic content, more likely to shut down discussion, from more constructive content that might be able to bridge different perspectives.

Admittedly the project is very ambitious with a long way to go, many stages and various techniques/technologies to incorporate including natural language processing, building the browser plugin and even considering psychological and behavioural aspects in designing how to present information that might oppose a user's view without triggering the natural shut-down response.

There was plenty of really interesting brainstrorming discussion but the biggest initial challenge, and where the project could use the most help, is in collecting training data. The main approach is for contributors to help collect URLs of blogs on polarising topics from which to scrape content. But during the sprint we also added the option for contributors to add relevant youtube videos to collaborative playlists. We also started working on simple R functions to help scrape and clean the caption content.

Sprint across the globe

What a productive event this year's sprint was! While details of the level of activity have been covered elsewhere and the final project demos can be found here and here, I just wanted to highlight some basic stats:

Global #mozprint involved:
  • 65 sites (+ virtual participants)
  • 20 countries
  • 108 projects
During the 50 hour #mozsprint, we saw:
  • 302 pull requests closed
  • 320 pull requests opened
  • 2223 comments & issues
  • 824 commits pushed


(access the full data on github activity here)

Mentee progress

I was really happy to see both our mentees get great responses, pick up new contributors and make good progress on their projects.

  • Marcos expertly moderated a very active gitter channel for Teach-R, attracted a number of excellent and very engaged new contributors, adding a number of new lessons, in both English and Portuguese!.

  • Kade also got great engagement for Aletheia, including onboarding science communicator Lisa Mattias (@l_matthia), who's already blogged about their plans to take the project forward by applying to present it at this year's Open Science Fair. Importantly, he also managed to attract the javascipt developer they've been desperately looking for. Success! You can read more about Kade's experiences of the sprint here.

They both made us very proud indeed!


But the most important feature of the sprint for me every year is the global comradery and atmosphere of celebration. Handing off from one timezone to the other and checking in within our own to hear from leads about their project needs and progress, hanging out with participants from far and wide on vidyo and through streams of constant messaging on gitter, catching up with friends across the network...

...and cake...sooooooooo much cake!!

disclaimer: this cake was sadly not at the Sheffield site. It definitely has inspired me to put a lot more effort into this aspect of the sprint next year though!

Final thoughts

The end of the sprint is always a bit sad but the projects live on, hopefully with a new lease of life. So if, by reading this, you're inspired to contribute, check out the full list of projects for something that might appeal. There's a huge diversity of topics, tasks and skills required to chose from and fun new people to meet!

So does the network so if you’ve got an exciting idea of your own that you think would make a good open source project make sure to check out @MozOpenLeaders and look out for the next mentorship round.

As for the impact on Sheffield RSE, well there was one point where we managed to get the full team and loose collaborators working in one room (we’re normally spread out across the university). It felt great to work together from the same space so we decided to make a point of routinely booking one of the many excellent co-working spaces the University of Sheffield has on offer and establish regular work-together days!

So thanks for the inspiration and excellent times Mozilla! Till the next time!

(ie Mozfest 2017!)


Apart from the coffee and good vibes, the day was also fuelled by sounds. Here's a couple of the mixes that kept the Sheffield site going!

Grooves no. 1:

Grooves no. 2:

tmux: remote terminal management and multiplexing

Today we have a guide to 'terminal multiplexing' including suggestions on how to use it on computer clusters such as ShARC and Iceberg.

Have you ever?

  • Started a process (such as a compilation or application install) over SSH only to realise that it's taking far longer than you expected and you need to shut down your laptop to go to a meeting, which you know will therefore kill both the SSH connection and your process?
  • Been in a cafe with flakey wifi and had a remote process hang or possibly die due to an unstable SSH connection?
  • Accidentally closed a window with a SSH session running in it and really regretted it?
  • Wanted to be able to switch between multiple terminal sessions on a remote machine without having to establish a SSH connection per session?
  • Wanted to be able to have multiple terminals visible at once so you can say edit source code in one terminal whilst keeping compilation errors visible in another?
  • Wanted a nicer way to copy and paste between remote terminal sessions?

If the answer to any of these is "yes" then terminal multiplexing may help!

Making remote Linux/Unix machines easier to administer/use!

First, we need to delve a little deeper into some of the problems we are trying to solve.

Why do my remote processes die when my SSH connection dies/hangs?

(Skip over this section if you want!)

Every process (bar the systemd process or init process with a process ID of 1) has a parent process. If a process is sent a signal telling it to cleanly terminate (or 'hang up') then typically its child processes will be told to do the same.

When you SSH to a remote machine, the SSH service on that machine creates a shell for you within which you can run commands.

To illustrate, here I logged into a server and used the pstree program to view the tree of child-parent relationships between processes. Notice in the excerpt shown below that the SSH service (sshd) has spawned a (bash) shell process for my SSH session, which in turn has spawned my pstree process:

[will@acai ~]$ ssh sharc
[will@sharc-login1 ~]$ pstree -a
systemd --switched-root --system --deserialize 21
  ├─sshd -D
  │   └─sshd
  │       └─sshd
  │           └─bash
  │               └─pstree -a

So if the SSH service decides that your connection has timed out then it will send a signal to bash process were to die then any child processes started by that bash process would also die.

If the remote servers you work with are primarily High-Performance Computing (HPC) clusters running scheduling software such as Grid Engine then you have a simple, robust way of ensuring that the sucess of your processes doesn't depend on the reliability of your connection to the clusters: submit your work to the scheduler as batch jobs. There are many other benefits to submitting batch jobs over using interactive sessions when using such clusters but we won't go into those here.

However, what do you do when there is no HPC-style scheduling software availble?

  • You could run batch jobs using much simpler schedulers such as at for one-off tasks or cron or systemd Timers for periodic tasks.
  • You could prefix your command with nohup (no hang up) to ensure it continues running if the parent process tells it to hang up.

Neither of these allow you to easily return to interactive sessions though. For that we need terminal multiplexers.

A brief guide to the tmux Terminal Multiplexer

Detaching and reattaching to sessions

Terminal Multiplexer programs like GNU Screen and tmux solve this problem by:

  1. Starting up a server process on-demand, which then spawns a shell. The server process is configured not to respond when being told to hang up so will persist if is started over a SSH connection that subsequently hangs/dies.
  2. Starting up a client process that allows you to connect to that server and interact with the shell session it has started
  3. Using key-bindings to stop the client process and detatch from the server process.
  4. Using command-line arguments to allow a client process to (re)connect to an existing server process

Demo 1

Here we look at demonstrating the above using tmux. I recommend tmux over GNU Screen as the documentation is clearer and it makes fewer references to legacy infrastructure. Plus, it is easier to google for it! However, it may use more memory (true for older versions).

Let's create and attach to a new tmux session, start a long-running command in it then detach and reattach to the session:

Used keys:

<prefix> d: detatch

where <prefix> is Control and b by default. Here <prefix> d means press Control and b then release that key combination before pressing d.

In this case we started tmux on the local machine. tmux is much more useful though when you start it on a remote machine after connecting via ssh.

Windows (like tabs)

What else can we do with terminal multiplexers? Well, as the name implies, they can be used to view and control multiple virtual consoles from one session.

A given tmux session can have multiple windows, each of which can contain multiple panes, each of which is a virtual console!

Demo 2

Here's a demonstration of creating, renaming, switching and deleting tmux windows:

Used keys:

<prefix> ,: rename a window
<prefix> c: create a new window
<prefix> n: switch to next window
<prefix> p: switch to previous window
<prefix> x: delete current window (actually deletes the current pane in the window but will also delete the window if it contains only one pane)

Dividing up Windows into Panes

Now let's look at creating, switching and deleting panes within a window:

Used keys:

<prefix> %: split the active window vertically
<prefix> ": split the active window horizontally
<prefix> Up or Down or Left or Right:
  switch to pane in that direction

Scrolling backwards

You can scroll back up through the terminal history of the current pane/window using:

<prefix> Page Up:
  scroll back through terminal history

Copying and pasting

If you have multiple panes side-by-side then attempt to copy text using the mouse, you'll copy lines of characters that span all panes, which is almost certainly not going to be what you want. Instead you can

<prefix> z: toggle the maximisation of the current pane

then copy the text you want.

Alternively, if you want to copy and paste between tmux panes/windows you can

<prefix> [: enter copy mode

move the cursor using the arrow keys to where you want to start copying then

space: (in copy mode) mark start of section to copy

move the cursor keys to the end of the section you want to copy then

enter: (in copy mode) mark end of section to copy and exit copy mode

You can then move to another pane/window and press

<prefix> ]: paste copied text

I find this mechanism very useful.

And there's more

Things not covered in detail here include:

Using tmux on HPC clusters

Terminal Multiplexers can be useful if doing interactive work on a HPC cluster such as the University of Sheffield clusters ShARC and Iceberg (assuming that you don't need a GUI).

On ShARC and Iceberg can:

  1. Start a tmux or GNU Screen session on a login node;
  2. Start an interactive job using qrshx or qrsh;
  3. Disconnect and reconnect from the tmux/Screen session (either deliberately or due an issue with the SSH connection to the cluster);
  4. Create additional windows/panes on the login node for editing files, starting additional interactive jobs etc, watching log files.

Starting tmux on worker nodes is also useful if you want to have multiple windows/panes on a worker node but less useful if you want to disconnect/reconnect from/to a session as if you run qrsh a second time you cannot guarantee that you will be give an interactive job on on the node you started the tmux session from.

However, note that you can have nested tmux sessions (with <prefix><prefix> <key> used to send tmux commands to the 'inner' tmux session).

Warning: many clusters have multiple login nodes for redundancy, with only one being the default active login node at any given time. If the active login node requires maintenance then logged-in users may be booted off and long-running processes may be terminated (before the system administrator makes a 'standby' login node the currently active one). Under such circumstances your tmux/Screen session may be killed.

Being a good HPC citizen

Your interactive job (on a cluster worker node) will be terminated by the cluster's Grid Engine job scheduler after a fixed amount of time (the default is 8 hours) but your tmux/Screen session was started on a login node so is outside the control of the cluster and will keep running indefinitely unless you kill it.

Each tmux/Screen session requires memory on the login node (which is used by all users) so to be a good HPC citizen you should:

  • Kill your tmux/Screen session when no longer needed (tmux/Screen will exit when you close all windows)
  • Only start as many tmux/Screen sessions on the login node as you need (ideally 1)
  • Exit your interactive Grid Engine job (on a worker node) if no longer needed as then others can make use of the resources you had been using on this node.

Tip: with tmux you can ensure that you either reconnect to an existing session (with a given name) if it already exists or create a new session using:

tmux new-session -A -s mysession

This should help avoid accidentally creating more than one tmux session.

NB the recordings of terminal sessions shown were created using ttyrec and ttygif then converted to .webm videos using ffmpeg.

Software Carpentry and Data Carpentry at the University of Sheffield!

The University of Sheffield is now a Software Carpentry Partner Organisation, allowing the Research Software Engineering and Library teams to start organising Software Carpentry and Data Carpentry workshops. These are designed to help researchers develop the programming, automation and data management skills needed to support their research. Workshop dates are to be announced shortly.


Our first Software Carpentry workshop is scheduled for 16th and 17th August!

Software Carpentry and Data Carpentry logos

Addressing the training needs of researchers with regards to programming

As more researchers realise they can produce better quality research more quickly if they have some coding and data management skills under their belts universities will need to ensure that training in these areas is accessible to those that need it.

Academic institutions will most likely already have courses for teaching highly-specialist subjects (such as how to use the local HPC cluster) but for the more generic aspects of research software development and data management there are several obvious choices:

  • Develop and deliver bespoke materials;
  • Buy in to commercial training packages;
  • Point researchers towards free online resources;

However, there is also a fourth option: team up with the Software Carpentry (SC) and Data Carpentry (DC) not-for-profit organisations to deliver on-site, interative workshops based on open-source materials that have been refined by a large community of SC and DC instructors.

Software whatywhaty?

Software Carpentry has developed discipline-agnostic workshop material on:

Data Carpentry lessons look at data management and processing within the context of a specific domain (such as ecology or genomics), focussing on areas such as:

  • the command line;
  • data cleaning and filtering using OpenRefine;
  • data processing and visualisation with Python or R;
  • cloud computing
  • GIS

What form do the workshops take?

The Software and Data Carpentry organisations ask that accredited instructors delivering 'branded' workshops adopt a fairly progressive teaching style:

  • Workshops typically last two days and include four lessons (e.g. the unix shell, Python, version control and databases).
  • There's lots of live coding: the instructor and students gather together in a room with laptops and a projector and all present go through a number of examples interactively. Students use their own laptops to ensure that they're able to continue where they left off at the end of a workshop. Instructors can and do make mistakes when doing live coding; students can then learn from these mistakes and may grow in confidence on learning that pros make mistakes too.
  • Instructors try to elicit responses from students and use quizes to gauge comprehension and keep students focussed.
  • Software Carpentry has a code of conduct and tries to ensure that all lessons delivered under its banner are as inclusive as possible.

What's happening at the University?

The University is now a Software Carpentry Partner Organisation so can run many workshops per year using the Software Carpentry and Data Carpentry branding. We could run workshops without the branding but Software and Data Carpentry are now familiar names to researchers (and potentially employers) and by working closely with those two organisations we become part of a global network of instructors with which we can share ideas and materials.

The RSE team and Library collectively now have five accredited Software and Data Carpentry instructors: Mike Croucher received training some time ago and in March Tania Allard and I from the RSE team plus Jez Cope and Beth Hellen from the Library's Research Services Unit participated in instructor training in Oxford.

Software Carpentry Instructor Training session

The four of us spent two days learning about the SC/DC teaching style, what makes for an effective instructor and got to practise several aspects of workshop development and delivery. I must thank the instructors on the training course (Mateuz Kuzak and Steve Crouch) plus Reproducible Research Oxford for hosting and organising the event.

We are now planning our first Software Carpentry and Data Carpentry workshops. These are to be held later in the summer.

Keep an eye on this blog, the RSE-group@sheffield.ac.uk mailing list and @RSE_Sheffield for dates!

Coffee and Cakes Event

The RSE Sheffield team would like to thank everyone for attending the second Coffee and Cakes Event that was held last Wednesday (31/05/2017). The event provided a great opportunity to hear from researchers all around the University about the software engineering challenges faced within their projects. We hope to use the insights gained from the event to help improve your research workflow in the future.

To get updates on future RSE events, please join our RSE Google Discussion Group.