ATI Data study group

I had the opportunity to attend the Alan Turing Institute (ATI) Data Study Group (22nd-26th May 2017). The ATI is the national institute for data science and as such, it has strong ties to both academia and industry.

The event was a week long data hackathon in which multiple groups used their best skills to crack the various projects proposed by the industrial partners. Over 5 days 6 groups worked intensively to deliver feasible solutions to the problems proposed.

The projects

The projects spanned over a wide range of topics, each with their very unique challenges and requirements.

  • DSTL presented two projects: the first of which was aimed at cyber security, thus trying to identify malicious attacks from IP's traffic data. The second project was focused on the identification of places/geographical landmarks so that the team could predict the likelihood of a given event to take place at a given location.
  • HSBC's challenge consisted in not the study of a particular data set(s) as the rest of the problems, but was in fact based on the development of a synthetic dataset that could be used along with some algorithms to evaluate the users' behaviour.
  • Siemens' project was centred around the study of vehicle traffic data that would enable efficient traffic lights and traffic volume control, which would eventually lead to the reduction of carbon emissions.
  • Samsung, being one of the leaders in the smartphones industry decided on using their collected (anonymous) users' data to analyse the users' gaming behaviour (e.g. which games would you biuy/play based on your current gaming habits) as well as developing a gaming recommendation engine.
  • Thomson Reuters's challenge was centred around graph analysis. Such a project had as primary goal to identify how positive/negative news of a given company affect other companies/industries within their network and how far does this effect extend.

The Hack

I joined the Thomson Reuters' project as this seemed to be one of the projects with the richest data set, both in its extension and type (e.g news, sentiment analysis, stock market, time series, etc.). The team was formed by 13 people with a huge variety of skills and coming from totally different backgrounds, which is what makes hackathons so exciting. You have to make the most of the skill sets your team has, in a very limited amount of time... pushing you out of your comfort zone.

AT-1

After a brief team introduction our 3 Thomson Reuters facilitators described the data and the challenge in more detail. We then identified the main goals of the project and subdivided the team in about other 4 teams. Once the initial planning was completed, we spent Monday's evening through Wednesday morning learning about their various API's, getting the data, wrangling data... and getting more data.

We soon realised that analysing all the data was incredibly complex and of course, there was not one correct way to do it. So we had to reduce the scope of the data we were in fact going to use and the sort of features/phenomena we were interested in.

The rest of the Wednesday and Thursday were used to start doing some prediction and regression on the data as well as writing up the report and finishing off our pitch presentation.

ATI-2

The findings

Certainly, we were able to obtain loads of insight from our data and the various algorithms we used. Some of the most important expected and unexpected findings were:

  • Negative news have a longer impact on the companies involved and those within their network (20 days as opposed to a 4 days impact from the positive news)
  • The companies are related to each other based on whether their are affiliates, competitors, parents, etc., not surprisingly the competitors are the companies that have the biggest effect on the other companies' stock prices
  • Different types of industries react differently to negative/positive news and the degree of extension of such an impact varies considerably from one industry type to another

Taking home

As every time I have been to a hackathon of some sort I ended up feeling absolutely drained, but accomplished at the same time. Hacks are such a good opportunity to meet like minded people, learn loads of stuff, test yourself, and have fun.

Would I encourage people to go to one of these events? Absolutely! If you are interested in all things data science you should keep an eye on the future events by the Alan Turing Institute. If you only want to have a go at hacking for fun, for a cause, or to meet new people I would suggest you have a look at the Major League Hacking. I am sure you will be able to find something near you and for all sort of interests. Our guys at Hack Sheffield organise a hackathon once a year so you might want to keep an eye on their activities. For those around Manchester area check out Hac100 including the youth and junior hacks (formerly Hack Manchester).

Will the RSE team at the university of Sheffield organise hacks? We have a crazy amount of work on our plates at the moment but we are definitely interested in (co)organising hackathons and many other events throughout the year. So keep your eyes peeled!

Mozsprint 2017 at the University of Sheffield


The 1st-2nd of June 2017 saw the Mozilla Global Sprint circle the globe for another time this year. It's Mozilla's flagship two-day community event, bringing together people from all over the world to celebrate the power of open collaboration by working on a huge diversity of community led projects, from developing open source software, building open tools to writing curriculum, planning events, and more. So here's a few of my own thoughts and reflections on this year's happenings.

Lead up to the sprint

Open Leadership Training mentorship

I joined my first Mozilla Global Sprint last year as the culmination of the Science Lab’s inaugural Working Open Workshop and mentorship program. I worked on my very own open science project, rmacroRDM which I'd spent the lead up to the Sprint preparing for. This year however it was a different experience for a number of reasons.

Firstly, the roles have been reversed, and from mentee, I was now a seasoned open leadership mentor. In fact, I had enjoyed the Open Leadership training program so much that I’d volunteered to mentor on the following two rounds, the first culminating at MozFest 2016 and this latest round at the Global Sprint 2017. Apart from staying connected to the vibrant network of movers and makers that is Mozilla, I also found I got a lot out of mentoring myself. From improving skills in understanding different people’s styles and support requirements to being introduced to new ideas, tools and technologies by interesting people from all over the world! Overall I find mentorship a positive sum activity for all parties involved.

So the lead up this year involved mentoring two projects while they prepare to launch at the global sprint. The Open Leadershp Training program involves mentees working through the OLT materials over 13 weeks while developing the resources required to open their projects up, ready to receive contributions. On a practical level, the program teaches approaches to help clearly define and promote the project and the use of github as a tool to openly host work on the web, plan, manage, track, discuss and collaborate. But the program delves deeper into the very essence of building open, supportive and welcoming communities in which people with an interest in a tool/cause/idea can contribute what they can, learn and develop themselves and feel valued and welcome members of a community.

Weekly contacts with the program alternated between whole cohort vidyo call check-ins and more focused one-on-one skype calls between mentors and mentees. This round I co-mentored with the wonderful Chris Ritzo from New America’s Open Technologiy Institute and we took on two extremely exciting projects, Aletheia and Teach-R.



Mentee projects:


Headed up by Kade Morton (@cypath), a super sharp, super visionary, super motivated, self-described crypto nerd from Brisbane, Australia, Aletheia doesn't pull any punches when describing it's reason for being:


In response they're building a decentralised and distributed database as a publishing platform for scientific research, leveraging two key pieces of technology, IPFS and blockchain. Many of the technical details are frankly over my head but I nonetheless learned a lot from Kade’s meticulous preparation and drive. Read more about the ideas behind the project here.




What can I say about Marcos Vital, professor of Quantitative Ecology at Federal University of Alagoas (UFAL), Brazil and fellow #rstats aficionado apart from he is also a huge inspiration! An effortless community builder, he runs a very successful local study group and has built a popular and engaged online community through his lab facebook page promoting science communication.

The topic of his project Teach-R is close to my heart, aiming to collate and develop training materials to TEACH people to TEACH R. Read more about it here



Hosting a Sheffield site.

Secondly, this year I helped host a site here at the University of Sheffield, and seeing as the sprint coincided with my first day as a Research Software Engineer for Sheffield RSE, we decided to take the event under our wing. With space secured and swag and coffee funds supplied by the Science Lab, the local site was read for action!



The Sprint!

Sprint at the University of Sheffield.

The was a good buzz of activity throughout the sprint at the site, with a few core participants while others came and went as they could. At the very least, roaming participants managed to soak up some of the atmosphere and pick up some git and github skills,...a success in my books!

Stuart Mumford (@StuartMumford) led project SunPy, a python based open-source solar data analysis environment and attracted a number of local contributors, including a new PhD student, although, as is often the case, much of the first morning seemed to be spent battling python installation on his laptop! Worth it for picking up a local contributor that will hopefully remain engaged throughout his studies though, and the team managed to push on with bug fixes and documentation development.

Jez Cope (@jezcope), our University's Research Data Manager was contributing to Library Carpentry, one of the biggest and most popular projects at his year's Sprint and also brought super tasty banana bread. He's also blogged about his experiences here.

Myself, while of course tempted by the many R, open science and reproducibility projects on offer, in the end chose to work on something unrelated to what I'm lucky to do for work and focus on a project I'm interested in personally. So I teamed up with Tyler Kolody (@TyTheSciGuy) on his timely project EchoBurst. The project aims to address our growing, social media facilitated, retreat into echo chambers, which is resulting in increasingly polarised public discourse and an unwillingness to engage with views we disagree with. The idea is to attempt to burst through such bubbles, by developing a browser extension with the potential to distinguish toxic content, more likely to shut down discussion, from more constructive content that might be able to bridge different perspectives.

Admittedly the project is very ambitious with a long way to go, many stages and various techniques/technologies to incorporate including natural language processing, building the browser plugin and even considering psychological and behavioural aspects in designing how to present information that might oppose a user's view without triggering the natural shut-down response.

There was plenty of really interesting brainstrorming discussion but the biggest initial challenge, and where the project could use the most help, is in collecting training data. The main approach is for contributors to help collect URLs of blogs on polarising topics from which to scrape content. But during the sprint we also added the option for contributors to add relevant youtube videos to collaborative playlists. We also started working on simple R functions to help scrape and clean the caption content.


Sprint across the globe

What a productive event this year's sprint was! While details of the level of activity have been covered and storyfied elsewhere and the final project demos can be found here and here, I just wanted to highlight some basic stats:

Global #mozprint involved:
  • 65 sites (+ virtual participants)
  • 20 countries
  • 108 projects
During the 50 hour #mozsprint, we saw:
  • 302 pull requests closed
  • 320 pull requests opened
  • 2223 comments & issues
  • 824 commits pushed

BOOM!

(access the full data on github activity here)


Mentee progress

I was really happy to see both our mentees get great responses, pick up new contributors and make good progress on their projects.

  • Marcos expertly moderated a very active gitter channel for Teach-R, attracted a number of excellent and very engaged new contributors, adding a number of new lessons, in both English and Portuguese!.

  • Kade also got great engagement for Aletheia, including onboarding science communicator Lisa Mattias (@l_matthia), who's already blogged about their plans to take the project forward by applying to present it at this year's Open Science Fair. Importantly, he also managed to attract the javascipt developer they've been desperately looking for. Success! You can read more about Kade's experiences of the sprint here.

They both made us very proud indeed!



Highlights

But the most important feature of the sprint for me every year is the global comradery and atmosphere of celebration. Handing off from one timezone to the other and checking in within our own to hear from leads about their project needs and progress, hanging out with participants from far and wide on vidyo and through streams of constant messaging on gitter, catching up with friends across the network...



...and cake...sooooooooo much cake!!

disclaimer: this cake was sadly not at the Sheffield site. It definitely has inspired me to put a lot more effort into this aspect of the sprint next year though!


Final thoughts

The end of the sprint is always a bit sad but the projects live on, hopefully with a new lease of life. So if, by reading this, you're inspired to contribute, check out the full list of projects for something that might appeal. There's a huge diversity of topics, tasks and skills required to chose from and fun new people to meet!

So does the network so if you’ve got an exciting idea of your own that you think would make a good open source project make sure to check out @MozOpenLeaders and look out for the next mentorship round.

As for the impact on Sheffield RSE, well there was one point where we managed to get the full team and loose collaborators working in one room (we’re normally spread out across the university). It felt great to work together from the same space so we decided to make a point of routinely booking one of the many excellent co-working spaces the University of Sheffield has on offer and establish regular work-together days!

So thanks for the inspiration and excellent times Mozilla! Till the next time!

(ie Mozfest 2017!)



Sounds:

Apart from the coffee and good vibes, the day was also fuelled by sounds. Here's a couple of the mixes that kept the Sheffield site going!

Grooves no. 1:


Grooves no. 2:




tmux: remote terminal management and multiplexing

Today we have a guide to 'terminal multiplexing' including suggestions on how to use it on computer clusters such as ShARC and Iceberg.


Have you ever?

  • Started a process (such as a compilation or application install) over SSH only to realise that it's taking far longer than you expected and you need to shut down your laptop to go to a meeting, which you know will therefore kill both the SSH connection and your process?
  • Been in a cafe with flakey wifi and had a remote process hang or possibly die due to an unstable SSH connection?
  • Accidentally closed a window with a SSH session running in it and really regretted it?
  • Wanted to be able to switch between multiple terminal sessions on a remote machine without having to establish a SSH connection per session?
  • Wanted to be able to have multiple terminals visible at once so you can say edit source code in one terminal whilst keeping compilation errors visible in another?
  • Wanted a nicer way to copy and paste between remote terminal sessions?

If the answer to any of these is "yes" then terminal multiplexing may help!

Making remote Linux/Unix machines easier to administer/use!

First, we need to delve a little deeper into some of the problems we are trying to solve.

Why do my remote processes die when my SSH connection dies/hangs?

(Skip over this section if you want!)

Every process (bar the systemd process or init process with a process ID of 1) has a parent process. If a process is sent a signal telling it to cleanly terminate (or 'hang up') then typically its child processes will be told to do the same.

When you SSH to a remote machine, the SSH service on that machine creates a shell for you within which you can run commands.

To illustrate, here I logged into a server and used the pstree program to view the tree of child-parent relationships between processes. Notice in the excerpt shown below that the SSH service (sshd) has spawned a (bash) shell process for my SSH session, which in turn has spawned my pstree process:

[will@acai ~]$ ssh sharc
...
[will@sharc-login1 ~]$ pstree -a
systemd --switched-root --system --deserialize 21
...
  ├─sshd -D
  │   └─sshd
  │       └─sshd
  │           └─bash
  │               └─pstree -a
...

So if the SSH service decides that your connection has timed out then it will send a signal to bash process were to die then any child processes started by that bash process would also die.

If the remote servers you work with are primarily High-Performance Computing (HPC) clusters running scheduling software such as Grid Engine then you have a simple, robust way of ensuring that the sucess of your processes doesn't depend on the reliability of your connection to the clusters: submit your work to the scheduler as batch jobs. There are many other benefits to submitting batch jobs over using interactive sessions when using such clusters but we won't go into those here.

However, what do you do when there is no HPC-style scheduling software availble?

  • You could run batch jobs using much simpler schedulers such as at for one-off tasks or cron or systemd Timers for periodic tasks.
  • You could prefix your command with nohup (no hang up) to ensure it continues running if the parent process tells it to hang up.

Neither of these allow you to easily return to interactive sessions though. For that we need terminal multiplexers.

A brief guide to the tmux Terminal Multiplexer

Detaching and reattaching to sessions

Terminal Multiplexer programs like GNU Screen and tmux solve this problem by:

  1. Starting up a server process on-demand, which then spawns a shell. The server process is configured not to respond when being told to hang up so will persist if is started over a SSH connection that subsequently hangs/dies.
  2. Starting up a client process that allows you to connect to that server and interact with the shell session it has started
  3. Using key-bindings to stop the client process and detatch from the server process.
  4. Using command-line arguments to allow a client process to (re)connect to an existing server process

Demo 1

Here we look at demonstrating the above using tmux. I recommend tmux over GNU Screen as the documentation is clearer and it makes fewer references to legacy infrastructure. Plus, it is easier to google for it! However, it may use more memory (true for older versions).

Let's create and attach to a new tmux session, start a long-running command in it then detach and reattach to the session:

Used keys:

<prefix> d: detatch

where <prefix> is Control and b by default. Here <prefix> d means press Control and b then release that key combination before pressing d.

In this case we started tmux on the local machine. tmux is much more useful though when you start it on a remote machine after connecting via ssh.

Windows (like tabs)

What else can we do with terminal multiplexers? Well, as the name implies, they can be used to view and control multiple virtual consoles from one session.

A given tmux session can have multiple windows, each of which can contain multiple panes, each of which is a virtual console!

Demo 2

Here's a demonstration of creating, renaming, switching and deleting tmux windows:

Used keys:

<prefix> ,: rename a window
<prefix> c: create a new window
<prefix> n: switch to next window
<prefix> p: switch to previous window
<prefix> x: delete current window (actually deletes the current pane in the window but will also delete the window if it contains only one pane)

Dividing up Windows into Panes

Now let's look at creating, switching and deleting panes within a window:

Used keys:

<prefix> %: split the active window vertically
<prefix> ": split the active window horizontally
<prefix> Up or Down or Left or Right:
  switch to pane in that direction

Scrolling backwards

You can scroll back up through the terminal history of the current pane/window using:

<prefix> Page Up:
  scroll back through terminal history

Copying and pasting

If you have multiple panes side-by-side then attempt to copy text using the mouse, you'll copy lines of characters that span all panes, which is almost certainly not going to be what you want. Instead you can

<prefix> z: toggle the maximisation of the current pane

then copy the text you want.

Alternively, if you want to copy and paste between tmux panes/windows you can

<prefix> [: enter copy mode

move the cursor using the arrow keys to where you want to start copying then

space: (in copy mode) mark start of section to copy

move the cursor keys to the end of the section you want to copy then

enter: (in copy mode) mark end of section to copy and exit copy mode

You can then move to another pane/window and press

<prefix> ]: paste copied text

I find this mechanism very useful.

And there's more

Things not covered in detail here include:

Using tmux on HPC clusters

Terminal Multiplexers can be useful if doing interactive work on a HPC cluster such as the University of Sheffield clusters ShARC and Iceberg (assuming that you don't need a GUI).

On ShARC and Iceberg can:

  1. Start a tmux or GNU Screen session on a login node;
  2. Start an interactive job using qrshx or qrsh;
  3. Disconnect and reconnect from the tmux/Screen session (either deliberately or due an issue with the SSH connection to the cluster);
  4. Create additional windows/panes on the login node for editing files, starting additional interactive jobs etc, watching log files.

Starting tmux on worker nodes is also useful if you want to have multiple windows/panes on a worker node but less useful if you want to disconnect/reconnect from/to a session as if you run qrsh a second time you cannot guarantee that you will be give an interactive job on on the node you started the tmux session from.

However, note that you can have nested tmux sessions (with <prefix><prefix> <key> used to send tmux commands to the 'inner' tmux session).

Warning: many clusters have multiple login nodes for redundancy, with only one being the default active login node at any given time. If the active login node requires maintenance then logged-in users may be booted off and long-running processes may be terminated (before the system administrator makes a 'standby' login node the currently active one). Under such circumstances your tmux/Screen session may be killed.

Being a good HPC citizen

Your interactive job (on a cluster worker node) will be terminated by the cluster's Grid Engine job scheduler after a fixed amount of time (the default is 8 hours) but your tmux/Screen session was started on a login node so is outside the control of the cluster and will keep running indefinitely unless you kill it.

Each tmux/Screen session requires memory on the login node (which is used by all users) so to be a good HPC citizen you should:

  • Kill your tmux/Screen session when no longer needed (tmux/Screen will exit when you close all windows)
  • Only start as many tmux/Screen sessions on the login node as you need (ideally 1)
  • Exit your interactive Grid Engine job (on a worker node) if no longer needed as then others can make use of the resources you had been using on this node.

Tip: with tmux you can ensure that you either reconnect to an existing session (with a given name) if it already exists or create a new session using:

tmux new-session -A -s mysession

This should help avoid accidentally creating more than one tmux session.


NB the recordings of terminal sessions shown were created using ttyrec and ttygif then converted to .webm videos using ffmpeg.

Software Carpentry and Data Carpentry at the University of Sheffield!

The University of Sheffield is now a Software Carpentry Partner Organisation, allowing the Research Software Engineering and Library teams to start organising Software Carpentry and Data Carpentry workshops. These are designed to help researchers develop the programming, automation and data management skills needed to support their research. Workshop dates are to be announced shortly.


Edit:

Our first Software Carpentry workshop is scheduled for 16th and 17th August!


Software Carpentry and Data Carpentry logos

Addressing the training needs of researchers with regards to programming

As more researchers realise they can produce better quality research more quickly if they have some coding and data management skills under their belts universities will need to ensure that training in these areas is accessible to those that need it.

Academic institutions will most likely already have courses for teaching highly-specialist subjects (such as how to use the local HPC cluster) but for the more generic aspects of research software development and data management there are several obvious choices:

  • Develop and deliver bespoke materials;
  • Buy in to commercial training packages;
  • Point researchers towards free online resources;

However, there is also a fourth option: team up with the Software Carpentry (SC) and Data Carpentry (DC) not-for-profit organisations to deliver on-site, interative workshops based on open-source materials that have been refined by a large community of SC and DC instructors.

Software whatywhaty?

Software Carpentry has developed discipline-agnostic workshop material on:

Data Carpentry lessons look at data management and processing within the context of a specific domain (such as ecology or genomics), focussing on areas such as:

  • the command line;
  • data cleaning and filtering using OpenRefine;
  • data processing and visualisation with Python or R;
  • cloud computing
  • GIS

What form do the workshops take?

The Software and Data Carpentry organisations ask that accredited instructors delivering 'branded' workshops adopt a fairly progressive teaching style:

  • Workshops typically last two days and include four lessons (e.g. the unix shell, Python, version control and databases).
  • There's lots of live coding: the instructor and students gather together in a room with laptops and a projector and all present go through a number of examples interactively. Students use their own laptops to ensure that they're able to continue where they left off at the end of a workshop. Instructors can and do make mistakes when doing live coding; students can then learn from these mistakes and may grow in confidence on learning that pros make mistakes too.
  • Instructors try to elicit responses from students and use quizes to gauge comprehension and keep students focussed.
  • Software Carpentry has a code of conduct and tries to ensure that all lessons delivered under its banner are as inclusive as possible.

What's happening at the University?

The University is now a Software Carpentry Partner Organisation so can run many workshops per year using the Software Carpentry and Data Carpentry branding. We could run workshops without the branding but Software and Data Carpentry are now familiar names to researchers (and potentially employers) and by working closely with those two organisations we become part of a global network of instructors with which we can share ideas and materials.

The RSE team and Library collectively now have five accredited Software and Data Carpentry instructors: Mike Croucher received training some time ago and in March Tania Allard and I from the RSE team plus Jez Cope and Beth Hellen from the Library's Research Services Unit participated in instructor training in Oxford.

Software Carpentry Instructor Training session

The four of us spent two days learning about the SC/DC teaching style, what makes for an effective instructor and got to practise several aspects of workshop development and delivery. I must thank the instructors on the training course (Mateuz Kuzak and Steve Crouch) plus Reproducible Research Oxford for hosting and organising the event.

We are now planning our first Software Carpentry and Data Carpentry workshops. These are to be held later in the summer.

Keep an eye on this blog, the RSE-group@sheffield.ac.uk mailing list and @RSE_Sheffield for dates!

Coffee and Cakes Event

The RSE Sheffield team would like to thank everyone for attending the second Coffee and Cakes Event that was held last Wednesday (31/05/2017). The event provided a great opportunity to hear from researchers all around the University about the software engineering challenges faced within their projects. We hope to use the insights gained from the event to help improve your research workflow in the future.

To get updates on future RSE events, please join our RSE Google Discussion Group.