SSI Fellowship success for Sheffield

The Software Sustainability Institute(SSI) is a cross-council funded group that supports the research software community in the UK. It has championed the role of the Research Software Engineer and has led national and international initiatives in the field.

One of the most popular activities undertaken by the SSI is their fellowship program. This competitive process provides an annual cohort of fellows with £3,000 to spend over fifteen months on a project of their choice. Competition for these fellowships is fierce! Just like larger fellowships, applicants must get through a peer-reviewed application process that includes written proposals and selection days.

I am extremely happy to report that Sheffield has won, not just one, but three SSI Fellowships this year. The only institution to match us was UCL, home of one of the first RSE group in the country. Here's a brief statement from each Sheffield fellow explaining how they plan to use their funds:

Tania Allard

Nowadays, the majority of research relies on software to some degree. However, in many cases, there is little focus on developing scientific software using best development practices due to a number of reasons such as the lack of adequate mentoring and training, little understanding of the requirements of the scientific code, and software being undervalued or not considered as a primary research output. This has changed over time with the emergence of RSEs (Research Software Engineers) just like myself. But certainly not every university or institute has an RSE team, neither every discipline is represented in the current RSE community. I plan to use this fellowship to develop an RSE winter school covering not only technical skills but also some of the craftsmanship and soft skills needed when developing a significant amount of scientific code. Also, this winter school will help to diversify the RSE pool by focusing on underrepresented groups within the community (e.g. gender, age, scientific disciplines, universities without RSEs) while disseminating best software practices among a number of disciplines.



Becky Arnold

I'm planning use the fellowship funds to bring external speakers in to talk to the astrophysics group, with the goal of improving the style, efficiency and sustainability of our coding. As physicists, as I imagine in many fields, we are largely taught to code to get the things we need to be done completed as quickly as possible, with little regard for the quality of the code itself. We are taught how to code, but not how to code well. I want to give us the opportunity to improve in that. Also I hope to change the way we think about coding, from a disposable stepping stone used to further research as quickly as possible to a fundamental part of the science itself.



Adam Tomkins

I am part of the Fruit Fly Brain Observatory project, with the aim to open up neurological data to the community, in an accessible way. Part of the issue with open data sharing is the vast amount of custom storage and format solutions, used by different labs. With this fellowship, I will be holding training events for both biologists and computational modelers on how to use the latest open data standards, demonstrating how using open software can generate instant added-value to data, with a larger community of tools and platforms.


RSE at Sheffield

When we set up the Sheffield RSE group, one of our aims was to help cultivate an environment at Sheffield where research software was valued. We do this by providing training events, writing grants with academics, consulting with researchers to improve software, improving the HPC environment and anything else we can think of. Of course, correlation does not imply causation but we like to believe that we helped our new SSI Fellows in some way (the SSI agrees) and we are very happy to bask in their reflected glory.

Code in the Academy: Rebecca Senior

In this interview series, the RSE team talk to University of Sheffield students about the role of coding in their research.

In out first of the series, we speak to Rebecca Senior.

She's in the final year of her PhD in the Department of Animal and Plant Sciences studying the interactions between land-use and climate change in tropical rainforests and what this means for biodiversity conservation. She's currently looking into how deforestation affects forest connectivity, so lots of spatial analyses. She mainly codes in R but also I dabbles in a bit of Python.


How did you first get into coding?

What motivated you to learn? How and what did you start learning?

When I was an undergrad we did our statistics in R commander, which is a GUI (Graphical User Interface) for R. A supervisor wisely told me that there’d come a point when I couldn’t do what I needed using R commander alone, so I spent the summer before my final year grappling with R and cursing it profusely until I was somewhat competent.

What are your favourite coding tools?

I’m pretty in love with the R + RStudio + tidyverse combo. RStudio is an integrated development environment (IDE) for R, which basically makes coding look far less hideous, and which allows you to write, save and run code in a more efficient way. The tidyverse is “an opinionated collection of R packages designed for data science”. The various packages make data management/analysis/presentation much more intuitive for many people.

How do you think coding has helped you in your work?

For one thing it got me an internship at UNEP-WCMC (UN Environment World Conservation Monitoring Centre) after I finished my undergrad, and it subsequently helped me to get this PhD position! More fundamentally, coding has sped things up, enhanced the reproducibility of my work and my ability to collaborate with others, and has helped me tackle complex problems that I couldn’t have done manually.

Tell us about your favourite coding achievement.

I wrote a teeny R package to estimate the time of sunrise and sunset based on date and location. It’s actually a really simple implementation of solar calculations developed by NOAA for MS Excel, but it was instrumental to one of my thesis chapters and it was my first experience of making an R package. Check it out here: https://github.com/rasenior/SolarCalc!

How do you think these skills can be better supported in academia?

Since all researchers (students and supervisors alike) are judged primarily on their publication output, encouraging students to publish software would be an obvious place to start. In my field, the journal Methods in Ecology and Evolution is a very popular option for people seeking to publish R packages.

That said, not all coding results in something publishable and, in any case, the sharing of software via peer-reviewed publications is not always a good measure of its usefulness anyway. I think students should be encouraged to share their coding achievements with peers, and more broadly via online platforms such as GitHub and Gist. Software has made its way into academic impact reporting, so perhaps coding should also be more valued within progress reports and theses?

It would be great to see the teaching of coding broaden beyond statistics, especially within the life sciences. There is so much more to coding than conducting t-tests! With continuing advances in technology we have to grapple with much bigger and more varied datasets, analyse them in sometimes very complex ways, and present the methods and results in a clear and succinct format, all the while maintaining reproducibility as much as possible. That’s a whole heap of coding skills that are very infrequently taught!

How do you see coding fitting in with your future career?

Whether I stay in academia or not, I will continue coding. I hope that my coding skills will help me secure a research position post-PhD. I’m not sure yet exactly where my research will take me, but I hope it involves developing R packages and making pretty figures in ggplot.

Any coding advice for new PhD students?

Don’t be afraid to set aside time for learning something new. Learning takes time – accept that and incorporate it into your work schedule. You’re still a student and some of the skills you learn may open doors you didn’t even know were there.

CodeFirstGirls meets Hacktoberfest

It is that time of the year again! The autumn-winter courses for CodeFirst: girls are in full swing at Sheffield, Manchester and many other locations all over the UK.

As the lead instructor of the Python courses part of my 'job' is to make sure that everything runs smoothly and that the gals make the most of the course. Since the courses run only over 8 weeks and we have loads of ground to cover I decided to improve the way the instructors communicate and plan the course as well as how we deliver the contents for the course.

Implementation

These are some of the approaches we are currently using in our courses:

  • GitHub: I use Git and GitHub all the time for all my projects and tasks. So it only made sense to make it a central point of contact as well as the main place for all the additional material to be kept into. It has worked wonders, all of the organisation stuff is there, we make sure that the additional materials developed by us are peer-reviewed and it makes it makes all of our lives easier.
  • GitKraken: ok ok I know many people would prefer teaching Git using the command line, but I have used both command line and GUI approaches and I think you first need to know your audience to better understand which approach to use. In this case GitKraken was my weapon of choice... powerful, intuitive, it can be easily integrated with GitHub, BitBucket and GitLab, and did I say beautiful? Yes, which makes it suitable for visual learners.
  • Learning by doing: I am a firm believer of learning by doing. It sometimes is the best approach to get the grips of things. How do you make the git-add-commit-push workflow a natural habit? Exactly, by doing it over and over again. So we are making sure that every session would include bits and pieces where the gals had to generate pieces of code, push them to their repos, collaborate with others and or create pull requests.
  • Feedback on the fly: As a Sofware/Data carpentry intructor some of the things I love the most is the use of post-its. That way you and the helpers know straight away who is struggling (red post-it) and how is not (green post-it) that way the learners get help instantly and the main instructor gets visual clues on how fast or slow to proceed. At the end of the day the learners write on the post-its something they liked or learned and something that could be improved or that they struggled with. So I decided to give this a go at CFG and so it has helped us a lot so far.
  • Active engagement: one of the key things that makes initiatives such as CodeFirstGirls work is not the fact that we teach them how to code, online courses do that. But the whole thing is an excellent community building activity, the gals find common minded people, are exposed to role models, and feel empowered to continue their career in tech. That is the beauty of what we do. So it is only fair that we engage with them. We have our own #hashtag (Go now and look for #ShefCodeFirst in Twitter), we have guest speakers, slack channels, our very own course website (obvs in GitHub pages) and we try to open their eyes to the wider tech and open source community. Also we have many Octocat stickers to give away!!!


Hacktoberfest

I mentioned before that we try to keep the gals actively engaged throughout the course as well as to integrate them to the wider community. And what a better way to do this but getting the gals involved in Hacktoberfest!!! We ere a bit tight on time but I thought it was worth trying to get some of the girls involved in something like this.

By doing so the girls would get the following benefits:

  • Learn how to contribute to open source projects
  • Integrate to the open source community
  • Get extra coding practice
  • Get extra git practice (4 Pull Requests were needed to complete this)
  • If completed they would get a special edition t-shirt (Whoop whoop)

That meant extra work for me: find specific tasks and projects for them to contribute to, merge pull requests bonanza, and prepare extra gifs and guides on how to complete the tasks. But it was totally worth it!!! I was more than delighted to see all the PR coming into our own repo as well as getting all the notifications from the girls getting involved in Hacktoberfest.

I know not everyone got involved as many have PhDs, Master's, dissertations, and a life to look after. But I am massively proud of them all. So many of our gals had never used Git or GitHub before and now they are collaborating like pros.

Talk about motivation :) And if you want to keep up to date with the end of course projects they will be presenting in 5 weeks time keep an eye on Twitter!

ReproHacking at Opencon London 2017 Doathon

Building on the success of last year’s #Reprohack2016 for the Berlin OpenCon satellite event, I rejoined the team of organisers (Laura Wheeler, Jon Tennant and Tony Ross-Hellauer ) and teamed up with Peter Kraker to develop the hackday for this year’s OpenCon London 2017.

To allow better reflection on this year's theme, “Open for what?”, we expanded the format to two tracks, opening up the scope for both projects and participants. One track retained the ReproHack format from last year, the other, a broader track, offered the opportunity for leads of any type of open science project to submit them for work. Projects were not constrained to coding and you didn't have to code to take part in the session - anyone with an interest in creative contribution to open science in whichever capacity was welcome.

On the day, after a round of introductions and sharing our motivations for attending, we reviewed the submissions and knuckled down.


ReproHacking

The original ReproHack was inspired by Owen Petchey’s Reproducible Research in Ecology, Evolution, Behaviour, and Environmental Studies course, where students attempt to reproduce the analysis and figures of a paper from the raw data, so we wanted to attempt the same. They take a few months over a number of sessions though, so, given our time constraints, we focused on reproducing papers that have also published the code behind them. This year we had a whole day, which gave us more time to dig deeper into the materials, moving beyond evaluating them for reproducibility, to how understandable, even how reusable they were. While fewer, we still had some excellent submissions to choose from.

I'm pleased to report that two of the three papers attempted were succesfully reproduced!


***

I was particulalrly impressed with the paper Andrew Ajube tackled:

The State of OA: A large-scale analysis of the prevalence and impact of Open Access articles Piwowar et. al

Under very minimal supervision and never having used R or Rstudio before, he managed to reproduce an analysis in an rmarkdown vignette]. I think this speaks volumes, to the power of the tools and approaches we have freely available to us, of following best practice (well described in this epic Jenny Bryan blog post on project-oriented workflows), the effort the producers went to, and of course, genuine engagement by Andrew. It can work and it is rewarding!


***

I worked with Marios Andreou on a very well curated paper, submitted by Ben Marwick.

The archaeology, chronology and stratigraphy of Madjedbebe (Malakunanja II): a site in northern Australia with early occupation Clarkson et. al

The paper offered two options to reproduce the work. A completely self-contained archive version in a docker container, which Marios spun up and reproduced the analysis in, in no time. I opted for the second option, installing the analysis as a package. It did require a bit of manual dependency managing but, this was documenteted in the analysis repository README on github. This meant that all functionality developed for the analysis was available for me to explore. Presenting the analysis in a vignette also made it’s inner working much more penetrable and allowed us to interactively edit it to get a better feel of the data and how the analysis functions worked. Ultimately, not only could we reproduce the science in the paper (open for transparency), we could also satisfy ourselves with what the code was doing (open for robustness) and reuse the code (open for resuse). The only step further would be to make the functionality more generalisable.

At the end of the session we collected some feedback about the experience, reflecting on reproducibility, the tools and approaches used, documentation and reusability. Here's the feedback we had for Ben's work.

As calls for openness are maturing, it's good to push beyond why open? to indeed, open how? open when? open for what? Different approaches to how a study is made "reproducible" has implications on what the openness can achieve downstream. It's probably a good time to start clarifying the remit of different approaches.


Do-athoning

On the do-athon side there were a couple of cool projects. Peter and Ali Smith worked on Visual data discovery in Open Knowledge Maps, adapting their knowledge mapping framework, Head Start, to the specific requirements of data discovery.

Tony, Lisa Mattias and Jon continued work on the, open to anyone, hyper-collaborative drafting of the "Foundations for Open Science Strategy Development" document, started at the OpenCon 2017 Do-athon.


Agile hacking

One thing I love about hacks is that you never know quite what skills you’re gonna get in the room. In our case, we got scrumaster Sven Ihnken offering to help us navigate the day. We’ve actually been agile working for the passed few months with the nascent shef dataviz team and I find it a productive way to work. So an agile hack seemed a worthy experiment. I personally thought it worked really well. It was nice to split up the day into shorter sprints and review progress around the room half way. And Sven did do a great job “buzzing” around the room, keeping us focused and engaged and ultimately, getting all our tasks from doing to done!

***


At the end of the day, we shared what we'd worked on, and settled in for the main Opencon London evening event. As the event talks went through more traditional remits of OpenCon, from public engagement to open access to literature and data, it just reiterated to me that each strand of openness is yet another way to invite people in to science. For me, inviting people all the way in, into your code and data, your entire workflow, is the bravest and most rewarding of all!

A successful 2nd RSE conference

RSE Sheffield in the 2nd RSE conference

RSE17

The second RSE conference took place on the 7th and 8th of September 2017 at the Museum of Science and Industry MOSI. There were over 200 attendees, 40 talks, 15 workshops, 3 keynote talks, one of which was given by our very own head honcho Mike Croucher (slides here), and geeky chats galore.

RSE team members Mozhgan and Tania were involved in the organising committee as talks co-chairs and diversity chair (disclose: they had nothing to do with Mike's keynote). Also, all of the RSE Sheffield team members made it to the conference, which seems to be a first due to the diverse commitments and project involvement of all of us.

Once again, the event was a huge success thanks to the efforts of the committee and volunteers as well as the amazing RSE community that made this a en engaging and welcoming event.

Conference highlights

With so many parallel sessions, workshops, and chats happening all at the same time it is quite complicated to keep a track of every single thing going on. And it seems rather unlikely that this will change over time as it was evident that the RSE community has outgrown the current conference size. So we decided to highlight our favourites of the event:

  • The talk on 'Imposter syndrome' by Vijay Sharma: Who in the scientific community has not ever experienced this? Exactly! So when given the chance everyone jumped into this talk full of relatable stories and handy tips on how to get over it.

  • Another talk that seemed to have gathered loads of interest was that of Toby Hodges from EMBL on community building. This came as no surprise (at least to me) as RSEs often act as community builders or as a bridge between collaborating communities. Opposed to just being focused on developing software and pushing it into production.

  • During the first day the RSEs had the chance to have a go at interacting with the Microsoft Hololens. There was a considerable queue to have a go at this, and unfortunately, we were not among the chosen ones to play with this. Maybe in the future.

  • My hands-on workshop on 'Jupyter notebooks for reproducible research'. I was ecstatic to know the community found this workshop interesting and had to run this twice!!!

  • Also, I'd like to casually throw in here that I have been elected as a committee member for the UK RSE association, so expect to read more about this in this blog.

For obvious reasons I missed most of the workshops but Kenji Takeda's workshop on 'Learn how to become an AI Super-RSE' was another favourite of the delegates as this was run twice too!

Our workshop on Jupyter notebooks for reproducible research

Being a RSE means that I serve as an advocate of sustainable software development. Also, as I have discussed here before: I am greatly concerned about reproducibility and replicability in science. Which, I might add, is not an easy task to embark onto. Thankfully, there are loads of tools and practices that we can adopt as part of our workflows to ensure that the code we develop is done by following the best practices possible, and as a consequence, can support science accordingly.

Naturally, as members of the community come up with more refined and powerful tools in the realm of scientific computing we (the users and other developers) adopt some of those tools meaning that we often end up modifying our workflows.

Such is the case of Jupyter notebooks. They brought up to life a whole new era of literate programming: where scientist, students, data scientist, and aficionados can share their scripts in a human readable format. What is more important, they transform scripts into a conveying scientific narrative where functions and loops are followed by their graphical outputs or allow the user to interact via widgets. This ability to openly share whole analysis pipelines is for sure, a step in the right direction.

However, the adoption of tools like this brings not only a number of advantages but also presents a number of challenges and integration issues with previously developed tools. For example, the traditional version control tools (including diff and merge tools) do not play nicely with the notebooks. Also, the notebooks have to be tested as any other piece of code.

During the workshop, I introduced two tools: nbdime and nbval, which were developed as part of the European funded project: OpenDreamKit. Such tools introduce very much needed version control and validation capabilities to the Jupyter notebooks, addressing some of the issues mentioned before.

So in order to cover these tools as well as how you would integrate them within your workflow I divided the workshop in three parts: diffing and merging of the notebooks, notebooks validation, and a brief 101 on reproducibility practices.

Notebooks diffing and merging

During the first part of the workshop the attendees shared their experiences using traditional version control tools with Jupyter notebooks... unsurprisingly everyone had had terrible experiences.

Then all of them had some hands-on time on how to use nbdime for diffing and merging from the command line as well as from their rich html rendered version (completely offline). As we progressed with the tutorial I could see some happy faces around the room and they all agreed that this was much needed.

Need more convincing? This tweet showed up in my feed just this week And just earlier this week this tweet showed up on my feed:

Notebooks validation

The second part of the workshop focused on the validation of the notebooks. And here I would like to ask this first: 'How many of you have found an amazing notebook somewhere in the web just to clone it and find out that it just does not work: dependencies are broken, functions are deprecated, can't tell if the results are reproducible?

I can tell you, we have all been there. And in such cases nbval is your best friend. It is a py.test plugin to determine whether execution of the stored inputs match the stored outputs of the .ipynb file. Whilst also ensuring that the notebooks are running without errors.

This lead to an incredible discussion on its place within conventional testing approaches. Certainly, it does not replace unittesting or integration testing, but it could be seen as a form of regression testing for the notebooks. Want to make sure that your awesome documentation formed by Jupyter notebooks is still working in a few months time? Why not use CI and nbval?

Wrapping up

The closing to the workshop was a 101 on working towards reproducible scientific computing. We shared some of our approaches for reproducible workflows and encouraged the delegates to share theirs. We covered topics such as valuing your digital assets, licensing, automation, version control and continuous integration, among others.

The perfect close to a great RSE conference!


Just a few more things

Let me highlight that all the materials for the workshop can be found at: https://github.com/trallard/JNB_reproducible and that all of it is completely self contained in the form of a Docker container.

If you missed out on the conference and would like to see the videos and slides of the various talks do not forget to visit the RSE conference website.


Iceberg vs ShARC


TL;DR Around 100 of Iceberg's nodes are ancient and weaker than a decent laptop. You may get better performance by switching to ShARC. You'll get even better performance by investing in the RSE project on ShARC.

Benchmarking different nodes on our HPC systems

I have been benchmarking various nodes on Iceberg and ShARC using Matrix-Matrix multiplication. This operation is highly parallel and optimised these days and is also a vital operation in many scientific workflows.

The benchmark units are GigaFlops (Billion operations per second) and higher is better Here are the results for maximum matrix sizes of 10000 by 10000, sorted worst to best

According to the Iceberg cluster specs, over half of Iceberg is made up of the old 'Westmere' nodes. According to these benchmarks, these are almost 4 times slower than a standard node on ShARC.

The RSE project - the fastest nodes available

We in the RSE group have co-invested with our collaborators in additional hardware on ShARC to form a 'Premium queue'. This hardware includes large memory nodes (768 Gigabytes per node - 12 times the amount that's normally available), Advanced GPUs (A DGX-1 server) and 'dense-core' nodes with 32 CPUs each.

These 32 core nodes are capable of over 800 Gigaflops and so are 6.7 times faster than the old Iceberg nodes. Furthermore, since they are only available to contributors, the queues will be shorter too!

Details of how to participate in the RSE-queue experiment on ShARC can be found on our website

What if ShARC is slower than Iceberg?

These benchmarks give reproducible evidence that ShARC can be significantly faster than Iceberg when well-optimised code is used. We have heard some unconfirmed reports that code run on ShARC can be slower than code run on Iceberg. If this is the case for you, please get in touch with us and give details.

Sheffield R Users group celebrates Hacktoberfest


We'll be honest here and say that our Sheffield R Users group Hacktoberfest celebrations started as a last minute stroke of inspiration. Nearing our standard first Tuesday of the month meetup, our speaker lineup for October was thin. At the same time I'd spent the last month mentoring as part of Mozilla Open Leadership program again, which was gearing up to have projects participate in Hacktoberfest, a global month long celebration of open source, organised by Digital Ocean and designed to get people engaged with projects hosted openly on GitHub. For those unfamiliar with the platform, GitHub is one of many code repositories where open projects live allowing anyone to copy, modify and even contribute back to open source projects, many of which depend on such volunteer contributions. As it takes a village to raise a child so it takes a small village to build, maintain, continue to develop and support the users of a succesful open source project, where even small non-technical contributions, for example, to documentation, can be a huge help to maintainers (see Yihui Xie's (of knitr fame) blog post on this).

So what better way to entice folks to get involved than the promise of stickers and a free t-shirt on completion of the Hacktoberfest challenge! And the challenge? Simple. Make four contributions (pull requests) to any open source project on GitHub between the 1st and 31st of October. And the contribution can be anything — fixing bugs, creating new features, or updating and writing documentation. Game on!


Many project owners had labelled specific issues up and we noticed there were many rOpenSci projects in need of some #rstats help.

Given that doing is the best way to learn and working on problems outside our daily routines can be a great distraction, we thought it'd be a great idea to skip the standard talk meetup format for October and instead opt for some hands on Hacktoberfest action! It would also give the opportunity to any of our R users who were curious but did not have previous experience with GitHub and open source to learn more through practice and also in a friendly space where they could get help with any questions or uncertainties. Working the details through on Twitter (as you do!), an exciting plan emerged...not only would we extend to holding weekly sessions throughout the whole month, we would end with a special Halloween celebratory session!


Kick off meetup - briefing session

At the kick off meetup, fellow Sheffield R Users Group co-organisers Tamora James (\@soaypim) and Mathew Hall (\@mathew_hall) introduced participants to the general ideas and principles of open source, discussed contibuting to open projects, introduced GitHub and walked through scanning issues (seeing what things need doing in a particular project), forking repositories (making a copy of the materials associated with a project) and making pull requests (sending contributions...yes it was all greek to me in the beginning too...and I'm Greek!). Given the short time we had to prepare for the session, the materials provided by Digital Ocean on their Hacktoberfest event kit page were an invaluable resource and we can easily recommend them as a great introduction to contributing to open source. Of the 8 folks that made it to the session, 3 would go on to contribute pull requests over the month.


The sessions

Admittedly, when you work at a computer all day, spending another 3 hours at your screen voluntarily is probably not everyone's top choice. But I personally found the opportunity to carve some time out to explore the huge variety of projects and diverse ways in which to get involved engaging and in some ways quite relaxing. The great collaborative spaces available for booking at the University of Sheffield, the informal setting and hanging out with friends made the sessions something I actually looked forward to. And the "no pressure" aspect of voluntary contribution meant I was free to play around, follow my own curiosity and explore things I was interested in but don't necessarily get the time to work with during my normal working day. Indeed some participants came along to make use of the company and learn some new things not necessarily related to the Hacktoberfest challenge. So collaboratory, no pressure spaces can be really useful for sharing knowledge.


Halloween R-stravaganza

Finally it was time for the closing event, our Halloween Hacktoberfest special! Excitement was building from the day before when Tamora and I spent the evening carving our too cute to be scary octocat :heart: spooky R pumpkin!


We also got some candy donations from Sheffield RSE and a special guest, Raniere Silva (\@rgaiacs), who came all the way from Manchester to join us (although technically it had been his idea after all). The stage was set for a fun finale!


Success!

While we all got our t-shirts I was really impressed with Tamora and Raniere's contributions and their approach served as the biggest take-away for me. They both focused on a problem or feature that would improve a tool they were already interested in / used. They got feedback on their suggestion before they even begun by opening an issue on GitHub and interacting with the project's owners about their idea. That meant their efforts were well focused and much more likely to be accepted.




My t-shirt in the end was mainly earned by helping with typos. For Hackoberfest, the size of your contribution doesn't matter as long as you send a contribution. And finding typos is actually non-trivial and time consuming due to our brain's auto-correct feature. Sadly, the coding pieces that I worked on over the session did not end up making the cut to submit as a functional pull request yet (there'll be a personal blog about my experience during Hacktoberfest coming soon instead). Mostly however I loved the experience and am already looking forward to organising it next year!.

Thinking ahead, 3 things I'd do differently in 2018 would be:

  • Reach out to more organisations: There's a great variety of clubs and meetups at the University and more widely in Sheffield that could be interested in joining forces for a Hacktoberfest event. This would give us R users an opportunity to interact with users of other tools and potentially even tackle issues requiring mixed skills as teams.
  • Start planning earlier! This would give us an opportunity to advertise better leading up to the kick-off session and allow us to co-ordinate with other groups.

  • Run a git & GitHub clinic before the first hack session\: This would give the opportunity to folks that have not used GitHub before to get some experience and confidence before turning up to a hack session.

So long #Hactoberfest! See you in 2018!



Pumpkin Carving session and Halloween special powered by:

New Group Member: Phil Tooley

I am thrilled to have joined the Research Software Engineering team at Sheffield. My new role is attached to the INSIGNEO Institute for in silico Medicine here at the University of Sheffield, developing image registration software as part of the CompBioMed project.

About Me

I am a former theoretical and computational physicist with particular interests in mathematical modelling and code optimisation. I am also a stalwart champion of the use of the Scipy stack for scientific computing, and enjoy trying to make python speed competitive with C and Fortran (Spoiler alert: it totally can be!)

Although I am a user of both C(++) and Fortran I find that python is often a better choice for many tasks even in high performance scientific computing. As well as the huge array of fast mathematical libraries available to python, it is also possible to write custom routines in pure (Numba) or nearly-pure (Cython) python code, and compile them to native machine code. This can give python equivalent performance to traditional C or Fortran codes, and opens up the possibility of running python code efficiently on HPC platforms. (More to come on this topic...)

Outside the office you will usually find me rock climbing or hiking somewhere.

My previous work

For the last 4 years I have been working on my Ph.D in Theoretical and Computational Plasma Physics at the University of Strathclyde working on novel accelerator technology known as Laser Wakefield Acceleration (LWFA). This involves firing a short \((30\, \mathrm{fs})\) but highly intense \((10^{25}\mathrm{W/m^2})\) laser pulse, into a jet of helium gas. The laser ionises the gas to a plasma and drives an electrostatic plasma wave with intense electric fields which can accelerate electrons to Gigaelectronvolt energies over just a few mm. (Compared to conventional accelerators such as SLAC which have to be kilometers in length to achieve the same energies.)

My research was into methods of controlling and improving the performance of these accelerations, and relied heavily on Particle in Cell Codes. These are massively parallel codes designed to run on HPCs and I spent a lot of my time developing code extensions and analysis tools for the terabyte scale datasets that they produce. Typically I was interested in extracting a very small (usually \(<1\%\) of the total) subset of the electrons based on some selection criteria. To do this efficiently I developed a custom C++-based analysis code which can extract electron trajectories from the data based on arbitrary criteria.

A major application of LWFAs is as compact sources of X-ray and UV light for imaging and materials analysis tools in science and industry, and so the second major theme of my work was analysis of the radiation produced by the accelerated electrons. This analysis required numerical integrators to calculate the radiation from the extracted electron trajectories by solving the Liénard-Wiechert potential equations.

ATI Data study group

I had the opportunity to attend the Alan Turing Institute (ATI) Data Study Group (22nd-26th May 2017). The ATI is the national institute for data science and as such, it has strong ties to both academia and industry.

The event was a week long data hackathon in which multiple groups used their best skills to crack the various projects proposed by the industrial partners. Over 5 days 6 groups worked intensively to deliver feasible solutions to the problems proposed.

The projects

The projects spanned over a wide range of topics, each with their very unique challenges and requirements.

  • DSTL presented two projects: the first of which was aimed at cyber security, thus trying to identify malicious attacks from IP's traffic data. The second project was focused on the identification of places/geographical landmarks so that the team could predict the likelihood of a given event to take place at a given location.
  • HSBC's challenge consisted in not the study of a particular data set(s) as the rest of the problems, but was in fact based on the development of a synthetic dataset that could be used along with some algorithms to evaluate the users' behaviour.
  • Siemens' project was centred around the study of vehicle traffic data that would enable efficient traffic lights and traffic volume control, which would eventually lead to the reduction of carbon emissions.
  • Samsung, being one of the leaders in the smartphones industry decided on using their collected (anonymous) users' data to analyse the users' gaming behaviour (e.g. which games would you biuy/play based on your current gaming habits) as well as developing a gaming recommendation engine.
  • Thomson Reuters's challenge was centred around graph analysis. Such a project had as primary goal to identify how positive/negative news of a given company affect other companies/industries within their network and how far does this effect extend.

The Hack

I joined the Thomson Reuters' project as this seemed to be one of the projects with the richest data set, both in its extension and type (e.g news, sentiment analysis, stock market, time series, etc.). The team was formed by 13 people with a huge variety of skills and coming from totally different backgrounds, which is what makes hackathons so exciting. You have to make the most of the skill sets your team has, in a very limited amount of time... pushing you out of your comfort zone.

AT-1

After a brief team introduction our 3 Thomson Reuters facilitators described the data and the challenge in more detail. We then identified the main goals of the project and subdivided the team in about other 4 teams. Once the initial planning was completed, we spent Monday's evening through Wednesday morning learning about their various API's, getting the data, wrangling data... and getting more data.

We soon realised that analysing all the data was incredibly complex and of course, there was not one correct way to do it. So we had to reduce the scope of the data we were in fact going to use and the sort of features/phenomena we were interested in.

The rest of the Wednesday and Thursday were used to start doing some prediction and regression on the data as well as writing up the report and finishing off our pitch presentation.

ATI-2

The findings

Certainly, we were able to obtain loads of insight from our data and the various algorithms we used. Some of the most important expected and unexpected findings were:

  • Negative news have a longer impact on the companies involved and those within their network (20 days as opposed to a 4 days impact from the positive news)
  • The companies are related to each other based on whether their are affiliates, competitors, parents, etc., not surprisingly the competitors are the companies that have the biggest effect on the other companies' stock prices
  • Different types of industries react differently to negative/positive news and the degree of extension of such an impact varies considerably from one industry type to another

Taking home

As every time I have been to a hackathon of some sort I ended up feeling absolutely drained, but accomplished at the same time. Hacks are such a good opportunity to meet like minded people, learn loads of stuff, test yourself, and have fun.

Would I encourage people to go to one of these events? Absolutely! If you are interested in all things data science you should keep an eye on the future events by the Alan Turing Institute. If you only want to have a go at hacking for fun, for a cause, or to meet new people I would suggest you have a look at the Major League Hacking. I am sure you will be able to find something near you and for all sort of interests. Our guys at Hack Sheffield organise a hackathon once a year so you might want to keep an eye on their activities. For those around Manchester area check out Hac100 including the youth and junior hacks (formerly Hack Manchester).

Will the RSE team at the university of Sheffield organise hacks? We have a crazy amount of work on our plates at the moment but we are definitely interested in (co)organising hackathons and many other events throughout the year. So keep your eyes peeled!

Mozsprint 2017 at the University of Sheffield


The 1st-2nd of June 2017 saw the Mozilla Global Sprint circle the globe for another time this year. It's Mozilla's flagship two-day community event, bringing together people from all over the world to celebrate the power of open collaboration by working on a huge diversity of community led projects, from developing open source software, building open tools to writing curriculum, planning events, and more. So here's a few of my own thoughts and reflections on this year's happenings.

Lead up to the sprint

Open Leadership Training mentorship

I joined my first Mozilla Global Sprint last year as the culmination of the Science Lab’s inaugural Working Open Workshop and mentorship program. I worked on my very own open science project, rmacroRDM which I'd spent the lead up to the Sprint preparing for. This year however it was a different experience for a number of reasons.

Firstly, the roles have been reversed, and from mentee, I was now a seasoned open leadership mentor. In fact, I had enjoyed the Open Leadership training program so much that I’d volunteered to mentor on the following two rounds, the first culminating at MozFest 2016 and this latest round at the Global Sprint 2017. Apart from staying connected to the vibrant network of movers and makers that is Mozilla, I also found I got a lot out of mentoring myself. From improving skills in understanding different people’s styles and support requirements to being introduced to new ideas, tools and technologies by interesting people from all over the world! Overall I find mentorship a positive sum activity for all parties involved.

So the lead up this year involved mentoring two projects while they prepare to launch at the global sprint. The Open Leadershp Training program involves mentees working through the OLT materials over 13 weeks while developing the resources required to open their projects up, ready to receive contributions. On a practical level, the program teaches approaches to help clearly define and promote the project and the use of github as a tool to openly host work on the web, plan, manage, track, discuss and collaborate. But the program delves deeper into the very essence of building open, supportive and welcoming communities in which people with an interest in a tool/cause/idea can contribute what they can, learn and develop themselves and feel valued and welcome members of a community.

Weekly contacts with the program alternated between whole cohort vidyo call check-ins and more focused one-on-one skype calls between mentors and mentees. This round I co-mentored with the wonderful Chris Ritzo from New America’s Open Technologiy Institute and we took on two extremely exciting projects, Aletheia and Teach-R.



Mentee projects:


Headed up by Kade Morton (@cypath), a super sharp, super visionary, super motivated, self-described crypto nerd from Brisbane, Australia, Aletheia doesn't pull any punches when describing it's reason for being:


In response they're building a decentralised and distributed database as a publishing platform for scientific research, leveraging two key pieces of technology, IPFS and blockchain. Many of the technical details are frankly over my head but I nonetheless learned a lot from Kade’s meticulous preparation and drive. Read more about the ideas behind the project here.




What can I say about Marcos Vital, professor of Quantitative Ecology at Federal University of Alagoas (UFAL), Brazil and fellow #rstats aficionado apart from he is also a huge inspiration! An effortless community builder, he runs a very successful local study group and has built a popular and engaged online community through his lab facebook page promoting science communication.

The topic of his project Teach-R is close to my heart, aiming to collate and develop training materials to TEACH people to TEACH R. Read more about it here



Hosting a Sheffield site.

Secondly, this year I helped host a site here at the University of Sheffield, and seeing as the sprint coincided with my first day as a Research Software Engineer for Sheffield RSE, we decided to take the event under our wing. With space secured and swag and coffee funds supplied by the Science Lab, the local site was read for action!



The Sprint!

Sprint at the University of Sheffield.

The was a good buzz of activity throughout the sprint at the site, with a few core participants while others came and went as they could. At the very least, roaming participants managed to soak up some of the atmosphere and pick up some git and github skills,...a success in my books!

Stuart Mumford (@StuartMumford) led project SunPy, a python based open-source solar data analysis environment and attracted a number of local contributors, including a new PhD student, although, as is often the case, much of the first morning seemed to be spent battling python installation on his laptop! Worth it for picking up a local contributor that will hopefully remain engaged throughout his studies though, and the team managed to push on with bug fixes and documentation development.

Jez Cope (@jezcope), our University's Research Data Manager was contributing to Library Carpentry, one of the biggest and most popular projects at his year's Sprint and also brought super tasty banana bread. He's also blogged about his experiences here.

Myself, while of course tempted by the many R, open science and reproducibility projects on offer, in the end chose to work on something unrelated to what I'm lucky to do for work and focus on a project I'm interested in personally. So I teamed up with Tyler Kolody (@TyTheSciGuy) on his timely project EchoBurst. The project aims to address our growing, social media facilitated, retreat into echo chambers, which is resulting in increasingly polarised public discourse and an unwillingness to engage with views we disagree with. The idea is to attempt to burst through such bubbles, by developing a browser extension with the potential to distinguish toxic content, more likely to shut down discussion, from more constructive content that might be able to bridge different perspectives.

Admittedly the project is very ambitious with a long way to go, many stages and various techniques/technologies to incorporate including natural language processing, building the browser plugin and even considering psychological and behavioural aspects in designing how to present information that might oppose a user's view without triggering the natural shut-down response.

There was plenty of really interesting brainstrorming discussion but the biggest initial challenge, and where the project could use the most help, is in collecting training data. The main approach is for contributors to help collect URLs of blogs on polarising topics from which to scrape content. But during the sprint we also added the option for contributors to add relevant youtube videos to collaborative playlists. We also started working on simple R functions to help scrape and clean the caption content.


Sprint across the globe

What a productive event this year's sprint was! While details of the level of activity have been covered and storyfied elsewhere and the final project demos can be found here and here, I just wanted to highlight some basic stats:

Global #mozprint involved:
  • 65 sites (+ virtual participants)
  • 20 countries
  • 108 projects
During the 50 hour #mozsprint, we saw:
  • 302 pull requests closed
  • 320 pull requests opened
  • 2223 comments & issues
  • 824 commits pushed

BOOM!

(access the full data on github activity here)


Mentee progress

I was really happy to see both our mentees get great responses, pick up new contributors and make good progress on their projects.

  • Marcos expertly moderated a very active gitter channel for Teach-R, attracted a number of excellent and very engaged new contributors, adding a number of new lessons, in both English and Portuguese!.

  • Kade also got great engagement for Aletheia, including onboarding science communicator Lisa Mattias (@l_matthia), who's already blogged about their plans to take the project forward by applying to present it at this year's Open Science Fair. Importantly, he also managed to attract the javascipt developer they've been desperately looking for. Success! You can read more about Kade's experiences of the sprint here.

They both made us very proud indeed!



Highlights

But the most important feature of the sprint for me every year is the global comradery and atmosphere of celebration. Handing off from one timezone to the other and checking in within our own to hear from leads about their project needs and progress, hanging out with participants from far and wide on vidyo and through streams of constant messaging on gitter, catching up with friends across the network...



...and cake...sooooooooo much cake!!

disclaimer: this cake was sadly not at the Sheffield site. It definitely has inspired me to put a lot more effort into this aspect of the sprint next year though!


Final thoughts

The end of the sprint is always a bit sad but the projects live on, hopefully with a new lease of life. So if, by reading this, you're inspired to contribute, check out the full list of projects for something that might appeal. There's a huge diversity of topics, tasks and skills required to chose from and fun new people to meet!

So does the network so if you’ve got an exciting idea of your own that you think would make a good open source project make sure to check out @MozOpenLeaders and look out for the next mentorship round.

As for the impact on Sheffield RSE, well there was one point where we managed to get the full team and loose collaborators working in one room (we’re normally spread out across the university). It felt great to work together from the same space so we decided to make a point of routinely booking one of the many excellent co-working spaces the University of Sheffield has on offer and establish regular work-together days!

So thanks for the inspiration and excellent times Mozilla! Till the next time!

(ie Mozfest 2017!)



Sounds:

Apart from the coffee and good vibes, the day was also fuelled by sounds. Here's a couple of the mixes that kept the Sheffield site going!

Grooves no. 1:


Grooves no. 2: