A certain amount of technical ability is required as a modern researcher. From running simulations, running large calculations, applying data transformation and analysis or even automating mundane tasks such as renaming and moving data files, learning to program can really help improve your workflow and productivity.
This page is a compilation of resources and learning materials that starts with learning to program from scratch to using the University’s High Performance Computing (HPC) clusters.
Often the software you use to carry out your research will influence your language choice, but if you’re starting from zero, Python is a great choice.
We’ve compiled a list of resources for learning Python from scratch.
Integrated Development Environments (IDEs) are text editors designed for writing code. They generally feature syntax highlighting, offer code suggestions and code debugging, allowing you to pause your program and examine it when checking for errors.
Python has many statistics libraries but the ones you will be using most is pandas, which offers data structures and operations for manipulating numerical tables and time series, and numpy which is a library optimised for performing numerical operations on large matrices. The two aforementioned libraries are used as the base part of many other scientific Python packages.
For data visualisation, the seaborn library provides a high-level interface over the venerable matplotlib library and makes it easier to plot good looking graphs. ggplot is another popular library that follows the principle of Grammar of Graphics. An interesting library is bokeh which is designed to generate integrative graphs that can be embedded in web pages.
Jupyter Notebooks is a platform that allows you to interactively write and run code, show its results and provide text annotations within a single document. The tool is great for exploratory coding or when you want to demonstrate a linear scientific workflow.
Here’s a demonstration of a notebook which is used to illustrate working through a machine learning exercise: https://nbviewer.jupyter.org/github/jdwittenauer/ipython-notebooks/blob/master/notebooks/ml/ML-Exercise1.ipynb
No matter what programming language you use, we always recommend using version control systems such as Git in your project.
With version control, all changes made to your code are progressively tracked. This means you can always change or delete code with the confidence that you can always revert the changes if necessary. It becomes absolutely essential when collaboratively working on the same code. Additionally, any text file works well with version control systems e.g. latex documents.
With Git, a folder/directory is converted into a
repository that can track changes of all of its contained files and subdirectories. The repository can then be uploaded and synchronized with on-line services such as Github, backing up your work and enabling you to access it from anywhere in the world. See our remote working with github guide for more details.
Needing to get help on a particular piece of your code? It’s easy to share your on-line repository’s access with helpers e.g. Code Clinic.
Download and installation instructions for all platforms can be found at:
Familiarity with the command-line interface is required as git is a command-line tool.
The services below offer free hosting for personal repositories and some even offer free pro versions for academia.
GUI tools can really help with routine Git operations and especially when trying to make sense of large repositories.
When running large simulations or doing analysis on large datasets, your tasks can take hours or even days to complete, you may struggle to even load everything into memory or can’t fit the dataset or results onto your hard disk. If this sounds like something you’re facing, it may be time to start having a look at using the HPC (High Performance Computing) clusters provided through the University of Sheffield.
A HPC cluster essentially consists of a number of computers connected through a fast network. Each machine, or node, normally has far more CPU cores and RAM than the average PC. The best way to take advantage of a HPC cluster is to split your computation into small, independent tasks and distribute them to run concurrently across multiple CPU cores or nodes.
Access to these clusters are open to all UoS researchers and academics and they’re free at the point-of-use.
All HPC clusters mentioned above run on Linux operating system. Generally you have to interact with them using the command line interface (although for some you can launch a GUI application such as matlab though them).
As you have limited user permissions on these systems, you have to use pre-installed software (using modules) and custom installation of software is normally done through packages such as conda or spack that allows installation of software and their dependencies to your home directory.
HPC clusters use a job scheduler to make sure everyone has a chance to run their tasks. You write a job script that tells the HPC what tasks to run and what computing resources you need and then send it to the scheduler to be added to the queue. The scheduler will then run your job when it’s your turn in the queue and there’s enough resource available on the cluster.
For queries relating to collaborating with the RSE team on projects: firstname.lastname@example.org
Join our mailing list so as to be notified when we advertise talks and workshops by subscribing to this Google Group.