Manchester Julia Workshop

A few weeks ago (19-20th September 2016) I had the chance to attend the very first Julia workshop in the UK held at the University of Manchester by the SIAM Student Chapter. The first day of the workshop consisted of a basic tutorial of Julia, installation instructions and around five hours of hackathon. The second day provided an introduction to carrying out research in various fields such as data analysis, material science, natural language processing and bioinformatics using Julia. The attendees were a mixture of PhD students, post-docs and lecturers mainly from the University of Manchester as well as other universities and institutes (Warwick, Glasgow, Reading, MIT, Imperial College London, Earlham Institute).

Day 1: Tutorial and Hackathon

There are several ways to run Julia in any OS, including command line version, Juno IDE and Jupyter notebook (IJulia). In case you want to avoid any installation process then there is also the browser based JuliaBox.com. I was surprised that the whole process was smooth without any software setup issues!

The tutorial consisted of some very basic demonstration of Julia mostly on linear algebra and statistics and after a short break we were left to explore Julia, collaborate and exchange ideas. There were also two Exercises proposed to us:

  • First Steps With Julia by kaggle which teaches some basics of image processing and machine learning to identify the character from pictures.
  • Bio.jl Exercises by Ben J. Ward which provides simple examples of using the Bio.jl to do simple operations and manipulations of biological sequences.

As I wanted to try as many libraries as possible from image processing and data visualization to embedded Java, I ended up using a lot of different packages so I found these commands (self-explanatory) for package managing the most useful for me:Pkg.add("PackageName"), Pkg.status(), Pkg.update(). Here of course, I detected some compatibility issues. I was running Julia version 0.4.6 but it appears that most of the attendees were using the version 0.4.5. Some commands seemed to have changed between these versions; for example in the kaggle's exercise the command float32sc(img) which converts an image to float values was not working for me instead I had to use the float32(img) command. A minor issue for a new-born language.

Day 2: Talks

The talks were centred around specific fields with heavy scientific computing (automatic differentiation, molecular modelling, natural language processing, bioinformatics and computational biology) and how Julia influence these fields. Each speaker presented his field of research and his Julia implementations which ended up as another package for the Julia community. More information about the speakers can be found on the Manchester Julia Workshop webpage and a list of the presented packages can be found below:

Final words

Overall I was very satisfied with the Julia experience and I am waiting for its first official release (v1.0) which will probably be next year. Here are the main advantages which led me to believe that Julia can be the next on demand programming language for scientific computing:

  • Combines the productivity of dynamic languages (Java, Python) and the performance of static languages (C, Fortran). In other words: very easy to write optimized code and run your program fast at the same time. Dr Jiahao Chen from MIT in his talk mentioned the following referring to Julia's speed, "You can define many methods for a generic function. If the compiler can figure out exactly which method you need to use when you invoke a function, then it generates optimized code".
  • Deals with the two language problem: base library and functionality is written in Julia itself.
  • It is free and open source (MIT licensed), high advantageous for the scientific community to share code or expand existing one.
  • A great and friendly community and users from various fields which constantly expand the existing Julia library.

Fun fact: The system for variable declaration accepts any Unicode character: \delta[tab] = 2 results in δ = 2, \:smiley: = 4 results in 😃 = 4. Although, apart from some April Fool's pranks, Julia's stylistic conventions is advised to be followed when defining variable names!

Coffee and Cakes Event

RSE Sheffield is hosting its first coffee and cakes event on 4th October 2016 at 10:00 in the Ada Lovelace room on 1st floor of the Computer Science Department (Regents Court East). Attendance is free and you don't need to register (or bring coffee and cake with you). Simply call in and take the opportunity to come and have an informal chat about research software.

The event is a community event for anyone not just computer science or members of the RSE team. If you work on software development are an RSE or simply want to talk about some aspect of software or software in teaching then come along.

Accelerated versions of R for Iceberg

To Long; Didn't Read -- Summary

I've built a version of R on Iceberg that is faster than the standard version for various operations. Documentation is at http://docs.hpc.shef.ac.uk/en/latest/iceberg/software/apps/r.html.

If it works more quickly for you, or if you have problems, please let us know by emailing rse@sheffield.ac.uk

Background

I took over building R for Iceberg, Sheffield's High Performance Computing System, around a year ago and have been incrementally improving both the install and the documentation with every release. Something that's been bothering me for a while is the lack of optimisation. The standard Iceberg build uses an ancient version of the gcc compiler and (probably) unoptimised versions of BLAS and LAPCK.

BLAS and LAPACK are extremely important libraries -- they provide the code that programs such as R use for linear algebra: Matrix-Matrix multiplication, Cholesky decomposition, principle component analysis and so on. It's important to note that there are lots of implementations of BLAS and LAPACK: ATLAS, OpenBLAS and the Intel MKL are three well-known examples. Written in Fortran, the interfaces of all of these versions are identical, which means you can use them interchangeably, but the speed of the implementation can vary considerably.

The BLAS and LAPACK implementations on Iceberg are undocumented (before my time!) which means that we have no idea what we are dealing with. Perhaps they are optimised, perhaps not. I suspected 'not'.

Building R with the Intel Compiler and MKL

The Intel Compiler Suite often produces the fastest executables of all available compilers for any given piece of Fortran or C/C++ code. Additionally, the Intel MKL is probably the fastest implementation of BLAS and LAPACK available for Intel Hardware. As such, I've had Build R using Intel Compilers and MKL on my to-do list for some time.

Following a recent visit to the University of Lancaster, where they've been doing this for a while, I finally bit the bullet and produced some build-scripts. Thanks to Lancaster's Mike Pacey for help with this! There are two versions (links point to the exact commits that produced the builds used in this article):

The benchmark code is available in the Sheffield HPC examples repo https://github.com/mikecroucher/HPC_Examples/. The exact commit that produced these results is 35de11e

Testing

It's no good having fast builds of R if they give the wrong results! To make sure that everything is OK, I ran R's installation test suite and everything passed. If you have an account on iceberg, you can see the output from the test suite at /usr/local/packages6/apps/intel/15/R/sequential-3.3.1/install_logs/make_install_tests-R-3.3.1.log.

It's important to note that although the tests passed, there are differences in output between this build and the reference build that R's test suite is based on. This is due to a number of factors such as the fact that Floating point addition is not associative and that the signs of eigenvectors are arbitrary and so on.

A discussion around these differences and how they relate to R can be found on nabble.

How fast is it?

So is it worth it? I ran a benchmark called linear_algebra_bench.r that implemented 5 tests

  • MatMul - Multiplies two random 1000 x 5000 matrices together
  • Chol - Cholesky decomposition of a 5000 x 5000 random matrix
  • SVD - Singular Value Decompisition of a 10000 x 2000 random matrix
  • PCA - Principle component analysis of a 10000 x 2000 random matrix
  • LDA - A Linear Discriminant Analysis problem

Run time of these operations compared to Iceberg's standard install of R is shown in the table below.

Execution time in seconds (Mean of 5 independent runs)

MatMul Chol SVD PCA LDA
Standard R 134.70 20.95 46.56 179.60 132.40
Intel R with sequential MKL 12.19 2.24 9.13 24.58 31.32
Intel R with parallel MKL (2 cores) 7.21 1.60 5.43 14.66 23.54
Intel R with parallel MKL (4 cores) 3.24 1.17 3.34 7.87 20.63
Intel R with parallel MKL (8 cores) 1.71 0.38 1.99 5.33 15.82
Intel R with parallel MKL (16 cores) 0.96 0.28 1.60 4.05 13.65

Another way of viewing these results is to see the speed up compared to the standard install of R. Even on a single CPU core, the Intel builds are between 4 and 11 times faster than the standard builds. Making use of 16 cores takes this up to 141 times faster in the case of Matrix-Matrix Multiplication!

Speed up compared to standard R

MatMul Chol SVD PCA LDA
Standard R 1 1 1 1 1
Intel R with sequential MKL 11 9 5 7 4
Intel R with parallel MKL (2 cores) 19 13 9 12 6
Intel R with parallel MKL (4 cores) 42 18 14 23 6
Intel R with parallel MKL (8 cores) 79 55 23 34 8
Intel R with parallel MKL (16 cores) 141 75 29 44 10

Parallel environment

The type of parallelisation in use here is OpenMP. As such, you need to use Iceberg's openmp environment. That is, if you want 8 cores (say), add the following to your submission script

#$ -pe openmp 8
export OMP_NUM_THREADS=8

Using OpenMP limits the number of cores you can use per job to the number available on a single node. At the time of writing, this is 16.

How many cores: Finding the sweet spot

Note that everything is fastest when using 16 cores! As such, it may be tempting to always use 16 cores for your jobs. This will almost always be a mistake. It may be that the aspect of your code that's accelerated by this build doesn't account for much of the runtime of your problem. As such, those 16 cores will sit idle most of the time -- wasting resources.

You'll also spend a lot longer waiting in the queue for 16 cores than you will for 2 cores which may swap any speed gains.

You should always perform scaling experiments before deciding how many cores to use for your jobs. Consider the Linear Discriminant Analysis problem, for example. Using just one core, Intel build gives us a 4 times speed-up compared to the standard build. Moving to 8 cores only makes it twice as fast again. As such, if you had lots of these jobs to do, your throughput would be higher running lots of single core jobs compared to lots of 8 core jobs.

If matrix-matrix multiply dominates your runtime, on the other hand, it may well be worth using 16 cores.

Using this version of R for your own work

As a user, there are a few things you need to be aware of with the Intel builds of R so I've created a separate documentation page for them. This is currently at http://docs.hpc.shef.ac.uk/en/latest/iceberg/software/apps/intel_r.html

My recommendation for using these builds is to work through the following procedure

  • Ensure that your code runs with Iceberg's standard version of R and produce a test result.
  • In the first instance, switch to the sequential version of the Intel R build. In the best case, this will just require changing the module. You may also need to install some of your packages since the Intel build has a separate packages directory to the standard build.
  • If you see speed-up and the results are consistent with your test result, try the parallel version. Initially start with 2 cores and move upwards to find the sweet spot.

The University of Sheffield named an NVIDIA GPU Education Center

Sheffield NVIDIA Education Centre

This week I am very pleased to announce that the University of Sheffield has been awarded the status of an NVIDIA CUDA Education Centre.

The faculty of Engineering has featured this in its latest faculty newsletter and the Department of Computer Science has published more details in a news feature.

But what does this mean to the RSE community at Sheffield and beyond?

The recognition of being an NVIDIA education centre is a reflection of the teaching that is provided by The University of Sheffield on the subject of GPU computing. In case you are unaware of what teaching there is, I have a 4th year and Masters teaching module COM4521/COM6521 which ran for the first time in the 2015/2016 Spring Semester. This course will be run annually and is open to research staff as well as taught students. Last time there was roughly a 50:50 mix including senior research staff and PhD students. It is much more involved that the one or two day courses which typically give only an introduction to GPU programming. If you are a researcher looking to exploit GPU performance in your research then this course is an opportunity to learn some new skills.

In the future this course will be made freely available so even researchers outside of The University of Sheffield will be able to go through the notes and worked examples (lab sheets).

Some of the other benefits of being an NVIDIA Eduction (and also an NVIDIA Research) centre are;

  • Access to NVIDIA GPU hardware and software (via Iceberg and in the Diamond labs)
  • Significant discount on Tesla hardware purchases
  • Access to NVIDIA parallel programming experts and resources
  • Access to educational webinars and an array of teaching materials
  • Free in the cloud GPU programming training at nvidia.qwiklab.com
  • Support in the form of letters of support (with contributions in kind) for research proposals with emphasis on GPU computing or deep learning
  • Joint promotion, public relations, and press activities with NVIDIA

Other Training Opportunities

Through RSE Sheffield and GPUComputing@Sheffield shorter courses for GPU computing are also available. I will be announcing dates for 1-2 day CUDA courses shortly and am working with CICS in developing new Python CUDA material.

For those that missed the sign-up, we are also running a two day deep learning with GPUs course in July. The places for this were in high demand and filled up within a day. This course will be repeated in due time and material from the course will be made available off-line.

Other GPU announcements will be made on both this RSE blog and on the GPUComputing@Sheffield mailing list. Expect some exciting new hardware and software once the Iceberg upgrade is complete (shortly).

Paul