Sharing Environments
Overview
Teaching: 30 min
Exercises: 15 minQuestions
Why should I share my Conda environment with others?
How do I share my Conda environment with others?
Objectives
Create an environment from a YAML file that can be read by Windows, Mac OS, or Linux.
Create an environment based on exact package versions.
Create a custom kernel for a Conda environment for use inside JupyterLab and Jupyter notebooks.
Sharing Conda environments with other researchers facilitates the reproducibility of your research.
Create an
environment.yml
file that describes your project’s software environment.
Working with environment files
When working on a collaborative research project it is often the case that your operating system might differ from the operating systems used by your collaborators. Similarly, the operating system used on a remote cluster to which you have access will likely differ from the operating system that you use on your local machine. In these cases it is useful to create an operating system agnostic environment file which you can share with collaborators or use to re-create an environment on a remote cluster.
Creating an environment file
In order to make sure that your environment is truly shareable, you need to make sure that that the contents of your environment are described in such a way that the resulting environment file can be used to re-create your environment on Linux, Mac OS, and Windows. Conda uses YAML (YAML Ain’t Markup Language) for writing its environment files. YAML is a human-readable data-serialization language that is commonly used for configuration files and that uses Python-style indentation to indicate nesting.
Creating your project’s Conda environment from a single environment file is a Conda “best practice”. Not only do you have a file to share with collaborators but you also have a file that can be placed under version control which further enhances the reproducibility of your research project and workflow.
Default
environment.yml
fileNote that by convention Conda environment files are called
environment.yml
. As such if you use theconda env create
sub-command without passing the--file
option, thenconda
will expect to find a file calledenvironment.yml
in the current working directory and will throw an error if a file with that name can not be found.
Let’s take a look at a few example environment.yml
files to give you an idea of how to write your own environment
files.
name: machine-learning-39-env
dependencies:
- ipython
- matplotlib
- pandas
- pip
- python
- scikit-learn
This environment.yml
file would create an environment called machine-learning-39-env
with the
most current and mutually compatible versions of the listed packages (including all required
dependencies). The newly created environment would be installed inside the ~/miniconda3/envs/
directory, unless we specified a different path using --prefix
.
If you prefer you can use explicit versions numbers for all packages:
name: machine-learning-39-env
dependencies:
- ipython=8.8
- matplotlib=3.6
- pandas=1.5
- pip=22.3
- python=3.10
- scikit-learn=1.2
Note that we are only specifying the major and minor version numbers and not the patch or build numbers. Defining the version number by fixing only the major and minor version numbers while allowing the patch version number to vary allows us to use our environment file to update our environment to get any bug fixes whilst still maintaining significant consistency of our Conda environment across updates.
Always version control your
environment.yml
files!Version control is a system of keeping track of changes that are made to files, in this case the
environment.yml
file. It’s really useful to do so if for example you make and update a specific version of a package and you find it breaks something in your environment or when running your code as you then have a record of what it previously was and can revert the changes.There are many systems for version control but the one you are most likely to encounter and we would recommend learning is Git. Unfortunately the topic is too broad to cover in this material but we include the commands to version control your files at the command line using Git.
By version controlling your
environment.yml
file along with your projects source code you can recreate your environment and results at any particular point in time and you do not need to version control the directory under~/miniconda3/envs/
where the environments packages are installed.
Let’s suppose that you want to use the environment.yml
file defined above to create a Conda
environment in a sub-directory of some project directory. Here is how you would accomplish this
task.
$ cd ~/Desktop/conda-environments-for-effective-and-reproducible-research
$ mkdir project-dir
$ cd project-dir
Once your project folder is created, create environment.yml
using your favourite editor for instance nano
.
Finally create a new Conda environment:
$ conda env create --name project-env --file environment.yml
$ conda activate project-env
Note that the above sequence of commands assumes that the environment.yml
file is stored within
your project-dir
directory.
Automatically generate an environment.yml
To export the packages installed into the previously created machine-learning-39-env
you can run the
following command:
$ conda env export --name machine-learning-39-env
When you run this command, you will see the resulting YAML formatted representation of your Conda
environment streamed to the terminal. Recall that we only listed five packages when we
originally created machine-learning-39-env
yet from the output of the conda env export
command
we see that these five packages result in an environment with roughly 80 dependencies!
What’s in the exported environment.yml file
The exported version of the file looks a bit different to the one we wrote. In addition to version numbers, on some lines we’ve got another code in there e.g.
vs2015_runtime=14.34.31931=h4c5c07a_10
. Theh4c5c07a_10
is the build variant hash. This appears when the package is different for different operating systems. The implication is that an environment file that contains a build variant hash for one or more of the packages cannot be used on a different operating system to the one it was created on.
To export this list into an environment.yml
file, you can use --file
option to directly save the
resulting YAML environment into a file. If the target for --file
exists it will be over-written so make sure your
filename is unique. So that we do not over-write environment.yaml
we save the output to machine-learning-39-env.yaml
instead and add it to the Git repository.
$ conda env export --name machine-learning-39-env --file machine-learning-39-env.yml
$ git init
$ git add machine-learning-39-env.yml
$ git commit -m "Adding machine-learning-39-env.yml config file."
It is important to note that the exported file includes the
prefix:
entry which records the location the environment is installed on your system. If you are to share this file with colleagues you should remove this line before doing so as it is highly unlikely their virtual environments will be in the same location. You should also ideally remove the line before committing to Git.
This exported environment file may not consistently produce environments that are reproducible across operating systems. The reason for this is, that it may include operating system specific low-level packages which cannot be used by other operating systems.
If you need an environment file that can produce environments that are reproducible across Mac OS, Windows, and Linux, then you are better off just including only those packages that you have specifically installed.
This is achieved using the --from-history
flag/option which means only those packages explicitly installed with
conda install
commands, and not the dependencies that were pulled in when doing so will be exported.
$ conda env export --name machine-learning-39-env --from-history --file machine-learning-39-env.yml
$ git add machine-learning-39-env.yml
$ git commit -m "Updates machine-learning-39-env.yml based on environment history"
Excluding build variant hash
In short: to make sure others can reproduce your environment independent of the operating system they use, make sure to add the
--from-history
argument to theconda env export
command. This will only include the packages you explicitly installed and the version you requested to be installed. For example if you installednumpy-1.24
this will be listed, but if you installedpandas
without a version and therefore installed the latest thenpandas
will be listed in your environment file without a version number so anyone using your environment file will get the latest version which may not match the version you used. This is one reason to explicitly state the version of a package you wish to install.Without
--from-history
the output may on some occasions include the build variant hash (which can alternatively be removed by editing the environment file). These are often specific to the operating system and including them in your environment file means it will not necessarily work if someone is using a different operating system.Be aware that
--from-history
will omit any packages you have installed usingpip
. This may be addressed in future releases. In the meantime, editing your exported environment files by hand is sometimes the best option.
Create a new environment from a YAML file.
Create a new project directory and then create a new
environment.yml
file inside your project directory with the following contents.name: scikit-learn-env dependencies: - ipython=8.8 - matplotlib=3.6 - pandas=1.5 - pip=22.3 - python=3.10 - scikit-learn=1.2
Now use this file to create a new Conda environment. Where is this new environment created? If you are using Git add the YAML file to your repository.
Solution
To create a new environment from a YAML file use the
conda env create
sub-command as follows.$ mkdir scikit-learn-project $ cd scikit-learn-project $ conda env create --file scikit-learn-env.yml $ git init $ git add scikit-learn-env.yml $ git commit -m "Adding scikit-learn-env.yml config file"
The above sequence of commands will create a new Conda environment inside the
~/miniconda3/envs
directory (check withconda env list
orconda info
).You can now run the
conda env list
command and see that this environment has been created or if you haveconda activate scikit-learn-env
you can useconda info
to get detailed information about the environment.
Specifying channels in the environment.yml
We learnt in the previous episode, that some packages may need to be installed from channels other than the default channel. We can also specify the channels that conda should look for the packages within the
environment.yml
file:name: polars-env channels: - conda-forge - defaults dependencies: - polars=0.16
When the above file is used to create an environment, conda would first look in the
conda-forge
channel for all packages mentioned underdependencies
. If they exist in theconda-forge
channel, conda would install them from there, and not look for them indefaults
at all.
Updating an environment
You are unlikely to know ahead of time which packages (and version numbers!) you will need to use for your research project. For example it may be the case that
- one of your core dependencies just released a new version (dependency version number update).
- you need an additional package for data analysis (add a new dependency).
- you have found a better visualization package and no longer need to old visualization package (add new dependency and remove old dependency).
If any of these occurs during the course of your research project, all you need to do is update
the contents of your environment.yml
file accordingly and then run the following command.
$ conda env update --name project-env --file environment.yml --prune
$ git add environment.yml
$ git commit -m "Updating environment.yml config file"
The --prune
option tells Conda to remove any dependencies that are no longer required from the environment.
Rebuilding a Conda environment from scratch
When working with
environment.yml
files it is often just as easy to rebuild the Conda environment from scratch whenever you need to add or remove dependencies. To rebuild a Conda environment from scratch you can pass the--force
option to theconda env create
command which will remove any existing environment directory before rebuilding it using the provided environment file.$ conda env create --name project-env --file environment.yml --force
Add Dask to the environment to scale up your analytics
Add
dask
to thescikit-learn-env.yml
environment file and update the environment. Dask provides advanced parallelism for data science workflows enabling performance at scale for the core Python data science tools such as Numpy Pandas, and Scikit-Learn.Solution
The
scikit-learn-env.yml
file should now look as follows.name: scikit-learn-env dependencies: - dask=2022.7 - dask-ml=2022.5 - ipython=8.8 - matplotlib=3.6 - pandas=1.5 - pip=22.3 - python=3.10 - scikit-learn=1.2
You could use the following command, that will rebuild the environment from scratch with the new Dask dependencies:
$ conda env create --name project-env --file environment.yml --force
Or, if you just want to update the environment in-place with the new Dask dependencies, you can use:
$ conda env update --name project-env --file environment.yml --prune
You would then add and commit the changes to the
scikit-learn-env.yml
to Git to keep the changes under version control.$ git add scikit-learn-env.yml $ git commit -m "Updating scikit-learn-env.yml with dask"
Installing via pip
in environment.yml
files
Since you write environment.yml
files for all of your projects, you might be wondering how to specify that packages
should be installed using pip
in the environment.yml
file. Here is an example environment.yml
file that uses
pip
to install the kaggle
and yellowbrick
packages.
name: example
dependencies:
- jupyterlab=1.0
- matplotlib=3.1
- pandas=0.24
- scikit-learn=0.21
- pip=22.3
- pip:
- kaggle
- yellowbrick==1.5
Note two things…
pip
is installed as a dependency under conda first (under thedependencies
section) with an explicit version (not essential).- Following this there is then an entry for
- pip:
and under this is another list (indented further) where double ‘==’ instead of ‘=’ for the explicit version thatpip
will install.In case you are wondering, the Yellowbrick package is a suite of visual diagnostic tools called “Visualizers” that extend the Scikit-Learn API to allow human steering of the model selection process. Recent version of Yellowbrick can also be installed using
conda
from theconda-forge
channel.$ conda install --channel conda-forge yellowbrick=1.2 --name project-env
Key Points