About Darth Vader of Conda EnVS

There are two types of data scientists, those who take the time to master Conda and those who don’t (and cry in the corner because of it).

This is the first article in the WhiteBox toolkit series, where we tell you in high detail about the tools we use in our daily work.

This article is about Conda, which we use to install and manage Python and its libraries on our systems. We use it both in development and production, and we firmly believe that Conda stands out over alternative tools such as VirtualEnv, Poetry, Pyenv, or Pipenv.

We don’t use Conda because of habit. We have tried other alternatives, but always come back to Conda as it is the only, arguably full-featured solution on the market.

Conda (and virtual environments in general) is a tool that is often poorly understood, based on my 6 + years of experience doing data science. You’d be surprised how many good professionals, even with 5 + years of experience, still struggle with messy, broken, and almost unusable Python installations because of this.

The purpose of this article is to put an end to this madness once and for all. ** If you have a better understanding of virtual environments, our goal is met.

Roughly, your Python environment (image from XKCD: xkcd.com/1987)

1. Virtual environment

First, let’s take a closer look at what a virtual environment is and isn’t with some examples.

1.1 Use Cases

Imagine that you want to start a new data science project. Do you want to.

  • Load some data from a.csv file.

  • Do some transformation (cleanup, aggregation, etc.).

  • Train a machine learning model.

To do this, you’ll use Python (or R), as well as some libraries. In Python, you need to.

  • Python itself

  • pandas

  • scikit-learn

On most Unix-like operating systems (Linux distributions and macOS), Python already works out of the box with the operating system. Just open a terminal and say: which python3.

Which python3 command

As you can see, the output from this command shows that Python has been installed on our system, specifically at the /usr/bin/python3 location. Python is just an executable binary on our system.

This Installation of Python is called System Python, which means that this executable is used by our operating System to do a lot of things. For example, if you open a file explorer, it might use System Python under the cover to list files and folders. That’s why we want to keep this Python installation clean and working flawlessly.

If System Python breaks during the installation of a library (like Pandas or SciKit-Learn) (and trust me, shit happens), the probability of running into trouble (complex fixes/new OS installations) is high, which is one of the reasons virtual environments exist and are so popular.

Now imagine that you want to install the latest version of Apache Airflow (a Python library for coordinating workflow) in Python 3.8, but your system is running Python 3.7. In this case, the virtual environment is also your best friend.

A typical use of a virtual environment is.

  • You can experiment without scruple. Install and uninstall libraries, and if something breaks, you can just delete the environment and recreate it without any risk.

  • You can create as many virtual environments as you need. For example, create a virtual environment for each project or for each application you want to run in isolation from the rest of the system (Jupyter, Apache Airflow, etc.).

  • Some libraries may have conflicting requirements (for example, one library requires Python 3.6, while another unmaintained legacy library still requires Python 2.7) and can be installed in different virtual environments.

1.2 define

A virtual environment is simply an isolated installation of Python and its libraries. ** In the next section of this article, I’ll explain that virtual environments are not limited to Python, but to any software.

A virtual environment

For example, in the figure above, we have

  • A system Python (version 3.7).

  • A Conda installation with 4 virtual environments, 2 Python, 1 R, and 1 Java.

We’ll look at what this diagram means in more detail, but for now, you should have a basic understanding of what virtual environments are.

In the case of using Conda to create and manage our virtual environments, these virtual environments are completely independent, meaning they have no relationship to system Python or to each other. Other virtual environment managers do not respect this independence and link to System Python, limiting your choice of Python versions.

2. Fairy tales

If you’re reading this and you probably already have some knowledge about virtual environments (or at least, you think you do), let’s start fighting off some of the scams about virtual environments and Conda that you’re probably already familiar with.

2.1 Conda and Anaconda (or Miniconda) are the same thing: fake

Conda is a virtual environment manager that allows you to create, delete, or package virtual environments and install software, while Anaconda (and Miniconda) includes Conda as well as several pre-downloaded libraries. ** In the case of Miniconda, only necessary libraries, just work, whereas in the case of Anaconda, over 500Mb libraries. Look at the chart below.

Conda, Miniconda and Anaconda

At this point, inexperienced users think: “OK, let’s install Anaconda (instead of Miniconda) or I’ll lose access to some libraries”. HMM… This is not ** true at all. ** Please go to the next section for more information.

2.2 You must install Anaconda instead of Miniconda to access more libraries. wrong

The only difference between Anaconda and Miniconda is that Anaconda has pre-downloaded many libraries **, so ** when you do Conda install Pandas, it’s already downloaded from the web and ready to install.

This is not a huge advantage over Miniconda.

  • Fast Internet connections are now available to most workstations and servers. When you try to install a new library, Conda first checks to see if it is already downloaded/cached on your system, and if not, downloads it from the Internet.

  • In addition, the version of the library is constantly updated, so Anaconda’s 500Mb cache library may become obsolete in a few hours.

  • Of all the libraries that Anaconda pre-installs, you may only need a small portion to get started on your project.

To be clear: **Anaconda and Miniconda have the same available library directory, ** Installing Anaconda has no advantage over Miniconda (unless you want to lose 500Mb of hard disk storage). Just install Miniconda and have Conda (remember, the virtual environment manager) download and install the latest version of the package from the web as needed.

2.3 You may use conda or PIP, but not both. wrong

There is an extended view that Conda and PIP are substitutes or competitors, which is not true. Conda and PIP have different goals, and they work pretty well together!

Keep in mind at the beginning of this article that Conda is a virtual environment manager, a piece of software that does two things.

  1. Manage virtual environments: It creates, deletes, or packages virtual environments.

  2. Install libraries (also called channels) from the Conda repository: This allows software to be installed in a virtual environment. These include entire programming languages such as Python, R, or Java, libraries such as PANDAS or SciKit-learn, and even software such as HTP, TMUx, or a full-fledged PostgreSQL database.

The only competitive point between Conda and PIP is a subset of point 2: installing Python libraries. When you create a new virtual environment, you can install Python libraries using Conda or PIP. Most of the time, installing them with Conda or PIP makes no difference. The PIP directory is more complete, while Conda’s dependency resolver is more powerful. In a Conda virtual environment, you can install libraries using both Conda and PIP.

2.4 Conda only works in Python virtual environments. False

This is a little-known feature of the Conda virtual environment that sets it apart from the competition.

Most people think of Conda as being Python specific, but in reality the Conda virtual environment is a general-purpose environment where you can install almost anything. Do you need R and RStudio? You got it. Need Java? You can do this. You can even install libraries that have traditionally been installed with APT or BREW, such as HTOP or TMUx.

The 2.5 Venv is like the Conda, but lightweight. wrong

There are two main differences between Venv and Conda.

  • Conda is more than just a Python virtual environment manager. It is a general-purpose virtual environment that supports much more than Python. Python installed in conda is a true Python executable, not a link to your system Python (as is the case with other alternatives), so it can be any version you want, regardless of system Python.

  • Venv is limited to installing packages using PIP, whereas with Conda you can install programs using both PIP and Conda packages.

3. The installation of conda

The 3.1 Linux and MacOS

If you’re using Linux or MacOS, the program is very similar. Follow these steps carefully (don’t skip anything) to avoid post-installation problems.

Miniconda Linux and MacOS installer

  1. Go to: https://docs.conda.io/en/latest/miniconda.html.

  2. Download the latest version of Linux or MacOS. My recommendation is to use the bash installer (.sh file). In the case of MacOS, you can also choose the graphical installer (.pkg). To download, just click on the link, or if you don’t have access to the graphical interface, just use wget. Download the latest version. If you need Python 3.7, don’t worry about having Python 3.9. This is just the Python version of the (base) environment, the version used internally by Conda, but not the Python version of your virtual environment (you can choose which version you want).

  3. Execute the bash installer from the terminal (it’s just a bash script) : bash miniconda3-py39_4.9.2 – linux-x86_64. sh.

  4. Click Enter to read the T&C portion of the installation until the installer asks you to answer yes/ No. Answer yes.

  5. Now, setup is asking for an installation location. My advice: keep the default location: /home/

    /miniconda3.

  6. Setup will now ask you if you want to initialize Conda. Answer “yes” (be careful, because the default answer is set to “no”). After you initialize Conda, you can access it from a standard terminal every time you open it. If you do not initialize Conda, you may not be able to access Conda when you open the terminal (the executable for Conda will not be on your PATH).

  7. Close the terminal.

Finally, you can check that Conda is properly installed in this way.

  1. Open a new terminal.

  2. You should see one (base) in your prompt. This means that Conda has been properly installed and initialized, and a default environment called Base has been activated.

  3. Write conda on your terminal. You should see conda’s help.

  4. Write conda Info on your terminal. You should see details about your current Conda installation.

Conda prompt modifier

Conda help command

Conda the info command

3.2 Windows

Miniconda Windows installer

If you’re using Windows, setup is just a standard graphical setup.

⚠ ️ warning. Windows is not standard ⚠️

I must warn you that Windows is not the standard for developing data science projects (and most software development tasks), so you can expect a lot of things not to work as smoothly as they do on Linux. You’ve been warned! If you are interested in setting up a professional development environment for data science projects, check out this post.

  1. Go to: https://docs.conda.io/en/latest/miniconda.html.

  2. Download the latest version for Windows. Make sure you download the latest version of Python. If you need Python 3.7 or 3.8, don’t worry about having Python 3.9. This is just the Python version of the Base environment, the version used internally by Conda, but not the Python version of your virtual environment (you can choose which version you want).

  3. Execute the setup program (.exe).

  4. Click Next > until you are asked to select the installation scope. Select the Just Me option. In most cases, you don’t need to install Conda for all users (which requires administrator privileges), but just for you.

  5. Now, setup is asking for an installation location. My advice is to keep the default location: C:\Users\

    \miniconda3.

  6. Setup will now ask you to set advanced options. Do not add Conda to the Windows PATH environment variable, because things work differently from other operating systems and you will use a special terminal to access Conda.

Conda Installation range (W10)

Conda Installation Position (W10)

Conda Install Advanced Options (W10)

Finally, you can check that Conda is properly installed in this way.

Anaconda tip (W10)

  1. Open the Start menu in Windows and look for an application called Anaconda Prompt. Open it.

  2. Write conda in the Anaconda Prompt terminal. You should see conda’s help.

  3. Write conda Info in your terminal. You should see details about your current Conda installation.

4. Virtual environment management

In this section, I will teach you best practices for managing virtual environments.

Warning: Avoid using Anaconda Navigator ☠️

Anaconda Navigator

There is a very popular tool (especially in Windows) that is just a precursor to the Conda CLI tool. This graphical tool doesn’t work very well, and unless it’s improved a lot in the recent past, I would discourage you from using it because it’s responsible for a lot of broken environments. Stick to the CLI to manage your Conda environment.

4.1 Basic Environment

During the Conda installation, a default virtual environment called Base is created. This environment is used internally by CONda to work. Conda itself is installed in this environment as a library 😊. Although you can use this environment to install libraries, I recommend against it. Don’t mess with the base environment.

4.2 Creating a Virtual Environment

To create a virtual environment, just do.

conda create -n <environment_name>
Copy the code

And make sure you want to create the environment with y (yes).

This will create an empty virtual environment. The blank is bold and underlined because I’m serious. It’s empty. Nothing is installed in this environment, and yes, that means Python is not installed in this environment either.

⚠️ Warning: Do not run the ** PIP install** command ⚠️ in an empty environment

When you create an empty virtual environment, since Python is not installed on that environment, and PIP is not installed, when you run the PIP install command, you are probably installing the library either in System Python (terrible 😱) or in the base environment (not so terrible, but not good either).

ℹ️ Info: Virtual environments actually live in folders ℹ️

User-created Conda virtual environments (other than base environments) are physically located in folders in paths (for Linux/MacOS).

/home/<your_user>/miniconda3/envs/<environment_name>
Copy the code

Now that you’ve created your first virtual environment, let’s move on to the next step… Activate it.

4.3 Activating and Deactivating a Virtual Environment

To activate a virtual environment, use the activate command.

conda activate <environment_name>
Copy the code

Activating an environment means that from the time of activation, all environment actions you perform will be performed on the active environment.

You can check your active environment in two ways.

  • Check the modified prompt because the name of the active environment appears in parentheses.

Conda activate command

  • Check the output of the command:conda env list

Conda env list command

To deactivate a virtual environment, use the deactivate command.

conda deactivate
Copy the code

The modified prompt displays the name of the previously active environment (usually the base environment).

☠️ Warning: Avoid stacking environments ☠️

You have to be very careful about enabling an environment, and always remember to deactivate it once you’re not going to use one, because environments can be stacked. This means you can activate one environment on top of another. This behavior (useful in very special cases) leads to chaos in a very short time: libraries installed in the environment get mixed up, and you don’t know where they are installed.

Now that you know how to activate and deactivate the environment, let’s move on to the next step… Install Python in an empty environment.

4.4 Installing Python in a Virtual Environment

Remember, when you create virtual environments, they are empty. If you need a Python virtual environment, your first task is to install Python in the empty virtual environment.

To install Python in an empty virtual environment, run the command (don’t forget to activate the environment first).

conda install python
Copy the code

This command will install the latest version of Python in the Conda repository (3.9 as of this writing). If you need an earlier version of Python, you can specify it directly in the command.

Conda install python = 3.8Copy the code

☠️ Warning: latest Python version ☠️

Be careful when installing the latest version of Python. Most libraries take months to get used to the new Python version and do not work properly in the latest version. I usually stick to the slightly older (1-2 versions behind) version on all my projects.

Now that you have a working virtual environment with Python installed, let’s move on to the next step… Install additional libraries.

4.5 Installing R in a Virtual Environment

If you need R in your virtual environment, run the following command instead of the previous one.

conda install r-essentials
Copy the code

This command will install R and some basic R_ packages.

4.6 install the library

4.6.1 Install from Conda or PIP

To install a library in a Conda virtual environment, you usually run this command.

conda install <library_name>
Copy the code

If you need a specific version of a library, just specify it.

Conda install tensorflow = against 2.4.1Copy the code

Keep in mind that you can install not only Python libraries but also many other programs using the native Conda installation command.

  • A process viewer. Conda install htop.

  • A terminal multiplexer: conda install tmux.

  • Postgres:conda install -c conda-forge postgresQL.

  • There’s a lot of other software…

Python libraries only. If you have Python installed in a virtual environment, you can use PIP to install libraries, for example.

pip install <library_name>
Copy the code

If you need a specific version, you can specify it like this.

PIP install tensorflow = = against 2.4.1Copy the code

4.6.2 conda channels

The software repository from Conda is called the _ channel. _ A channel is like a folder that contains multiple libraries. The most important passage is.

  • Anaconda (default): The default Conda channel, called Anaconda, is stable and is where the most trusted software resides. If you do not specify a channel, Conda will look for libraries in that channel.

  • Conda-forge: Community-driven channel with the latest version of the library. If you need a library that’s not in the default channel, you can probably find it here.

Everyone can have their own Conda channel. Many companies have their own Conda channels where they upload software. For example, Intel has its conda channel, the library has been optimized, can run in the Intel hardware: https://anaconda.org/intel/repo

When you install Python in an empty environment, you are unknowingly downloading and installing Python from the default Anaconda channel.

If you need to install a library from a particular channel, use the -c argument followed by the channel name, like this.

conda install -c <channel_name> <library_name>
Copy the code

For example, to install the Plotly graphic library from the Plotly official channel.

conda install -c plotly plotly
Copy the code

To search for a library in a conda channel, you can use the search command.

conda search -c <channel_name> <library_name>
Copy the code

This command returns all available versions of the library and their corresponding builds.

Conda search command

Now, let’s take a look at some of the trickier common libraries and packages you might want to install in your Conda virtual environment……

4.6.3 installation Jupyter

JupyterLab

Install the latest version of Jupyter in a virtual environment.

conda install -c conda-forge jupyterhub jupyterlab nodejs nb_conda_kernels
Copy the code

This installation includes a new front end (JupyterLab) and other tools that allow authentication (JupyterHub), installation extensions (Node.js), and a library for automatic kernel discovery (NB_conda_kernels).

4.6.4 install Spark

Install Spark in a virtual environment.

conda install -c conda-forge openjdk pyspark
Copy the code

This installation includes Spark (including PySpark) and Java Virtual Machines (required to run Spark).

4.6.5 Installing GPU-supported Tensorflow

conda install -c conda-forge tensorflow-gpu
Copy the code

If you have the NVIDIA driver installed correctly, this installation will work. It will set up the rest of the NVIDIA stuff for you (CUDA, cuDNN, etc.).

4.7 Listing the installed libraries

To get a complete list of the software installed in the Conda virtual environment, use the list command (remember to activate the required virtual environment beforehand).

conda list
Copy the code

Conda list command

4.8 Deleting a Virtual Environment

The best thing about virtual environments is that when you don’t need them anymore, or when something goes wrong and breaks, you can always delete it and recreate it from scratch.

To remove a conda virtual environment, use the remove command.

conda env remove -n <environment_name>
Copy the code

In its normal operation, Conda caches the libraries used in the environment so that it does not have to download certain libraries when they are needed in another environment. It allows Conda to save some disks.

After an environment is deleted, some libraries are still cached, but are no longer needed. You can clean up your conda cache and free up some precious GB in your brand new SSD with the clean command.

conda clean --all
Copy the code

5. Sharing and packaging of virtual environments

This section details two ways to share and distribute a Conda environment, each with its own advantages and disadvantages.

5.1 Sharing the Conda Environment in YAML File Form

An interesting option for a shared environment is to encode it as a.yML file. To create a YAML file from an existing environment, run the export command.

conda env export >> environment_file.yml
Copy the code

Conda environment YAML file

You can also create a YAML file manually, like this.

name: <environment_name> channels: - conda-forge - defaults dependencies: -python =3.8 -pip-pandas -matplotlib-scikit-learn -pip: -lightgbmCopy the code

Note that you can specify packages from the Conda channel or PIP at the same time.

To create a conda environment from a YAML file, run the create command, which has the following syntax.

conda env create -f environment.yml
Copy the code

5.2 Use the following methods to package a Conda environmentconda-pack

One nifty feature of Conda, which is not available on the competition, is the Conda-pack. At WhiteBox, we use this capability to deploy our projects in extremely hostile environments, such as Hadoop clusters at large companies, with extreme security and isolation from the Internet.

Conda-pack is a library that allows you to package your entire environment into a compressed file (tar.gz) that you just copy and paste somewhere and it will work.

  1. Install Conda-pack: The creators of Conda recommend installing this library in the base environment, where you can run conda install -c conda-forge conda-pack.

  2. To activate the required environment: conda Activate Environment_name

  3. Put yourself in you want to generate compressed file path: CD path/to/desired/directory.

  4. Pack your environment: Conda Pack. This can take some time, depending on the size of your environment and the number of libraries installed.

  5. Move the compressed file to the destination (for example, over SSH).

  6. Create a folder for the environment: mkdir environment_Folder.

  7. Decompress the environment in the folder: tar XZVF

    .tar.gz -c environment_Folder.

  8. Activate the environment: source environment_Folder /bin/activate.

  9. Run conda-unpack to recreate the symlink and set everything up.

  10. You’re done!

6. Last word on virtual environments

I hope that with the help of this article, you have gained a deeper understanding of virtual environments and are no longer stuck between broken libraries.

If you missed something, or found some sections confusing, please let us know so we can improve on this post.

If you’re interested in the full data scientist setup we’ve been using for years on many projects, the setup described in this article is battle-tested.

  • Davidadrian. Cc/definitive -…