Machine Learning setup using Docker and Python

03-May-20 . 8 mins read

python docker

Background
Key Docker concepts
Build attempts
Dockerfile
Build the image
Create the container and run jupyter
Housekeeping
Random notes
Summary

Photo by Natalie Scott on Unsplash

Background

I have Anaconda distribution of Python on my Macbook Air. I had installed it when I had started learning Python. Recently, I bought a Desktop PC and I have Windows/Ubuntu dual boot on it. So, I was looking for a development environment that I could seemlessly move across these three platforms. I have been reading about containerisation using Docker and thought of giving it a try. This is what I am trying to achieve on my Macbook Air :

Create a Docker image that I can share across my Mac/Windows/Ubuntu platforms
Use light-weight unix base
Create a project based environment so that I can streamline the installation of Python packages and their dependencies.
Install Jupyter and maybe other IDE’s like Pycharm/VS Code for development
Get rid of the Anaconda installation on my Mac

Key Docker concepts

There are many online resources explaining what is Docker. The ones that I used for reference are listed in the Summary section of this post. Some basic concepts are here :

As per the official docker docs, Docker is a platform for developers and sysadmins to build, run, and share applications with containers. The use of containers to deploy applications is called containerization.
A container runs on Linux operating system (OS)
It is similar to a VM as it lets you run another OS inside your host machine OS (which can be MacOS, Windows or another Linux). But instead of hypervisor, a Docker container shares the kernel of the host OS making it light-weight
An image is a read-only template with instructions for creating a Docker container. Often, an image is based on another image, with some additional customization.
A container is a runnable instance of an image. You can create, start, stop, move, or delete a container using the Docker API or CLI. Similar to Object Oriented programming, it is like an object that is instantiated from a class.
The image and the container both have names
A Dockerfile is a simple text file that has instructions on how to build your customised image FROM a base image

Build attempts

The first step was to choose a base image. As I needed to run Jupyter, below were some options that I tried.

continuumio/miniconda3

Provided by continuumio
Runs debian and python version 3.5
Then I used conda to install jupyter
Image size : 2GB / Build time : 4.5mins
Verdict : Not chosen as miniconda installs several packages that I may not need.

python:3.8.2-slim-buster

Official python image provided by Docker
Manually installed miniconda (with help from Stackoverflow) and then used it to install jupyter (with help from this article)
Image size : 2GB / Build time : 10mins
Verdict : Better control over the OS/python version but installing conda made the image heavy

jupyter/base-notebook

Provided by jupyter
Runs Ubuntu and python version 3.7
Image size : 650MB / Build time : 1min
Verdict : No control over the OS or the version of Python

Dockerfile

After trying all of the above I decided to use the official debian-python image provided by Docker and then used pip3 to install jupyter. My final Dockerfile looked like this.

FROM python:3.8.2-slim-buster
RUN apt-get update && \
      apt-get install -y && \
    pip3  --no-cache-dir install jupyter numpy
EXPOSE 8888
WORKDIR /project-home 
VOLUME /project-home
CMD ["jupyter", "notebook", "--ip='*'", "--port=8888", "--no-browser", "--allow-root"]

FROM python:3.8.2-slim-buster
- The base image that will be used
- python is the repository and 3.8.2-alpine3.11 is called the tag that I copied from dockerhub
RUN
- Updates/Installs packages
- && is for chaining the layers and \ is for continuing on new line
EXPOSE 8888
- Opens the port within the container that will listen to external connections
WORKDIR /project-home
- Creates the folder inside the container
- The jupyter notebook will start in this folder instead of root.
VOLUME /project-home
- Optional but it is the mount point on which the host folder will be mounted to the container
CMD [“jupyter”, “notebook”, "–ip=’*’“,”–port=8888“,”–no-browser“,”–allow-root"]
- Command to run the jupyter notebook

Build the image

Save the Dockerfile in the project folder and then at the terminal run
docker build -t debian-pip-jupyter:latest .
- Builds the image
- debian-pip-jupyter is the repository name for the image and latest is the tag. If this is not specified, then docker will use the name:tag from FROM line in the Dockerfile
- . (current directory) is the location of the Dockerfile
`docker images or docker image ls
- Lists the images

Create the container and run jupyter

A image has a name, so also a container has a name. We can either specify it using –name option or docker will give on its own
docker run [OPTIONS] image-name [COMMAND]
- -d Detached (runs in background)
- -a Attach to STDOUT/STDERR
- -i Attach STDIN
- -t Allocates pseudo-TTY
- –name [NAME] Set the container name
docker run -p 8888:8888 -v “/Mac Backup/OneDrive/Docker/images/ml-project”:/project-home debian-pip-jupyter
- -p maps the local host port to : container port that was EXPOSEd in the Dockerfile
- -v maps the local directory to : project-home directory in the container
  - This is needed so that I can continue to have my source code/data outside of the container on my local host machine
  - On my Mac I got this error :The path /Mac Backup/OneDrive/Docker/images/ml-project is not shared from OS X and is not known to Docker. You can configure shared paths from Docker -> Preferences… -> File Sharing.So I added the Docker folder in the Mac Preferences
- Running this commands gives the url to open jupyter http://127.0.0.1:8888/?token=afee17bc6a9d95e581aa1a269718f993ae3a61a07e308290
docker ps -a
- Gives the list of all containers (including stopped)
- The NAMES given here can also be used to start the container at later stage
docker start tender_goldwasser
- Restart the stopped container instance

Housekeeping

docker container prune
- Remove all stopped containers
docker stop $(docker ps -a -q) && docker rm $(docker ps -a -q)
- Stop and remove all containers
docker stop sleepy_diffie
- Stop a unresponsive/mistakenly created container
docker rmi $(docker images -a -q)
- Removes all images
- All the Docker images on a system can be listed by adding -a to the docker images command. Once you’re sure you want to delete them all, you can add the -q flag to pass the Image ID to docker rmi:
- Dangling images are not referenced by other images and are safe to delete.
docker system prune -a
- will delete ALL dangling data (i.e. In order: containers stopped, volumes without containers and images with no containers). Even unused data, with -a option.
: Images
- The docker images -a will list many images with :
- These are the intermediate images that Docker create but they don’t occupy space
- But if such images are listed when docker images command then such images are called Dangling images and will need to be pruned
- Docker stores the layers in a graph database and are available here /var/lib/docker/graph
Not tried
- docker rmi $(docker images –filter “dangling=true” -q –no-trunc)
- docker images | grep “none”
- docker rmi $(docker images | grep “none” | awk ‘/ / { print $3 }’)

Random notes

–rm : In docker run, removes the container after you exit
CTRL + P + Q : Release terminal but the container will keep running
We can also specify the CMD in the docker run command
docker run -it image-name “/bin/sh”
- Create an instance, assign a TTY and keep STDIN open (interactive mode)
- Since “bin/sh” command is available in linux, this will let us enter the shell of the container.
The docker building time can be done using time docker build -t debian-pip-jupyter:latest .
The –no-cache-dir option in pip install is to remove the downloaded package file
Start a Jupyter Notebook server and interact with Miniconda via your browser:
- docker run -i -t -p 8888:8888 continuumio/miniconda3 /bin/bash -c "/opt/conda/bin/conda install jupyter -y --quiet && mkdir /opt/notebooks && /opt/conda/bin/jupyter notebook --notebook-dir=/opt/notebooks --ip='*' --port=8888 --no-browser"
- View the Jupyter Notebook by opening http://localhost:8888
conda vs pip / virtualenv
- If you have used pip and virtualenv in the past, you can use conda to perform all of the same operations.
- Pip is a package manager, and virtualenv is an environment manager; and conda is both.
- pip installs all Python package dependencies required, whether or not those conflict with other packages previously installed. So a working installation of, for example, Google TensorFlow can suddenly stop working when a user pip-installs a new package that needs a different version of the NumPy library.
- More insidiously, everything might still appear to work, but the user gets different results, or is unable to reproduce the same results elsewhere because the user did not pip-install in the same order.
- Conda analyzes the user’s current environment, everything that have been installed, any version limitations that the user specifies (e.g. if the user only wants tensorflow >= 2.0) and figures out how to install compatible dependencies.
- Otherwise, it will tell the user that what he or she wants can’t be done. pip, by contrast, will just install the package the user specified and any dependencies, even if that breaks other packages.
- Source : Wiki

Summary

A basic setup to run Python Jupyter notebooks using Docker
References
Some other links for future reference
- A Simple Docker Tutorial for Machine-Learning Developers
  - Use the docker container as the Python interpreter in the Pycharm
  - Docker container cannot use your GPU directly. Fortunately, Nvidia-Docker has been created for solving this problem
- A Guide to Docker for Machine Learning Written for Clever Beginners
- Smaller Docker images with Conda
- pythonspeed.com

How to run Tensorflow using NVIDIA CUDA and Docker on Windows 10 WSL ?