Build Your Own Docker

A small introduction to process isolation in Linux

Oct 19, 2022

Assumed audience: Developers with some knowledge of Docker, Python and Linux. Especially those curious about how Docker’s process isolation works under the hood in Linux.

Containerization - using tools like Docker - is a major trend in modern software development. One of its oft-quoted benefits is the ability to run processes in an "isolated environment".

This post is a hands-on introduction to process isolation. Follow along as we create a bash shell using Python, explore some of its behavioural differences as compared to a bash shell created in a Docker container, and attempt to implement these features.

Create a "container"

Let's write a Python script to run a bash shell. Put the following in a main.py:

import os

# read: https://docs.python.org/3/library/os.html#os.execv
os.execv("/bin/bash", ["/bin/bash"])

os.execv executes the command provided in the first parameter, using the arguments passed in a list in the second parameter. Documentation for os.execv can be found here.

(/home/chay/byo-docker)> python main.py
root@chay-dev-vm:/home/chay/byo-docker#

Hurray, we now have a container! Sorta.

Namespace Isolation

From now onwards, run your commands as the root user.

Typically, containers have a separate hostname from the host. Let's try to implement this behaviour by changing the hostname inside the "container" we created:

(/home/chay/byo-docker)> sudo su
(/home/chay/byo-docker)> python main.py
root@chay-dev-vm:/home/chay/byo-docker# hostname
chay-dev-vm

root@chay-dev-vm:/home/chay/byo-docker# hostname test

root@chay-dev-vm:/home/chay/byo-docker# hostname
test

root@chay-dev-vm:/home/chay/byo-docker# exit
(/home/chay/byo-docker)> hostname
test

Oh crap, it turns out that whatever we did in this container affected the host! Let's quickly revert it.

(/home/chay/byo-docker)> hostname chay-dev-vm
(/home/chay/byo-docker)> hostname
chay-dev-vm

We want to be able to change things inside our container without affecting the host. In Linux, we can do this by ensuring that our container process is run with separate namespaces.

There are currently 8 kind of namespaces (as of kernel version 5.6), and the one that is responsible for providing hostname isolation is called UTS, short for UNIX Time-Sharing. We can tell Linux to create a process with an isolated UTS namespace by invoking the clone system call with the CLONE_NEWUTS flag.

There is no built-in Python function to perform the clone system call in a convenient way, but fortunately we can run C libraries in Python. This is a little out of scope, so let's abstract the complexity behind a clone function. Create a file called clone.py with the following:

# Adapted from https://gist.github.com/adamhooper/fbbd16d207ddd3680a9aa31508debc4e

import ctypes
import signal

# Clone flags
# https://github.com/torvalds/linux/blob/master/tools/include/uapi/linux/sched.h
CLONE_NEWUTS = 0x04000000

_CHILD_STACK = ctypes.create_string_buffer(2 * 1024 * 1024)
_RUN_CHILD_STACK_POINTER = ctypes.c_void_p(
    ctypes.cast(_CHILD_STACK, ctypes.c_void_p).value + len(_CHILD_STACK)
)
_CLONE_FLAGS = CLONE_NEWUTS


def clone(fn):
    child_pid = ctypes.CDLL("libc.so.6").clone(
        ctypes.PYFUNCTYPE(ctypes.c_int)(fn),
        _RUN_CHILD_STACK_POINTER,
        _CLONE_FLAGS | signal.SIGCHLD,  # send parent SIGCHLD on exit
        0,
    )

    return child_pid

Back to our main.py, we can now use our clone function to spawn our container with an isolated UTS namespace. Since our function takes a callable as an input, we need to move the os.execv call into a function:

import os
import clone

def exec_bash():
    print(os.getpid())
    os.execv("/bin/bash", ["/bin/bash"])

child_pid = clone.clone(exec_bash)

# wait for child to exit
os.waitpid(child_pid, 0)
print("child terminated")

Let's see it at work! Try to change the hostname inside our container again:

(/home/chay/byo-docker)> python main.py
root@chay-dev-vm:/home/chay/byo-docker# hostname
chay-dev-vm

root@chay-dev-vm:/home/chay/byo-docker# hostname test
root@chay-dev-vm:/home/chay/byo-docker# hostname
test

root@chay-dev-vm:/home/chay/byo-docker# exit
(/home/chay/byo-docker)> hostname
chay-dev-vm

Now changing the namespace inside the container no longer affects the host.

When you run a container using docker, the hostname is usually a string of 12 random characters:

(/home/chay/byo-docker)> docker run --rm -it ubuntu bash
root@a3b1c1b9f086:/#

To implement the same behaviour, we can simply run hostname <random_string> in the container, right before we spawn the shell:

...
import random
import string

container_id = "".join(random.choice(string.ascii_lowercase + string.digits) for i in range(12))

def exec_bash():
    os.system(f"hostname {container_id}")
    os.execv("/bin/bash", ["/bin/bash"])
...

(/home/chay/byo-docker)> python main.py
root@6s2qvm7otw4u:/home/chay/byo-docker]#

Now this looks more like a container!

Process Isolation

Try running ps within the container. You should still see all the processes running in the host.

(/home/chay/byo-docker)> python main.py
root@6s2qvm7otw4u:/home/chay/byo-docker]# ps
    PID TTY          TIME CMD
1219384 pts/6    00:00:00 sudo
1219385 pts/6    00:00:00 bash
1219418 pts/6    00:00:00 python
1219419 pts/6    00:00:00 bash
1219444 pts/6    00:00:00 ps

That's not how docker containers work! We should not be able to see host processes from our container.

It should look something like this:

(/home/chay/byo-docker)> docker run --rm -it ubuntu bash
root@a3b1c1b9f086:/# ps
    PID TTY          TIME CMD
      1 pts/0    00:00:00 bash
      9 pts/0    00:00:00 ps

Notice that the main process running in the container (bash) has a PID of 1. The only processes that are visible to the container are children of the main command.

Let's print out the PIDs of our python program and the child process.

import os
import clone

print(os.getpid())

def exec_bash():
    print(os.getpid())
    os.execv("/bin/bash", ["/bin/bash"])
...

(/home/chay/byo-docker)> python main.py
1219932
1219933

The namespace that provides processes with an independent set of PIDs is called… PID. If you clicked on the link to the Linux source code earlier, you may have noticed a clone flag called CLONE_NEWPID.

Let's try it out! Add it to the clone flags in our clone.py:

...
# Clone flags
# https://github.com/torvalds/linux/blob/master/tools/include/uapi/linux/sched.h
CLONE_NEWUTS = 0x04000000
CLONE_NEWPID = 0x20000000
...
_CLONE_FLAGS = CLONE_NEWUTS | CLONE_NEWPID # flags are combined using a bitwise-OR operator
...

Now, run the container again:

(/home/chay/byo-docker)> python main.py
3971697
1
root@f6h6yuzeqaww:/home/chay/byo-docker]#

Voilà, now the child process has a PID of 1!

Once again, run ps to see if the container is truly isolated from the parent.

root@f6h6yuzeqaww:/home/chay/byo-docker]# ps
    PID TTY          TIME CMD
3971650 pts/0    00:00:00 sudo
3971651 pts/0    00:00:00 su
3971652 pts/0    00:00:00 bash
3971697 pts/0    00:00:00 python3
3971698 pts/0    00:00:00 bash
3971707 pts/0    00:00:00 ps

What's going on? And why is the PID 3971698 instead of 1?

As it turns out, the ps command takes its information from the /proc folder. Since we have not yet isolated our container's filesystem from our host filesystem, the ps command will show processes on the host.

PID namespaces are nested, meaning that when a new PID namespace is created, the parent is able to see all processes in all their descendants. Although our bash process appears as PID 1 from inside the container, it has a different PID from the parent’s perspective.

Filesystem Isolation

Let's create a filesystem then! A simple way to do it is to copy the contents from the ubuntu Docker image:

(/home/chay/byo-docker)> mkdir rootfs-example
(/home/chay/byo-docker)> docker export $(docker create ubuntu) | tar -C rootfs-example -xvf -

We are going to make the process in the container use this newly created filesystem as it's root filesystem, using the chroot system call.

...
def exec_bash():
    os.system(f"hostname {container_id}")
    os.chroot("/home/chay/byo-docker/rootfs-example")
    os.chdir("/")
    os.execv("/bin/bash", ["/bin/bash"])
...

Let's try running it!

(/home/chay/byo-docker)> python main.py
root@evkxhls2rjne:/# ps
bash: ps: command not found

root@evkxhls2rjne:/# /bin/ps
Error, do this: mount -t proc proc /proc

Why do we need to mount the proc directory when it already exists as a folder? According to the proc man page, this is the mechanism that allows users to get information about the kernel, by allowing the kernel to make changes to the contents of this directory.

...
def exec_bash():
    os.system(f"hostname {container_id}")
    os.chroot("/home/chay/byo-docker/rootfs-example")
    os.chdir("/")
    os.system("/bin/mount -t proc proc /proc")
    os.execv("/bin/bash", ["/bin/bash"])
...

Let's try that again:

(/home/chay/byo-docker)> python main.py
root@boyJBkZJYvVl:/# /bin/ps
    PID TTY          TIME CMD
      1 ?        00:00:00 bash
      7 ?        00:00:00 ps

Now, our bash process has PID 1.

Resource Limits

Resource limits in Linux are managed by Control Groups, or more popularly shortened as cgroups. Like proc, this kernel feature is exposed to the user through a filesystem mounted at /sys/fs/cgroup.

Let’s try using cgroups to limit the amount of RAM that our container has access to. We first create a directory under /sys/fs/cgroup/memory:

(/home/chay/byo-docker)> cd /sys/fs/cgroup/memory
(/sys/fs/cgroup/memory)> mkdir byo-docker
(/sys/fs/cgroup/memory)> cd byo-docker
(/sys/fs/cgroup/memory/byo-docker)>

Note that this newly created directory is magically pre-populated with some special files.

(/sys/fs/cgroup/memory/byo-docker)> ls
cgroup.clone_children
cgroup.event_control
cgroup.procs
memory.failcnt
memory.force_empty
memory.kmem.failcnt
memory.kmem.limit_in_bytes
memory.kmem.max_usage_in_bytes
memory.kmem.slabinfo
memory.kmem.tcp.failcnt
memory.kmem.tcp.limit_in_bytes
memory.kmem.tcp.max_usage_in_bytes
memory.kmem.tcp.usage_in_bytes
memory.kmem.usage_in_bytes
memory.limit_in_bytes
memory.max_usage_in_bytes
memory.move_charge_at_immigrate
memory.numa_stat
memory.oom_control
memory.pressure_level
memory.soft_limit_in_bytes
memory.stat
memory.swappiness
memory.usage_in_bytes
memory.use_hierarchy
notify_on_release
tasks

The memory limit of this cgroup can be set by writing the desired figure to the memory.limit_in_bytes file. Let’s limit our container to 10 Megabytes of memory, and disable swap usage.

(/sys/fs/cgroup/memory/byo-docker)> cat memory.limit_in_bytes
9223372036854771712
(/sys/fs/cgroup/memory/byo-docker)> echo 10M > memory.limit_in_bytes
(/sys/fs/cgroup/memory/byo-docker)> cat memory.limit_in_bytes
10485760
(/sys/fs/cgroup/memory/byo-docker)> echo 0 > memory.swappiness

The next step is to associate our container to this cgroup. This is done by putting the container's PID (from the POV of the host) into the tasks file. Start our container in a separate terminal session and note the PID of the parent python process:

(/home/chay/byo-docker)> python main.py
3418583
1
root@tvnqzhwuhe5u:/]#

The container process should be the next higher PID value 3418584, but to be sure we can use pgrep to list the child processes of 3418583. We can then bind this PID to the cgroup we created:

(/sys/fs/cgroup/memory/byo-docker)> pgrep -lP 3418583
3418584 bash
(/sys/fs/cgroup/memory/byo-docker)> cat tasks
(/sys/fs/cgroup/memory/byo-docker)> echo 3418584 > tasks
(/sys/fs/cgroup/memory/byo-docker)> cat tasks
3418584

To test if this is working properly, let’s consume some RAM! A common method is to use head to output a fix amount of bytes from /dev/zero into tail, since tail needs to keep the current line in memory.

Our container filesystem does not have /dev/zero, but we can create it using mknod.

root@tvnqzhwuhe5u:/# mknod /dev/zero c 1 5

root@tvnqzhwuhe5u:/# head -c 8M /dev/zero | tail

root@tvnqzhwuhe5u:/# head -c 11M /dev/zero | tail
Killed

Increase the memory limit to 20 Megabytes, and try again:

(/sys/fs/cgroup/memory/byo-docker)> echo 20M > memory.limit_in_bytes
(/sys/fs/cgroup/memory/byo-docker)> cat memory.limit_in_bytes
20971520

...

root@tvnqzhwuhe5u:/# head -c 11m /dev/zero | tail

root@tvnqzhwuhe5u:/# head -c 18m /dev/zero | tail

root@tvnqzhwuhe5u:/# head -c 21m /dev/zero | tail
Killed

This demonstrates that our cgroup memory limits works! When the process consumes RAM beyond the defined limits, the operating system's Out-Of-Memory (OOM) Killer will terminate it. Check out the journalctl to see the logs from the OOM Killer:

(/sys/fs/cgroup/memory/byo-docker)> journalctl | tail
...
... kernel: Memory cgroup out of memory: Killed process 3864769 (tail) total-vm:21932kB, anon-rss:18728kB, file-rss:1680kB, shmem-rss:0kB, UID:0 pgtables:80kB oom_score_adj:0

To clean up the cgroup we created, we can simply remove the folder:

(/sys/fs/cgroup/memory/byo-docker)> cd ..
(/sys/fs/cgroup/memory)> rmdir byo-docker

Senior Intern

Discussion about this post