

Discover more from Senior Intern
Assumed audience: Developers with some knowledge of Docker, Python and Linux. Especially those curious about how Docker’s process isolation works under the hood in Linux.
Containerization - using tools like Docker - is a major trend in modern software development. One of its oft-quoted benefits is the ability to run processes in an "isolated environment".
This post is a hands-on introduction to process isolation. Follow along as we create a bash shell using Python, explore some of its behavioural differences as compared to a bash shell created in a Docker container, and attempt to implement these features.
Create a "container"
Let's write a Python script to run a bash shell. Put the following in a main.py
:
import os
# read: https://docs.python.org/3/library/os.html#os.execv
os.execv("/bin/bash", ["/bin/bash"])
os.execv
executes the command provided in the first parameter, using the arguments passed in a list in the second parameter. Documentation for os.execv
can be found here.
(/home/chay/byo-docker)> python main.py
root@chay-dev-vm:/home/chay/byo-docker#
Hurray, we now have a container! Sorta.
Namespace Isolation
From now onwards, run your commands as the root user.
Typically, containers have a separate hostname from the host. Let's try to implement this behaviour by changing the hostname inside the "container" we created:
(/home/chay/byo-docker)> sudo su
(/home/chay/byo-docker)> python main.py
root@chay-dev-vm:/home/chay/byo-docker# hostname
chay-dev-vm
root@chay-dev-vm:/home/chay/byo-docker# hostname test
root@chay-dev-vm:/home/chay/byo-docker# hostname
test
root@chay-dev-vm:/home/chay/byo-docker# exit
(/home/chay/byo-docker)> hostname
test
Oh crap, it turns out that whatever we did in this container affected the host! Let's quickly revert it.
(/home/chay/byo-docker)> hostname chay-dev-vm
(/home/chay/byo-docker)> hostname
chay-dev-vm
We want to be able to change things inside our container without affecting the host. In Linux, we can do this by ensuring that our container process is run with separate namespaces.
There are currently 8 kind of namespaces (as of kernel version 5.6), and the one that is responsible for providing hostname isolation is called UTS, short for UNIX Time-Sharing. We can tell Linux to create a process with an isolated UTS namespace by invoking the clone system call with the CLONE_NEWUTS flag.
There is no built-in Python function to perform the clone system call in a convenient way, but fortunately we can run C libraries in Python. This is a little out of scope, so let's abstract the complexity behind a clone
function. Create a file called clone.py
with the following:
# Adapted from https://gist.github.com/adamhooper/fbbd16d207ddd3680a9aa31508debc4e
import ctypes
import signal
# Clone flags
# https://github.com/torvalds/linux/blob/master/tools/include/uapi/linux/sched.h
CLONE_NEWUTS = 0x04000000
_CHILD_STACK = ctypes.create_string_buffer(2 * 1024 * 1024)
_RUN_CHILD_STACK_POINTER = ctypes.c_void_p(
ctypes.cast(_CHILD_STACK, ctypes.c_void_p).value + len(_CHILD_STACK)
)
_CLONE_FLAGS = CLONE_NEWUTS
def clone(fn):
child_pid = ctypes.CDLL("libc.so.6").clone(
ctypes.PYFUNCTYPE(ctypes.c_int)(fn),
_RUN_CHILD_STACK_POINTER,
_CLONE_FLAGS | signal.SIGCHLD, # send parent SIGCHLD on exit
0,
)
return child_pid
Back to our main.py
, we can now use our clone function to spawn our container with an isolated UTS namespace. Since our function takes a callable as an input, we need to move the os.execv
call into a function:
import os
import clone
def exec_bash():
print(os.getpid())
os.execv("/bin/bash", ["/bin/bash"])
child_pid = clone.clone(exec_bash)
# wait for child to exit
os.waitpid(child_pid, 0)
print("child terminated")
Let's see it at work! Try to change the hostname inside our container again:
(/home/chay/byo-docker)> python main.py
root@chay-dev-vm:/home/chay/byo-docker# hostname
chay-dev-vm
root@chay-dev-vm:/home/chay/byo-docker# hostname test
root@chay-dev-vm:/home/chay/byo-docker# hostname
test
root@chay-dev-vm:/home/chay/byo-docker# exit
(/home/chay/byo-docker)> hostname
chay-dev-vm
Now changing the namespace inside the container no longer affects the host.
When you run a container using docker, the hostname is usually a string of 12 random characters:
(/home/chay/byo-docker)> docker run --rm -it ubuntu bash
root@a3b1c1b9f086:/#
To implement the same behaviour, we can simply run hostname <random_string>
in the container, right before we spawn the shell:
...
import random
import string
container_id = "".join(random.choice(string.ascii_lowercase + string.digits) for i in range(12))
def exec_bash():
os.system(f"hostname {container_id}")
os.execv("/bin/bash", ["/bin/bash"])
...
(/home/chay/byo-docker)> python main.py
root@6s2qvm7otw4u:/home/chay/byo-docker]#
Now this looks more like a container!
Process Isolation
Try running ps
within the container. You should still see all the processes running in the host.
(/home/chay/byo-docker)> python main.py
root@6s2qvm7otw4u:/home/chay/byo-docker]# ps
PID TTY TIME CMD
1219384 pts/6 00:00:00 sudo
1219385 pts/6 00:00:00 bash
1219418 pts/6 00:00:00 python
1219419 pts/6 00:00:00 bash
1219444 pts/6 00:00:00 ps
That's not how docker containers work! We should not be able to see host processes from our container.
It should look something like this:
(/home/chay/byo-docker)> docker run --rm -it ubuntu bash
root@a3b1c1b9f086:/# ps
PID TTY TIME CMD
1 pts/0 00:00:00 bash
9 pts/0 00:00:00 ps
Notice that the main process running in the container (bash
) has a PID of 1. The only processes that are visible to the container are children of the main command.
Let's print out the PIDs of our python program and the child process.
import os
import clone
print(os.getpid())
def exec_bash():
print(os.getpid())
os.execv("/bin/bash", ["/bin/bash"])
...
(/home/chay/byo-docker)> python main.py
1219932
1219933
The namespace that provides processes with an independent set of PIDs is called… PID. If you clicked on the link to the Linux source code earlier, you may have noticed a clone flag called CLONE_NEWPID.
Let's try it out! Add it to the clone flags in our clone.py
:
...
# Clone flags
# https://github.com/torvalds/linux/blob/master/tools/include/uapi/linux/sched.h
CLONE_NEWUTS = 0x04000000
CLONE_NEWPID = 0x20000000
...
_CLONE_FLAGS = CLONE_NEWUTS | CLONE_NEWPID # flags are combined using a bitwise-OR operator
...
Now, run the container again:
(/home/chay/byo-docker)> python main.py
3971697
1
root@f6h6yuzeqaww:/home/chay/byo-docker]#
Voilà, now the child process has a PID of 1!
Once again, run ps
to see if the container is truly isolated from the parent.
root@f6h6yuzeqaww:/home/chay/byo-docker]# ps
PID TTY TIME CMD
3971650 pts/0 00:00:00 sudo
3971651 pts/0 00:00:00 su
3971652 pts/0 00:00:00 bash
3971697 pts/0 00:00:00 python3
3971698 pts/0 00:00:00 bash
3971707 pts/0 00:00:00 ps
What's going on? And why is the PID 3971698
instead of 1
?
As it turns out, the ps
command takes its information from the /proc
folder. Since we have not yet isolated our container's filesystem from our host filesystem, the ps
command will show processes on the host.
PID namespaces are nested, meaning that when a new PID namespace is created, the parent is able to see all processes in all their descendants. Although our bash process appears as PID 1 from inside the container, it has a different PID from the parent’s perspective.
Filesystem Isolation
Let's create a filesystem then! A simple way to do it is to copy the contents from the ubuntu
Docker image:
(/home/chay/byo-docker)> mkdir rootfs-example
(/home/chay/byo-docker)> docker export $(docker create ubuntu) | tar -C rootfs-example -xvf -
We are going to make the process in the container use this newly created filesystem as it's root filesystem, using the chroot system call.
...
def exec_bash():
os.system(f"hostname {container_id}")
os.chroot("/home/chay/byo-docker/rootfs-example")
os.chdir("/")
os.execv("/bin/bash", ["/bin/bash"])
...
Let's try running it!
(/home/chay/byo-docker)> python main.py
root@evkxhls2rjne:/# ps
bash: ps: command not found
root@evkxhls2rjne:/# /bin/ps
Error, do this: mount -t proc proc /proc
Why do we need to mount the proc directory when it already exists as a folder? According to the proc man page, this is the mechanism that allows users to get information about the kernel, by allowing the kernel to make changes to the contents of this directory.
...
def exec_bash():
os.system(f"hostname {container_id}")
os.chroot("/home/chay/byo-docker/rootfs-example")
os.chdir("/")
os.system("/bin/mount -t proc proc /proc")
os.execv("/bin/bash", ["/bin/bash"])
...
Let's try that again:
(/home/chay/byo-docker)> python main.py
root@boyJBkZJYvVl:/# /bin/ps
PID TTY TIME CMD
1 ? 00:00:00 bash
7 ? 00:00:00 ps
Now, our bash
process has PID 1.
Resource Limits
Resource limits in Linux are managed by Control Groups, or more popularly shortened as cgroups. Like proc
, this kernel feature is exposed to the user through a filesystem mounted at /sys/fs/cgroup
.
Let’s try using cgroups to limit the amount of RAM that our container has access to. We first create a directory under /sys/fs/cgroup/memory
:
(/home/chay/byo-docker)> cd /sys/fs/cgroup/memory
(/sys/fs/cgroup/memory)> mkdir byo-docker
(/sys/fs/cgroup/memory)> cd byo-docker
(/sys/fs/cgroup/memory/byo-docker)>
Note that this newly created directory is magically pre-populated with some special files.
(/sys/fs/cgroup/memory/byo-docker)> ls
cgroup.clone_children
cgroup.event_control
cgroup.procs
memory.failcnt
memory.force_empty
memory.kmem.failcnt
memory.kmem.limit_in_bytes
memory.kmem.max_usage_in_bytes
memory.kmem.slabinfo
memory.kmem.tcp.failcnt
memory.kmem.tcp.limit_in_bytes
memory.kmem.tcp.max_usage_in_bytes
memory.kmem.tcp.usage_in_bytes
memory.kmem.usage_in_bytes
memory.limit_in_bytes
memory.max_usage_in_bytes
memory.move_charge_at_immigrate
memory.numa_stat
memory.oom_control
memory.pressure_level
memory.soft_limit_in_bytes
memory.stat
memory.swappiness
memory.usage_in_bytes
memory.use_hierarchy
notify_on_release
tasks
The memory limit of this cgroup can be set by writing the desired figure to the memory.limit_in_bytes
file. Let’s limit our container to 10 Megabytes of memory, and disable swap usage.
(/sys/fs/cgroup/memory/byo-docker)> cat memory.limit_in_bytes
9223372036854771712
(/sys/fs/cgroup/memory/byo-docker)> echo 10M > memory.limit_in_bytes
(/sys/fs/cgroup/memory/byo-docker)> cat memory.limit_in_bytes
10485760
(/sys/fs/cgroup/memory/byo-docker)> echo 0 > memory.swappiness
The next step is to associate our container to this cgroup. This is done by putting the container's PID (from the POV of the host) into the tasks
file. Start our container in a separate terminal session and note the PID of the parent python process:
(/home/chay/byo-docker)> python main.py
3418583
1
root@tvnqzhwuhe5u:/]#
The container process should be the next higher PID value 3418584
, but to be sure we can use pgrep to list the child processes of 3418583
. We can then bind this PID to the cgroup we created:
(/sys/fs/cgroup/memory/byo-docker)> pgrep -lP 3418583
3418584 bash
(/sys/fs/cgroup/memory/byo-docker)> cat tasks
(/sys/fs/cgroup/memory/byo-docker)> echo 3418584 > tasks
(/sys/fs/cgroup/memory/byo-docker)> cat tasks
3418584
To test if this is working properly, let’s consume some RAM! A common method is to use head to output a fix amount of bytes from /dev/zero
into tail, since tail
needs to keep the current line in memory.
Our container filesystem does not have /dev/zero
, but we can create it using mknod.
root@tvnqzhwuhe5u:/# mknod /dev/zero c 1 5
root@tvnqzhwuhe5u:/# head -c 8M /dev/zero | tail
root@tvnqzhwuhe5u:/# head -c 11M /dev/zero | tail
Killed
Increase the memory limit to 20 Megabytes, and try again:
(/sys/fs/cgroup/memory/byo-docker)> echo 20M > memory.limit_in_bytes
(/sys/fs/cgroup/memory/byo-docker)> cat memory.limit_in_bytes
20971520
...
root@tvnqzhwuhe5u:/# head -c 11m /dev/zero | tail
root@tvnqzhwuhe5u:/# head -c 18m /dev/zero | tail
root@tvnqzhwuhe5u:/# head -c 21m /dev/zero | tail
Killed
This demonstrates that our cgroup memory limits works! When the process consumes RAM beyond the defined limits, the operating system's Out-Of-Memory (OOM) Killer will terminate it. Check out the journalctl to see the logs from the OOM Killer:
(/sys/fs/cgroup/memory/byo-docker)> journalctl | tail
...
... kernel: Memory cgroup out of memory: Killed process 3864769 (tail) total-vm:21932kB, anon-rss:18728kB, file-rss:1680kB, shmem-rss:0kB, UID:0 pgtables:80kB oom_score_adj:0
To clean up the cgroup we created, we can simply remove the folder:
(/sys/fs/cgroup/memory/byo-docker)> cd ..
(/sys/fs/cgroup/memory)> rmdir byo-docker
Further Reading
We briefly covered how to create a process that has some sort of namespace, process and filesystem isolation from the host. We also learnt how cgroups are utilised to limit the resources that a process can consume.
This merely scratches the surface of how containers work. If you would like to learn more, I recommend this post by Bakare Emmanuel.
Shoutout to Julian Friedman for creating the popular gist that inspired this post!