If we talk about a container image, it is an archive created by the tar command. Let's download and examine a popular image called alpine:latest. I will use the skopeo utility, which operates with container images. With its help, I will download the image directly from the docker registry to my /tmp directory:
$ cd /tmp /tmp $ skopeo copy docker://alpine docker-archive:alpine_latest.tar Getting image source signatures Copying blob da9db072f522 done | Copying config 63b790fccc done | Writing manifest to image destination /tmp $ tar tvf alpine_latest.tar -r--r--r-- 0/0 8081920 1970-01-01 02:00 75654b8eeebd3beae97271a102f57cdeb794cc91e442648544963a7e951e9558.tar -r--r--r-- 0/0 600 1970-01-01 02:00 63b790fccc9078ab8bb913d94a5d869e19fca9b77712b315da3fa45bb8f14636.json l--------- 0/0 0 1970-01-01 02:00 be347e7910e909949a9117603d2b45b45d07fd42e75b90e254fb86f44c4b956a/layer.tar -> ../75654b8eeebd3beae97271a102f57cdeb794cc91e442648544963a7e951e9558.tar -r--r--r-- 0/0 3 1970-01-01 02:00 be347e7910e909949a9117603d2b45b45d07fd42e75b90e254fb86f44c4b956a/VERSION -r--r--r-- 0/0 346 1970-01-01 02:00 be347e7910e909949a9117603d2b45b45d07fd42e75b90e254fb86f44c4b956a/json -r--r--r-- 0/0 180 1970-01-01 02:00 manifest.json -r--r--r-- 0/0 2 1970-01-01 02:00 repositories
Although the image is a plain tar file, its contents are created by the tools used. You need to know the format of the required JSON files in it to create it manually. Fortunately, you don't need to know this to unpack it.
Let's create a basic folder for our sandbox and unpack the largest layer's tar file into it.
/tmp $ tar xf alpine_latest.tar /tmp $ mkdir -p c/75654b8eeebd3beae97271a102f57cdeb794cc91e442648544963a7e951e9558 /tmp $ tar -x -C c/75654b8eeebd3beae97271a102f57cdeb794cc91e442648544963a7e951e9558 -f 75654b8eeebd3beae97271a102f57cdeb794cc91e442648544963a7e951e9558.tar /tmp $ ll c/75654b8eeebd3beae97271a102f57cdeb794cc91e442648544963a7e951e9558/ total 0 drwxr-xr-x 2 voleg voleg 1680 Sep 6 14:34 bin drwxr-xr-x 2 voleg voleg 40 Sep 6 14:34 dev drwxr-xr-x 17 voleg voleg 740 Sep 6 14:34 etc drwxr-xr-x 2 voleg voleg 40 Sep 6 14:34 home drwxr-xr-x 6 voleg voleg 260 Sep 6 14:34 lib drwxr-xr-x 5 voleg voleg 100 Sep 6 14:34 media drwxr-xr-x 2 voleg voleg 40 Sep 6 14:34 mnt drwxr-xr-x 2 voleg voleg 40 Sep 6 14:34 opt dr-xr-xr-x 2 voleg voleg 40 Sep 6 14:34 proc drwx------ 2 voleg voleg 40 Sep 6 14:34 root drwxr-xr-x 2 voleg voleg 40 Sep 6 14:34 run drwxr-xr-x 2 voleg voleg 1260 Sep 6 14:34 sbin drwxr-xr-x 2 voleg voleg 40 Sep 6 14:34 srv drwxr-xr-x 2 voleg voleg 40 Sep 6 14:34 sys drwxr-xr-x 2 voleg voleg 40 Sep 6 14:34 tmp drwxr-xr-x 7 voleg voleg 140 Sep 6 14:34 usr drwxr-xr-x 12 voleg voleg 260 Sep 6 14:34 var
As you can see, this tar file is a backup of some root filesystem. The long number is a hash calculated when this image was created, and it is considered be a unique identifier for this layer. All other container images created on top of this root filesystem will not contain this tar file, but will refer to it by this hash number. That is why it is usually opened on the host filesystem in a directory named as hash, just like we have done here.
Since this image can be used by multiple containers by design, it should be read-only for our running container. However, running processes usually want to write something to their filesystem, which is where the overlay comes in. Assuming we call our container myalpine, we will create a directory to store read-write changes there.
/tmp $ mkdir c/myalpine /tmp $ mkdir c/runtime c/work /tmp $ sudo mount myalpine -t overlay -o lowerdir=c/75654b8eeebd3beae97271a102f57cdeb794cc91e442648544963a7e951e9558/,upperdir=c/myalpine,workdir=c/work c/runtime /tmp $ mount | grep myalpine myalpine on /tmp/c/runtime type overlay (rw,relatime,lowerdir=c/75654b8eeebd3beae97271a102f57cdeb794cc91e442648544963a7e951e9558/,upperdir=c/myalpine,workdir=c/work,uuid=on)
This mount will be done every time the container is started by conventional tools. Our process starts in the c/runtime directory, which is the result of merging the lowerdir read-only directory and the upperdir read-write directory. The contents of the workdir directory are temporary and are used by the overlay mount itself. Let's modify a file in our runtime directory.
/tmp $ echo "# added in runtime" >> c/runtime/etc/hosts /tmp $ cat c/75654b8eeebd3beae97271a102f57cdeb794cc91e442648544963a7e951e9558/etc/hosts 127.0.0.1 localhost localhost.localdomain ::1 localhost localhost.localdomain /tmp $ cat c/myalpine/etc/hosts 127.0.0.1 localhost localhost.localdomain ::1 localhost localhost.localdomain # added in runtime
This way, the contents of the base image are read-only and can be used by other containers too. Any changes are saved to the upperdir and persist between container restarts.
The chroot command is a very old jailing technique. After we have prepared our root filesystem in the previous step, let's enter our container using the chroot command. This command requires root privileges, so sudo is useful.
/tmp $ sudo chroot c/runtime /bin/sh / # df -h Filesystem Size Used Available Use% Mounted on df: /proc/mounts: No such file or directory / # whoami root / # ps -ef PID USER TIME COMMAND
/proc is not mounted, so some commands don't work correctly. Let's mount it since we are root in the container.
/ # mount -t proc proc /proc / # df -h Filesystem Size Used Available Use% Mounted on myalpine 30.0G 266.3M 29.7G 1% / / # ps -ef PID USER TIME COMMAND 1 root 0:39 /usr/lib/systemd/systemd --switched-root --system --deserialize=36 single .. / # ls -l /proc/kcore -r-------- 1 root root 140737471594496 Nov 28 11:19 /proc/kcore / # mount -t devtmpfs devtmpfs /dev / # fdisk -l /dev/nvme0n1 Found valid GPT with protective MBR; using GPT Disk /dev/nvme0n1: 2000409264 sectors, 1914M Logical sector size: 512 Disk identifier (GUID): 3c1c882b-dbb5-4dcb-b0e1-1eaee936a06b Partition table holds up to 128 entries First usable sector is 34, last usable sector is 2000409230 Number Start (sector) End (sector) Size Name 1 2048 534527 260M EFI System Partition 2 534528 567295 16.0M Microsoft reserved partition 3 567296 243947674 116G Basic data partition 4 243949568 246323199 1159M 5 246327296 246851583 256M 6 246851584 2000408575 836G / # mount -t auto /dev/nvme0n1p1 /mnt / # ls /mnt/ $RECYCLE.BIN BOOT EFI System Volume Information / # ip addr ..
Even if we can't exit our jailed file system, we can access the external system and even devices. We can mount disks and overwrite them. We can access and read system memory. We can access and manipulate the network.
Let's exit the chrooted environment.
/ # df Filesystem 1K-blocks Used Available Use% Mounted on myalpine 31457280 272720 31184560 1% / devtmpfs 4096 0 4096 0% /dev /dev/nvme0n1p1 262144 41724 220420 16% /mnt / # umount /mnt / # umount /dev / # umount /proc / # exit
Namespaces are intended to solve the security problems we pointed out in the chroot chapter. Each newly created namespace can be isolated in several ways, we will turn on all isolations:
/tmp $ unshare --fork --pid --mount --uts --ipc --net --user --map-root-user /bin/bash basename: missing operand Try 'basename --help' for more information. /tmp # id uid=0(root) gid=0(root) groups=0(root),65534(nobody)
The result is that we are still in the same host filesystem (unlike chroot), but isolated from networking, processes, userids, and mounts. Because of the new userid namespace, you got the id 0, corresponding to root, but in the original namespace you are still using your uid. You can omit the --map-root-user option, then you get the user nobody. We got an isolated environment, but still outside our container, let's chroot into it.
/tmp # chroot c/runtime /bin/sh / # export PATH=/sbin:/usr/sbin:/bin:/usr/bin / # hostname myalpine root@myalpine:/ # mount -t proc proc /proc root@myalpine:/ # mount -t devtmpfs devtmpfs /dev mount: permission denied (are you root?) root@myalpine:/ # strings /proc/kcore strings: /proc/kcore: Permission denied root@myalpine:/ # df Filesystem 1K-blocks Used Available Use% Mounted on myalpine 31457280 272720 31184560 1% / root@myalpine:/ # ps -ef PID USER TIME COMMAND 1 root 0:00 /bin/bash 18 root 0:00 /bin/sh 27 root 0:00 ps -ef root@myalpine:/ # ip add 1: lo:mtu 65536 qdisc noop state DOWN qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
All access denied messages appear because your root access is fake.
You can limit the container's resource usage using cgroups. All subsequent settings are made outside the container, on the host node itself. Let's create a new cgroup and name it accordingly.
host# mkdir /sys/fs/cgroup/myalpine host# echo "20000 100000" > /sys/fs/cgroup/myalpine/cpu.max host# echo $((50*1024*1024)) > /sys/fs/cgroup/myalpine/memory.max host# ps -ef | grep /bin/sh voleg 1314499 1314163 0 14:05 pts/3 00:00:00 /bin/sh .. host# echo 1314499 > /sys/fs/cgroup/myalpine/cgroup.procs
As a result, we limited PID 1314499 and its child processes to 20% CPU and 50 MB memory.
To set up a network in our isolated namespace, it needs to have some name that can be referenced. Let's create that name and attach it to our existing isolated namespace.
host# ip netns attach myalpine 1314499
Where myalpine will be the name of the isolated network namespace used by PID 1314499.
Now let's create a virtual ethernet cable with two connectors veth0 and veth1.
host# ip link add veth0 type veth peer name veth1
Now move one cable connector into the container namespace.
host# ip link set veth1 netns myalpine
Looking inside our container we see a new interface added.
root@myalpine:/ # ip add 1: lo:mtu 65536 qdisc noop state DOWN qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 34: veth1@if35: mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 22:dd:eb:28:69:b8 brd ff:ff:ff:ff:ff:ff
Let's continue setting up the internal network.
host# ip addr add 10.1.1.1/24 dev veth0 host# ip link set dev veth0 up root@myalpine:/ # ip addr add 10.1.1.2/24 dev veth1 root@myalpine:/ # ip link set dev veth1 up root@myalpine:/ # ping -c1 10.1.1.1 PING 10.1.1.1 (10.1.1.1): 56 data bytes 64 bytes from 10.1.1.1: seq=0 ttl=64 time=0.185 ms --- 10.1.1.1 ping statistics --- 1 packets transmitted, 1 packets received, 0% packet loss round-trip min/avg/max = 0.185/0.185/0.185 ms