Container technology in CLI commands

This article will demonstrate the technologies used in containers, but using Linux commands.

File system

If we talk about a container image, it is an archive created by the tar command. Let's download and examine a popular image called alpine:latest. I will use the skopeo utility, which operates with container images. With its help, I will download the image directly from the docker registry to my /tmp directory:

$ cd /tmp
/tmp $ skopeo copy docker://alpine docker-archive:alpine_latest.tar
Getting image source signatures
Copying blob da9db072f522 done   | 
Copying config 63b790fccc done   | 
Writing manifest to image destination
/tmp $ tar tvf alpine_latest.tar  
-r--r--r-- 0/0         8081920 1970-01-01 02:00 75654b8eeebd3beae97271a102f57cdeb794cc91e442648544963a7e951e9558.tar
-r--r--r-- 0/0             600 1970-01-01 02:00 63b790fccc9078ab8bb913d94a5d869e19fca9b77712b315da3fa45bb8f14636.json
l--------- 0/0               0 1970-01-01 02:00 be347e7910e909949a9117603d2b45b45d07fd42e75b90e254fb86f44c4b956a/layer.tar -> ../75654b8eeebd3beae97271a102f57cdeb794cc91e442648544963a7e951e9558.tar
-r--r--r-- 0/0               3 1970-01-01 02:00 be347e7910e909949a9117603d2b45b45d07fd42e75b90e254fb86f44c4b956a/VERSION
-r--r--r-- 0/0             346 1970-01-01 02:00 be347e7910e909949a9117603d2b45b45d07fd42e75b90e254fb86f44c4b956a/json
-r--r--r-- 0/0             180 1970-01-01 02:00 manifest.json
-r--r--r-- 0/0               2 1970-01-01 02:00 repositories

Although the image is a plain tar file, its contents are created by the tools used. You need to know the format of the required JSON files in it to create it manually. Fortunately, you don't need to know this to unpack it.

Let's create a basic folder for our sandbox and unpack the largest layer's tar file into it.

/tmp $ tar xf alpine_latest.tar
/tmp $ mkdir -p c/75654b8eeebd3beae97271a102f57cdeb794cc91e442648544963a7e951e9558
/tmp $ tar -x -C c/75654b8eeebd3beae97271a102f57cdeb794cc91e442648544963a7e951e9558 -f 75654b8eeebd3beae97271a102f57cdeb794cc91e442648544963a7e951e9558.tar
/tmp $ ll c/75654b8eeebd3beae97271a102f57cdeb794cc91e442648544963a7e951e9558/
total 0
drwxr-xr-x  2 voleg voleg 1680 Sep  6 14:34 bin
drwxr-xr-x  2 voleg voleg   40 Sep  6 14:34 dev
drwxr-xr-x 17 voleg voleg  740 Sep  6 14:34 etc
drwxr-xr-x  2 voleg voleg   40 Sep  6 14:34 home
drwxr-xr-x  6 voleg voleg  260 Sep  6 14:34 lib
drwxr-xr-x  5 voleg voleg  100 Sep  6 14:34 media
drwxr-xr-x  2 voleg voleg   40 Sep  6 14:34 mnt
drwxr-xr-x  2 voleg voleg   40 Sep  6 14:34 opt
dr-xr-xr-x  2 voleg voleg   40 Sep  6 14:34 proc
drwx------  2 voleg voleg   40 Sep  6 14:34 root
drwxr-xr-x  2 voleg voleg   40 Sep  6 14:34 run
drwxr-xr-x  2 voleg voleg 1260 Sep  6 14:34 sbin
drwxr-xr-x  2 voleg voleg   40 Sep  6 14:34 srv
drwxr-xr-x  2 voleg voleg   40 Sep  6 14:34 sys
drwxr-xr-x  2 voleg voleg   40 Sep  6 14:34 tmp
drwxr-xr-x  7 voleg voleg  140 Sep  6 14:34 usr
drwxr-xr-x 12 voleg voleg  260 Sep  6 14:34 var

As you can see, this tar file is a backup of some root filesystem. The long number is a hash calculated when this image was created, and it is considered be a unique identifier for this layer. All other container images created on top of this root filesystem will not contain this tar file, but will refer to it by this hash number. That is why it is usually opened on the host filesystem in a directory named as hash, just like we have done here.

Since this image can be used by multiple containers by design, it should be read-only for our running container. However, running processes usually want to write something to their filesystem, which is where the overlay comes in. Assuming we call our container myalpine, we will create a directory to store read-write changes there.

/tmp $ mkdir c/myalpine
/tmp $ mkdir c/runtime c/work
/tmp $ sudo mount myalpine -t overlay -o lowerdir=c/75654b8eeebd3beae97271a102f57cdeb794cc91e442648544963a7e951e9558/,upperdir=c/myalpine,workdir=c/work c/runtime
/tmp $ mount | grep myalpine
myalpine on /tmp/c/runtime type overlay (rw,relatime,lowerdir=c/75654b8eeebd3beae97271a102f57cdeb794cc91e442648544963a7e951e9558/,upperdir=c/myalpine,workdir=c/work,uuid=on)

This mount will be done every time the container is started by conventional tools. Our process starts in the c/runtime directory, which is the result of merging the lowerdir read-only directory and the upperdir read-write directory. The contents of the workdir directory are temporary and are used by the overlay mount itself. Let's modify a file in our runtime directory.

/tmp $ echo "# added in runtime" >> c/runtime/etc/hosts
/tmp $ cat c/75654b8eeebd3beae97271a102f57cdeb794cc91e442648544963a7e951e9558/etc/hosts
127.0.0.1       localhost localhost.localdomain
::1             localhost localhost.localdomain
/tmp $ cat c/myalpine/etc/hosts
127.0.0.1       localhost localhost.localdomain
::1             localhost localhost.localdomain
# added in runtime

This way, the contents of the base image are read-only and can be used by other containers too. Any changes are saved to the upperdir and persist between container restarts.

chroot

The chroot command is a very old jailing technique. After we have prepared our root filesystem in the previous step, let's enter our container using the chroot command. This command requires root privileges, so sudo is useful.

/tmp $ sudo chroot c/runtime /bin/sh
/ # df -h
Filesystem                Size      Used Available Use% Mounted on
df: /proc/mounts: No such file or directory
/ # whoami
root
/ # ps -ef
PID   USER     TIME  COMMAND

/proc is not mounted, so some commands don't work correctly. Let's mount it since we are root in the container.

/ # mount -t proc proc /proc
/ # df -h
Filesystem                Size      Used Available Use% Mounted on
myalpine                 30.0G    266.3M     29.7G   1% /
/ # ps -ef
PID   USER     TIME  COMMAND
    1 root      0:39 /usr/lib/systemd/systemd --switched-root --system --deserialize=36 single
 ..
/ # ls -l /proc/kcore 
-r--------    1 root     root     140737471594496 Nov 28 11:19 /proc/kcore
/ # mount -t devtmpfs devtmpfs /dev
/ # fdisk -l /dev/nvme0n1
Found valid GPT with protective MBR; using GPT

Disk /dev/nvme0n1: 2000409264 sectors, 1914M
Logical sector size: 512
Disk identifier (GUID): 3c1c882b-dbb5-4dcb-b0e1-1eaee936a06b
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 2000409230

Number  Start (sector)    End (sector)  Size Name
     1            2048          534527  260M EFI System Partition
     2          534528          567295 16.0M Microsoft reserved partition
     3          567296       243947674  116G Basic data partition
     4       243949568       246323199 1159M 
     5       246327296       246851583  256M 
     6       246851584      2000408575  836G 
/ # mount -t auto /dev/nvme0n1p1 /mnt
/ # ls /mnt/
$RECYCLE.BIN               BOOT                       EFI                        System Volume Information
/ # ip addr
 ..

Even if we can't exit our jailed file system, we can access the external system and even devices. We can mount disks and overwrite them. We can access and read system memory. We can access and manipulate the network.

Let's exit the chrooted environment.

/ # df
Filesystem           1K-blocks      Used Available Use% Mounted on
myalpine              31457280    272720  31184560   1% /
devtmpfs                  4096         0      4096   0% /dev
/dev/nvme0n1p1          262144     41724    220420  16% /mnt
/ # umount /mnt
/ # umount /dev
/ # umount /proc
/ # exit

Namespaces

Namespaces are intended to solve the security problems we pointed out in the chroot chapter. Each newly created namespace can be isolated in several ways, we will turn on all isolations:

/tmp $ unshare --fork --pid --mount --uts --ipc --net --user --map-root-user  /bin/bash
basename: missing operand
Try 'basename --help' for more information.
/tmp # id
uid=0(root) gid=0(root) groups=0(root),65534(nobody)

The result is that we are still in the same host filesystem (unlike chroot), but isolated from networking, processes, userids, and mounts. Because of the new userid namespace, you got the id 0, corresponding to root, but in the original namespace you are still using your uid. You can omit the --map-root-user option, then you get the user nobody. We got an isolated environment, but still outside our container, let's chroot into it.

/tmp # chroot c/runtime /bin/sh
/ # export PATH=/sbin:/usr/sbin:/bin:/usr/bin
/ # hostname myalpine
root@myalpine:/ # mount -t proc proc /proc
root@myalpine:/ # mount -t devtmpfs devtmpfs /dev
mount: permission denied (are you root?)
root@myalpine:/ # strings /proc/kcore 
strings: /proc/kcore: Permission denied
root@myalpine:/ # df
Filesystem           1K-blocks      Used Available Use% Mounted on
myalpine              31457280    272720  31184560   1% /
root@myalpine:/ # ps -ef
PID   USER     TIME  COMMAND
    1 root      0:00 /bin/bash
   18 root      0:00 /bin/sh
   27 root      0:00 ps -ef
root@myalpine:/ # ip add
1: lo:  mtu 65536 qdisc noop state DOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

All access denied messages appear because your root access is fake.

cgroups

You can limit the container's resource usage using cgroups. All subsequent settings are made outside the container, on the host node itself. Let's create a new cgroup and name it accordingly.

host# mkdir /sys/fs/cgroup/myalpine
host# echo "20000 100000" > /sys/fs/cgroup/myalpine/cpu.max
host# echo $((50*1024*1024)) > /sys/fs/cgroup/myalpine/memory.max
host# ps -ef | grep /bin/sh
voleg    1314499 1314163  0 14:05 pts/3    00:00:00 /bin/sh
 ..
host# echo 1314499 > /sys/fs/cgroup/myalpine/cgroup.procs

As a result, we limited PID 1314499 and its child processes to 20% CPU and 50 MB memory.

Network

To set up a network in our isolated namespace, it needs to have some name that can be referenced. Let's create that name and attach it to our existing isolated namespace.

host# ip netns attach myalpine 1314499

Where myalpine will be the name of the isolated network namespace used by PID 1314499.

Now let's create a virtual ethernet cable with two connectors veth0 and veth1.

host# ip link add veth0 type veth peer name veth1

Now move one cable connector into the container namespace.

host# ip link set veth1 netns myalpine

Looking inside our container we see a new interface added.

root@myalpine:/ # ip add
1: lo:  mtu 65536 qdisc noop state DOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
34: veth1@if35:  mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 22:dd:eb:28:69:b8 brd ff:ff:ff:ff:ff:ff

Let's continue setting up the internal network.

host# ip addr add 10.1.1.1/24 dev veth0
host# ip link set dev veth0 up
root@myalpine:/ # ip addr add 10.1.1.2/24 dev veth1
root@myalpine:/ # ip link set dev veth1 up
root@myalpine:/ # ping -c1 10.1.1.1
PING 10.1.1.1 (10.1.1.1): 56 data bytes
64 bytes from 10.1.1.1: seq=0 ttl=64 time=0.185 ms

--- 10.1.1.1 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.185/0.185/0.185 ms

Updated on Thu Nov 28 16:24:28 IST 2024 More documentations here