Monday, November 1, 2021

Containers vs. Pods – Taking a Deeper Look

Containers could have become a lightweight VM replacement. However, the most widely used form of containers, standardized by Docker/OCI, encourages you to have just one process service per container. Such an approach has a bunch of pros - increased isolation, simplified horizontal scaling, higher reusability, etc. However, there is a big con - in the wild, virtual (or physical) machines rarely run just one service.

While Docker tries to offer some workarounds to create multi-service containers, Kubernetes makes a bolder step and chooses a group of cohesive containers, called a Pod, as the smallest deployable unit.

When I stumbled upon Kubernetes a few years ago, my prior VM and bare-metal experience allowed me to get the idea of Pods pretty quickly. Or so thought I... 🙈

Starting working with Kubernetes, one of the first things you learn is that every pod gets a unique IP and hostname and that within a pod, containers can talk to each other via localhost. So, it's kinda obvious - a pod is like a tiny little server.

After a while, though, you realize that every container in a pod gets an isolated filesystem and that from inside one container, you don't see processes running in other containers of the same pod. Ok, fine! Maybe a pod is not a tiny little server but just a group of containers with a shared network stack.

But then you learn that containers in one pod can communicate via shared memory! So, probably the network namespace is not the only shared thing...

This last finding was the final straw for me. So, I decided to have a deep dive and see with my own eyes:

  • How Pods are implemented under the hood
  • What is the actual difference between a Pod and a Container
  • How one can create Pods using Docker.

And on the way, I hope it'll help me to solidify my Linux, Docker, and Kubernetes skills.

Examining a container

The OCI Runtime Spec doesn't limit container implementations to only Linux containers, i.e., the ones implemented with namespaces and cgroups. However, unless otherwise is stated explicitly, the word container in this article refers to this rather traditional form.

Setting up a playground

Before taking a look at the namespaces and cgroups constituting a container, let's set up a playground real quick:

$ cat > Vagrantfile <<EOF
# -*- mode: ruby -*-
# vi: set ft=ruby :

Vagrant.configure("2") do |config|
  config.vm.box = "debian/buster64"
  config.vm.hostname = "docker-host"
  config.vm.define "docker-host"
  config.vagrant.plugins = ['vagrant-vbguest']

  config.vm.provider "virtualbox" do |vb|
    vb.cpus = 2
    vb.memory = "2048"
  end

  config.vm.provision "shell", inline: <<-SHELL
    apt-get update
    apt-get install -y curl vim
  SHELL

  config.vm.provision "docker"
end
EOF

$ vagrant up
$ vagrant ssh

Last but not least, starting a guinea-pig container:

$ docker run --name foo --rm -d --memory='512MB' --cpus='0.5' nginx

Inspecting container's namespaces

Let's see what isolation primitives were created when I started the container:

# Look up the container in the process tree.
$ ps auxf
USER       PID  ...  COMMAND
...
root      4707       /usr/bin/containerd-shim-runc-v2 -namespace moby -id cc9466b3e...
root      4727        \_ nginx: master process nginx -g daemon off;
systemd+  4781            \_ nginx: worker process
systemd+  4782            \_ nginx: worker process

# Find the namespaces used by 4727 process.
$ sudo lsns
        NS TYPE   NPROCS   PID USER    COMMAND
...
4026532157 mnt         3  4727 root    nginx: master process nginx -g daemon off;
4026532158 uts         3  4727 root    nginx: master process nginx -g daemon off;
4026532159 ipc         3  4727 root    nginx: master process nginx -g daemon off;
4026532160 pid         3  4727 root    nginx: master process nginx -g daemon off;
4026532162 net         3  4727 root    nginx: master process nginx -g daemon off;

Ok, so the namespaces used to isolate the above container are:

  • mnt (Mount) - the container has an isolated mount table.
  • uts (UNIX Time-Sharing) - the container is able to have its own hostname and domain name.
  • ipc (Interprocess Communication) - processes inside the container can communicate via system-level IPC only to processes inside the same container.
  • pid (Process ID) - processes inside the container are only able to see other processes inside the same container or inside the same pid namespace.
  • net (Network) - the container gets its own network stack.

Notice how the User ID namespace wasn't used! The OCI Runtime Spec mentions the user namespace support. However, while Docker can use this namespace for its containers, it doesn't do it by default due to the inherent limitations. Thus, the root user in a container is likely the root user from your host system. Beware!

Another namespace that could have been on the list is Cgroup. It took me a while to understand that the cgroup namespace is not the same as the cgroups mechanism. Cgroup namespace just gives a container an isolated view of the cgroup hierarchy. Again, it seems like Docker supports putting containers into private cgroup namespaces but doesn't do it by default.

Inspecting container's cgroups

Linux namespaces make processes inside a container think they run on a dedicated machine. However, not seeing any neighbors doesn't mean being fully protected from them. Some hungry neighbors can accidentally consume an unfair share of the host's resources.

cgroups to the rescue!

The cgroups limits for a given process can be checked via examining the corresponding subtree in the cgroup virtual filesystem. The cgroupfs is usually mounted as /sys/fs/cgroup, and the process-specific part of the path can be found at /proc/<PID>/cgroup:

PID=$(docker inspect --format '' foo)

# Check cgroupfs node for the container main process (4727).
$ cat /proc/${PID}/cgroup
11:freezer:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
10:blkio:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
9:rdma:/
8:pids:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
7:devices:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
6:cpuset:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
5:cpu,cpuacct:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
4:memory:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
3:net_cls,net_prio:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
2:perf_event:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
1:name=systemd:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
0::/system.slice/containerd.service

Hm... Seems like Docker uses the /docker/<container-id> pattern. Ok, anyways:

ID=$(docker inspect --format '' foo)

# Check the memory limit.
$ cat /sys/fs/cgroup/memory/docker/${ID}/memory.limit_in_bytes
536870912  # Yay! It's the 512MB we requested!

# See the CPU limits.
ls /sys/fs/cgroup/cpu/docker/${ID}

Interesting that starting a container without explicitly setting any resource limits configures a cgroup anyway. I haven't really checked, but my guess is that while CPU and RAM consumption is unrestricted by default, cgroups might be used to limit some device access from inside a container.

Here is how I picture containers in my head after this investigation:

Examining a pod

Now, let's take a look at Kubernetes Pods. Much like with containers, the implementation of pods can vary between different CRI runtimes. For instance, when Kata Containers is used as one of the supported runtime classes, some of the pods can be true virtual machines! And expectedly, the VM-based pods differ in implementation and capabilities from the pods implemented with traditional Linux containers.

To keep the Containers and Pods fair comparison, the Pod examination will be done on a Kubernetes cluster that uses containerd/runc runtime. And that's exactly what Docker uses under the hood to run containers.

Setting up a playground

This time the playground is set up using minikube with the VirtualBox driver and containerd runtime. To quickly install minikube and kubectl, you can use the handy arkade tool written by Alex Ellis:

# Install arkade ()
$ curl -sLS https://get.arkade.dev | sh

$ arkade get kubectl minikube

$ minikube start --driver virtualbox --container-runtime containerd

For the guinea-pig pod, the following would do:

$ kubectl --context=minikube apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: foo
spec:
  containers:
    - name: app
      image: docker.io/kennethreitz/httpbin
      ports:
        - containerPort: 80
      resources:
        limits:
          memory: "256Mi"
    - name: sidecar
      image: curlimages/curl
      command: ["/bin/sleep", "3650d"]
      resources:
        limits:
          memory: "128Mi"
EOF

Inspecting pod's containers

The actual pod inspection should be done on the Kubernetes cluster node:

$ minikube ssh

Let's look up the pod's processes there:

$ ps auxf
USER       PID  ...  COMMAND
...
root      4947         \_ containerd-shim -namespace k8s.io -workdir /mnt/sda1/var/lib/containerd/...
root      4966             \_ /pause
root      4981         \_ containerd-shim -namespace k8s.io -workdir /mnt/sda1/var/lib/containerd/...
root      5001             \_ /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
root      5016                 \_ /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
root      5018         \_ containerd-shim -namespace k8s.io -workdir /mnt/sda1/var/lib/containerd/...
100       5035             \_ /bin/sleep 3650d

Based on the uptime, the above three (!) process groups were likely created during the pod startup. That's interesting because in the manifest, only two containers, httpbin and sleep, were requested.

It's possible to cross-check the above finding using the containerd command-line client called ctr:

$ sudo ctr --namespace=k8s.io containers ls
CONTAINER      IMAGE                                   RUNTIME
...
097d4fe8a7002  docker.io/curlimages/curl@sha256:1a220  io.containerd.runtime.v1.linux
...
dfb1cd29ab750  docker.io/kennethreitz/httpbin:latest   io.containerd.runtime.v1.linux
...
f0e87a9330466  k8s.gcr.io/pause:3.1                    io.containerd.runtime.v1.linux

Indeed, three containers were created. At the same time, crictl - another command-line client compatible with any CRI runtime - shows just two containers:

$ sudo crictl ps
CONTAINER      IMAGE          CREATED            STATE    NAME     ATTEMPT  POD ID
097d4fe8a7002  bcb0c26a91c90  About an hour ago  Running  sidecar  0        f0e87a9330466
dfb1cd29ab750  b138b9264903f  About an hour ago  Running  app      0        f0e87a9330466

But notice how the POD ID field above matches the pause:3.1 container ID from the ctr's output! Well, seems like the pod has an auxiliary container. So, what is it for?

I'm not aware of any OCI Runtime Spec equivalent for Pods. So, when I'm not satisfied with the information provided by the Kubernetes API spec, I usually go directly to the Kubernetes Container Runtime Interface (CRI) protobuf file:

// kubelet expects any compatible container runtime
// to implement the following gRPC methods:

service RuntimeService {
    ...
    rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse) {}    
    rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse) {}    
    rpc RemovePodSandbox(RemovePodSandboxRequest) returns (RemovePodSandboxResponse) {}    
    rpc PodSandboxStatus(PodSandboxStatusRequest) returns (PodSandboxStatusResponse) {}
    rpc ListPodSandbox(ListPodSandboxRequest) returns (ListPodSandboxResponse) {}

    rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse) {}
    rpc StartContainer(StartContainerRequest) returns (StartContainerResponse) {}    
    rpc StopContainer(StopContainerRequest) returns (StopContainerResponse) {}    
    rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse) {}
    rpc ListContainers(ListContainersRequest) returns (ListContainersResponse) {}    
    rpc ContainerStatus(ContainerStatusRequest) returns (ContainerStatusResponse) {}    
    rpc UpdateContainerResources(UpdateContainerResourcesRequest) returns (UpdateContainerResourcesResponse) {}    
    rpc ReopenContainerLog(ReopenContainerLogRequest) returns (ReopenContainerLogResponse) {}

    // ...    
}

message CreateContainerRequest {
    // ID of the PodSandbox in which the container should be created.
    string pod_sandbox_id = 1;
    // Config of the container.
    ContainerConfig config = 2;
    // Config of the PodSandbox. This is the same config that was passed
    // to RunPodSandboxRequest to create the PodSandbox. It is passed again
    // here just for easy reference. The PodSandboxConfig is immutable and
    // remains the same throughout the lifetime of the pod.
    PodSandboxConfig sandbox_config = 3;
}

So, Pods are actually defined in terms of sandboxes and containers that can be started in these sandboxes. The sandbox thingy manages some common for all pod's containers resources, and the pause container is started during the RunPodSandbox() call. A little bit of Internet searching reveals that this container has just an idle process inside.

Inspecting pod's namespaces

Here is how the namespaces look like on the cluster node:

$ sudo lsns
        NS TYPE   NPROCS   PID USER            COMMAND
4026532614 net         4  4966 root            /pause
4026532715 mnt         1  4966 root            /pause
4026532716 uts         4  4966 root            /pause
4026532717 ipc         4  4966 root            /pause
4026532718 pid         1  4966 root            /pause
4026532719 mnt         2  5001 root            /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
4026532720 pid         2  5001 root            /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
4026532721 mnt         1  5035 100             /bin/sleep 3650d
4026532722 pid         1  5035 100             /bin/sleep 3650d

So, much like the Docker container in the first section, the pause container gets all five namespaces - net, mnt, uts, ipc, and pid. But apparently, httpbin and sleep containers get just by two namespaces - mnt and pid. How come?

Turns out, lsns isn't the best tool to check process' namespaces. Instead, to check the namespaces used by a certain process, the /proc/${PID}/ns location can be referred:

# httpbin container
sudo ls -l /proc/5001/ns
...
lrwxrwxrwx 1 root root 0 Oct 24 14:05 ipc -> 'ipc:[4026532717]'
lrwxrwxrwx 1 root root 0 Oct 24 14:05 mnt -> 'mnt:[4026532719]'
lrwxrwxrwx 1 root root 0 Oct 24 14:05 net -> 'net:[4026532614]'
lrwxrwxrwx 1 root root 0 Oct 24 14:05 pid -> 'pid:[4026532720]'
lrwxrwxrwx 1 root root 0 Oct 24 14:05 uts -> 'uts:[4026532716]'

# sleep container
sudo ls -l /proc/5035/ns
...
lrwxrwxrwx 1 100 101 0 Oct 24 14:05 ipc -> 'ipc:[4026532717]'
lrwxrwxrwx 1 100 101 0 Oct 24 14:05 mnt -> 'mnt:[4026532721]'
lrwxrwxrwx 1 100 101 0 Oct 24 14:05 net -> 'net:[4026532614]'
lrwxrwxrwx 1 100 101 0 Oct 24 14:05 pid -> 'pid:[4026532722]'
lrwxrwxrwx 1 100 101 0 Oct 24 14:05 uts -> 'uts:[4026532716]'

While it might be tricky to notice, the httpbin and sleep containers actually reuse the net, uts, and ipc namespaces of the pause container!

Again, this can be cross-checked with crictl:

# Inspect httpbin container.
$ sudo crictl inspect dfb1cd29ab750
{
  ...
  "namespaces": [
    {
      "type": "pid"
    },
    {
      "type": "ipc",
      "path": "/proc/4966/ns/ipc"
    },
    {
      "type": "uts",
      "path": "/proc/4966/ns/uts"
    },
    {
      "type": "mount"
    },
    {
      "type": "network",
      "path": "/proc/4966/ns/net"
    }
  ],
  ...
}

# Inspect sleep container.
$ sudo crictl inspect 097d4fe8a7002
...

I think the above findings perfectly explain the ability of containers in the same pod:

  • to talk to each other
    • via localhost and/or
    • using IPC means (shared memory, message queues, etc.)
  • to have a shared domain and hostname.

However, after seeing how all these namespaces are freely reused between containers, I started to suspect that the default boundaries can be shacked. And indeed, a more thorough read of the Pod API spec showed that with the shareProcessNamespace flag set to true pod's containers will have four common namespaces instead of the default three. But there was a more shocking finding - hostIPC, hostNetwork, and hostPID flags can make the containers use the corresponding host's namespaces 🤯

Interesting that the CRI API spec seems to be even more flexible. At least syntactically, it allows scoping the net, pid, and ipc namespaces to either CONTAINER, POD, or NODE. So, hypothetically, a pod where containers cannot talk to each other via localhost can be constructed 🙈

Inspecting pod's cgroups

Ok, what's up with pod's cgroups? systemd-cgls can nicely visualize the cgroups hierarchy:

$ sudo systemd-cgls
Control group /:
-.slice
├─kubepods
│ ├─burstable
│ │ ├─pod4a8d5c3e-3821-4727-9d20-965febbccfbb
│ │ │ ├─f0e87a93304666766ab139d52f10ff2b8d4a1e6060fc18f74f28e2cb000da8b2
│ │ │ │ └─4966 /pause
│ │ │ ├─dfb1cd29ab750064ae89613cb28963353c3360c2df913995af582aebcc4e85d8
│ │ │ │ ├─5001 /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
│ │ │ │ └─5016 /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
│ │ │ └─097d4fe8a7002d69d6c78899dcf6731d313ce8067ae3f736f252f387582e55ad
│ │ │   └─5035 /bin/sleep 3650d
...

So, the pod itself gets a parent node, and every container can be tweaked separately as well. This matches my expectations based on the fact that in the Pod manifest, resource limits can be set for every container in the pod individually.

At this moment, a Pod in my head looks something like this:

Implementing Pods with Docker

If a Pod under the hood is implemented as a bunch of semi-fused containers with a common cgroup parent, will it be possible to reproduce a Pod-like construct using Docker?

Recently I already tried doing something similar to make multiple containers listen on the same socket, and I know that Docker allows creating a container that reuses an existing network namespace with the docker run --network container:<other-container-name> syntax. But I also know that the OCI Runtime Spec defines only create and start commands. So, when you execute a command inside an existing container with docker exec <existing-container> <command>, you actually run, i.e., create then start, a completely fresh container that just happens to reuse all the namespaces of the target container (proofs 1 & 2). It makes me pretty confident the Pods can be reproduced using the standard Docker commands.

As a playground, the original vagrant box with Docker would do. However, I'll use an extra package to simplify dealing with cgroups:

$ sudo apt-get install cgroup-tools

Firstly, a parent cgroup entry needs to be configured. For the sake of brevity, I'll use only cpu and memory controllers:

sudo cgcreate -g cpu,memory:/pod-foo

# Check if the corresponding folders were created:
ls -l /sys/fs/cgroup/cpu/pod-foo/
ls -l /sys/fs/cgroup/memory/pod-foo/

Secondly, a sandbox container should be created:

$ docker run -d --rm \
  --name foo_sandbox \
  --cgroup-parent /pod-foo \
  --ipc 'shareable' \
  alpine sleep infinity

Lastly, starting the actual containers reusing the namespaces of the sandbox container:

# app (httpbin)
$ docker run -d --rm \
  --name app \
  --cgroup-parent /pod-foo \
  --network container:foo_sandbox \
  --ipc container:foo_sandbox \
  kennethreitz/httpbin

# sidecar (sleep)
$ docker run -d --rm \
  --name sidecar \
  --cgroup-parent /pod-foo \
  --network container:foo_sandbox \
  --ipc container:foo_sandbox \
  curlimages/curl sleep 365d

Have you noticed which namespace I omitted? Right, I couldn't share the uts namespace between containers. Seems like this possibility is not exposed currently in the docker run command. Well, that's a pity, of course. But apart from the uts namespace, it's a success!

The cgroups look much like the ones created by Kubernetes itself:

$ sudo systemd-cgls memory
Controller memory; Control group /:
├─pod-foo
│ ├─488d76cade5422b57ab59116f422d8483d435a8449ceda0c9a1888ea774acac7
│ │ ├─27865 /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
│ │ └─27880 /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
│ ├─9166a87f9a96a954b10ec012104366da9f1f6680387ef423ee197c61d37f39d7
│ │ └─27977 sleep 365d
│ └─c7b0ec46b16b52c5e1c447b77d67d44d16d78f9a3f93eaeb3a86aa95e08e28b6
│   └─27743 sleep infinity

The global list of namespaces also looks familiar:

$ sudo lsns
        NS TYPE   NPROCS   PID USER    COMMAND
...
4026532157 mnt         1 27743 root    sleep infinity
4026532158 uts         1 27743 root    sleep infinity
4026532159 ipc         4 27743 root    sleep infinity
4026532160 pid         1 27743 root    sleep infinity
4026532162 net         4 27743 root    sleep infinity
4026532218 mnt         2 27865 root    /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
4026532219 uts         2 27865 root    /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
4026532220 pid         2 27865 root    /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
4026532221 mnt         1 27977 _apt    sleep 365d
4026532222 uts         1 27977 _apt    sleep 365d
4026532223 pid         1 27977 _apt    sleep 365d

And the httpbin and sidecar containers seems to share the ipc and net namespaces:

# app container
$ sudo ls -l /proc/27865/ns
lrwxrwxrwx 1 root root 0 Oct 28 07:56 ipc -> 'ipc:[4026532159]'
lrwxrwxrwx 1 root root 0 Oct 28 07:56 mnt -> 'mnt:[4026532218]'
lrwxrwxrwx 1 root root 0 Oct 28 07:56 net -> 'net:[4026532162]'
lrwxrwxrwx 1 root root 0 Oct 28 07:56 pid -> 'pid:[4026532220]'
lrwxrwxrwx 1 root root 0 Oct 28 07:56 uts -> 'uts:[4026532219]'

# sidecar container
$ sudo ls -l /proc/27977/ns
lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 ipc -> 'ipc:[4026532159]'
lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 mnt -> 'mnt:[4026532221]'
lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 net -> 'net:[4026532162]'
lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 pid -> 'pid:[4026532223]'
lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 uts -> 'uts:[4026532222]'

Yay! 🎉

Summarizing

Containers and Pods are alike. Under the hood, they heavily rely on Linux namespaces and cgroups. However, Pods aren't just groups of containers. A Pod is a self-sufficient higher-level construct. All pod's containers run on the same machine (cluster node), their lifecycle is synchronized, and mutual isolation is weakened to simplify the inter-container communication. This makes Pods much closer to traditional VMs, bringing back the familiar deployment patterns like sidecar or reverse proxy.

Resources

More Containers articles from this blog



from Hacker News https://ift.tt/3nC9UIT

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.