Tuesday, April 25, 2023

How to use Podman inside of a container

One of the most asked about topics to folks working on upstream container technologies is running Podman within a container. Most of this has historically been related to Docker in Docker (DIND), but now, people also want to run Podman in Podman (PINP) or Podman in Docker (PIND).

But Podman can be run in multiple ways, rootful and rootless. We end up with people wanting to run various combinations of rootful and rootless Podman:

  • Rootful Podman in rootful Podman
  • Rootless Podman in rootful Podman
  • Rootful Podman in rootless Podman
  • Rootless Podman in rootless Podman

You get the picture.

This blog will attempt to cover each combination, starting with a discussion of privileges. We'll start with the PINP scenario here in part one. In part two of the series, we'll cover similar ground but do so within the context of Kubernetes. Be sure to read both articles for a complete picture.

Container engines require privileges

In order to run a container engine like Podman within a container, the first thing you need to understand is that you need a fair amount of privilege.

  • Containers require multiple UIDs. Most container images need more than one UID to work. For example, you might have an image with most of the files owned by root, but some owned by the apache user (UID=60).
  • Container engines mount file systems and use the system call clone to create user namespaces.

Note: You might need a newer version of Podman.  Examples in this blog were run with Podman 3.2.

Our test image

For the examples in this blog, we'll use the quay.io/podman/stable image, which was built with the idea of finding the best way to run Podman within a container. You can examine how we build this image from the Dockerfile and containers.conf image in the github.com repo.

# stable/Dockerfile
#
# Build a Podman container image from the latest
# stable version of Podman on the Fedoras Updates System.
# https://bodhi.fedoraproject.org/updates/?search=podman
# This image can be used to create a secured container
# that runs safely with privileges within the container.
#
FROM registry.fedoraproject.org/fedora:latest

# Don't include container-selinux and remove
# directories used by yum that are just taking
# up space.
RUN dnf -y update; yum -y reinstall shadow-utils; \
yum -y install podman fuse-overlayfs --exclude container-selinux; \
rm -rf /var/cache /var/log/dnf* /var/log/yum.*

RUN useradd podman; \
echo podman:10000:5000 > /etc/subuid; \
echo podman:10000:5000 > /etc/subgid;

VOLUME /var/lib/containers
VOLUME /home/podman/.local/share/containers

ADD https://raw.githubusercontent.com/containers/libpod/master/contrib/podmanimage/stable/containers.conf /etc/containers/containers.conf
ADD https://raw.githubusercontent.com/containers/libpod/master/contrib/podmanimage/stable/podman-containers.conf /home/podman/.config/containers/containers.conf

RUN chown podman:podman -R /home/podman

# chmod containers.conf and adjust storage.conf to enable Fuse storage.
RUN chmod 644 /etc/containers/containers.conf; sed -i -e 's|^#mount_program|mount_program|g' -e '/additionalimage.*/a "/var/lib/shared",' -e 's|^mountopt[[:space:]]*=.*$|mountopt = "nodev,fsync=0"|g' /etc/containers/storage.conf
RUN mkdir -p /var/lib/shared/overlay-images /var/lib/shared/overlay-layers /var/lib/shared/vfs-images /var/lib/shared/vfs-layers; touch /var/lib/shared/overlay-images/images.lock; touch /var/lib/shared/overlay-layers/layers.lock; touch /var/lib/shared/vfs-images/images.lock; touch /var/lib/shared/vfs-layers/layers.lock

ENV _CONTAINERS_USERNS_CONFIGURED=""

Let’s examine the Dockerfile.

FROM registry.fedoraproject.org/fedora:latest

# Don't include container-selinux and remove
# directories used by yum that are just taking
# up space.
RUN dnf -y update; yum -y reinstall shadow-utils; \
yum -y install podman fuse-overlayfs --exclude container-selinux; \
rm -rf /var/cache /var/log/dnf* /var/log/yum.*

First pull fedora latest, and then update to the latest packages.  Note it reinstalls shadow-utils, since there is a known issue in the shadow-utils install on the Fedora image where the filecaps on newsubuid and newsubgid are not set.  Reinstalling shadow-utils fixes the problem.  Next, install Podman as well as the fuse-overlayfs.  We don’t install container-selinux because it is not needed within the container.

RUN useradd podman; \
echo podman:10000:5000 > /etc/subuid; \
echo podman:10000:5000 > /etc/subgid;

Next I create a user podman and set up the /etc/subuid and /etc/subgid files to use 5000 UIDs.  This is used to set up User Namespace within the container.  5000 is an arbitrary number and potentially too small.  We picked this number because it is smaller than the 65k allocated to rootless users.  If you were only running the container as root, 65k would have been a better number.

VOLUME /var/lib/containers
VOLUME /home/podman/.local/share/containers

Since we can run rootfull and rootless containers with this image we create two volumes.  Rootfull Podman uses /var/lib/containers  for it’s container storage and rootless uses /home/podman/.local/share/containers.  Overlay over overlay is often denied by the kernel, so this creates non overlay volumes to be used within the container.

ADD https://raw.githubusercontent.com/containers/libpod/master/contrib/podmanimage/stable/containers.conf /etc/containers/containers.conf
ADD https://raw.githubusercontent.com/containers/libpod/master/contrib/podmanimage/stable/podman-containers.conf /home/podman/.config/containers/containers.conf

I have pre-configured two containers.conf files to make sure containers run easier in each mode.

The image is set up to run with fuse-overlayfs by default. In certain cases, you could run the kernel's overlay file system for rootful mode, and you'll soon be able to do this in rootless mode. However, for now, we use fuse-overlayfs as our container storage within the container. Other people have used VFS storage driver, but this is not that efficient.

The --privileged flag

The easiest way to run Podman inside of a container is to use the --privileged flag.

Rootful Podman in rootful Podman with --privileged

# podman run --privileged quay.io/podman/stable podman run ubi8 echo hello
Resolved "ubi8-minimal" as an alias (/etc/containers/registries.conf.d/shortnames.conf)
Trying to pull registry.access.redhat.com/ubi8:latest...
Getting image source signatures
Copying blob sha256:a591faa84ab05242a17131e396a336da172b0e1ec66d921c9f130b7c4c24586d
Copying blob sha256:76b9354adec626b01ffb0faae4a217cebd616661fd90c4b54ba4415f53392fb8
Copying config sha256:dc080723f596f2407300cca2c19a17accad89edcf39f7b8b33e6472dd41e30f1
Writing manifest to image destination
Storing signatures
hello

To save time, since I will be doing a lot of experiments, I created a directory on my host ./mycontainers, which I will volume mount into the container to be used and not have to pull the image each time.

# podman run --privileged -v ./mycontainers:/var/lib/containers quay.io/podman/stable podman run ubi8 echo hello
hello

Rootless Podman in rootful Podman with --privileged

The quay.io/podman/stable image is set up with a podman user that you can use to run rootless containers.

# podman run --user podman --privileged quay.io/podman/stable podman run ubi8 echo hello
Resolved "ubi8" as an alias (/etc/containers/registries.conf.d/shortnames.conf)
...
hello

Note in this case, the Podman running inside the container is running as the user podman. This is because the containerized Podman uses the user namespace to create a confined container within the privileged container.

Running rootless Podman in Docker with --privileged

Similar to rootful Podman, you can also run rootless Podman within Docker with the --privileged option.

# docker run --privileged quay.io/podman/stable podman run ubi8 echo hello

Rootless Podman with Docker

# docker run --user podman --privileged quay.io/podman/stable podman run ubi8 echo hello
Resolved "ubi8" as an alias (/etc/containers/registries.conf.d/shortnames.conf)
...
hello

Can we do this more securely?

Notice that even though we ran the outer containers --privileged above, the inner containers are running in locked-down mode. The rootless Podman running within the container is really locked down and would have a very difficult time escaping. Given that, I am not a fan of using the --privileged flag. I believe we can do better from a security perspective.

Running without the --privileged flag

Let's look at how we can remove the --privileged flag for better security.

Rootful Podman in rootful Podman without --privileged

# podman run --cap-add=sys_admin,mknod --device=/dev/fuse --security-opt label=disable quay.io/podman/stable podman run ubi8-minimal echo hello
hello

We can eliminate the --privileged flag from rootful Podman but still have to disable some security features to make rootful Podman within the container work.

  1. Capabilities: --cap-add=sys_admin,mknod We need to add two Linux capabilities.
    1. CAP_SYS_ADMIN is required for the Podman running as root inside of the container to mount the required file systems.
    2. CAP_MKNOD is required for Podman running as root inside of the container to create the devices in /dev. (Note that Docker allows this by default).
  2. Devices: The --device /dev/fuse flag must use fuse-overlayfs inside the container. This option tells Podman on the host to add /dev/fuse to the container so that containerized Podman can use it.
  3. Disable SELinux: The --security-opt label=disable option tells the host's Podman to disable SElinux separation for the container. SELinux does not allow containerized processes to mount all of the file systems required to run inside a container.

Rootful Podman in Docker without --privileged

# docker run --cap-add=sys_admin --cap-add mknod --device=/dev/fuse --security-opt seccomp=unconfined --security-opt label=disable quay.io/podman/stable podman run ubi8-minimal echo hello
hello
  1. Note Docker does not support the comma separate --cap-add command, so I had to add sys_admin and mknod separately
  2. Still needed --device /dev/fuse, since container defaults to /dev/fuse
  3. Docker always creates builtin volumes as owned by root:root, so we need to create a volume to mount for Podman in the container to be able to use for storage.
  4. As always, I need to disable SELinux separation
  5. Also need to disable seccomp, since Docker has a slightly stricter seccomp policy than Podman. You could just use a Podman security policy by using--seccomp=/usr/share/containers/seccomp.json
# docker run --cap-add=sys_admin --cap-add mknod --device=/dev/fuse --security-opt seccomp=/usr/share/containers/seccomp.json --security-opt label=disable quay.io/podman/stable podman run ubi8-minimal echo hello
hello

Rootless Podman in rootful Podman without --privileged

Run non-privileged container with Podman inside using a non-root user using the user namespace.

# podman run --user podman --security-opt label=disable --security-opt unmask=ALL --device /dev/fuse -ti quay.io/podman/stable podman run -ti docker.io/busybox echo hello
hello
  1. Note that unlike the rooful within rootful case before, we don't have to add the dangerous security capabilities sys_admin and mknod
  2. In this case, I am running with --user podman, which automatically causes the Podman within the container to run within the user namespace
  3. Still disabling SELinux since it blocks the mounting
  4. Still need --device /dev/fuse to use fuse-overlayfs within the container

Podman-remote in rootful Podman with a leaked Podman socket from the host

# podman run -v /run:/run --security-opt label=disable quay.io/podman/stable podman --remote run busybox echo hi
hi

In this case, we are leaking the /run directory from the host into the container. This allows podman --remote to communicate with the Podman socket on the host and start the container on the host OS. This is often how people execute Docker In Docker, especially Docker builds. You could also execute Podman builds this way and take advantage of images previously pulled to the system.

Note, however, this is extremely insecure. The processes within the container can totally take over the host machine.

  1. You still need to disable SELinux separation because SELinux would block the container processes from using sockets leaked in /run.
  2. The podman --remote flag is added to tell Podman to work in remote mode. Note you could also just install the podman-remote executable into a container and use this.

[ Getting started with containers? Check out this free course. Deploying containerized applications: A technical overview. ]

Podman-remote in Docker with a leaked Podman socket from the host

# docker run -v /run:/run --security-opt label=disable quay.io/podman/stable podman --remote run busybox echo hi
hi

The same example works for a Docker container.

This example shows a fully locked down container—other than SELinux being disabled—with the Podman socket leaked into the container. SELinux would block this access, as it should.

# /bin/podman run --security-opt=label=disable -v /run/podman:/run/podman quay.io/podman/stable podman --remote run alpine echo hi
hi

Rootless Podman with containerized rootful Podman

$ podman run --privileged quay.io/podman/stable podman run ubi8 echo hello
Resolved "ubi8" as an alias (/etc/containers/registries.conf.d/shortnames.conf)
..
hello

Rootless Podman running rootless Podman

$ podman run --security-opt label=disable --user podman --device /dev/fuse quay.io/podman/stable podman run alpine echo hello

Final thoughts

Now you have some context for Podman in Podman options, using both rootful and rootless modes. in various combinations. You also have a better sense of the necessary privileges and the considerations surrounding the --privileged flag.

Part two in this series looks at the use of Podman and Kubernetes. The article covers similar territory but within the context of Kubernetes.

[ Want to test your sysadmin skills? Take a skills assessment today. ]



from Hacker News https://ift.tt/cgRZhi5

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.