In August 2022, Dan Walsh (one of the authors of this article) moved out of his role as container runtimes architect at Red Hat to architect for the Red Hat Enterprise Linux (RHEL) for Edge team. Specifically, he has moved to the Red Hat In-Vehicle Operating System (RHIVOS) Containers On Wheels (COW) team.
You might notice some Podman enhancements coming directly from the RHIVOS COW team, like Make systemd better for Podman with Quadlet and Deploying a multi-container application using Podman and Quadlet. Alexander Larson, a COW team member, created Quadlet to make running containers under systemd easier. One of the cornerstones of RHIVOS is using systemd to manage the life cycle of containers created by Podman.
[ Download now: Podman basics cheat sheet ]
Satisfy the need for speed
During Podman's development, as with most container engines, the speed requirements were mainly around pulling container images. Search the internet and you'll find thousands of discussions on shrinking the size of container images. Pulling images has always been the number 1 complaint when using container engines. No one pays attention if it takes a second or 2 to start a container at the command line or in Kubernetes.
When we examine running containers in a car, this equation tips upside down. In a car, most container images are preinstalled and then updated as part of the operating system or at specific times—but not on startup. If a container image will be installed on behalf of a user command, the user must wait while the container downloads. However, applications critical to driving cars aren't updated in this way.
What is important is the speed at which the applications start. When you turn the key in a car, you expect the applications to be up and running as fast as possible. Some countries enforce a legal requirement that when you put the car into reverse, the backup camera must start within a couple of seconds.
When our team measured the time to start a Podman container on a low-level system (Raspberry Pi), we found it takes almost two seconds just to start the application. If the backup camera or other sensors were to run as containers, we needed to improve the starting speed significantly.
The goal became removing microseconds from the container startup time.
In this article, I cover Podman's speed—primarily the speed to start a container. The chart below provides an overview of progress. If you absorb nothing else from this article, at least understand what the chart tells you.
The rest of the article explains how we improved Podman's speed.
One of the first things we did was analyze what happens when Podman starts a container and why it takes so long. It turns out there was a lot of low-hanging fruit.
Catch the details
When working with a large codebase with hundreds of contributors, sometimes small inefficiencies get added to the code. Since each one adds only tens of microseconds, they are easy to miss. They just have to be found and fixed by grinding with a profiler.
Here are a few that we addressed:
- Don't unnecessarily do in-depth copies of large structures.
- Use
pidfd_open()
to avoid sleeping in a loop to wait for process exit. - Avoid APIs that take a long time, such as retrieving the whole system configuration to read a single configuration value, especially when there are simpler ways.
- Properly fix races in image event shutdown routines instead of sleeping 100msec.
- Avoid repeatedly creating the same large data structure, loading it, and sometimes writing it to disk by caching it to memory.
[ Get hands on with Podman in this tutorial scenario. ]
Compile regular expressions with Go
Podman is written in Go. The Go compiler supports initializing variables when they are created in the global state. It is fairly common to initialize regular expressions (regex) globally with code like:
AlphaRegexp := regexp.MustCompile(`[a-zA-Z]`)
If this is done in the global space, then every start of an application pays the price of executing the slow operation of compiling the regex, even when the program never uses this global variable.
Go also encourages the idea of "vendoring," which allows users to include—or vendor—other people's code directly into their executable, instead of using shared libraries.
While the compilation might take only a few microseconds, through vendoring and code reuse, we found that same regex-init construct multiple times throughout the code and in many of the vendored sub-libraries. A new package that compiled these variables on demand rather than on initialization eliminated the inadvertent overhead everywhere. We opened multiple pull requests for vendored code to get those teams to remove the global regex compiles.
Drop virtual networks
One of the most time-consuming parts of setting up a container is creating the virtual networks. By default, Podman sets up private networking by executing netavark and sometimes aardvark-dns. Just running a sub-program can take some time since the kernel needs to duplicate all of the code and then wait for the program to start. Switching to --network=host
to use the host network, or using --network=none
if the container does not use the network, greatly sped up the container startup. Since most applications within the car can probably use the host network or don't need a network, we suggest running with one of these flags.
Use crun improvements
Over the years, Giuseppe Scrivano has continuously improved the speed of Podman's default OCI runtime crun. Runc, a popular alternative OCI runtime written in Go, takes considerably longer to start and uses more resources than crun
. Giuseppe wrote an article describing all of the crun speedups.
Precompile seccomp
Most of the improvements were made over the last few years, but when the COW team got involved, we found that compiling the seccomp rules cost us considerable time. Seccomp rules are usually defined in the /usr/share/containers/seccomp.json
file. Almost everyone that runs Podman uses this file, and yet we compile it into BPF bytecode on every container start. crun now uses a precompiled version of the seccomp.json
file, if it exists, eliminating the recompilation.
As Giuseppe points out in his article:
With that in place, the cost of compiling the seccomp profile is paid only when the generated BPF filter is not in the cache. This is what I have now:
# hyperfine 'crun-from-the-future run foo'
Benchmark 1: 'crun-from-the-future run foo'
Time (mean ± σ): 5.6 ms ± 3.0 ms \[User: 1.0 ms, System: 4.5 ms\]
Range (min … max): 4.2 ms … 26.8 ms 101 runs
This demonstrates considerable improvement from the original 159ms in 2017.
Execute programs during initialization
Podman does a series of checks when it starts to figure out what the kernel supports and which OCI runtime version the system uses. In some cases, this involves a fork or exec of the OCI runtime to check the version. We found it no longer needs to do this and we removed the check, saving startup time.
Work around kernel issues
RHIVOS uses a real-time kernel variant that changes some behavior, making container setup slower. In particular, the real-time kernel changes the default behavior of the read-copy-update (RCU) framework. RCU is a kernel synchronization mechanism that avoids the use of lock primitives. Unfortunately, some optimizations in the RCU framework (something called "expedited grace periods") are not compatible with real-time guarantees, so they are disabled by default on the real-time kernel.
It turns out that these optimizations are important for events during container setup, like mounts, unmounts, and cgroup setup. So container startup on real-time kernels can be quite a lot slower.
You can work around this by using the rcupdate.rcu_normal_after_boot=0
kernel option, but this affects real-time guarantees. We are currently working on better fixes for this.
[ Kubernetes: Everything you need to know ]
Use transient storage
By default, Podman keeps storage on physical partitions in /var/lib/containers
for rootful users and $HOME/.local/share/containers
for rootless users. When running a container, Podman hits the storage directories with lots of locking operations and often by creating JSON files. These activities involve many writes and kernel syncs, each slowing container startup. Podman also stores its internal database information in the container storage directories.
In RHIVOS, we do not intend to preserve containers over reboot, meaning all containers are destroyed when the car is off. We want to allow container image storage to be permanent but containers to be temporary, so we added the concept of transient storage.
You can see more information about this feature in Podman's man pages:
$ man podman
...
--transient-store
Enables a global transient storage mode where all container metadata is
stored on non-persistent media (i.e. in the location specified by
--runroot). This mode allows starting containers faster, as well as
guaranteeing a fresh state on boot in case of unclean shutdowns or
other problems. However it is not compatible with a traditional model
where containers persist across reboots.
Default value for this is configured in containers-storage.conf(5).
$ man containers-storage.conf
...
transient_store = "false" | "true"
Transient store mode makes all container metadata be saved in temporary
storage (i.e. runroot above). This is faster, but doesn't persist
across reboots. Additional garbage collection must also be performed
at boot-time, so this option should remain disabled in most
configurations. (default: false)
You can run containers with transient storage by providing the --transient-store
command line flag:
# podman --transient-store run ubi9 echo hi
This approach is similar to running all your containers with the podman run --rm
option. All container locking, reads, and writes, as well as the Podman database, are moved to /run
, which is a temporary filesystem (tmpfs). This dramatically increases the speed of starting a container.
Note that you can't run containers in a mixed mode, where some are transient and others persist. If you are running an edge device or server, where the speed of starting containers is critically important and persisting the containers over reboot is not, then using --transient-store
is an excellent idea.
Wrap up
We continue to work on finding and fixing performance issues in container startup in Podman. At this point, we have successfully improved it from around 2 seconds on the Raspberry Pi to under 0.3 seconds, providing a 6-fold increase in speed.
[ Learning path: Getting started with Red Hat OpenShift Service on AWS (ROSA) ]
from Hacker News https://ift.tt/6xrz925
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.