Exploiting CVE-2021-3490 for Container Escapes
Today, containers are the preferred approach to deploy software or create build environments in CI/CD lifecycles. However, since the emergence of container solutions and environments like Docker and Kubernetes, security researchers have consistently found ways to escape from containers once they are compromised. Most attacks are based on configuration errors. But it is also possible to escalate privileges and escape to the container’s host system by exploiting vulnerabilities in the host’s operating system.
This blog shows how to modify an existing Linux kernel exploit in order to use it for container escapes and how the CrowdStrike Falcon® platform can help to prevent and hunt for similar threats.
Schedule a demo to see how CrowdStrike’s CNAPP solution protects the cloud and cloud workloads: Cloud Security Posture Management | Cloud Workload Protection | Container Security | Cloud Infrastructure Entitlement Management
Before we outline the modifications required to turn the exploit into a container escape, we first look at what the original exploit achieved.
Valentina Palmiotti published a full exploit for CVE-2021-3490 that can be used to locally escalate privileges to root on affected systems. The vulnerability was rooted in the eBPF subsystem of the Linux kernel and fixed in version 5.10.37. eBPF allows user space processes to load custom programs into the kernel and attach them to so-called events, thus giving user space the ability to observe kernel internals and, in specifically supported cases, to implement custom logic for networking, access control and other tasks. These eBPF programs have to pass a verifier before being loaded, which is supposed to guarantee that the code does not contain loops and does not write to memory outside of its dedicated area. This step should ensure that eBPF programs terminate and are not able to manipulate kernel memory, which would potentially allow attackers to escalate privileges. However, this verifier contained several vulnerabilities in the past. CVE-2021-3490 is one of them and can ultimately be used to achieve a kernel read and write primitive.
Building on the kernel read primitive, it is possible to leak a kernel pointer. eBPF programs can communicate with processes running in user space using so-called “eBPF maps.” Every eBPF map is described by a
struct bpf_map object, which contains a field
ops pointing to a
struct bpf_map_ops. That struct contains several function pointers for working with the eBPF map. eBPF maps come in different kinds with different definitions for
ops stored at known offsets. For array maps,
ops will be set to point to the kernel symbol
array_map_ops. The exploit will leak that address and then use it as a starting point to further scan the kernel’s memory space and read pointers from the kernel’s symbol table.
The kernel exports pointers to certain variables, objects and functions in a symbol table to make them accessible by kernel modules. This table is called
ksymtab. In order to look up the actual name of a stored symbol address, a second table, called
kstrtab, is utilized. A pointer to the string in
kstrtab that contains the name is stored as part of every
ksymtab entry, right after the pointer to the symbol itself. To find the address of a kernel symbol, the exploit first reads memory from kernel space starting at the leaked address of
array_map_ops using the arbitrary read primitive. This is done until the string containing the symbol name of interest is found in
kstrtab is mapped after
ksymtab, the previously read memory region should contain the pointer to the string in
kstrtab. Therefore, the exploit then proceeds to search for that pointer, and the pointer to the actual symbol is stored right before it.
One clarification has to be made about the above code excerpt: There are two different formats of
ksymtab. Which one is used is decided when the Linux kernel is built from source by the configuration parameter
HAVE_ARCH_PREL32_RELOCATIONS. In one case, the actual addresses of the symbols and
kstrtab entries are stored. However, in many kernel builds it does not actually contain pointers but offsets such that the address where the offset is stored plus the offset itself is the symbol’s address or the string’s address in
Nevertheless, using this technique, the exploit identifies the address of
init_pid_ns. This object is a
struct pid_namespace and describes the default process ID namespace new processes are started in.
Namespaces have become a fundamental feature of Linux and are crucial to the idea of container environments. They allow separating system resources between different processes such that one process can observe a completely different set than others. For example, mount namespaces control observable filesystem mount points such that two processes can have different views of the filesystem. This allows a container’s filesystem to have a different root directory than the host. Process ID namespaces on the other side give processes a completely unique process tree. The first process in a process ID namespace always has the identifier (PID) 1. It is considered as the
init process that initializes the operating system and from which new processes originate. Therefore, if this process is stopped, all other processes in the particular process ID namespace are stopped as well.
init_pid_ns, it is possible to enumerate all
struct task objects of the processes running in that namespace as those are stored in a traversable radix tree in the field
idr. The exploit identifies the correct
task object by its PID. Those
task objects store a pointer to a
struct cred object that contains the UID and GID (user and group identifier) associated with the process and therefore holds the granted permissions. By overwriting the
cred object of a process, it is possible to escalate privileges by setting the UID and GID to 0, which is associated with the
However, this approach does not work if a container was compromised and the attacker’s goal is to escape into the container’s host environment.
Why This Doesn’t Work in Containers
Linux kernel exploits are an alternative method to escape container environments to the host in case no mistakes in the container configuration were made. They can be used because containers share the host’s kernel and therefore its vulnerabilities, regardless of the Linux distribution the container is based on. However, exploit developers have to pay attention to some obstacles compared to privilege escalation outside of container environments.
First, container solutions are able to restrict the capabilities of processes running inside a container. For example, the capability
SYS_ADMIN is normally not granted to processes running in containers, which can therefore not mount file systems or execute various other privileged actions. Moreover, it is possible to restrict the set of syscalls a userland process can call by utilizing
seccomp. For example, in the default configuration of Docker, an exploit would not be able to use eBPF at all. Nevertheless, in the default Kubernetes configuration,
seccomp does not restrict the available syscalls at all. For the remainder of this post, though, we will assume that the container is configured such that eBPF could be used by userland processes.
Second, on a more practical note, the techniques of the original exploit described above will not work out-of-the-box. As already described, containers rely heavily on namespaces. Because containers typically have their own associated process ID namespace, it is not as straightforward to identify the exploit process running in the container by its PID, because, for example, the exploit may have PID 42 from the container’s perspective but PID 1337 from the host’s perspective. However, the parent namespace can still observe all processes running in child namespaces. Therefore, those processes have a PID in both parent and child namespace. Ultimately, the initial process ID namespace described by
init_pid_ns can observe any process running on a particular system. Nevertheless, even if we identify the
task structure of our exploit process within a container, overwriting its
cred object as described previously will simply elevate privileges within the container but not allow container escape.
Changes for Container Escapes
It is possible to modify the exploit so that a container escape is conducted and privileges are escalated to
root on the host. To easily find the exploit process in a container, an exploit can search for the symbol
pcpu_base_addr symbols in
current_task stores the offset to the running process’s
task object based on the address stored in
pcpu_base_addr is unique per CPU core, the process must be pinned before on one core using the
sched_setaffinity() system call.
Using this technique it is possible to identify the correct
task object without traversing the radix tree of all processes stored in
This allows the attacker to overwrite the correct
cred object and therefore obtain
root privileges. Due to the usage of namespaces, the observable file system is still that of the container, though. Nevertheless, it is possible to overcome that obstacle as well. The
task object contains a pointer to a
struct fs_struct object. This object contains information about the observable file system, i.e.,which directory is considered as the processes’ file system root. Using the leaked pointer to
init_pid_ns, it is possible to traverse the process radix tree and identify the host’s
init process, which has PID 1. Next, it is possible to retrieve the
fs pointer from this process’s
task object. Lastly, while overwriting the
cred object of the exploit process, the
fs pointer must be overwritten as well using the
fs pointer. The exploit process can then observe the complete host file system.
One last addition must be made. As stated above, containers normally have limited capabilities. Capabilities are used to restrict the permissions of processes running in containers. To obtain full privileges, the exploit also has to overwrite the capabilities mask of the exploit’s process in the
task object. How exactly the values must be set to obtain full capabilities without any restrictions can be investigated in the definition of the
init process’ credentials.
The technique described in this blog to identify the
task object of the exploit’s process only works on Linux kernel version before 5.15, as
pcpu_base_addr is no longer exported as a symbol to
ksymtab. Nevertheless, alternative methods exist to find the correct
task object, e.g., by traversing the radix tree of all processes from
init_pid_ns and matching on features of the exploit process other than the PID, such as the
comm member of
struct task that contains the executable name.
Container Escape Mitigations
Detecting this and similar exploits is very hard as they are data-only and misuse only legitimate system calls. The CrowdStrike Falcon platform can assist in preventing attacks using similar techniques for privilege escalation. As a defense-in-depth strategy, the following steps can be taken to harden Linux hosts and container environments to prevent exploitation of CVE-2021-3490 and future attacks.
- Upgrade the kernel version. With a critical kernel vulnerability like CVE-2021-3490, it is paramount that available fixes are applied by upgrading the kernel version.
- Provide only required capabilities to the container. By limiting the capabilities of the container, the root account of the container becomes limited in its capabilities, which significantly reduces the chances of container escape and exploitation of kernel vulnerabilities. For example, to exploit the CVE-2021-3490 using the described technique, the attacker needs CAP_BPF or CAP_SYS_ADMIN granted. Note that privileged containers have those capabilities. Therefore, you should monitor your environment for such containers with CrowdStrike Falcon® Cloud Workload Protection (CWP), as discussed in point 4 below.
- Use a seccomp profile. While Kubernetes does not apply a seccomp profile without configuration, Docker’s default seccomp profiles protect against a number of dangerous system calls that can help attackers to break out of the container environment. Correct Seccomp profiles can help significantly reduce the container attack surface. CVE-2021-3490 requires the
bpfsystem call to exploit the vulnerability, which is blocked in Docker’s default seccomp profile. Hence, exploitation of CVE-2021-3490 in a container environment using a strong seccomp profile would fail.
- Monitor host and containerized environment for a breach. In case a privileged workload or a host is compromised by attackers, the organization needs state-of-the-art monitoring and detection capabilities to prevent and detect advanced persistent threats (APTs), eCrime and nation-state actors. CrowdStrike can help with this. Falcon Cloud Workload Protection identifies any indicators of misconfiguration (IOMs) in your containerized environment to uncover a weakness. Falcon Cloud Workload Protection prevents and detects malicious activity on your host and containers to prevent and detect — in real time — breaches by eCrime and nation-state adversaries. For example, if a privileged container or a container without a seccomp profile is executed, the following notifications would appear:
Also, Falcon CWP helps to hunt for threats using the eBPF subsystem to escalate privileges by logging if the
bpf system call was used by a process.
Container technology is a good solution to separate and fine-tune resources to different processes. However, while existing solutions add another layer of security due to the restriction of capabilities and available syscalls, the available attack surface inside a container still contains the host’s kernel. Every eased restriction — for example, allowing the use of eBPF — will increase the attack surface. If a threat actor is able to take advantage of a vulnerability inside the host’s kernel and an exploit is available, the host can be compromised, regardless of other security layers and restrictions such as namespaces.
This blog showed exactly that: Not much effort is needed to turn a full exploit chain for a local privilege escalation into one that is able to escape containers as well. The basic rules of network hygiene (patch early and often) not only apply to containers but to the hosts that deploy those in a cloud environment as well. Moreover, solutions such as Docker and Kubernetes can reduce the attack surface drastically if configured properly. CrowdStrike Falcon Cloud Workload Protection can assist in identifying and hunting for weaknesses in the deployed configuration that could lead to a compromise.
- Request a free trial of the industry-leading CrowdStrike Falcon platform.
- Read about adversaries tracked by CrowdStrike in 2021 in the 2022 CrowdStrike Global Threat Report and in the 2022 Falcon OverWatch Threat Hunting Report.
- Request a free CrowdStrike Intelligence threat briefing and learn how to stop adversaries targeting your organization.
- Watch an introductory video on the CrowdStrike Falcon console and register for an on-demand demo of the market-leading CrowdStrike Falcon platform in action.