Skip to content

Conversation

@sbrivio-rh
Copy link

@sbrivio-rh sbrivio-rh commented Oct 8, 2025

Note: this is the corrected version of moby/moby#51130, which I opened against the wrong repository. I'm just copying over the whole description from there.


I'm a maintainer and author of passt (https://passt.top/), a user-mode networking implementation, that's used to connect containers, with pasta(1), and virtual machines, with passt(1), in an unprivileged way, without creating network interfaces.

By the way, Moby optionally uses pasta(1) to connect rootless containers via rootlesskit:

Given that these tools deal with network packets from untrusted workloads, we pay particular attention to their security posture.

The project implements a rather substantial sandboxing mechanism, so that, once the initialisation phase completes, passt(1) and pasta(1) only have access to an empty filesystem with a zero-size limit, and relinquish access possibilities to any resources they don't need, by means of detaching namespaces:

Users report that they can't use passt(1) in Docker containers, with one notable example at:

and resort to run modified builds of passt:

with sandboxing features entirely disabled. This is of course not something we support, so it's not a particular concern in terms of maintainability, but still it forces users to disable important security features, and it's a rather alarming trend.

As a side note, Flatpak has a similar issue:

and, same there, users routinely run custom builds of applications that ship strict native sandboxing features (including passt, Chromium, and Firefox) with those features disabled. This is not in the best interest of security and surely not in the best interest of those users.

To fix this, enable unshare() regardless of the CAP_SYS_ADMIN capability, so that unprivileged applications can perform appropriate, strict sandboxing.

I'm well aware of CVE-2022-0185 and CVE-2022-0492, but, since then, there have been significant hardening efforts going on in the affected portions of the kernel and the current situation appears substantially different, now.

Despite the original intention, a blanket ban on unprivileged unshare() appears nowadays to be detrimental to the security of containerised application, instead of contributing to it, as an increased number of applications finally start using namespaces for their own sandboxing, which is generally stricter than what any container runtime can provide.

Link: https://bugs.passt.top/show_bug.cgi?id=116
Reported-by: [email protected]
Signed-off-by: Stefano Brivio [email protected]

- What I did
I took unshare(2), the system call, out of the CAP_SYS_ADMIN gate in the default seccomp profile.

- How I did it
I did it proudly, with a keyboard. I used so-called shortcuts that allowed me to conceptually cut one line of text file and paste it to another location.

- How to verify it
Run passt in a Docker container.

- Human readable description for the release notes

The unshare(2) system call is now permitted in the default seccomp profile, enabling users to run applications that provide native sandboxing capabilities based on Linux namespaces.

- A picture of a cute animal (not mandatory but encouraged)

Inspired from a submission at https://user.xmission.com/~emailbox/ascii_cats.htm:

fsc              ._
              .-'  `-.
           .-'        \
          ;    .-'\    ;
          `._.'    ;   |
                   |   |
                   ;   :
                  ;   :
                  ;   :
                 /   /
                ;   :                   ,
                ;   |               .-"7|
              .-'"  :            .-' .' :
           .-'       \         .'  .'   `.
         .'           `-. ""-.-'`""    `",`-._..--"7
         ;    .          `-.J `-,    ;"`.;|,_,    ;
       _.'    |         `"" `. ."""--. o \:.-. _.'
    .""       :            ,--`;   ,  `--/}o,' ;
    ;   .___.'        /     ,--.`-. `-..7_.-  /_
     \   :   `..__.._;    .'__;    `---..__.-'-.`"-,
     .'   `--. |   \_;    \'   `-._.-")     \\  `-,
     `.   -.`_):      `.   `-"""`.   ;__.' ;/ ;   "
       `-.__7"  `-..._.'`7     -._;'  ``"-''
                         `--.,__.'  let me run unshare() or isolation code in passt will face GRAVITY
                         

I'm a maintainer and author of passt (https://passt.top/), a user-mode
networking implementation, that's used to connect containers, with
pasta(1), and virtual machines, with passt(1), in an unprivileged way,
without creating network interfaces.

By the way, Moby optionally uses pasta(1) to connect rootless
containers via rootlesskit:
  https://github.com/rootless-containers/rootlesskit/blob/236f31ec2258a1da1b1a9b62b168dd5f9a840f83/pkg/network/pasta/pasta.go

Given that these tools deal with network packets from untrusted
workloads, we pay particular attention to their security posture.

The project implements a rather substantial sandboxing mechanism, so
that, once the initialisation phase completes, passt(1) and pasta(1)
only have access to an empty filesystem with a zero-size limit, and
relinquish access possibilities to any resources they don't need, by
means of detaching namespaces:
  https://passt.top/passt/tree/isolation.c
  https://passt.top/#security

Users report that they can't use passt(1) in Docker containers, with
one notable example at:
  https://bugs.passt.top/show_bug.cgi?id=116

and resort to run modified builds of passt:
  https://bugs.passt.top/show_bug.cgi?id=116#c6

with sandboxing features entirely disabled. This is of course not
something we support, so it's not a particular concern in terms of
maintainability, but it still forces users to disable important
security features, and it's a rather alarming trend.

As a side note, Flatpak has a similar issue:
  flatpak/flatpak#5921

and, same there, users routinely run custom builds of applications
that ship strict native sandboxing features (including passt,
Chromium, and Firefox) with those features disabled. This is not
in the best interest of security and surely not in the best interest
of those users.

To fix this, enable unshare() regardless of the CAP_SYS_ADMIN
capability, so that unprivileged applications can perform appropriate
sandboxing.

I'm well aware of CVE-2022-0185 and CVE-2022-0492, but, since then,
there have been significant hardening efforts going on in the affected
portions of the kernel and the current situation appears substantially
different, now.

Despite the original intention, a blanket ban on unprivileged
unshare() appears nowadays to be detrimental to the security of
containerised application, instead of contributing to it, as an
increased number of applications finally start using namespaces for
their own sandboxing, which is generally stricter than what any
container runtime can provide.

Link: https://bugs.passt.top/show_bug.cgi?id=116
Reported-by: [email protected]
Signed-off-by: Stefano Brivio <[email protected]>
@sbrivio-rh
Copy link
Author

I just found #4 as I moved this merge request to the right repository. I'm not sure what to do with this one, as it's partially a duplicate, but passt(1) and pasta(1) need unshare(2) flags that are not covered by that one.

"uname",
"unlink",
"unlinkat",
"unshare",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have this as a non-default built-in profile like --security-opt seccomp=allow-unshare-user?

Or if we are going to have this as the default, we will need to provide seccomp=disallow-unshare-user option.

Originally posted by @AkihiroSuda in #42441

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I wasn't aware of moby/moby#42441.

I would argue that unshare() should be the default, otherwise container developers will hit https://bugs.passt.top/show_bug.cgi?id=116#c0 and keep distributing less secure builds of software because they have no practical way to ask users to add options when they run containers. See also https://bugs.passt.top/show_bug.cgi?id=116#c9.

I can take care of adjusting this pull request (if it makes sense at all) in the sense of moby/moby#42455, which already implemented your suggestion.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that user namespaces are blocked by default is that they expose a massive amount of kernel attack surface. This makes it much easier for an application within the container to break out.

For passt, I’m curious if the same goal could be achieved with just seccomp and possibly Landlock. Whether passt has permission to open files doesn’t matter if it can’t make any filesystem syscalls, and Landlock can cut off the remaining filesystem access except chdir(). seccomp can also prevent passt from sending signals to any process that isn’t itself.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that user namespaces are blocked by default is that they expose a massive amount of kernel attack surface. This makes it much easier for an application within the container to break out.

I think I covered that part here:

I'm well aware of CVE-2022-0185 and CVE-2022-0492, but, since then, there have been significant hardening efforts going on in the affected portions of the kernel and the current situation appears substantially different, now.

but that's a quantitative and somewhat arbitrary evaluation. And while at it, I'm myself responsible for CVE-2022-2078, but again, we've been hardening things a lot in the past years, also as a result of exposure from rootless containers (Podman can do all this). Exposure is actually a good thing in the long term.

Much less arbitrary, though, is what the author of #4 pointed out in #4 (comment): it's not Docker's job to mitigate kernel vulnerabilities. There are Linux security modules, including Landlock, with configurable and appropriately flexible profiles, which makes them the right tool for this.

For passt, I’m curious if the same goal could be achieved with just seccomp

passt already ships rather restrictive seccomp profiles:

$ make
seccomp profile passt allows:  accept accept4 bind clock_gettime close connect
   epoll_ctl epoll_pwait epoll_wait exit_group fallocate fcntl fsync ftruncate
   getsockname getsockopt listen lseek read recvfrom recvmmsg recvmsg sendmmsg
   sendmsg sendto setsockopt shutdown socket timerfd_create timerfd_gettime
   timerfd_settime write writev
seccomp profile pasta allows:  accept accept4 bind clock_gettime clone close connect
   epoll_ctl epoll_pwait epoll_wait exit exit_group fallocate fcntl fsync ftruncate
   getsockname getsockopt ioctl listen lseek openat pipe2 read recvfrom recvmmsg
   recvmsg rt_sigprocmask rt_sigreturn sendmmsg sendmsg sendto setns setsockopt
   shutdown socket splice timerfd_create timerfd_gettime timerfd_settime waitid
   write writev
seccomp profile vu allows:  accept accept4 bind clock_gettime close connect
   epoll_ctl epoll_pwait epoll_wait exit_group fallocate fcntl fsync ftruncate
   getsockname getsockopt ioctl listen lseek mmap munmap read recvfrom recvmmsg
   recvmsg sendmmsg sendmsg sendto setsockopt shutdown socket timerfd_create
   timerfd_gettime timerfd_settime write writev

and possibly Landlock.

...as well as AppArmor and SELinux policies. Of course, all contributions including a new shiny Landlock profile are warmly welcome, but Landlock wouldn't cover much more than what we're already covering with "traditional" LSMs.

Whether passt has permission to open files doesn’t matter if it can’t make any filesystem syscalls,

pasta(1) needs connect(2) and bind(2), as well as openat(2) for a number of reasons (see git log), even though we can probably drop the latter with a bit of extra work. But it's not just about filesystem access, it's also about seeing other PIDs (not necessarily to send signals).

and Landlock can cut off the remaining filesystem access except chdir().

Right, I don't exclude that Landlock might provide some slightly finer tailored access control compared to what we have with AppArmor and SELinux.

seccomp can also prevent passt from sending signals to any process that isn’t itself.

I don't see a way (unless we're talking of something based on further system call argument examination via e.g. seccomp_unotify(2) and seitan), but, in any case, kill(2) is not enabled in the seccomp profiles, so that's not a concern.

In any case, while the original user report behind this was about passt(1), with a blanket ban on unshare(2), you can't run pasta(1) in Docker itself (it obviously needs clone(CLONE_NEWNET)) which is rather absurd. And that's not even about sandboxing, it's about basic functionality we can't provide otherwise.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and Landlock can cut off the remaining filesystem access except chdir().

Right, I don't exclude that Landlock might provide some slightly finer tailored access control compared to what we have with AppArmor and SELinux.

The huge advantages of Landlock are that it is unprivileged and does not expose a large amount of kernel attack surface.

Much less arbitrary, though, is what the author of #4 pointed out in #4 (comment): it's not Docker's job to mitigate kernel vulnerabilities. There are Linux security modules, including Landlock, with configurable and appropriately flexible profiles, which makes them the right tool for this.

It actually somewhat is Docker's job. Seccomp is the only approach I know of to restricting namespaces that is distribution-agnostic and allows generating policy at runtime. LSMs are very distribution-specific: some use SELinux, others use AppArmor, and there may be others that use neither. Also, I don’t expect changing SELinux policies to be in scope for Docker, especially on distributions like RHEL that use monolithic policy. AppArmor policies can be dynamically generated but I don’t know if they are flexible enough for this purpose. Landlock is not enabled universally yet.

What I absolutely do support is having the decision to allow user namespaces be separate from the decision to allow CAP_SYS_ADMIN. The latter should imply the former, but not the other way around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants