-
Couldn't load subscription status.
- Fork 6
seccomp: Take unshare() out of CAP_SYS_ADMIN gate #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I'm a maintainer and author of passt (https://passt.top/), a user-mode networking implementation, that's used to connect containers, with pasta(1), and virtual machines, with passt(1), in an unprivileged way, without creating network interfaces. By the way, Moby optionally uses pasta(1) to connect rootless containers via rootlesskit: https://github.com/rootless-containers/rootlesskit/blob/236f31ec2258a1da1b1a9b62b168dd5f9a840f83/pkg/network/pasta/pasta.go Given that these tools deal with network packets from untrusted workloads, we pay particular attention to their security posture. The project implements a rather substantial sandboxing mechanism, so that, once the initialisation phase completes, passt(1) and pasta(1) only have access to an empty filesystem with a zero-size limit, and relinquish access possibilities to any resources they don't need, by means of detaching namespaces: https://passt.top/passt/tree/isolation.c https://passt.top/#security Users report that they can't use passt(1) in Docker containers, with one notable example at: https://bugs.passt.top/show_bug.cgi?id=116 and resort to run modified builds of passt: https://bugs.passt.top/show_bug.cgi?id=116#c6 with sandboxing features entirely disabled. This is of course not something we support, so it's not a particular concern in terms of maintainability, but it still forces users to disable important security features, and it's a rather alarming trend. As a side note, Flatpak has a similar issue: flatpak/flatpak#5921 and, same there, users routinely run custom builds of applications that ship strict native sandboxing features (including passt, Chromium, and Firefox) with those features disabled. This is not in the best interest of security and surely not in the best interest of those users. To fix this, enable unshare() regardless of the CAP_SYS_ADMIN capability, so that unprivileged applications can perform appropriate sandboxing. I'm well aware of CVE-2022-0185 and CVE-2022-0492, but, since then, there have been significant hardening efforts going on in the affected portions of the kernel and the current situation appears substantially different, now. Despite the original intention, a blanket ban on unprivileged unshare() appears nowadays to be detrimental to the security of containerised application, instead of contributing to it, as an increased number of applications finally start using namespaces for their own sandboxing, which is generally stricter than what any container runtime can provide. Link: https://bugs.passt.top/show_bug.cgi?id=116 Reported-by: [email protected] Signed-off-by: Stefano Brivio <[email protected]>
|
I just found #4 as I moved this merge request to the right repository. I'm not sure what to do with this one, as it's partially a duplicate, but passt(1) and pasta(1) need unshare(2) flags that are not covered by that one. |
| "uname", | ||
| "unlink", | ||
| "unlinkat", | ||
| "unshare", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have this as a non-default built-in profile like
--security-opt seccomp=allow-unshare-user?Or if we are going to have this as the default, we will need to provide
seccomp=disallow-unshare-useroption.
Originally posted by @AkihiroSuda in #42441
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I wasn't aware of moby/moby#42441.
I would argue that unshare() should be the default, otherwise container developers will hit https://bugs.passt.top/show_bug.cgi?id=116#c0 and keep distributing less secure builds of software because they have no practical way to ask users to add options when they run containers. See also https://bugs.passt.top/show_bug.cgi?id=116#c9.
I can take care of adjusting this pull request (if it makes sense at all) in the sense of moby/moby#42455, which already implemented your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that user namespaces are blocked by default is that they expose a massive amount of kernel attack surface. This makes it much easier for an application within the container to break out.
For passt, I’m curious if the same goal could be achieved with just seccomp and possibly Landlock. Whether passt has permission to open files doesn’t matter if it can’t make any filesystem syscalls, and Landlock can cut off the remaining filesystem access except chdir(). seccomp can also prevent passt from sending signals to any process that isn’t itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that user namespaces are blocked by default is that they expose a massive amount of kernel attack surface. This makes it much easier for an application within the container to break out.
I think I covered that part here:
I'm well aware of CVE-2022-0185 and CVE-2022-0492, but, since then, there have been significant hardening efforts going on in the affected portions of the kernel and the current situation appears substantially different, now.
but that's a quantitative and somewhat arbitrary evaluation. And while at it, I'm myself responsible for CVE-2022-2078, but again, we've been hardening things a lot in the past years, also as a result of exposure from rootless containers (Podman can do all this). Exposure is actually a good thing in the long term.
Much less arbitrary, though, is what the author of #4 pointed out in #4 (comment): it's not Docker's job to mitigate kernel vulnerabilities. There are Linux security modules, including Landlock, with configurable and appropriately flexible profiles, which makes them the right tool for this.
For passt, I’m curious if the same goal could be achieved with just seccomp
passt already ships rather restrictive seccomp profiles:
$ make
seccomp profile passt allows: accept accept4 bind clock_gettime close connect
epoll_ctl epoll_pwait epoll_wait exit_group fallocate fcntl fsync ftruncate
getsockname getsockopt listen lseek read recvfrom recvmmsg recvmsg sendmmsg
sendmsg sendto setsockopt shutdown socket timerfd_create timerfd_gettime
timerfd_settime write writev
seccomp profile pasta allows: accept accept4 bind clock_gettime clone close connect
epoll_ctl epoll_pwait epoll_wait exit exit_group fallocate fcntl fsync ftruncate
getsockname getsockopt ioctl listen lseek openat pipe2 read recvfrom recvmmsg
recvmsg rt_sigprocmask rt_sigreturn sendmmsg sendmsg sendto setns setsockopt
shutdown socket splice timerfd_create timerfd_gettime timerfd_settime waitid
write writev
seccomp profile vu allows: accept accept4 bind clock_gettime close connect
epoll_ctl epoll_pwait epoll_wait exit_group fallocate fcntl fsync ftruncate
getsockname getsockopt ioctl listen lseek mmap munmap read recvfrom recvmmsg
recvmsg sendmmsg sendmsg sendto setsockopt shutdown socket timerfd_create
timerfd_gettime timerfd_settime write writev
and possibly Landlock.
...as well as AppArmor and SELinux policies. Of course, all contributions including a new shiny Landlock profile are warmly welcome, but Landlock wouldn't cover much more than what we're already covering with "traditional" LSMs.
Whether passt has permission to open files doesn’t matter if it can’t make any filesystem syscalls,
pasta(1) needs connect(2) and bind(2), as well as openat(2) for a number of reasons (see git log), even though we can probably drop the latter with a bit of extra work. But it's not just about filesystem access, it's also about seeing other PIDs (not necessarily to send signals).
and Landlock can cut off the remaining filesystem access except chdir().
Right, I don't exclude that Landlock might provide some slightly finer tailored access control compared to what we have with AppArmor and SELinux.
seccomp can also prevent passt from sending signals to any process that isn’t itself.
I don't see a way (unless we're talking of something based on further system call argument examination via e.g. seccomp_unotify(2) and seitan), but, in any case, kill(2) is not enabled in the seccomp profiles, so that's not a concern.
In any case, while the original user report behind this was about passt(1), with a blanket ban on unshare(2), you can't run pasta(1) in Docker itself (it obviously needs clone(CLONE_NEWNET)) which is rather absurd. And that's not even about sandboxing, it's about basic functionality we can't provide otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and Landlock can cut off the remaining filesystem access except chdir().
Right, I don't exclude that Landlock might provide some slightly finer tailored access control compared to what we have with AppArmor and SELinux.
The huge advantages of Landlock are that it is unprivileged and does not expose a large amount of kernel attack surface.
Much less arbitrary, though, is what the author of #4 pointed out in #4 (comment): it's not Docker's job to mitigate kernel vulnerabilities. There are Linux security modules, including Landlock, with configurable and appropriately flexible profiles, which makes them the right tool for this.
It actually somewhat is Docker's job. Seccomp is the only approach I know of to restricting namespaces that is distribution-agnostic and allows generating policy at runtime. LSMs are very distribution-specific: some use SELinux, others use AppArmor, and there may be others that use neither. Also, I don’t expect changing SELinux policies to be in scope for Docker, especially on distributions like RHEL that use monolithic policy. AppArmor policies can be dynamically generated but I don’t know if they are flexible enough for this purpose. Landlock is not enabled universally yet.
What I absolutely do support is having the decision to allow user namespaces be separate from the decision to allow CAP_SYS_ADMIN. The latter should imply the former, but not the other way around.
Note: this is the corrected version of moby/moby#51130, which I opened against the wrong repository. I'm just copying over the whole description from there.
I'm a maintainer and author of passt (https://passt.top/), a user-mode networking implementation, that's used to connect containers, with pasta(1), and virtual machines, with passt(1), in an unprivileged way, without creating network interfaces.
By the way, Moby optionally uses pasta(1) to connect rootless containers via rootlesskit:
Given that these tools deal with network packets from untrusted workloads, we pay particular attention to their security posture.
The project implements a rather substantial sandboxing mechanism, so that, once the initialisation phase completes, passt(1) and pasta(1) only have access to an empty filesystem with a zero-size limit, and relinquish access possibilities to any resources they don't need, by means of detaching namespaces:
Users report that they can't use passt(1) in Docker containers, with one notable example at:
and resort to run modified builds of passt:
with sandboxing features entirely disabled. This is of course not something we support, so it's not a particular concern in terms of maintainability, but still it forces users to disable important security features, and it's a rather alarming trend.
As a side note, Flatpak has a similar issue:
and, same there, users routinely run custom builds of applications that ship strict native sandboxing features (including passt, Chromium, and Firefox) with those features disabled. This is not in the best interest of security and surely not in the best interest of those users.
To fix this, enable unshare() regardless of the CAP_SYS_ADMIN capability, so that unprivileged applications can perform appropriate, strict sandboxing.
I'm well aware of CVE-2022-0185 and CVE-2022-0492, but, since then, there have been significant hardening efforts going on in the affected portions of the kernel and the current situation appears substantially different, now.
Despite the original intention, a blanket ban on unprivileged unshare() appears nowadays to be detrimental to the security of containerised application, instead of contributing to it, as an increased number of applications finally start using namespaces for their own sandboxing, which is generally stricter than what any container runtime can provide.
Link: https://bugs.passt.top/show_bug.cgi?id=116
Reported-by: [email protected]
Signed-off-by: Stefano Brivio [email protected]
- What I did
I took unshare(2), the system call, out of the CAP_SYS_ADMIN gate in the default seccomp profile.
- How I did it
I did it proudly, with a keyboard. I used so-called shortcuts that allowed me to conceptually cut one line of text file and paste it to another location.
- How to verify it
Run passt in a Docker container.
- Human readable description for the release notes
- A picture of a cute animal (not mandatory but encouraged)
Inspired from a submission at https://user.xmission.com/~emailbox/ascii_cats.htm: