Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion seccomp/default_linux.go
Original file line number Diff line number Diff line change
Expand Up @@ -398,6 +398,7 @@ func DefaultProfile() *Seccomp {
"uname",
"unlink",
"unlinkat",
"unshare",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have this as a non-default built-in profile like --security-opt seccomp=allow-unshare-user?

Or if we are going to have this as the default, we will need to provide seccomp=disallow-unshare-user option.

Originally posted by @AkihiroSuda in #42441

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I wasn't aware of moby/moby#42441.

I would argue that unshare() should be the default, otherwise container developers will hit https://bugs.passt.top/show_bug.cgi?id=116#c0 and keep distributing less secure builds of software because they have no practical way to ask users to add options when they run containers. See also https://bugs.passt.top/show_bug.cgi?id=116#c9.

I can take care of adjusting this pull request (if it makes sense at all) in the sense of moby/moby#42455, which already implemented your suggestion.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that user namespaces are blocked by default is that they expose a massive amount of kernel attack surface. This makes it much easier for an application within the container to break out.

For passt, I’m curious if the same goal could be achieved with just seccomp and possibly Landlock. Whether passt has permission to open files doesn’t matter if it can’t make any filesystem syscalls, and Landlock can cut off the remaining filesystem access except chdir(). seccomp can also prevent passt from sending signals to any process that isn’t itself.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that user namespaces are blocked by default is that they expose a massive amount of kernel attack surface. This makes it much easier for an application within the container to break out.

I think I covered that part here:

I'm well aware of CVE-2022-0185 and CVE-2022-0492, but, since then, there have been significant hardening efforts going on in the affected portions of the kernel and the current situation appears substantially different, now.

but that's a quantitative and somewhat arbitrary evaluation. And while at it, I'm myself responsible for CVE-2022-2078, but again, we've been hardening things a lot in the past years, also as a result of exposure from rootless containers (Podman can do all this). Exposure is actually a good thing in the long term.

Much less arbitrary, though, is what the author of #4 pointed out in #4 (comment): it's not Docker's job to mitigate kernel vulnerabilities. There are Linux security modules, including Landlock, with configurable and appropriately flexible profiles, which makes them the right tool for this.

For passt, I’m curious if the same goal could be achieved with just seccomp

passt already ships rather restrictive seccomp profiles:

$ make
seccomp profile passt allows:  accept accept4 bind clock_gettime close connect
   epoll_ctl epoll_pwait epoll_wait exit_group fallocate fcntl fsync ftruncate
   getsockname getsockopt listen lseek read recvfrom recvmmsg recvmsg sendmmsg
   sendmsg sendto setsockopt shutdown socket timerfd_create timerfd_gettime
   timerfd_settime write writev
seccomp profile pasta allows:  accept accept4 bind clock_gettime clone close connect
   epoll_ctl epoll_pwait epoll_wait exit exit_group fallocate fcntl fsync ftruncate
   getsockname getsockopt ioctl listen lseek openat pipe2 read recvfrom recvmmsg
   recvmsg rt_sigprocmask rt_sigreturn sendmmsg sendmsg sendto setns setsockopt
   shutdown socket splice timerfd_create timerfd_gettime timerfd_settime waitid
   write writev
seccomp profile vu allows:  accept accept4 bind clock_gettime close connect
   epoll_ctl epoll_pwait epoll_wait exit_group fallocate fcntl fsync ftruncate
   getsockname getsockopt ioctl listen lseek mmap munmap read recvfrom recvmmsg
   recvmsg sendmmsg sendmsg sendto setsockopt shutdown socket timerfd_create
   timerfd_gettime timerfd_settime write writev

and possibly Landlock.

...as well as AppArmor and SELinux policies. Of course, all contributions including a new shiny Landlock profile are warmly welcome, but Landlock wouldn't cover much more than what we're already covering with "traditional" LSMs.

Whether passt has permission to open files doesn’t matter if it can’t make any filesystem syscalls,

pasta(1) needs connect(2) and bind(2), as well as openat(2) for a number of reasons (see git log), even though we can probably drop the latter with a bit of extra work. But it's not just about filesystem access, it's also about seeing other PIDs (not necessarily to send signals).

and Landlock can cut off the remaining filesystem access except chdir().

Right, I don't exclude that Landlock might provide some slightly finer tailored access control compared to what we have with AppArmor and SELinux.

seccomp can also prevent passt from sending signals to any process that isn’t itself.

I don't see a way (unless we're talking of something based on further system call argument examination via e.g. seccomp_unotify(2) and seitan), but, in any case, kill(2) is not enabled in the seccomp profiles, so that's not a concern.

In any case, while the original user report behind this was about passt(1), with a blanket ban on unshare(2), you can't run pasta(1) in Docker itself (it obviously needs clone(CLONE_NEWNET)) which is rather absurd. And that's not even about sandboxing, it's about basic functionality we can't provide otherwise.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and Landlock can cut off the remaining filesystem access except chdir().

Right, I don't exclude that Landlock might provide some slightly finer tailored access control compared to what we have with AppArmor and SELinux.

The huge advantages of Landlock are that it is unprivileged and does not expose a large amount of kernel attack surface.

Much less arbitrary, though, is what the author of #4 pointed out in #4 (comment): it's not Docker's job to mitigate kernel vulnerabilities. There are Linux security modules, including Landlock, with configurable and appropriately flexible profiles, which makes them the right tool for this.

It actually somewhat is Docker's job. Seccomp is the only approach I know of to restricting namespaces that is distribution-agnostic and allows generating policy at runtime. LSMs are very distribution-specific: some use SELinux, others use AppArmor, and there may be others that use neither. Also, I don’t expect changing SELinux policies to be in scope for Docker, especially on distributions like RHEL that use monolithic policy. AppArmor policies can be dynamically generated but I don’t know if they are flexible enough for this purpose. Landlock is not enabled universally yet.

What I absolutely do support is having the decision to allow user namespaces be separate from the decision to allow CAP_SYS_ADMIN. The latter should imply the former, but not the other way around.

"uretprobe", // kernel v6.11, libseccomp v2.6.0
"utime",
"utimensat",
Expand Down Expand Up @@ -618,7 +619,6 @@ func DefaultProfile() *Seccomp {
"syslog",
"umount",
"umount2",
"unshare",
},
Action: specs.ActAllow,
},
Expand Down