-
Couldn't load subscription status.
- Fork 6
seccomp: Take unshare() out of CAP_SYS_ADMIN gate #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
sbrivio-rh
wants to merge
1
commit into
moby:main
Choose a base branch
from
sbrivio-rh:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally posted by @AkihiroSuda in #42441
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I wasn't aware of moby/moby#42441.
I would argue that unshare() should be the default, otherwise container developers will hit https://bugs.passt.top/show_bug.cgi?id=116#c0 and keep distributing less secure builds of software because they have no practical way to ask users to add options when they run containers. See also https://bugs.passt.top/show_bug.cgi?id=116#c9.
I can take care of adjusting this pull request (if it makes sense at all) in the sense of moby/moby#42455, which already implemented your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that user namespaces are blocked by default is that they expose a massive amount of kernel attack surface. This makes it much easier for an application within the container to break out.
For passt, I’m curious if the same goal could be achieved with just seccomp and possibly Landlock. Whether passt has permission to open files doesn’t matter if it can’t make any filesystem syscalls, and Landlock can cut off the remaining filesystem access except chdir(). seccomp can also prevent passt from sending signals to any process that isn’t itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I covered that part here:
but that's a quantitative and somewhat arbitrary evaluation. And while at it, I'm myself responsible for CVE-2022-2078, but again, we've been hardening things a lot in the past years, also as a result of exposure from rootless containers (Podman can do all this). Exposure is actually a good thing in the long term.
Much less arbitrary, though, is what the author of #4 pointed out in #4 (comment): it's not Docker's job to mitigate kernel vulnerabilities. There are Linux security modules, including Landlock, with configurable and appropriately flexible profiles, which makes them the right tool for this.
passt already ships rather restrictive seccomp profiles:
...as well as AppArmor and SELinux policies. Of course, all contributions including a new shiny Landlock profile are warmly welcome, but Landlock wouldn't cover much more than what we're already covering with "traditional" LSMs.
pasta(1)needsconnect(2)andbind(2), as well asopenat(2)for a number of reasons (see git log), even though we can probably drop the latter with a bit of extra work. But it's not just about filesystem access, it's also about seeing other PIDs (not necessarily to send signals).Right, I don't exclude that Landlock might provide some slightly finer tailored access control compared to what we have with AppArmor and SELinux.
I don't see a way (unless we're talking of something based on further system call argument examination via e.g.
seccomp_unotify(2)and seitan), but, in any case,kill(2)is not enabled in the seccomp profiles, so that's not a concern.In any case, while the original user report behind this was about
passt(1), with a blanket ban onunshare(2), you can't runpasta(1)in Docker itself (it obviously needsclone(CLONE_NEWNET)) which is rather absurd. And that's not even about sandboxing, it's about basic functionality we can't provide otherwise.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The huge advantages of Landlock are that it is unprivileged and does not expose a large amount of kernel attack surface.
It actually somewhat is Docker's job. Seccomp is the only approach I know of to restricting namespaces that is distribution-agnostic and allows generating policy at runtime. LSMs are very distribution-specific: some use SELinux, others use AppArmor, and there may be others that use neither. Also, I don’t expect changing SELinux policies to be in scope for Docker, especially on distributions like RHEL that use monolithic policy. AppArmor policies can be dynamically generated but I don’t know if they are flexible enough for this purpose. Landlock is not enabled universally yet.
What I absolutely do support is having the decision to allow user namespaces be separate from the decision to allow
CAP_SYS_ADMIN. The latter should imply the former, but not the other way around.