-
Notifications
You must be signed in to change notification settings - Fork 2.2k
libct/nsenter: improve error reporting #4951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libct/nsenter: improve error reporting #4951
Conversation
aabdd90 to
600e983
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request refactors error handling in the nsenter component to improve error reporting clarity by distinguishing between errors that include errno information and those that don't. The changes introduce a new bailx macro for errno-free errors, add helper functions for safe I/O operations with proper error handling, and ensure consistent cleanup of child processes when errors occur.
Key Changes:
- Introduced
bailxmacro for error reporting without errno, reorganizing the existingbailmacro to use it - Added
xread,xwrite,iobail, and improvedsane_killhelper functions for consistent error handling and process cleanup - Replaced direct read/write calls with helper functions that handle partial I/O and automatically kill child processes on failure
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| libcontainer/nsenter/log.h | Adds bailx macro for errno-free error reporting and refactors bail to use it |
| libcontainer/nsenter/nsexec.c | Refactors error handling throughout using new bailx macro and I/O helper functions, improves sane_kill to preserve errno, and ensures proper child process cleanup on errors |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cyphar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor nits, otherwise looks good!
rata
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Left a few comments :)
I guess this PR is not yet improving the error message in the case of: #4916, right? In that case, please ping me when that PR is not draft, I'd very much like to see that improvement :)
libcontainer/nsenter/nsexec.c
Outdated
| if (pid > 0) { | ||
| int ret, saved_errno; | ||
|
|
||
| saved_errno = errno; | ||
| ret = kill(pid, signum); | ||
| errno = saved_errno; | ||
| return ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this returns the ret value of kill, but errno is set to something else, and we are missing what it was. It seems confusing.
Why not log the previous error (not sure if here or on the caller) and have this function return the ret val of kill and errno set to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to complicate things further than they are.
In fact, since no one is using the return value of sane_kill, and this is a last resort kill, and I think some of the calls will return ESRCH, we can just change its signature to void sane_kill.
We could also log a warning from the kill (unless it's ESRCH) but it's hard to see how it could be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree, sane_kill is used in cleanup in the error path so we probably want to keep the errno untouched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, making it void sound good. I had checked the same yesterday (no one uses the ret code). I would log the kill error, though.
Since sane_kill after a failed read or write, but before reporting the error from that read or write, it may change the errno value in case kill(2) fails. Save and restore the errno around the call to kill. While at it, - change the code to return early; - don't return kill return value as no one is using it, and the errno value no longer correlates. Signed-off-by: Kir Kolyshkin <[email protected]>
We use bail to report fatal errors, and bail always append %m (aka strerror(errno)). In case an error condition did not set errno, the log message will end up with ": Success" or an error from a stale errno value. Either case is confusing for users. Introduce bailx which is the same as bail except it does not append %m, and use it where appropriate. The naming follows libc's err(3) and errx(3). PS we still use bail in a few cases after read or write, even if that read/write did not return an error, because the code does not distinguish between short read/write and error (-1). This will be addressed by the next commit. Signed-off-by: Kir Kolyshkin <[email protected]>
Add a few missing sane_kill calls where they make sense. Remove one useless sane_kill of stage2_pid, as during SYNC_USERMAP stage2 is not yet started. It is harmless yet it makes the code slightly harder to read. Set the child pid to -1 upon receiving SYNC_CHILD_FINISH to minimize the chances of killing an unrelated process. When a child sends SYNC_CHILD_FINISH it is about to exit (although theoretically it could be stuck during debug logging). Signed-off-by: Kir Kolyshkin <[email protected]>
Introduce and use iobail, xread, and xwrite wrappers so that we can properly check read/write return value and call either bail or bailx on error, with proper diagnostics (distinguishing failed read/write from a short read/write). This prevents the "Success" prefix in errors like: failed to sync with stage-1: next state: Success Signed-off-by: Kir Kolyshkin <[email protected]>
600e983 to
6c18b25
Compare
|
OK I think I've addressed all the comments, @rata can you take another look? |
You can definitely review #4928 right now; it is just a combination of patches from this PR plus the runc init fatal messages collection (the last commit). It's currently a draft because we need to merge this PR first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
nits, but feel free to ignore:
- I'd log the kill error in sane_kill().
- I think I'd just reimplement iobail() inside xread and xwrite, I think it's too much indirection and cognitive load to have
iobail()inside our custom read/write. It's just a few lines to deal with that in xread/xwrite withoutiobail()
There are places when both parent and stage1_pid kill stage2_pid, thus it may result in kill errors and I don't want more confusion for the user.
See, xread and xwrite mostly end up with success, while CHECK_IO(write, syncfd, &s, sizeof(s), "failed to sync with parent: write(SYNC_USERMAP_PLS)");
CHECK_IO_KILL(write, syncfd, &s, sizeof(s), "some error message", stage1_pid, stage2_pid);but @lifubang was more in favor of functions rather than macros (and I tend to agree). Let's merge this as is for now; feel free to further improve upon this in subsequent PRs. |
|
SGTM, thanks! |
Separated from #4928 as per @lifubang's suggestion.
See individual commits for details.