libct/nsenter: improve error reporting #4951

kolyshkin · 2025-10-25T03:49:41Z

Separated from #4928 as per @lifubang's suggestion.

See individual commits for details.

Copilot

Pull Request Overview

This pull request refactors error handling in the nsenter component to improve error reporting clarity by distinguishing between errors that include errno information and those that don't. The changes introduce a new bailx macro for errno-free errors, add helper functions for safe I/O operations with proper error handling, and ensure consistent cleanup of child processes when errors occur.

Key Changes:

Introduced bailx macro for error reporting without errno, reorganizing the existing bail macro to use it
Added xread, xwrite, iobail, and improved sane_kill helper functions for consistent error handling and process cleanup
Replaced direct read/write calls with helper functions that handle partial I/O and automatically kill child processes on failure

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
libcontainer/nsenter/log.h	Adds `bailx` macro for errno-free error reporting and refactors `bail` to use it
libcontainer/nsenter/nsexec.c	Refactors error handling throughout using new `bailx` macro and I/O helper functions, improves `sane_kill` to preserve errno, and ensures proper child process cleanup on errors

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

libcontainer/nsenter/nsexec.c

cyphar

Just some minor nits, otherwise looks good!

libcontainer/nsenter/nsexec.c

rata

Thanks for the PR. Left a few comments :)

I guess this PR is not yet improving the error message in the case of: #4916, right? In that case, please ping me when that PR is not draft, I'd very much like to see that improvement :)

rata · 2025-10-28T10:33:03Z

libcontainer/nsenter/nsexec.c

+	if (pid > 0) {
+		int ret, saved_errno;
+
+		saved_errno = errno;
+		ret = kill(pid, signum);
+		errno = saved_errno;
+		return ret;


Hmm, this returns the ret value of kill, but errno is set to something else, and we are missing what it was. It seems confusing.

Why not log the previous error (not sure if here or on the caller) and have this function return the ret val of kill and errno set to it?

I don't want to complicate things further than they are.

In fact, since no one is using the return value of sane_kill, and this is a last resort kill, and I think some of the calls will return ESRCH, we can just change its signature to void sane_kill.

We could also log a warning from the kill (unless it's ESRCH) but it's hard to see how it could be useful.

Yeah, I agree, sane_kill is used in cleanup in the error path so we probably want to keep the errno untouched.

Yeap, making it void sound good. I had checked the same yesterday (no one uses the ret code). I would log the kill error, though.

libcontainer/nsenter/nsexec.c

Since sane_kill after a failed read or write, but before reporting the error from that read or write, it may change the errno value in case kill(2) fails. Save and restore the errno around the call to kill. While at it, - change the code to return early; - don't return kill return value as no one is using it, and the errno value no longer correlates. Signed-off-by: Kir Kolyshkin <[email protected]>

We use bail to report fatal errors, and bail always append %m (aka strerror(errno)). In case an error condition did not set errno, the log message will end up with ": Success" or an error from a stale errno value. Either case is confusing for users. Introduce bailx which is the same as bail except it does not append %m, and use it where appropriate. The naming follows libc's err(3) and errx(3). PS we still use bail in a few cases after read or write, even if that read/write did not return an error, because the code does not distinguish between short read/write and error (-1). This will be addressed by the next commit. Signed-off-by: Kir Kolyshkin <[email protected]>

Add a few missing sane_kill calls where they make sense. Remove one useless sane_kill of stage2_pid, as during SYNC_USERMAP stage2 is not yet started. It is harmless yet it makes the code slightly harder to read. Set the child pid to -1 upon receiving SYNC_CHILD_FINISH to minimize the chances of killing an unrelated process. When a child sends SYNC_CHILD_FINISH it is about to exit (although theoretically it could be stuck during debug logging). Signed-off-by: Kir Kolyshkin <[email protected]>

Introduce and use iobail, xread, and xwrite wrappers so that we can properly check read/write return value and call either bail or bailx on error, with proper diagnostics (distinguishing failed read/write from a short read/write). This prevents the "Success" prefix in errors like: failed to sync with stage-1: next state: Success Signed-off-by: Kir Kolyshkin <[email protected]>

kolyshkin · 2025-10-29T00:30:32Z

OK I think I've addressed all the comments, @rata can you take another look?

kolyshkin · 2025-10-29T00:37:52Z

I guess this PR is not yet improving the error message in the case of: #4916, right? In that case, please ping me when that PR is not draft, I'd very much like to see that improvement :)

You can definitely review #4928 right now; it is just a combination of patches from this PR plus the runc init fatal messages collection (the last commit). It's currently a draft because we need to merge this PR first.

rata

LGTM.

nits, but feel free to ignore:

I'd log the kill error in sane_kill().
I think I'd just reimplement iobail() inside xread and xwrite, I think it's too much indirection and cognitive load to have iobail() inside our custom read/write. It's just a few lines to deal with that in xread/xwrite without iobail()

kolyshkin · 2025-10-29T16:54:04Z

I'd log the kill error in sane_kill().

There are places when both parent and stage1_pid kill stage2_pid, thus it may result in kill errors and I don't want more confusion for the user.

I think I'd just reimplement iobail() inside xread and xwrite, I think it's too much indirection and cognitive load to have iobail() inside our custom read/write. It's just a few lines to deal with that in xread/xwrite without iobail()

See, xread and xwrite mostly end up with success, while iobail always result in an exit and is thus marked as noreturn. Also, initially (see #4928 (comment)) I had both read and write as two macros (with and without kill):

CHECK_IO(write, syncfd, &s, sizeof(s), "failed to sync with parent: write(SYNC_USERMAP_PLS)");
CHECK_IO_KILL(write, syncfd, &s, sizeof(s), "some error message", stage1_pid, stage2_pid);

but @lifubang was more in favor of functions rather than macros (and I tend to agree).

Let's merge this as is for now; feel free to further improve upon this in subsequent PRs.

rata · 2025-10-29T17:17:00Z

SGTM, thanks!

kolyshkin requested a review from lifubang October 25, 2025 03:49

kolyshkin mentioned this pull request Oct 25, 2025

Better errors from runc init #4928

Open

kolyshkin force-pushed the better-nsenter-errors branch 2 times, most recently from aabdd90 to 600e983 Compare October 26, 2025 23:15

lifubang approved these changes Oct 27, 2025

View reviewed changes

lifubang requested review from AkihiroSuda, Copilot and cyphar October 27, 2025 14:21

Copilot AI reviewed Oct 27, 2025

View reviewed changes

libcontainer/nsenter/nsexec.c Outdated Show resolved Hide resolved

libcontainer/nsenter/nsexec.c Show resolved Hide resolved

libcontainer/nsenter/nsexec.c Show resolved Hide resolved

cyphar approved these changes Oct 28, 2025

View reviewed changes

libcontainer/nsenter/nsexec.c Outdated Show resolved Hide resolved

libcontainer/nsenter/nsexec.c Outdated Show resolved Hide resolved

rata reviewed Oct 28, 2025

View reviewed changes

kolyshkin added 4 commits October 28, 2025 17:21

kolyshkin force-pushed the better-nsenter-errors branch from 600e983 to 6c18b25 Compare October 29, 2025 00:29

rata approved these changes Oct 29, 2025

View reviewed changes

kolyshkin merged commit fb01482 into opencontainers:main Oct 29, 2025
36 checks passed

libct/nsenter: improve error reporting #4951

libct/nsenter: improve error reporting #4951

Uh oh!

Conversation

kolyshkin commented Oct 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cyphar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rata left a comment

Choose a reason for hiding this comment

Uh oh!

rata Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

kolyshkin Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

cyphar Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rata Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kolyshkin commented Oct 29, 2025

Uh oh!

kolyshkin commented Oct 29, 2025

Uh oh!

rata left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kolyshkin commented Oct 29, 2025

Uh oh!

Uh oh!

rata commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cyphar Oct 29, 2025 •

edited

Loading

rata left a comment •

edited

Loading