Skip to content

Conversation

@klihub
Copy link
Member

@klihub klihub commented Oct 1, 2025

The primary motivation for this PR is to clean up some annoying bits in the NRI wire protocol.

The most prominent such bit is the lack of proper, dedicated wire RPC calls and messages for some pod and container lifecycle events. These are instead multiplexed over a single RPC call and message (StateChange, StateChangeEvent), lacking any event-specific returned data and making any future extension unnecessarily difficult. Other, smaller annoyances are the usage of a single Empty response type in multiple other RPC calls, with similar negative side effects.

Addressing these alone is a more intrusive protocol change than the usual addition of new NRI inputs or controls and the related protocol extension with new fields. Therefore this PR defines a new v1beta1 revision of the protocol, initially identical to v1alpha1, which is then updated address the aforementioned annoyances.

Some of the original protocol problems are also reflected in the stub and the plugin/stub API. This PR tries to clean some of those up as well. It defines a new v1beta1 stub, turning the original one into a v1alpha1 revision.

In an attempt to provide a painless transition path for plugins (and to experiment with it in practice) this PR also implements an internal transparent protocol conversion from v1alpha1 to v1beta1, for plugins which register themselves talking the earlier version of the protocol.

This PR is still work-in-progress, but I'd like to gather some initial thoughts and general feedback from anybody who has the time and interest to take a look at this, already in its current state.

Here is the list of things (I think) this PR addresses in v1beta1:

  • eliminate on wire StateChange in favor of dedicated event-specific requests
  • eliminate remaining usage of Empty (plugin registration, plugin shutdown, wasm plugin log bridging)
  • implement plugin shutdown, use it where it makes sense instead of just closing the connection
  • delay closing plugin connections for fatal errors, giving error propagation back to plugins a chance

Here is a list of things I either know are still missing or I'd like to additionally address. This is probably still incomplete. I'll try to update it as I can.

  • adaptation test suite tests for protocol conversion
  • documentation updates
  • additional stub/plugin API cleanups

@mikebrow @chrishenzie @marquiz If you have some BW, PTAL.

@klihub klihub force-pushed the devel/api/v1beta1 branch from e3187d6 to c3f5db9 Compare October 1, 2025 16:39
@mikebrow
Copy link
Member

mikebrow commented Oct 1, 2025

Interesting changes.. let's do a call on this next week covering the motivations. I'm LGTM for a versioning practice run to manage proto version extensions/enhancements. Not sure about the need to split state change notifications into event type unique rpcs? Would like to hear some ideas on scenarios where that would be helpful. The other changes all sound like good tech debt cleanup.

@klihub klihub force-pushed the devel/api/v1beta1 branch from c3f5db9 to cd2e51f Compare October 1, 2025 18:32
@klihub
Copy link
Member Author

klihub commented Oct 1, 2025

Interesting changes.. let's do a call on this next week covering the motivations. I'm LGTM for a versioning practice run to manage proto version extensions/enhancements. Not sure about the need to split state change notifications into event type unique rpcs? Would like to hear some ideas on scenarios where that would be helpful. The other changes all sound like good tech debt cleanup.

Okay, let's have a chat about this next week.

@klihub klihub force-pushed the devel/api/v1beta1 branch 4 times, most recently from 9ca998a to fdb6329 Compare October 2, 2025 09:40
@klihub klihub force-pushed the devel/api/v1beta1 branch 3 times, most recently from 9c1f1a2 to f04f8ff Compare October 3, 2025 07:30
klihub added 15 commits October 3, 2025 10:54
Neither assume that we have a single proto definition
nor hardcode proto file locations for post-processing
actions based on such an assumption.

Signed-off-by: Krisztian Litkey <[email protected]>
Make importing pkg/api an effective alias of pkg/api/v1alpha1.

Signed-off-by: Krisztian Litkey <[email protected]>
Copy v1alpha1 API as a starting point for the v1beta1 revision.

Signed-off-by: Krisztian Litkey <[email protected]>
Add dedicated regular RPC calls and messages for requests we used to
all handle in v1alpha1 with StateChange tagged with Event's to tell
them apart. Remove unused StateChange and StateChangeEvent. Replace
Empty similarly as a response type with dedicated reponses specific
to each request. Remove Empty as an unused type.

Update Shutdown, allowing a reason for the shutdown to be passed to
the plugin.

Signed-off-by: Krisztian Litkey <[email protected]>
Make importing pkg/stub an effective alias of pkg/stub/v1alpha1.

Signed-off-by: Krisztian Litkey <[email protected]>
Add protocol translation for plugins that talk v1alpha1.

Signed-off-by: Krisztian Litkey <[email protected]>
A plugin having received a sync request is not enough to ensure it
has been fully registered on the runtime side (IOW appended to the
list of active plugins). Use a sync block/unblock to ensure it has.

Signed-off-by: Krisztian Litkey <[email protected]>
Implement plugin shutdown. On error paths where this seems the right
choice, shut down plugins instead of closing them. This should give
plugins a chance to gather more context (from the shutdown reason)
than just closing the connection. On other error code paths where we
close the connection, delay it to allow any errors to propagate back
to the plugin before the connection is closed.

Signed-off-by: Krisztian Litkey <[email protected]>
@klihub klihub force-pushed the devel/api/v1beta1 branch from f04f8ff to 837c25f Compare October 3, 2025 07:54
@klihub
Copy link
Member Author

klihub commented Oct 9, 2025

Interesting changes.. let's do a call on this next week covering the motivations. I'm LGTM for a versioning practice run to manage proto version extensions/enhancements. Not sure about the need to split state change notifications into event type unique rpcs? Would like to hear some ideas on scenarios where that would be helpful. The other changes all sound like good tech debt cleanup.

@mikebrow @chrishenzie Some context about the motivation to ditch once and for all StateChange on the wire and replace it with proper dedicated services and related request/response messages. Our API both on the runtime and the client side has never exposed StateChange. Instead we try to pretend that there are dedicated calls and messages for each one multiplexed over StateChange. In practice what happens is that while both the runtime and the plugins see, for instance, something like RunPodSandox(pod), StopPodSandbox(pod), PostCreateContainer(pod, container), StartContainer(pod, container), in reality we pass this over on the wire with something like

&StateChangeEvent{
    Event: Event_{RUN, STOP, POST_UDPATE,REMOVE}_POD_SANDBOX,
    Pod: pod,
    Container: nil,
}
&StateChangeEvent{
    Event: Event_{POST_CREATE,START,POST_START,POST_UPDATE,REMOVE}_CONTAINER,
    Pod: pod,
    Container: container,
}

while on the return path we always have

type (
    {Run,Stop,PostUpdate,Remove}PodSandboxResponse = Empty
    {PostCreate,Start,PostStart,PostUpdate,Remove}ContainerResponse = Empty
    Empty struct {}
)

This was a futile and pretty misguided attempt, by me in momentary lapse of reason/worse than usual brain damage, to try and keep the protocol looking (slightly) small(er than otherwise). IMO, it did not help or make much sense at that time, and it does even less now. The only thing it causes is problems, and artificially created ones at that.

The initial collective property for all requests mutiplexed as StateChange was that they all had event-like semantics, IOW the plugin/client could not request altering anything in response. A secondary one was that, initially, all these requests had either (PodSandbox) or (PodSandbox, Container) inputs, so the 'union content' approach for StateChangeEvent did not look too ugly. But this already became a problem when @chrishenzie was adding the UpdatePodSandbox request. That, too, was event-like in this sense, but the inputs already did not fit into StateChangeEvent. So there was a dilemma whether to 1) go down the path of further uglification by adding Overhead *LinuxContainerResource, Resources *LinuxContainerResources to the 'union of inputs' or 2) handle it with a proper dedicate request and messages and blow the rest of StateChange apart in a similar fashion(... eventually, while living with the now inconsistent approach for the time being).

We decided to go with the second option, but haven't blown apart StateChange yet. I'd like to do that now, while we're rolling a next wire protocol version as part of this PR.

@mikebrow
Copy link
Member

mikebrow commented Oct 9, 2025

Interesting changes.. let's do a call on this next week covering the motivations. I'm LGTM for a versioning practice run to manage proto version extensions/enhancements. Not sure about the need to split state change notifications into event type unique rpcs? Would like to hear some ideas on scenarios where that would be helpful. The other changes all sound like good tech debt cleanup.

@mikebrow @chrishenzie Some context about the motivation to ditch once and for all StateChange on the wire and replace it with proper dedicated services and related request/response messages. Our API both on the runtime and the client side has never exposed StateChange. Instead we try to pretend that there are dedicated calls and messages for each one multiplexed over StateChange. In practice what happens is that while both the runtime and the plugins see, for instance, something like RunPodSandox(pod), StopPodSandbox(pod), PostCreateContainer(pod, container), StartContainer(pod, container), in reality we pass this over on the wire with something like

&StateChangeEvent{
    Event: Event_{RUN, STOP, POST_UDPATE,REMOVE}_POD_SANDBOX,
    Pod: pod,
    Container: nil,
}
&StateChangeEvent{
    Event: Event_{POST_CREATE,START,POST_START,POST_UPDATE,REMOVE}_CONTAINER,
    Pod: pod,
    Container: container,
}

while on the return path we always have

type (
    {Run,Stop,PostUpdate,Remove}PodSandboxResponse = Empty
    {PostCreate,Start,PostStart,PostUpdate,Remove}ContainerResponse = Empty
    Empty struct {}
)

This was a futile and pretty misguided attempt, by me in momentary lapse of reason/worse than usual brain damage, to try and keep the protocol looking (slightly) small(er than otherwise). IMO, it did not help or make much sense at that time, and it does even less now. The only thing it causes is problems, and artificially created ones at that.

The initial collective property for all requests mutiplexed as StateChange was that they all had event-like semantics, IOW the plugin/client could not request altering anything in response. A secondary one was that, initially, all these requests had either (PodSandbox) or (PodSandbox, Container) inputs, so the 'union content' approach for StateChangeEvent did not look too ugly. But this already became a problem when @chrishenzie was adding the UpdatePodSandbox request. That, too, was event-like in this sense, but the inputs already did not fit into StateChangeEvent. So there was a dilemma whether to 1) go down the path of further uglification by adding Overhead *LinuxContainerResource, Resources *LinuxContainerResources to the 'union of inputs' or 2) handle it with a proper dedicate request and messages and blow the rest of StateChange apart in a similar fashion(... eventually, while living with the now inconsistent approach for the time being).

We decided to go with the second option, but haven't blown apart StateChange yet. I'd like to do that now, while we're rolling a next wire protocol version as part of this PR.

Thanks for the detailed explanation. SGTM!!

events is a very overloaded term.. Calling them something else might help :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants