Skip to content

sled-agent's setup of OPTE port NAT entries is unreliable #9700

@jgallagher

Description

@jgallagher

(This ticket is written from the perspective of a rack that has uplinks configured on both switches. Some of this still applies to a rack with uplinks configured on only one switch, but I don't think going into the details of the difference is super useful. I just wanted to note this because some comments below are inaccurate if only one switch is expected to have an uplink.)

When starting a zone that has external connectivity (Nexus, boundary NTP, or external DNS), sled-agent's codepath for setting up the zone's OPTE ports includes blocking until either:

  1. it finds the IDs of both switches
  2. it finds the ID of one switch and a 5 minute timeout has passed

This is currently implemented here; #9699 makes changes in this area to make the 5-minute timeout more obvious. After it finds one or both switch zone IPs, it then blocks until it can successfully ensure the NAT entry exists in the dendrite for every found IP (here).

If Nexus is online, all of this is effectively an optimization. Nexus has a background task that periodically ensures all NAT entries for all these services are correct on all dendrite instances, so if anything goes wrong here, we expect that Nexus will come around and fix things up in short order.

However, this code path is critical when Nexus is not online: during rack setup and during cold boot. In both cases we must set up NAT entries for the boundary NTP zone in particular, so that the rack can sync time and proceed with setup / booting.

There are several problems with the current implementation:

  1. All of this happens synchronously in the "start a zone" path. We have a potentially-5-minute timeout followed by a potentially-infinite timeout, both of which will affect zone startup time (and anything waiting on zones to start, like the general config reconcilation loop).
  2. If we only find one switch zone IP and give up on the other after 5 minutes, we'll only set up NAT entries on the one dendrite we found and not the other.
  3. If something goes wrong with dendrite in between when we found the switch zone IP and when we attempt to ensure NAT entries, we'll wait forever for the IP we found earlier to succeed. (An unlikely-in-practice but possible scenario is: we find both switch zones, then a scrimlet dies and has to be replaced before we contact that scrimlet's dendrite; we're now stuck forever waiting for the broken scrimlet to come back, and won't recover when the replacement comes online.)

@JustinAzoff recently saw problem 2 in practice on madrid. Both switches had upstream connectivity, but the sled hosting boundary NTP was only able to find one of the switch zone IPs, so it only set up NAT entries on one of the dendrite instances. We're now in an invalid state: we ought to have connectivity through both switches, since both are successfully connected upstream, but we've failed to set up NAT entries on one, and until NTP syncs (which it may never be able to do, if replies always come to the switch that's missing a NAT entry), Nexus won't be able to come online and restore the NAT entry on the other switch.

At risk of being "guy with hammer finds another nail-shaped problem", I'm strongly tempted to suggest we add a "NAT entry reconciliation" task to sled-agent itself. That task would periodically attempt to find all available dendrite instances and would ensure NAT entries exist for any service zones hosted on that sled. This would overlap with the Nexus RPW; I'm not sure if we'd want to remove the Nexus RPW altogether, keep the Nexus RPW but have it only remove NAT entries that are no longer needed, keep the Nexus RPW as it is today, or something else. (@internet-diglett may have ideas here? I think the Nexus RPW syncs with both dendrite and CRDB; sled-agent can't sync with CRDB, so maybe we still need both?)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions