Skip to content

Conversation

@jgallagher
Copy link
Contributor

@jgallagher jgallagher commented Jan 21, 2026

sled-agent attempts to find the IPs of all switch zones in two places:

  1. During rack setup (to know what to tell Nexus during handoff)
  2. When starting a zone that requires external connectivity (to set up NAT entries in dendrite)

Prior to this PR, both of these would attempt to find both switches for 5 minutes, and after that point would proceed as long as they'd found at least one. This is sorta-okay-but-not-really (more on this in another issue shortly) fine for NAT entries, because Nexus has a background task that will come back around and sync NAT entries for services eventually. But it's not fine for rack setup: if we proceed with only one switch found when the RSS config specifies uplinks for both, we'll fail to hand off to Nexus (details in #9678).

After this change, we change rack setup to wait forever for all switches which have a configured uplink. This means if a switch hasn't come up yet RSS won't proceed, but that should be okay. (It seems better if we could come up with one switch then have Nexus reconcile things after the fact, but that will be a larger change with more risk and more testing difficulty, I think.)

Fixes #9678.

sled-agent attempts to find the IPs of all switch zones in two places:

1. During rack setup (to know what to tell Nexus during handoff)
2. When starting a zone that requires external connectivity (to set up
   NAT entries in dendrite)

Prior to this PR, both of these would attempt to find both switches for
5 minutes, and after that point would proceed as long as they'd found at
least one. This was fine for NAT entries, because Nexus has a background
task that will come back around and sync NAT entries for services
eventually. But it's not fine for rack setup: if we proceed with only
one switch found when the RSS config specifies uplinks for both, we'll
fail to hand off to Nexus (details in #9678).

After this change, we change rack setup to wait forever for all switches
which have a configured uplink. This means if a switch hasn't come up
yet RSS won't proceed, but that should be okay. (It seems _better_ if we
could come up with one switch then have Nexus reconcile things after the
fact, but that will be a larger change with more risk and more testing
difficulty, I think.)

The zone startup path still uses a 5-minute timeout. If Nexus is up this
is much longer than necessary, since Nexus's bg task will also do this
work. But it's critical sled-agent autonomously set up NAT entries for
boundary NTP during cold boot, so it seems a little dicey to lower this
timeout without much more rigorous testing.
@jgallagher
Copy link
Contributor Author

Testing notes from dublin: I manually shut off one of the switch zones and kicked off RSS. (This reproduced #9678 prior to this change.)

RSS got stuck, as expected, waiting to find both switch zones:

00:16:10.642Z WARN SledAgent (RSS): Failed to look up switch zone locations
    error = Only found one switch (expected two): {Switch0: fd00:1122:3344:101::2}
    file = sled-agent/src/bootstrap/early_networking.rs:195
    requested_wait_time = 18446744073709551615.999999999s
    retry_after = 17.88171432s
    total_elapsed = 539.955145719s
...
00:16:33.820Z WARN SledAgent (RSS): Failed to look up switch zone locations
    error = Only found one switch (expected two): {Switch0: fd00:1122:3344:101::2}
    file = sled-agent/src/bootstrap/early_networking.rs:195
    requested_wait_time = 18446744073709551615.999999999s
    retry_after = 19.06689036s
    total_elapsed = 563.133314023s
...
00:16:52.959Z WARN SledAgent (RSS): Failed to look up switch zone locations
    error = Only found one switch (expected two): {Switch0: fd00:1122:3344:101::2}
    file = sled-agent/src/bootstrap/early_networking.rs:195
    requested_wait_time = 18446744073709551615.999999999s
    retry_after = 18.115925884s
    total_elapsed = 582.272329612s

Once I restarted the missing switch zone, we found it and proceeded with RSS:

00:17:11.140Z INFO SledAgent (RSS): Successfully looked up all expected switch zone underlay addresses
    addrs = {Switch0: fd00:1122:3344:101::2, Switch1: fd00:1122:3344:103::2}
    file = sled-agent/src/bootstrap/early_networking.rs:159

and successfully handed off to Nexus

19:31:49.938Z INFO SledAgent (RSS): Handing off control to Nexus
    file = sled-agent/src/rack_setup/service.rs:778
19:31:52.640Z INFO SledAgent (RSS): Handoff to Nexus is complete
    file = sled-agent/src/rack_setup/service.rs:1076

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rack fails to come up when one switch does not come up in a timely manner

2 participants