-
Notifications
You must be signed in to change notification settings - Fork 2
Description
We've been debugging some frustrating issues with transceivers in our manufacturing pipeline. We generally attempt to bring up all the links at once, and then go through various, mostly-manual steps when that fails. We often look at the Tofino software state machines, optical power, or BER counters, and then use that to diagnose the issue.
This process doesn't work very well. We have limited visibility into what the Tofino SDE is actually doing: what requests has it made, what state does it think the modules are in, what is it waiting for before moving on? We can capture all the messages we send to the SP about the transceivers on its behalf, but that's usually hard to sift through or interpret. At the same time, we have debugged and fixed numerous issues in the past around the SDE's state machines. In general, the machines only go forward. They also never attempt to verify that the state of the module matches the state the software expects. It asks for something to happen, checks the return code, and then assumes the module is still that way forever.
The only exception to this is when a module is physically unplugged. At that point, the SDE resets its state machines entirely and starts over. For many of these manufacturing issues, this is the only reliable way to fix them. @Aaron-Hartwig and I feel that we should consider adding some machinery in both Dendrite, the transceiver-control crate, and the front I/O FPGA board to more cleanly reset modules in some circumstances.
@Aaron-Hartwig suggested adding some bits in the FPGA to track whether a module is "new". This would be set for each module if the presence pin changes state. Dendrite could receive this in the extra space in the ExtendedStatus messages. When it finds new modules, it goes through a complete reset process: hide them from the Tofino SDE, fully reset them, and then unhide them. It can also ACK to the SP that it has handled these, so the FPGA can clear the "new" bit.
This doesn't cover every case, or even most. But it does let us physically unplug the module, and be sure that it will be fully reset and handled from the beginning of the SDE state machines when it is reinserted.
This is a Dendrite issue because it's the brains of this operation, but there's work in the FPGA code, Hubris, and the transceiver-control crate as well.