-
Notifications
You must be signed in to change notification settings - Fork 0
FSD Timing
Here are some thoughts on event ordering and first fault finding in the JLab Fast ShutDown (FSD) system.
The primary purpose of the fast shutdown system is to cut off beam before it can burn a hole in the beam line. The beam is turned off when a positive voltage is dropped from a device in a special dedicated/separate FSD network of hierarchical nodes, this drop is propagated up the "tree" and is ultimately detected at the root, and the root node shuts off the beam. This system is supposed to respond in 20µs because in 1990 it was estimated that in roughly 50 µs the beam would rupture the vacuum wall [1].
A secondary goal of the system is to allow determination of WHICH device faulted and WHEN. A second control system network (EPICS), is used to monitor and report which node dropped the voltage and when. FSD devices are tied to IOCs which report when they notice a connected device faulted. An EPICS client registers EPICS CA monitors to be notified from IOCs about faults.
As with any loosely synchronized distributed system there are two timestamps in play for a given message: the time the EPICS CA Client received a message and the timestamp bundled inside the message provided by the remote IOC indicating the time it sent the message. In this base case, there are also two different clocks: the EPICS CA Client clock, and the remote IOC clock. Since we're dealing with a large distributed system there are actually many different IOCs, each with their own clock. The clocks are loosely synchronized using NTP to within roughly 10ms typically, which is about 1,000 times too slow to adequately order events happening on the order of 20µs.
Any given EPICS CA client can simply monitor the set of FSD related IOCs and trust the timestamps provided by each independent IOC. In addition to the clock sync problem, each IOC introduces varying processing delays taking their time reporting they noticed a device fault (variable and inadequately frequent scan rates).
There are also some cases where an IOC has a very outrageous IOC provided timestamp either because of NTP drift or because of processing delays.
An EPICS CA client can optionally ignore the IOC provided timestamp and use it's own message received timestamp to impose a global ordering. It's also possible to take a combination of both. The MYA archiver uses the IOC provided timestamp, unless it exceeds a difference threshold from it's message received timestamp. In the case of using the EPICS CA client message received timestamps instead of IOC provided timestamps, the variable IOC processing delays are still an issue, and you are then also adding in different network path lengths and variable network path congestion to the mix.
Critically, there doesn't currently exist a mechanism in our current FSD system which allows us to determine first to fault or a globally synchronized event ordering with absolute certainly. Any given EPICS CA Client can impose their own ordering of received messages or rely on the ordering provided by remotely distributed IOC clocks. In neither case is the timing accuracy sufficient. The later case has the nice side effect of being consistent (though not correct) regardless of EPICS CA Client, but suffers from having to trust remote clocks, one or more of which may be wildly incorrect. Our current strategy is to group concurrent faults together into a larger window called a trip and use heuristics / rules to guess the root cause of the trip and speculate on the likely order.