So apparently with the current Dolphin drivers, when one of the nodes goes down (say c1lsc), it causes all the other nodes to freeze for up to 20 seconds.
This 20 seconds can force a model to go over the 60 microseconds limit and is sufficiently long enough to force the FE to time out. Alex and Rolf have been working with the vendors to get this problem fixed, as having all your front ends go down because you rebooted a single computer is bad.
[40184.120912] c1rfm: sync error my=0x3a6b2d5d00000000 remote=0x0
[40184.120914] c1rfm: sync error my=0x3a6b2d5d00000000 remote=0x0
[44472.627831] c1pem: ADC TIMEOUT 0 7718 38 7782
[44472.627835] c1mcs: ADC TIMEOUT 0 7718 38 7782
[44472.627849] c1sus: ADC TIMEOUT 0 7718 38 7782
[44472.644677] c1rfm: cycle 1945 time 17872; adcWait 15; write1 0; write2 0; longest write2 0
[44472.644682] c1x02: cycle 7782 time 17849; adcWait 12; write1 0; write2 0; longest write2 0
[44472.646898] c1rfm: ADC TIMEOUT 0 8133 5 7941
The solution for the moment is to start the computers at exactly the same time, so the dolphin is up before the front ends, or start the models by hand after the computer is up and dolphin running, but after they have timed out. This is done by:
sudo rmmod c1SYSfe
sudo insmod /opt/rtcds/caltech/c1/target/c1SYS/bin/c1SYSfe.ko
Alex and Rolf have been working with the vendors to get this fixed, and we may simply need to update our Dolphin drivers. I'm trying to get in contact with them and see if this is the case. |