40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Message ID: 4208     Entry time: Wed Jan 26 12:04:31 2011
Author: josephb 
Type: Update 
Category: CDS 
Subject: Explanation of why c1sus and c1lsc models crash when the other one goes down 

So apparently with the current Dolphin drivers, when one of the nodes goes down (say c1lsc), it causes all the other nodes to freeze for up to 20 seconds.

This 20 seconds can force a model to go over the 60 microseconds limit and is sufficiently long enough to force the FE to time out.  Alex and Rolf have been working with the vendors to get this problem fixed, as having all your front ends go down because you rebooted a single computer is bad.

[40184.120912] c1rfm: sync error my=0x3a6b2d5d00000000 remote=0x0
[40184.120914] c1rfm: sync error my=0x3a6b2d5d00000000 remote=0x0
[44472.627831] c1pem: ADC TIMEOUT 0 7718 38 7782
[44472.627835] c1mcs: ADC TIMEOUT 0 7718 38 7782
[44472.627849] c1sus: ADC TIMEOUT 0 7718 38 7782
[44472.644677] c1rfm: cycle 1945 time 17872; adcWait 15; write1 0; write2 0; longest write2 0
[44472.644682] c1x02: cycle 7782 time 17849; adcWait 12; write1 0; write2 0; longest write2 0
[44472.646898] c1rfm: ADC TIMEOUT 0 8133 5 7941

The solution for the moment is to start the computers at exactly the same time, so the dolphin is up before the front ends, or start the models by hand after the computer is up and dolphin running, but after they have timed out.  This is done by:

sudo rmmod c1SYSfe

sudo insmod /opt/rtcds/caltech/c1/target/c1SYS/bin/c1SYSfe.ko

 

Alex and Rolf have been working with the vendors to get this fixed, and we may simply need to update our Dolphin drivers.  I'm trying to get in contact with them and see if this is the case.

ELOG V3.1.3-