40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Message ID: 16650     Entry time: Mon Feb 7 16:14:37 2022
Author: Tega 
Type: Update 
Category: Computers 
Subject: realtime system reboot problem 

I was looking into plotting temperature sensor data trend and why we currently do not have frame data written to file (on /frames) since Friday, and noticed that the FE models were not running. So I spoke to Anchal about it and he mentioned that we are currently unable to ssh into the FE machines, therefore we have been unable to start the models. I recalled the last time we enountered this problem Koji resolved it on Chiara, so I search the elog for Koji's fix and found it here, https://nodus.ligo.caltech.edu:8081/40m/16310. I followed the procedure and restarted c1sus and c1lsc machine and we are now able to ssh into these machines. Also restarted the remaining FE machines and confirm that can ssh into them. Then to start models, I ssh into each FE machine (c1lsc, c1sus, c1ioo, c1iscex, c1iscey, c1sus2) and ran the command

rtcds start --all

to start all models on the FE machine. This procedure worked for all the FE machines but failed for c1lsc. For some reason after starting the first the IOP model - c1x04, c1lsc and c1ass, the ssh connection to the machine drops. When we try to ssh into c1lsc after this event, we get the following error :  "ssh: connect to host c1lsc port 22: No route to host".  I reset the c1lsc machine and deecided to to start the IOP model for now. I'll wait for Anchal or Paco to resolve this issue.


[Anchal, Tega]

I informed Anchal of the problem and ask if he could take a look. It turn out 9 FE models across 3 FE machines (c1lsc, c1sus, c1ioo) have a certain interdependece that requires careful consideration when starting the FE model. In a nutshell, we need to first start the IOP models in all three FE machines before we start the other models in these machines. So we turned off all the models and shutdown the FE machines mainly bcos of a daq issue, since the DC (data concentrator) indicator was not initialised. Anchal looked around in fb1 to figure out why this was happening and eventually discovered that it was the same as the ms_stream issue encountered earlier in fb1 clone (https://nodus.ligo.caltech.edu:8081/40m/16372). So we restarted fb1 to see if things clear up given that chiara dhcp sever is now working fine. Upon restart of fb1, we use the info in a previous elog that shows if the DAQ network is working or not, r.e. we ran the command

$ /opt/mx/bin/mx_info
MX:fb1:mx_init:querying driver:error 5(errno=2):No MX device entry in /dev.

 The output shows that MX device was not initialiesd during the reboot as can also be seen below.

$ sudo systemctl status daqd_dc -l

● daqd_dc.service - Advanced LIGO RTS daqd data concentrator
   Loaded: loaded (/etc/systemd/system/daqd_dc.service; enabled)
   Active: failed (Result: exit-code) since Mon 2022-02-07 18:02:02 PST; 12min ago
  Process: 606 ExecStart=/usr/bin/daqd_dc_mx -c /opt/rtcds/caltech/c1/target/daqd/daqdrc.dc (code=exited, status=1/FAILURE)
 Main PID: 606 (code=exited, status=1/FAILURE)

Feb 07 18:01:56 fb1 systemd[1]: Starting Advanced LIGO RTS daqd data concentrator...
Feb 07 18:01:56 fb1 systemd[1]: Started Advanced LIGO RTS daqd data concentrator.
Feb 07 18:02:00 fb1 daqd_dc_mx[606]: [Mon Feb  7 18:01:57 2022] Unable to set to nice = -20 -error Unknown error -1
Feb 07 18:02:00 fb1 daqd_dc_mx[606]: Failed to do mx_get_info: MX not initialized.
Feb 07 18:02:00 fb1 daqd_dc_mx[606]: 263596
Feb 07 18:02:02 fb1 systemd[1]: daqd_dc.service: main process exited, code=exited, status=1/FAILURE
Feb 07 18:02:02 fb1 systemd[1]: Unit daqd_dc.service entered failed state.


NOTE: We commented out the line

Restart=always

in the file "/etc/systemd/system/daqd_dc.service" in order to see the error, BUT MUST UNDO THIS AFTER THE PROBLEM IS FIXED!

ELOG V3.1.3-