40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Entry  Thu Dec 14 19:41:00 2017, gautam, Update, CDS, CDS recovery, NFS woes CDS_14Dec2017.pngCDS_errors.png
    Reply  Fri Dec 15 00:26:40 2017, johannes, Update, CDS, Re: CDS recovery, NFS woes 
    Reply  Fri Dec 15 01:53:37 2017, jamie, Update, CDS, CDS recovery, NFS woes 
       Reply  Fri Dec 15 11:19:11 2017, gautam, Update, CDS, CDS recovery, NFS woes NFS.pngMCautolocker.png
Message ID: 13477     Entry time: Thu Dec 14 19:41:00 2017     Reply to this: 13479   13480
Author: gautam 
Type: Update 
Category: CDS 
Subject: CDS recovery, NFS woes 

[Koji, Jamie(remote), gautam]

Summary: The CDS system seems to be back up and functioning. But there seems to be some pending problems with the NFS that should be looked into.

We locked Y-arm, hand aligned transmission to 1yes. Some pending problems with ASS model (possibly symptomatic of something more general). Didn't touch Xarm because we don't know what exactly the status of ETMX is.

Problems raised in elogs in the thread of 13474 and also 13436 seem to be solved.


I would make a detailed post with how the problems were fixed, but unfortunately, most of what we did was not scientific/systematic/repeatable. Instead, I note here some general points (Jamie/Koji can addto /correct me):

  1. There is a "known" problem with unloading models on c1lsc. Sometimes, running rtcds stop <model> will kill the c1lsc frontend.
  2. Sometimes, when one machine on the dolphin network goes down, all 3 go down.
  3. The new FB/RCG means that some of the old commands now no longer work. Specifically, instead of telnet fb 8087 followed by shutdown (to fix DC errors) no longer works. Instead, ssh into fb1, and run sudo systemctl restart daqd_*.
  4. Timing error on c1sus machine was linked to the mx_stream processes somehow not being automatically started. The "!mxstream restart" button on the CDS overview MEDM screen should run the necessary commands to restart it. However, today, I had to manually run sudo systemctl start mx_stream on c1sus to fix this error. It is a mystery why the automatic startup of this process was disabled in the first place. Jamie has now rectified this problem, so keep an eye out.
  5. c1oaf persistently reported DC errors (0x2bad) that couldn't be addressed by running mxstream restart or restarting the daqd processes on FB1. Restarting the model itself (i.e. rtcds restart c1oaf) fixed this issue (though of course I took the risk of having to go into the lab and hard-reboot 3 machines).
  6. At some point, we thought we had all the CDS lights green - but at that point, the END FEs crashed, necessitating Koji->EX and Gautam->EY hard reboots. This is a new phenomenon. Note that the vertex machines were unaffected.
  7. At some point, all the DC lights on the CDS overview screen went white - at the same time, we couldn't ssh into FB1, although it was responding to ping. After ~2mins, the green lights came back and we were able to connect to FB1. Not sure what to make of this.
  8. While trying to run the dither alignment scripts for the Y-arm, we noticed some strange behaviour:
    • Even when there was no signal (looking at EPICS channels) at the input of the ASS servos, the output was fluctuating wildly by ~20cts-pp.
    • This is not simply an EPICS artefact, as we could see corresponding motion of the suspension on the CCD.
    • A possible clue is that when I run the "Start Dither" script from the MEDM screen, I get a bunch of error messages (see Attachment #2).
    • Similar error messages show up when running the LSC offset script for example. Seems like there are multiple ports open somehow on the same machine?
    • There are no indicator lights on the CDS overview screen suggesting where the problem lies.
    • Will continue investigating tomorrow.

Some other general remarks:

  1. ETMX watchdog remains shutdown.
  2. ITMY and BS oplevs have been hijacked for HeNe RIN / Oplev sensing noise measurement, and so are not enabled.
  3. Y arm trans QPD (Thorlabs) has large 60Hz harmonics. These can be mitigated by turning on a 60Hz comb filter, but we should check if this is some kind of ground loop. The feature is much less evident when looking at the TRANS signal on the QPD.

UPDATE 8:20pm:

Koji suggested trying to simply retsart the ASS model to see if that fixes the weird errors shown in Attachment #2. This did the trick. But we are now faced with more confusion - during the restart process, the various indicators on the CDS overview MEDM screen froze up, which is usually symptomatic of the machines being unresponsive and requiring a hard reboot. But we waited for a few minutes, and everything mysteriously came back. Over repeated observations and looking at the dmesg of the frontend, the problem seems to be connected with an unresponsive NFS connection. Jamie had noted sometime ago that the NFS seems unusually slow. How can we fix this problem? Is it feasible to have a dedicated machine that is not FB1 do the NFS serving for the FEs?

Attachment 1: CDS_14Dec2017.png  23 kB  | Hide | Hide all
CDS_14Dec2017.png
Attachment 2: CDS_errors.png  18 kB  | Hide | Hide all
CDS_errors.png
ELOG V3.1.3-