40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Entry  Thu Dec 14 19:41:00 2017, gautam, Update, CDS, CDS recovery, NFS woes CDS_14Dec2017.pngCDS_errors.png
    Reply  Fri Dec 15 00:26:40 2017, johannes, Update, CDS, Re: CDS recovery, NFS woes 
    Reply  Fri Dec 15 01:53:37 2017, jamie, Update, CDS, CDS recovery, NFS woes 
       Reply  Fri Dec 15 11:19:11 2017, gautam, Update, CDS, CDS recovery, NFS woes NFS.pngMCautolocker.png
Message ID: 13480     Entry time: Fri Dec 15 01:53:37 2017     In reply to: 13477     Reply to this: 13481
Author: jamie 
Type: Update 
Category: CDS 
Subject: CDS recovery, NFS woes 

I would make a detailed post with how the problems were fixed, but unfortunately, most of what we did was not scientific/systematic/repeatable. Instead, I note here some general points (Jamie/Koji can addto /correct me):

  1. There is a "known" problem with unloading models on c1lsc. Sometimes, running rtcds stop <model> will kill the c1lsc frontend.
  2. Sometimes, when one machine on the dolphin network goes down, all 3 go down.
  3. The new FB/RCG means that some of the old commands now no longer work. Specifically, instead of telnet fb 8087 followed by shutdown (to fix DC errors) no longer works. Instead, ssh into fb1, and run sudo systemctl restart daqd_*.

This should still work, but the address has changed.  The daqd was split up into three separate binaries to get around the issue with the monolithic build that we could never figure out.  The address of the data concentrator (DC) (which is the thing that needs to be restarted) is now 8083.


UPDATE 8:20pm:

Koji suggested trying to simply retsart the ASS model to see if that fixes the weird errors shown in Attachment #2. This did the trick. But we are now faced with more confusion - during the restart process, the various indicators on the CDS overview MEDM screen froze up, which is usually symptomatic of the machines being unresponsive and requiring a hard reboot. But we waited for a few minutes, and everything mysteriously came back. Over repeated observations and looking at the dmesg of the frontend, the problem seems to be connected with an unresponsive NFS connection. Jamie had noted sometime ago that the NFS seems unusually slow. How can we fix this problem? Is it feasible to have a dedicated machine that is not FB1 do the NFS serving for the FEs?

I don't think the problem is fb1.  The fb1 NFS is mostly only used during front end boot.  It's the rtcds mount that's the one that sees all the action, which is being served from chiara.

ELOG V3.1.3-