Quote: |
I would make a detailed post with how the problems were fixed, but unfortunately, most of what we did was not scientific/systematic/repeatable. Instead, I note here some general points (Jamie/Koji can addto /correct me):
- There is a "known" problem with unloading models on c1lsc. Sometimes, running rtcds stop <model> will kill the c1lsc frontend.
- Sometimes, when one machine on the dolphin network goes down, all 3 go down.
- The new FB/RCG means that some of the old commands now no longer work. Specifically, instead of telnet fb 8087 followed by shutdown (to fix DC errors) no longer works. Instead, ssh into fb1, and run sudo systemctl restart daqd_*.
|
This should still work, but the address has changed. The daqd was split up into three separate binaries to get around the issue with the monolithic build that we could never figure out. The address of the data concentrator (DC) (which is the thing that needs to be restarted) is now 8083.
Quote: |
UPDATE 8:20pm:
Koji suggested trying to simply retsart the ASS model to see if that fixes the weird errors shown in Attachment #2. This did the trick. But we are now faced with more confusion - during the restart process, the various indicators on the CDS overview MEDM screen froze up, which is usually symptomatic of the machines being unresponsive and requiring a hard reboot. But we waited for a few minutes, and everything mysteriously came back. Over repeated observations and looking at the dmesg of the frontend, the problem seems to be connected with an unresponsive NFS connection. Jamie had noted sometime ago that the NFS seems unusually slow. How can we fix this problem? Is it feasible to have a dedicated machine that is not FB1 do the NFS serving for the FEs?
|
I don't think the problem is fb1. The fb1 NFS is mostly only used during front end boot. It's the rtcds mount that's the one that sees all the action, which is being served from chiara. |