[Anchal, Paco]
Bringing back CDS took a lot of work yesterday. I'm gonna try to summarize the main points here.
mx_start_stop
For some reason, fb1 was not able to mount mx devices automatically on system boot. This was an issue I earlier faced in fb1(clone) too. The fix to this problem is to run the script:
controls@fb1:/opt/mx/sbin/mx_start_stop start
To make this persistent, I've configured a daemon (/etc/systemd/system/mx_start_stop.service) in fb1 to run once on system boot and mount the mx devices as mentioned above. We did not see this issue of later reboots yesterday.
gpstime
Next was the issue of gpstime module out of date on fb1. This issue is also known in the past and requires us to do the following:
controls@fb1:~ 0$ sudo modprobe -r gpstime
controls@fb1:~ 1$ sudo modprobe gpstime
Again, to make this persistent, I've configured a daemon (/etc/systemd/system/re-add-gpstime.service) in fb1 to run the above commands once on system boot. This corrected gpstime automatically and we did not face these problems again.
time synchornization
Later we found that fb1-FE computers, ntp time synchronization was not working and the main reason was that fb1 was unable to access internet. As a rule of thumb, it is always a good idea to try pinging www.google.com on fb1 to ensure that it is connected to internet. The issue had to do with fb1 not being able to find any namespace server. We fixed this issue by reloading bind9 service on chiara a couple of times. We're not really sure why it wasn't working.
~>sudo service bind9 stop
~>sudo service bind9 start
~>sudo service bind9 status
* bind9 is running
After the above, we saw that fb1 ntp server is working fine. You see following output on fb1 when that is the case:
controls@fb1:~ 0$ ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
-table-moral.bnr 110.142.180.39 2 u 399 512 377 195.034 -14.618 0.122
*server1.quickdr .GPS. 1 u 67 64 377 130.483 -1.621 1.077
+ntp2.tecnico.ul 56.99.239.27 2 u 473 512 377 184.648 -0.775 2.231
+schattenbahnhof 129.69.1.153 2 u 365 512 377 144.848 3.841 1.092
192.168.123.255 .BCST. 16 u - 64 0 0.000 0.000 0.000
On the FE models, timedatectl should show that NTP synchronized feild is yes. That wasn't happening even after us restarting the systemd-timesyncd service. After this, I just tried restarting all FE computers and it started working.
CDS
We had removed all db9 enabling plugs on the new SOSs beforehand to keep coils off just in case CDS does not come back online properly.
Everything in CDS loaded properly except the c1oaf model which kepy showing 0x2bad status. This meant that some IPC flags are red on c1sus, c1mcs and c1lsc as well. But everything else is green. See attachment 1. I then burtrestroed everything in the /opt/rtcds/caltech/c1/burt/autoburt/snapshots/2022/Feb/4/12:19 directory. This includes the snapshot of c1vac as well that I added on autoburt that day. All burt restore statuses were green OK. I think we are in good state now to start watchdogs on the new SOSs and put back the db9 enabling plugs.
Future work:
When somebody gets time, we should make cutom service files in fb1:/etc/systemd/system/ symbolic links to a repo directory and version control these important services. We should also make sure that their dependencies and startup order is correctly configured. I might have done a half-assed job there since I recently learned how to make unit files. We should do the same on nodus and chiara too. Our hope is that on one glorious day, the lab can be restarted without spending more than 20 min on booting up the computers and network.
|