40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Message ID: 16988     Entry time: Mon Jul 11 19:29:23 2022
Author: Paco 
Type: Summary 
Category: General 
Subject: Finalizing recovery -- timing issues, cds, MC1 

[Yuta, Koji, Paco]

Restarting CDS

We were having some trouble restarting all the models on the FEs. The error was the famous 0x4000 DC error, which has to do with time de-synchronization between fb1 and a given FE. We tried a combination of things haphazardly, such as reloading the gpstime process using

controls@fb1:~ 0$ sudo systemctl stop daqd_*
:~ 0$ sudo modprobe -r gpstime
controls@fb1:~ 0$ sudo modprobe gpstime
controls@fb1:~ 0$ sudo systemctl start daqd_*
controls@fb1:~ 0$ sudo systemctl restart open-mx.service

without much success, even when doing this again after hard rebooting FE + IO chassis combinations around the lab. Koji prompted us to check the local times as reported by the gpstime module, and comparing it to network reported times we saw the expected offset of ~ 3.5 s. On a given FE ("c1***") and fb1 separately, we ran:

controls@c1***:~ 0$ timedatectl
  Local time: Mon 2022-07-11 16:22:39 PDT
  Universal time: Tue 2022-07-11 23:22:39 UTC
       Time zone: America/Los_Angeles (PDT, -0700)
       NTP enabled: yes
       NTP synchronized: no
 RTC in local TZ: no
       DST active: yes
 Last DST change: DST began at
                  Sun 2022-03-13 01:59:59 PST
                  Sun 2022-03-13 03:00:00 PDT
 Next DST change: DST ends (the clock jumps one hour backwards) at
                  Sun 2022-11-06 01:59:59 PDT
                  Sun 2022-11-06 01:00:00 PST
controls@fb1:~ 0$ ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
============================================================================== .BCST.          16 u    -   64    0    0.000    0.000   0.000

which meant a couple of things:

  1. fb1 was serving its time (broadcast to local (martian) network)
  2. fb1 was not getting its time from the internet
  3. c1*** was not synchronized even though fb1 was serving the time

By looking at previous elogs with similar issues, we tried two things;

  1. First, from the FEs, run sudo systemctl restart systemd-timesyncd to get the FE in sync; this didn't immediately solve anything.
  2. Then, from fb1, we tried pinging google.com and failed! The fb1 was not connected to the internet!!!

We tried rebooting fb1 to see if it connected, but eventually what solved this was restarting the bind9 service on chiara! Now we could ping google, and saw this output

controls@fb1:~ 0$ ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
+tor.viarouge.ne   2 u  244 1024  377  144.478    0.761   0.566
*ntp.exact-time. .GPS.            1 u   93 1024  377  174.450   -1.741   0.613
 time.nullrouten .STEP.          16 u    - 1024    0    0.000    0.000   0.000
+ntp.as43588.net      2 u  39m 1024  314  189.152    4.244   0.733 .BCST.          16 u    -   64    0    0.000    0.000   0.000 

meaning fb1 was getting its time served. Going back to the FEs, we still couldn't see the ntp synchronized flag up, but it just took time after a few minutes we saw the FEs in sync! This also meant that we could finally restart all FE models, which we successfully did following the script described in the wiki. Then we had to reload the modbusIOC service in all the slow machines (sometimes this required us to call sudo systemctl daemon-reload) and performed burt restore to a last Friday's snap file collection.

IMC realign and MC1 glitch?

With Koji's help PMC locked, and then Yuta and Paco manually increased the input power to the IFO by rotating the waveplate picomotor to 37.0 deg. After this, we noticed that the MC REFL spot was not hitting the camera, so maybe MC1 was misaligned. Paco checked the AP table and saw the spot horizontally misaligned on the camera, which gave us the initial YAW correction on MC1. After some IMC recovery, we saw only MC1 got spontaneously kicked along both PIT and YAW, making our alignment futile. Though not hard to recover, we wondered why this happened.

We went into the 1X4 rack and pushed MC1 suspension cables in to disregard loose connections, but as we came back into the control room we again saw it being kicked randomly! We even turned damping off for a little while and this random kicking didn't stop. There was no significant seismic motion at the time so it is still unclear of what is happening.

ELOG V3.1.3-