40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Entry  Wed Sep 29 17:10:09 2021, Anchal, Summary, CDS, c1teststand problems summary c1teststand_issues_summary.pdf
    Reply  Thu Sep 30 14:09:37 2021, Anchal, Summary, CDS, New way to ssh into c1teststand 
    Reply  Mon Oct 4 11:05:44 2021, Anchal, Summary, CDS, c1teststand problems summary 
       Reply  Mon Oct 4 18:00:16 2021, Koji, Summary, CDS, c1teststand problems summary 
          Reply  Tue Oct 5 17:58:52 2021, Anchal, Summary, CDS, c1teststand problems summary 
       Reply  Tue Oct 5 18:00:53 2021, Anchal, Summary, CDS, c1teststand time synchronization working now 
          Reply  Mon Oct 11 17:31:25 2021, Anchal, Summary, CDS, Fixed mounting of mx devices in fb. daqd_dc is running now. 
             Reply  Mon Oct 11 18:29:35 2021, Anchal, Summary, CDS, Moving forward? 
                Reply  Tue Oct 12 17:20:12 2021, Anchal, Summary, CDS, Connected c1sus2 to martian network 
                   Reply  Tue Oct 12 23:42:56 2021, Koji, Summary, CDS, Connected c1sus2 to martian network 
                      Reply  Wed Oct 13 11:25:14 2021, Anchal, Summary, CDS, Ran c1sus2 models in martian CDS. All good! CDS_screens_running.png
                         Reply  Tue Oct 19 18:20:33 2021, Ian MacMillan, Summary, CDS, c1sus2 DAC to ADC test data3_Plots.pdfdata2_Plots.pdf
                            Reply  Tue Oct 19 23:43:09 2021, Koji, Summary, CDS, c1sus2 DAC to ADC test P_20211019_224433.jpgP_20211019_224122.jpgP_20211019_224400.jpgP_20211019_224411.jpg
                               Reply  Wed Oct 20 11:48:27 2021, Anchal, Summary, CDS, Power supple configured correctly. 
                               Reply  Tue Oct 26 18:24:00 2021, Ian MacMillan, Summary, CDS, c1sus2 DAC to ADC test data2_Plots.pdfdata3_Plots.pdf
             Reply  Tue Oct 12 17:10:56 2021, Anchal, Summary, CDS, Some more information 
Message ID: 16372     Entry time: Mon Oct 4 11:05:44 2021     In reply to: 16365     Reply to this: 16376   16382
Author: Anchal 
Type: Summary 
Category: CDS 
Subject: c1teststand problems summary 

[Anchal, Paco]

We tried to fix the ntp synchronization in c1teststand today by repeating the steps listed in 40m/16302. Even though teh cloned fb1 now has the exact same package version, conf & service files, and status, the FE machines (c1bhd and c1sus2) fail to sync to the time. the timedatectl shows the same stauts 'Idle'. We also, dug bit deeper into the error messages of daq_dc on cloned fb1 and mx_stream on FE machines and have some error messages to report here.


Attempt on fixing the ntp

  • We copied the ntp package version 1:4.2.6 deb file from /var/cache/apt/archives/ntp_1%3a4.2.6.p5+dfsg-7+deb8u3_amd64.deb on the martian fb1 to the cloned fb1 and ran.
    controls@fb1:~ 0$ sudo dbpg -i ntp_1%3a4.2.6.p5+dfsg-7+deb8u3_amd64.deb
  • We got error messages about missing dependencies of libopts25 and libssl1.1. We downloaded oldoldstable jessie versions of these packages from here and here. We ensured that these versions are higher than the required versions for ntp. We installed them with:
    controls@fb1:~ 0$ sudo dbpg -i libopts25_5.18.12-3_amd64.deb 
    controls@fb1:~ 0$ sudo dbpg -i libssl1.1_1.1.0l-1~deb9u4_amd64.deb
  • Then we installed the ntp package as described above. It asked us if we want to keep the configuration file, we pressed Y.
  • However, we decided to make the configuration and service files exactly same as martian fb1 to make it same in cloned fb1. We copied /etc/ntp.conf and /etc/systemd/system/ntp.service files from martian fb1 to cloned fb1 in the same positions. Then we enabled ntp, reloaded the daemon, and restarted ntp service:
    controls@fb1:~ 0$ sudo systemctl enable ntp
    controls@fb1:~ 0$ sudo systemctl daemon-reload
    controls@fb1:~ 0$ sudo systemctl restart ntp
  • But ofcourse, since fb1 doesn't have internet access, we got some errors in status of the ntp.service:
    controls@fb1:~ 0$ sudo systemctl status ntp
    ● ntp.service - NTP daemon (custom service)
       Loaded: loaded (/etc/systemd/system/ntp.service; enabled)
       Active: active (running) since Mon 2021-10-04 17:12:58 UTC; 1h 15min ago
     Main PID: 26807 (code=exited, status=0/SUCCESS)
       CGroup: /system.slice/ntp.service
               ├─30408 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 105:107
               └─30525 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 105:107
    
    Oct 04 17:48:42 fb1 ntpd_intres[30525]: host name not found: 2.debian.pool.ntp.org
    Oct 04 17:48:52 fb1 ntpd_intres[30525]: host name not found: 3.debian.pool.ntp.org
    Oct 04 18:05:05 fb1 ntpd_intres[30525]: host name not found: 0.debian.pool.ntp.org
    Oct 04 18:05:15 fb1 ntpd_intres[30525]: host name not found: 1.debian.pool.ntp.org
    Oct 04 18:05:25 fb1 ntpd_intres[30525]: host name not found: 2.debian.pool.ntp.org
    Oct 04 18:05:35 fb1 ntpd_intres[30525]: host name not found: 3.debian.pool.ntp.org
    Oct 04 18:21:48 fb1 ntpd_intres[30525]: host name not found: 0.debian.pool.ntp.org
    Oct 04 18:21:58 fb1 ntpd_intres[30525]: host name not found: 1.debian.pool.ntp.org
    Oct 04 18:22:08 fb1 ntpd_intres[30525]: host name not found: 2.debian.pool.ntp.org
    Oct 04 18:22:18 fb1 ntpd_intres[30525]: host name not found: 3.debian.pool.ntp.org
  • But the ntpq command is giving the saem output as given by ntpq comman in martian fb1 (except for the source servers), that the broadcasting is happening in the same manner:
    controls@fb1:~ 0$ ntpq -p
         remote           refid      st t when poll reach   delay   offset  jitter
    ==============================================================================
     192.168.123.255 .BCST.          16 u    -   64    0    0.000    0.000   0.000
    
  • On the FE machines side though, the systemd-timesyncd are still unable to read the time signal from fb1 and show the status as idle:
    controls@c1bhd:~ 3$ timedatectl
          Local time: Mon 2021-10-04 18:34:38 UTC
      Universal time: Mon 2021-10-04 18:34:38 UTC
            RTC time: Mon 2021-10-04 18:34:38
           Time zone: Etc/UTC (UTC, +0000)
         NTP enabled: yes
    NTP synchronized: no
     RTC in local TZ: no
          DST active: n/a
    controls@c1bhd:~ 0$ systemctl status systemd-timesyncd -l
    ● systemd-timesyncd.service - Network Time Synchronization
       Loaded: loaded (/lib/systemd/system/systemd-timesyncd.service; enabled)
       Active: active (running) since Mon 2021-10-04 17:21:29 UTC; 1h 13min ago
         Docs: man:systemd-timesyncd.service(8)
     Main PID: 244 (systemd-timesyn)
       Status: "Idle."
       CGroup: /system.slice/systemd-timesyncd.service
               └─244 /lib/systemd/systemd-timesyncd
  • So the time synchronization is still not working. We expected the FE machined to just synchronize to fb1 even though it doesn't have any upstream ntp server to synchronize to. But that didn't happen.
  • I'm (Anchal) working on getting internet access to c1teststand computers.

Digging into mx_stream/daqd_dc errors:

  • We went and changed the Restart fileld in /etc/systemd/system/daqd_dc.service on cloned fb1 to 2. This allows the service to fail and stop restarting after two attempts. This allows us to see the real error message instead of the systemd error message that the service is restarting too often. We got following:
    controls@fb1:~ 3$ sudo systemctl status daqd_dc -l
    ● daqd_dc.service - Advanced LIGO RTS daqd data concentrator
       Loaded: loaded (/etc/systemd/system/daqd_dc.service; enabled)
       Active: failed (Result: exit-code) since Mon 2021-10-04 17:50:25 UTC; 22s ago
      Process: 715 ExecStart=/usr/bin/daqd_dc_mx -c /opt/rtcds/caltech/c1/target/daqd/daqdrc.dc (code=exited, status=1/FAILURE)
     Main PID: 715 (code=exited, status=1/FAILURE)
    
    Oct 04 17:50:24 fb1 systemd[1]: Started Advanced LIGO RTS daqd data concentrator.
    Oct 04 17:50:25 fb1 daqd_dc_mx[715]: [Mon Oct  4 17:50:25 2021] Unable to set to nice = -20 -error Unknown error -1
    Oct 04 17:50:25 fb1 daqd_dc_mx[715]: Failed to do mx_get_info: MX not initialized.
    Oct 04 17:50:25 fb1 daqd_dc_mx[715]: 263596
    Oct 04 17:50:25 fb1 systemd[1]: daqd_dc.service: main process exited, code=exited, status=1/FAILURE
    Oct 04 17:50:25 fb1 systemd[1]: Unit daqd_dc.service entered failed state.
    
  • It seemed like the only thing daqd_dc process doesn't like is that mx_stream services are in failed state in teh FE computers. So we did the same process on FE machines to get the real error messages:
    controls@fb1:~ 0$ sudo chroot /diskless/root
    fb1:/ 0#
    fb1:/ 0# sudo nano /etc/systemd/system/mx_stream.service
    fb1:/ 0#
    fb1:/ 0# exit
  • Then I ssh'ed into c1bhd to see the error message on mx_stream service properly.
    controls@c1bhd:~ 0$ sudo systemctl daemon-reload
    controls@c1bhd:~ 0$ sudo systemctl restart mx_stream
    controls@c1bhd:~ 0$ sudo systemctl status mx_stream -l
    ● mx_stream.service - Advanced LIGO RTS front end mx stream
       Loaded: loaded (/etc/systemd/system/mx_stream.service; enabled)
       Active: failed (Result: exit-code) since Mon 2021-10-04 17:57:20 UTC; 24s ago
      Process: 11832 ExecStart=/etc/mx_stream_exec (code=exited, status=1/FAILURE)
     Main PID: 11832 (code=exited, status=1/FAILURE)
    
    Oct 04 17:57:20 c1bhd systemd[1]: Starting Advanced LIGO RTS front end mx stream...
    Oct 04 17:57:20 c1bhd systemd[1]: Started Advanced LIGO RTS front end mx stream.
    Oct 04 17:57:20 c1bhd mx_stream_exec[11832]: send len = 263596
    Oct 04 17:57:20 c1bhd mx_stream_exec[11832]: OMX: Failed to find peer index of board 00:00:00:00:00:00 (Peer Not Found in the Table)
    Oct 04 17:57:20 c1bhd mx_stream_exec[11832]: mx_connect failed Nic ID not Found in Peer Table
    Oct 04 17:57:20 c1bhd mx_stream_exec[11832]: c1x06_daq mmapped address is 0x7f516a97a000
    Oct 04 17:57:20 c1bhd mx_stream_exec[11832]: c1bhd_daq mmapped address is 0x7f516697a000
    Oct 04 17:57:20 c1bhd systemd[1]: mx_stream.service: main process exited, code=exited, status=1/FAILURE
    Oct 04 17:57:20 c1bhd systemd[1]: Unit mx_stream.service entered failed state.
    
  • c1sus2 shows the same error. I'm not sure I understand these errors at all. But they seem to have nothing to do with timing issuessurprise!

As usual, some help would be helpful

ELOG V3.1.3-