40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Entry  Mon Aug 23 11:51:26 2021, Koji, Update, General, Campus Wide Power Glitch Reported: Monday, 8/23/21 at 9:30am  
    Reply  Mon Aug 23 19:00:05 2021, Koji, Update, General, Campus Wide Power Glitch Reported: Monday, 8/23/21 at 9:30am  
       Reply  Mon Aug 23 22:51:44 2021, Anchal, Update, General, Time synchronization efforts 
          Reply  Tue Aug 24 09:22:48 2021, Anchal, Update, General, Time synchronization working now 
             Reply  Tue Aug 24 18:11:27 2021, Paco, Update, General, Time synchronization not really working 
                Reply  Tue Aug 24 22:37:40 2021, Anchal, Update, General, Time synchronization not really working 
Message ID: 16293     Entry time: Tue Aug 24 18:11:27 2021     In reply to: 16292     Reply to this: 16295
Author: Paco 
Type: Update 
Category: General 
Subject: Time synchronization not really working 

tl;dr: NTP servers and clients were never synchronized, are not synchronizing even with ntp... nodus is synchronized but uses chronyd; should we use chronyd everywhere?


Spent some time investigating the ntp synchronization. In the morning, after Anchal set up all the ntp servers / FE clients I tried restarting the rts IOPs with no success. Later, with Tega we tried the usual manual matching of the date between c1iscex and fb1 machines but we iterated over different n-second offsets from -10 to +10, also without success.

This afternoon, I tried debugging the FE and fb1 timing differences. For this I inspected the ntp configuration file under /etc/ntp.conf in both the fb1 and /diskless/root.jessie/etc/ntp.conf (for the FE machines) and tried different combinations with and without nodus, with and without restrict lines, all while looking at the output of sudo journalctl -f on c1iscey. Everytime I changed the ntp config file, I restarted the service using sudo systemctl restart ntp.service . Looking through some online forums, people suggested basic pinging to see if the ntp servers were up (and broadcasting their times over the local network) but this failed to run (read-only filesystem) so I went into fb1, and ran sudo chroot /diskless/root.jessie/ /bin/bash to allow me to change file permissions. The test was first done with /bin/ping which couldn't even open a socket (root access needed) by running chmod 4755 /bin/ping then ssh-ing into c1iscey and pinging the fb1 machine successfully. After this, I ran chmod 4755 /usr/sbin/ntpd so that the ntp daemon would have no problem in reaching the server in case this was blocking the synchronization. I exited the chroot shell and the ntp daemon in c1iscey; but the ntpstat still showed unsynchronised status. I also learned that when running an ntp query with ntpq -p if a client has succeeded in synchronizing its time to the server time, an asterisk should be appended at the end. This was not the case in any FE machine... and looking at fb1, this was also not true. Although the fb1 peers are correctly listed as nodus, the caltech ntp server, and a broadcast (.BCST.) server from local time (meant to serve the FE machines), none appears to have synchronized... Going one level further, in nodus I checked the time synchronization servers by running chronyc sources the output shows

controls@nodus|~> chronyc sources
210 Number of sources = 4
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^* testntp1.superonline.net      1  10   377   280  +1511us[+1403us] +/-   92ms
^+ 38.229.59.9                   2  10   377   206  +8219us[+8219us] +/-  117ms
^+ tms04.deltatelesystems.ru     2  10   377   23m    -17ms[  -17ms] +/-  183ms
^+ ntp.gnc.am                    3  10   377   914  -8294us[-8401us] +/-  168ms

I then ran chronyc clients to find if fb1 was listed (as I would have expected) but the output shows this --

Hostname                   Client    Peer CmdAuth CmdNorm  CmdBad  LstN  LstC
=========================  ======  ======  ======  ======  ======  ====  ====
501 Not authorised

So clearly chronyd succeeded in synchronizing nodus' time to whatever server it was pointed at but downstream from there, neither the fb1 or any FE machines seem to be synchronizing properly. It may be as simple as figuring out the correct ntp configuration file, or switching to chronyd for all machines (for the sake of homogeneity?)

ELOG V3.1.3-