40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Message ID: 6911     Entry time: Wed Jul 4 17:33:04 2012
Author: Jamie 
Type: Update 
Category: CDS 
Subject: timing, possibly leap second, brought down CDS 

I got a call from Koji and Yuta that something was wrong with the CDS system.  I somehow had an immediate suspicion that it had something to do with the recent leap second.

It took a while for nodus to respond, and once he finally let me in I found a bunch of the following in his dmesg, repeated and filling the buffer:

Jul  3 22:41:34 nodus xntpd[306]: [ID 774427 daemon.notice] time reset (step) 0.998366 s
Jul  3 22:46:20 nodus xntpd[306]: [ID 774427 daemon.notice] time reset (step) -1.000847 s

Looking at date on all the front end systems, including fb, I could tell that they all looked a second fast, which is what you would expect if they had missed the leap second.  Everything syncs against nodus, so given nodus's problems above, that might explain everything.

I stopped daqd and nds on fb, and unloaded the mx drivers, which seemed to be showing problems.  I also stopped nodus's xntp:

  sudo /etc/init.d/xntpd stop

His ntp config file is in /etc/inet/ntp.conf, which is definitely the WRONG PLACE, given that the ntp server is not, as far as I can tell, being controlled by inetd.  (nodus is WAY out of date and desperately needs an overhaul.  it's nearly impossible to figure out what the hell is going on in there).  I found an old elog of Rana's that mentioned updating his config to point him to the caltech NTP server, which is now listed in the config, so I tried manually resyncing against that:

  sudo ntpdate -s -b -u 131.215.239.14

Unfortunately that didn't seem to have any effect.  This was making me wonder if the caltech server is off?  Anyway, I tried resyncing against the global NTP pool:

  sudo ntpdate -s -b -u pool.ntp.org

This seemed to work: the clock came back in sync with others that are known good.  Once nodus time was good I reloaded the mx drivers on fb and restarted daqd and nds.  They seemed come up fine.  At this point front ends started coming back on their own.  I went and restarted all the models on the machines that didn't (c1iscey and c1ioo).  Currently everything is looking ok.

I'm worried that there is still a problem with one of the NTP servers that nodus is sync'ing against, and that the problem might come back.  I'll check in again later tonight.

ELOG V3.1.3-