I got a call from Koji and Yuta that something was wrong with the CDS system. I somehow had an immediate suspicion that it had something to do with the recent leap second.
It took a while for nodus to respond, and once he finally let me in I found a bunch of the following in his dmesg, repeated and filling the buffer:
Jul 3 22:41:34 nodus xntpd[306]: [ID 774427 daemon.notice] time reset (step) 0.998366 s
Jul 3 22:46:20 nodus xntpd[306]: [ID 774427 daemon.notice] time reset (step) -1.000847 s
Looking at date on all the front end systems, including fb, I could tell that they all looked a second fast, which is what you would expect if they had missed the leap second. Everything syncs against nodus, so given nodus's problems above, that might explain everything.
I stopped daqd and nds on fb, and unloaded the mx drivers, which seemed to be showing problems. I also stopped nodus's xntp:
sudo /etc/init.d/xntpd stop
His ntp config file is in /etc/inet/ntp.conf, which is definitely the WRONG PLACE, given that the ntp server is not, as far as I can tell, being controlled by inetd. (nodus is WAY out of date and desperately needs an overhaul. it's nearly impossible to figure out what the hell is going on in there). I found an old elog of Rana's that mentioned updating his config to point him to the caltech NTP server, which is now listed in the config, so I tried manually resyncing against that:
sudo ntpdate -s -b -u 131.215.239.14
Unfortunately that didn't seem to have any effect. This was making me wonder if the caltech server is off? Anyway, I tried resyncing against the global NTP pool:
sudo ntpdate -s -b -u pool.ntp.org
This seemed to work: the clock came back in sync with others that are known good. Once nodus time was good I reloaded the mx drivers on fb and restarted daqd and nds. They seemed come up fine. At this point front ends started coming back on their own. I went and restarted all the models on the machines that didn't (c1iscey and c1ioo). Currently everything is looking ok.
I'm worried that there is still a problem with one of the NTP servers that nodus is sync'ing against, and that the problem might come back. I'll check in again later tonight. |