Subject: Power glitch 

Today's recovery seems to be a lot more complicated than usual.

  • The vertex area of the lab is pretty warm - I think the ACs are not running. The wall switch-box (see Attachment #1) shows some red lights which I'm pretty sure are usually green. I pressed the push-buttons above the red light, hopefully this fixed the AC and the lab cools down soon.
  • Related to the above - C1IOO has a bunch of warning orange indicator lights ON that suggest it is feeling the heat. Not sure if that is why, but I am unable to bring any of the C1IOO models back online - the rtcds compilation just fails, after which I am unable to ssh back into the machine as well.
  • C1SUS was problematic as well. I found that the expansion chassis was not powered. Fortunately, this was fixed by simply switching to the one free socket on the power strip that powers a bunch of stuff on 1X4 - this brought the expansion chassis back alive, and after a soft reboot of c1sus, I was able to get these models up and running. Fortunately, none of the electronics seem to have been damaged. Perhaps it is time for surge-protecting power strips inside the lab area as well (if they aren't already)? 
  • I was unable to successfully resolve the dmesg problem alluded to earlier. Looking through some forums, I gather that the output of dmesg should be written to a file in /var/log/. But no such file exists on any of our 5 front-ends (but it does on Megatron, for example). So is this way of setting up the front end machines deliberate? Why does this matter? Because it seems that the buffer which we see when we simply run "dmesg" on the console gets preiodically cleared. So sometime back, when I was trying to verify that the installed DACs are indeed 16-bit DACs by looking at dmesg, running "dmesg | head" showed a first line that was written to well after the last reboot of the machine. Anyway, this probably isn't a big deal, and I also verified during the model recompilation that all our DACs are indeed 16-bit.
  • I was also trying to set up the Upstart processes on megatron such that the MC autolocker and FSS slow control scripts start up automatically when the machine is rebooted. But since C1IOO isn't co-operating, I wasn't able to get very far on this front either...

So current status is that all front-end models except those hosted on C1IOO are back up and running. Further recovery efforts in progress.  

GV Jun 5 6pm: From my discussion with jamie, I gather that the fact that the dmesg output is not written to file is because our front-ends are diskless (this is also why the ring buffer, which is what we are reading from when running "dmesg", gets cleared periodically)



Looks like there was a power glitch at around 10am today.

All frontends, FB, Megatron, Optimus were offline. Chiara reports an uptime of 666 days so looks like its UPS works fine. PSL was tripped, probably the end lasers too (yet to check). Slow machines seem alright (Responds to ping, and I can also telnet into them).

Since all the frontends have to be re-started manually, I am taking this opportunity to investigate some cds issues like the lack of a dmesg log file on some of the frontends. So the IFO will be offline for sometime.


