40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Entry  Fri Jun 2 12:32:16 2017, gautam, Update, General, Power glitch 
    Reply  Fri Jun 2 16:02:34 2017, gautam, Update, General, Power glitch IMG_7399.JPG
       Reply  Fri Jun 2 22:01:52 2017, gautam, Update, General, Power glitch - recovery power_glitch_recovery.pngIMG_7406.JPGIMG_7407.JPGIMG_7400.JPG
          Reply  Sun Jun 4 15:59:50 2017, gautam, Update, General, Power glitch - recovery powerGlitchRecovery.png
Message ID: 13036     Entry time: Fri Jun 2 22:01:52 2017     In reply to: 13035     Reply to this: 13038
Author: gautam 
Type: Update 
Category: General 
Subject: Power glitch - recovery 

[Koji, Rana, Gautam]

Attachment #1 - CDS status at the end of todays efforts. There is one red indicator light showing an RFM error which couldn't be fixed by running "global diag reset" or "mxstream restart" scripts, but getting to this point was a journey so we decided to call it for today.

The state this work was started in was as indicated in the previous elog - c1ioo wasn't ssh-able, but was responding to ping. We then did the following:

  1. Killed all models on all four other front ends other than c1ioo. 
  2. Hard reboot for c1ioo - at this point, we could ssh into c1ioo. With all other models killed, we restarted the c1ioo models one by one. They all came online smoothly.
  3. We then set about restarting the models on the other machines.
    • We started with the IOP models, and then restarted the others one by one
    • We then tried running "global diag reset", "mxstream restart" and "telnet fb 8087 -> shutdown" to get rid of all the red indicator fields on the CDS overview screen.
    • All models came back online, but the models on c1sus indicated a DC (data concentrator?) error. 
  4. After a few minutes, I noticed that all the models on c1iscex had stalled
    • dmesg pointed to a synchronization error when trying to initialize the ADC
    • The field that normally pulses at ~1pps on the CDS overview MEDM screen when the models are running normally was stuck
    • Repeated attempts to restart the models kept throwing up the same error in dmesg 
    • We even tried killing all models on all other frontends and restarting just those on c1iscex as detailed earlier in this elog for c1ioo - to no avail.
    • A walk to the end station to do a hard reboot of c1iscex revealed that both green indicator lights on the slave timing card in the expansion chassis were OFF.
    • The corresponding lights on the Master Timing Sequencer (which supplies the synchronization signal to all the front ends via optical fiber) were also off.
    • Sometime ago, Eric and I had noticed a similar problem. Back then, we simply switched the connection on the Master Timing Sequencer to the one unused available port, this fixed the problem. This time, switching the fiber connection on the Master Timing Sequencer had no effect.
    • Power cycling the Master Timing Sequencer had no effect
    • However, switching the optical fiber connections going to the X and Y ends lead to the green LED on the suspect port on the Master Timing Sequencer (originally the X end fiber was plugged in here) turning back ON when the Y end fiber was plugged in.
    • This suggested a problem with the slave timing card, and not the master. 
  5. Koji and I then did the following at the X-end electronics rack:
    • Shutdown c1iscex, toggled the switches in the front and back of the expansion chassis
    • Disconnect AC power from rear of c1iscex as well as the expansion chassis. This meant all LEDs in the expansion chassis went off, except a single one labelled "+5AUX" on the PCB - to make this go off, we had to disconnect a jumper on the PCB (see Attachment #2), and then toggle the power switches on the front and back of the expansion chassis (with the AC power still disconnected). Finally all lights were off.
    • Confident we had completely cut all power to the board, we then started re-connecting AC power. First we re-started the expansion chassis, and then re-booted c1iscex.
    • The lights on the slave timing card came on (including the one that pulses at ~1pps, which indicates normal operation)!
  6. Then we went back to the control room, and essentially repeated bullet points 2 and 3, but starting with c1iscex instead of c1ioo.
  7. The last twist in this tale was that though all the models came back online, the DC errors on c1sus models persisted. No amount of "mxstream restart", "global diag reset", or restarting fb would make these go away.
  8. Eventually, Koji noticed that there was a large discrepancy in the gpstimes indicated in c1x02 (the IOP model on c1sus), compared to all the other IOP models (even though the PDT displayed was correct). There were also a large number or IRIG-B errors indicated on the same c1x02 status screen, and the "TIM" indicator in the status word was red.
  9. Turns out, running ntpdate before restarting all the models somehow doesn't sync the gps time - so this was what was causing the DC errors. 
  10. So we did a hard reboot of c1sus (and for good measure, repeated the bullet points of 5 above on c1sus and its expansion chassis). Then, we tried starting the c1x02 model without running ntpdate first (on startup, there is an 8 hour mismatch between the actual time in Pasadena and the system time - but system time is 8 hours behind, so it isn't even somehow syncing to UTC or any other real timezone?)
    • Model started up smoothly
    • But there was still a 1 second discrepancy between the gpstime on c1x02 and all the other IOPs (and the 8 hour discrepancy between displayed PDT and actual time in Pasadena)
    • So we tried running ntpdate after starting c1x02 - this finally fixed the problem, gpstime and PDT on c1x02 agreed with the other frontends and the actual time in Pasadena.
    • However, the models on c1lsc and c1ioo crashed
    • So we restarted the IOPs on both these machines, and then the rest of the models.
  11. Finally, we ran "mxstream restart", "global diag reset", and restarted fb, to make the CDS overview screen look like it does now.

Why does ntpdate behave this way? And only on one of the frontends? And what is the remaining RFM error? 

Koji then restarted the IMC autolocker and FSS slow processes on megatron. The IMC locked almost immediately. The MC2 transmon indicated a large shift in the spot position, and also the PMC transmission is pretty low (while the lab temperature equilibriates after the AC being off during peak daytime heat). So the MC transmission is ~14500 counts, while we are used to more like 16,500 counts nowadays.

Re-alignment of the IFO remains to be done. I also did not restart the end lasers, or set up the Marconi with nominal params. 

Attachment #3 - Status of the Master Timing Sequencer after various reboots and power cycling of front ends and associated electronics.

Attachment #4 - Warning lights on C1IOO


Today's recovery seems to be a lot more complicated than usual.

So current status is that all front-end models except those hosted on C1IOO are back up and running. Further recovery efforts in progress.  


Attachment 1: power_glitch_recovery.png  23 kB  | Hide | Hide all
Attachment 2: IMG_7406.JPG  2.621 MB  Uploaded Sat Jun 3 16:59:40 2017  | Hide | Hide all
Attachment 3: IMG_7407.JPG  1.203 MB  Uploaded Sat Jun 3 17:01:19 2017  | Hide | Hide all
Attachment 4: IMG_7400.JPG  1.745 MB  Uploaded Sat Jun 3 17:02:10 2017  | Hide | Hide all
ELOG V3.1.3-