40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Entry  Wed Dec 9 20:14:49 2020, gautam, Update, VAC, UPS failure rackBeforenAfter.pdfIMG_0008.jpgIMG_0009.jpgvacStatus.png
    Reply  Thu Dec 10 13:05:52 2020, Jon, Update, VAC, UPS failure TP2_time_history.pngslow_controls_monitors.png
       Reply  Thu Dec 10 14:29:26 2020, gautam, Update, VAC, UPS failure vacDiag1.png
Message ID: 15724     Entry time: Thu Dec 10 13:05:52 2020     In reply to: 15721     Reply to this: 15725
Author: Jon 
Type: Update 
Category: VAC 
Subject: UPS failure 

I've investigated the vacuum controls failure that occurred last night. Here's what I believe happened.

From looking at the system logs, it's clear that there was a sudden loss of power to the control computer (c1vac). Also, the system was actually down for several hours. The syslog shows normal EPICS channel writes (pressure readback updates, etc., and many of them per minute) which suddenly stop at 4:12 pm. There are no error or shutdown messages in the syslog or in the interlock log. The next activity is the normal start-up messaging at 7:39 pm. So this is all consistent with the UPS suddenly failing.

According to the Tripp Lite manual, the FAULT icon indicates "the battery-supported outlets are overloaded." The failure of the TP2 dry pump appears to have caused this. After the dry pump failure, the rising pressure in the TP2 foreline caused TP2's current draw to increase way above its normal operating range. Attachment 1 shows anomalously high TP2 current and foreline pressure in the minutes just before the failure. The critical system-wide failure is that this overloaded the UPS before overloading TP2's internal protection circuitry, which would have shut down the pump, triggering interlocks and auto-notifications.

Preventing this in the future:

First, there are too many electronics on the 1 kVA UPS. The reason I asked us to buy a dual 208/120V UPS (which we did buy) is to relieve the smaller 120V UPS. I envision moving the turbo pumps, gauge controllers, etc. all to the 5 kVA unit and reserving the smaller 1 kVA unit for the c1vac computer and its peripherals. We now have the dual 208/120V UPS in hand. We should make it a priority to get that installed.

Second, there are 1 Hz "blinker" channels exposed for c1vac and all the slow controls machines, each reporting the machine's alive status. I don't think they're being monitored by any auto-notification program (running on a central machine), but they could be. Maybe there already exists code that could be co-opted for this purpose? There is an MEDM screen displaying the slow machine statuses at Sitemap > CDS > SLOW CONTROLS STATUS, pictured in Attachment 2. This is the only way I know to catch sudden failures of the control computer itself.

Attachment 1: TP2_time_history.png  54 kB  | Hide | Hide all
TP2_time_history.png
Attachment 2: slow_controls_monitors.png  9 kB  Uploaded Thu Dec 10 14:14:32 2020  | Hide | Hide all
slow_controls_monitors.png
ELOG V3.1.3-