The acromags are on the UPS. I suspect the transient came in on one of the signal lines. Chub tells me he unplugged one of the signal cables from the chassis around the time things died on Monday, although we couldn't reproduce the problem doing that again today.
In this situation it wasn't the software that died, but the acromag units themselves. I have an idea to detect future occurrences using a "blinker" signal. One acromag outputs a periodic signal which is directly sensed by another acromag. The can be implemented as another polling condition enforced by the interlock code.
If the acromags lock up whenever there is an electrical spike, shouldn't we have them on UPS to smooth out these ripples? And wasn't the idea to have some handshake/watchdog system to avoid silently dying computers?
The problem encountered with the vac controls was indeed resolved via the recommendation I posted yesterday. The Acromags had gone into a protective state (likely caused by an electrical transient in one of the signals)
While working on the vac controls today, I also took care of some of the remaining to-do items. Below is a summary of what was done, and what still remains.
sudo mkfs -t ext4 /dev/sdb
sudo dd if=/dev/sda of=/dev/sdb bs=64K conv=noerror,sync
controls@c1vac:~$ sudo dd if=/dev/sda of=/dev/sdb bs=64K conv=noerror,sync
[sudo] password for controls:
^C283422+0 records in
283422+0 records out
18574344192 bytes (19 GB) copied, 719.699 s, 25.8 MB/s
I'm rebooting the IOLAN server to load new serial ports. The interlocks might trip when the pressure gauge readbacks cut out.
Today I implemented protection of the vac system against extended power losses. Previously, the vac controls system (both old and new) could not communicate with the APC Smart-UPS 2200 providing backup power. This was not an issue for short glitches, but for extended outages the system had no way of knowing it was running on dwindling reserve power. An intelligent system should sense the outage and put the IFO into a controlled shutdown, before the batteries are fully drained.
What enabled this was a workaround Gautam and I found for communicating with the UPS serially. Although the UPS has a serial port, neither the connector pinout nor the low-level command protocol are released by APC. The only official way to communicate with the UPS is through their high-level PowerChute software. However, we did find "unofficial" documentation of APC's protocol. Using this information, I was able to interface the the UPS to the IOLAN serial device server. This allowed the UPS status to be queried using the same Python/TCP sockets model as all the other serial devices (gauges, pumps, etc.). I created a new service called "serial_UPS.service" to persistently run this Python process like the others. I added a new EPICS channel "C1:Vac-UPS_status" which is updated by this process.
With all this in place, I added new logic to the interlock.py code which closes all valves and stops all pumps in the event of a power failure. To be conservative, this interlock is also tripped when the communications link with the UPS is disconnected (i.e., when the power state becomes unknown). I tested the new conditions against both communication failure (by disconnecting the serial cable) and power failure (by pressing the "Test" button on the UPS front panel). This protects TP2 and TP3. However, I discovered that TP1---the pump that might be most damaged by a sudden power failure---is not on the UPS. It's plugged directly into a 240V outlet along the wall. This is because the current UPS doesn't have any 240V sockets. I'd recommend we get one that can handle all the turbo pumps.
Pin 1: RxD
Pin 2: TxD
Pin 5: GND
Baud rate: 2400
Data bits: 8
Stop bits: 1
Work is completed and the vac system is back in its nominal state.
The vac controls system is going down for migration from Python 2.7 to 3.4. Will advise when it is back up.
I've converted all the vac control system code to run on Python 3.4, the latest version available through the Debian package manager. Note that these codes now REQUIRE Python 3.x. We decided there was no need to preserve Python 2.x compatibility. I'm leaving the vac system returned to its nominal state ("vacuum normal + RGA").
agreed - we need all pumps on UPS for their safety and also so that we can spin them down safely. Can you and Chub please find a suitable UPS?
However, I discovered that TP1---the pump that might be most damaged by a sudden power failure---is not on the UPS. It's plugged directly into a 240V outlet along the wall. This is because the current UPS doesn't have any 240V sockets. I'd recommend we get one that can handle all the turbo pumps.
While glancing at my Vacuum striptool, I noticed that the IFO pressure is 2e-4 torr. There was an "AC power loss" reported by C1Vac about 4 hours (14:07 local time) ago. We are investigating. I closed the PSL shutter.
Jon and I investigated at the vacuum rack. The UPS was reporting a normal status ("On Line"). Everything looked normal so we attempted to bring the system back to the nominal state. But TP2 drypump was making a loud rattling noise, and the TP2 foreline pressure was not coming down at a normal rate. We wonder if the TP2 drypump has somehow been damaged - we leave it for Chub to investigate and give a more professional assessment of the situation and what the appropriate course of action is.
The PSL shutter will remain closed overning, and the main volume and annuli are valved off. We spun up TP1 and TP3 and decided to leave them on (but they have negligible load).
Overnight pressure trends don't suggest anything went awry after the initial interlock trip. Some watchdog script that monitors vacuum pressure and closes the PSL shutter in the event of pressure exceeding some threshold needs to be implemented. Another pending task is to make sure that backup disk for c1vac actually is bootable and is a plug-and-play replacement.
Bob and Chub concluded that the drypump that serves as TP2's forepump had failed. Steve had told me the whereabouts of a spare Agilent IDP-7. This was meant to be a replacement for the TP3 foreline pump when it failed, but we decided to swap it in while diagnosing the failed drypump (which had 2182 hours continuous running according to the hour counter). Sure enough, the spare pump spun up and the TP2fl pressure dropped at a rate consistent with what is expected. I was then able to spin up TP1, TP2 and TP3.
However, when opening V4 (the foreline of TP1 pumped by TP2), I heard a loud repeated click track (~5Hz) from the electronics rack. Shortly after, the interlocks shut down all the TPs again, citing "AC power loss". Something is not right, I leave it to Jon and Chub to investigate.
I can't explain the mechanical switching sound Gautam reported. The relay controlling power to the TP2 forepump is housed in the main AC relay box under the arm tube, not in the Acromag chassis, so it can't be from that. I've cycled through the pumpdown sequence several times and can't reproduce the effect. The Acromag switches for TP2 still work fine.
In any case, I've made modifications to the vacuum interlocks that will help with two of the issues:
C1:AUX-PSL_ShutterRqst --> 0
After finishing this vac work, I began a new pumpdown at ~4:30pm. The pressure fell quickly and has already reached ~1e-5 torr. TP2 current and temp look fine.
PSL shutter was re-opened at 6pm local time. IMC was locked. As of 10pm, the main volume pressure is already back down to the 8e-6 level.
Is this one close to failure as well?
This happened again, about 30,000 seconds (~2:06pm local time according to the logfile) ago. The cited error was the same -
Hard to believe there was any real power loss, nothing else in the lab seems to have been affected so I am inclined to suspect a buggy UPS communication channel. The PSL shutter was not closed - I believe the condition is for P1a to exceed 3 mtorr (it is at 1 mtorr right now), but perhaps this should be modified to close the PSL shutter in the event of any interlock tripping. Also, probably not a bad idea to send an email alert to the lab mailing list in the event of a vac interlock failure.
For tonight, I only plan to work with the EX ALS system anyways so I'm closing the PSL shutter, I'll work with Chub to restore the vacuum if he deems it okay tomorrow.
After getting the go ahead from Chub and Jon, I restored the Vacuum state to "Vacuum normal", see Attachment #1. Steps:
controls@c1vac:/opt/target/python/interlocks$ git diff interlock.py
diff --git a/python/interlocks/interlock.py b/python/interlocks/interlock.py
index 28d3366..46a39fc 100755
@@ -52,8 +52,8 @@ class Interlock(object):
self.pumps = 
for pump in interlocks['pumps']:
pm = PumpManager(pump['name'])
- for condition in pump['conditions']:
+ #for condition in pump['conditions']:
+ # pm.register_condition(*condition)
So far the pressure is coming down smoothly, see Attachment #2. I'll keep an eye on it.
PSL shutter was opened at 645pm local time. IMC locked almost immediately.
Update 11pm: The pressure has reached 8.5e-6 torr without hiccup.
I slightly cleaned up Gautam's disabling of the UPS-predicated vac interlock and restarted the interlock service. This interlock is intended to protect the turbo pumps after a power outage, but it has proven disruptive to normal operations with too many false triggers. It will be reenabled once a new UPS has been installed. For now, as it has been since 2001, the vac pumps are unprotected against an extended power outage.
This activity seems to have closed the PSL shutter (actually I'm not sure why that happened - the interlock should only trip if P1a exceeds 3 mtorr, and looking at the time series for the last 2 hours, it did not ever exceed this threshold). I saw no reason for it to remain closed so I re-opened it just now.
I vote for not remotely rebooting any of the vacuum / PSL subsystems. In the event of something going catastrophically wrong, someone should be on hand to take action in the lab.
I did the following:
I think this completes the pre-pumpdown alignment checks we usually do. The detailed plan for tomorrow is here: please have a look and lmk if I missed something.
[chub, koji, gautam]
Close up photos of the EY and IY chambers may be found here.
Update on the display manager of c1vac: I was able to get it working again by running sudo systemctl restart display-manager. Now I can interact with the MEDM screens on c1vac. It is a bit annoying that this machine doesn't have the users directory so I don't have access to the many convenient StripTool templates though - maybe I'll make local copies tomorrow for the pumpdown.
Overnight, the pressure of the main volume only rose by 10 mtorr, so there was no need to run the roughing pumps again. So we went straight to the turbos - hooked up the AUX drypump and set it up to back TP2. Initially, we tried having both TP2 and TP3 act as backing pumps for TP1, but the wimpy TP3 current was always passing the interlock threshold. So we decided to pump down with TP3 valved off, only TP2 backing TP1. This went smooth - we had to keep an eye on P2, to make sure it stayed below 1 torr. It took ~ 1 hour to go from 500 mtorr to 100 mtorr, but after that, I could almost immediately open up RV2 completely. A safe setting to run at seems to be to have RV2 open by between 0.5 and 1 turn (out of the full range of 7 turns) until the pressure drops to ~100 mtorr. Then we can crank it open. We are, at the time of writing, at ~8e-5 torr and the pressure is coming down steadily.
I had to manually clear the IG error on the CC1 gauge, and re-enabled the High Voltage, so that we have a readback of the main volume pressure in that range. I made a script to do this (enable the HV, the IG error still has to be cleared by pushing the appropriate buttons on the Hornet), it lives at /opt/target/python/serial/turnHornetON.py. I guess it'll take a few days to hit 8e-6 torr, but I don't see any reason to not leave the turbos running over the weekend.
Remaining tasks are (i) disconnect the roughing pump line and (ii) pump down the annuli, which will be done later today. Both were done at ~2pm, now we are in the vacuum normal config. I'll turn the two small turbos to run on "Standby Mode" before I head home today. I think TP3 may be close to end-of-life - the TP3 current went up to 1A even while evacuating the small volume of the annular line (which was already at 1 torr) with the AUX drypump backing it. The interlock condition is set to trip at 1.2A, and this pump is nominally supposed to be able to back TP1 during the pumpdown of the main volume from 500 mtorr, which it wasn't able to do.
I've been monitoring the status of the pumpdown remotely with ndscope lookbacks of C1:Vac-CC1_pressure. Today morning, I saw that the channel was putting out a constant value (signature of EPICS server being frozen). caget did not work either. Then I tried ssh-ing into c1vac to see if there were any issues but I was unable to. The machine isn't responding to ping either. The EPICS value has been frozen since ~1030pm PDT 26 May 2019.
I will try and head to campus later today to check on it. Isn't an email alert or soemthing supposed to be sent out in such an event?
The vacuum itself was fine - CC1 gauge reported a pressure of 1.3e-5 torr. Note to self: the C1:Vac-CC1_HORNET_PRESSURE channel, which is the analog readback of the Hornet gauge and which is hooked up to an Acromag ADC in the c1auxex chassis, is independent of the status of the c1vac machine, and so can serve as a diagnostic.
However, I was unable to interact with c1vac in any way, the monitor hooked up directly to it was showing a frozen display. So I hard-rebooted the system. It took a few minutes to come back online - but even after 10 minutes of waiting, still no display. In the process of the reboot, several valves were closed off - when the EPICS processes restart, there are momentary instances where the readback channels get an "undefined" value, which prompts the main interlock process to transition to a "SAFE" state.
Running df -h, I saw that the /var partition was completely full. Maybe this was somehow interfering with the machine running smoothly? Two files in particular, daemon.log and daemon.log.1 were ~1GB each. The contents of these files seemed to be just the readbacks for the caget and caput commands. So I cleared both these files, and now the /var partition usage is only 26%. I also got the display back up and running on the physical monitor hooked up to the c1vac machine's VGA port. Let's see if this has improved the stability situation. The CPU load is still high (~6-7), with most of this coming from the modbus process. Why is this so high? c1susaux has more Acromag units but claims a much lower load of 0.71. Is the CPU of the c1vac machine somehow inferior?
In the meantime, I ssh-ed into c1vac and restored the "Vacuum normal" valve config. During this little escapade, the main volume pressure rose to ~6e-5 torr. It's coming back down smoothly.
Unrelated to this work: we had turned the RGA off for the vent, I powered it back on and re-initialized it this morning.
Gautam and I debugged a communications problem with TP3 that was causing its python service to fail. We traced the problem back to the querying of the pump controller for its operational parameters (speed, voltage, temp). Some small percentage of the time (~5%, indeterministically), the pump controller is returning an invalid response which causes the service to shut itself down and signal a NO COMM error.
As a temporary fix, I wrapped the failing query in an exception handler to continue past this particular error. However, we suspect the microprocessor in the TP3 controller may be beginning to fail. There is a spare controller sitting right next to it in the vacuum rack. We will ask Chub to install the replacement in the near future.
gautam: this pump is responsible for pumping the annular volume under normal operations. while this problem is being resolved, the annular volume is valved off (as it has been since July 2019 anyway which is when this problem first manifested).
There was a jump in the main volume pressure at ~6pm PDT yesterday. The cause is unknown, but the pressure doesn't seem to be coming back down (but also isn't increasing alarmingly).
I wanted to look at the RGA scans to see if there were any clues as to what changed, but looks like the daily RGA scans stopped updating on Dec 24 2019. The c0rga machine responsible for running these scans doesn't respond to ssh. Not much to be done until the lockdown is over i guess...
I was in the lab at the time. But did not notice anything (like turbo sound etc). I was around ETMX/Y (1X9, 1Y4) rack and SUS rack (1X4/5), but did not go into the Vac region.
The email address in the N2 checking script wasn't right - I now updated it to email the 40m list if the sum of reserve tank pressures fall below 800 PSI. The checker itself is only run every 3 hours (via cron on c1vac).
I reset the remote of this git repo to the 40m version instead of Jon's personal one, to ensure consistency between what's on the vacuum machine and in the git repo. There is now a N2 checker python mailer that will email the 40m list if all the tank pressures are below 600 PSI (>12 hours left for someone to react before the main N2 line pressure drops and the interlocks kick in). For now, the script just runs as a cron job every 3 hours, but perhaps we should integrate it with the interlock process
Four nitrogen cylinders replaced the empties in the rack at the west entrance. Additionally, Airgas will now deliver only once a week. Let me know via email or text when the there are four empties in the rack and I'll order the next round.
The new nitrogen cylinders were delivered to the rack at the west entrance. We only get one Airgas delivery per week during the stay-at-home order, but so far they've not let us down.
There appears to have been some sort of vacuum failure.
ldas-pcdev1 was down, so the summary pages weren't being generated. I have now switched over to ldas-pcdev6. I suspect some forepump failure, will check up later today unless someone else wants to take care of this.
There was no interlock action, and I don't check the vacuum status every half hour, so there was a period of time last night there was high circulating power in the arm cavities when the main volume pressure was higher than nominal. I have now closed the PSL shutter until the issue is resolved.
It looks like the main vacuum interlock was tripped due to a serial communication error from the TP2 controller. With Rana/Koji's permission, I will open V1 and expose the main volume to TP1 again (#2 in last section).
Recommended course of action:
The pumpspool UPS has its "Replace Battery" indicator light on. Might be a good chance to change the UPS, but at the very least, we should put in fresh batteries (last replacement was in Aug 2017).
I'll say this again - the pumpspool area is noisier than I remember it being, I think one / both of the roughing pumps backing TP2 / TP3 need tip-seal replacements.
BTW, EX is 5C hotter than EY, by virtue of the tarnac outside? In fact, judging by Steve's thermometers, EX reports a 12C swing in 24 hours between 30 C and 18 C (so almost no temperature control) while EY reports a 5C swing between 20 and 25 C. This is borne out by the ETM Oplev data I think...
1. I agree that it's likely that it was the temp signal glitch.
Recom #2: I approve to reopen the valves to pump down the main volume. As long as there is no frequent glitch, we can just bring the vacuum back to normal with the current software setup.
2. Recom #1 is also reasonable. You can use simple logic like if we register 10 consecutive samples that exceed the threshold, we can activate the interlock. I feel we should still keep the temp interlock. Switching between pumping mode and the normal operation may cause unexpected omission of the interlocks when it is necessary.
3. We should purchase the UPS battery / replacement rotary TIP seal. Once they are in hand, we can stop the vacuum and execute the replacement. Can one person (who?) accomplish everything with some remote help?
4. The lab temp: you mean, 12degC swing with the AC on!?
Jon and Koji remotely supported Jordan's resetting the TP2 controller.
From the operator's console in front of the vac rack:
Open a terminal window (click the LXTerminal icon on the desktop)
Type "control" + enter to open the vac controls screen
Toggle all the open valves closed (edit by KA: and manually close RV2 by rotating the gate valve handle )
Turn OFF TP2 by clicking the "Off' button. Make sure the status changes and the rotation speed falls to zero (you'll also hear the pump spinning down)
The other pumps (TP1, TP3) can be left running
Once TP2 has stopped spinning, go to the back of the rack and locate the ethernet cable running from the back of the TP2 controller to the IOLAN server (near the top of the rack). Disconnect and reconnect the cable at each end, verifying it is firmly locked in place.
From the front of the rack, power down the TP2 controller (I don't quite remember for the Agilent, but you might have to move the slider on the front from "Remote" to "Local" first)
Wait about 30 seconds, then power it back on. If you had to move the slider to shut it down, revert it back to the "Remote" position.
Go back to the controls screen on the console. If the pump came back up and is communicating serially again, its status will say something other than "NO COMM"
Turn TP2 back on. Verify that it spins up to its nominal speed (66 kRPM)
At this point you can reopen any valves you initially closed (any that were already closed before, leave closed)
TP2 was stopped and at this moment the glitches were gone. Jordan powercycled the TP2 controller and we brought up the TP2 back at the full speed.
However, the glitches came back as before. Obviously we can't go on from here, and we've decided to stop the recovery process here today.
- We left TP1/2/3 running while the valves including RV2 were closed.
- When Jordan is back in the lab next week, we'll try to use TP3 as the backing of TP1 so that we can resume the main volume pumping.
- Currently, TP3 does not have interlocking and that is a risk. Jon is going to implement it.
- Meanwhile, we will try to replace the controller of TP2. We are supposed to have this in the lab. Ask Chub about the location.
- Once we confirm the stability of the diagnostic signals for TP2, we will come back to the nominal pumping scheme.
I still don't understand why restoring the vacuum is contingent on this functionality working. All the TPs have their own internal logic to shutdown the pump if some damage threshold is exceeded. Plus, we have the pressure-sensor based interlocks to protect the main volume as well as pumps. While the extra redundancy from the readbacks from the controller is useful, clearly it isn't the first line of defense?
The main volume pressure is currently ~10mTorr. If we pump down before this reaches 500mTorr, the procedure is pretty straightforward. Otherwise, we have to do the dance with the manual throttling valve (judging by current rate of increase, unlikely to exceed this over the weekend, but I lose IFO time).
Obviously I don't want to rush this and have some permanent damage, so I'll stay out of this unless otherwise instructed.
The vacuum safety policy and design are not clear to me, and I don't know what the first and second defense is. Since we had limited time and bandwidth during the remotely-supported recovery work today, we wanted to work step by step.
The pressure rising rate is 20mtorr/day, and turning on TP3 early next week will resume the main-volume pumping without too much hustle. If you need the IFO time now, contact with Jon and use backing with TP3.
Didn't mean to sound whiny. I will wait until the vacuum team tells me it is okay.
[Jon, Jordan, Koji]
Today Jordan reconfigured the vac system to allow pumping of the main volume resume, with Jon and Koji remotely advising. All clear to resume normal IFO activities. However, the vac system is operating in a temporary configuration that will have to be reverted as we locate replacement components. Details below.
Since serial readback of the TP2 controller seems to be failing, we reconfigured the system with TP3 now backing for TP1. TP2 was valved off (at V4) and shut down until we can replace its controller.
TP3 has its own problems, however. It was valved off in January after its temperature readback began glitching and spuriously triggering the interlocks [ELOG 15140]. However the problem appears to be limited only one readback (rotation speed, current, voltage are fine) and there is enough redundancy in the pump-dependent interlock conditions to safely connect it to the main volume.
We also discovered that sometime since January, the TP3 dry pump has failed. The foreline pressure had risen to 165 torr. Since the TP2 and TP3 dry pumps are not interchangeable (Agilent vs. Varian), we instead valved in the auxiliary dry pump and disconnected the failed dry pump using a KF blank. This is a temporary arrangement until the permanent dry pump can be repaired. Jordan removed it to replace the tip seals and will test it in the bake lab before reinstalling.
With this configuration in place, we proceeded to pump down the main volume without issue (attachment 1). We monitored the pumpdown for about 45 min., until the pressure had reached ~1E-5 torr and TP3 had been transitioned to standby (low-speed) mode.
I missed the vacuum discussion on the call today, but I have some questions/comments:
At the very least, I think we should consider making the interlock code have levels (like interrupts on a micro controller). So if the pressure gauges are communicating and are reporting acceptable pressure readings, we should be able to reject unphysical readbacks from the TP controllers.
I still don’t understand why TP2 can’t back TP1, but we just disable all the software interlock conditions contingent on TP2 readbacks. This pump is far newer than TP3, and unless I’ve misunderstood something major about the vacuum infrastructure, I don’t really see why we should trust this flaky serial readbacks for any actionable interlocks, at least without some AND logic (since temperature, current and speed aren’t really independent variables).
I also think we should finally implement the email alert in the event the vacuum interlock is tripped. I can implement this if no one else volunteers.
This might also be a good reminder to get the documentation in order about the new vacuum system.
Looking at images of the old vac screens, the TP2/3 rotation speed and status string were digitally monitored. However I don't know if there were software interlocks predicated on those.
The temperature and current interlocks are implemented precisely because the pumps can shut themselves off. The concern is not about damaging the pumps (their internal logic protects against that); it's that a pump could automatically shut down and back-vent the IFO to atmosphere. Another interlock (e.g., the pressure differentials) might catch it, but it would depend on the back-vent rate and the scenario has never been tested. The temperature and current interlocks are set to trip just before the pump reaches its internal shut-down threshold.
One way we might be able to reduce our reliance on the flaky serial readbacks is to implement rotation-speed hardware interlocks. The old vac documentation alludes to these, but as far as Chub and I could determine in 2018, they never actually existed. The older turbo controllers, at least, had an analog output proportional to speed which could be used to control a relay to interrupt the V4/5 control signals. I'll look into this for the new controllers. If it could be done, we could likely eliminate the layer of serial-readback interlocks altogether.
That would be awesome if you're willing to volunteer. I agree this would be great to have.
I agree there were MEDM fields, but I can't find any record of these channels being recorded till 2018 December, so I don't agree that they were being digitally monitored. You can also look back in the elog (e.g. here and here) and see that the display fields are just blank. I would then assume that no interlocks were dependent on these channels, because otherwise the vacuum interlocks would be perpetually tripped.
Sorry but I'm having trouble imagining a scenario how the pressure gauges wouldn't register this before the IFO volume is compromised. Is there some back of the envelope calculations I can do to understand this? Since both the pressure gauges and the TP diagnostic channels are being monitored via EPICS, the refresh rate is similar, so I don't see how we can have a pump temperature / speed / current threshold tripped but NOT have this be registered on all the pressure gauges, seems like a bit of a contrived scenario to me. Our thresholds currently seem to be arbitrary numbers anyway, or are they based on some expected backstreaming rate? Isn't this scenario degenerate with a leak elsewhere in the vacuum envelope that would be caught by the differential pressure interlocks?
For the email alert, can you expose a soft channel that is a flag - if this flag is not 1, then the service will send out an email.
Right, I doubt they were ever recorded or used for interlocks. But the readbacks did work at one point in the past. There's a photo of the old vac monitor screen on p. 19 of E1500239 (last updated 2017) which shows the fields once alive.
I don't disagree that the pressure gauges would register the change. What I'm not sure about is whether the change would violate any of the existing interlock conditions, triggering a shutdown. Looking at what we have now, the only non-pump-related conditions I see that might catch it are the diffpres conditions:
abs(P2 - PTP2) > 1 torr (for a TP2 failure)
abs(P3 - PTP3) > 1 torr (for a TP3 failure)
abs(P1a - P2) > 1 torr (for either a TP2 or TP3 failure)
For the P1a-P2 differential, the threshold of 1 torr is the smallest value that in practice still allows us to pump down the IFO without having to disable the interlocks (P1a-P2 is the TP1 intake/exhaust differential). The purpose of the P2-PTP2/P3-PTP3 differentials is to prevent V4/5 from opening and suddenly exposing the spinning turbo to high pressure. I'm not aware of a real damage threshold calculation that any one has done; I think < 1 torr is lore passed down by Steve.
If a turbo pump fails, the rate it would backstream is unknown (to me, at least) and likely depends on the failure mode. The scenario I'm concerned about is if the backstream rate is slower than the conduction time through the pumspool and into the main volume. In that case, the pressure gauges will rise more or less together all the way up to atmosphere, likely never crossing the 1 torr differential pressure thresholds.
There's already a channel C1:Vac-error_status, where if the value is anything other than an empty string, there is an interlock tripped. Does that work?
I removed the backing pumps for TP2 and TP3 today to test ultimate pressure and determine if they need a tip seal replacement. This was done with Jon backing me on Zoom. We closed off TP3 and powered down TP3 and the auxilliary pump, in order to remove the forepumps from the exhaust line.
Once pumps were removed I connected a Pirani gauge to the pump directly and pumped down, results as follows:
TP2 Forepump (Agilent IDP 7):
TP3 Forepump (Varian SH 110):
TP3 forepump defintely needs a new tip seal, and while the pressure on TP2 Forepump was good there was a significant amount of particulate that came out of the exhaust line, so a new tip seal might not be needed but it is recommeded.
So why not just have a special mode for the interlock code during pumpdown and venting, and during normal operation we expect the main volume pressure to be <100uTorr so the interlock trips if this condition is violated? These can just be EPICS buttons on the Vac control MEDM screen. Both of these procedures are not "business as usual", and even if we script them in the future, it's likely to have some operator supervising, so I don't think it's unreasonable to have to switch between these modes. I just think the pressure gauges have demonstrated themselves to be much more reliable than these TP serial readbacks (as you say, they worked once upon a time, but that is already evidence of its flakiness?). The Pirani gauges are not ultra-reliable, they have failed in the past, but at least less frequently than this serial comm glitching. In fact, if these readbacks are so flaky, it's not impossible that they don't signal a TP shutdown? I just think the real power of having these multi-channel diagnostics is lost without some AND logic - a turbopump failure is likely to result in an increase in pump current and temperature increase and pump speed decrease, so it's not the individual channel values that should be determining if an interlock is tripped.
I definitely think that protecting the vacuum envelope is a priority - but I don't think it should be at the expense of commissioning time. But if you think these extra interlocks are essential to the safety of the vacuum system, I withdraw my request.
It would be better to have a flag channel, might be useful for the summary pages too. I will make it if it is too much trouble.
I agree with your assessment, Jordan. If I'm not mistaken the scroll pump for TP2 is new; we had a very early failure with the last new scroll pump (the forepump for TP3) tip seals at just over 5000 hours. Glad to see my replacement seals held up for over 60K hours. If this is the trend with these pumps, we can simply run them to around 60000 hours and replace the seals at that time, rather than waiting for failure! - Chub
I've created a purchase list of hardware needed to restore the aging vacuum system. This wasn't planned as part of the BHD upgrade, but I've added it to the BHD procurement list since hardware replacements have become necessary.
The list proposes replacing the aging TP3 Varian turbo pump with the newer Agilent model which has already replaced TP2. It seems I was mistaken in believing we already had a second Agilent pump on hand. A thorough search of the lab has not turned it up, and Steve himself has told me he doesn't remember ordering a second one. Fortunately Steve did leave us a detailed Agilent parts list [ELOG 14322].
It also proposes replacing the glitching TP2 Agilent controller with a new one. The existing one can be sent back for repair and then retained as a spare. Considering that one of these controllers is already malfunctioning after < 2 years, I think it's a very good idea to have a spare on hand.
Below is our current list of vacuum hardware issues. Items that this purchase list will address (limited to only the most urgent) are highlighted in yellow.
I think we should discuss interlock possibilities at a 40m meeting. I'm reluctant to make the system more complicated, but perhaps we can find ways to reduce the reliance on the turbo pump readbacks. I agree they've proven to be the least reliable.
While we may be able to improve the tolerance to certain kinds of hardware malfunctions (and if so, we should), I don't see interlocks triggering on abnormal behavior of critical equipment as the root problem. As I see it, our bigger problem is with all the malfunctioning, mostly end-of-lifetime pieces of vacuum equipment still in use. If we can address the hardware problems, as I'm trying to do with replacements [ELOG 15412], I think that in itself will make the interlocking much less of an issue.
Ok, this can be added pretty easily. Its value will just be toggled between 1 and 0 every time the interlock server raises/clears the existing string channel. Adding the channel will require restarting the whole vac IOC, so I'll do it at a time when Jordan is on hand in case something fails to come back up.
For this particular email service, ideally the email should be sent out as soon as the interlock is tripped, so this would require a line of code to be added to the main interlock code. Which I guess would require a restart of the interlock service. So let me know when you guys plan to do the dry-pump tip seal replacement operation (when I presume valves will be closed anyways) so that we can do this in a minimally invasive way.
Tip Seals were replaced on the forepumps for TP2 and TP3, and both are ready to be installed back onto the forelines.
TP2 Forepump Ultimate Pressure: 180 mtorr
TP3 Forepump Ultimate Pressure: 120 mtorr