[Larry (on site), Koji & Gautam (remote)]
Network recovery (Larry/KA)
Asked Larry to get into the lab.
14:30 Larry went to the lab office area. He restarted (power cycled) the edge-switch (on the rack next to the printer). This recovered the ssh-access to nodus.
Also Larry turned on the CAD WS. Koji confirmed the remote access to the CAD WS.
Nodus recovery (KA)
Apr 12, 22:43 nodus was restarted.
Apache (dokuwiki, svn, etc) recovered along with the systemctl command on wiki
ELOG recovered by running the script
Control Machines / RT FE / Acromag server Status
Judging by uptime, basically only the machines that are on UPS (all control room workstations + chiara) survived the power outage. All RT FEs are down. Apart from c1susaux, the acromag servers are back up (but the modbus processes have NOT been restarted yet). Vacuum machine is not visible on the network (could just be a networking issue and the local subnet to valves/pumps is connected, but no way to tell remotely).
KA imagines that FB took some finite time to come up. However, the RT machines required FB to download the OS. That made the RTs down. If so, what we need is to power cycle them.
Acromag: unknown state
The power was lost at Apr 12 22:39:42, according to the vacuum pressure log. The power loss was for a few min.
In the past year, pygwinc has expanded to support not just fundamental noise calculations (e.g., quantum, thermal) but also any number of user-defined noises. These custom noise definitions can do anything, from evaluating an empirical model (e.g., electronics, suspension) to loading real noise measurements (e.g., laser AM/PM noise). Here is an example of the framework applied to H1.
Starting with the BHD review-era noises, I have set up the 40m pygwinc fork with a working noise budget which we can easily expand. Specific actions:
I set up our fork in this way to keep the 40m separate from the main pygwinc code (i.e., not added to as a built-in IFO type). With the 40m code all contained within one root-level directory (with a 40m-specific name), we should now always be able to upgrade to the latest pygwinc without creating intractable merge conflicts.
I have a query out to Dolphin asking:
Answers from Dolphin:
Since upgrading every front end is out of the question, our only option is to install an old OS (Linux kernel < 3.x) on the two new machines. Based on Keith's advice, I think we should go with Debian 8. (Link to Keith's Debian 8 instructions.)
I don't have a recent measurement of the optical gain of this config so I can't undo the loop, but in-loop performance doesn't suggest any excess in the 10-100 Hz band. Interestingly, there is considerable improvement below 10 Hz. Maybe some of this is reduced A2L noise because of the better angular stability, but there is also improvement at frequencies where the FF isn't doing anything, so could be some bilinear coupling. The two datasets were collected at approximately the same time in the evening, ~5pm, but on two different days.
I wonder how much noise is getting injected into PRC length at 10-100 Hz due to this. Any change the PRC ERR?
that's pretty great performance. maybe you can also upload some code so that we can do it later too - or maybe in the 40m GIT
Using the data I collected yesterday, the POP angular FF filters have been trained. The offline time-domain performance looks (unbelievably) good, online performance will be verified at the next available opportunity(see update).
The sequence of steps followed is the same as that done for the MCL FF filters. The trace that is missing from Attachment #1 is the measured online subtraction. Some rough notes:
Update Apr 5 1145pm:
This afternoon, I kept the PRM locked for ~1hour and then measured transfer functions from the PRM angular actuators to the POP QPD spot motion for pitch and yaw between ~1pm and 4pm. After this work, the PRM was misaligned again. I will now work on the feedforward filter design.
I wanted to pass along a complication pointed out by K. Thorne re: our plan to use Gen1 (old) Dolphin IPC cards in the new real-time machines: c1bhd, c1sus2. The implication is that we may be forced to install a very old OS (e.g., Debian 8) for compatibility with the IPC card driver, which could lead to other complications like an incompatibility with the modern network interface.
I'll add more info if I hear back from them.
We want to migrate the end shutter controls from c1aux to the end acromags. Could you include them to the list if not yet?
This will let us remove c1aux from the rack, I believe.
Yehonathan's list does include C1:AUX-GREEN_Y_Shutter and I copied its definition from /cvs/cds/caltech/target/c1aux/ShutterInterlock.db into the new ETMYaux.db file.
I noticed ShutterInterlock.db still contains about a dozen channels. Some of them appear to be ghosts (like the C1:AUX-PSL_Shutter[...] set, which has since become C1:PSL-PSL_Shutter[...] hosted on c1psl) but others like C1:AUX-GREEN_X_Shutter appear to still be in active use.
I have made a wiring + channel list that need to be included in the new C!AUXEY Acromag.
I used Yehonathan's wiring assignments to lay the rest of groundwork for the final slow controls machine upgrade, c1auxey. Actions completed:
The "1" will be dropped after the new system is permanently installed.
Hardware-wise, this system will require:
I know that we do have these quantities left on hand. The next steps are to set up the Supermicro host and begin assembling the Acromag chassis. Both of these activities require an in-person presence, so I think this is as far as we can advance this project for now.
Retraining the MCL filters resulted in a slight improvement in the performance. Compared to no FF, the RMS in the 0.5-5 Hz range is reduced by approximately a factor of 3.
Attachment #1 shows my re-measurement of the MC2 position drive to MCL transfer function.
Attachment #2 shows the IIR fits to the FIR filters calculated here.
Attachment #3 shows several MCL spectra.
Conclusions + next steps
The problem is that foton does not inherit the model sample rate when launched from DTT/awggui. This is likely some shared/linked/dynamic library issue, the binaries we are running are precompiled presumably for some other OS. I've never gotten this to work since we changed to SL7 (but I did use it successfully in 2017 with the Ubuntu12 install).
do you really mean awggui cannot make shaped noise injections via its foton text box ? That has always worked for me in the past.
If this is broken I'm suspicious there's been some package installs to the shared dirs by someone.
I'd like to re-measure the transfer function from driving MC2 position to the MC_L_DQ channel (for feedforward purposes). Swept sine would be one option, but I can't get the "Envelope" feature of DTT to work, the excitation amplitude isn't getting scaled as specified in the envelope, and so I'm unable to make the measurement near 1 Hz (which is where the FF is effective). I see some scattered mentions of such an issue in past elogs but no mention of a fix (I also feel like I have gotten the envelope function to work for some other loop measurement templates). So then I thought I'd try broadband noise injection, since that seems to have been the approach followed in the past. Again, the noise injection needs to be shaped around ~1 Hz to avoid knocking the IMC out of lock, but I can't get Foton to do shaped noise injections because it doesn't inherit the sample rate when launched from inside DTT/awggui - this is not a new issue, does anyone know the fix?
Note that we are using the gds2.15 install of foton, but the pre-packaged foton that comes with the SL7 installation doesn't work either.
The envelope feature for swept-sine wasn't working because i specified the frequency grid in the wrong order apparently. Eric von Reis has been notified to include a sorting algorithm in future DTT so that this can be in arbitrary order. fixing that allows me to run a swept sine with enveloped excitation amplitude and hence get the TF I want, but still no shaped noise injections via foton 😢
Yesterday evening I took nearly all of the masks, gloves, gowns, alcohol wipes, hats, and shoe covers. These were the ones in the cleanroom cabinets at the east end of the Y-arm, as well as the many boxes under the yarm near those cabinets.
This photo album shows the stuff, plus some other random photos I took around the same time (6-7 PM) of the state of parts of the lab.
It was mostly copied from C1AUXEX.
I ignored the IPANG channels since it is going to be removed from the table.
Since there has been a proliferation of BHD Google docs recently, I've linked them all from the BHD wiki page. Let's continue adding any new docs to this central list.
The email address in the N2 checking script wasn't right - I now updated it to email the 40m list if the sum of reserve tank pressures fall below 800 PSI. The checker itself is only run every 3 hours (via cron on c1vac).
I reset the remote of this git repo to the 40m version instead of Jon's personal one, to ensure consistency between what's on the vacuum machine and in the git repo. There is now a N2 checker python mailer that will email the 40m list if all the tank pressures are below 600 PSI (>12 hours left for someone to react before the main N2 line pressure drops and the interlocks kick in). For now, the script just runs as a cron job every 3 hours, but perhaps we should integrate it with the interlock process
I think the feedforward filters used for stabilizing MCL with vertex seismometers would benefit from a retraining (last trained in Sep 2015).
I wanted to re-familiarize myself with the seismic feedforward methodology. Getting good stabilization of the PRC angular motion as we have been able to in the past will be a big help for lock acquisition. But remotely, it is easier to work with the IMC length feedforward (IMC is locked more often than the PRC). So I collected 2 hours of data from early Sunday morning and went through the set of steps (partially).
Attachment #1 shows the performance of a first attempt.
Attachment #2 shows a comparison between the filter used in Attachment #1 and the filters currently loaded into the OAF system.
Attachment #3 is the asd after implementing a time domain Wiener filter, while Attachment #4 is an actual measurement from earlier today - it's not quite as good as Attachment #3 would have me expect but that might also be due to the time of the day.
Conclusions and next steps:
On the basis of Attachments #3 and #4, I'd say it's worth it to complete the remaining steps for online implementation: FIR to IIR fitting and conversion to sos coefficients that Foton likes (prefereably all in python). Once I've verified that this works, I'll see if I can get some data for the motion on the POP QPD with the PRMI locked on carrier. That'll be the target signal for the PRC angular FF training. Probably can't hurt to have this implemented for the arms as well.
While this set of steps follows the traditional approach, it'd be interesting if someone wants to try Gabriele's code which I think directly gives a z-domain representation and has been very successful at the sites.
* The y-axes on the spectra are labelled in um/rtHz but I don't actually know if the calibration has been updated anytime recently. As I type this, I'm also reminded that I have to check what the whitening situation is on the Pentek board that digitizes MCL.
Some short notes, more details tomorrow.
Attachment #1 shows time series of some signals, from the time I ramp of ALS CARM control to a lockloss. With this limited set of signals, I don't see any clear indication of the cause of lockloss, but I was never able to keep the lock going for > a couple of mins.
Attachment #2 shows the CARM OLTF. Compared to last week, I was able to get the UGF a little higher. This particular measurement doesn't show it, but I was also able to engage the regular boost. I did a zeroth order test looking at the CM_SLOW input to make sure that I wasn't increasing the gain so much that the ADC was getting saturated. However, I did notice that the pk-to-pk error signal in this locked, 5kHz UGF state was still ~1000 cts, which seems large?
Attachment #3 shows the DTT measurement of the relative gains of DARM A and B paths. This measurement was taken when the DARM_A gain was 1, and DARM_B gain was 0.015. On the basis of this measurement, DARM_B (=AS55) sees the excitation injected 16dB above the ALS signal, and so the gain of the DARM_B path should be ~0.16 for the same UGF. But I was never able to get the DARM_B gain above 0.02 without breaking the lock (admittedly the lockloss may have been due to something else).
Attachment #4 shows a zoomed in version of Attachment #1 around the time when the lock was lost. Maybe POP_YAW experienced too large an excursion?
Some other misc points:
I was in the lab at the time. But did not notice anything (like turbo sound etc). I was around ETMX/Y (1X9, 1Y4) rack and SUS rack (1X4/5), but did not go into the Vac region.
There was a jump in the main volume pressure at ~6pm PDT yesterday. The cause is unknown, but the pressure doesn't seem to be coming back down (but also isn't increasing alarmingly).
I wanted to look at the RGA scans to see if there were any clues as to what changed, but looks like the daily RGA scans stopped updating on Dec 24 2019. The c0rga machine responsible for running these scans doesn't respond to ssh. Not much to be done until the lockdown is over i guess...
No real progress tonight - I made it a bunch of times to the point where CARM was RF only, but I never got to run a measurement to determine what the DARM_B loop gain should be to make the control fully RF.
I measured the cross-calibration of the two PDs on the PSL table.
I used the existing flip mounted BS that routes the beam into a PDA255, the same as in the IMC transmission.
I placed a PDA520, the same as the one measuring TRY_OUT on the ETMY table, on the transmission of the BS (Attachment 1).
I used the SR785 to measure the frequency response of PDA520 with reference to PDA255 (Attachment 2). Indeed, calibration is quite significant.
I calibrated the Y arm frequency response measurement.
However, the data seem to fit well to 1/sqrt(f^2+fp^2) - electric field response - but not to 1/(f^2+fp^2) - intensity response. (Attachment 3).
Also, the extracted fp is 3.8KHz (Finesse ~ 500) in the good fit -> too small.
When I did this measurement for the IMC in the past I fitted the response to 1/sqrt(f^2+fp^2) by mistake but I didn't notice it because I got a pole frequency that was consistent with ringdown measurements.
I also cross calibrated the PDs participating in the IMC measurement but found that the calibration amounted for distortions no bigger than 1db.
Today I finished implementing loopback monitors of the up/down state of the slow controls machines. They are visible on a new MEDM screen accessible from Sitemap > CDS > Slow Machine Status (pictured in attachment 1). Each monitor is a single EPICS binary channel hosted by the slow machine, which toggles its state at 1 Hz (an alive "blinker"). For each machine, the monitor is defined by a separate database file named c1[machine]_state.db located in the target directory.
Sitemap > CDS > Slow Machine Status
This is implemented for all upgraded machines, which includes every slow machine except for c1auxey. This is the next and final one slated for replacement.
The blinkers are currently implemented as soft channels, but I'd like to ultimately convert them to hard channels using two sinking/sourcing BIO units. This will require new wiring inside each Acromag chassis, however. For now, as soft channels, the monitors are sensitive to a failure of the host machine or a failure of the EPICS IOC. As hard channels, they will additionally be sensitive to a failure of the secondary network interface, as has been known to happen.
Each slow machine's IOC had to be restarted this afternoon to pick up the new channels. The IOCs were restarted according to the following procedure.
The intial recovery of c1susaux did not succeed. Most visibly, the alignment state of the IFO was not restored. After some debugging, we found that the restart of the modbus service was partially failing at the final burt-restore stage. The latest snapshot file /opt/rtcds/caltech/c1/burt/autoburt/latest/c1susaux.snap was not found. I manually restored a known good snapshot from earlier in the day (15:19) and we were able to relock the IMC and XARM. GV and I were just talking earlier today about eliminating these burt-restores from the systemd files. I think we should.
I want to monitor the PMC TRANS and REFL levels on the PSL table - previously there were some cables going to the oscilloscope on the shelf but someone had removed these. I re-installed them just now. While there, I disconnected the drive to the AOM - there must've been some DC signal going to it because when I removed the cable, the PMC and IMC transmission were recovered to their nominal levels.
Updated the cert in /etc/httpd/ssl. The new cert is good until March 12, 2022.
Finally, some RF only CARM, see Attachment #1. During this time, DARM was also on a blend of IR and ALS control, but I couldn't turn the ALS path off in ~4-5 attempts tonight (mostly me pressing a wrong button). Attachment #2 shows the CARM OLTF, with ~2kHz UGF - for now, I didn't bother turning any boosts on. PRCL and MICH are still on 3f signals.
The recycling gain is ~7-8 (so losses >200ppm), but there may be some offset in some loop. I'll look at REFL DC tomorrow.
Can we please make an effort to keep the IFO in this state for the next week or two
- it really helped tonight I didn't have to spend 2 hours fixing some random stuff and could focus on the task at hand.
the long DB25 cable to connect the Acromag chassis to the temperature sensor interface box arrived. We laid it out today. This cable does the following:
Both signals now show up in the EPICS channels, but are noisy - I suspect this is because the return pin of the Acromag is not shorted to ground (this is a problem I've seen on the bench before). We will rectify this tomorrow as well.
We took this opportunity to remvoe the bench supply and temporary Acromag crate (formerly known as c1psl2) from under the PSL table. While trying to find some space to store the bench supply, we came across a damaged Oscilloscope in the second "Electronics" cabinet along the Y-arm, see Attachment #1.
After this work, I found that the IMC autolocker was reliably failing to run the mcup script at the stage where the FSS gains are ramped up to their final values. I was, however, able to smoothly transition to the low-noise locked state if I was manipulating the EPICS sliders by hand. So I added an extra 2 seconds of sleep time between the increasing of the VCO gain to the final value and the ramping of the FSS gains in the mcup script (where previously there was none). Now the autolocker performs reliably.
Of course the reboot wiped any logs we could have used for clues as to what happened. Next time it'll be good to preserve this info. I suspect the local subnet went down.
P.S. for some reason the system logs are priveleged now - I ran sudo sysctl kernel.dmesg_restrict=0 on c1psl to make it readable by any user. This change won't persist on reboot.
I restarted the IOC but it didn't help.
I am now rebooting c1psl... That seemed to help. PMC screen seem to be working again. I am able to lock the PMC now.
Came this morning to find the PMC was unlocked since 6AM. Laser is still on, but PMC REFL PD DC shows dead white constant 0V on PMC screen. All the controls on the PMC screen show constant 0V actually except for the PMC_ERR_OUTPUT which is a fast channel.
Is PSL Acromag already failing?
IMC was locking easily once some switches on the MC servo screen were put to normal states.
TTs were grossly misaligned. Onces they where aligned, arm cavities were locking easily. Dither align for the X arm is very slow though...
when doing the AM sweeps of cavities
make sure to cross-calibrate the detectors
else you'll make of science much frivolities
much like the U.S. elections electors
It's been a while since I've attempted any locking, so tonight was mostly getting the various subsystems back together.
I opened the packages send from Syracuse.
- The components are not vacuum clean. We need C&B.
- Some large parts are there, but many parts are missing to build complete SOSs.
- No OSEMs.
- Left and right panels for 6 towers
- 3 base blocks
- 1 suspension block
- 8 OSEM plates. (1 SOS needs 2 plates)
- The parts looks like old versions. The side panels needs insert pins to hold the OSEMs in place. We need to check what needs to be inserted there.
- An unrelated tower was also included.
Attachment #1 shows the relevant parts of the schematic of the WFS demod board (not whitening board).
Before removing the boards from the eurocrate:
After Koji effected the fix, the boards were re-installed, HV supplies were dialled back up to nominal voltage/currents, and the PMC/IMC were re-locked. The WFS DC channels now no longer saturate even when the IMC is unlocked 👏 👏 . I leave it to Yehonathan / Jon to calibrate these EPICS channels into physical units of mW of power. We should also fix the MEDM screen and remove the un-necessary EPICS channels.
Later in the evening, I took advantage of the non-saturated readbacks to center the beams better on the WFS heads. Then, with the WFS servos disabled, I manually aligned the IMC mirrors till REFLDC was minimized. Then I centered the beam on the MC2 transmission QPD (looking at individual quadrants), and set the WFS1/2 RF offsets and MC2 Trans QPD offsets in this condition.
WFS DC channels are saturating when the IMC is unlocked.
My old scheme was flawed as I used pitch as the readback. The pitch signal could not distinguish the cross-coupling due to coil imbalance and that due to the natural suspension L2P. A new scheme based on yaw alone has been developed and will be integrated into ifo_test. For now we revert the C1:SUS-MC2_UL/UR/LR/LLCOIL gains back to 1, -1, 1, -1.
We did some quick DC balancing of the MC2 coil drivers to reduce the l2a coupling. We updated the gains in the C1:SUS-MC2_UL/UR/LR/LLCOIL to be 1, -0.99, 0.937,-0.933, respectively. The previous values were 1, -1, 1, -1.
The procedures are the following:
Drive UL+LR and change the gain of LR to zero pitch.
Drive UR+LL and change the gain of LL to zero pitch.
Lastly, drive all 4 coils and change UR & LR together to zero yaw.
We used C1:SUS-MC2_LOCKIN1_OSC to create the excitations at 33 Hz w/ 30,000 cts. The angular error signals were derived from IMC WFSs.
While this time we did things by hand, in the future it should be automated as the procedure is sufficiently straightforward.
I want to measure the transfer function of the arm cavities to extract the pole frequencies and get more insight into what is going on with the DC loss measurements.
The idea is to modulate the light using the PSL AOM. Measure the light transmitted from the arm cavities and use the light transferred from the IMC as a reference.
I tried to start measuring the X arm but the transmission PD is connected to the QPD whitening filter board with a 4 pin Lemo for which I couldn't find an adapter.
Could this be because of the PDA520 limited BWs? I tried playing with the PD gain/bandwidth switch but it seems like it was already set to high bandwidth/low gain.
In any case, the extracted pole frequency ~ 2.9kHz implies a finesse > 600 (assuming FSR = 3.9MHz) which is way above the maximal finesse (~ 450) for the arm cavities.
I disconnected the source from the AOM. But left the other two BNCs connected to the SR785. Also, TRY PD is still teed off. Long BNC cable is still on the ground.
I returned the triggering threshold to normal values (5/3).
Meanwhile, i want to block the Y arm trans PD (Thorlabs). To do it, the PD<->QPD thresholds were changed from 5.0/3.0 to 0.5/0.3.
ETMX was grossly misaligned.
I re-aligned it and the X arm now locks.
7:00PM with Koji
Both the alignment of the X and Y arms was recovered.
~>z avg 10 C1:LSC-TRX_OUT C1:LSC-TRY_OUT
We are running ass for the X arm to recover the X arm alignment.
An earthquake around 330 UTC (=730pm yesterday eve) tripped ITMX, ITMY and ETMX watchdogs. ITMX got stuck. I released the stuck optic and re-enabled the local damping loops just now.
I did some preliminary debugging of this, and have localized the problem to the output path (after MC slow) on the IMC Servo card. Basically, I monitored the spectrum of the ALS beat frequency fluctuations under a few different conditions:
Toggling C1:IOO-MC_FASTSW, which supposedly isolates the post-MC slow (a.k.a. MCL) part of the servo, I see no difference. I am also reasonably confident this switch itself works, because I can break the IMC lock by toggling it. So pending a more detailed investigation, I am forced to conclude that the problem originates in the part of the IMC servo board after the MCL pickoff. Some cabling was removed at 1X2 on Tuesday between the times when there was no excess and when it showed up, but it's hard to imagine how this could have created this particular problem.
Sometime between 1PM and 6PM on Tuesday, excess laser frequency noise shows up in MCF at around 800 Hz, as shown in Attachment #1. Sigh.
While I show the MCF spectrum here, I confirmed that this noise is not injected by the IMC loop (with the PSL shutter closed, and the IMC servo board disconnected from the feedback path to the NPRO, the PMC error and control points still show the elevated noise, see Attachment #2). I don't think the problem is from the PMC loop - see Attachment #3 which is the ALS beat out-of-loop noise with the PMC unlocked (the PSL beam doesn't see the cavity before it gets to the ALS setup, and we only actuate on the cavity length for that loop, so this wasn't even really necessary).
Was there some work on the PSL table on Tuesday afternoon that can explain this?
It seems like the AO path gain stages on the IMC Servo board work just fine. The weird results I reported earlier were likely a measurement error arising from the fact that I did not disconnect the LEMO IN2 cable while measuring using the BNC IN2 connector, which probably made some parasitic path to ground that was screwing the measurement up. Today, I re-did the measurement with the signal injected at the IN2 BNC, and the TF measured being the ratio of TP3 on the board to a split-off of the SR785 source (T-eed off). Attachments #1, #2 shows the result - the gain deficit from the "expected" value is now consistent with that seen on other sliders.
Note that the signal from the CM board in the LSC rack is sent single-ended over a 2-pin LEMO cable (whose return pin is shorted to ground). But it is received differentially on the IMC Servo board. I took this chance to look for evidence of extra power line noise due to potential ground loops by looking at the IMC error point with various auxiliary cables connected to the board - but got distracted by some excess noise (next elog).
I am running some tests on the IMC servo board with an extender card so the IMC will not be locking for a couple of hours.
We've completed almost all of the in-situ testing of the c1psl channels. During this process, we identified several channels which needed to be rewired to different Acromags (BIO sinking v. sourcing). We also elected to change the connector type of a few channels for practical advantages. Those modifications and other issues found during testing are detailed below. Also attached are the updated channel assignments, with a column indicating the in-situ testing status of each channel.
With the Acromag chassis now permanently installed, we tested the C1PSL channels going over the channel list one by one, excluding the IMC channels which Gautam is taking responsibility for (the servo board itself is also in question).
The strategy is to check the response of input channels to specific output channels for expected behaviour whenever is possible.
We marked on the channel list spreadsheet the status of the channels that were tested.
In more detail
PMC Servo Card
Unlocked the PMC by switching C1:PSL-PMC_SW1. Tweaked C1:PSL-PMC_RAMP and observed a change in C1:PSL-PMC_PZT.
We misaligned MC1 to get a measurable signal in WFS channels. NDScoped the corresponding C1:IOO-WFS*_SEG*_I&Q channels and observed a change in those channels in response to switching the attenuation on and off.
The signals were compared to previous values for consistency. Then they were unplugged from the Acromag chassis to confirm their values went to 0 and returned to the same values after being reconnected.
The C1PSL crate has now been installed in a more permanent way in the rack.
After this work, I disabled logging and restarted the modbus service (and copied the current version of the systemd service file to the target directory for backup). The PMC and IMC lock alright. The system is now ready to be tested in-situ. I will separately continue my IMC Servo board tests in the evening.
One thought about how to protect against this kind of silent failure - how about we always run the modbus service with logging enabled, and then send out a warning email and stop the service if the logfile size suddenly blows up (which is characteristic of when the communications process dies)? This should be done in addition to the ping-ing of the individual IPs.
Regarding the burt-restore step that the systemd service runs after starting up the IOC - this is not even that useful, at least in the way it is currently setup (restore the "latest" burt snapshot file). If the maintenance takes >1hour as it often does, the "latest" snapshot for the system under maintenance is just garbage. So either the burt-restore should be for a "known good time" (dangerous because this will require frequent updates of the systemd service every time we find a new safe state) or we should just do it manually (my preference). Then there is no need to install custom packages on the server machine. Anyway, for now, I have not commented this step out.
Jordan is going to take pictures of all the electronics racks and update the relevant wiki pages.
Jon is going to write up the details of todays adventures. But the C1PSL Acromag chassis is sitting on the floor between the IMC beamtube and the 1X1 electronics rack, and is very much a trip hazard. Be careful if youre in that area.
I investigated the problem reported earlier today with the BIO1 channels. By logging the systemd messages generated when the IOC starts, I was immediately able to determine that the problem was not limited to BIO1. The modbus communications were failing for several other units as well.
Because some in-situ rewiring of a handful of channels had recently been done (more on this soon), I initially suspected that one of the Acromags had been damaged in the process. However, removing BIO1 (or other non-communicating modules) did not restore communications with the rest of the modules. To test whether the chassis was the source of the problem at all, we set up a fresh ADC (new out of the package) and directly connected it to the secondary Ethernet interface of c1psl. With only the one new ADC connected, the modbus IOC failed in exactly the same way.
To confirm that the new ADC did in fact work, we connected it to c1auxex in the same configuration. The unit worked fine connected to c1auxex. This established that the source of the problem was the c1psl host. After some extensive debugging, I traced the problem to a pre-execution script (part of the modbus IOC systemd service) which resets the secondary network interface (the one connected to the Acromag chassis) prior to launching the IOC. This was to ensure the secondary interface always had the correct IP address. It appears this reset was somehow creating a race condition that allowed the modbus initializations (first communications with the Acromags) to sometimes start before the network interface had actually come back up.
I still don't understand how this was happening, or why the pre script worked just fine up until yesterday, but eliminating the network interface reset fixes the problem in 100% of the trials we ran. Unfortunately we lost the entire day to debugging this problem, so the final round of testing is still to be completed. We plan to pick it back up tomorrow afternoon.
We are going to replace the old Sun c1ioo with a modernized supermicro. At the opportunity, remove the DAC and BIO cards to use them with the new machines. BTW I also have ~4 32ch BIO cards in my office.