At the moment, c1sus and c1mcs on the c1sus machine seem to be dead in the water. At this point, it is unclear to me why.
Apparently during the 40m meeting, Alex was able to get test points working for the c1mcs model. He said he "had to slow down mx_stream startup on c1sus". When we returned at 2pm, things were running fine.
We began updating all the matrix values on the medm screens. Somewhere towards the end of this the c1sus model seemed to have crashed, leaving only c1x02 and c1mcs running. There were no obvious error messages I saw in dmesg and the target/c1sus/logs/log.txt file (although that seems to empty these days). We quickly saved to burt snap shots, one of c1sus and one of c1mcs and saved them to /opt/rtcds/catlech/c1/target/snapshots directory temporarily. We then ran the killc1sus script on c1sus, and then after confirming the code was removed, ran the startup script, startc1sus. The code seemed to come back partly. It was syncing up and finding the ADC/DAC boards, but not doing any real computations. The cycle time was reporting reasonably, but the usr time (representing computation done for the model) was 0. There were no updating monitor channels on the medm screens and filters would not turn on.
At this point I tried bringing down all 3 models, and restarting c1x02, then c1sus and c1mcs. At this point, both c1sus and c1mcs came back partly, doing no real calculations. c1x02 appears to be working normally (or at least the two filter banks in that model are showing changing channels from ADCs properly). I then tried rebooting the c1sus machine. It came back in the same state, working c1x02, non-calculating c1sus and c1mcs.
This problem has been resolved.
Apparently during one of Alex's debugging sessions, he had commented out the feCode function call on line 1532 of the controller.c file (located in /opt/rtcds/caltech/c1/core/advLigoRTS/src/fe/ directory).
This function is the one that actually calls all the front end specific code and without it, the code just doesn't do any computations. We had to then rebuild the front end codes with this corrected file.
Around noon, Yuta and I were trying to figure out why we were getting no signal out to the mode cleaner coils. It turns out the mode cleaner optic control model was not talking to the IOP model.
Alex and I were working under the incorrect assumption that you could use the same DAC piece in multiple models, and simply use a subset of the channels. He finally went and asked Rolf, who said that the same DAC simulink piece in different models doesn't work. You need to use shared memory locations to move the data to the model with the DAC card. Rolf says there was a discussion (probably a long while back) where it was asked if we needed to support DAC cards in multiple models and the decision was that it was not needed.
Rolf and Alex have said they'd come over and discuss the issue.
In the meantime, I'm moving forward by adding shared memory locations for all the mode cleaner optics to talk to the DAC in the c1sus model.
Note by KA: Important fact that is worth remembering
As noted by Koji, Alex and Rolf stopped by.
We discussed the feasibility of getting multiple models using the same DAC. We decided that we infact did need it. (I.e. 8 optics through 3 DACs does not divide nicely), and went about changing the controller.c file so as to gracefully handle that case. Basically it now writes a 0 to the channel rather than repeating the last output if a particular model goes down that is sharing a DAC.
In a separate issue, we found that when skipping DACs in a model (say using DACs 1 and 2 only) there was a miscommunication to the IOP, resulting in the wrong DACs getting the data. the temporary solution is to have all DACs in each model, even if they are not used. This will eventually be fixed in code.
At this point, we *seem* to be able to control and damp optics. Look for a elog from Yuta confirming or denying this later tonight (or maybe tomorrow).
Currently trying to understand why the ssh connections to c1sus are flaky. This morning, every time I tried to make the c1sus model on the c1sus machine, the ssh session would be terminated at a random spot midway through the build process. Eventually restarting c1sus fixed the problem for the moment.
However, previously in the last 48 hours, the c1sus machine had stopped responding to ssh logins while still appearing to be running the front end code. The next time this occurs, we should attach a monitor and keyboard and see what kind of state the computer is in. Its interesting to note we didn't have these problems before we switched over to the Gentoo kernel from the real-time linux Centos 5.5 kernel.
The c1sus model was split into 2, so that c1sus controls BS, PRM, SRM, ITMX, ITMY, while c1mcs controls MC1, MC2, MC3. The c1mcs uses shared memory to tell c1sus what signals to the binary outputs (which control analog whitening/dewhitening filters), since two models can't control a binary output.
This split was done because the CPU time was running above 60 microseconds (the limit allowable since we're trying to run at 16kHz). Apparently the work Alex had done getting testpoints working had put a greater load on the cpu and pushed it over an acceptable maximum. After removing the MC optics controls, the CPU time dropped to about 47 microseconds from about 67 microseconds. The c1mcs is taking about 20 microseconds per cycle.
The new model is using the top_names functionality to still call the channels C1SUS-XXX_YYY. However, the directory to find the actual medm filter modules is /opt/rtcds/caltech/c1/medm/c1mcs, and the gds testpoint screen for that model is called C1MCS-GDS_TP.adl. I'm currently in the process of updating the medm screens to point to the correct location.
Also, while plugging in the cables from the coil dewhitening boards, we realized I (Joe) had made a mistake in the assignment of channels to the binary output boards. I need to re-examine Jay's old drawings and fix the simulink model binary outputs.
I ran a DAC to ADC test on c1sus2 channels where I hooked up the outputs on the DAC to the input channels on the ADC. We used different combinations of ADCs and DACs to make sure that there were no errors that cancel each other out in the end. I took a transfer function across these channel combinations to reproduce figure 1 in T2000188.
As seen in the two attached PDFs the channels seem to be working properly they have a flat response with a gain of 0.5 (-6 dB). This is the response that is expected and is the result of the DAC signal being sent as a single ended signal and the ADC receiving as a differential input signal. This should result in a recorded signal of 0.5 the amplitude of the actual output signal.
The drop off on the high frequency end is the result of the anti-aliasing filter and the anti-imaging filter. Both of these are 8-pole elliptical filters so when combined we should get a drop off of 320dB per decade. I measured the slope on the last few points of each filter and the averaged value was around 347dB per decade. This is slightly steeper than expected but since it is to cut off higher frequencies it shouldn't have an effect on the operation of the system. Also it is very close to the expected value.
The ripples seen before the drop off are also an effect of the elliptical filters and are seen in T2000188.
Note: the transfer function that doesn't seem to match the others is the heartbeat timing signal.
(Because of a totally unrelated reason) I was checking the electronics units for the upgrade. And I realized that the electronics units at the test stand have not been properly powered.
I found that the AA/AI stack at the test stand (Attachment 1) has an unusual powering configuration (Attachment 2).
- Only the positive power supply was used / - The supply voltage is only +15V / - The GND reference is not connected to anywhere.
For confirmation, I checked the voltage across the DC power strip (Attachments 3/4). The positive was +5.3V and the negative was -9.4V. This is subject to change depending on the earth potential.
This is not a good condition at all. The asymmetric powering of the circuit may cause damages to the opamps. So I turned off the switches of the units.
The power configuration should be immediately corrected.
[Ian, Anchal, Paco]
After the Koji found that there was a problem with the power source Anchal and I fixed the power then reran the measurment. The only change this time around is that I increased the excitation amplitude to 100. In the first run the excitation amplitude was 1 which seemed to come out noise free but is too low to give a reliable value.
link to previous results
The new plots are attached.
From the 40m wiki, I was able to use the instructions here to map out what to do to get the IPC issue resolved. Here is a summary of my findings.
I updated the /etc/dis/dishost.conf file on the frame builder machine to include the c1sus2 machine which runs the sender model, c1hpc, see below. After this, the file becomes available on c1sus2 machine, see attachment 1, and the c1sus2 node shows up in the dxadmin GUI, see attachment 2. However, the c1sus2 machine was not active. I noticed that the log file for the dis_nodemgr service, see attachment 3, which is responsible for setting things up, indicated that the dis_irm service may not be up, so I checked and confirmed that this was indeed the case, see attachment 4. I tried restarting this service but was unsuccessful. I restarted the machine but this did not help either. I have reached out to Jonathan Hanks for assistance.
We decided to give the dolphin debugging another go. Firstly, we noticed that c1sus2 was no longer recogonising the dolphin card, which can be checked using
lspci | grep Stargen
or looking at the status light on the dolphin card of c1sus2, which was orange for both ports A and B.
We decided to do a hard reboot of c1sus2 and turned off the DAQ chassis for a few minutes, then restared c1sus2. This solved the card recognition problem as well as the 'dis_irm' driver loading issue (I think the driver does not get loaded if the system does not recognise a valid card, as I also saw the missing dis_irm driver module on c1testand).
Next, we confirmed the status of all dolphin cards on fb1, using
It looks like the dolphin card on c1sus2 has now been configured and is availabe to all other nodes. We then restated the all FE machines and models to see if we are in the clear. Unfortunately, we are not so lucky since the problem persisted.
Looking at the output of 'dmesg', we could only identity two notable difference between the operational dolphin cards on c1sus/c1ioo/c1lsc and c1sus2, namely: the card number being equal to zero and the memory addresses which are also zero, see image below.
Anyways, at least we can now eliminate driver issues and would move on to debugging the models next.
IPC issue still unresolved.
Updated shared memory tag so that 'SUS' -> 'SU2' in c1hpc, c1bac and c1su2. Removed obsolete 'HPC/BAC-SUS' references from IPC file, C1.ipc. Restarted the FE models but the c1sus2 machine froze, so I did a manual reboot. This brought down the vertex machines---which I restarted using /opt/rtcds/caltech/c1/scripts/cds/rebootC1LSC.sh---and the end machines which I restarted manually. Everything but the BHD optics now have their previous values. So need to burtrestore these.
# IPC file:
# Model file locations:
# Log files:
SUS overview medm screen :
[Yuta, Tega, Chris]
We did it!
Following Chris's suggestion, we added "pciRfm=1" to the CDS parameter block in c1x07.mdl - the IOP model for c1sus2. Then restarted the FE machines and this solved the dolphin IPC problem on c1sus2. We no longer see the RT Netstat error for 'C1:HPC-LSC_DCPD_A' and 'C1:HPC-LSC_DCPD_B' on the LSC IPC status page, see attachement 1.
Attachment 2 shows the module dependencies before and after the change was made, which confirms that the IOP model was not using the dolphin driver before the change.
We encountered a burt restore problem with missing snapfiles from yesterday when we tried restoring the EPICS values after restarting the FE machines. Koji helped us debug the problem, but the summary is that restarting the FE models somehow fixed the issue.
We were able to fix the shared memory issue by updating the receiver model name from ''SUS' to 'SU2' and the ADC zero issue by including both ADC0 and ADC1 in the c1hpc and c1bac models as well as removing the grounding of the unused ADC channels (including chn#16 and chn#17 which are actually used in c1hpc) in c1su2. We also used shared memory to move the DCPD_A/B error signals (after signal conditioning and mixing A/B; now named A_ERR and B_ERR) from c1hpc to c1bac.
C1:HPC-DCPD_A_IN1 and C1:HPC-DCPD_B_IN1 are now available (they are essentially the same as C1:LSC-DCPD_A_IN1 and C1:LSC-DCPD_B_IN1, except for they are ADC-ed with different ADC; see elog 40m/16954 and Attachment #1).
Dolphin IPC error in seding signal from c1hpc to c1lsc still remains.
We have added C1:HPC-DCPD_A_ERR and C1:HPC-DCPD_B_ERR testpoints, which can be used as A+B, A-B etc.
Restarting c1hpc crashed c1sus2, and also made c1lsc/ioo/sus models red.
We run /opt/rtcds/caltech/c1/Git/40m/scripts/cds/restartAllModels.sh to restart all the machines. It worked perfectly without manually pressing power buttons! Wow!
We have also edited /opt/rtcds/caltech/c1/medm/c1su2/C1SU2_WATCHDOGS.adl so that it will use new /opt/rtcds/caltech/c1/Git/40m/scripts/SUS/medm/resetFromWatchdogTrip.sh instead of old /opt/rtcds/caltech/c1/scripts/SUS/damprestore.py.
Today I tested the remaining Acromag channels and retested the non-functioning channels found yesterday, which Chub repaired this morning. We're still not quite ready for an in situ test. Here are the issues that remain.
I further diagnosed these channels by connecting a calibrated DC voltage source directly to the ADC terminals. The EPICS channels do sense this voltage, so the problem is isolated to the wiring between the ADC and DB37 feedthrough.
No output signal
To further diagnose these channels, I connected a voltmeter directly to the DAC terminals and toggled each channel output. The DACs are outputting the correct voltage, so these problems are also isolated to the wiring between DAC and feedthrough.
In testing the DC bias channels, I did not check the sign of the output signal, but only that the output had the correct magnitude. As a result my bench test is insensitive to situations where either two degrees of freedom are crossed or there is a polarity reversal. However, my susPython scripting tests for exactly this, fetching and applying all the relevant signal gains between pitch/yaw input and coil bias output. It would be very time consuming to propagate all these gains by hand, so I've elected to wait for the automated in situ test.
Jon suggested to reboot the acromag chassis, then the computer, and we did this without success. Then, Koji suggested we try running ifup eth0, so we ran `sudo /sbin/ifup eth0` and it worked to put c1susaux back in the martian network, but the modbus service was still down. We switched off the chassis and rebooted the computer and we had to do sudo /sbin/ifup eth0` again (why do we need to do this manually everytime?). Switched on the chassis but still no channels. `sudo systemctl status modbusioc.service' gave us inactive (dead) status. So we ran sudo systemctl restart modbusioc.service'.
The status became:
● modbusIOC.service - ModbusIOC Service via procServ
Loaded: loaded (/etc/systemd/system/modbusIOC.service; enabled)
Active: inactive (dead)
start condition failed at Thu 2021-06-17 16:10:42 PDT; 12min ago
ConditionPathExists=/opt/rtcds/caltech/c1/burt/autoburt/latest/c1susaux.snap was not met`
After another iteration we finally got a modbusIOC.service OK status, and we then repeated Jon's reboot procedure. This time, the acromags were on but reading 0.0, so we just needed to run `sudo /sbin/ifup eth1`and finally some sweet slow channels were read. As a final step we burt restored to 05:19 AM today c1susaux.snap file and managed to relock the IMC >> will keep an eye on it.... Finally, in the process of damping all the suspended optics, we noticed some OSEM channels on BS and PRM are reading 0.0 (they are red as we browse them)... We succeeded in locking both arms, but this remains an unknown for us.
For the in-situ test, I decided that we will use the physical SRM to test the c1susaux Acromag replacement crate functionality for all 8 optics (PRM, BS, ITMX, ITMY, SRM, MC1, MC2, MC3). To facilitate this, I moved the backplane connector of the SRM SUS PD whitening board from the P1 connector to P2, per Koji's mods at ~5:10PM local time. Watchdog was shutdown, and the backplane connectors for the SRM coil driver board was also disconnected (this is interfaced now to the Acromag chassis).
I had to remove the backplane connector for the BS coil driver board in order to have access to the SRM backplane connector. Room in the back of these eurocrate boxes is tight in the existing config...
At ~6pm, I manually powered down c1susaux (as I did not know of any way to turn off the EPICS server run by the old VME crate in a software way). The point was to be able to easily interface with the MEDM screens. So the slow channels prefixed C1:SUS-* are now being served by the Supermicro called c1susaux2.
A critical wiring error was found. The channel mapping prepared by Johannes lists the watchdog enable BIO channels as "C1:SUS-<OPTIC>_<COIL>_ENABLE", which go to pins 23A-27A on the P1 connector, with returns on the corresponding C pins. However, we use the "TEST" inputs of the coil driver boards for sending in the FAST actuation signals. The correct BIO channels for switching this input is actually "C1:SUS-<OPTIC>_<COIL>_TEST", which go to pins 28A-32A on the P1 connector. For todays tests, I voted to fix this inside the Acromag crate for the SRM channels, and do our tests. Chub will unfortunately have to fix the remaining 7 optics, see Attachment #1 for the corrections required. I apportion 70% of the blame to Johannes for the wrong channel assignment, and accept 30% for not checking it myself.
The good news: the tests for the SRM channels all passed!
Additionally, I confirmed that the watchdog tripped when the RMS OSEM PD voltage exceeded 200 counts. Ideally we'd have liked to test the stability of the EPICS server, but we have shut it down and brought the crate back out to the electronics bench for Chub to work on tomorrow.
I restarted the old VME c1susaux at 915pm local time as I didn't want to leave the watchdogs in an undefined state. Unsurprisingly, ITMY is stuck. Also, the BS (cable #22) and SRM (cable #40) coil drivers are physically disconnected at the front DB15 output because of the undefined backplane inputs. I also re-opened the PSL shutter.
c1susaux (which controls watchdogs and alignments for all non-ETM optics) was down, the last BURT was done yesterday around 2PM.
I restarted via keying the crate. I restored the BURT snapshot from yesterday.
For future reference:
Now that the Acromag upgrade of c1vac is complete, the next system to be upgraded will be c1susaux. We chose c1susaux because it is one of the highest-priority systems awaiting upgrade, and because Johannes has already partially assembled its Acromag replacement (see photos below). I've assessed the partially-assembled Acromag chassis and the mostly-set-up host computer and propose we do the following to complete the system.
As I go, I'm writing step-by-step documentation here so that others can follow this procedure for future systems. The goal is to create a standard procedure that can be followed for all the remaining upgrades.
The bulk of the remaining work is the wiring and testing of the rackmount chassis housing the Acromag units. This system consists of 17 units: 10 ADCs, 4 DACs, and 3 digitial I/O modules. Johannes has already created a full list of channel wiring assignments. He has installed DB37-to-breakout board feedthroughs for all the signal cable connections. It looks like about 40% of the wiring from the breakout boards to Acromag terminals is already done.
The Acromag units have to be initially configured using the Windows laptop connected by USB. Last week I wasn't immediately able to check their configuration because I couldn't power on the units. Although the DC power wiring is complete, when I connected a 24V power supply to the chassis connector and flipped on the switch, the voltage dropped to ~10V irrespective of adjusting the current limit. The 24V indicator lights on the chassis front and back illuminated dimly, but the Acromag lights did not turn on. I suspect there is a short to ground somewhere, but I didn't have time to investigate further. I'll check again this week unless someone else looks at it first.
The host computer has already been mostly configured by Johannes. So far I've only set up IP forwarding rules between the martian-facing and Acromag-facing ethernet interfaces (the Acromags are on a subnet inaccessible from the outside). This is documented in the link above. I also plan to set up local installations of modbus and EPICS, as explained below. The new EPICS command file (launches the IOC) and database files (define the channels) have already been created by Johannes. I think all that remains is to set up the IOC as a persistent system service.
Recommendation from Keith Thorne:
For CDS lab-wide, Jamie Rollins and Ryan Blair have been maintaining Debian 8 and 9 repos with some of these.
They have somewhat older EPICS versions and may not include all the modules we have for SL7.
One worry is whether they will keep up Debian 9 maintained, as Debian 10 is already out.
I would likely choose Debian 9 instead of Ubuntu 18.04.02, as not sure of Ubuntu repos for EPICS libraries.
For CDS lab-wide, Jamie Rollins and Ryan Blair have been maintaining Debian 8 and 9 repos with some of these.
They have somewhat older EPICS versions and may not include all the modules we have for SL7.
One worry is whether they will keep up Debian 9 maintained, as Debian 10 is already out.
I would likely choose Debian 9 instead of Ubuntu 18.04.02, as not sure of Ubuntu repos for EPICS libraries.
Based on this, I propose we use Debian 9 for our Acromag systems. I don't see a strong reason to switch to SL7, especially since c1vac and c1susaux are already set-up using Debian 8. Although Debian 8 is one version out of date, I think it's better to get a well-documented and tested procedure in place before we upgrade the working c1vac and c1susaux computers. When we start building the next system, let's install Debian 9 (or 10, if it's available), get it working with EPICS/modbus, then loop back to c1vac and c1susaux for the OS upgrade.
The current convention is for all machines to share a common installation which is hosted on the /cvs/cds network drive. This seems appealing because only a single central EPICS distribution needs to be maintained. However, from experience attempting this on c1vac, I'm convinced this is a bad design for the new Acromag systems.
The problem is that any network outage, even routine maintenance or brief glitches, wreaks havoc on Acromags set up this way. When the network is interrupted, the modbus executable disappears mid-execution, crashing the process and hanging the OS (I think related to the deadlocked NFS mount), so that the only way to recover is to manually power-cycle. Still worse, this can happen silently (channel values freeze), meaning that, e.g., watchdog protections might fail.
To avoid this, I'm planning to install a local EPICS distribution from source on c1susaux, just as I did for c1vac. This only takes a few minutes to do, and I will include the steps in the documented procedure. Building from source also better protects against OS-dependent buginess.
Just a few remarks, since I heard from Gautam that c1susaux is next in line for upgrade.
All units have already been configured with IP addresses and settings following the scheme explained on the slow controls wiki page. I did this while powering the units in the chassis, so I'm not sure where the short is coming from. Is the power supply maybe not sourcing enough current? Powering all units at the same time takes significant current, something like >1.5 Amps if I remember correctly. These are the IPs I assigned before I left:
I used black/white twisted-pair wires for A/D, red/white for D/A, and green/white for BIO channels. I found it easiest to remove the blue terminal blocks from the Acromag units for doing the majority of the wiring, but wasn't able to finish it. I had also done the analog channel calibrations using the windows untility using multimeters and one of the precision voltage sources I had brought over from the Bridge labs, but it's probably a good idea to check it and correct if necessary. I also recommend to check that the existing wiring particularly for MC1 and MC2 is correct, as I had swapped their order in the channel assignment in the past.
While looking through the database files I noticed two glaring mistakes which I fixed:
[Anchal, JC, Ian, Paco]
We have now fixed all issues with the PD mons of c1susaux2 chassis. The slow channels are now reading same values as the fast channels and there is no arbitrary offset. The binary channels are all working now except for LO2 UL which keeps showing ENABLE OFF. This was an issue earlier on LO1 UR and it magically disappeared and now is on LO2. I think the optical isolators aren't very robust. But anyways, now our watchdog system is fully functional for all BHD suspended optics.
[Anchal, Yehonathan, Ian]
We installed c1susaux2 acromag chassis in 1Y0 with c1susaux2 computer. We connected PD monitors, Binary inputs, Binary outputs, and Run/Acquire RTS signals for 6 of the 7 suspensions. We ran out of DB9 cables to connect PR3. Of the ones that were connected, LO2, AS1, AS4, SR2, and PR2 are showing no issues in the functionality from the chassis. For LO1, everything is working except for UR EnableMon channel. The enable monitor does not show an ON state for the coil even though the coil driver chassis shows that it is ON via the LED lights. A possible reason could be that a wire got disconnected when we closed the chassis (there are a lot of wires pushing against each other. Another reason could be that the optical isolator ISO10 could have developed a bad channel on channel 2. The circuit was tested before closing the chassis, so not sure what went wrong after closing it.
PR2 is showing a non-acromag chassis related issue. As soon as we close the loop by enabling the coils, the watchdog triggers because the loop is unstable. Not sure what has changed for PR2, but someone should take a look at it.
For the issue with LO1, I suggest we keep a note that the C1:SUS-LO1_UR_ENABLEMon channel is faulty and don't take its value seriously. We should diagnose and fix this issue once we have more reasons to disconnect the chassis and open it.
I tried to perform a simple enabling test of coils using c1susaux2 modbus channels but failed. I'm able to do the enabling of coils using the windows GUI of acromag card but I can not do it when the cards are connected to the computer subnetwork. The issue is two-fold:
There's also an issue in reading back the ENABLE_MON channels. Here we suspect that one of the optical isolator box that we have been using might have a short in one of it's output channel. I'll investigate this more tomorrow. Again, the issue is two-fold. The EPICS channel values do not really change. So there is clearly some issue of communicating with the acromag cards.
I took the c1teststand computer from teststand and converted it into c1susaux2. To do so, I installed a fresh copy of debian 10 on it and followed the steps on this wiki page. I did some parts slightly differently though. The directory /cvs/cds/caltecg/c1susaux2 is a repository and contains the service unit file modbusIOC.service as well. A symbolic link is created at /etc/systemd/system to use this service file for creating the modbusIOC service. All db files are generated by parsing the acromag chassis wiring file using this python script.
The service file is running without any errors now and all channels are available. The leftmost bench on EEshop at 40m is now ready to do LO1 slow controls and monitor testing. If someone gets time today, they can hookup an unused coil driver to the chassis and verify ENABLE switching and monitoring through the optical isolators. We can also drive some voltage on the PD monitors and verify the functioning of our ADCs. Once this test passes, it is straight forward to finish the remaining 6 SOS wiring and we would be good to install the chassis.
Attaching wiring diagram of c1susuaux2 acromag chassis. Any comments/modification suggestions should come soon as we'll go ahead and wire it soon.
Note: While accessing channels using caget on c1susuaux2, you might get a warning "Identical process variable names on multiple servers". You can safely ignore it. It just means that channel is accessible on that particular computer via two different network interfaces (martian network eno1 and acromag subnetwork eno2) and it will just pick one of them.
Yesterday Gautam and I ran final tests of the eight suspensions controlled by c1susaux, using PyIFOTest. All of the optics pass a set of basic signal-routing tests, which are described in more detail below. The only issue found was with ITMX having an apparent DC bias polarity reversal (all four front coils) relative to the other seven susaux optics. However, further investigation found that ETMX and ETMY have the same reversal, and there is documentation pointing to the magnets being oppositely-oriented on these two optics. It seems likely that this is the case for ITMX as well.
I conclude that all the new c1susaux wiring/EPICS interfacing works correctly. There are of course other tests that can still be scripted, but at this point I'm satisfied that the new Acromag machine itself is correctly installed. PyIFOTest has been morphed into a powerful general framework for automating IFO tests. Anything involving fast/slow IO can now be easily scripted. I highly encourage others to think of more applications this may have at the 40m.
The code is currently located in /users/jon/pyifotest although we should find a permanent location for it. From the root level it is executed as
$ ./IFOTest <PARAMETER_FILE>
where PARAMETER_FILE is the filepath to a YAML config file containing the test parameters. I've created a config file for each of the suspended optics. They are located in the root-level directory and follow the naming convention SUS-<OPTIC>.yaml.
The code climbs a hierarchical "ladder" of actuation/readback-paired tests, with the test at each level depending on signals validated in the preceding level. At the base is the fast data system, which provides an independent reference against which the slow channels are tested. There are currently three scripted tests for the slow SUS channels, listed in order of execution:
c1susvme1 is behaving weirdly. I've restarted it several times but its computation time is hanging out around 260 usec, making it useless for suspension control and locking. I also found a PS/2 keyboard plugged in, which doesn't work, so I unplugged it. It needs to be plugged into a PS/2 keyboard/mouse Y-splitter cable.
Power cycling c1dcuepics seems to have fixed the EPICs channel problems, and c1lsc, c1asc, and c1iovme are talking again.
I burt restored c1iscepics and c1Iosepics from the snapshot at 6 am this morning.
However, c1susvme1 never came back after the last power cycle of its crate that it shared with c1susvme2. I connected a monitor and keyboard per the reboot instructions. I hit ctrl-x, and it proceeded to boot, however, it displays that there's a media error, PXE-E61, suggests testing the cable, and only offers an option to reboot. From a cursory inspection of the front, the cables seem to look okay. Also, this machine had eventually come back after the first power cycle and I'm pretty sure no cables were moved in between.
I had a go at trying to bring c1susvme1 back online. The first few times I hit the physical reset button, I saw the same error that Joe mentioned, about needing to check some cables. I tried one round of rebooting c1sosvme, c1susvme2 and c1susvme1, with no success. After a few iterations of jiggle cables/reset button/ctrl-x on c1susvme1, it came back. I ran the startup.cmd script, and re-enabled the suspensions, and Mode Cleaner is now locked. So, all systems are back online, and I'm crossing my fingers and toes that they stay that way, at least for a little while.
Today I noticed that the FE SYNC counters of c1susvme1/2 on the RFM network screen were stuck at 16384. I tried to reboot the machines to fix the problem but it didn't work.
The BS watchdog tripped off when I did that, because I had forgotten to disable it. I had to wait for a few minutes before it settled down again.
Later I also re-locked the mode cleaner. But before I could do it, Rana had to reduce the MC_L offset for me.
We've also been having problems with timing for c1susvme2. Attached is a one-hour plot of timing data for this cpu, known as SRM. Each spike is an instance of lateness, and a potential cause of lock loss. This has been going on for a quite a while.
Attached is a 3 day trend of SRM CPU timing info. It clearly gets better (though still problematic) at some point, but I don't know why as it doesn't correspond with any work done. I've labeled a reboot, which was done to try to clear out the timing issues. It can also be seen that it gets worse during locking work, but maybe that's a coincidence.
The attached shows the 200 day '10-minute' trend of the CPU meters and also the room temperature.
To my eye there is no correlation between the signals. Its clear that c1susvme2 (SRM LOAD) is going up and no evidence that its temperature.
It got worse again, starting with locking last night, but it has not recovered. Attached is a 3-day trend of SRM cpu load showing the good spell.
Last week, Alex recompiled the c1susvme2 code without the decimation filters for the OUT16 channels, so these channels are now as aliased as the rest of them. This appears to have helped with the timing issues: although it's not completely cured it is much better. Attached is a five day trend.
When I came in earlier today, I noticed that c1susvme2 was red on the DAQ screens. Since the vme computers always seem to be happier as a set, I hit the physical reset buttons on sosvme, susvme1 and susvme2. I then did the telnet or ssh in as appropriate for each computer in turn. sosvme and susvme1 came back just fine. However, I couldn't cd to /cvs/cds/caltech/target/c1susvme2 while ssh-ed in to susvme2. I could cd to /cvs/cds, and then did an ls, and it came back totally blank. There was nothing at all in the folder.
Yoichi showed me how to do 'df' to figure out what filesystems are mounted, and it looked as though the filesystem was mounted. But then Yoichi tried to unmount the filesystem, and it claimed that it wasn't mounted at all. We then remounted the filesystem, and things were good again. I was able to continue the regular restart procedure, and the computer is back up again.
Recap: c1susvme2 mysteriously got unmounted from /cvs/cds! But it's back, and the computers are all good again.
c1susvme2 has been running just a bit late for about a week. I rebooted it.
The plot shows SRM_FE_SYNC, which is the number of times in the last second that c1susvme2 was late for the 16k cycle. Similarly for ETMX.
The reboot appears to have worked.
[JC, Tega, Chris]
After moving the test stand front-ends, chiara (name server) and fb1 (boot server) to the new rack behind 1X7, we powered everything up and checked that we can reach c1teststand via pianosa and that the front-ends are still able to boot from fb1. After confirming these tests, we decided to start the software upgrade to debian 10. We installed buster on fb1 and are now in the process of setting up diskless boot. I have been looking around for cds instructions on how to do this and I found the CdsFrontEndDebian10page which contains most of the info we require. The page suggests that it may be cleaner to start the debian10 installation on a front-end that is connected to an I/O chassis with at least 1 ADC and 1 DAC card, then move the installation disk to the boot server and continue from there, so I moved the disk from fb1 to one of the front-ends but I had trouble getting it to boot. I decided to do a clean install on another disk on the c1lsc front-end which has a host adapter card that can be connected to the c1bhd I/O chassis. We can then mount this disk on fb1 and use it to setup the diskless boot OS.
We went and collected some information for the overlords to fix the c1teststand DAQ network issue.
We would try to get internet access to c1teststand soon. Meanwhile, someone with more experience and knowledge should look into this situation and try to fix it. We need to test the c1teststand within few weeks now.
We tried to fix the ntp synchronization in c1teststand today by repeating the steps listed in 40m/16302. Even though teh cloned fb1 now has the exact same package version, conf & service files, and status, the FE machines (c1bhd and c1sus2) fail to sync to the time. the timedatectl shows the same stauts 'Idle'. We also, dug bit deeper into the error messages of daq_dc on cloned fb1 and mx_stream on FE machines and have some error messages to report here.
controls@fb1:~ 0$ sudo dbpg -i ntp_1%3a4.2.6.p5+dfsg-7+deb8u3_amd64.deb
controls@fb1:~ 0$ sudo dbpg -i libopts25_5.18.12-3_amd64.deb
controls@fb1:~ 0$ sudo dbpg -i libssl1.1_1.1.0l-1~deb9u4_amd64.deb
controls@fb1:~ 0$ sudo systemctl enable ntp
controls@fb1:~ 0$ sudo systemctl daemon-reload
controls@fb1:~ 0$ sudo systemctl restart ntp
controls@fb1:~ 0$ sudo systemctl status ntp
● ntp.service - NTP daemon (custom service)
Loaded: loaded (/etc/systemd/system/ntp.service; enabled)
Active: active (running) since Mon 2021-10-04 17:12:58 UTC; 1h 15min ago
Main PID: 26807 (code=exited, status=0/SUCCESS)
├─30408 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 105:107
└─30525 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 105:107
Oct 04 17:48:42 fb1 ntpd_intres: host name not found: 2.debian.pool.ntp.org
Oct 04 17:48:52 fb1 ntpd_intres: host name not found: 3.debian.pool.ntp.org
Oct 04 18:05:05 fb1 ntpd_intres: host name not found: 0.debian.pool.ntp.org
Oct 04 18:05:15 fb1 ntpd_intres: host name not found: 1.debian.pool.ntp.org
Oct 04 18:05:25 fb1 ntpd_intres: host name not found: 2.debian.pool.ntp.org
Oct 04 18:05:35 fb1 ntpd_intres: host name not found: 3.debian.pool.ntp.org
Oct 04 18:21:48 fb1 ntpd_intres: host name not found: 0.debian.pool.ntp.org
Oct 04 18:21:58 fb1 ntpd_intres: host name not found: 1.debian.pool.ntp.org
Oct 04 18:22:08 fb1 ntpd_intres: host name not found: 2.debian.pool.ntp.org
Oct 04 18:22:18 fb1 ntpd_intres: host name not found: 3.debian.pool.ntp.org
controls@fb1:~ 0$ ntpq -p
remote refid st t when poll reach delay offset jitter
192.168.123.255 .BCST. 16 u - 64 0 0.000 0.000 0.000
controls@c1bhd:~ 3$ timedatectl
Local time: Mon 2021-10-04 18:34:38 UTC
Universal time: Mon 2021-10-04 18:34:38 UTC
RTC time: Mon 2021-10-04 18:34:38
Time zone: Etc/UTC (UTC, +0000)
NTP enabled: yes
NTP synchronized: no
RTC in local TZ: no
DST active: n/a
controls@c1bhd:~ 0$ systemctl status systemd-timesyncd -l
● systemd-timesyncd.service - Network Time Synchronization
Loaded: loaded (/lib/systemd/system/systemd-timesyncd.service; enabled)
Active: active (running) since Mon 2021-10-04 17:21:29 UTC; 1h 13min ago
Main PID: 244 (systemd-timesyn)
controls@fb1:~ 3$ sudo systemctl status daqd_dc -l
● daqd_dc.service - Advanced LIGO RTS daqd data concentrator
Loaded: loaded (/etc/systemd/system/daqd_dc.service; enabled)
Active: failed (Result: exit-code) since Mon 2021-10-04 17:50:25 UTC; 22s ago
Process: 715 ExecStart=/usr/bin/daqd_dc_mx -c /opt/rtcds/caltech/c1/target/daqd/daqdrc.dc (code=exited, status=1/FAILURE)
Main PID: 715 (code=exited, status=1/FAILURE)
Oct 04 17:50:24 fb1 systemd: Started Advanced LIGO RTS daqd data concentrator.
Oct 04 17:50:25 fb1 daqd_dc_mx: [Mon Oct 4 17:50:25 2021] Unable to set to nice = -20 -error Unknown error -1
Oct 04 17:50:25 fb1 daqd_dc_mx: Failed to do mx_get_info: MX not initialized.
Oct 04 17:50:25 fb1 daqd_dc_mx: 263596
Oct 04 17:50:25 fb1 systemd: daqd_dc.service: main process exited, code=exited, status=1/FAILURE
Oct 04 17:50:25 fb1 systemd: Unit daqd_dc.service entered failed state.
controls@fb1:~ 0$ sudo chroot /diskless/root
fb1:/ 0# sudo nano /etc/systemd/system/mx_stream.service
fb1:/ 0# exit
controls@c1bhd:~ 0$ sudo systemctl daemon-reload
controls@c1bhd:~ 0$ sudo systemctl restart mx_stream
controls@c1bhd:~ 0$ sudo systemctl status mx_stream -l
● mx_stream.service - Advanced LIGO RTS front end mx stream
Loaded: loaded (/etc/systemd/system/mx_stream.service; enabled)
Active: failed (Result: exit-code) since Mon 2021-10-04 17:57:20 UTC; 24s ago
Process: 11832 ExecStart=/etc/mx_stream_exec (code=exited, status=1/FAILURE)
Main PID: 11832 (code=exited, status=1/FAILURE)
Oct 04 17:57:20 c1bhd systemd: Starting Advanced LIGO RTS front end mx stream...
Oct 04 17:57:20 c1bhd systemd: Started Advanced LIGO RTS front end mx stream.
Oct 04 17:57:20 c1bhd mx_stream_exec: send len = 263596
Oct 04 17:57:20 c1bhd mx_stream_exec: OMX: Failed to find peer index of board 00:00:00:00:00:00 (Peer Not Found in the Table)
Oct 04 17:57:20 c1bhd mx_stream_exec: mx_connect failed Nic ID not Found in Peer Table
Oct 04 17:57:20 c1bhd mx_stream_exec: c1x06_daq mmapped address is 0x7f516a97a000
Oct 04 17:57:20 c1bhd mx_stream_exec: c1bhd_daq mmapped address is 0x7f516697a000
Oct 04 17:57:20 c1bhd systemd: mx_stream.service: main process exited, code=exited, status=1/FAILURE
Oct 04 17:57:20 c1bhd systemd: Unit mx_stream.service entered failed state.
As usual, some help would be helpful