40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 278 of 339  Not logged in ELOG logo
ID Date Author Type Category Subjectup
  17026   Fri Jul 22 15:05:26 2022 TegaConfigurationBHDc1sus2 shared memory and ADC fix

[Tega, Yuta]

We were able to fix the shared memory issue by updating the receiver model name from ''SUS' to 'SU2' and the ADC zero issue by including both ADC0 and ADC1 in the c1hpc and c1bac models as well as removing the grounding of the unused ADC channels (including chn#16 and chn#17 which are actually used in c1hpc) in c1su2. We also used shared memory to move the DCPD_A/B error signals (after signal conditioning and mixing A/B; now named A_ERR and B_ERR) from c1hpc to c1bac.
C1:HPC-DCPD_A_IN1 and C1:HPC-DCPD_B_IN1 are now availableangel (they are essentially the same as C1:LSC-DCPD_A_IN1 and C1:LSC-DCPD_B_IN1, except for they are ADC-ed with different ADC; see elog 40m/16954 and Attachment #1).
Dolphin IPC error in seding signal from c1hpc to c1lsc still remains.crying

Attachment 1: Screenshot_2022-07-22_15-04-33_DCPD.png
Attachment 2: Screenshot_2022-07-22_15-12-19_models.png
Attachment 3: Screenshot_2022-07-22_15-15-11_ERR.png
Attachment 4: Screenshot_2022-07-22_15-32-19_GDS.png
  17028   Fri Jul 22 17:46:10 2022 yutaConfigurationBHDc1sus2 watchdog update and DCPD ERR channels

[Tega, Yuta]

We have added C1:HPC-DCPD_A_ERR and C1:HPC-DCPD_B_ERR testpoints, which can be used as A+B, A-B etc.
Restarting c1hpc crashed c1sus2, and also made c1lsc/ioo/sus models red.
We run /opt/rtcds/caltech/c1/Git/40m/scripts/cds/restartAllModels.sh to restart all the machines. It worked perfectly without manually pressing power buttons! Wow!heart

We have also edited /opt/rtcds/caltech/c1/medm/c1su2/C1SU2_WATCHDOGS.adl so that it will use new /opt/rtcds/caltech/c1/Git/40m/scripts/SUS/medm/resetFromWatchdogTrip.sh instead of old /opt/rtcds/caltech/c1/scripts/SUS/damprestore.py.

Attachment 1: Screenshot_2022-07-22_17-48-25.png
  14563   Tue Apr 23 18:48:25 2019 JonUpdateSUSc1susaux bench testing completed

Today I tested the remaining Acromag channels and retested the non-functioning channels found yesterday, which Chub repaired this morning. We're still not quite ready for an in situ test. Here are the issues that remain.

Analog Input Channels

Channel Issue
C1:SUS-MC2_URPDMon No response
C1:SUS-MC2_LRPDMon No response

I further diagnosed these channels by connecting a calibrated DC voltage source directly to the ADC terminals. The EPICS channels do sense this voltage, so the problem is isolated to the wiring between the ADC and DB37 feedthrough.

Analog Output Channels

Channel Issue
C1:SUS-ITMX_ULBiasAdj No output signal
C1:SUS-ITMX_LLBiasAdj No output signal
C1:SUS-ITMX_URBiasAdj No output signal
C1:SUS-ITMX_LRBiasAdj No output signal
C1:SUS-ITMY_ULBiasAdj No output signal
C1:SUS-ITMY_LLBiasAdj No output signal
C1:SUS-ITMY_URBiasAdj No output signal
C1:SUS-ITMY_LRBiasAdj No output signal
C1:SUS-MC1_ULBiasAdj No output signal
C1:SUS-MC1_LLBiasAdj No output signal
C1:SUS-MC1_URBiasAdj No output signal
C1:SUS-MC1_LRBiasAdj No output signal

To further diagnose these channels, I connected a voltmeter directly to the DAC terminals and toggled each channel output. The DACs are outputting the correct voltage, so these problems are also isolated to the wiring between DAC and feedthrough.

In testing the DC bias channels, I did not check the sign of the output signal, but only that the output had the correct magnitude. As a result my bench test is insensitive to situations where either two degrees of freedom are crossed or there is a polarity reversal. However, my susPython scripting tests for exactly this, fetching and applying all the relevant signal gains between pitch/yaw input and coil bias output. It would be very time consuming to propagate all these gains by hand, so I've elected to wait for the automated in situ test.

Digital Output Channels

yes Everything works.

  16210   Thu Jun 17 16:37:23 2021 Anchal, PacoUpdateSUSc1susaux computer rebooted

Jon suggested to reboot the acromag chassis, then the computer, and we did this without success. Then, Koji suggested we try running ifup eth0, so we ran `sudo /sbin/ifup eth0` and it worked to put c1susaux back in the martian network, but the modbus service was still down. We switched off the chassis and rebooted the computer and we had to do sudo /sbin/ifup eth0` again (why do we need to do this manually everytime?). Switched on the chassis but still no channels. `sudo systemctl status modbusioc.service' gave us inactive (dead) status. So  we ran sudo systemctl restart modbusioc.service'.

The status became:

● modbusIOC.service - ModbusIOC Service via procServ
   Loaded: loaded (/etc/systemd/system/modbusIOC.service; enabled)
   Active: inactive (dead)
           start condition failed at Thu 2021-06-17 16:10:42 PDT; 12min ago
           ConditionPathExists=/opt/rtcds/caltech/c1/burt/autoburt/latest/c1susaux.snap was not met`

After another iteration we finally got a modbusIOC.service OK status, and we then repeated Jon's reboot procedure. This time, the acromags were on but reading 0.0, so we just needed to run `sudo /sbin/ifup eth1`and finally some sweet slow channels were read. As a final step we burt restored to 05:19 AM today c1susaux.snap file and managed to relock the IMC >> will keep an eye on it.... Finally, in the process of damping all the suspended optics, we noticed some OSEM channels on BS and PRM are reading 0.0 (they are red as we browse them)... We succeeded in locking both arms, but this remains an unknown for us.

  14567   Wed Apr 24 17:07:39 2019 gautamUpdateSUSc1susaux in-situ testing [and future of IFOtest]

[jon, gautam]

For the in-situ test, I decided that we will use the physical SRM to test the c1susaux Acromag replacement crate functionality for all 8 optics (PRM, BS, ITMX, ITMY, SRM, MC1, MC2, MC3). To facilitate this, I moved the backplane connector of the SRM SUS PD whitening board from the P1 connector to P2, per Koji's mods at ~5:10PM local time. Watchdog was shutdown, and the backplane connectors for the SRM coil driver board was also disconnected (this is interfaced now to the Acromag chassis).

I had to remove the backplane connector for the BS coil driver board in order to have access to the SRM backplane connector. Room in the back of these eurocrate boxes is tight in the existing config...

At ~6pm, I manually powered down c1susaux (as I did not know of any way to turn off the EPICS server run by the old VME crate in a software way). The point was to be able to easily interface with the MEDM screens. So the slow channels prefixed C1:SUS-* are now being served by the Supermicro called c1susaux2.

A critical wiring error was found. The channel mapping prepared by Johannes lists the watchdog enable BIO channels as "C1:SUS-<OPTIC>_<COIL>_ENABLE", which go to pins 23A-27A on the P1 connector, with returns on the corresponding C pins. However, we use the "TEST" inputs of the coil driver boards for sending in the FAST actuation signals. The correct BIO channels for switching this input is actually "C1:SUS-<OPTIC>_<COIL>_TEST", which go to pins 28A-32A on the P1 connector. For todays tests, I voted to fix this inside the Acromag crate for the SRM channels, and do our tests. Chub will unfortunately have to fix the remaining 7 optics, see Attachment #1 for the corrections required. I apportion 70% of the blame to Johannes for the wrong channel assignment, and accept 30% for not checking it myself.

The good news: the tests for the SRM channels all passed!

  • Attachment #2: Output of Jon's testing code. My contribution is the colored logs courtesy of python's coloredlogs package, but this needs a bit more work - mainly the PASS mssage needs to be green. This test applies bias voltages to PIT/YAW, and looks for the response in the PDmon channels. It backs out the correct signs for the four PDs based on the PIT/YAW actuation matrix, and checks that the optic has moved "sufficiently" for the applied bias. You can also see that the PD signals move with consistent signs when PIT/YAW misalignment is applied. Additionally, the DC values of the PDMon channels reported by the Acromag system are close to what they were using the VME system. I propose calling the next iteration of IFOtest "Sherlock".
  • Attachment #3: Confirmation (via spectra) that the SRM OSEM PD whitening can still be switched even after my move of the signals from the P1 connector to the P2 connector. I don't have an explanation right now for the shape of the SIDE coil spectrum.
  • Attachment #4: Applied 100 cts (~ 100*10/2**15/2 ~ 15mV at the monitor point) offset at the bias input of the coil output filters on SRM (this is a fast channel). Looked for the response in the Coil Vmon channels (these are SLOW channels). The correct coil showed consistent response across all 5 channels.

Additionally, I confirmed that the watchdog tripped when the RMS OSEM PD voltage exceeded 200 counts. Ideally we'd have liked to test the stability of the EPICS server, but we have shut it down and brought the crate back out to the electronics bench for Chub to work on tomorrow.

I restarted the old VME c1susaux at 915pm local time as I didn't want to leave the watchdogs in an undefined state. Unsurprisingly, ITMY is stuck. Also, the BS (cable #22) and SRM (cable #40) coil drivers are physically disconnected at the front DB15 output because of the undefined backplane inputs. I also re-opened the PSL shutter.

Attachment 1: 2019-04-24_20-29.pdf
Attachment 2: Screenshot_from_2019-04-24_20-05-54.png
Attachment 3: SRM_OSEMPD_WHT_ACROMAG.pdf
Attachment 4: DCVmon.png
  12336   Tue Jul 26 09:56:34 2016 ericqUpdateCDSc1susaux restarted

c1susaux (which controls watchdogs and alignments for all non-ETM optics) was down, the last BURT was done yesterday around 2PM. 

I restarted via keying the crate. I restored the BURT snapshot from yesterday.

  14590   Thu May 2 15:35:54 2019 JonOmnistructureUpgradec1susaux upgrade documentation

For future reference:

  • The updated list of c1susaux channel wiring (includes the "coil enable" --> "coil test" digital outputs change)
  • Step-by-step instructions on how to set up an Acromag system from scratch
  14495   Mon Mar 25 10:21:05 2019 JonUpdateUpgradec1susaux upgrade plan

Now that the Acromag upgrade of c1vac is complete, the next system to be upgraded will be c1susaux. We chose c1susaux because it is one of the highest-priority systems awaiting upgrade, and because Johannes has already partially assembled its Acromag replacement (see photos below). I've assessed the partially-assembled Acromag chassis and the mostly-set-up host computer and propose we do the following to complete the system.


As I go, I'm writing step-by-step documentation here so that others can follow this procedure for future systems. The goal is to create a standard procedure that can be followed for all the remaining upgrades.

Acromag Chassis Status

The bulk of the remaining work is the wiring and testing of the rackmount chassis housing the Acromag units. This system consists of 17 units: 10 ADCs, 4 DACs, and 3 digitial I/O modules. Johannes has already created a full list of channel wiring assignments. He has installed DB37-to-breakout board feedthroughs for all the signal cable connections. It looks like about 40% of the wiring from the breakout boards to Acromag terminals is already done.

The Acromag units have to be initially configured using the Windows laptop connected by USB. Last week I wasn't immediately able to check their configuration because I couldn't power on the units. Although the DC power wiring is complete, when I connected a 24V power supply to the chassis connector and flipped on the switch, the voltage dropped to ~10V irrespective of adjusting the current limit. The 24V indicator lights on the chassis front and back illuminated dimly, but the Acromag lights did not turn on. I suspect there is a short to ground somewhere, but I didn't have time to investigate further. I'll check again this week unless someone else looks at it first.

Host Computer Status

The host computer has already been mostly configured by Johannes. So far I've only set up IP forwarding rules between the martian-facing and Acromag-facing ethernet interfaces (the Acromags are on a subnet inaccessible from the outside). This is documented in the link above. I also plan to set up local installations of modbus and EPICS, as explained below. The new EPICS command file (launches the IOC) and database files (define the channels) have already been created by Johannes. I think all that remains is to set up the IOC as a persistent system service.

Host computer OS

Recommendation from Keith Thorne:

For CDS lab-wide, Jamie Rollins and Ryan Blair have been maintaining Debian 8 and 9 repos with some of these.  
They have somewhat older EPICS versions and may not include all the modules we have for SL7.
One worry is whether they will keep up Debian 9 maintained, as Debian 10 is already out.

I would likely choose Debian 9 instead of Ubuntu 18.04.02, as not sure of Ubuntu repos for EPICS libraries.

Based on this, I propose we use Debian 9 for our Acromag systems. I don't see a strong reason to switch to SL7, especially since c1vac and c1susaux are already set-up using Debian 8. Although Debian 8 is one version out of date, I think it's better to get a well-documented and tested procedure in place before we upgrade the working c1vac and c1susaux computers. When we start building the next system, let's install Debian 9 (or 10, if it's available), get it working with EPICS/modbus, then loop back to c1vac and c1susaux for the OS upgrade.

Local vs. central modbus/EPICS installation

The current convention is for all machines to share a common installation which is hosted on the /cvs/cds network drive. This seems appealing because only a single central EPICS distribution needs to be maintained. However, from experience attempting this on c1vac, I'm convinced this is a bad design for the new Acromag systems.

The problem is that any network outage, even routine maintenance or brief glitches, wreaks havoc on Acromags set up this way. When the network is interrupted, the modbus executable disappears mid-execution, crashing the process and hanging the OS (I think related to the deadlocked NFS mount), so that the only way to recover is to manually power-cycle. Still worse, this can happen silently (channel values freeze), meaning that, e.g., watchdog protections might fail.

To avoid this, I'm planning to install a local EPICS distribution from source on c1susaux, just as I did for c1vac. This only takes a few minutes to do, and I will include the steps in the documented procedure. Building from source also better protects against OS-dependent buginess.

Main TODO items

  • Debug issue with Acromag DC power wiring
  • Complete wiring from chassis feedthroughs to Acromag terminals, following this wiring diagram
  • Check/set the configuration of each Acromag unit using the software on the Windows laptop
  • Set the analog channel calibrations in the EPICS database file
  • Test each channel ex situ. Chub and I discussed an idea to use two DB-37F breakout boards, with the wiring between the board terminals manually set. One DAC channel would be calibrated and driven to test other ADC channels. A similar approach could be used for the digital input/output channels.
Attachment 1: IMG_3136.jpg
Attachment 2: IMG_3138.jpg
Attachment 3: IMG_3137.jpg
  14496   Tue Mar 26 04:25:13 2019 JohannesUpdateUpgradec1susaux upgrade plan

Main TODO items

  • Debug issue with Acromag DC power wiring
  • Complete wiring from chassis feedthroughs to Acromag terminals, following this wiring diagram
  • Check/set the configuration of each Acromag unit using the software on the Windows laptop
  • Set the analog channel calibrations in the EPICS database file
  • Test each channel ex situ. Chub and I discussed an idea to use two DB-37F breakout boards, with the wiring between the board terminals manually set. One DAC channel would be calibrated and driven to test other ADC channels. A similar approach could be used for the digital input/output channels.

Just a few remarks, since I heard from Gautam that c1susaux is next in line for upgrade.

All units have already been configured with IP addresses and settings following the scheme explained on the slow controls wiki page. I did this while powering the units in the chassis, so I'm not sure where the short is coming from. Is the power supply maybe not sourcing enough current? Powering all units at the same time takes significant current, something like >1.5 Amps if I remember correctly. These are the IPs I assigned before I left:

Acromag Unit IP Address

I used black/white twisted-pair wires for A/D, red/white for D/A, and green/white for BIO channels. I found it easiest to remove the blue terminal blocks from the Acromag units for doing the majority of the wiring, but wasn't able to finish it. I had also done the analog channel calibrations using the windows untility using multimeters and one of the precision voltage sources I had brought over from the Bridge labs, but it's probably a good idea to check it and correct if necessary. I also recommend to check that the existing wiring particularly for MC1 and MC2 is correct, as I had swapped their order in the channel assignment in the past.

While looking through the database files I noticed two glaring mistakes which I fixed:

  1. The definition of C1SUSAUX_BIO2 was missing in /cvs/cds/caltech/target/c1susaux2/C1SUSAUX.cmd. I added it after the assignments for C1SUSAUX_BIO1
  2. Due to copy/paste the database files /cvs/cds/caltech/target/c1susaux2/C1_SUS-AUX_<OPTIC>.db files were still pointing to C1AUXEX. I overwrote all instances of this in all database files with C1SUSAUX.


  16756   Mon Apr 4 17:03:47 2022 AnchalSummaryCDSc1susaux2 slow controls acromag chassis fixed and installed

[Anchal, JC, Ian, Paco]

We have now fixed all issues with the PD mons of c1susaux2 chassis. The slow channels are now reading same values as the fast channels and there is no arbitrary offset. The binary channels are all working now except for LO2 UL which keeps showing ENABLE OFF. This was an issue earlier on LO1 UR and it magically disappeared and now is on LO2. I think the optical isolators aren't very robust. But anyways, now our watchdog system is fully functional for all BHD suspended optics.

Attachment 1: Screenshot_2022-04-04_17-03-26.png
  16724   Mon Mar 14 12:20:05 2022 AnchalSummaryCDSc1susaux2 slow controls acromag chassis installed

[Anchal, Yehonathan, Ian]

We installed c1susaux2 acromag chassis in 1Y0 with c1susaux2 computer. We connected PD monitors, Binary inputs, Binary outputs, and Run/Acquire RTS signals for 6 of the 7 suspensions. We ran out of DB9 cables to connect PR3. Of the ones that were connected, LO2, AS1, AS4, SR2, and PR2 are showing no issues in the functionality from the chassis. For LO1, everything is working except for UR EnableMon channel. The enable monitor does not show an ON state for the coil even though the coil driver chassis shows that it is ON via the LED lights. A possible reason could be that a wire got disconnected when we closed the chassis (there are a lot of wires pushing against each other. Another reason could be that the optical isolator ISO10 could have developed a bad channel on channel 2. The circuit was tested before closing the chassis, so not sure what went wrong after closing it.

PR2 is showing a non-acromag chassis related issue. As soon as we close the loop by enabling the coils, the watchdog triggers because the loop is unstable. Not sure what has changed for PR2, but someone should take a look at it.

For the issue with LO1, I suggest we keep a note that the C1:SUS-LO1_UR_ENABLEMon channel is faulty and don't take its value seriously. We should diagnose and fix this issue once we have more reasons to disconnect the chassis and open it.


Attachment 1: BHD_WatchDogs.png
Attachment 2: 40mBHD_C1SUSAUX2_Acromag_Chassis.pdf
  16712   Mon Mar 7 19:38:47 2022 AnchalSummaryCDSc1susaux2 slow controls issues

I tried to perform a simple enabling test of coils using c1susaux2 modbus channels but failed. I'm able to do the enabling of coils using the windows GUI of acromag card but I can not do it when the cards are connected to the computer subnetwork. The issue is two-fold:

  • The enable channels such as C1:SUS-LO1_UL_ENABLE are not changing values when their DOL changes a value. In this case, I created a calc channel C1:SUS-LO1_ALL_CALC which takes the AND of all coil's individual CALC channels which are normally used as DOL for the ENABLE channels. But even though the changes are reflected properly to C1:SUS-LO1_ALL_CALC, it does not affect C1:SUS-LO1_UL_ENABLE. See the db files here for more info.
  • I tried to directly change the value of C1:SUS-LO1_UL_ENABLE using caput and even though in soft value the channel changes, it does not propagate a change at the output of Acromag card. So my suspicion is that something might be off with the setting of the Acromag card or c1susuaux2.cmd file. I followed this wiki page instructions, but if anyone can find an error, it would be useful.

There's also an issue in reading back the ENABLE_MON channels. Here we suspect that one of the optical isolator box that we have been using might have a short in one of it's output channel. I'll investigate this more tomorrow. Again, the issue is two-fold. The EPICS channel values do not really change. So there is clearly some issue of communicating with the acromag cards.

  16700   Fri Mar 4 11:04:34 2022 AnchalSummaryCDSc1susaux2 system setup and running

I took the c1teststand computer from teststand and converted it into c1susaux2. To do so, I installed a fresh copy of debian 10 on it and followed the steps on this wiki page. I did some parts slightly differently though. The directory /cvs/cds/caltecg/c1susaux2 is a repository and contains the service unit file modbusIOC.service as well. A symbolic link is created at /etc/systemd/system to use this service file for creating the modbusIOC service. All db files are generated by parsing the acromag chassis wiring file using this python script.

The service file is running without any errors now and all channels are available. The leftmost bench on EEshop at 40m is now ready to do LO1 slow controls and monitor testing. If someone gets time today, they can hookup an unused coil driver to the chassis and verify ENABLE switching and monitoring through the optical isolators. We can also drive some voltage on the PD monitors and verify the functioning of our ADCs. Once this test passes, it is straight forward to finish the remaining 6 SOS wiring and we would be good to install the chassis.

Attaching wiring diagram of c1susuaux2 acromag chassis. Any comments/modification suggestions should come soon as we'll go ahead and wire it soon.

Note: While accessing channels using caget on c1susuaux2, you might get a warning "Identical process variable names on multiple servers". You can safely ignore it. It just means that channel is accessible on that particular computer via two different network interfaces (martian network eno1 and acromag subnetwork eno2) and it will just pick one of them.

Attachment 1: 40mBHD_C1SUSAUX2_Acromag_Chassis.pdf
  14588   Thu May 2 10:59:58 2019 JonUpdateSUSc1susux in situ wiring testing completed


Yesterday Gautam and I ran final tests of the eight suspensions controlled by c1susaux, using PyIFOTest. All of the optics pass a set of basic signal-routing tests, which are described in more detail below. The only issue found was with ITMX having an apparent DC bias polarity reversal (all four front coils) relative to the other seven susaux optics. However, further investigation found that ETMX and ETMY have the same reversal, and there is documentation pointing to the magnets being oppositely-oriented on these two optics. It seems likely that this is the case for ITMX as well. 

I conclude that all the new c1susaux wiring/EPICS interfacing works correctly. There are of course other tests that can still be scripted, but at this point I'm satisfied that the new Acromag machine itself is correctly installed. PyIFOTest has been morphed into a powerful general framework for automating IFO tests. Anything involving fast/slow IO can now be easily scripted. I highly encourage others to think of more applications this may have at the 40m.

Usage and Design

The code is currently located in /users/jon/pyifotest although we should find a permanent location for it. From the root level it is executed as


where PARAMETER_FILE is the filepath to a YAML config file containing the test parameters. I've created a config file for each of the suspended optics. They are located in the root-level directory and follow the naming convention SUS-<OPTIC>.yaml.

The code climbs a hierarchical "ladder" of actuation/readback-paired tests, with the test at each level depending on signals validated in the preceding level. At the base is the fast data system, which provides an independent reference against which the slow channels are tested. There are currently three scripted tests for the slow SUS channels, listed in order of execution:

  1. VMon test:  Validates the low-frequency sensing of SUS actuation (VMon channels). A DC offset is applied in the final filter module of the fast coil outputs, one coil at a time. The test confirms that the VMon of the actuated coil, and only this VMon, senses the displacement, and that the response has the correct polarity. The screen output is a matrix showing the change in VMon responses with actuation of each coil. A passing test, roughly, is diagonal values >> 0 and off-diagonal values << diagonal.

  2. Coil Enable test:  Validates the slow watchdog control of the fast coil outputs (Coil-Enable channels). Analogously to (1), this test also applies a DC offset via the fast system to one coil at a time and analyzes the VMon responses. However, in this case, the offset is enabled to all five coils simulataneously and only one coil output is enabled at a time. The screen output is again a \Delta VMon matrix interpreted in the same way as above.


  3. PDMon/DC Bias test:  Validates slow alignment control and readback (BiasAdj and PDMon channels). A DC misalignment is introduced first in pitch, then in yaw, with the OSEM PDMon responses measured in both cases. Using the gains from the PIT/YAW---> COIL output coupling matrix, the script verifies that each coil moves in the correct direction and by a sufficiently large magnitude for the applied DC bias. The screen output shows the change in PDMon responses with a pure pitch actuation, and with a pure yaw actuation. The output filter matrix coefficients have already been divided out, so a passing test is a sufficiently large, positive change under both pitch and yaw actuations.


  405   Wed Mar 26 22:26:15 2008 JohnUpdateComputersc1susvme
I removed the fan and tweaked the timing cables to see if they were the source of our problems. I saw no effect. I'm leaving the fan off for the moment to see if that helps. It is on top of the filing cabinet next to my desk.
  1303   Sat Feb 14 16:15:19 2009 robConfigurationComputersc1susvme1

c1susvme1 is behaving weirdly.  I've restarted it several times but its computation time is hanging out around 260 usec, making it useless for suspension control and locking.  I also found a PS/2 keyboard plugged in, which doesn't work, so I unplugged it.  It needs to be plugged into a PS/2 keyboard/mouse Y-splitter cable. 

  900   Fri Aug 29 12:43:44 2008 josephbSummaryComputersc1susvme1 down
Around noon today, c1susvme was having problems. The C0DAQ_RFMNETWORK light was red. The status light was off, the sig det light was amber and the own data light was green. I could also ssh in, but could not not run startup. I switched off the watchdogs for c1susvme2 (the watchdogs for c1susvme1 had already been tripped), and manually power cycled the crate.

However, when c1susvme1 when it came back up it had not mounted the usual cvs/cds/ directories. c1susvme2 did however. c1susvme1 has been on the new network for awhile, while c1susvme2 was switch over today. So apparently switching networks doesn't help this particular problem.

I did a remote reboot of c1susvme1, and it came up with the correct files mounted. Both machines ran their approriate startup.cmd files and are currently green.
  2069   Thu Oct 8 14:41:46 2009 jenneUpdateComputersc1susvme1 is back online


Power cycling c1dcuepics seems to have fixed the EPICs channel problems, and c1lsc, c1asc, and c1iovme are talking again.

I burt restored c1iscepics and c1Iosepics from the snapshot at 6 am this morning.

However, c1susvme1 never came back after the last power cycle of its crate that it shared with c1susvme2.  I connected a monitor and keyboard per the reboot instructions.  I hit ctrl-x, and it proceeded to boot, however, it displays that there's a media error, PXE-E61, suggests testing the cable, and only offers an option to reboot.  From a cursory inspection of the front, the cables seem to look okay.  Also, this machine had eventually come back after the first power cycle and I'm pretty sure no cables were moved in between.


 I had a go at trying to bring c1susvme1 back online.  The first few times I hit the physical reset button, I saw the same error that Joe mentioned, about needing to check some cables.  I tried one round of rebooting c1sosvme, c1susvme2 and c1susvme1, with no success.  After a few iterations of jiggle cables/reset button/ctrl-x on c1susvme1, it came back.  I ran the startup.cmd script, and re-enabled the suspensions, and Mode Cleaner is now locked.  So, all systems are back online, and I'm crossing my fingers and toes that they stay that way, at least for a little while.

  822   Mon Aug 11 11:36:11 2008 josephb, SteveConfigurationComputersc1susvme1 minor problems
Around 11 am c1susvme1 start having issues. Namely C1:SUS-PRM_FE_SYNC was railing at some large value like 16384 (2^14). I presume this means the computer was running catastophically late.

I turned off the BS and ITM watch dogs (the PRM was already off), tried hitting reset and sshing in, and running startup, but this didn't help. I then turned off the c1susvme2 associated watch dogs (MC1-3, SRM) and went out to do a hard reboot by switching the crate power off. c1susvme2 came back up fine, was restarted and associated watch dogs turned back on. However, c1susvme1 came back up without mounting /cvs/cds/.

As a test, I replaced the ethernet connection with a CAT6 cable to the Prosafe switch in 1Y6, and then ran reboot on c1susvme1. When it came back up, it had mounted properly, and I was able to run the ./startup.cmd file. At this point it seems to be happy. The new cable is in the trays coming in from the top of the 1Y4 and 1Y6 and approriately labeled.

Edit: Apparently ITMX and ITMY became excited after the reboot (perhaps I turned the watchdogs back on too early? Although that was after the DAQ light was listed as green for c1susvme). Steve noticed this when the alarms went off again (I had turned them off after the reboot seemed successful), and he damped them. Interestingly, the BS remained unexcited.
  358   Tue Mar 4 23:22:32 2008 robDAQComputersc1susvme1&2 rebooted

I found that some channels from c1susvme1 and c1susvme2 were not being recording by the DAQ (and were not showing up in DV). I rebooted these processors, which fix the problem. If you see other cases of this (signal exactly zero, but not a testpoint problem), just reboot the corresponding processor.
  3239   Fri Jul 16 16:12:31 2010 AlbertoConfigurationComputersc1susvme1/2 rebooted

Today I noticed that the FE SYNC counters of c1susvme1/2 on the RFM network screen were stuck at 16384. I tried to reboot the machines to fix the problem but it didn't work.

The BS watchdog tripped off when I did that, because I had forgotten to disable it. I had to wait for a few minutes before it settled down again.

Later I also re-locked the mode cleaner. But before I could do it, Rana had to reduce the MC_L offset for me.

  399   Mon Mar 24 20:15:03 2008 JohnSummaryComputersc1susvme2
c1susvme2 isn't behaving itself. It keeps getting out of sync and/or giving a red status light.

After going through the usual restart procedures a few times (unsuccessfully) we power cycled the c1susvme & c1sosvme crates. We think everything came back okay.

We still can't get the status and CRC (cyclic redundancy check) to return to normal on c1susvme2. If Alex is around tomorrow please ask him to take a look.
  400   Tue Mar 25 10:44:24 2008 robUpdateComputersc1susvme2

c1susvme2 isn't behaving itself. It keeps getting out of sync and/or giving a red status light.

After going through the usual restart procedures a few times (unsuccessfully) we power cycled the c1susvme & c1sosvme crates. We think everything came back okay.

We still can't get the status and CRC (cyclic redundancy check) to return to normal on c1susvme2. If Alex is around tomorrow please ask him to take a look.

I rebooted it again this morning. The ASS machine is currently not running its process, for whatever reason (someone turn it off?). Let's leave it like this for a day and see how the c1susvme2 does. The other recent change is Steve's install of a cooling fan--maybe that's causing the problem.
  403   Tue Mar 25 16:34:47 2008 robUpdateComputersc1susvme2


c1susvme2 isn't behaving itself. It keeps getting out of sync and/or giving a red status light.

After going through the usual restart procedures a few times (unsuccessfully) we power cycled the c1susvme & c1sosvme crates. We think everything came back okay.

We still can't get the status and CRC (cyclic redundancy check) to return to normal on c1susvme2. If Alex is around tomorrow please ask him to take a look.

I rebooted it again this morning. The ASS machine is currently not running its process, for whatever reason (someone turn it off?). Let's leave it like this for a day and see how the c1susvme2 does. The other recent change is Steve's install of a cooling fan--maybe that's causing the problem.

Now c1susvme1 is joining the action. Since leaving the ASS off doesn't change anything, we can probably absolve it of blame. I now suspect the 4-pin LEMO cables going from the CLK DRIVER modules to the clock fanout modules. These cables are being squeezed/shaken by Steve's new fan setup, and may have been the culprit all along. John will do some testing to see if they are indeed the problem.
  401   Tue Mar 25 13:21:25 2008 AndreyUpdateComputersc1susvme2 is not behaving itself again
  406   Fri Mar 28 16:18:18 2008 robUpdateComputersc1susvme2 status
c1susvme2 is getting worse and worse. it won't run for more than ~45 minutes without fatally de-syncing. for now I've turned off c1iovme (which sends the MCL signal) to see if that's causing the problem. next I'll swap the boards for c1susvme1 and c1susvme2 to see if it's the cpu (or maybe the RFM card) itself, rather than the timing/pentek systems.
  408   Mon Mar 31 14:14:16 2008 robUpdateComputersc1susvme2 status

c1susvme2 is getting worse and worse. it won't run for more than ~45 minutes without fatally de-syncing. for now I've turned off c1iovme (which sends the MCL signal) to see if that's causing the problem. next I'll swap the boards for c1susvme1 and c1susvme2 to see if it's the cpu (or maybe the RFM card) itself, rather than the timing/pentek systems.

I swapped the processors for c1susvme1 and c1susvme2. So for now, to startup, you should ssh into c1susvme1 and run the startup.cmd for c1susvme2, and vice versa.
  2037   Thu Oct 1 15:42:55 2009 robUpdateLockingc1susvme2 timing problems update


We've also been having problems with timing for c1susvme2.  Attached is a one-hour plot of timing data for this cpu, known as SRM.  Each spike is an instance of lateness, and a potential cause of lock loss.  This has been going on for a quite a while.



Attached is a 3 day trend of SRM CPU timing info.  It clearly gets better (though still problematic) at some point, but I don't know why as it doesn't correspond with any work done.  I've labeled a reboot, which was done to try to clear out the timing issues.  It can also be seen that it gets worse during locking work, but maybe that's a coincidence.

Attachment 1: srmcpu2.png
  2041   Fri Oct 2 14:52:55 2009 ranaUpdateComputersc1susvme2 timing problems update

The attached shows the 200 day '10-minute' trend of the CPU meters and also the room temperature.

To my eye there is no correlation between the signals. Its clear that c1susvme2 (SRM LOAD) is going up and no evidence that its temperature.


Attachment 1: Untitled.png
  2042   Fri Oct 2 15:11:44 2009 robUpdateComputersc1susvme2 timing problems update update

It got worse again, starting with locking last night, but it has not recovered.  Attached is a 3-day trend of SRM cpu load showing the good spell.

Attachment 1: srmcpu3.png
  2080   Mon Oct 12 14:51:41 2009 robUpdateComputersc1susvme2 timing problems update update update


It got worse again, starting with locking last night, but it has not recovered.  Attached is a 3-day trend of SRM cpu load showing the good spell.

 Last week, Alex recompiled the c1susvme2 code without the decimation filters for the OUT16 channels, so these channels are now as aliased as the rest of them.  This appears to have helped with the timing issues: although it's not completely cured it is much better.  Attached is a five day trend.

Attachment 1: srmcpu.png
  1905   Fri Aug 14 15:29:43 2009 JenneUpdateComputersc1susvme2 was unmounted from /cvs/cds

When I came in earlier today, I noticed that c1susvme2 was red on the DAQ screens.  Since the vme computers always seem to be happier as a set, I hit the physical reset buttons on sosvme, susvme1 and susvme2.  I then did the telnet or ssh in as appropriate for each computer in turn.  sosvme and susvme1 came back just fine. However, I couldn't cd to /cvs/cds/caltech/target/c1susvme2 while ssh-ed in to susvme2.  I could cd to /cvs/cds, and then did an ls, and it came back totally blank.  There was nothing at all in the folder. 

Yoichi showed me how to do 'df' to figure out what filesystems are mounted, and it looked as though the filesystem was mounted.  But then Yoichi tried to unmount the filesystem, and it claimed that it wasn't mounted at all.  We then remounted the filesystem, and things were good again.  I was able to continue the regular restart procedure, and the computer is back up again.

Recap: c1susvme2 mysteriously got unmounted from /cvs/cds!  But it's back, and the computers are all good again.

  1634   Sat May 30 12:36:52 2009 robUpdateComputersc1susvme2, c1iscex running late

c1susvme2 has been running just a bit late for about a week.  I rebooted it. 

The plot shows SRM_FE_SYNC, which is the number of times in the last second that c1susvme2 was late for the 16k cycle.   Similarly for ETMX.


Attachment 1: srmsync.jpg
Attachment 2: etmxsync.jpg
  1635   Mon Jun 1 13:25:00 2009 robUpdateComputersc1susvme2, c1iscex running late


c1susvme2 has been running just a bit late for about a week.  I rebooted it. 

The plot shows SRM_FE_SYNC, which is the number of times in the last second that c1susvme2 was late for the 16k cycle.   Similarly for ETMX.



The reboot appears to have worked.

Attachment 1: doublesync.jpg
  16365   Wed Sep 29 17:10:09 2021 AnchalSummaryCDSc1teststand problems summary

[anchal, ian]

We went and collected some information for the overlords to fix the c1teststand DAQ network issue.

  • from c1teststand, c1bhd and c1sus2 computers were not accessible through ssh. (No route to host). So we restarted both the computers (the I/O chassis were ON).
  • After the computers restarted, we were able to ssh into c1bhd and c1sus, ad we ran rtcds start c1x06 and rtcds start c1x07.
  • The first page in attachment shows the screenshot of GDS_TP screens of the IOP models after this step.
  • Then we started teh user models by running rtcds start c1bhd and rtcds start c1su2.
  • The second page shows the screenshot of GDS_TP screens. You can notice that DAQ status is red in all the screens and the DC statuses are blank.
  • So we checked if daqd_ services are running in the fb computer. They were not. So we started them all by sudo systemctl start daqd_*.
  • Third page shows the status of all services after this step. the daqd_dc.service remained at failed state.
  • open-mx_stream.service was not even loaded in fb. We started it by running sudo systemctl start open-mx_stream.service.
  • The fourth page shows the status of this service. It started without any errors.
  • However, when we went to check the status of mx_stream.service in c1bhd and c1sus2, they were not loaded and we we tried to start them, they showed failed state and kept trying to start every 3 seconds without success. (See page 5 and 6).
  • Finally, we also took a screenshot of timedatectl command output on the three computers fb, c1bhd, and c1sus2 to show that their times were not synced at all.
  • The ntp service is running on fb but it probably does not have access to any of the servers it is following.
  • The timesyncd on c1bhd and c1sus2 (FE machines) is also running but showing status 'Idle' which suggested they are unable to find the ntp signal from fb.
  • I believe this issue is similar to what jamie ficed in the fb1 on martian network in 40m/16302. Since the fb on c1teststand network was cloned before this fix, it might have this dysfunctional ntp as well.

We would try to get internet access to c1teststand soon. Meanwhile, someone with more experience and knowledge should look into this situation and try to fix it. We need to test the c1teststand within few weeks now.

Attachment 1: c1teststand_issues_summary.pdf
c1teststand_issues_summary.pdf c1teststand_issues_summary.pdf c1teststand_issues_summary.pdf c1teststand_issues_summary.pdf c1teststand_issues_summary.pdf c1teststand_issues_summary.pdf c1teststand_issues_summary.pdf
  16372   Mon Oct 4 11:05:44 2021 AnchalSummaryCDSc1teststand problems summary

[Anchal, Paco]

We tried to fix the ntp synchronization in c1teststand today by repeating the steps listed in 40m/16302. Even though teh cloned fb1 now has the exact same package version, conf & service files, and status, the FE machines (c1bhd and c1sus2) fail to sync to the time. the timedatectl shows the same stauts 'Idle'. We also, dug bit deeper into the error messages of daq_dc on cloned fb1 and mx_stream on FE machines and have some error messages to report here.

Attempt on fixing the ntp

  • We copied the ntp package version 1:4.2.6 deb file from /var/cache/apt/archives/ntp_1%3a4.2.6.p5+dfsg-7+deb8u3_amd64.deb on the martian fb1 to the cloned fb1 and ran.
    controls@fb1:~ 0$ sudo dbpg -i ntp_1%3a4.2.6.p5+dfsg-7+deb8u3_amd64.deb
  • We got error messages about missing dependencies of libopts25 and libssl1.1. We downloaded oldoldstable jessie versions of these packages from here and here. We ensured that these versions are higher than the required versions for ntp. We installed them with:
    controls@fb1:~ 0$ sudo dbpg -i libopts25_5.18.12-3_amd64.deb 
    controls@fb1:~ 0$ sudo dbpg -i libssl1.1_1.1.0l-1~deb9u4_amd64.deb
  • Then we installed the ntp package as described above. It asked us if we want to keep the configuration file, we pressed Y.
  • However, we decided to make the configuration and service files exactly same as martian fb1 to make it same in cloned fb1. We copied /etc/ntp.conf and /etc/systemd/system/ntp.service files from martian fb1 to cloned fb1 in the same positions. Then we enabled ntp, reloaded the daemon, and restarted ntp service:
    controls@fb1:~ 0$ sudo systemctl enable ntp
    controls@fb1:~ 0$ sudo systemctl daemon-reload
    controls@fb1:~ 0$ sudo systemctl restart ntp
  • But ofcourse, since fb1 doesn't have internet access, we got some errors in status of the ntp.service:
    controls@fb1:~ 0$ sudo systemctl status ntp
    ● ntp.service - NTP daemon (custom service)
       Loaded: loaded (/etc/systemd/system/ntp.service; enabled)
       Active: active (running) since Mon 2021-10-04 17:12:58 UTC; 1h 15min ago
     Main PID: 26807 (code=exited, status=0/SUCCESS)
       CGroup: /system.slice/ntp.service
               ├─30408 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 105:107
               └─30525 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 105:107
    Oct 04 17:48:42 fb1 ntpd_intres[30525]: host name not found: 2.debian.pool.ntp.org
    Oct 04 17:48:52 fb1 ntpd_intres[30525]: host name not found: 3.debian.pool.ntp.org
    Oct 04 18:05:05 fb1 ntpd_intres[30525]: host name not found: 0.debian.pool.ntp.org
    Oct 04 18:05:15 fb1 ntpd_intres[30525]: host name not found: 1.debian.pool.ntp.org
    Oct 04 18:05:25 fb1 ntpd_intres[30525]: host name not found: 2.debian.pool.ntp.org
    Oct 04 18:05:35 fb1 ntpd_intres[30525]: host name not found: 3.debian.pool.ntp.org
    Oct 04 18:21:48 fb1 ntpd_intres[30525]: host name not found: 0.debian.pool.ntp.org
    Oct 04 18:21:58 fb1 ntpd_intres[30525]: host name not found: 1.debian.pool.ntp.org
    Oct 04 18:22:08 fb1 ntpd_intres[30525]: host name not found: 2.debian.pool.ntp.org
    Oct 04 18:22:18 fb1 ntpd_intres[30525]: host name not found: 3.debian.pool.ntp.org
  • But the ntpq command is giving the saem output as given by ntpq comman in martian fb1 (except for the source servers), that the broadcasting is happening in the same manner:
    controls@fb1:~ 0$ ntpq -p
         remote           refid      st t when poll reach   delay   offset  jitter
    ============================================================================== .BCST.          16 u    -   64    0    0.000    0.000   0.000
  • On the FE machines side though, the systemd-timesyncd are still unable to read the time signal from fb1 and show the status as idle:
    controls@c1bhd:~ 3$ timedatectl
          Local time: Mon 2021-10-04 18:34:38 UTC
      Universal time: Mon 2021-10-04 18:34:38 UTC
            RTC time: Mon 2021-10-04 18:34:38
           Time zone: Etc/UTC (UTC, +0000)
         NTP enabled: yes
    NTP synchronized: no
     RTC in local TZ: no
          DST active: n/a
    controls@c1bhd:~ 0$ systemctl status systemd-timesyncd -l
    ● systemd-timesyncd.service - Network Time Synchronization
       Loaded: loaded (/lib/systemd/system/systemd-timesyncd.service; enabled)
       Active: active (running) since Mon 2021-10-04 17:21:29 UTC; 1h 13min ago
         Docs: man:systemd-timesyncd.service(8)
     Main PID: 244 (systemd-timesyn)
       Status: "Idle."
       CGroup: /system.slice/systemd-timesyncd.service
               └─244 /lib/systemd/systemd-timesyncd
  • So the time synchronization is still not working. We expected the FE machined to just synchronize to fb1 even though it doesn't have any upstream ntp server to synchronize to. But that didn't happen.
  • I'm (Anchal) working on getting internet access to c1teststand computers.

Digging into mx_stream/daqd_dc errors:

  • We went and changed the Restart fileld in /etc/systemd/system/daqd_dc.service on cloned fb1 to 2. This allows the service to fail and stop restarting after two attempts. This allows us to see the real error message instead of the systemd error message that the service is restarting too often. We got following:
    controls@fb1:~ 3$ sudo systemctl status daqd_dc -l
    ● daqd_dc.service - Advanced LIGO RTS daqd data concentrator
       Loaded: loaded (/etc/systemd/system/daqd_dc.service; enabled)
       Active: failed (Result: exit-code) since Mon 2021-10-04 17:50:25 UTC; 22s ago
      Process: 715 ExecStart=/usr/bin/daqd_dc_mx -c /opt/rtcds/caltech/c1/target/daqd/daqdrc.dc (code=exited, status=1/FAILURE)
     Main PID: 715 (code=exited, status=1/FAILURE)
    Oct 04 17:50:24 fb1 systemd[1]: Started Advanced LIGO RTS daqd data concentrator.
    Oct 04 17:50:25 fb1 daqd_dc_mx[715]: [Mon Oct  4 17:50:25 2021] Unable to set to nice = -20 -error Unknown error -1
    Oct 04 17:50:25 fb1 daqd_dc_mx[715]: Failed to do mx_get_info: MX not initialized.
    Oct 04 17:50:25 fb1 daqd_dc_mx[715]: 263596
    Oct 04 17:50:25 fb1 systemd[1]: daqd_dc.service: main process exited, code=exited, status=1/FAILURE
    Oct 04 17:50:25 fb1 systemd[1]: Unit daqd_dc.service entered failed state.
  • It seemed like the only thing daqd_dc process doesn't like is that mx_stream services are in failed state in teh FE computers. So we did the same process on FE machines to get the real error messages:
    controls@fb1:~ 0$ sudo chroot /diskless/root
    fb1:/ 0#
    fb1:/ 0# sudo nano /etc/systemd/system/mx_stream.service
    fb1:/ 0#
    fb1:/ 0# exit
  • Then I ssh'ed into c1bhd to see the error message on mx_stream service properly.
    controls@c1bhd:~ 0$ sudo systemctl daemon-reload
    controls@c1bhd:~ 0$ sudo systemctl restart mx_stream
    controls@c1bhd:~ 0$ sudo systemctl status mx_stream -l
    ● mx_stream.service - Advanced LIGO RTS front end mx stream
       Loaded: loaded (/etc/systemd/system/mx_stream.service; enabled)
       Active: failed (Result: exit-code) since Mon 2021-10-04 17:57:20 UTC; 24s ago
      Process: 11832 ExecStart=/etc/mx_stream_exec (code=exited, status=1/FAILURE)
     Main PID: 11832 (code=exited, status=1/FAILURE)
    Oct 04 17:57:20 c1bhd systemd[1]: Starting Advanced LIGO RTS front end mx stream...
    Oct 04 17:57:20 c1bhd systemd[1]: Started Advanced LIGO RTS front end mx stream.
    Oct 04 17:57:20 c1bhd mx_stream_exec[11832]: send len = 263596
    Oct 04 17:57:20 c1bhd mx_stream_exec[11832]: OMX: Failed to find peer index of board 00:00:00:00:00:00 (Peer Not Found in the Table)
    Oct 04 17:57:20 c1bhd mx_stream_exec[11832]: mx_connect failed Nic ID not Found in Peer Table
    Oct 04 17:57:20 c1bhd mx_stream_exec[11832]: c1x06_daq mmapped address is 0x7f516a97a000
    Oct 04 17:57:20 c1bhd mx_stream_exec[11832]: c1bhd_daq mmapped address is 0x7f516697a000
    Oct 04 17:57:20 c1bhd systemd[1]: mx_stream.service: main process exited, code=exited, status=1/FAILURE
    Oct 04 17:57:20 c1bhd systemd[1]: Unit mx_stream.service entered failed state.
  • c1sus2 shows the same error. I'm not sure I understand these errors at all. But they seem to have nothing to do with timing issuessurprise!

As usual, some help would be helpful

  16376   Mon Oct 4 18:00:16 2021 KojiSummaryCDSc1teststand problems summary

I don't know anything about mx/open-mx, but you also need open-mx,don't you?

controls@c1ioo:~ 0$ systemctl status *mx*
● open-mx.service - LSB: starts Open-MX driver
   Loaded: loaded (/etc/init.d/open-mx)
   Active: active (running) since Wed 2021-09-22 11:54:39 PDT; 1 weeks 5 days ago
  Process: 470 ExecStart=/etc/init.d/open-mx start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/open-mx.service
           └─620 /opt/3.2.88-csp/open-mx-1.5.4/bin/fma -d

● mx_stream.service - Advanced LIGO RTS front end mx stream
   Loaded: loaded (/etc/systemd/system/mx_stream.service; enabled)
   Active: active (running) since Wed 2021-09-22 12:08:00 PDT; 1 weeks 5 days ago
 Main PID: 5785 (mx_stream)
   CGroup: /system.slice/mx_stream.service
           └─5785 /usr/bin/mx_stream -e 0 -r 0 -w 0 -W 0 -s c1x03 c1ioo c1als c1omc -d fb1:0


  16381   Tue Oct 5 17:58:52 2021 AnchalSummaryCDSc1teststand problems summary

open-mx service is running successfully on the fb1(clone), c1bhd and c1sus.


I don't know anything about mx/open-mx, but you also need open-mx,don't you?

  17083   Tue Aug 16 18:22:59 2022 TegaUpdateComputersc1teststand rack mounting for CDS upgrade

[Tega, Yuta]

I keep getting confused about the purpose of the teststand. The view I am adopting going forward is its use as a platform for testing the compatibility of new hardware upgrade, instead of thinking of it as an independent system that works with old hardware.

The initial idea of clearing 1X7 cannot be done for now, because I missed the deadline for providing a detailed enough plan before Monday power up of the lab, so we are just going to go ahead and use the new rack as was initially intended and get the latest hardware and software tested here.

We mounted the DAQ, subnet and dolphin IX switches, see attachement 1. The mounting ears that came with the dolphin switch did not fit and so could not be used for mounting. We looked around the lab and decided to used one of the NavePoint mounting brackets which we found next to the teststand, see attachment 2.

We plan to move the new rack to the current location of the teststand and use the power connection from there. It is also closer to 1X7 so that moving the front-ends and switches to 1X7 should be straight forward after we complete all CDS upgrade testing.

Attachment 1: IMG_20220816_180157132.jpg
Attachment 2: IMG_20220816_175125874.jpg
  17088   Wed Aug 17 11:10:51 2022 ranaUpdateComputersc1teststand rack mounting for CDS upgrade

we want to be able to run SimPlant on the teststand, test our new controls algorithms, test watchdogs, and any other software upgrades. Ideally in the steady state it will run some plants with suspensions and cavities and we will develop our measurement scripts on there also (e.g. IFOtest).


[Tega, Yuta]

I keep getting confused about the purpose of the teststand. The view I am adopting going forward is its use as a platform for testing the compatibility of new hardware upgrade, instead of thinking of it as an independent system that works with old hardware.

  16697   Thu Mar 3 15:37:40 2022 AnchalSummaryCDSc1teststand restructured

c1teststand has been restructured. There is no port computer called 'c1teststand' anymore. When you ssh into the c1teststand network using ssh c1teststand from inside martian or from outside network using the method mentioned in this wiki page , you would land into chiara (clone) computer and you can navigate into any teststand network computer from there.

I'll be repurposing 1U c1teststand computer into the new c1susaux2 slow machine now. All files from home directory and from /etc directory of former c1teststand have been zipped and stored in /home/controls of chiara (clone). Just a aside, the network configuration of teststand can be done from inside the teststand network, by going to a browser on either fb1 (clone) or chair (clone) and going to address The login and password are same as our usual workstation username and password.

  16271   Fri Aug 6 13:13:28 2021 AnchalUpdateBHDc1teststand subnetwork now accessible remotely

c1teststand subnetwork is now accessible remotely. To log into this network, one needs to do following:

  • Log into nodus or pianosa. (This will only work from these two computers)
  • ssh -CY controls@
  • Password is our usual workstation password.
  • This will log you into c1teststand network.
  • From here, you can log into fb1, chiara, c1bhd and c1sus2  which are all part of the teststand subnetwork.

Just to document the IT work I did, doing this connection was bit non-trivial than usual.

  • The martian subnetwork is created by a NAT router which connects only nodus to outside GC network and all computers within the network have ip addresses 192.168.113.xxx with subnet mask of
  • The cloned test stand network was also running on the same IP address scheme, mostly because fb1 and chiara are clones in this network. So every computer in this network also had ip addresses 192.168.113.xxx.
  • I setup a NAT router to connect to martian network forwarding ssh requests to c1teststand computer. My NAT router creates a separate subnet with IP addresses 10.0.1.xxx and suubnet mask gated through
  • However, the issue is for c1teststand, there are now two networks accessible which have same IP addresses 192.168.113.xxx. So when you try to do ssh, it always search in its local c1teststand subnetwork instead of routing through the NAT router to the martian network.
  • To work around this, I had to manually provide an ip router to c1teststand for connecting to two of the computers (nodus and pianosa) in martian network. This is done by:
    ip route add via dev eno1
    ip route add via dev eno1
  • This gives c1teststand specific path for ssh requests to/from these computers in the martian network.
  16273   Mon Aug 9 10:38:48 2021 AnchalUpdateBHDc1teststand subnetwork now accessible remotely

I had to add following two lines in the /etc/network/interface file to make the special ip routes persistent even after reboot:

post-up ip route add via dev eno1
post-up ip route add via dev eno1

  16382   Tue Oct 5 18:00:53 2021 AnchalSummaryCDSc1teststand time synchronization working now

Today I got a new router that I used to connect the c1teststand, fb1 and chiara. I was able to see internet access in c1teststand and fb1, but not in chiara. I'm not sure why that is the case.

The good news is that the ntp server on fb1(clone) is working fine now and both FE computers, c1bhd and c1sus2 are succesfully synchronized to the fb1(clone) ntpserver. This resolves any possible timing issues in this DAQ network.

On running the IOP and user models however, I see the same errors are mentioned in 40m/16372. Something to do with:

Oct 06 00:47:56 c1sus2 mx_stream_exec[21796]: OMX: Failed to find peer index of board 00:00:00:00:00:00 (Peer Not Found in the Table)
Oct 06 00:47:56 c1sus2 mx_stream_exec[21796]: mx_connect failed Nic ID not Found in Peer Table
Oct 06 00:47:56 c1sus2 mx_stream_exec[21796]: c1x07_daq mmapped address is 0x7fa4819cc000
Oct 06 00:47:56 c1sus2 mx_stream_exec[21796]: c1su2_daq mmapped address is 0x7fa47d9cc000

Thu Oct 7 17:04:31 2021

I fixed the issue of chiara not getting internet. Now c1teststand, fb1 and chiara, all have internet connections. It was the issue of default gateway and interface and findiing the DNS. I have found the correct settings now.

  14239   Tue Oct 9 16:05:29 2018 gautamConfigurationASCc1tst deleted, c1asy deployed.

Setting up c1asy:

  • Backed up old c1tst.mdl as c1tst_old_bak.mdl in /opt/rtcds/userapps/release/cds/c1/models
  • Copied the c1tst model to /opt/rtcds/userapps/release/isc/c1/models/c1asy.mdl as this is where the c1asx.mdl file resides.
  • Backed up original c1rfm.mdl as c1rfm_old.mdl in /opt/rtcds/userapps/release/cds/c1/models (since the old c1tst had an RFM block which is unnecessary).
  • Deleted offending RFM block from c1rfm.mdl.
  • Recompiled and re-installed c1rfm.mdl. Model has not yet been restarted, as I'd like suspension watchdogs to be shutdown, but c1susaux EPICS channels are presently not responsive.
  • Removed c1tst model (C-node91) from /opt/rtcds/caltech/c1/target/gds/param/testpoints.
  • Removed /opt/rtcds/caltech/c1/target/gds/param/tpchn_c1tst.par (at this point, DCUID 91 is free for use by c1asy).
  • Moved c1tst line in /opt/rtcds/caltech/c1/target/daqd/master to "old model definitions models" section.
  • Added /opt/rtcds/caltech/c1/target/gds/param/tpchn_c1asy.par to the master file.
  • Edited/diskless/root.jessie/etc/rtsystab to allow c1asy to be run on c1iscey.
  • Finally, I followed the instructions here to get the channels into frames and make all the indicators green.

Now Yuki can work on copying the simulink model (copy c1asx structure) and implementing the autoalignment servo.

Attachment 1: CDSoverview_ASY.png
  14507   Tue Apr 2 14:53:57 2019 gautamUpdateCDSc1vac added to burt

I deleted references to c1vac1 and c1vac2 (which no longer exist) and added c1vac to the autoburt request file list at /opt/rtcds/caltech/c1/burt/autoburt/requestfilelist

  14641   Tue May 28 09:51:33 2019 gautamUpdateVACc1vac hard-rebooted

The vacuum itself was fine - CC1 gauge reported a pressure of 1.3e-5 torr. Note to self: the C1:Vac-CC1_HORNET_PRESSURE channel, which is the analog readback of the Hornet gauge and which is hooked up to an Acromag ADC in the c1auxex chassis, is independent of the status of the c1vac machine, and so can serve as a diagnostic.

However, I was unable to interact with c1vac in any way, the monitor hooked up directly to it was showing a frozen display. So I hard-rebooted the system. It took a few minutes to come back online - but even after 10 minutes of waiting, still no display. In the process of the reboot, several valves were closed off - when the EPICS processes restart, there are momentary instances where the readback channels get an "undefined" value, which prompts the main interlock process to transition to a "SAFE" state. 

Running df -h, I saw that the /var partition was completely full. Maybe this was somehow interfering with the machine running smoothly? Two files in particular, daemon.log and daemon.log.1 were ~1GB each. The contents of these files seemed to be just the readbacks for the caget and caput commands. So I cleared both these files, and now the /var partition usage is only 26%. I also got the display back up and running on the physical monitor hooked up to the c1vac machine's VGA port. Let's see if this has improved the stability situation. The CPU load is still high (~6-7), with most of this coming from the modbus process. Why is this so high? c1susaux has more Acromag units but claims a much lower load of 0.71. Is the CPU of the c1vac machine somehow inferior?

In the meantime, I ssh-ed into c1vac and restored the "Vacuum normal" valve config. During this little escapade, the main volume pressure rose to ~6e-5 torr. It's coming back down smoothly.

Unrelated to this work: we had turned the RGA off for the vent, I powered it back on and re-initialized it this morning.

Attachment 1: Screen_Shot_2019-05-31_at_12.44.54_PM.png
  14640   Mon May 27 11:37:13 2019 gautamUpdateVACc1vac is unresponsive

I've been monitoring the status of the pumpdown remotely with ndscope lookbacks of C1:Vac-CC1_pressure. Today morning, I saw that the channel was putting out a constant value (signature of EPICS server being frozen). caget did not work either. Then I tried ssh-ing into c1vac to see if there were any issues but I was unable to. The machine isn't responding to ping either. The EPICS value has been frozen since ~1030pm PDT 26 May 2019.

I will try and head to campus later today to check on it. Isn't an email alert or soemthing supposed to be sent out in such an event?

  17081   Mon Aug 15 18:06:07 2022 AnchalUpdateGeneralc1vac issues, 1 pressure gauge died

[Anchal, Paco, Tega]

Disk full issue:

c1vac was showing /var disk to be full. We moved all gunzipped backup logs to /home/controls/logBackUp. This emptied 36% of space on /var. Ideally, we need not log so much. Some solution needs to be found for reducing these log sizes or monitoring them for smart handling.

Pressure sensor malfunctioning:

We were unable to opel the PSL shuttter, due to the interlock with C1:Vac-P1a_pressure. We found that C1:Vac-P1a_pressure is not being written by serial_MKS937a service on c1vac. The issue was the the sensor itself has become bad and needs to be replaced. We believe that "L 0E-04" in the status (C1:Vac-P1a_status) message indicates a malfunctioning sensor.

Quick fix:

We removed writing of C1:Vac-P1a_pressure and C1:Vac-P1a_status from MKS937a and mvoed them to XGS600 which is using the sensor 1 from main volume. See this commit.

Now we are able to open PSL shutter. The sensor should be replaced ASAP and this commit can be reverted then.

  17082   Mon Aug 15 20:09:18 2022 KojiUpdateGeneralc1vac issues, 1 pressure gauge died

- Disk Full: Just use the usual /etc/logrotate thing

- Vacuum gauge

I rather feel not replacing P1a. We used to have Ps and CCs as they didn't cover the entire pressure range. However, this new FRG (=Full Range Gauge) does cover from 1atm to 4nTorr.

Why don't we have a couple of FRG spares, instead?

Questions to Tega: How many FRGs can our XGS-600 controller handle?


ELOG V3.1.3-