I found all the EPICS channels for the model c1ioo on the FE c1ioo to be blank just now. The realtime model itself seemed to be running fine, judging by the IMC alignment (as the WFS loops seemed to still be running okay). I couldn't find any useful info in demsg but I don't know what I'm looking for. So my guess is that somehow the EPICS process for that model died. Unclear why.
Clearly, there is a discrepancy for f>20Hz. Why?
Kira informed me that she was having trouble accessing past data for her PID tuning tests. Looking at the last day of data, it looks like there are frequent EPICS data dropouts, each up to a few hours. Observations (see Attachment #1 for evidence):
It is difficult to diagnose how long this has been going on for, as once you start pulling longer stretches of data on dataviewer, any "data freezes" are washed out in the extent of data plotted.
All slow machines (except c1auxex) were dead today, so I had to key them all. While I was at it, I also decided to update MC autolocker screen. Kira pointed out that I needed to change the EPCIS input type (in the RTCDS model) to a "binary input", as opposed to an "analog input", which I did. Model recompilation and restart went smooth. I had to go into the epics record manually to change the two choices to "ENABLE" and "DISABLE" as opposed to the default "ON" and "OFF". Anyways, long story short, MC autolocker controls are a bit more intuitive now I think.
Reboot for c1susaux and c1iscaux today. ITMX precautions were followed. Reboots went smoothly.
IMC is shuttered while Jon does PLL characterization...
When c1oaf starts up there are 446 gain channels that should be set to 0.0 but which end up at 1.0. An example channel is C1:OAF-ADAPT_CARM_ADPT_ACC1_GAIN. The safe.snap file states that it should be set to 0. After model start up it is at 1.0.
We ran some tests, including modifying the safe.snap to make sure it was reading the snap file we were expecting. For this I set the setpoint to 0.5. After restart of the model we saw that the setpoint went to 0.5 but the epics value remained at 1.0. I then set the snap file back to its original setting. I ran the epics sequencer by hand in a gdb session and verified that the sequencer was setting the field to 0. I also built a custom sequencer that would catch writes by the sdf system to the channel. I only saw one write, the initial write that pushed a 0. I have reverted my changes to the sequencer.
The gain channel can be caput to the correct value and it is not pushed back to 1.0. So there does not appear to be a process actively pushing the value to 1.0. On Rolfs sugestion we ran the sequencer w/o the kernel object loaded, and saw the same behavior.
This will take some thought.
FSS slow wasn't running so PSL PZT voltage was swinging around a lot. Reason was that was c1psl unresponsive. I keyed the crate, now it's okay. Now ITMX is stuck - Johannes just told be about an un-elogged c1susaux reboot. Seems that ITMX got stuck at ~4:30pm yesterday PT. After some shaking, the optic was loosened. Please follow the procedure in future and if you do a reboot, please elog it and verify that the optic didn't get stuck.
Unfortunately, this has happened (and seems like it will happen) enough times that I set up a script for rebooting the machine in a controlled way, hopefully it will negate the need to repeatedly go into the VEA and hard-reboot the machines. Script lives at /opt/rtcds/caltech/c1/scripts/cds/rebootC1LSC.sh. SVN committed. It worked well for me today. All applicable CDS indicator lights are now green again. Be aware that c1oaf will probably need to be restarted manually in order to make the DC light green. Also, this script won't help you if you try to unload a model on c1lsc and the FE crashes. It relies on c1lsc being ssh-able. The basic logic is:
Why is this happening so frequently now? Last few lines of error log:
I fixed it by running the reboot script.
Per this elog, we don't need any AIOut channels or Oplev channels. However, the latest wiring diagram I can find for the EX Acromag situation suggests that these channels are hooked up (physically). If this is true, there are 12 ADC channels that are occupied which we can use for other purposes. Question for Johannes: Is this true? If so, Kira has plenty of channels available for her Temperature control stuff..
As an aside, we found that the EPICS channel names for the TRX/TRY QPD gain stages are somewhat strangely named. Looking closely at the schematic (which has now been added to the 40m DCC tree, we can add out custom mods later), they do (somewhat) add up, but I think we should definitely rename them in a more systematic manner, and use an MEDM screen to indicate stuff like x4 or x20 or "Active" etc. BTW, the EX and EY QPDs have different settings. But at least the settings are changed synchronously for all four quadrants, unlike the WFS heads...
Unrelated: I had to key the c1iscaux and c1auxey crates.
I went through the wiring of the c1auxex crate today to disentangle the pin assignments. The full detail can be found in attachment #1, #2 has less detail but is more eye candy. The red flagged channels are now marked for removal at the next opportunity. This will free up DAQ channels as follows:
This should be enough for temperature sensing, NPRO diagnostics, and even eventual remote PDH control with new servo boxes.
Do we really have 2 free ADC channels at EX now? I was under the impression we had ZERO free, which is why we wanted to put a new ADC unit in. I think in the wiring diagram, the Vacuum gauge monitor channel, Seis Can Temp Sensor monitor, and Seis Can Heater channels are missing. It would also be good to have, in the wiring diagram, a mapping of which signals go to which I/O ports (Dsub, front panel BNC etc) on the 4U(?) box housing all the Acromags, this would be helpful in future debugging sessions.
Bad wording, sorry. Should have been channels in excess of ETMX controls. I'll add the others to the list as well.
Updated channel list and wiring diagram attached. Labels are 'F' for 'Front' and 'R' for - you guessed it - 'Rear', the number identifies the slot panel the breakout is attached to.
pianosa has been upgraded to SL7. I've made a controls user account, added it to sudoers, did the network config, and mounted /cvs/cds using /etc/fstab. Other capabilities are being slowly added, but it may be a while before this workstation has all the kinks ironed out. For now, I'm going to follow the instructions on this wiki to try and get the usual LSC stuff working.
MEDM, EPICS and dataviewer seem to work, but diaggui still doesn't work (it doesn't work on Rossa either, same problem as reported here, does a fix exist?). So looks like only donatella can run diaggui for now. I had to disable the systemd firewall per the instructions page in order to get EPICS to work. Also, there is no MATLAB installed on this machine yet. sshd has been enabled.
Seems like DTT also works now. The trick seems to be to run sudo /usr/bin/diaggui instead of just diaggui. So this is indicative of some conflict between the yum installed gds and the relic gds from our shared drive. I also have to manually change the NDS settings each time, probably there's a way to set all of this up in a more smooth way but I don't know what it is. awggui still doesn't get the correct channels, not sure where I can change the settings to fix that.
DON"T RUN DIAGGUI AS ROOT
I moved the N2 check script and the disk usage checking script from the (sudo) crontab of nodus to the controls user crontab on megatron .
The fardest I can go back on channel C1: Vac_N2pres is 320 days
C1:Vac-CC1_Hornet Presuure gauge started logging Feb. 23, 2018
Did you update the " low N2 message" email addresses?
we disabled logging the N2 Pressure to a text file, since it was filling up disk space. Now it just sends an email to our 40m mailing list, so we'll all get a warning.
The crontab uses the 'bash' version of output redirection '2>&1', which redirects stdout and stderr, but probably we just want stderr, since stdout contains messages without issues and will just fill up again.
Since the lab-wide computer shutdown last Wednesday, all the realtime models running on c1lsc have been flaky. The error is always the same:
[58477.149254] c1cal: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1daf: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1ass: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1oaf: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1lsc: ADC TIMEOUT 0 10963 19 11027
[58478.148001] c1x04: timeout 0 1000000
[58479.148017] c1x04: timeout 1 1000000
[58479.148017] c1x04: exiting from fe_code()
This has happened at least 4 times since Wednesday. The reboot script makes recovery easier, but doing it once in 2 days is getting annoying, especially since we are running many things (e.g. ASS) in custom configurations which have to be reloaded each time. I wonder why the problem persists even though I've power-cycled the expansion chassis? I want to try and do some IFO characterization today so I'm going to run the reboot script again but I'll get in touch with J Hanks to see if he has any insight (I don't think there are any logfiles on the FEs anyways that I'll wipe out by doing a reboot). I wonder if this problem is connected to DuoTone? But if so, why is c1lsc the only FE with this problem? c1sus also does not have the DuoTone system set up correctly...
The last time this happened, the problem apparently fixed itself so I still don't have any insight as to what is causing the problem in the first place . Maybe I'll try disabling c1oaf since that's the configuration we've been running in for a few weeks.
I spent most of today fighting various CDS errors.
Let's see how stable this configuration is. Onto some locking now...
Stability was short-lived it seems. When I came in this morning, all models on c1lsc were dead already, and now c1sus is also dead (Attachment #1). Moreover, MC1 shadow sensors failed for a brief period again this afternoon (Attachment #2). I'm going to wait for some CDS experts to take a look at this since any fix I effect seems to be short-lived. For the MC1 shadow sensors, I wonder if the Trillium box (and associated Sorensen) failure somehow damaged the MC1 shadow sensor/coil driver electronics.
I've left the c1lsc frontend shutdown for now, to see if c1sus and c1ioo can survive without any problems overnight. In parallel, we are going to try and debug the MC1 OSEM Sensor problem - the idea will be to disable the bias voltage to the OSEM LEDs, and see if the readback channels still go below zero, this would be a clear indication that the problem is in the readback transimpedance stage and not the LED. Per the schematic, this can be done by simply disconnecting the two D-sub connectors going to the vacuum flange (this is the configuration in which we usually use the sat box tester kit for example). Attachment #1 shows the current setup at the PD readout board end. The dark DC count (i.e. with the OSEM LEDs off) is ~150 cts, while the nominal level is ~1000 cts, so perhaps this is already indicative of something being broken but let's observe overnight.
Overnight, all models on c1sus and c1ioo seem to have had no stability issues, supporting the hypothesis that timing issues stem from c1lsc. Moreover, the MC1 shadow sensor readouts showed no negative values over a ~12hour period. I think we should just observe this for another day, in any case I don't think there is any urgent IFO related activity scheduled.
I am starting the c1x04 model (IOP) on c1lsc to see how it behaves overnight.
Well, there was apparently an immediate reaction - all the models on c1sus and c1ioo reported an ADC timeout and crashed. I'm going to reboot them and still have c1x04 IOP running, to see what happens.
[97544.431561] c1pem: ADC TIMEOUT 3 8703 63 8767
[97544.431574] c1mcs: ADC TIMEOUT 1 8703 63 8767
[97544.431576] c1sus: ADC TIMEOUT 1 8703 63 8767
[97544.454746] c1rfm: ADC TIMEOUT 0 9033 9 8841
As part of this slow but systematic debugging, I am turning on the c1lsc model overnight to see if the model crashes return.
The model seems to have run without issues overnight. Not completely related, but the MC1 shadow sensor signals also don't show any abnormal excursions to negative values in the last 48 hours. I'm thinking about re-connecting the satellite box (but preserving the breakout setup at 1X6 for a while longer) and re-locking the IMC. I'll also start c1ass on the c1lsc frontend. I would say that the other models on c1lsc (i.e. c1oaf, c1cal, c1daf) aren't really necessary for basic IFO operation.
After this work of increasing the series resistance on ETMX, there have been numerous occassions where the insufficient misalignment of ETMX has caused problems in locking vertex cavities. Today, I modified the script (located at /opt/rtcds/caltech/c1/medm/MISC/ifoalign/AlignSoft.py) to avoid such problems. The way the misalign script works is to write an offset value to the "TO_COIL" filter bank (accessed via "Output Filters" button on the Suspension master MEDM screen - not the most intuitive place to put an offset but okay). So I just increased the value of this offset from 250 counts to 2500 counts (for ETMX only). I checked that the script works, now when both ETMs are misaligned, the AS55Q signal shows a clean Michelson-like sine wave as it fringes instead of having the arm cavity PDH fringes as well .
Note that the svn doesn't seem to work on the newly upgraded SL7 machines: svn status gives me the following output.
Is it safe to run 'svn upgrade'? Or is it time to migrate to git.ligo.org/40m/scripts?
Starting c1cal now, let's see if the other c1lsc FE models are affected at all... Moreover, since MC1 seems to be well-behaved, I'm going to restore the nominal eurocrate configuration (sans extender board) tomorrow.
When I came in this morning no light was reaching the MC. One fast machine was dead, c1lsc, and a number of the slow machines: c1susaux, c1iool0, c1auxex, c1auxey, c1iscaux. Gautam walked me through reseting the slow machines manually and the fast machines via the reboot script. The computers are all back online and the MC is again able to lock.
I found c1lsc unresponsive again today. Following the procedure in elog #13935, I ran the rebootC1LSC.sh script to perform a soft reboot of c1lsc and restart the epics processes on c1lsc, c1sus, and c1ioo. It worked. I also manually restarted one unresponsive slow machine, c1aux.
After the restarts, the CDS overview page shows the first three models on c1lsc are online (image attached). The above elog references c1oaf having to be restarted manually, so I attempted to do that. I connect via ssh to c1lsc and ran the script startc1oaf. This failed as well, however.
In this state I was able to lock the MICH configuration, which is sufficient for my purposes for now, but I was not able to lock either of the arm cavities. Are some of the still-dead models necessary to lock in resonant configurations?
c1lsc crashed again. I've contacted Rolf/JHanks for help since I'm out of ideas on what can be done to fix this problem.
Rolf came by today morning. For now, we've restarted the FE machine and the expansion chassis (note that the correct order in which to do this is: turn off computer--->turn off expansion chassis--->turn on expansion chassis--->turn on computer). The debugging measures Rolf suggested are (i) to replace the old generation ADC card in the expansion chassis which has a red indicator light always on and (ii) to replace the PCIe fiber (2010 make) running from the c1lsc front-end machine in 1X6 to the expansion chassis in 1Y3, as the manufacturer has suggested that pre-2012 versions of the fiber are prone to failure. We will do these opportunistically and see if there is any improvement in the situation.
Another tip from Rolf: if the c1lsc FE is responsive but the models have crashed, then doing sudo reboot by ssh-ing into c1lsc should suffice* (i.e. it shouldn't take down the models on the other vertex FEs, although if the FE is unresponsive and you hard reboot it, this may still be a problem). I'll modify I've modified the c1lsc reboot script accordingly.
* Seems like this can still lead to the other vertex FEs crashing, so I'm leaving the reboot script as is (so all vertex machines are softly rebooted when c1lsc models crash).
Todd E. came by this morning and gave us (i) 1x new ADC card and (ii) 1x roll of 100m (2017 vintage) PCIe fiber. This afternoon, I replaced the old ADC card in the c1lsc expansion chassis, and have returned the old card to Todd. The PCIe fiber replacement is a more involved project (Steve is acquiring some protective tubing to route it from the FE in 1X6 to the expansion chassis in 1Y3), but hopefully the problem was the ADC card with red indicator light, and replacing it has solved the issue. CDS is back to what is now the nominal state (Attachment #1) and Yarm is locked for Jon to work on his IFOcoupling study. We will monitor the stability in the coming days.
(i) to replace the old generation ADC card in the expansion chassis which has a red indicator light always on and (ii) to replace the PCIe fiber (2010 make) running from the c1lsc front-end machine in 1X6 to the expansion chassis in 1Y3, as the manufacturer has suggested that pre-2012 versions of the fiber are prone to failure. We will do these opportunistically and see if there is any improvement in the situation.
Looks like the ADC was not to blame, same symptoms persist.
The PCIe fiber replacement is a more involved project (Steve is acquiring some protective tubing to route it from the FE in 1X6 to the expansion chassis in 1Y3), but hopefully the problem was the ADC card with red indicator light, and replacing it has solved the issue.
Gautam and I restarted the models on c1lsc, c1ioo, and c1sus. The LSC system is functioning again. We found that only restarting c1lsc as Rolf had recommended did actually kill the models running on the other two machines. We simply reverted the rebootC1LSC.sh script to its previous form, since that does work. I'll keep using that as required until the ongoing investigations find the source of the problem.
[steve, yuki, gautam]
The plastic tubing/housing for the fiber arrived a couple of days ago. We routed ~40m of fiber through roughly that length of the tubing this morning, using some custom implements Steve sourced. To make sure we didn't damage the fiber during this process, I'm now testing the vertex models with the plastic tubing just routed casually (= illegally) along the floor from 1X4 to 1Y3 (NOTE THAT THE WIKI PAGE DIAGRAM IS OUT OF DATE AND NEEDS TO BE UPDATED), and have plugged in the new fiber to the expansion chassis and the c1lsc front end machine. But I'm seeing a DC error (0x4000), which is indicative of some sort of timing error (Attachment #1) **. Needs more investigation...
Pictures + more procedural details + proper routing of the protected fiber along cable trays after lunch. If this doesn't help the stability problem, we are out of ideas again, so fingers crossed...
** In the past, I have been able to fix the 0x4000 error by manually rebooting fb (simply restarting the daqd processes on fb using sudo systemctl restart daqd_* doesn't seem to fix the problem). Sure enough, seems to have done the job this time as well (Attachment #2). So my initial impression is that the new fiber is functioning alright .
The PCIe fiber replacement is a more involved project (Steve is acquiring some protective tubing to route it from the FE in 1X6 to the expansion chassis in 1Y3)
This didn't go as smoothly as planned. While there were no issues with the new fiber over the ~3 hours that I left it plugged in, I didn't realize the fiber has distinct ends for the "HOST" and "TARGET" (-5 points to me I guess). So while we had plugged in the ends correctly (by accident) for the pre-lunch test, while routing the fiber on the overhead cable tray, we switched the ends (because the "HOST" end of the cable is close to the reel and we felt it would be easier to do the routing the other way.
Anyway, we will fix this tomorrow. For now, the old fiber was re-connected, and the models are running. IMC is locked.
[steve, koji, gautam]
We took another pass at this today, and it seems to have worked - see Attachment #1. I'm leaving CDS in this configuration so that we can investigate stability. IMC could be locked. However, due to the vacuum slow machine having failed, we are going to leave the PSL shutter closed over the weekend.
Multiple realtime processes on c1sus are suffering from frequent time outs. It eventually knocks out c1sus (process).
Obviously this has started since the fiber swap this afternoon.
gautam 10pm: there are no clues as to the origin of this problem on the c1sus frontend dmesg logs. The only clue (see Attachment #3) is that the "ADC" error bit in the CDS status word is red - but opening up the individual ADC error log MEDM screens show no errors or overflows. Not sure what to make of this. The IOP model on this machine (c1x02) reports an error in the "Timing" bit of the CDS status word, but from the previous exchange with Rolf / J Hanks, this is down to a misuse of ADC0 Ch31 which is supposed to be reserved for a DuoTone diagnostic signal, but which we use for some other signal (one of the MC suspension shadow sensors iirc). The response is also not consistent with this CDS manual - which suggests that an "ADC" error should just kill the models. There are no obvious red indicator lights in the c1sus expansion chassis either.
We had another crash of c1sus and Gautam did full power cycling of c1sus. It was a sturggle to recover all the frontends, but this solved the timing issue.
We went through full reset of c1sus, and rebooting all the other RT hosts, as well as daqd and fb1.
DASWG is not what we want to use for config; we should use the K. Thorne LLO instructions, like I did for ROSSA.
Let's install Jamie's new Data Viewer
This problem resurfaced, which I noticed when I couldn't get the single arm locks going.
The fix was NOT restarting the c1rfm model, which just brought the misery of all vertex FEs crashing and the usual dance to get everything back.
Restarting the sender models (i.e. c1scx and c1scy) seems to have done the trick though.
NDscope is now running on pianosa. To be really useful, we need the templates, so I've made /users/Templates/NDScope_templates where these will be stored. Perhaps someone can write a parser to convert dataviewer .xml to something ndscope can understand. To get it installed, I had to run:
sudo yum install ndscope
sudo yum install python34-gpstime
sudo yum install python34-dateutil
sudo yum install python34-requests
I also changed the pythonpath variable to include the python3.4 site-packages library in .bashrc
We looked into the /frames situation a bit tonight. Here is a summary:
Plan of action:
BTW - the last chiara (shared drive) backup was October 16 6 am. dmesg showed a bunch of errors, Koji is now running fsck in a tmux session on chiara, let's see if that repairs the errors. We missed the opportunity to swap in the 4TB backup disk, so we will do this at the next opportunity.
fsck of chiara backup disk (UUID="90a5c98a-22fb-4685-9c17-77ed07a5e000") was done. But this required many files to be fixed. So the backed-up files are not reliable now.
On the top of that, the disk became not recognized from the machine.
I went to the disk and disconnected the USB and then the power supply, which was/is connected to the UPS.
Then they are reconnected again. This made the disk came back as /media/90a5c98a-22fb-4685-9c17-77ed07a5e000. (*)
After unmounting this disk, I ran "sudo mount -a" to follow the way of mounting as fstab does.
Now I am running the backup script manually so that we can pretend to maintain a snapshot of the day at least.
(*) This is the same situation we found at the recovery from the power shutdown. So my hypothesis is that on Oct 16 at 7 AM during the backup there was a USB failure or disk failure or something which unmounted the disk. This caused some files got damaged. Also this caused the disk mounted as /media/90a5c98a-22fb-4685-9c17-77ed07a5e000. So since then, we did not have the backup.
Update (20:00): The disk connection failed again. I think this disk is no longer reliable.
sudo fsck -yV UUID="90a5c98a-22fb-4685-9c17-77ed07a5e000" [238/276]
[sudo] password for controls:
fsck from util-linux 2.20.1
[/sbin/fsck.ext4 (1) -- /media/40mBackup] fsck.ext4 -y /dev/sde1
e2fsck 1.42 (29-Nov-2011)
/dev/sde1 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Error reading block 527433852 (Attempt to read block from filesystem resulted in
short read) while getting next inode from scan. Ignore error? yes
Added the following channels to C0EDCU.ini:
Also modified the old P1 channel to
Unfortunately, we realized too late that we don't have these channels in the frames, so we don't have the data from this test pumpdown logged, but we will have future stuff. I say we should also log diagnostics from the pumps, such as temperature, current etc. After making the changes, I restarted the daqd processes.
Things to add to ASA wiki page once the wiki comes back online: