[ Gautam , Steve ]
c1susaux & c1iscaux were rebooted manually.
Had to reboot c1psl, c1susaux, c1auxex, c1auxey and c1iscaux today. PMC has been relocked. ITMX didn't get stuck. According to this thread, there have been two instances in the last 10 days in which c1psl and c1susaux have failed. Since we seem to be doing this often lately, I've made a little script that uses the netcat utility to check which slow machines respond to telnet, it is located at /opt/rtcds/caltech/c1/scripts/cds/testSlowMachines.bash.
The script can be executed by ./testSlowMachines.bash.
None of the 3 dd backups I made were bootable - at boot, selecting the drive put me into grub rescue mode, which seemed to suggest that the /boot partition did not exist on the backed up disk, despite the fact that I was able to mount this partition on a booted computer. Perhaps for the same reason, but maybe not.
After going through various StackOverflow posts / blogs / other googling, I decided to try cloning the drives using ddrescue instead of dd.
This seems to have worked for nodus - I was able to boot to console on the machine called rosalba which was lying around under my desk. I deliberately did not have this machine connected to the martian network during the boot process for fear of some issues because of having multiple "nodus"-es on the network, so it complained a bit about starting the elog and other network related issues, but seems like we have a plug-and-play version of the nodus root filesystem now.
chiara and fb1 rootfs backups (made using ddrescue) are still not bootable - I'm working on it.
Nov 6 2017: I am now able to boot the chiara backup as well - although mysteriously, I cannot boot it from the machine called rosalba, but can boot it from ottavia. Anyways, seems like we have usable backups of the rootfs of nodus and chiara now. FB1 is still a no-go, working on it.
Looks to have worked this time around.
controls@fb1:~ 0$ sudo dd if=/dev/sda of=/dev/sdc bs=64K conv=noerror,sync
33554416+0 records in
33554416+0 records out
2199022206976 bytes (2.2 TB) copied, 55910.3 s, 39.3 MB/s
You have new mail in /var/mail/controls
I was able to mount all the partitions on the cloned disk. Will now try booting from this disk on the spare machine I am testing in the office area now. That'd be a "real" test of if this backup is useful in the event of a disk failure.
Eurocrate key turning reboots today morning for c1psl and c1aux.c1auxex and c1auxey are also down but I didn't bother keying them for now. PSL FSS slow loop is now active again (its inactivity was what prompted me to check status of the slow machines).
Note that the EPCIS channels for PSL shutter are hosted on c1aux.But looks like the slow machine became unresponsive at some point during the weekend, so plotting the trend data for the PSL shutter channel would have you believe that the PSL shutter was open all the time. But the MC_REFL DC channel tells a different story - it suggests that the PSL shutter was closed at ~4AM on Sunday, presumably by the vacuum interlock system. I wonder:
Eurocrate key turning reboots today morning for and c1susaux, c1auxex and c1auxey. Usual precautions were taken to minimize risk of ITMX getting stuck.
The IFO hasn't been aligned in ~1week, so I recovered arm and PRM alignment by locking individual arms and also PRMI on carrier. I will try recovering DRMI locking in the evening.
As far as MC1 glitching is concerned, there hasn't been any major one (i.e. one in which MC1 is kicked by such a large amount that the autolocker can't lock the IMC) for the past 2 months - but the MC WFS offsets are an indication of when smaller glitches have taken place, and there were large DC offsets on the MC WFS servo outputs, which I offloaded to the DC MC suspension sliders using the MC WFS relief script.
I'd like for the save-restore routine that runs when the slow machines reboot to set the watchdog state default to OFF (currently, after a key-turning reboot, the watchdogs are enabled by default), but I'm not really sure how this whole system works. The relevant files seem to be in the directory /cvs/cds/caltech/target/c1susaux. There is a script in there called startup.cmd, which seems to be the initialization script that runs when the slow machine is rebooted. But looking at this file, it is not clear to me where the default values are loaded from? There are a few "saverestore" files in this directory as well:
Are the "default" channel values loaded from one of these?
I wanted to use the foton.py utility for my NB tool, and I remember Chris telling me that it was shipping as standard with the newer versions of gds. It wasn't available in the versions of gds available on our workstations - the default version is 2.15.1. So I downloaded gds-2.17.15 from http://software.ligo.org/lscsoft/source/, and installed it to /ligo/apps/linux-x86_64/gds-2.17.15/gds-2.17.15. In it, there is a file at GUI/foton/foton.py.in - this is the one I needed.
Turns out this was more complicated than I expected. Building the newer version of gds throws up a bunch of compilation errors. Chris had pointed me to some pre-built binaries for ubuntu12 on the llo cds wiki, but those versions of gds do not have foton.py. I am dropping this for now.
We probably want to get a dedicated machine that will handle the EPICS channel serving for the Acromag system
This is the machine that Larry suggested when I asked him for his opinion on a low workload rack-mount unit. It only has an atom processor, but I don't think it needs anything particularly powerful under the hood. He said that we will likely be able to let us borrow one of his for a couple days to see if it's up to the task. The dual ethernet is a nice touch, maybe we can keep the communication between the server and the DAQ units on their separate local network.
I noticed yesterday evening that I wasn't able to engage the single arm locking servos - turned out that they weren't getting triggered, which in turn pointed me to the fact that the arm transmssion channels seemed dead. Poking around a little, I found that there was a red light on the CDS overview screen for c1rfm.
Not sure how to debug further...
* Fix seems to be to restart the sender RFM models (c1scx, c1scy, c1asx, c1asy).
[Koji, Jamie(remote), gautam]
Problems raised in elogs in the thread of 13474 and also 13436 seem to be solved.
I would make a detailed post with how the problems were fixed, but unfortunately, most of what we did was not scientific/systematic/repeatable. Instead, I note here some general points (Jamie/Koji can addto /correct me):
Some other general remarks:
Koji suggested trying to simply retsart the ASS model to see if that fixes the weird errors shown in Attachment #2. This did the trick. But we are now faced with more confusion - during the restart process, the various indicators on the CDS overview MEDM screen froze up, which is usually symptomatic of the machines being unresponsive and requiring a hard reboot. But we waited for a few minutes, and everything mysteriously came back. Over repeated observations and looking at the dmesg of the frontend, the problem seems to be connected with an unresponsive NFS connection. Jamie had noted sometime ago that the NFS seems unusually slow. How can we fix this problem? Is it feasible to have a dedicated machine that is not FB1 do the NFS serving for the FEs?
The Xarm is currently in its original state, all cables are connected and c1auxex is hosting the slow channels.
This should still work, but the address has changed. The daqd was split up into three separate binaries to get around the issue with the monolithic build that we could never figure out. The address of the data concentrator (DC) (which is the thing that needs to be restarted) is now 8083.
I don't think the problem is fb1. The fb1 NFS is mostly only used during front end boot. It's the rtcds mount that's the one that sees all the action, which is being served from chiara.
Looking at the dmesg on c1iscex for example, at least part of the problem seems to be associated with FB1 (192.168.113.201, see Attachment #1). The "server" can be unresponsive for O(100) seconds, which is consistent with the duration for which we see the MEDM status lights go blank, and the EPICS records get frozen. Note that the error timestamped ~4000 was from last night, which means there have been at least 2 more instances of this kind of freeze-up overnight.
I don't know if this is symptomatic of some more widespread problem with the 40m networking infrastructure. In any case, all the CDS overview screen lights were green today morning, and MC autolocker seems to have worked fine overnight.
I have also updated the wiki page with the updated daqd restart commands.
Unrelated to this work - Koji fixed up the MC overview screen such that the MC autolocker button is now visible again. The problem seems to do with me migrating some of the c1ioo EPICS channels from the slow machine to the fast system, as a result of which the EPICS variable type changed from "ENUM" to something that was not "ENUM". In any case, the button exists now, and the MC autolocker blinky light is responsive to its state.
Eurocrate key turning reboots today morning for and c1susaux, c1auxey and c1iscaux. These were responding to ping but not telnet-able. Usual precautions were taken to minimize risk of ITMX getting stuck.
MC autolocker got stuck (judging by wall StripTool traces, it has been this way for ~7 hours) because c1psl was unresponsive so I power cycled it. Now MC is locked.
We'd like to setup the recording of the PSL diagnostic connector Acromag channels in a more robust way - the objective is to assess the long term performance of the Acromag DAQ system, glitch rates etc. At the Wednesday meeting, Rana suggested using c1ioo to run the IOC processes - the advantage being that c1ioo has the systemd utility, which seems to be pretty reliable in starting up various processes in the event of the computer being rebooted for whatever reason. Jamie pointed out that this may not be the best approach however - because all the FEs get the list of services to run from their common shared drive mount point, it may be that in the event of a power failure for example, all of them try and start the IOC processes, which is presumably undesirable. Furthermore, Johannes reported the necessity for the procServ utility to be able to run the modbusIOC process in the background - this utility is not available on any of the FEs currently, and I didn't want to futz around with trying to install it.
One alternative is to connect the PSL Acromag also to the Supermicro computer Johannes has set up at the Xend - it currently has systemd setup to run the modbusIOC, so it has all the utilities necessary. Or else, we could use optimus, which has systemd, and all the EPICS dependencies required. I feel less wary of trying to install procServ on optimus too. Thoughts?
c1psl, c1susaux, and c1auxey today
I was poking around at the LSC rack to try and set up a temporary arrangement whereby I take the signals from the DAC differentially and route them to the D990694 differentially. The situation is complicated by the fact that, afaik, we don't have any break out boards for the DIN96 connectors on the back of all our Eurocrate cards (or indeed for many of the other funky connecters we have like IDE/IDC 10,50 etc etc). I've asked Steve to look into ordering a few of these. So I tried to put together a hacky solution with an expansion card and an IDC64 connector. I must have accidentally shorted a pair of DAC pins or something, because all models on the c1lsc FE crashed. On attempting to restart them (c1lsc was still ssh-able), the usual issue of all vertex FEs crashing happened. It required several iterations of me walking into the lab to hard-reboot FEs, but everything is back green now, and I see the AS beam on the camera so the input pointing of the TTs is roughly back where it was. Y arm TEM00 flashes are also seen. I'm not going to re-align the IFO tonight. Maybe I'll stick to using a function generator for the THD tests, probably routing non AI-ed signals directly is as bad as any timing asynchronicity between funcGen and DAQ system...
I wanted to lock the single arm POX/POY config to do some tests on the BeatMouth. But I was unable to.
Not sure what to make of all this, but I can lock the arms now.
To make this setup more permanent, I modified the c1lsc model to pipe the LO power monitor signals from the Demod chassis to unused channels ADC_0_25 (X channel LO) and ADC_0_26 (Y channel LO) in the c1lsc model. I also added a couple of CDS filter blocks inside the "ALS" namespace block in c1lsc so as to allow for calibration from counts to dBm. I didn't add any DQ channels for now as I think the slow EPICS records will be sufficient for diagnostics. It is kind of overkill to use the fast channels for DC voltage monitoring, but until we have acromag channels readily accessible at 1Y2, this will do.
Modified model compiled and installed successfully, though I have yet to restart it given that I'll likely have to do a major reboot of all vertex FEs
It's been a while - but today, all slow machines (with the exception of c1auxex) were un-telnetable. c1psl, c1iool0, c1susaux, c1iscaux1, c1iscaux2, c1aux and c1auxey were rebooted. Usual satellite box unplugging was done to avoid ITMX getting stuck.
I'm probably doing something stupid - but I've not been able to figure this out. In the MC1 and MC3 coil driver filter banks, we have a digital "28HzELP" filter module in FM9. Attachment #1 shows the MC1 filterbanks. In the shown configuration, I would expect the only difference between the "IN1" and "OUT" testpoints to be the transfer function of said ELP filter, after all, it is just a bunch of multiplications by filter coefficients. But yesterday looking at some DTT traces, their shapes looked suspicious. So today, I did the analysis entirely offline (motivation being to rule out DTT weirdness) using scipy's welch. Attachment #2 shows the ASDs of the IN1 and OUT testpoint data (collected for 30s, fft length is set to 2 seconds, and hanning window from scipy is used). I've also plotted the "expected" spectral shape, by loading the sos coefficients from the filter file and using scipy to compute the transfer function.
Clearly, there is a discrepancy for f>20Hz. Why?
Code used to generate this plot (and also a datafile to facilitate offline plotting) is attached in the tarball Attachment #3. Note that I am using a function from my Noise Budget repo to read in the Foton filter file...
*ChrisW suggested ruling out spectral leakage. I re-ran the script with (i) 180 seconds of data (ii) fft length of 15 seconds and (iii) blackman-harris window instead of Hanning. Attachment #4 shows similar discrepancy between expectation and measurement...
I found all the EPICS channels for the model c1ioo on the FE c1ioo to be blank just now. The realtime model itself seemed to be running fine, judging by the IMC alignment (as the WFS loops seemed to still be running okay). I couldn't find any useful info in demsg but I don't know what I'm looking for. So my guess is that somehow the EPICS process for that model died. Unclear why.
Kira informed me that she was having trouble accessing past data for her PID tuning tests. Looking at the last day of data, it looks like there are frequent EPICS data dropouts, each up to a few hours. Observations (see Attachment #1 for evidence):
It is difficult to diagnose how long this has been going on for, as once you start pulling longer stretches of data on dataviewer, any "data freezes" are washed out in the extent of data plotted.
All slow machines (except c1auxex) were dead today, so I had to key them all. While I was at it, I also decided to update MC autolocker screen. Kira pointed out that I needed to change the EPCIS input type (in the RTCDS model) to a "binary input", as opposed to an "analog input", which I did. Model recompilation and restart went smooth. I had to go into the epics record manually to change the two choices to "ENABLE" and "DISABLE" as opposed to the default "ON" and "OFF". Anyways, long story short, MC autolocker controls are a bit more intuitive now I think.
Reboot for c1susaux and c1iscaux today. ITMX precautions were followed. Reboots went smoothly.
IMC is shuttered while Jon does PLL characterization...
When c1oaf starts up there are 446 gain channels that should be set to 0.0 but which end up at 1.0. An example channel is C1:OAF-ADAPT_CARM_ADPT_ACC1_GAIN. The safe.snap file states that it should be set to 0. After model start up it is at 1.0.
We ran some tests, including modifying the safe.snap to make sure it was reading the snap file we were expecting. For this I set the setpoint to 0.5. After restart of the model we saw that the setpoint went to 0.5 but the epics value remained at 1.0. I then set the snap file back to its original setting. I ran the epics sequencer by hand in a gdb session and verified that the sequencer was setting the field to 0. I also built a custom sequencer that would catch writes by the sdf system to the channel. I only saw one write, the initial write that pushed a 0. I have reverted my changes to the sequencer.
The gain channel can be caput to the correct value and it is not pushed back to 1.0. So there does not appear to be a process actively pushing the value to 1.0. On Rolfs sugestion we ran the sequencer w/o the kernel object loaded, and saw the same behavior.
This will take some thought.
FSS slow wasn't running so PSL PZT voltage was swinging around a lot. Reason was that was c1psl unresponsive. I keyed the crate, now it's okay. Now ITMX is stuck - Johannes just told be about an un-elogged c1susaux reboot. Seems that ITMX got stuck at ~4:30pm yesterday PT. After some shaking, the optic was loosened. Please follow the procedure in future and if you do a reboot, please elog it and verify that the optic didn't get stuck.
Unfortunately, this has happened (and seems like it will happen) enough times that I set up a script for rebooting the machine in a controlled way, hopefully it will negate the need to repeatedly go into the VEA and hard-reboot the machines. Script lives at /opt/rtcds/caltech/c1/scripts/cds/rebootC1LSC.sh. SVN committed. It worked well for me today. All applicable CDS indicator lights are now green again. Be aware that c1oaf will probably need to be restarted manually in order to make the DC light green. Also, this script won't help you if you try to unload a model on c1lsc and the FE crashes. It relies on c1lsc being ssh-able. The basic logic is:
Why is this happening so frequently now? Last few lines of error log:
I fixed it by running the reboot script.
Per this elog, we don't need any AIOut channels or Oplev channels. However, the latest wiring diagram I can find for the EX Acromag situation suggests that these channels are hooked up (physically). If this is true, there are 12 ADC channels that are occupied which we can use for other purposes. Question for Johannes: Is this true? If so, Kira has plenty of channels available for her Temperature control stuff..
As an aside, we found that the EPICS channel names for the TRX/TRY QPD gain stages are somewhat strangely named. Looking closely at the schematic (which has now been added to the 40m DCC tree, we can add out custom mods later), they do (somewhat) add up, but I think we should definitely rename them in a more systematic manner, and use an MEDM screen to indicate stuff like x4 or x20 or "Active" etc. BTW, the EX and EY QPDs have different settings. But at least the settings are changed synchronously for all four quadrants, unlike the WFS heads...
Unrelated: I had to key the c1iscaux and c1auxey crates.
I went through the wiring of the c1auxex crate today to disentangle the pin assignments. The full detail can be found in attachment #1, #2 has less detail but is more eye candy. The red flagged channels are now marked for removal at the next opportunity. This will free up DAQ channels as follows:
This should be enough for temperature sensing, NPRO diagnostics, and even eventual remote PDH control with new servo boxes.
Do we really have 2 free ADC channels at EX now? I was under the impression we had ZERO free, which is why we wanted to put a new ADC unit in. I think in the wiring diagram, the Vacuum gauge monitor channel, Seis Can Temp Sensor monitor, and Seis Can Heater channels are missing. It would also be good to have, in the wiring diagram, a mapping of which signals go to which I/O ports (Dsub, front panel BNC etc) on the 4U(?) box housing all the Acromags, this would be helpful in future debugging sessions.
Bad wording, sorry. Should have been channels in excess of ETMX controls. I'll add the others to the list as well.
Updated channel list and wiring diagram attached. Labels are 'F' for 'Front' and 'R' for - you guessed it - 'Rear', the number identifies the slot panel the breakout is attached to.
pianosa has been upgraded to SL7. I've made a controls user account, added it to sudoers, did the network config, and mounted /cvs/cds using /etc/fstab. Other capabilities are being slowly added, but it may be a while before this workstation has all the kinks ironed out. For now, I'm going to follow the instructions on this wiki to try and get the usual LSC stuff working.
MEDM, EPICS and dataviewer seem to work, but diaggui still doesn't work (it doesn't work on Rossa either, same problem as reported here, does a fix exist?). So looks like only donatella can run diaggui for now. I had to disable the systemd firewall per the instructions page in order to get EPICS to work. Also, there is no MATLAB installed on this machine yet. sshd has been enabled.
Seems like DTT also works now. The trick seems to be to run sudo /usr/bin/diaggui instead of just diaggui. So this is indicative of some conflict between the yum installed gds and the relic gds from our shared drive. I also have to manually change the NDS settings each time, probably there's a way to set all of this up in a more smooth way but I don't know what it is. awggui still doesn't get the correct channels, not sure where I can change the settings to fix that.
DON"T RUN DIAGGUI AS ROOT
I moved the N2 check script and the disk usage checking script from the (sudo) crontab of nodus to the controls user crontab on megatron .
The fardest I can go back on channel C1: Vac_N2pres is 320 days
C1:Vac-CC1_Hornet Presuure gauge started logging Feb. 23, 2018
Did you update the " low N2 message" email addresses?
we disabled logging the N2 Pressure to a text file, since it was filling up disk space. Now it just sends an email to our 40m mailing list, so we'll all get a warning.
The crontab uses the 'bash' version of output redirection '2>&1', which redirects stdout and stderr, but probably we just want stderr, since stdout contains messages without issues and will just fill up again.
Since the lab-wide computer shutdown last Wednesday, all the realtime models running on c1lsc have been flaky. The error is always the same:
[58477.149254] c1cal: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1daf: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1ass: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1oaf: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1lsc: ADC TIMEOUT 0 10963 19 11027
[58478.148001] c1x04: timeout 0 1000000
[58479.148017] c1x04: timeout 1 1000000
[58479.148017] c1x04: exiting from fe_code()
This has happened at least 4 times since Wednesday. The reboot script makes recovery easier, but doing it once in 2 days is getting annoying, especially since we are running many things (e.g. ASS) in custom configurations which have to be reloaded each time. I wonder why the problem persists even though I've power-cycled the expansion chassis? I want to try and do some IFO characterization today so I'm going to run the reboot script again but I'll get in touch with J Hanks to see if he has any insight (I don't think there are any logfiles on the FEs anyways that I'll wipe out by doing a reboot). I wonder if this problem is connected to DuoTone? But if so, why is c1lsc the only FE with this problem? c1sus also does not have the DuoTone system set up correctly...
The last time this happened, the problem apparently fixed itself so I still don't have any insight as to what is causing the problem in the first place . Maybe I'll try disabling c1oaf since that's the configuration we've been running in for a few weeks.
I spent most of today fighting various CDS errors.
Let's see how stable this configuration is. Onto some locking now...
Stability was short-lived it seems. When I came in this morning, all models on c1lsc were dead already, and now c1sus is also dead (Attachment #1). Moreover, MC1 shadow sensors failed for a brief period again this afternoon (Attachment #2). I'm going to wait for some CDS experts to take a look at this since any fix I effect seems to be short-lived. For the MC1 shadow sensors, I wonder if the Trillium box (and associated Sorensen) failure somehow damaged the MC1 shadow sensor/coil driver electronics.
I've left the c1lsc frontend shutdown for now, to see if c1sus and c1ioo can survive without any problems overnight. In parallel, we are going to try and debug the MC1 OSEM Sensor problem - the idea will be to disable the bias voltage to the OSEM LEDs, and see if the readback channels still go below zero, this would be a clear indication that the problem is in the readback transimpedance stage and not the LED. Per the schematic, this can be done by simply disconnecting the two D-sub connectors going to the vacuum flange (this is the configuration in which we usually use the sat box tester kit for example). Attachment #1 shows the current setup at the PD readout board end. The dark DC count (i.e. with the OSEM LEDs off) is ~150 cts, while the nominal level is ~1000 cts, so perhaps this is already indicative of something being broken but let's observe overnight.
Overnight, all models on c1sus and c1ioo seem to have had no stability issues, supporting the hypothesis that timing issues stem from c1lsc. Moreover, the MC1 shadow sensor readouts showed no negative values over a ~12hour period. I think we should just observe this for another day, in any case I don't think there is any urgent IFO related activity scheduled.
I am starting the c1x04 model (IOP) on c1lsc to see how it behaves overnight.
Well, there was apparently an immediate reaction - all the models on c1sus and c1ioo reported an ADC timeout and crashed. I'm going to reboot them and still have c1x04 IOP running, to see what happens.
[97544.431561] c1pem: ADC TIMEOUT 3 8703 63 8767
[97544.431574] c1mcs: ADC TIMEOUT 1 8703 63 8767
[97544.431576] c1sus: ADC TIMEOUT 1 8703 63 8767
[97544.454746] c1rfm: ADC TIMEOUT 0 9033 9 8841
As part of this slow but systematic debugging, I am turning on the c1lsc model overnight to see if the model crashes return.
The model seems to have run without issues overnight. Not completely related, but the MC1 shadow sensor signals also don't show any abnormal excursions to negative values in the last 48 hours. I'm thinking about re-connecting the satellite box (but preserving the breakout setup at 1X6 for a while longer) and re-locking the IMC. I'll also start c1ass on the c1lsc frontend. I would say that the other models on c1lsc (i.e. c1oaf, c1cal, c1daf) aren't really necessary for basic IFO operation.
After this work of increasing the series resistance on ETMX, there have been numerous occassions where the insufficient misalignment of ETMX has caused problems in locking vertex cavities. Today, I modified the script (located at /opt/rtcds/caltech/c1/medm/MISC/ifoalign/AlignSoft.py) to avoid such problems. The way the misalign script works is to write an offset value to the "TO_COIL" filter bank (accessed via "Output Filters" button on the Suspension master MEDM screen - not the most intuitive place to put an offset but okay). So I just increased the value of this offset from 250 counts to 2500 counts (for ETMX only). I checked that the script works, now when both ETMs are misaligned, the AS55Q signal shows a clean Michelson-like sine wave as it fringes instead of having the arm cavity PDH fringes as well .
Note that the svn doesn't seem to work on the newly upgraded SL7 machines: svn status gives me the following output.
Is it safe to run 'svn upgrade'? Or is it time to migrate to git.ligo.org/40m/scripts?