40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 54 of 341  Not logged in ELOG logo
ID Date Author Type Categoryup Subject
  13399   Tue Oct 24 16:43:11 2017 SteveUpdateCDSslow machine bootfest

[ Gautam , Steve ]

c1susaux & c1iscaux were rebooted manually.

Quote:

Had to reboot c1psl, c1susaux, c1auxex, c1auxey and c1iscaux today. PMC has been relocked. ITMX didn't get stuck. According to this thread, there have been two instances in the last 10 days in which c1psl and c1susaux have failed. Since we seem to be doing this often lately, I've made a little script that uses the netcat utility to check which slow machines respond to telnet, it is located at /opt/rtcds/caltech/c1/scripts/cds/testSlowMachines.bash.

The script can be executed by ./testSlowMachines.bash.

 

  13404   Sat Oct 28 00:36:26 2017 gautamUpdateCDS40m files backup situation - ddrescue

None of the 3 dd backups I made were bootable - at boot, selecting the drive put me into grub rescue mode, which seemed to suggest that the /boot partition did not exist on the backed up disk, despite the fact that I was able to mount this partition on a booted computer. Perhaps for the same reason, but maybe not.

After going through various StackOverflow posts / blogs / other googling, I decided to try cloning the drives using ddrescue instead of dd.

This seems to have worked for nodus - I was able to boot to console on the machine called rosalba which was lying around under my desk. I deliberately did not have this machine connected to the martian network during the boot process for fear of some issues because of having multiple "nodus"-es on the network, so it complained a bit about starting the elog and other network related issues, but seems like we have a plug-and-play version of the nodus root filesystem now.

chiara and fb1 rootfs backups (made using ddrescue) are still not bootable - I'm working on it.

Nov 6 2017: I am now able to boot the chiara backup as well - although mysteriously, I cannot boot it from the machine called rosalba, but can boot it from ottavia. Anyways, seems like we have usable backups of the rootfs of nodus and chiara now. FB1 is still a no-go, working on it.

Quote:

Looks to have worked this time around.

controls@fb1:~ 0$ sudo dd if=/dev/sda of=/dev/sdc bs=64K conv=noerror,sync
33554416+0 records in
33554416+0 records out
2199022206976 bytes (2.2 TB) copied, 55910.3 s, 39.3 MB/s
You have new mail in /var/mail/controls

I was able to mount all the partitions on the cloned disk. Will now try booting from this disk on the spare machine I am testing in the office area now. That'd be a "real" test of if this backup is useful in the event of a disk failure.

 

 

Attachment 1: 415E2F09-3962-432C-B901-DBCB5CE1F6B6.jpeg
415E2F09-3962-432C-B901-DBCB5CE1F6B6.jpeg
Attachment 2: BFF8F8B5-1836-4188-BDF1-DDC0F5B45B41.jpeg
BFF8F8B5-1836-4188-BDF1-DDC0F5B45B41.jpeg
  13408   Mon Oct 30 11:15:02 2017 gautamUpdateCDSslow machine bootfest + vacuum snafu

Eurocrate key turning reboots today morning for c1psl and c1aux.c1auxex and c1auxey are also down but I didn't bother keying them for now. PSL FSS slow loop is now active again (its inactivity was what prompted me to check status of the slow machines).

Note that the EPCIS channels for PSL shutter are hosted on c1aux.But looks like the slow machine became unresponsive at some point during the weekend, so plotting the trend data for the PSL shutter channel would have you believe that the PSL shutter was open all the time. But the MC_REFL DC channel tells a different story - it suggests that the PSL shutter was closed at ~4AM on Sunday, presumably by the vacuum interlock system. I wonder:

  1. How does the vacuum interlock close the PSL shutter? Is there a non-EPICS channel path? Because if the slow machine happens to be unresponsive when the interlock wants to close the PSL shutter via EPICS commands, it will be unable to. The fact that the PSL shutter did close suggests that there is indeed another path.
  2. We should add some feature to the vacuum interlock (if it doesn't already exist) such that the PSL shutter isn't accidentally re-opened until any vacuum related issues are resolved. Steve was immediately able to identify that the problem was vacuum related, but I think I would have just re-opened the PSL shutter thinking that the issue was slow computer related.
  13410   Mon Nov 6 11:15:43 2017 gautamUpdateCDSslow machine bootfest + IFO re-alignment

Eurocrate key turning reboots today morning for and c1susaux, c1auxex and c1auxey. Usual precautions were taken to minimize risk of ITMX getting stuck.

The IFO hasn't been aligned in ~1week, so I recovered arm and PRM alignment by locking individual arms and also PRMI on carrier. I will try recovering DRMI locking in the evening.

As far as MC1 glitching is concerned, there hasn't been any major one (i.e. one in which MC1 is kicked by such a large amount that the autolocker can't lock the IMC) for the past 2 months - but the MC WFS offsets are an indication of when smaller glitches have taken place, and there were large DC offsets on the MC WFS servo outputs, which I offloaded to the DC MC suspension sliders using the MC WFS relief script.

I'd like for the save-restore routine that runs when the slow machines reboot to set the watchdog state default to OFF (currently, after a key-turning reboot, the watchdogs are enabled by default), but I'm not really sure how this whole system works. The relevant files seem to be in the directory /cvs/cds/caltech/target/c1susaux. There is a script in there called startup.cmd, which seems to be the initialization script that runs when the slow machine is rebooted. But looking at this file, it is not clear to me where the default values are loaded from? There are a few "saverestore" files in this directory as well:

  • saverestore.sav
  • saverestore.savB
  • saverestore.sav.bu
  • saverestore.req

Are the "default" channel values loaded from one of these?

  13420   Wed Nov 8 17:04:21 2017 gautamUpdateCDSgds-2.17.15 [not] installed

I wanted to use the foton.py utility for my NB tool, and I remember Chris telling me that it was shipping as standard with the newer versions of gds. It wasn't available in the versions of gds available on our workstations - the default version is 2.15.1. So I downloaded gds-2.17.15 from http://software.ligo.org/lscsoft/source/, and installed it to /ligo/apps/linux-x86_64/gds-2.17.15/gds-2.17.15. In it, there is a file at GUI/foton/foton.py.in - this is the one I needed. 


Turns out this was more complicated than I expected. Building the newer version of gds throws up a bunch of compilation errors. Chris had pointed me to some pre-built binaries for ubuntu12 on the llo cds wiki, but those versions of gds do not have foton.py. I am dropping this for now.

  13422   Thu Nov 9 15:33:08 2017 johannesUpdateCDSrevisiting Acromag
Quote:

We probably want to get a dedicated machine that will handle the EPICS channel serving for the Acromag system

http://www.supermicro.com/products/system/1U/5015/SYS-5015A-H.cfm?typ=H

This is the machine that Larry suggested when I asked him for his opinion on a low workload rack-mount unit. It only has an atom processor, but I don't think it needs anything particularly powerful under the hood. He said that we will likely be able to let us borrow one of his for a couple days to see if it's up to the task. The dual ethernet is a nice touch, maybe we can keep the communication between the server and the DAQ units on their separate local network.

  13436   Tue Nov 21 11:21:26 2017 gautamUpdateCDSRFM network down

I noticed yesterday evening that I wasn't able to engage the single arm locking servos - turned out that they weren't getting triggered, which in turn pointed me to the fact that the arm transmssion channels seemed dead. Poking around a little, I found that there was a red light on the CDS overview screen for c1rfm.

  • The error seems to be in the receiving model only, i.e. c1rfm, all the sending models (e.g. c1scx) don't report any errors, at least on the CDS overview screen.
  • Judging by dataviewer trending of the c1rfm status word, seems like this happened on Sunday morning, around 11am.
  • I tried restarting both sender and receiver models, but error persists.
  • I got no useful information from the dmesg logs of either c1sus (which runs c1rfm), or c1iscex (which runs c1scx).
  • There are no physical red lights in the expansion chassis that I could see - in the past, when we have had some timing errors, this would be a signature.

Not sure how to debug further...

* Fix seems to be to restart the sender RFM models (c1scx, c1scy, c1asx, c1asy).

Attachment 1: RFMerrors.png
RFMerrors.png
  13477   Thu Dec 14 19:41:00 2017 gautamUpdateCDSCDS recovery, NFS woes

[Koji, Jamie(remote), gautam]

Summary: The CDS system seems to be back up and functioning. But there seems to be some pending problems with the NFS that should be looked into.

We locked Y-arm, hand aligned transmission to 1yes. Some pending problems with ASS model (possibly symptomatic of something more general). Didn't touch Xarm because we don't know what exactly the status of ETMX is.

Problems raised in elogs in the thread of 13474 and also 13436 seem to be solved.


I would make a detailed post with how the problems were fixed, but unfortunately, most of what we did was not scientific/systematic/repeatable. Instead, I note here some general points (Jamie/Koji can addto /correct me):

  1. There is a "known" problem with unloading models on c1lsc. Sometimes, running rtcds stop <model> will kill the c1lsc frontend.
  2. Sometimes, when one machine on the dolphin network goes down, all 3 go down.
  3. The new FB/RCG means that some of the old commands now no longer work. Specifically, instead of telnet fb 8087 followed by shutdown (to fix DC errors) no longer works. Instead, ssh into fb1, and run sudo systemctl restart daqd_*.
  4. Timing error on c1sus machine was linked to the mx_stream processes somehow not being automatically started. The "!mxstream restart" button on the CDS overview MEDM screen should run the necessary commands to restart it. However, today, I had to manually run sudo systemctl start mx_stream on c1sus to fix this error. It is a mystery why the automatic startup of this process was disabled in the first place. Jamie has now rectified this problem, so keep an eye out.
  5. c1oaf persistently reported DC errors (0x2bad) that couldn't be addressed by running mxstream restart or restarting the daqd processes on FB1. Restarting the model itself (i.e. rtcds restart c1oaf) fixed this issue (though of course I took the risk of having to go into the lab and hard-reboot 3 machines).
  6. At some point, we thought we had all the CDS lights green - but at that point, the END FEs crashed, necessitating Koji->EX and Gautam->EY hard reboots. This is a new phenomenon. Note that the vertex machines were unaffected.
  7. At some point, all the DC lights on the CDS overview screen went white - at the same time, we couldn't ssh into FB1, although it was responding to ping. After ~2mins, the green lights came back and we were able to connect to FB1. Not sure what to make of this.
  8. While trying to run the dither alignment scripts for the Y-arm, we noticed some strange behaviour:
    • Even when there was no signal (looking at EPICS channels) at the input of the ASS servos, the output was fluctuating wildly by ~20cts-pp.
    • This is not simply an EPICS artefact, as we could see corresponding motion of the suspension on the CCD.
    • A possible clue is that when I run the "Start Dither" script from the MEDM screen, I get a bunch of error messages (see Attachment #2).
    • Similar error messages show up when running the LSC offset script for example. Seems like there are multiple ports open somehow on the same machine?
    • There are no indicator lights on the CDS overview screen suggesting where the problem lies.
    • Will continue investigating tomorrow.

Some other general remarks:

  1. ETMX watchdog remains shutdown.
  2. ITMY and BS oplevs have been hijacked for HeNe RIN / Oplev sensing noise measurement, and so are not enabled.
  3. Y arm trans QPD (Thorlabs) has large 60Hz harmonics. These can be mitigated by turning on a 60Hz comb filter, but we should check if this is some kind of ground loop. The feature is much less evident when looking at the TRANS signal on the QPD.

UPDATE 8:20pm:

Koji suggested trying to simply retsart the ASS model to see if that fixes the weird errors shown in Attachment #2. This did the trick. But we are now faced with more confusion - during the restart process, the various indicators on the CDS overview MEDM screen froze up, which is usually symptomatic of the machines being unresponsive and requiring a hard reboot. But we waited for a few minutes, and everything mysteriously came back. Over repeated observations and looking at the dmesg of the frontend, the problem seems to be connected with an unresponsive NFS connection. Jamie had noted sometime ago that the NFS seems unusually slow. How can we fix this problem? Is it feasible to have a dedicated machine that is not FB1 do the NFS serving for the FEs?

Attachment 1: CDS_14Dec2017.png
CDS_14Dec2017.png
Attachment 2: CDS_errors.png
CDS_errors.png
  13479   Fri Dec 15 00:26:40 2017 johannesUpdateCDSRe: CDS recovery, NFS woes
Quote:

Didn't touch Xarm because we don't know what exactly the status of ETMX is.

The Xarm is currently in its original state, all cables are connected and c1auxex is hosting the slow channels.

  13480   Fri Dec 15 01:53:37 2017 jamieUpdateCDSCDS recovery, NFS woes
Quote:

I would make a detailed post with how the problems were fixed, but unfortunately, most of what we did was not scientific/systematic/repeatable. Instead, I note here some general points (Jamie/Koji can addto /correct me):

  1. There is a "known" problem with unloading models on c1lsc. Sometimes, running rtcds stop <model> will kill the c1lsc frontend.
  2. Sometimes, when one machine on the dolphin network goes down, all 3 go down.
  3. The new FB/RCG means that some of the old commands now no longer work. Specifically, instead of telnet fb 8087 followed by shutdown (to fix DC errors) no longer works. Instead, ssh into fb1, and run sudo systemctl restart daqd_*.

This should still work, but the address has changed.  The daqd was split up into three separate binaries to get around the issue with the monolithic build that we could never figure out.  The address of the data concentrator (DC) (which is the thing that needs to be restarted) is now 8083.

Quote:

UPDATE 8:20pm:

Koji suggested trying to simply retsart the ASS model to see if that fixes the weird errors shown in Attachment #2. This did the trick. But we are now faced with more confusion - during the restart process, the various indicators on the CDS overview MEDM screen froze up, which is usually symptomatic of the machines being unresponsive and requiring a hard reboot. But we waited for a few minutes, and everything mysteriously came back. Over repeated observations and looking at the dmesg of the frontend, the problem seems to be connected with an unresponsive NFS connection. Jamie had noted sometime ago that the NFS seems unusually slow. How can we fix this problem? Is it feasible to have a dedicated machine that is not FB1 do the NFS serving for the FEs?

I don't think the problem is fb1.  The fb1 NFS is mostly only used during front end boot.  It's the rtcds mount that's the one that sees all the action, which is being served from chiara.

  13481   Fri Dec 15 11:19:11 2017 gautamUpdateCDSCDS recovery, NFS woes

Looking at the dmesg on c1iscex for example, at least part of the problem seems to be associated with FB1 (192.168.113.201, see Attachment #1). The "server" can be unresponsive for O(100) seconds, which is consistent with the duration for which we see the MEDM status lights go blank, and the EPICS records get frozen. Note that the error timestamped ~4000 was from last night, which means there have been at least 2 more instances of this kind of freeze-up overnight.

I don't know if this is symptomatic of some more widespread problem with the 40m networking infrastructure. In any case, all the CDS overview screen lights were green today morning, and MC autolocker seems to have worked fine overnight.

I have also updated the wiki page with the updated daqd restart commands.

Unrelated to this work - Koji fixed up the MC overview screen such that the MC autolocker button is now visible again. The problem seems to do with me migrating some of the c1ioo EPICS channels from the slow machine to the fast system, as a result of which the EPICS variable type changed from "ENUM" to something that was not "ENUM". In any case, the button exists now, and the MC autolocker blinky light is responsive to its state.

Quote:

I don't think the problem is fb1.  The fb1 NFS is mostly only used during front end boot.  It's the rtcds mount that's the one that sees all the action, which is being served from chiara.

 

Attachment 1: NFS.png
NFS.png
Attachment 2: MCautolocker.png
MCautolocker.png
  13518   Tue Jan 9 11:52:29 2018 gautamUpdateCDSslow machine bootfest

Eurocrate key turning reboots today morning for and c1susaux, c1auxey and c1iscaux. These were responding to ping but not telnet-able. Usual precautions were taken to minimize risk of ITMX getting stuck.

 

  13522   Wed Jan 10 12:24:52 2018 gautamUpdateCDSslow machine bootfest

MC autolocker got stuck (judging by wall StripTool traces, it has been this way for ~7 hours) because c1psl was unresponsive so I power cycled it. Now MC is locked.

  13536   Thu Jan 11 21:09:33 2018 gautamUpdateCDSrevisiting Acromag

We'd like to setup the recording of the PSL diagnostic connector Acromag channels in a more robust way - the objective is to assess the long term performance of the Acromag DAQ system, glitch rates etc. At the Wednesday meeting, Rana suggested using c1ioo to run the IOC processes - the advantage being that c1ioo has the systemd utility, which seems to be pretty reliable in starting up various processes in the event of the computer being rebooted for whatever reason. Jamie pointed out that this may not be the best approach however - because all the FEs get the list of services to run from their common shared drive mount point, it may be that in the event of a power failure for example, all of them try and start the IOC processes, which is presumably undesirable. Furthermore, Johannes reported the necessity for the procServ utility to be able to run the modbusIOC process in the background - this utility is not available on any of the FEs currently, and I didn't want to futz around with trying to install it.

One alternative is to connect the PSL Acromag also to the Supermicro computer Johannes has set up at the Xend - it currently has systemd setup to run the modbusIOC, so it has all the utilities necessary. Or else, we could use optimus, which has systemd, and all the EPICS dependencies required. I feel less wary of trying to install procServ on optimus too. Thoughts?

 

  13558   Fri Jan 19 11:13:21 2018 gautamUpdateCDSslow machine bootfest

c1psl, c1susaux, and c1auxey today

Quote:

MC autolocker got stuck (judging by wall StripTool traces, it has been this way for ~7 hours) because c1psl was unresponsive so I power cycled it. Now MC is locked.

 

  13620   Thu Feb 8 00:01:08 2018 gautamUpdateCDSVertex FEs all crashed

I was poking around at the LSC rack to try and set up a temporary arrangement whereby I take the signals from the DAC differentially and route them to the D990694 differentially. The situation is complicated by the fact that, afaik, we don't have any break out boards for the DIN96 connectors on the back of all our Eurocrate cards (or indeed for many of the other funky connecters we have like IDE/IDC 10,50 etc etc). I've asked Steve to look into ordering a few of these. So I tried to put together a hacky solution with an expansion card and an IDC64 connector. I must have accidentally shorted a pair of DAC pins or something, because all models on the c1lsc FE crashedindecision. On attempting to restart them (c1lsc was still ssh-able), the usual issue of all vertex FEs crashing happened. It required several iterations of me walking into the lab to hard-reboot FEs, but everything is back green now, and I see the AS beam on the camera so the input pointing of the TTs is roughly back where it was. Y arm TEM00 flashes are also seen. I'm not going to re-align the IFO tonight. Maybe I'll stick to using a function generator for the THD tests, probably routing non AI-ed signals directly is as bad as any timing asynchronicity between funcGen and DAQ system...

Attachment 1: CDSrecovery_20180207.png
CDSrecovery_20180207.png
  13643   Tue Feb 20 21:14:59 2018 gautamUpdateCDSRFM network errors

I wanted to lock the single arm POX/POY config to do some tests on the BeatMouth. But I was unable to.

  • I tracked the problem down to the fact that the TRX and TRY triggers weren't getting piped correctly to the LSC model
  • In fact, all RFM channels from the end machines were showing error rates of 16384/sec (i.e. every sample).
  • After watchdogging ETMX, I tried restarting just the c1scx model - this promptly took down the whole c1iscex machine.
  • Then I tried the same with c1iscey - this time the models restarted successfully without the c1iscey machine crashing, but the RFM errors persisted for the c1scy channels.
  • I walked down to EX and hard rebooted c1iscex.
  • c1iscex came back online, and I ssh-ed in and did rtcds start --all.
  • This brought all the models back online, and the RFM errors on both c1iscex and c1iscey channels vanished.

Not sure what to make of all this, but I can lock the arms now.

  13646   Wed Feb 21 12:17:04 2018 gautamUpdateCDSLO Power mon channels added to c1lsc

To make this setup more permanent, I modified the c1lsc model to pipe the LO power monitor signals from the Demod chassis to unused channels ADC_0_25 (X channel LO) and ADC_0_26 (Y channel LO) in the c1lsc model. I also added a couple of CDS filter blocks inside the "ALS" namespace block in c1lsc so as to allow for calibration from counts to dBm. I didn't add any DQ channels for now as I think the slow EPICS records will be sufficient for diagnostics. It is kind of overkill to use the fast channels for DC voltage monitoring, but until we have acromag channels readily accessible at 1Y2, this will do.

Modified model compiled and installed successfully, though I have yet to restart it given that I'll likely have to do a major reboot of all vertex FEs frown

  13727   Wed Apr 4 16:23:39 2018 gautamUpdateCDSslow machine bootfest

[johannes, gautam]

It's been a while - but today, all slow machines (with the exception of c1auxex) were un-telnetable. c1psl, c1iool0, c1susaux, c1iscaux1, c1iscaux2, c1aux and c1auxey were rebooted. Usual satellite box unplugging was done to avoid ITMX getting stuck.

  13729   Thu Apr 5 10:38:38 2018 gautamUpdateCDSCDS puzzle

I'm probably doing something stupid - but I've not been able to figure this out. In the MC1 and MC3 coil driver filter banks, we have a digital "28HzELP" filter module in FM9. Attachment #1 shows the MC1 filterbanks. In the shown configuration, I would expect the only difference between the "IN1" and "OUT" testpoints to be the transfer function of said ELP filter, after all, it is just a bunch of multiplications by filter coefficients. But yesterday looking at some DTT traces, their shapes looked suspicious. So today, I did the analysis entirely offline (motivation being to rule out DTT weirdness) using scipy's welch. Attachment #2 shows the ASDs of the IN1 and OUT testpoint data (collected for 30s, fft length is set to 2 seconds, and hanning window from scipy is used). I've also plotted the "expected" spectral shape, by loading the sos coefficients from the filter file and using scipy to compute the transfer function.

Clearly, there is a discrepancy for f>20Hz. Why?

Code used to generate this plot (and also a datafile to facilitate offline plotting) is attached in the tarball Attachment #3. Note that I am using a function from my Noise Budget repo to read in the Foton filter file...

*ChrisW suggested ruling out spectral leakage. I re-ran the script with (i) 180 seconds of data (ii) fft length of 15 seconds and (iii) blackman-harris window instead of Hanning. Attachment #4 shows similar discrepancy between expectation and measurement...

Attachment 1: MC1_outputs.png
MC1_outputs.png
Attachment 2: EllipTFCheck_MC1.pdf
EllipTFCheck_MC1.pdf
Attachment 3: MC1_ELP.tgz
Attachment 4: EllipTFCheck_MC1.pdf
EllipTFCheck_MC1.pdf
  13732   Thu Apr 5 19:31:17 2018 gautamUpdateCDSEPICS processes for c1ioo dead

I found all the EPICS channels for the model c1ioo on the FE c1ioo to be blank just now. The realtime model itself seemed to be running fine, judging by the IMC alignment (as the WFS loops seemed to still be running okay). I couldn't find any useful info in demsg but I don't know what I'm looking for. So my guess is that somehow the EPICS process for that model died. Unclear why.

  13733   Fri Apr 6 10:00:29 2018 gautamUpdateCDSCDS puzzle
Quote:

Clearly, there is a discrepancy for f>20Hz. Why?

Spectral leakage

  13734   Fri Apr 6 16:07:18 2018 gautamUpdateCDSFrequent EPICS dropouts

Kira informed me that she was having trouble accessing past data for her PID tuning tests. Looking at the last day of data, it looks like there are frequent EPICS data dropouts, each up to a few hours. Observations (see Attachment #1 for evidence):

  1. Problem seems to be with writing these EPICS channels to frames, as the StripTool traces do not show the same discontinuities.
  2. Problem does not seem local to c1auxex (new Acromag machine). These dropouts are also happening in other EPICS channel records. I have plotted PMC REFL, which is logged by the slow machine c1psl, and you can see the dropouts happen at the same times.
  3. Problem does not seem to happen to the fast channel DQ records (see ETMX Sensor record plotted for 24 hours, there are no discontinuties).

It is difficult to diagnose how long this has been going on for, as once you start pulling longer stretches of data on dataviewer, any "data freezes" are washed out in the extent of data plotted.

Attachment 1: EPICS_dropout.png
EPICS_dropout.png
  13758   Wed Apr 18 10:44:45 2018 gautamUpdateCDSslow machine bootfest

All slow machines (except c1auxex) were dead today, so I had to key them all. While I was at it, I also decided to update MC autolocker screen. Kira pointed out that I needed to change the EPCIS input type (in the RTCDS model) to a "binary input", as opposed to an "analog input", which I did. Model recompilation and restart went smooth. I had to go into the epics record manually to change the two choices to "ENABLE" and "DISABLE" as opposed to the default "ON" and "OFF". Anyways, long story short, MC autolocker controls are a bit more intuitive now I think.

Attachment 1: MCautolocker_MEDM_revamp.png
MCautolocker_MEDM_revamp.png
  13812   Thu May 3 12:19:13 2018 gautamUpdateCDSslow machine bootfest

Reboot for c1susaux and c1iscaux today. ITMX precautions were followed. Reboots went smoothly.

IMC is shuttered while Jon does PLL characterization...


Now relocked.

  13898   Wed May 30 16:12:30 2018 Jonathan HanksSummaryCDSLooking at c1oaf issues

When c1oaf starts up there are 446 gain channels that should be set to 0.0 but which end up at 1.0.  An example channel is C1:OAF-ADAPT_CARM_ADPT_ACC1_GAIN.  The safe.snap file states that it should be set to 0.  After model start up it is at 1.0.

We ran some tests, including modifying the safe.snap to make sure it was reading the snap file we were expecting.  For this I set the setpoint to 0.5.  After restart of the model we saw that the setpoint went to 0.5 but the epics value remained at 1.0.  I then set the snap file back to its original setting.  I ran the epics sequencer by hand in a gdb session and verified that the sequencer was setting the field to 0.  I also built a custom sequencer that would catch writes by the sdf system to the channel.  I only saw one write, the initial write that pushed a 0.  I have reverted my changes to the sequencer.

The gain channel can be caput to the correct value and it is not pushed back to 1.0.  So there does not appear to be a process actively pushing the value to 1.0.  On Rolfs sugestion we ran the sequencer w/o the kernel object loaded, and saw the same behavior.

This will take some thought.

  13925   Thu Jun 7 12:20:53 2018 gautamUpdateCDSslow machine bootfest

FSS slow wasn't running so PSL PZT voltage was swinging around a lot. Reason was that was c1psl unresponsive. I keyed the crate, now it's okay. Now ITMX is stuck - Johannes just told be about an un-elogged c1susaux reboot. Seems that ITMX got stuck at ~4:30pm yesterday PT. After some shaking, the optic was loosened. Please follow the procedure in future and if you do a reboot, please elog it and verify that the optic didn't get stuck.

Attachment 1: ITMX_stuck.png
ITMX_stuck.png
  13934   Fri Jun 8 14:40:55 2018 c1lscUpdateCDSi am dead
Attachment 1: 31.png
31.png
  13935   Fri Jun 8 20:15:08 2018 gautamUpdateCDSReboot script

Unfortunately, this has happened (and seems like it will happen) enough times that I set up a script for rebooting the machine in a controlled way, hopefully it will negate the need to repeatedly go into the VEA and hard-reboot the machines. Script lives at /opt/rtcds/caltech/c1/scripts/cds/rebootC1LSC.sh. SVN committed. It worked well for me today. All applicable CDS indicator lights are now green again. Be aware that c1oaf will probably need to be restarted manually in order to make the DC light green. Also, this script won't help you if you try to unload a model on c1lsc and the FE crashes. It relies on c1lsc being ssh-able. The basic logic is:

  1. Ask for confirmation.
  2. Shutdown all vertex optic watchdogs, PSL shutter.
  3. ssh into c1sus and c1ioo, shutdown all models on these machines, soft reboot them.
  4. ssh into c1lsc, soft reboot the machine. No attempt is made to unload the models.
  5. Wait 2 minutes for all machines to come back online.
  6. Restart models on all 3 vertex FEs (IOPs first, then rest).
  7. Prompt user for confirmation to re-enable watchdog status and open PSL shutter.
Attachment 1: 31.png
31.png
  13942   Mon Jun 11 18:49:06 2018 gautamUpdateCDSc1lsc dead again

Why is this happening so frequently now? Last few lines of error log:

[  575.099793] c1oaf: DAQ EPICS: Int = 199  Flt = 706 Filters = 9878 Total = 10783 Fast = 113
[  575.099793] c1oaf: DAQ EPICS: Number of Filter Module Xfers = 11 last = 98
[  575.099793] c1oaf: crc length epics = 43132
[  575.099793] c1oaf:  xfer sizes = 128 788 100988 100988 
[240629.686307] c1daf: ADC TIMEOUT 0 43039 31 43103
[240629.686307] c1cal: ADC TIMEOUT 0 43039 31 43103
[240629.686307] c1ass: ADC TIMEOUT 0 43039 31 43103
[240629.686307] c1oaf: ADC TIMEOUT 0 43039 31 43103
[240629.686307] c1lsc: ADC TIMEOUT 0 43039 31 43103
[240630.684493] c1x04: timeout 0 1000000 
[240631.684938] c1x04: timeout 1 1000000 
[240631.684938] c1x04: exiting from fe_code()

I fixed it by running the reboot script.

Attachment 1: 36.png
36.png
  13947   Mon Jun 11 23:22:53 2018 gautamUpdateCDSEX wiring confusion

 [Koji, gautam]

Per this elog, we don't need any AIOut channels or Oplev channels. However, the latest wiring diagram I can find for the EX Acromag situation suggests that these channels are hooked up (physically). If this is true, there are 12 ADC channels that are occupied which we can use for other purposes. Question for Johannes: Is this true? If so, Kira has plenty of channels available for her Temperature control stuff..

As an aside, we found that the EPICS channel names for the TRX/TRY QPD gain stages are somewhat strangely named. Looking closely at the schematic (which has now been added to the 40m DCC tree, we can add out custom mods later), they do (somewhat) add up, but I think we should definitely rename them in a more systematic manner, and use an MEDM screen to indicate stuff like x4 or x20 or "Active" etc. BTW, the EX and EY QPDs have different settings. But at least the settings are changed synchronously for all four quadrants, unlike the WFS heads...


Unrelated: I had to key the c1iscaux and c1auxey crates.

  13958   Wed Jun 13 23:23:44 2018 johannesUpdateCDSEX wiring confusion

It's true.

I went through the wiring of the c1auxex crate today to disentangle the pin assignments. The full detail can be found in attachment #1, #2 has less detail but is more eye candy. The red flagged channels are now marked for removal at the next opportunity. This will free up DAQ channels as follows:

TYPE Total Available now Available after
ADC 24 2 14
DAC 16 8 12
BIO sinking 16 7 7
BIO sourcing 8 8 8

This should be enough for temperature sensing, NPRO diagnostics, and even eventual remote PDH control with new servo boxes.

Attachment 1: c1auxex_channels.pdf
c1auxex_channels.pdf
Attachment 2: XEND_slow_wiring.pdf
XEND_slow_wiring.pdf
  13961   Thu Jun 14 10:41:00 2018 gautamUpdateCDSEX wiring confusion

Do we really have 2 free ADC channels at EX now? I was under the impression we had ZERO free, which is why we wanted to put a new ADC unit in. I think in the wiring diagram, the Vacuum gauge monitor channel, Seis Can Temp Sensor monitor, and Seis Can Heater channels are missing. It would also be good to have, in the wiring diagram, a mapping of which signals go to which I/O ports (Dsub, front panel BNC etc) on the 4U(?) box housing all the Acromags, this would be helpful in future debugging sessions.

Quote:
 
TYPE Total Available now Available after
ADC 24 2 14

 

  13965   Thu Jun 14 15:31:18 2018 johannesUpdateCDSEX wiring confusion

Bad wording, sorry. Should have been channels in excess of ETMX controls. I'll add the others to the list as well.

Updated channel list and wiring diagram attached. Labels are 'F' for 'Front' and 'R' for - you guessed it - 'Rear', the number identifies the slot panel the breakout is attached to.

Attachment 1: XEND_slow_wiring.pdf
XEND_slow_wiring.pdf
Attachment 2: c1auxex_channels.pdf
c1auxex_channels.pdf
  14000   Thu Jun 21 22:13:12 2018 gautamUpdateCDSpianosa upgrade

pianosa has been upgraded to SL7. I've made a controls user account, added it to sudoers, did the network config, and mounted /cvs/cds using /etc/fstab. Other capabilities are being slowly added, but it may be a while before this workstation has all the kinks ironed out. For now, I'm going to follow the instructions on this wiki to try and get the usual LSC stuff working.

  14003   Fri Jun 22 00:59:43 2018 gautamUpdateCDSpianosa functional, but NO DTT

MEDM, EPICS and dataviewer seem to work, but diaggui still doesn't work (it doesn't work on Rossa either, same problem as reported here, does a fix exist?). So looks like only donatella can run diaggui for now. I had to disable the systemd firewall per the instructions page in order to get EPICS to work. Also, there is no MATLAB installed on this machine yet. sshd has been enabled.

  14007   Fri Jun 22 15:13:47 2018 gautamUpdateCDSDTT working

Seems like DTT also works now. The trick seems to be to run sudo /usr/bin/diaggui instead of just diaggui. So this is indicative of some conflict between the yum installed gds and the relic gds from our shared drive. I also have to manually change the NDS settings each time, probably there's a way to set all of this up in a more smooth way but I don't know what it is. awggui still doesn't get the correct channels, not sure where I can change the settings to fix that.

Attachment 1: Screenshot_from_2018-06-22_15-12-37.png
Screenshot_from_2018-06-22_15-12-37.png
  14008   Fri Jun 22 15:22:39 2018 sudoUpdateCDSDTT working
Quote:

Seems like DTT also works now. The trick seems to be to run sudo /usr/bin/diaggui instead of just diaggui. So this is indicative of some conflict between the yum installed gds and the relic gds from our shared drive. I also have to manually change the NDS settings each time, probably there's a way to set all of this up in a more smooth way but I don't know what it is. awggui still doesn't get the correct channels, not sure where I can change the settings to fix that.

DON"T RUN DIAGGUI AS ROOT

  14027   Wed Jun 27 21:18:00 2018 gautamUpdateCDSLab maintenance scripts from NODUS---->MEGATRON

I moved the N2 check script and the disk usage checking script from the (sudo) crontab of nodus no to the controls user crontab on megatron yes.

  14028   Thu Jun 28 08:09:51 2018 SteveUpdateCDS vacuum pneumatic N2 pressure

The fardest I can go back on channel C1: Vac_N2pres is  320 days

C1:Vac-CC1_Hornet Presuure gauge started logging Feb. 23, 2018

Did you update the " low N2 message"  email addresses?

 

Quote:

I moved the N2 check script and the disk usage checking script from the (sudo) crontab of nodus no to the controls user crontab on megatron yes.

 

Attachment 1: 320d_N2.png
320d_N2.png
  14029   Thu Jun 28 10:28:27 2018 ranaUpdateCDS vacuum pneumatic N2 pressure

we disabled logging the N2 Pressure to a text file, since it was filling up disk space. Now it just sends an email to our 40m mailing list, so we'll all get a warning.

The crontab uses the 'bash' version of output redirection '2>&1', which redirects stdout and stderr, but probably we just want stderr, since stdout contains messages without issues and will just fill up again.

  14133   Sun Aug 5 13:28:43 2018 gautamUpdateCDSc1lsc flaky

Since the lab-wide computer shutdown last Wednesday, all the realtime models running on c1lsc have been flaky. The error is always the same:

[58477.149254] c1cal: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1daf: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1ass: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1oaf: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1lsc: ADC TIMEOUT 0 10963 19 11027
[58478.148001] c1x04: timeout 0 1000000 
[58479.148017] c1x04: timeout 1 1000000 
[58479.148017] c1x04: exiting from fe_code()

This has happened at least 4 times since Wednesday. The reboot script makes recovery easier, but doing it once in 2 days is getting annoying, especially since we are running many things (e.g. ASS) in custom configurations which have to be reloaded each time. I wonder why the problem persists even though I've power-cycled the expansion chassis? I want to try and do some IFO characterization today so I'm going to run the reboot script again but I'll get in touch with J Hanks to see if he has any insight (I don't think there are any logfiles on the FEs anyways that I'll wipe out by doing a reboot). I wonder if this problem is connected to DuoTone? But if so, why is c1lsc the only FE with this problem? c1sus also does not have the DuoTone system set up correctly...

The last time this happened, the problem apparently fixed itself so I still don't have any insight as to what is causing the problem in the first place frown. Maybe I'll try disabling c1oaf since that's the configuration we've been running in for a few weeks.

  14136   Mon Aug 6 00:26:21 2018 gautamUpdateCDSMore CDS woes

I spent most of today fighting various CDS errors.

  • I rebooted c1lsc around 3pm, my goal was to try and do some vertex locking and figure out what the implications were of having only ~30% power we used to have at the AS port.
  • Shortly afterwards (~4pm), c1lsc crashed.
  • Using the reboot script, I was able to bring everything back up. But the DC lights on c1sus models were all red, and a 0x4000 error was being reported.
  • This error is indicative of some timing issue, but all the usual tricks (reboot vertex FEs in various order, restart the mx_streams etc) didn't clear this error.
  • I checked the Tempus GPS unit, that didn't report any obvious problems (i.e. front display was showing the correct UTC time).
  • Finally, I decided to shut down all watchdogs, soft reboot all the FEs, soft reboot FB, power cycle all expansion chassis.
  • This seems to have done the trick - I'm leaving c1oaf disabled for now.
  • The remaining red indicators are due to c1dnn and c1oaf being disabled.

Let's see how stable this configuration is. Onto some locking now...

Attachment 1: CDSoverview.png
CDSoverview.png
  14139   Mon Aug 6 14:38:38 2018 gautamUpdateCDSMore CDS woes

Stability was short-lived it seems. When I came in this morning, all models on c1lsc were dead already, and now c1sus is also dead (Attachment #1). Moreover, MC1 shadow sensors failed for a brief period again this afternoon (Attachment #2). I'm going to wait for some CDS experts to take a look at this since any fix I effect seems to be short-lived. For the MC1 shadow sensors, I wonder if the Trillium box (and associated Sorensen) failure somehow damaged the MC1 shadow sensor/coil driver electronics.

Quote:
 

Let's see how stable this configuration is. Onto some locking now...

Attachment 1: CDScrash.png
CDScrash.png
Attachment 2: MC1failures.png
MC1failures.png
  14140   Mon Aug 6 19:49:09 2018 gautamUpdateCDSMore CDS woes

I've left the c1lsc frontend shutdown for now, to see if c1sus and c1ioo can survive without any problems overnight. In parallel, we are going to try and debug the MC1 OSEM Sensor problem - the idea will be to disable the bias voltage to the OSEM LEDs, and see if the readback channels still go below zero, this would be a clear indication that the problem is in the readback transimpedance stage and not the LED. Per the schematic, this can be done by simply disconnecting the two D-sub connectors going to the vacuum flange (this is the configuration in which we usually use the sat box tester kit for example). Attachment #1 shows the current setup at the PD readout board end. The dark DC count (i.e. with the OSEM LEDs off) is ~150 cts, while the nominal level is ~1000 cts, so perhaps this is already indicative of something being broken but let's observe overnight.

Attachment 1: IMG_7106.JPG
IMG_7106.JPG
  14142   Tue Aug 7 11:30:46 2018 gautamUpdateCDSMore CDS woes

Overnight, all models on c1sus and c1ioo seem to have had no stability issues, supporting the hypothesis that timing issues stem from c1lsc. Moreover, the MC1 shadow sensor readouts showed no negative values over a ~12hour period. I think we should just observe this for another day, in any case I don't think there is any urgent IFO related activity scheduled.

  14143   Tue Aug 7 22:28:23 2018 gautamUpdateCDSMore CDS woes

I am starting the c1x04 model (IOP) on c1lsc to see how it behaves overnight.

Well, there was apparently an immediate reaction - all the models on c1sus and c1ioo reported an ADC timeout and crashed. I'm going to reboot them and still have c1x04 IOP running, to see what happens.

[97544.431561] c1pem: ADC TIMEOUT 3 8703 63 8767
[97544.431574] c1mcs: ADC TIMEOUT 1 8703 63 8767
[97544.431576] c1sus: ADC TIMEOUT 1 8703 63 8767
[97544.454746] c1rfm: ADC TIMEOUT 0 9033 9 8841
Quote:

Overnight, all models on c1sus and c1ioo seem to have had no stability issues, supporting the hypothesis that timing issues stem from c1lsc. Moreover, the MC1 shadow sensor readouts showed no negative values over a ~12hour period. I think we should just observe this for another day, in any case I don't think there is any urgent IFO related activity scheduled.

  14146   Wed Aug 8 23:03:42 2018 gautamUpdateCDSc1lsc model started

As part of this slow but systematic debugging, I am turning on the c1lsc model overnight to see if the model crashes return.

  14149   Thu Aug 9 12:31:13 2018 gautamUpdateCDSCDS status update

The model seems to have run without issues overnight. Not completely related, but the MC1 shadow sensor signals also don't show any abnormal excursions to negative values in the last 48 hours. I'm thinking about re-connecting the satellite box (but preserving the breakout setup at 1X6 for a while longer) and re-locking the IMC. I'll also start c1ass on the c1lsc frontend. I would say that the other models on c1lsc (i.e. c1oaf, c1cal, c1daf) aren't really necessary for basic IFO operation.

Quote:

As part of this slow but systematic debugging, I am turning on the c1lsc model overnight to see if the model crashes return.

  14151   Thu Aug 9 22:50:13 2018 gautamUpdateCDSAlignSoft script modified

After this work of increasing the series resistance on ETMX, there have been numerous occassions where the insufficient misalignment of ETMX has caused problems in locking vertex cavities. Today, I modified the script (located at /opt/rtcds/caltech/c1/medm/MISC/ifoalign/AlignSoft.py) to avoid such problems. The way the misalign script works is to write an offset value to the "TO_COIL" filter bank (accessed via "Output Filters" button on the Suspension master MEDM screen - not the most intuitive place to put an offset but okay). So I just increased the value of this offset from 250 counts to 2500 counts (for ETMX only). I checked that the script works, now when both ETMs are misaligned, the AS55Q signal shows a clean Michelson-like sine wave as it fringes instead of having the arm cavity PDH fringes as well yes.

Note that the svn doesn't seem to work on the newly upgraded SL7 machines: svn status gives me the following output.

svn: E155036: Please see the 'svn upgrade' command
svn: E155036: Working copy '/cvs/cds/rtcds/userapps/trunk/cds/c1/medm/MISC/ifoalign' is too old (format 10, created by Subversion 1.6)

 Is it safe to run 'svn upgrade'? Or is it time to migrate to git.ligo.org/40m/scripts?

Attachment 1: MichelsonFringing.png
MichelsonFringing.png
ELOG V3.1.3-