40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 53 of 335  Not logged in ELOG logo
ID Date Author Type Categoryup Subject
  13935   Fri Jun 8 20:15:08 2018 gautamUpdateCDSReboot script

Unfortunately, this has happened (and seems like it will happen) enough times that I set up a script for rebooting the machine in a controlled way, hopefully it will negate the need to repeatedly go into the VEA and hard-reboot the machines. Script lives at /opt/rtcds/caltech/c1/scripts/cds/rebootC1LSC.sh. SVN committed. It worked well for me today. All applicable CDS indicator lights are now green again. Be aware that c1oaf will probably need to be restarted manually in order to make the DC light green. Also, this script won't help you if you try to unload a model on c1lsc and the FE crashes. It relies on c1lsc being ssh-able. The basic logic is:

  1. Ask for confirmation.
  2. Shutdown all vertex optic watchdogs, PSL shutter.
  3. ssh into c1sus and c1ioo, shutdown all models on these machines, soft reboot them.
  4. ssh into c1lsc, soft reboot the machine. No attempt is made to unload the models.
  5. Wait 2 minutes for all machines to come back online.
  6. Restart models on all 3 vertex FEs (IOPs first, then rest).
  7. Prompt user for confirmation to re-enable watchdog status and open PSL shutter.
  13942   Mon Jun 11 18:49:06 2018 gautamUpdateCDSc1lsc dead again

Why is this happening so frequently now? Last few lines of error log:

[  575.099793] c1oaf: DAQ EPICS: Int = 199  Flt = 706 Filters = 9878 Total = 10783 Fast = 113
[  575.099793] c1oaf: DAQ EPICS: Number of Filter Module Xfers = 11 last = 98
[  575.099793] c1oaf: crc length epics = 43132
[  575.099793] c1oaf:  xfer sizes = 128 788 100988 100988 
[240629.686307] c1daf: ADC TIMEOUT 0 43039 31 43103
[240629.686307] c1cal: ADC TIMEOUT 0 43039 31 43103
[240629.686307] c1ass: ADC TIMEOUT 0 43039 31 43103
[240629.686307] c1oaf: ADC TIMEOUT 0 43039 31 43103
[240629.686307] c1lsc: ADC TIMEOUT 0 43039 31 43103
[240630.684493] c1x04: timeout 0 1000000 
[240631.684938] c1x04: timeout 1 1000000 
[240631.684938] c1x04: exiting from fe_code()

I fixed it by running the reboot script.

  13947   Mon Jun 11 23:22:53 2018 gautamUpdateCDSEX wiring confusion

 [Koji, gautam]

Per this elog, we don't need any AIOut channels or Oplev channels. However, the latest wiring diagram I can find for the EX Acromag situation suggests that these channels are hooked up (physically). If this is true, there are 12 ADC channels that are occupied which we can use for other purposes. Question for Johannes: Is this true? If so, Kira has plenty of channels available for her Temperature control stuff..

As an aside, we found that the EPICS channel names for the TRX/TRY QPD gain stages are somewhat strangely named. Looking closely at the schematic (which has now been added to the 40m DCC tree, we can add out custom mods later), they do (somewhat) add up, but I think we should definitely rename them in a more systematic manner, and use an MEDM screen to indicate stuff like x4 or x20 or "Active" etc. BTW, the EX and EY QPDs have different settings. But at least the settings are changed synchronously for all four quadrants, unlike the WFS heads...


Unrelated: I had to key the c1iscaux and c1auxey crates.

  13958   Wed Jun 13 23:23:44 2018 johannesUpdateCDSEX wiring confusion

It's true.

I went through the wiring of the c1auxex crate today to disentangle the pin assignments. The full detail can be found in attachment #1, #2 has less detail but is more eye candy. The red flagged channels are now marked for removal at the next opportunity. This will free up DAQ channels as follows:

TYPE Total Available now Available after
ADC 24 2 14
DAC 16 8 12
BIO sinking 16 7 7
BIO sourcing 8 8 8

This should be enough for temperature sensing, NPRO diagnostics, and even eventual remote PDH control with new servo boxes.

  13961   Thu Jun 14 10:41:00 2018 gautamUpdateCDSEX wiring confusion

Do we really have 2 free ADC channels at EX now? I was under the impression we had ZERO free, which is why we wanted to put a new ADC unit in. I think in the wiring diagram, the Vacuum gauge monitor channel, Seis Can Temp Sensor monitor, and Seis Can Heater channels are missing. It would also be good to have, in the wiring diagram, a mapping of which signals go to which I/O ports (Dsub, front panel BNC etc) on the 4U(?) box housing all the Acromags, this would be helpful in future debugging sessions.

Quote:
 
TYPE Total Available now Available after
ADC 24 2 14

 

  13965   Thu Jun 14 15:31:18 2018 johannesUpdateCDSEX wiring confusion

Bad wording, sorry. Should have been channels in excess of ETMX controls. I'll add the others to the list as well.

Updated channel list and wiring diagram attached. Labels are 'F' for 'Front' and 'R' for - you guessed it - 'Rear', the number identifies the slot panel the breakout is attached to.

  14000   Thu Jun 21 22:13:12 2018 gautamUpdateCDSpianosa upgrade

pianosa has been upgraded to SL7. I've made a controls user account, added it to sudoers, did the network config, and mounted /cvs/cds using /etc/fstab. Other capabilities are being slowly added, but it may be a while before this workstation has all the kinks ironed out. For now, I'm going to follow the instructions on this wiki to try and get the usual LSC stuff working.

  14003   Fri Jun 22 00:59:43 2018 gautamUpdateCDSpianosa functional, but NO DTT

MEDM, EPICS and dataviewer seem to work, but diaggui still doesn't work (it doesn't work on Rossa either, same problem as reported here, does a fix exist?). So looks like only donatella can run diaggui for now. I had to disable the systemd firewall per the instructions page in order to get EPICS to work. Also, there is no MATLAB installed on this machine yet. sshd has been enabled.

  14007   Fri Jun 22 15:13:47 2018 gautamUpdateCDSDTT working

Seems like DTT also works now. The trick seems to be to run sudo /usr/bin/diaggui instead of just diaggui. So this is indicative of some conflict between the yum installed gds and the relic gds from our shared drive. I also have to manually change the NDS settings each time, probably there's a way to set all of this up in a more smooth way but I don't know what it is. awggui still doesn't get the correct channels, not sure where I can change the settings to fix that.

  14008   Fri Jun 22 15:22:39 2018 sudoUpdateCDSDTT working
Quote:

Seems like DTT also works now. The trick seems to be to run sudo /usr/bin/diaggui instead of just diaggui. So this is indicative of some conflict between the yum installed gds and the relic gds from our shared drive. I also have to manually change the NDS settings each time, probably there's a way to set all of this up in a more smooth way but I don't know what it is. awggui still doesn't get the correct channels, not sure where I can change the settings to fix that.

DON"T RUN DIAGGUI AS ROOT

  14027   Wed Jun 27 21:18:00 2018 gautamUpdateCDSLab maintenance scripts from NODUS---->MEGATRON

I moved the N2 check script and the disk usage checking script from the (sudo) crontab of nodus no to the controls user crontab on megatron yes.

  14028   Thu Jun 28 08:09:51 2018 SteveUpdateCDS vacuum pneumatic N2 pressure

The fardest I can go back on channel C1: Vac_N2pres is  320 days

C1:Vac-CC1_Hornet Presuure gauge started logging Feb. 23, 2018

Did you update the " low N2 message"  email addresses?

 

Quote:

I moved the N2 check script and the disk usage checking script from the (sudo) crontab of nodus no to the controls user crontab on megatron yes.

 

  14029   Thu Jun 28 10:28:27 2018 ranaUpdateCDS vacuum pneumatic N2 pressure

we disabled logging the N2 Pressure to a text file, since it was filling up disk space. Now it just sends an email to our 40m mailing list, so we'll all get a warning.

The crontab uses the 'bash' version of output redirection '2>&1', which redirects stdout and stderr, but probably we just want stderr, since stdout contains messages without issues and will just fill up again.

  14133   Sun Aug 5 13:28:43 2018 gautamUpdateCDSc1lsc flaky

Since the lab-wide computer shutdown last Wednesday, all the realtime models running on c1lsc have been flaky. The error is always the same:

[58477.149254] c1cal: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1daf: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1ass: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1oaf: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1lsc: ADC TIMEOUT 0 10963 19 11027
[58478.148001] c1x04: timeout 0 1000000 
[58479.148017] c1x04: timeout 1 1000000 
[58479.148017] c1x04: exiting from fe_code()

This has happened at least 4 times since Wednesday. The reboot script makes recovery easier, but doing it once in 2 days is getting annoying, especially since we are running many things (e.g. ASS) in custom configurations which have to be reloaded each time. I wonder why the problem persists even though I've power-cycled the expansion chassis? I want to try and do some IFO characterization today so I'm going to run the reboot script again but I'll get in touch with J Hanks to see if he has any insight (I don't think there are any logfiles on the FEs anyways that I'll wipe out by doing a reboot). I wonder if this problem is connected to DuoTone? But if so, why is c1lsc the only FE with this problem? c1sus also does not have the DuoTone system set up correctly...

The last time this happened, the problem apparently fixed itself so I still don't have any insight as to what is causing the problem in the first place frown. Maybe I'll try disabling c1oaf since that's the configuration we've been running in for a few weeks.

  14136   Mon Aug 6 00:26:21 2018 gautamUpdateCDSMore CDS woes

I spent most of today fighting various CDS errors.

  • I rebooted c1lsc around 3pm, my goal was to try and do some vertex locking and figure out what the implications were of having only ~30% power we used to have at the AS port.
  • Shortly afterwards (~4pm), c1lsc crashed.
  • Using the reboot script, I was able to bring everything back up. But the DC lights on c1sus models were all red, and a 0x4000 error was being reported.
  • This error is indicative of some timing issue, but all the usual tricks (reboot vertex FEs in various order, restart the mx_streams etc) didn't clear this error.
  • I checked the Tempus GPS unit, that didn't report any obvious problems (i.e. front display was showing the correct UTC time).
  • Finally, I decided to shut down all watchdogs, soft reboot all the FEs, soft reboot FB, power cycle all expansion chassis.
  • This seems to have done the trick - I'm leaving c1oaf disabled for now.
  • The remaining red indicators are due to c1dnn and c1oaf being disabled.

Let's see how stable this configuration is. Onto some locking now...

  14139   Mon Aug 6 14:38:38 2018 gautamUpdateCDSMore CDS woes

Stability was short-lived it seems. When I came in this morning, all models on c1lsc were dead already, and now c1sus is also dead (Attachment #1). Moreover, MC1 shadow sensors failed for a brief period again this afternoon (Attachment #2). I'm going to wait for some CDS experts to take a look at this since any fix I effect seems to be short-lived. For the MC1 shadow sensors, I wonder if the Trillium box (and associated Sorensen) failure somehow damaged the MC1 shadow sensor/coil driver electronics.

Quote:
 

Let's see how stable this configuration is. Onto some locking now...

  14140   Mon Aug 6 19:49:09 2018 gautamUpdateCDSMore CDS woes

I've left the c1lsc frontend shutdown for now, to see if c1sus and c1ioo can survive without any problems overnight. In parallel, we are going to try and debug the MC1 OSEM Sensor problem - the idea will be to disable the bias voltage to the OSEM LEDs, and see if the readback channels still go below zero, this would be a clear indication that the problem is in the readback transimpedance stage and not the LED. Per the schematic, this can be done by simply disconnecting the two D-sub connectors going to the vacuum flange (this is the configuration in which we usually use the sat box tester kit for example). Attachment #1 shows the current setup at the PD readout board end. The dark DC count (i.e. with the OSEM LEDs off) is ~150 cts, while the nominal level is ~1000 cts, so perhaps this is already indicative of something being broken but let's observe overnight.

  14142   Tue Aug 7 11:30:46 2018 gautamUpdateCDSMore CDS woes

Overnight, all models on c1sus and c1ioo seem to have had no stability issues, supporting the hypothesis that timing issues stem from c1lsc. Moreover, the MC1 shadow sensor readouts showed no negative values over a ~12hour period. I think we should just observe this for another day, in any case I don't think there is any urgent IFO related activity scheduled.

  14143   Tue Aug 7 22:28:23 2018 gautamUpdateCDSMore CDS woes

I am starting the c1x04 model (IOP) on c1lsc to see how it behaves overnight.

Well, there was apparently an immediate reaction - all the models on c1sus and c1ioo reported an ADC timeout and crashed. I'm going to reboot them and still have c1x04 IOP running, to see what happens.

[97544.431561] c1pem: ADC TIMEOUT 3 8703 63 8767
[97544.431574] c1mcs: ADC TIMEOUT 1 8703 63 8767
[97544.431576] c1sus: ADC TIMEOUT 1 8703 63 8767
[97544.454746] c1rfm: ADC TIMEOUT 0 9033 9 8841
Quote:

Overnight, all models on c1sus and c1ioo seem to have had no stability issues, supporting the hypothesis that timing issues stem from c1lsc. Moreover, the MC1 shadow sensor readouts showed no negative values over a ~12hour period. I think we should just observe this for another day, in any case I don't think there is any urgent IFO related activity scheduled.

  14146   Wed Aug 8 23:03:42 2018 gautamUpdateCDSc1lsc model started

As part of this slow but systematic debugging, I am turning on the c1lsc model overnight to see if the model crashes return.

  14149   Thu Aug 9 12:31:13 2018 gautamUpdateCDSCDS status update

The model seems to have run without issues overnight. Not completely related, but the MC1 shadow sensor signals also don't show any abnormal excursions to negative values in the last 48 hours. I'm thinking about re-connecting the satellite box (but preserving the breakout setup at 1X6 for a while longer) and re-locking the IMC. I'll also start c1ass on the c1lsc frontend. I would say that the other models on c1lsc (i.e. c1oaf, c1cal, c1daf) aren't really necessary for basic IFO operation.

Quote:

As part of this slow but systematic debugging, I am turning on the c1lsc model overnight to see if the model crashes return.

  14151   Thu Aug 9 22:50:13 2018 gautamUpdateCDSAlignSoft script modified

After this work of increasing the series resistance on ETMX, there have been numerous occassions where the insufficient misalignment of ETMX has caused problems in locking vertex cavities. Today, I modified the script (located at /opt/rtcds/caltech/c1/medm/MISC/ifoalign/AlignSoft.py) to avoid such problems. The way the misalign script works is to write an offset value to the "TO_COIL" filter bank (accessed via "Output Filters" button on the Suspension master MEDM screen - not the most intuitive place to put an offset but okay). So I just increased the value of this offset from 250 counts to 2500 counts (for ETMX only). I checked that the script works, now when both ETMs are misaligned, the AS55Q signal shows a clean Michelson-like sine wave as it fringes instead of having the arm cavity PDH fringes as well yes.

Note that the svn doesn't seem to work on the newly upgraded SL7 machines: svn status gives me the following output.

svn: E155036: Please see the 'svn upgrade' command
svn: E155036: Working copy '/cvs/cds/rtcds/userapps/trunk/cds/c1/medm/MISC/ifoalign' is too old (format 10, created by Subversion 1.6)

 Is it safe to run 'svn upgrade'? Or is it time to migrate to git.ligo.org/40m/scripts?

  14166   Wed Aug 15 21:27:47 2018 gautamUpdateCDSCDS status update

Starting c1cal now, let's see if the other c1lsc FE models are affected at all... Moreover, since MC1 seems to be well-behaved, I'm going to restore the nominal eurocrate configuration (sans extender board) tomorrow.

  14171   Mon Aug 20 15:16:39 2018 JonUpdateCDSRebooted c1lsc, slow machines

When I came in this morning no light was reaching the MC. One fast machine was dead, c1lsc, and a number of the slow machines: c1susaux, c1iool0, c1auxex, c1auxey, c1iscaux. Gautam walked me through reseting the slow machines manually and the fast machines via the reboot script. The computers are all back online and the MC is again able to lock.

  14187   Tue Aug 28 18:39:41 2018 JonUpdateCDSC1LSC, C1AUX reboots

I found c1lsc unresponsive again today. Following the procedure in elog #13935, I ran the rebootC1LSC.sh script to perform a soft reboot of c1lsc and restart the epics processes on c1lsc, c1sus, and c1ioo. It worked. I also manually restarted one unresponsive slow machine, c1aux.

After the restarts, the CDS overview page shows the first three models on c1lsc are online (image attached). The above elog references c1oaf having to be restarted manually, so I attempted to do that. I connect via ssh to c1lsc and ran the script startc1oaf. This failed as well, however.

In this state I was able to lock the MICH configuration, which is sufficient for my purposes for now, but I was not able to lock either of the arm cavities. Are some of the still-dead models necessary to lock in resonant configurations?

  14192   Tue Sep 4 10:14:11 2018 gautamUpdateCDSCDS status update

c1lsc crashed again. I've contacted Rolf/JHanks for help since I'm out of ideas on what can be done to fix this problem.

Quote:

Starting c1cal now, let's see if the other c1lsc FE models are affected at all... Moreover, since MC1 seems to be well-behaved, I'm going to restore the nominal eurocrate configuration (sans extender board) tomorrow.

  14193   Wed Sep 5 10:59:23 2018 wgautamUpdateCDSCDS status update

Rolf came by today morning. For now, we've restarted the FE machine and the expansion chassis (note that the correct order in which to do this is: turn off computer--->turn off expansion chassis--->turn on expansion chassis--->turn on computer). The debugging measures Rolf suggested are (i) to replace the old generation ADC card in the expansion chassis which has a red indicator light always on and (ii) to replace the PCIe fiber (2010 make) running from the c1lsc front-end machine in 1X6 to the expansion chassis in 1Y3, as the manufacturer has suggested that pre-2012 versions of the fiber are prone to failure. We will do these opportunistically and see if there is any improvement in the situation.

Another tip from Rolf: if the c1lsc FE is responsive but the models have crashed, then doing sudo reboot by ssh-ing into c1lsc should suffice* (i.e. it shouldn't take down the models on the other vertex FEs, although if the FE is unresponsive and you hard reboot it, this may still be a problem). I'll modify I've modified the c1lsc reboot script accordingly.

* Seems like this can still lead to the other vertex FEs crashing, so I'm leaving the reboot script as is (so all vertex machines are softly rebooted when c1lsc models crash).

Quote:

c1lsc crashed again. I've contacted Rolf/JHanks for help since I'm out of ideas on what can be done to fix this problem.

  14194   Thu Sep 6 14:21:26 2018 gautamUpdateCDSADC replacement in c1lsc expansion chassis

Todd E. came by this morning and gave us (i) 1x new ADC card and (ii) 1x roll of 100m (2017 vintage) PCIe fiber. This afternoon, I replaced the old ADC card in the c1lsc expansion chassis, and have returned the old card to Todd. The PCIe fiber replacement is a more involved project (Steve is acquiring some protective tubing to route it from the FE in 1X6 to the expansion chassis in 1Y3), but hopefully the problem was the ADC card with red indicator light, and replacing it has solved the issue. CDS is back to what is now the nominal state (Attachment #1) and Yarm is locked for Jon to work on his IFOcoupling study. We will monitor the stability in the coming days.

Quote:

(i) to replace the old generation ADC card in the expansion chassis which has a red indicator light always on and (ii) to replace the PCIe fiber (2010 make) running from the c1lsc front-end machine in 1X6 to the expansion chassis in 1Y3, as the manufacturer has suggested that pre-2012 versions of the fiber are prone to failure. We will do these opportunistically and see if there is any improvement in the situation.

  14195   Fri Sep 7 12:35:14 2018 gautamUpdateCDSADC replacement in c1lsc expansion chassis

Looks like the ADC was not to blame, same symptoms persist.

Quote:

The PCIe fiber replacement is a more involved project (Steve is acquiring some protective tubing to route it from the FE in 1X6 to the expansion chassis in 1Y3), but hopefully the problem was the ADC card with red indicator light, and replacing it has solved the issue.

  14196   Mon Sep 10 12:44:48 2018 JonUpdateCDSADC replacement in c1lsc expansion chassis

Gautam and I restarted the models on c1lsc, c1ioo, and c1sus. The LSC system is functioning again. We found that only restarting c1lsc as Rolf had recommended did actually kill the models running on the other two machines. We simply reverted the rebootC1LSC.sh script to its previous form, since that does work. I'll keep using that as required until the ongoing investigations find the source of the problem.

Quote:

Looks like the ADC was not to blame, same symptoms persist.

Quote:

The PCIe fiber replacement is a more involved project (Steve is acquiring some protective tubing to route it from the FE in 1X6 to the expansion chassis in 1Y3), but hopefully the problem was the ADC card with red indicator light, and replacing it has solved the issue.

 

  14202   Thu Sep 20 11:29:04 2018 gautamUpdateCDSNew PCIe fiber housed

[steve, yuki, gautam]

The plastic tubing/housing for the fiber arrived a couple of days ago. We routed ~40m of fiber through roughly that length of the tubing this morning, using some custom implements Steve sourced. To make sure we didn't damage the fiber during this process, I'm now testing the vertex models with the plastic tubing just routed casually (= illegally) along the floor from 1X4 to 1Y3 (NOTE THAT THE WIKI PAGE DIAGRAM IS OUT OF DATE AND NEEDS TO BE UPDATED), and have plugged in the new fiber to the expansion chassis and the c1lsc front end machine. But I'm seeing a DC error (0x4000), which is indicative of some sort of timing error (Attachment #1) **. Needs more investigation...

Pictures + more procedural details + proper routing of the protected fiber along cable trays after lunch. If this doesn't help the stability problem, we are out of ideas again, so fingers crossed...

** In the past, I have been able to fix the 0x4000 error by manually rebooting fb (simply restarting the daqd processes on fb using sudo systemctl restart daqd_* doesn't seem to fix the problem). Sure enough, seems to have done the job this time as well (Attachment #2). So my initial impression is that the new fiber is functioning alright yes.

Quote:

The PCIe fiber replacement is a more involved project (Steve is acquiring some protective tubing to route it from the FE in 1X6 to the expansion chassis in 1Y3)

  14203   Thu Sep 20 16:19:04 2018 gautamUpdateCDSNew PCIe fiber install postponed to tomorrow

[steve, gautam]

This didn't go as smoothly as planned. While there were no issues with the new fiber over the ~3 hours that I left it plugged in, I didn't realize the fiber has distinct ends for the "HOST" and "TARGET" (-5 points to me I guess). So while we had plugged in the ends correctly (by accident) for the pre-lunch test, while routing the fiber on the overhead cable tray, we switched the ends (because the "HOST" end of the cable is close to the reel and we felt it would be easier to do the routing the other way. 

Anyway, we will fix this tomorrow. For now, the old fiber was re-connected, and the models are running. IMC is locked.

Quote:

Pictures + more procedural details + proper routing of the protected fiber along cable trays after lunch. If this doesn't help the stability problem, we are out of ideas again, so fingers crossed...

  14206   Fri Sep 21 16:46:38 2018 gautamUpdateCDSNew PCIe fiber installed and routed

[steve, koji, gautam]

We took another pass at this today, and it seems to have worked - see Attachment #1. I'm leaving CDS in this configuration so that we can investigate stability. IMC could be locked. However, due to the vacuum slow machine having failed, we are going to leave the PSL shutter closed over the weekend.

  14208   Fri Sep 21 19:50:17 2018 KojiUpdateCDSFrequent time out

Multiple realtime processes on c1sus are suffering from frequent time outs. It eventually knocks out c1sus (process).

Obviously this has started since the fiber swap this afternoon.

gautam 10pm: there are no clues as to the origin of this problem on the c1sus frontend dmesg logs. The only clue (see Attachment #3) is that the "ADC" error bit in the CDS status word is red - but opening up the individual ADC error log MEDM screens show no errors or overflows. Not sure what to make of this. The IOP model on this machine (c1x02) reports an error in the "Timing" bit of the CDS status word, but from the previous exchange with Rolf / J Hanks, this is down to a misuse of ADC0 Ch31 which is supposed to be reserved for a DuoTone diagnostic signal, but which we use for some other signal (one of the MC suspension shadow sensors iirc). The response is also not consistent with this CDS manual - which suggests that an "ADC" error should just kill the models. There are no obvious red indicator lights in the c1sus expansion chassis either.

  14210   Sat Sep 22 00:21:07 2018 KojiUpdateCDSFrequent time out

[Gautam, Koji]

We had another crash of c1sus and Gautam did full power cycling of c1sus. It was a sturggle to recover all the frontends, but this solved the timing issue.

We went through full reset of c1sus, and rebooting all the other RT hosts, as well as daqd and fb1.

  14253   Sun Oct 14 16:55:15 2018 not gautamUpdateCDSpianosa upgrade

DASWG is not what we want to use for config; we should use the K. Thorne LLO instructions, like I did for ROSSA.

Quote:

pianosa has been upgraded to SL7. I've made a controls user account, added it to sudoers, did the network config, and mounted /cvs/cds using /etc/fstab. Other capabilities are being slowly added, but it may be a while before this workstation has all the kinks ironed out. For now, I'm going to follow the instructions on this wiki to try and get the usual LSC stuff working.

  14267   Fri Nov 2 12:07:16 2018 ranaUpdateCDSNDScope

https://alog.ligo-wa.caltech.edu/aLOG/index.php?callRep=44971

Let's install Jamie's new Data Viewer

  14293   Tue Nov 13 21:53:19 2018 gautamUpdateCDSRFM errors

This problem resurfaced, which I noticed when I couldn't get the single arm locks going.

The fix was NOT restarting the c1rfm model, which just brought the misery of all vertex FEs crashing and the usual dance to get everything back.

Restarting the sender models (i.e. c1scx and c1scy) seems to have done the trick though.

  14344   Tue Dec 11 14:33:29 2018 gautamUpdateCDSNDScope

NDscope is now running on pianosa. To be really useful, we need the templates, so I've made /users/Templates/NDScope_templates where these will be stored. Perhaps someone can write a parser to convert dataviewer .xml to something ndscope can understand. To get it installed, I had to run:

sudo yum install ndscope
sudo yum install python34-gpstime
sudo yum install python34-dateutil
sudo yum install python34-requests

 I also changed the pythonpath variable to include the python3.4 site-packages library in .bashrc

Quote:

https://alog.ligo-wa.caltech.edu/aLOG/index.php?callRep=44971

Let's install Jamie's new Data Viewer

  14356   Thu Dec 13 22:56:28 2018 gautamUpdateCDSFrames

[koji, gautam]

We looked into the /frames situation a bit tonight. Here is a summary:

  1. We have already lost some second trend data since the new FB has been running from ~August 2017.
  2. The minute trend data is still safe from that period, we believe.
  3. The Jetstor has ~2TB of trend data in the /frames/trend folder.
    • This is a combination of "second", "minute_raw" and "minute".
    • It is not clear to us what the distinction is between "minute_raw" and "minute", except that the latter seems to go back farther in time than the former.
    • Even so, the minute trend folder from October 2011 is empty - how did we manage to get the long term trend data?? From the folder volumes, it appears that the oldest available trend data is from ~July 24 2015.

Plan of action:

  1. The wiper script needs to be tweaked a bit to allow more storage for the minute trends (which we presumably want to keep for long term).
  2. We need to clear up some space on FB1 to transfer the old trend data from Jetstor to FB1.
  3. We need to revive the data backup via LDAS. Also summary pages.

BTW - the last chiara (shared drive) backup was October 16 6 am. dmesg showed a bunch of errors, Koji is now running fsck in a tmux session on chiara, let's see if that repairs the errors. We missed the opportunity to swap in the 4TB backup disk, so we will do this at the next opportunity.

  14359   Fri Dec 14 14:25:36 2018 KojiUpdateCDSchiara backup

fsck of chiara backup disk (UUID="90a5c98a-22fb-4685-9c17-77ed07a5e000") was done. But this required many files to be fixed. So the backed-up files are not reliable now.
On the top of that, the disk became not recognized from the machine.

I went to the disk and disconnected the USB and then the power supply, which was/is connected to the UPS.
Then they are reconnected again. This made the disk came back as /media/90a5c98a-22fb-4685-9c17-77ed07a5e000. (*)
After unmounting this disk, I ran "sudo mount -a" to follow the way of mounting as fstab does.
Now I am running the backup script manually so that we can pretend to maintain a snapshot of the day at least.

(*) This is the same situation we found at the recovery from the power shutdown. So my hypothesis is that on Oct 16 at 7 AM during the backup there was a USB failure or disk failure or something which unmounted the disk. This caused some files got damaged. Also this caused the disk mounted as /media/90a5c98a-22fb-4685-9c17-77ed07a5e000. So since then, we did not have the backup.
Update (20:00): The disk connection failed again. I think this disk is no longer reliable.

 

  14374   Thu Dec 20 17:17:41 2018 gautamUpdateCDSLogging of new Vacuum channels

Added the following channels to C0EDCU.ini:

[C1:Vac-P1b_pressure]
units=torr
[C1:Vac-PRP_pressure]
units=torr
[C1:Vac-PTP2_pressure]
units=torr
[C1:Vac-PTP3_pressure]
units=torr
[C1:Vac-TP2_rot]
units=kRPM
[C1:Vac-TP3_rot]
units=kRPM

Also modified the old P1 channel to

[C1:Vac-P1a_pressure]
units=torr

Unfortunately, we realized too late that we don't have these channels in the frames, so we don't have the data from this test pumpdown logged, but we will have future stuff. I say we should also log diagnostics from the pumps, such as temperature, current etc. After making the changes, I restarted the daqd processes.


Things to add to ASA wiki page once the wiki comes back online:

  1. What is the safe way to clean the cryo pump if we want to use it again?
  2. What are safe conditions to turn the RGA on?
  14376   Fri Dec 21 11:11:51 2018 gautamUpdateCDSLogging of new Vacuum channels

The N2 pressure channel name was also wrong in C0EDCU.ini, so I updated it this morning to the correct name and units:

[C1:Vac-N2_pressure]
units=psi

Now it too is being recorded to frames.

  14386   Fri Jan 4 17:43:24 2019 gautamUpdateCDSTiming issues

[J Hanks (remote), koji, gautam]

Summary:

The problem stems from the way GPS timing signals are handled by the FEs and FB. We effected a partial fix:

  • Now, old frame data is no longer being overwritten
  • For the channels that are indeed being recorded now, the correct time stamp is being applied so they can be found in /frames by looking for the appropriate gpstime.

Details:

  • The usual FE/FB power cycling did not fix the problem.
  • The gps time used by FB and associated RT processes may be found by using  cat /proc/gps (i.e. this is different from the system time found by using date, or gpstime).
  • This was off by 2 years.
  • The way this worked up till now was by adding a fixed offset to this time.
    • This offset can be found as a line saying set symm_gps_offset=31622382 in daqdrc.fw (for example)
    • There were similar lines in daqdrc.rcv and daqdrc.dc - however, they were not all the same offset! We couldn't figure out why.
    • All these files live in /opt/rtcds/caltech/c1/target/daqd/.

Changes effected:

  1. First, we tried changing the offset in the daqdrc.fw file only.
    • Incremented it by 24*60*60*365 = number of seconds in a year with no leap seconds/days.
    • This did not fix the problem.
  2. So J Hanks decided to rebuild the Spectracom driver - these commands may not be comprehensive, but I think I got everything).
    • The relevant file is spectracomGPS.c (made a copy of /usr/src/symmetricom-3.3~rc1, called symmetricom-3.3~rc1-patched, this file is in /usr/src/symmetricom-3.3~rc1-patched/include/drv)
    • Added the following lines:
      /* 2018 had no leap seconds or leap days, so adjust for that */
             pHardware->gpsOffset += 31536000;
    • re-built and installed the modified symmetricom driver.
    • Checked that cat /proc/gps now yields the correct time.
    • Reset the gps time offsets in daqdrc.fw, daqdrc.rcv and daqdrc.dc to 0
    • With these steps, the frames were being written to /frames with the correct timestamp.
  3. Next, we checked the timing on the FEs
    • Basically, J Hanks rebuilt the version of the symmetricom driver that is used by the rtcds models to mimic the changes made for FB.
    • This did the trick for c1lsc and c1ioo - cat /proc/gps now returns the correct time on those FEs.
    • However, c1sus remains problematic (it initially reported a GPS time from 2 years ago, and even after the re-installed driver, is 4 days behind) - he suspects that this is because c1sus is the only FE with a Symmetricom/Spectracom card installed in the I/O chassis. So c1sus reports a gpstime that is ~4 days behind the "correct" gpstime.
    • He is going to check with Rolf/other CDS experts to figure out if it's okay for us to simply remove the card and run the models, or if we need to take other steps.
    • As part of this work, the c1x02 IOP model was recompiled, re-installed and re-started.

The realtime models were not restarted (although all the vertex FEs are running) - we can regroup next week and decide what is the correct course of action.

Quote:
 
  • Attachment #2 shows the minute trend of the pressure gauges for a 12 day period - it looks like there is some issue with the frame builder clock, perhaps this issue resurfaced? But checking the system time on FB doesn't suggest anything is wrong.. I double checked with dataviewer as well that the trends don't exist... But checking the status of the individual daqd processes indeed showed that the dates were off by 1 year, so I just restarted all of them and now the time seems correct. How can we fix this problem more permanently? Also, the P1b readout looks suspicious - why are there periods where it seems like we are reading values better than the LSB of the device?
  14392   Wed Jan 9 11:33:35 2019 gautamUpdateCDSTiming issues still persist

Summary:

The gps time mismatch between /proc/gps and gpstime seems to be resolved. However, the 0x4000 DC errors still persist. It is not clear to me why.

Details:

On the phone with J Hanks on Friday, he reminded me that c1sus seems to be the only machine with an IRIG-B timing card installed. I can't find the elog but I remembered that Jamie, ericq and I had done this work in 2016 (?), and I also remembered Jamie saying it wasn't working exactly as expected. Since the DAQ was working fine before this card was installed, and since there are no problems with the recording of channels from the other four FE machines without this card installed, I decided to simply pull out the card from the expansion chassis. The card has been stored in the CDS/FE cabinet along the Y arm for now. There was also a cable that interfaces to the card which brings over the 1pps from the GPS unit, which has also been stored in the CDS/FE cabinet.

This seems to have resolved the mismatch between the gpstime reported by cat /proc/gps and the gpstime commands - Attachment #1 (the <1 second mismatch is presumably due to the deadtime between commands). However, the 0x4000 DC errors still persist. I'll try the full power cycle of FEs and FB which has fixed this kind of error in the past, but apart from that, I'm out of ideas.

Update 1215:

Following the instructions in this elog did not fix the problem. The problem seems to be with the daqd_fw service, which reports the following:

controls@fb1:~ 0$ sudo systemctl status daqd_fw.service 
● daqd_fw.service - Advanced LIGO RTS daqd frame writer
   Loaded: loaded (/etc/systemd/system/daqd_fw.service; enabled)
   Active: failed (Result: start-limit) since Wed 2019-01-09 12:17:12 PST; 2min 0s ago
  Process: 2120 ExecStart=/usr/bin/daqd_fw -c /opt/rtcds/caltech/c1/target/daqd/daqdrc.fw (code=killed, signal=ABRT)
 Main PID: 2120 (code=killed, signal=ABRT)

Jan 09 12:17:12 fb1 systemd[1]: Unit daqd_fw.service entered failed state.
Jan 09 12:17:12 fb1 systemd[1]: daqd_fw.service holdoff time over, scheduling restart.
Jan 09 12:17:12 fb1 systemd[1]: Stopping Advanced LIGO RTS daqd frame writer...
Jan 09 12:17:12 fb1 systemd[1]: Starting Advanced LIGO RTS daqd frame writer...
Jan 09 12:17:12 fb1 systemd[1]: daqd_fw.service start request repeated too quickly, refusing to start.
Jan 09 12:17:12 fb1 systemd[1]: Failed to start Advanced LIGO RTS daqd frame writer.
Jan 09 12:17:12 fb1 systemd[1]: Unit daqd_fw.service entered failed state.
                                                     

Update 1530:

The frame-writer error was tracked down to a C0EDCU issue. Jon told me that the Hornet CC1 pressure gauge channel was renamed to . C1:Vac-CC1_pressure, and I made the change in the C0EDCU file. However, it returns a value of 9990000000.0, which the frame writer is not happy about... Keeping the old channel name makes the frame-writer run again (although the actual data is bunk).

Update 1755:

J Hanks suggested adding a 1 second offset to the daqdrc config files. This has now fixed the 0x4000 errors, and we are back to the "nominal" RTCDS status screen now - Attachment #2.

  14455   Thu Feb 14 23:14:12 2019 gautamUpdateCDSc1rfm errors

The pressure is still 2e-4 torr according to CC1 so I thought I'd give ASS debugging a go tonight. But the arm transmission signal isn't coming through to the LSC model from the end PDs - so a resurfacing of this problem. Rebooting the sender model, c1scy, did not fix the problem. Moreover, c1susaux is dead. The last time I rebooted it, ITMY got stuck so I'm not going to attempt a revival tonight.

  14457   Fri Feb 15 15:22:08 2019 gautamUpdateCDSc1rfm errors persist

I restarted c1scyc1rfm (so both sender and receiver models were cycled) and power-cycled the c1iscey and c1sus machines. The TRY PD is certainly seeing light - it is just not getting piped over to c1rfm. dmesg doesn't give any clues. I'm out of ideas.

P.S. The new reality seems to be that getting ITMY stuck in the event of a c1susaux reboot is inevitable. As is the practise for ITMX, I tried slowly ramping the PIT and YAW biases to 0 slowly - but in the process of ramping YAW to 0, the optic got stuck. I am ramping in steps of 0.1 (in units of the PIT/YAW sliders, waiting ~3 seconds between steps), I guess I can try ramping even more slowly.

Update: I power cycled the physical RFM switch. This necessitated reboot of all vertex FEs. But seems like things are back to normal now...

Note: to unstick ITMY, seems like the best approach is:

  1. Jiggle bias until SIDE shadow sensor is on average above it's half-light level. This is the critical step. A bias of +20000 cts on the fast SIDE output seems to help.
  2. Set YAW bias to -10, ramp down the BIAS in steps of 0.1, watching shadow sensor levels to ensure optic doesn't get stuck again.
  3. Hope for the best. Iterate if necessary.
Quote:

The pressure is still 2e-4 torr according to CC1 so I thought I'd give ASS debugging a go tonight. But the arm transmission signal isn't coming through to the LSC model from the end PDs - so a resurfacing of this problem. Rebooting the sender model, c1scy, did not fix the problem. Moreover, c1susaux is dead. The last time I rebooted it, ITMY got stuck so I'm not going to attempt a revival tonight.

  14472   Sat Mar 2 14:19:35 2019 gautamUpdateCDSFSS Slow servo gains not burt-ed

PSL NPRO PZT voltage showed large low frequency (hour timescale) excursions on the control room StripTool trace, leading me to suspect the slow servo wasn't working as expected. Yesterday evening, I keyed the unresponsive c1psl crate at ~9 PM PST, and had to run the burtrestore to get the PMC locking working. I must have pressed the wrong button on burtgooey or something, because all the FSS_SLOW channels were reset to 0. What's more, their values were not being saved by the hourly burt-snap script, so I don't have any lookback on what these values were. There isn't any detailed record on the elog about what the optimal values for these are, and the most recent reference I could find was Ki=0.1, Kp=Kd=0, which is what I've set it now to. The servo isn't running away, so I'm leaving things in this state, PID tuning can be done later.

I also added the FSS Slow servo channels to the burt snapshot requirement file at /cvs/cds/caltech/target/c1psl/autoBurt.req, and confirmed that the snapshots are getting the channels from now onwards.

While looking at the req file, I saw a bunch of *_MOPA* channels and also several other currently unused channels. Probably would benefit from going through these and commenting out all the legacy channels, to minimize disk space wastage (though we compress the snapshot files every few years anyways I guess).

Reminder that this (unrelated) issue still needs to be looked into... Note also that the new vacuum system does not have burt snapshot set up (i.e. it is still trying to get the old channels from the c1vac1 and c1vac2 databases, which while has significant overlap with the new system, should probably be setup correctly).

  14492   Thu Mar 21 18:09:36 2019 KojiUpdateCDSdb file preparation for acromag c1susaux

I have updated the google doc spreadsheet to indicate the required action for the new dbfile generation.

There are three types of actions:

1. COPY - Just duplicate the old EPICS db entry. This is for soft channels, calc channels.
2. DELETE - Delete the entry for some physical channels that will not be implemented on Acromag (oplev, dewhitening mon, AI monitor, etc)
3. REPLACE - For the physical channels, we want to replace the port names.

The blue part of the spreadsheet indicates the action for each channel. If it is a physical channel, the assigned module and the channel are indicated there. What we still want to do is to use the these information for generating the port name which looks like "@asynMask(C1VAC_XT1221A_ADC 1 -16)MODBUS_DATA".

The links to the spreadsheets can be found on 40m wiki: https://wiki-40m.ligo.caltech.edu/CDS/SlowControls/c1susaux

  14505   Mon Apr 1 12:01:52 2019 JonUpdateCDS 

I brought c1susaux back online this morning for suspension-channel test scripting. It had been dead for some time. I followed the procedure outlined in #12542. ITMY became stuck during this process, which Gautam tells me always happens since the last vacuum access, but ITMX is not stuck.

ELOG V3.1.3-