ID |
Date |
Author |
Type |
Category |
Subject |
13732
|
Thu Apr 5 19:31:17 2018 |
gautam | Update | CDS | EPICS processes for c1ioo dead |
I found all the EPICS channels for the model c1ioo on the FE c1ioo to be blank just now. The realtime model itself seemed to be running fine, judging by the IMC alignment (as the WFS loops seemed to still be running okay). I couldn't find any useful info in demsg but I don't know what I'm looking for. So my guess is that somehow the EPICS process for that model died. Unclear why. |
13733
|
Fri Apr 6 10:00:29 2018 |
gautam | Update | CDS | CDS puzzle |
Quote: |
Clearly, there is a discrepancy for f>20Hz. Why?
|
Spectral leakage |
13734
|
Fri Apr 6 16:07:18 2018 |
gautam | Update | CDS | Frequent EPICS dropouts |
Kira informed me that she was having trouble accessing past data for her PID tuning tests. Looking at the last day of data, it looks like there are frequent EPICS data dropouts, each up to a few hours. Observations (see Attachment #1 for evidence):
- Problem seems to be with writing these EPICS channels to frames, as the StripTool traces do not show the same discontinuities.
- Problem does not seem local to c1auxex (new Acromag machine). These dropouts are also happening in other EPICS channel records. I have plotted PMC REFL, which is logged by the slow machine c1psl, and you can see the dropouts happen at the same times.
- Problem does not seem to happen to the fast channel DQ records (see ETMX Sensor record plotted for 24 hours, there are no discontinuties).
It is difficult to diagnose how long this has been going on for, as once you start pulling longer stretches of data on dataviewer, any "data freezes" are washed out in the extent of data plotted. |
Attachment 1: EPICS_dropout.png
|
|
13758
|
Wed Apr 18 10:44:45 2018 |
gautam | Update | CDS | slow machine bootfest |
All slow machines (except c1auxex) were dead today, so I had to key them all. While I was at it, I also decided to update MC autolocker screen. Kira pointed out that I needed to change the EPCIS input type (in the RTCDS model) to a "binary input", as opposed to an "analog input", which I did. Model recompilation and restart went smooth. I had to go into the epics record manually to change the two choices to "ENABLE" and "DISABLE" as opposed to the default "ON" and "OFF". Anyways, long story short, MC autolocker controls are a bit more intuitive now I think. |
Attachment 1: MCautolocker_MEDM_revamp.png
|
|
13812
|
Thu May 3 12:19:13 2018 |
gautam | Update | CDS | slow machine bootfest |
Reboot for c1susaux and c1iscaux today. ITMX precautions were followed. Reboots went smoothly.
IMC is shuttered while Jon does PLL characterization...
Now relocked. |
13898
|
Wed May 30 16:12:30 2018 |
Jonathan Hanks | Summary | CDS | Looking at c1oaf issues |
When c1oaf starts up there are 446 gain channels that should be set to 0.0 but which end up at 1.0. An example channel is C1:OAF-ADAPT_CARM_ADPT_ACC1_GAIN. The safe.snap file states that it should be set to 0. After model start up it is at 1.0.
We ran some tests, including modifying the safe.snap to make sure it was reading the snap file we were expecting. For this I set the setpoint to 0.5. After restart of the model we saw that the setpoint went to 0.5 but the epics value remained at 1.0. I then set the snap file back to its original setting. I ran the epics sequencer by hand in a gdb session and verified that the sequencer was setting the field to 0. I also built a custom sequencer that would catch writes by the sdf system to the channel. I only saw one write, the initial write that pushed a 0. I have reverted my changes to the sequencer.
The gain channel can be caput to the correct value and it is not pushed back to 1.0. So there does not appear to be a process actively pushing the value to 1.0. On Rolfs sugestion we ran the sequencer w/o the kernel object loaded, and saw the same behavior.
This will take some thought. |
13925
|
Thu Jun 7 12:20:53 2018 |
gautam | Update | CDS | slow machine bootfest |
FSS slow wasn't running so PSL PZT voltage was swinging around a lot. Reason was that was c1psl unresponsive. I keyed the crate, now it's okay. Now ITMX is stuck - Johannes just told be about an un-elogged c1susaux reboot. Seems that ITMX got stuck at ~4:30pm yesterday PT. After some shaking, the optic was loosened. Please follow the procedure in future and if you do a reboot, please elog it and verify that the optic didn't get stuck. |
Attachment 1: ITMX_stuck.png
|
|
13934
|
Fri Jun 8 14:40:55 2018 |
c1lsc | Update | CDS | i am dead |
|
Attachment 1: 31.png
|
|
13935
|
Fri Jun 8 20:15:08 2018 |
gautam | Update | CDS | Reboot script |
Unfortunately, this has happened (and seems like it will happen) enough times that I set up a script for rebooting the machine in a controlled way, hopefully it will negate the need to repeatedly go into the VEA and hard-reboot the machines. Script lives at /opt/rtcds/caltech/c1/scripts/cds/rebootC1LSC.sh. SVN committed. It worked well for me today. All applicable CDS indicator lights are now green again. Be aware that c1oaf will probably need to be restarted manually in order to make the DC light green. Also, this script won't help you if you try to unload a model on c1lsc and the FE crashes. It relies on c1lsc being ssh-able. The basic logic is:
- Ask for confirmation.
- Shutdown all vertex optic watchdogs, PSL shutter.
- ssh into c1sus and c1ioo, shutdown all models on these machines, soft reboot them.
- ssh into c1lsc, soft reboot the machine. No attempt is made to unload the models.
- Wait 2 minutes for all machines to come back online.
- Restart models on all 3 vertex FEs (IOPs first, then rest).
- Prompt user for confirmation to re-enable watchdog status and open PSL shutter.
|
Attachment 1: 31.png
|
|
13942
|
Mon Jun 11 18:49:06 2018 |
gautam | Update | CDS | c1lsc dead again |
Why is this happening so frequently now? Last few lines of error log:
[ 575.099793] c1oaf: DAQ EPICS: Int = 199 Flt = 706 Filters = 9878 Total = 10783 Fast = 113
[ 575.099793] c1oaf: DAQ EPICS: Number of Filter Module Xfers = 11 last = 98
[ 575.099793] c1oaf: crc length epics = 43132
[ 575.099793] c1oaf: xfer sizes = 128 788 100988 100988
[240629.686307] c1daf: ADC TIMEOUT 0 43039 31 43103
[240629.686307] c1cal: ADC TIMEOUT 0 43039 31 43103
[240629.686307] c1ass: ADC TIMEOUT 0 43039 31 43103
[240629.686307] c1oaf: ADC TIMEOUT 0 43039 31 43103
[240629.686307] c1lsc: ADC TIMEOUT 0 43039 31 43103
[240630.684493] c1x04: timeout 0 1000000
[240631.684938] c1x04: timeout 1 1000000
[240631.684938] c1x04: exiting from fe_code()
I fixed it by running the reboot script. |
Attachment 1: 36.png
|
|
13947
|
Mon Jun 11 23:22:53 2018 |
gautam | Update | CDS | EX wiring confusion |
[Koji, gautam]
Per this elog, we don't need any AIOut channels or Oplev channels. However, the latest wiring diagram I can find for the EX Acromag situation suggests that these channels are hooked up (physically). If this is true, there are 12 ADC channels that are occupied which we can use for other purposes. Question for Johannes: Is this true? If so, Kira has plenty of channels available for her Temperature control stuff..
As an aside, we found that the EPICS channel names for the TRX/TRY QPD gain stages are somewhat strangely named. Looking closely at the schematic (which has now been added to the 40m DCC tree, we can add out custom mods later), they do (somewhat) add up, but I think we should definitely rename them in a more systematic manner, and use an MEDM screen to indicate stuff like x4 or x20 or "Active" etc. BTW, the EX and EY QPDs have different settings. But at least the settings are changed synchronously for all four quadrants, unlike the WFS heads...
Unrelated: I had to key the c1iscaux and c1auxey crates. |
13958
|
Wed Jun 13 23:23:44 2018 |
johannes | Update | CDS | EX wiring confusion |
It's true.
I went through the wiring of the c1auxex crate today to disentangle the pin assignments. The full detail can be found in attachment #1, #2 has less detail but is more eye candy. The red flagged channels are now marked for removal at the next opportunity. This will free up DAQ channels as follows:
TYPE |
Total |
Available now |
Available after |
ADC |
24 |
2 |
14 |
DAC |
16 |
8 |
12 |
BIO sinking |
16 |
7 |
7 |
BIO sourcing |
8 |
8 |
8 |
This should be enough for temperature sensing, NPRO diagnostics, and even eventual remote PDH control with new servo boxes. |
Attachment 1: c1auxex_channels.pdf
|
|
Attachment 2: XEND_slow_wiring.pdf
|
|
13961
|
Thu Jun 14 10:41:00 2018 |
gautam | Update | CDS | EX wiring confusion |
Do we really have 2 free ADC channels at EX now? I was under the impression we had ZERO free, which is why we wanted to put a new ADC unit in. I think in the wiring diagram, the Vacuum gauge monitor channel, Seis Can Temp Sensor monitor, and Seis Can Heater channels are missing. It would also be good to have, in the wiring diagram, a mapping of which signals go to which I/O ports (Dsub, front panel BNC etc) on the 4U(?) box housing all the Acromags, this would be helpful in future debugging sessions.
Quote: |
TYPE |
Total |
Available now |
Available after |
ADC |
24 |
2 |
14 |
|
|
13965
|
Thu Jun 14 15:31:18 2018 |
johannes | Update | CDS | EX wiring confusion |
Bad wording, sorry. Should have been channels in excess of ETMX controls. I'll add the others to the list as well.
Updated channel list and wiring diagram attached. Labels are 'F' for 'Front' and 'R' for - you guessed it - 'Rear', the number identifies the slot panel the breakout is attached to. |
Attachment 1: XEND_slow_wiring.pdf
|
|
Attachment 2: c1auxex_channels.pdf
|
|
14000
|
Thu Jun 21 22:13:12 2018 |
gautam | Update | CDS | pianosa upgrade |
pianosa has been upgraded to SL7. I've made a controls user account, added it to sudoers, did the network config, and mounted /cvs/cds using /etc/fstab. Other capabilities are being slowly added, but it may be a while before this workstation has all the kinks ironed out. For now, I'm going to follow the instructions on this wiki to try and get the usual LSC stuff working. |
14003
|
Fri Jun 22 00:59:43 2018 |
gautam | Update | CDS | pianosa functional, but NO DTT |
MEDM, EPICS and dataviewer seem to work, but diaggui still doesn't work (it doesn't work on Rossa either, same problem as reported here, does a fix exist?). So looks like only donatella can run diaggui for now. I had to disable the systemd firewall per the instructions page in order to get EPICS to work. Also, there is no MATLAB installed on this machine yet. sshd has been enabled. |
14007
|
Fri Jun 22 15:13:47 2018 |
gautam | Update | CDS | DTT working |
Seems like DTT also works now. The trick seems to be to run sudo /usr/bin/diaggui instead of just diaggui. So this is indicative of some conflict between the yum installed gds and the relic gds from our shared drive. I also have to manually change the NDS settings each time, probably there's a way to set all of this up in a more smooth way but I don't know what it is. awggui still doesn't get the correct channels, not sure where I can change the settings to fix that. |
Attachment 1: Screenshot_from_2018-06-22_15-12-37.png
|
|
14008
|
Fri Jun 22 15:22:39 2018 |
sudo | Update | CDS | DTT working |
Quote: |
Seems like DTT also works now. The trick seems to be to run sudo /usr/bin/diaggui instead of just diaggui. So this is indicative of some conflict between the yum installed gds and the relic gds from our shared drive. I also have to manually change the NDS settings each time, probably there's a way to set all of this up in a more smooth way but I don't know what it is. awggui still doesn't get the correct channels, not sure where I can change the settings to fix that.
|
DON"T RUN DIAGGUI AS ROOT |
14027
|
Wed Jun 27 21:18:00 2018 |
gautam | Update | CDS | Lab maintenance scripts from NODUS---->MEGATRON |
I moved the N2 check script and the disk usage checking script from the (sudo) crontab of nodus to the controls user crontab on megatron . |
14028
|
Thu Jun 28 08:09:51 2018 |
Steve | Update | CDS | vacuum pneumatic N2 pressure |
The fardest I can go back on channel C1: Vac_N2pres is 320 days
C1:Vac-CC1_Hornet Presuure gauge started logging Feb. 23, 2018
Did you update the " low N2 message" email addresses?
Quote: |
I moved the N2 check script and the disk usage checking script from the (sudo) crontab of nodus to the controls user crontab on megatron .
|
|
Attachment 1: 320d_N2.png
|
|
14029
|
Thu Jun 28 10:28:27 2018 |
rana | Update | CDS | vacuum pneumatic N2 pressure |
we disabled logging the N2 Pressure to a text file, since it was filling up disk space. Now it just sends an email to our 40m mailing list, so we'll all get a warning.
The crontab uses the 'bash' version of output redirection '2>&1', which redirects stdout and stderr, but probably we just want stderr, since stdout contains messages without issues and will just fill up again. |
14133
|
Sun Aug 5 13:28:43 2018 |
gautam | Update | CDS | c1lsc flaky |
Since the lab-wide computer shutdown last Wednesday, all the realtime models running on c1lsc have been flaky. The error is always the same:
[58477.149254] c1cal: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1daf: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1ass: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1oaf: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1lsc: ADC TIMEOUT 0 10963 19 11027
[58478.148001] c1x04: timeout 0 1000000
[58479.148017] c1x04: timeout 1 1000000
[58479.148017] c1x04: exiting from fe_code()
This has happened at least 4 times since Wednesday. The reboot script makes recovery easier, but doing it once in 2 days is getting annoying, especially since we are running many things (e.g. ASS) in custom configurations which have to be reloaded each time. I wonder why the problem persists even though I've power-cycled the expansion chassis? I want to try and do some IFO characterization today so I'm going to run the reboot script again but I'll get in touch with J Hanks to see if he has any insight (I don't think there are any logfiles on the FEs anyways that I'll wipe out by doing a reboot). I wonder if this problem is connected to DuoTone? But if so, why is c1lsc the only FE with this problem? c1sus also does not have the DuoTone system set up correctly...
The last time this happened, the problem apparently fixed itself so I still don't have any insight as to what is causing the problem in the first place . Maybe I'll try disabling c1oaf since that's the configuration we've been running in for a few weeks. |
14136
|
Mon Aug 6 00:26:21 2018 |
gautam | Update | CDS | More CDS woes |
I spent most of today fighting various CDS errors.
- I rebooted c1lsc around 3pm, my goal was to try and do some vertex locking and figure out what the implications were of having only ~30% power we used to have at the AS port.
- Shortly afterwards (~4pm), c1lsc crashed.
- Using the reboot script, I was able to bring everything back up. But the DC lights on c1sus models were all red, and a 0x4000 error was being reported.
- This error is indicative of some timing issue, but all the usual tricks (reboot vertex FEs in various order, restart the mx_streams etc) didn't clear this error.
- I checked the Tempus GPS unit, that didn't report any obvious problems (i.e. front display was showing the correct UTC time).
- Finally, I decided to shut down all watchdogs, soft reboot all the FEs, soft reboot FB, power cycle all expansion chassis.
- This seems to have done the trick - I'm leaving c1oaf disabled for now.
- The remaining red indicators are due to c1dnn and c1oaf being disabled.
Let's see how stable this configuration is. Onto some locking now... |
Attachment 1: CDSoverview.png
|
|
14139
|
Mon Aug 6 14:38:38 2018 |
gautam | Update | CDS | More CDS woes |
Stability was short-lived it seems. When I came in this morning, all models on c1lsc were dead already, and now c1sus is also dead (Attachment #1). Moreover, MC1 shadow sensors failed for a brief period again this afternoon (Attachment #2). I'm going to wait for some CDS experts to take a look at this since any fix I effect seems to be short-lived. For the MC1 shadow sensors, I wonder if the Trillium box (and associated Sorensen) failure somehow damaged the MC1 shadow sensor/coil driver electronics.
Quote: |
Let's see how stable this configuration is. Onto some locking now...
|
|
Attachment 1: CDScrash.png
|
|
Attachment 2: MC1failures.png
|
|
14140
|
Mon Aug 6 19:49:09 2018 |
gautam | Update | CDS | More CDS woes |
I've left the c1lsc frontend shutdown for now, to see if c1sus and c1ioo can survive without any problems overnight. In parallel, we are going to try and debug the MC1 OSEM Sensor problem - the idea will be to disable the bias voltage to the OSEM LEDs, and see if the readback channels still go below zero, this would be a clear indication that the problem is in the readback transimpedance stage and not the LED. Per the schematic, this can be done by simply disconnecting the two D-sub connectors going to the vacuum flange (this is the configuration in which we usually use the sat box tester kit for example). Attachment #1 shows the current setup at the PD readout board end. The dark DC count (i.e. with the OSEM LEDs off) is ~150 cts, while the nominal level is ~1000 cts, so perhaps this is already indicative of something being broken but let's observe overnight. |
Attachment 1: IMG_7106.JPG
|
|
14142
|
Tue Aug 7 11:30:46 2018 |
gautam | Update | CDS | More CDS woes |
Overnight, all models on c1sus and c1ioo seem to have had no stability issues, supporting the hypothesis that timing issues stem from c1lsc. Moreover, the MC1 shadow sensor readouts showed no negative values over a ~12hour period. I think we should just observe this for another day, in any case I don't think there is any urgent IFO related activity scheduled. |
14143
|
Tue Aug 7 22:28:23 2018 |
gautam | Update | CDS | More CDS woes |
I am starting the c1x04 model (IOP) on c1lsc to see how it behaves overnight.
Well, there was apparently an immediate reaction - all the models on c1sus and c1ioo reported an ADC timeout and crashed. I'm going to reboot them and still have c1x04 IOP running, to see what happens.
[97544.431561] c1pem: ADC TIMEOUT 3 8703 63 8767
[97544.431574] c1mcs: ADC TIMEOUT 1 8703 63 8767
[97544.431576] c1sus: ADC TIMEOUT 1 8703 63 8767
[97544.454746] c1rfm: ADC TIMEOUT 0 9033 9 8841
Quote: |
Overnight, all models on c1sus and c1ioo seem to have had no stability issues, supporting the hypothesis that timing issues stem from c1lsc. Moreover, the MC1 shadow sensor readouts showed no negative values over a ~12hour period. I think we should just observe this for another day, in any case I don't think there is any urgent IFO related activity scheduled.
|
|
14146
|
Wed Aug 8 23:03:42 2018 |
gautam | Update | CDS | c1lsc model started |
As part of this slow but systematic debugging, I am turning on the c1lsc model overnight to see if the model crashes return. |
14149
|
Thu Aug 9 12:31:13 2018 |
gautam | Update | CDS | CDS status update |
The model seems to have run without issues overnight. Not completely related, but the MC1 shadow sensor signals also don't show any abnormal excursions to negative values in the last 48 hours. I'm thinking about re-connecting the satellite box (but preserving the breakout setup at 1X6 for a while longer) and re-locking the IMC. I'll also start c1ass on the c1lsc frontend. I would say that the other models on c1lsc (i.e. c1oaf, c1cal, c1daf) aren't really necessary for basic IFO operation.
Quote: |
As part of this slow but systematic debugging, I am turning on the c1lsc model overnight to see if the model crashes return.
|
|
14151
|
Thu Aug 9 22:50:13 2018 |
gautam | Update | CDS | AlignSoft script modified |
After this work of increasing the series resistance on ETMX, there have been numerous occassions where the insufficient misalignment of ETMX has caused problems in locking vertex cavities. Today, I modified the script (located at /opt/rtcds/caltech/c1/medm/MISC/ifoalign/AlignSoft.py) to avoid such problems. The way the misalign script works is to write an offset value to the "TO_COIL" filter bank (accessed via "Output Filters" button on the Suspension master MEDM screen - not the most intuitive place to put an offset but okay). So I just increased the value of this offset from 250 counts to 2500 counts (for ETMX only). I checked that the script works, now when both ETMs are misaligned, the AS55Q signal shows a clean Michelson-like sine wave as it fringes instead of having the arm cavity PDH fringes as well .
Note that the svn doesn't seem to work on the newly upgraded SL7 machines: svn status gives me the following output.
svn: E155036: Please see the 'svn upgrade' command
svn: E155036: Working copy '/cvs/cds/rtcds/userapps/trunk/cds/c1/medm/MISC/ifoalign' is too old (format 10, created by Subversion 1.6)
Is it safe to run 'svn upgrade'? Or is it time to migrate to git.ligo.org/40m/scripts? |
Attachment 1: MichelsonFringing.png
|
|
14166
|
Wed Aug 15 21:27:47 2018 |
gautam | Update | CDS | CDS status update |
Starting c1cal now, let's see if the other c1lsc FE models are affected at all... Moreover, since MC1 seems to be well-behaved, I'm going to restore the nominal eurocrate configuration (sans extender board) tomorrow. |
14171
|
Mon Aug 20 15:16:39 2018 |
Jon | Update | CDS | Rebooted c1lsc, slow machines |
When I came in this morning no light was reaching the MC. One fast machine was dead, c1lsc, and a number of the slow machines: c1susaux, c1iool0, c1auxex, c1auxey, c1iscaux. Gautam walked me through reseting the slow machines manually and the fast machines via the reboot script. The computers are all back online and the MC is again able to lock. |
14187
|
Tue Aug 28 18:39:41 2018 |
Jon | Update | CDS | C1LSC, C1AUX reboots |
I found c1lsc unresponsive again today. Following the procedure in elog #13935, I ran the rebootC1LSC.sh script to perform a soft reboot of c1lsc and restart the epics processes on c1lsc, c1sus, and c1ioo. It worked. I also manually restarted one unresponsive slow machine, c1aux.
After the restarts, the CDS overview page shows the first three models on c1lsc are online (image attached). The above elog references c1oaf having to be restarted manually, so I attempted to do that. I connect via ssh to c1lsc and ran the script startc1oaf. This failed as well, however.
In this state I was able to lock the MICH configuration, which is sufficient for my purposes for now, but I was not able to lock either of the arm cavities. Are some of the still-dead models necessary to lock in resonant configurations? |
Attachment 1: CDS_FE_STATUS.png
|
|
14192
|
Tue Sep 4 10:14:11 2018 |
gautam | Update | CDS | CDS status update |
c1lsc crashed again. I've contacted Rolf/JHanks for help since I'm out of ideas on what can be done to fix this problem.
Quote: |
Starting c1cal now, let's see if the other c1lsc FE models are affected at all... Moreover, since MC1 seems to be well-behaved, I'm going to restore the nominal eurocrate configuration (sans extender board) tomorrow.
|
|
14193
|
Wed Sep 5 10:59:23 2018 |
wgautam | Update | CDS | CDS status update |
Rolf came by today morning. For now, we've restarted the FE machine and the expansion chassis (note that the correct order in which to do this is: turn off computer--->turn off expansion chassis--->turn on expansion chassis--->turn on computer). The debugging measures Rolf suggested are (i) to replace the old generation ADC card in the expansion chassis which has a red indicator light always on and (ii) to replace the PCIe fiber (2010 make) running from the c1lsc front-end machine in 1X6 to the expansion chassis in 1Y3, as the manufacturer has suggested that pre-2012 versions of the fiber are prone to failure. We will do these opportunistically and see if there is any improvement in the situation.
Another tip from Rolf: if the c1lsc FE is responsive but the models have crashed, then doing sudo reboot by ssh-ing into c1lsc should suffice* (i.e. it shouldn't take down the models on the other vertex FEs, although if the FE is unresponsive and you hard reboot it, this may still be a problem). I'll modify I've modified the c1lsc reboot script accordingly.
* Seems like this can still lead to the other vertex FEs crashing, so I'm leaving the reboot script as is (so all vertex machines are softly rebooted when c1lsc models crash).
Quote: |
c1lsc crashed again. I've contacted Rolf/JHanks for help since I'm out of ideas on what can be done to fix this problem.
|
|
14194
|
Thu Sep 6 14:21:26 2018 |
gautam | Update | CDS | ADC replacement in c1lsc expansion chassis |
Todd E. came by this morning and gave us (i) 1x new ADC card and (ii) 1x roll of 100m (2017 vintage) PCIe fiber. This afternoon, I replaced the old ADC card in the c1lsc expansion chassis, and have returned the old card to Todd. The PCIe fiber replacement is a more involved project (Steve is acquiring some protective tubing to route it from the FE in 1X6 to the expansion chassis in 1Y3), but hopefully the problem was the ADC card with red indicator light, and replacing it has solved the issue. CDS is back to what is now the nominal state (Attachment #1) and Yarm is locked for Jon to work on his IFOcoupling study. We will monitor the stability in the coming days.
Quote: |
(i) to replace the old generation ADC card in the expansion chassis which has a red indicator light always on and (ii) to replace the PCIe fiber (2010 make) running from the c1lsc front-end machine in 1X6 to the expansion chassis in 1Y3, as the manufacturer has suggested that pre-2012 versions of the fiber are prone to failure. We will do these opportunistically and see if there is any improvement in the situation.
|
|
Attachment 1: CDSoverview.png
|
|
14195
|
Fri Sep 7 12:35:14 2018 |
gautam | Update | CDS | ADC replacement in c1lsc expansion chassis |
Looks like the ADC was not to blame, same symptoms persist.
Quote: |
The PCIe fiber replacement is a more involved project (Steve is acquiring some protective tubing to route it from the FE in 1X6 to the expansion chassis in 1Y3), but hopefully the problem was the ADC card with red indicator light, and replacing it has solved the issue.
|
|
Attachment 1: Screenshot_from_2018-09-07_12-34-52.png
|
|
14196
|
Mon Sep 10 12:44:48 2018 |
Jon | Update | CDS | ADC replacement in c1lsc expansion chassis |
Gautam and I restarted the models on c1lsc, c1ioo, and c1sus. The LSC system is functioning again. We found that only restarting c1lsc as Rolf had recommended did actually kill the models running on the other two machines. We simply reverted the rebootC1LSC.sh script to its previous form, since that does work. I'll keep using that as required until the ongoing investigations find the source of the problem.
Quote: |
Looks like the ADC was not to blame, same symptoms persist.
Quote: |
The PCIe fiber replacement is a more involved project (Steve is acquiring some protective tubing to route it from the FE in 1X6 to the expansion chassis in 1Y3), but hopefully the problem was the ADC card with red indicator light, and replacing it has solved the issue.
|
|
|
14202
|
Thu Sep 20 11:29:04 2018 |
gautam | Update | CDS | New PCIe fiber housed |
[steve, yuki, gautam]
The plastic tubing/housing for the fiber arrived a couple of days ago. We routed ~40m of fiber through roughly that length of the tubing this morning, using some custom implements Steve sourced. To make sure we didn't damage the fiber during this process, I'm now testing the vertex models with the plastic tubing just routed casually (= illegally) along the floor from 1X4 to 1Y3 (NOTE THAT THE WIKI PAGE DIAGRAM IS OUT OF DATE AND NEEDS TO BE UPDATED), and have plugged in the new fiber to the expansion chassis and the c1lsc front end machine. But I'm seeing a DC error (0x4000), which is indicative of some sort of timing error (Attachment #1) **. Needs more investigation...
Pictures + more procedural details + proper routing of the protected fiber along cable trays after lunch. If this doesn't help the stability problem, we are out of ideas again, so fingers crossed...
** In the past, I have been able to fix the 0x4000 error by manually rebooting fb (simply restarting the daqd processes on fb using sudo systemctl restart daqd_* doesn't seem to fix the problem). Sure enough, seems to have done the job this time as well (Attachment #2). So my initial impression is that the new fiber is functioning alright .
Quote: |
The PCIe fiber replacement is a more involved project (Steve is acquiring some protective tubing to route it from the FE in 1X6 to the expansion chassis in 1Y3)
|
|
Attachment 1: PCIeFiberSwap.png
|
|
Attachment 2: PCIeFiberSwap_FBrebooted.png
|
|
14203
|
Thu Sep 20 16:19:04 2018 |
gautam | Update | CDS | New PCIe fiber install postponed to tomorrow |
[steve, gautam]
This didn't go as smoothly as planned. While there were no issues with the new fiber over the ~3 hours that I left it plugged in, I didn't realize the fiber has distinct ends for the "HOST" and "TARGET" (-5 points to me I guess). So while we had plugged in the ends correctly (by accident) for the pre-lunch test, while routing the fiber on the overhead cable tray, we switched the ends (because the "HOST" end of the cable is close to the reel and we felt it would be easier to do the routing the other way.
Anyway, we will fix this tomorrow. For now, the old fiber was re-connected, and the models are running. IMC is locked.
Quote: |
Pictures + more procedural details + proper routing of the protected fiber along cable trays after lunch. If this doesn't help the stability problem, we are out of ideas again, so fingers crossed...
|
|
14206
|
Fri Sep 21 16:46:38 2018 |
gautam | Update | CDS | New PCIe fiber installed and routed |
[steve, koji, gautam]
We took another pass at this today, and it seems to have worked - see Attachment #1. I'm leaving CDS in this configuration so that we can investigate stability. IMC could be locked. However, due to the vacuum slow machine having failed, we are going to leave the PSL shutter closed over the weekend. |
Attachment 1: PCIeFiber.png
|
|
Attachment 2: IMG_5878.JPG
|
|
14208
|
Fri Sep 21 19:50:17 2018 |
Koji | Update | CDS | Frequent time out |
Multiple realtime processes on c1sus are suffering from frequent time outs. It eventually knocks out c1sus (process).
Obviously this has started since the fiber swap this afternoon.
gautam 10pm: there are no clues as to the origin of this problem on the c1sus frontend dmesg logs. The only clue (see Attachment #3) is that the "ADC" error bit in the CDS status word is red - but opening up the individual ADC error log MEDM screens show no errors or overflows. Not sure what to make of this. The IOP model on this machine (c1x02) reports an error in the "Timing" bit of the CDS status word, but from the previous exchange with Rolf / J Hanks, this is down to a misuse of ADC0 Ch31 which is supposed to be reserved for a DuoTone diagnostic signal, but which we use for some other signal (one of the MC suspension shadow sensors iirc). The response is also not consistent with this CDS manual - which suggests that an "ADC" error should just kill the models. There are no obvious red indicator lights in the c1sus expansion chassis either. |
Attachment 1: 33.png
|
|
Attachment 2: 49.png
|
|
Attachment 3: Screenshot_from_2018-09-21_21-52-54.png
|
|
14210
|
Sat Sep 22 00:21:07 2018 |
Koji | Update | CDS | Frequent time out |
[Gautam, Koji]
We had another crash of c1sus and Gautam did full power cycling of c1sus. It was a sturggle to recover all the frontends, but this solved the timing issue.
We went through full reset of c1sus, and rebooting all the other RT hosts, as well as daqd and fb1. |
Attachment 1: 23.png
|
|
14253
|
Sun Oct 14 16:55:15 2018 |
not gautam | Update | CDS | pianosa upgrade |
DASWG is not what we want to use for config; we should use the K. Thorne LLO instructions, like I did for ROSSA.
Quote: |
pianosa has been upgraded to SL7. I've made a controls user account, added it to sudoers, did the network config, and mounted /cvs/cds using /etc/fstab. Other capabilities are being slowly added, but it may be a while before this workstation has all the kinks ironed out. For now, I'm going to follow the instructions on this wiki to try and get the usual LSC stuff working.
|
|
14267
|
Fri Nov 2 12:07:16 2018 |
rana | Update | CDS | NDScope |
https://alog.ligo-wa.caltech.edu/aLOG/index.php?callRep=44971
Let's install Jamie's new Data Viewer |
14293
|
Tue Nov 13 21:53:19 2018 |
gautam | Update | CDS | RFM errors |
This problem resurfaced, which I noticed when I couldn't get the single arm locks going.
The fix was NOT restarting the c1rfm model, which just brought the misery of all vertex FEs crashing and the usual dance to get everything back.
Restarting the sender models (i.e. c1scx and c1scy) seems to have done the trick though. |
Attachment 1: RFMerrors.png
|
|
14344
|
Tue Dec 11 14:33:29 2018 |
gautam | Update | CDS | NDScope |
NDscope is now running on pianosa. To be really useful, we need the templates, so I've made /users/Templates/NDScope_templates where these will be stored. Perhaps someone can write a parser to convert dataviewer .xml to something ndscope can understand. To get it installed, I had to run:
sudo yum install ndscope
sudo yum install python34-gpstime
sudo yum install python34-dateutil
sudo yum install python34-requests
I also changed the pythonpath variable to include the python3.4 site-packages library in .bashrc
Quote: |
https://alog.ligo-wa.caltech.edu/aLOG/index.php?callRep=44971
Let's install Jamie's new Data Viewer
|
|
Attachment 1: ndscope.png
|
|
14356
|
Thu Dec 13 22:56:28 2018 |
gautam | Update | CDS | Frames |
[koji, gautam]
We looked into the /frames situation a bit tonight. Here is a summary:
- We have already lost some second trend data since the new FB has been running from ~August 2017.
- The minute trend data is still safe from that period, we believe.
- The Jetstor has ~2TB of trend data in the /frames/trend folder.
- This is a combination of "second", "minute_raw" and "minute".
- It is not clear to us what the distinction is between "minute_raw" and "minute", except that the latter seems to go back farther in time than the former.
- Even so, the minute trend folder from October 2011 is empty - how did we manage to get the long term trend data?? From the folder volumes, it appears that the oldest available trend data is from ~July 24 2015.
Plan of action:
- The wiper script needs to be tweaked a bit to allow more storage for the minute trends (which we presumably want to keep for long term).
- We need to clear up some space on FB1 to transfer the old trend data from Jetstor to FB1.
- We need to revive the data backup via LDAS. Also summary pages.
BTW - the last chiara (shared drive) backup was October 16 6 am. dmesg showed a bunch of errors, Koji is now running fsck in a tmux session on chiara, let's see if that repairs the errors. We missed the opportunity to swap in the 4TB backup disk, so we will do this at the next opportunity. |
14359
|
Fri Dec 14 14:25:36 2018 |
Koji | Update | CDS | chiara backup |
fsck of chiara backup disk (UUID="90a5c98a-22fb-4685-9c17-77ed07a5e000") was done. But this required many files to be fixed. So the backed-up files are not reliable now.
On the top of that, the disk became not recognized from the machine.
I went to the disk and disconnected the USB and then the power supply, which was/is connected to the UPS.
Then they are reconnected again. This made the disk came back as /media/90a5c98a-22fb-4685-9c17-77ed07a5e000. (*)
After unmounting this disk, I ran "sudo mount -a" to follow the way of mounting as fstab does.
Now I am running the backup script manually so that we can pretend to maintain a snapshot of the day at least.
(*) This is the same situation we found at the recovery from the power shutdown. So my hypothesis is that on Oct 16 at 7 AM during the backup there was a USB failure or disk failure or something which unmounted the disk. This caused some files got damaged. Also this caused the disk mounted as /media/90a5c98a-22fb-4685-9c17-77ed07a5e000. So since then, we did not have the backup.
Update (20:00): The disk connection failed again. I think this disk is no longer reliable.
|
Attachment 1: fsck_log.log
|
sudo fsck -yV UUID="90a5c98a-22fb-4685-9c17-77ed07a5e000" [238/276]
[sudo] password for controls:
fsck from util-linux 2.20.1
[/sbin/fsck.ext4 (1) -- /media/40mBackup] fsck.ext4 -y /dev/sde1
e2fsck 1.42 (29-Nov-2011)
/dev/sde1 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Error reading block 527433852 (Attempt to read block from filesystem resulted in
short read) while getting next inode from scan. Ignore error? yes
... 283 more lines ...
|
14374
|
Thu Dec 20 17:17:41 2018 |
gautam | Update | CDS | Logging of new Vacuum channels |
Added the following channels to C0EDCU.ini:
[C1:Vac-P1b_pressure]
units=torr
[C1:Vac-PRP_pressure]
units=torr
[C1:Vac-PTP2_pressure]
units=torr
[C1:Vac-PTP3_pressure]
units=torr
[C1:Vac-TP2_rot]
units=kRPM
[C1:Vac-TP3_rot]
units=kRPM
Also modified the old P1 channel to
[C1:Vac-P1a_pressure]
units=torr
Unfortunately, we realized too late that we don't have these channels in the frames, so we don't have the data from this test pumpdown logged, but we will have future stuff. I say we should also log diagnostics from the pumps, such as temperature, current etc. After making the changes, I restarted the daqd processes.
Things to add to ASA wiki page once the wiki comes back online:
- What is the safe way to clean the cryo pump if we want to use it again?
- What are safe conditions to turn the RGA on?
|