As part of preparing for the SURF projects this summer, I grabbed ~50 minutes of MCL and STS_1 data from early this morning to do a little MISO wiener filtering. It was pretty straightforward to use the misofw.m code to achieve an offline subtraction factor of ~10 from 1-3Hz. This isn't the best ever, but doesn't compare so unfavorably to older work, especially given that I did no prefiltering, and didn't use all that long of a data stretch.
Code and plot (but not data) is attached.
There was an earthquake: M4.0 - 40km SSW of South Dos Palos, California
No noticeable effects on the IFO. MC did not lose lock; however the arms did unlock.
Given some of the things we've facing lately, it occurs to me that we could be better served by having some sort of unified human-alerting scheme in place, for things like:
Currently, many of these things are just checked sporadically when it occurs to someone to do so, or when debugging random issues. Smoother IFO operation and peace of mind could be gained if we're confident that the relevant people are notified in a timely manner.
Thoughts? Suggestions on other things to monitor, like maybe frontend/model crashes?
Starting on the 14th (five days ago) the local chiara rsync backup of /cvs/cds to an external HDD has been failing:
2015-05-13 07:00:01,614 INFO Updating backup image of /cvs/cds
2015-05-13 07:49:46,266 INFO Backup rsync job ran successfully, transferred 6504 files.
2015-05-14 07:00:01,826 INFO Updating backup image of /cvs/cds
2015-05-14 07:50:18,709 ERROR Backup rysnc job failed with exit code 24!
2015-05-15 07:00:01,385 INFO Updating backup image of /cvs/cds
2015-05-15 08:09:18,527 ERROR Backup rysnc job failed with exit code 24!
Code 24 apparently means "Partial transfer due to vanished source files."
Manually running the backup command on chiara worked fine, returning a code of 0 (success), so we are backed up. For completeness, the command is controls@chiara: sudo rsync -av --delete --stats /home/cds/ /media/40mBackup
controls@chiara: sudo rsync -av --delete --stats /home/cds/ /media/40mBackup
Are the summary page jobs moving files around at this time of day? If so, one of the two should be rescheduled to not conflict.
There's a few hours so far after today's c1cal shut off that the summary page shows no dropouts. I'm not yet sure that this is related, but it seems like a clue.
The c1cal model was maxing out its CPU meter so I logged onto c1lsc and did 'rtcds c1cal stop'. Let's see if this changes any of our FB / DAQD problems.
Too weird. I undid me changes. We'll have to make the C1: stuff work inside each python script.
This makes things act weird:
controls@pianosa|MC 1> z avg 1 "C1:LSC-TRY_OUT"
IFO environment variable not specified.
Today at 5 PM we replaced the east N2 cylinder. The east pressure was 500 and the west cylinder pressure was 1000. Since Steve's elogs say that the consumption can be as high as 800 per day we wanted to be safe.
1) Checked the N2 pressures: the unregulated cylinder pressures are both around 1500 PSI. How long until they get to 1000?
controls@pianosa|MC 1> z avg 1 "C1:LSC-TRY_OUT"
IFO environment variable not specified.
4) Noticed that DAQD is restarting once per hour on the hour. Why?
It looks like daqd isn't being restarted, but in fact crashing every hour.
Going into the logs in target/fb/logs/old, it looks like at 10 seconds past the hour, every hour, daqd starts spitting out:
[Mon May 18 12:00:10 2015] main profiler warning: 1 empty blocks in the buffer
[Mon May 18 12:00:11 2015] main profiler warning: 0 empty blocks in the buffer
[Mon May 18 12:00:12 2015] main profiler warning: 0 empty blocks in the buffer
[Mon May 18 12:00:13 2015] main profiler warning: 0 empty blocks in the buffer
An ELOG search on this kind of phrase will get you a lot of talk about FB transfer problems.
I noticed the framebuilder had 100% usage on its internal, non-RAID, non /frames/, HDD, which hosts the root filesystem (OS files, home directory, diskless boot files, etc), largely due to a ~110GB directory of frames from our first RF lock that had been copied over to the home directory. The HDD only has 135GB capacity. I thought that maybe this was somehow a bottleneck for files moving around, but after deleting the huge directory, daqd still died at 4PM.
The offsite LDAS rsync happens at ten minutes past the hour, so is unlikely to be the culprit. I don't have any other clues at this point.
Measuring the voltage noise and frequency response of the Analog Delay-line Frequency Discriminator (DFD)
The schematic and an actual photo of the setup is shown below. The setup was checked to be physically sturdy with no loose connections or moving parts.
The voltage noise at the output of the DFD was measured using an SR785 signal analyzer while simultaneously monitoring the signal on an oscilloscope.
The noise at the output of the DFD was measured for no RF input and at several RF input frequencies including the zero crossing frequency and the optimum operating frequency of the DFD (20MHz).
The plot below show the voltage noise for different RF inputs to the DFD. It can be seen that the noise level is slightly lower at the zero crossing frequency where the amplitude noise is eliminated by the DFD.
I also did measurements to obtain the frequency response of the setup as the cable length difference has changed from the prior setup. The cable length difference is 21cm and the obtained linear signal at the output of the DFD extends over ~ 380MHz which is good enough for our purposes in FOL. A cosine fit to the data was done as before. //edit- Manasa: The gain of SR560 was set to 20 to obtain the data shown below//
Fit Coefficients (with 95% confidence bounds):
a = -0.8763 (-1.076, -0.6763)
b = 3.771 (3.441, 4.102)
Data and matlab scripts are zipped and attached.
Still seems to be running without causing FB issues.
I'm not so sure. I just was experiencing some severe network latency / EPICS channel freezes that was alleviated by killing the rsync job on nodus. It started a few minutes after ten past the hour, when the rysnc job started.
Unrelated to this, for some odd reason, there is some weirdness going on with ssh'ing to martian machines from the control room computers. I.e. on pianosa, ssh nodus fails with a failure to resolve hostaname message, but ssh nodus.martian succeeds.
Yes - my rampdown.py script correctly ramps down the watchdog thresholds. This replaces the old rampdown.pl Perl script that Rob and Dave Barker wrote.
Unfortunately, cron doesn't correctly inherit the bashrc environment variables so its having trouble running.
On a positive note, I've resurrected the MEDM Screenshot taking cron job, so now this webpage is alive (mostly) and you can check screens from remote:
Added this to the megatron crontab and commented out the op340m crontab line. IF this works for awhile we can retire our last Solaris machine.
For some reason, my email address is the one that megatron complains to when cron commands fail; since 11:15PM last night, I've been getting emails that the rampdown.py line is failing, with the super-helpful message: expr: syntax error
expr: syntax error
Looking at the summary page trends from today, you can see that the MC transmission is pretty flat after I zeroed the MCWFS offsets. In addition, the transmission from both arms is also flat, indicating that our previous observation of long term drift in the Y arm transmission probably had more to do with bad Y-arm initial alignment than unbalanced ETMY coil-magnets.
Much like checking the N2 pressure, amount of coffee beans, frames backups, etc. we should put MC WFS offset adjustment into our periodic checklist. Would be good to have a reminder system that pings us to check these items and wait for confirmation that we have done so.
Tried swapping cables at the Guralp interface box side. It seems that all of our seismic signal problems have to do with the GUR2 cable being flaky (not surprising since it looks like it was patched with Orange Electrical tape!! rather than proper mechanical strain relief).
After swapping the cables today, the GUR2 DAQ channels all look fine: i.e. GUR1 (the one at the Y end) is fine, as is its cable and the GUR2 analog channels inside the interface box.
OTOH, the GUR1 DAQ channels (which have GUR2 (EX) connected into it) are too small by a factor of ~1000. Seems like that end of the cable will need to be remade. Luckily Jenne is still around this week and can point us to the pinout / instructions. Looks like there could be some shorting inside the backshell, so I've left it disconnected rather than risk damaging the seismometer. We should get a GUR1 style backshell to remake this cable. It might also be possible that the end at the seismometer is bad - Steve was supposed to swap the screws on the granite-aluminum plate on Thursday; I'll double check.
2) The IMC has been flaky for a day or so; don't know why. I moved the gains in the autolocker so now the input gain slider to the MC board is 10 dB higher and the output slider is 10 dB lower. This is updated in the mcdown and mcup scripts and both committed to SVN. The trend shows that the MC was wandering away after ~15 minutes of lock, so I suspected the WFS offsets. I ran the offsets script (after flipping the z servo signs and adding 'C1:' prefix). So far powers are good and stable.
3) pianosa was unresponsive and I couldn't ssh to it. I powered it off and then it came back.
5) Many (but not all) EPICS readbacks are whiting out every several minutes. I remote booted c1susaux since it was one of the victims, but it didn't change any behavior.
6) The ETMX and ITMX have very different bounce mode response: should add to our Vent Todo List. Double checked that the bounce/roll bandstop is on and at the right frequency for the bounce mode. Increased the stopband from 40 to 50 dB to see if that helps.
7) op340 is still running ! The only reason to keep it alive is its crontab:
07 * * * * /opt/rtcds/caltech/c1/burt/autoburt/burt.cron >> /opt/rtcds/caltech/c1/burt/burtcron.log
#46 * * * * /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/PSL/FSS/FSSSlowServo > /cvs/cds/caltech/logs/scripts/FSSslow.cronlog 2>&1
#14,44 * * * * /cvs/cds/caltech/conlog/bin/check_conlogger_and_restart_if_dead
15,45 * * * * /opt/rtcds/caltech/c1/scripts/SUS/rampdown.pl > /dev/null 2>&1
#10 * * * * /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/MC/autolockMCmain40m >/cvs/cds/caltech/logs/scripts/mclock.cronlog 2>&1
#27 * * * * /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/PSL/FSS/RCthermalPID.pl >/cvs/cds/caltech/logs/scripts/RCthermalPID.cronlog 2>&1
00 0 * * * /var/scripts/ntp.sh > /dev/null 2>&1
#00 4 * * * /opt/rtcds/caltech/c1/scripts/RGA/RGAlogger.cron >> /cvs/cds/caltech/users/rward/RGA/RGAcron.out 2>&1
#00 6 * * * /cvs/cds/scripts/backupScripts.pl
00 7 * * * /opt/rtcds/caltech/c1/scripts/AutoUpdate/update_conlog.cron
00 8 * * * /opt/rtcds/caltech/c1/scripts/crontab/backupCrontab
added a new script (scripts/SUS/rampdown.py) which decrements every 30 minutes if needed. Added this to the megatron crontab and commented out the op340m crontab line. IF this works for awhile we can retire our last Solaris machine.
8) To see if we could get rid of the wandering PCDRIVE noise, I looked into the NPRO temperatures: was - T_crystal = 30.89 C, T_diode1 = 21 C, T_diode2 = 22 C. I moved up the crystal temp to 33.0 C, to see if it could make the noise more stable. Then I used the trimpots on the front of the controller to maximimize the laseroutput at these temperatures; it was basically maximized already. Lets see if there's any qualitative difference after a week. I'm attaching the pinout for the DSUB25 diagnostics connector on the back of the box. Aidan is going to help us record this stuff with AcroMag tech so that we can see if there's any correlation with PCDRIVE. The shifts in FSS_SLOW coincident with PCDRIVE noise corresponds to ~100 MHz, so it seems like it could be NPRO related.
The CDSUTILS package has a feature where it substitutes in a C1 or H1 or L1 prefix depending upon what site you are at. The idea is that this should make code portable between LLO and LHO.
Here at the 40m, we have no need to do that, so its better for us to be able to copy and paste channel names directly from MEDM or whatever without having to remove the "C1:" from all over the place.
the way to do this on the command line is (in bash) to type:
To make this easier on us, I have implemented this in our shared .bashrc so that its always the case. This might break some scripts which have been adapted to use the weird CDSUTILS convention, so beware and fix appropriately.
Vacuum Operation Guide is up loaded into the 40m-wiki. This is an old master copy. Not exact in terms of real action, but it is still a good guide of logic.
Rana has promissed to watch the N2 supply and change cylinder when it is empty. I will be Hanford next week.
Today Steve and I tried to recenter the Guralps. The breakout box technique didn't work for us, so we just turned the leveling screws until we got the mass position outputs within +/-50 mV for all DoF as read out by the breakout box.
The attachment shows the 6 channels after our work. You can see that GUR2_X/Y still look deadish. I tried wiggling the cables at the interface box and powering on/off, but no luck. Next, we swap cables.
Tried to bring the weather station back to life, but no luck. The unit on the wall is alive and so is the EPICS IOC (c1pem1). But there is apparently no communication between them. telnet into c1pem and the error message repeating at the prompt is:
Weather Monitor Output: NO COMM
Might be related to the flaky connector situation that Liz and I found there a couple summers ago, but I tried jiggling and reseating that one with no luck. Looks like it stopped working around 8 PM on March 24, 2014. That's the same time as a ~30s power outage, so perhaps we just need some more power cycling? Tried hitting the reset button on the VME card for c1pem1, but didn't change anything.
Let's try power cycling that crate (which has c1pem1, c0daqawg, and some GPS receiver)...nope - no luck.
Also tried power cycling the weather box which is near the BS chamber on the wall. This didn't change the error message at the c1pem1 telnet prompt.
COD_Sugar napolion is due to Steve: Item delivered, model CMG-SCU-0013, sn G9536
Reward being offered for the safe return of this thing:
Still seems to be running without causing FB issues. One thought is that we could look through the FB status channel trends and see if there is some excess of FB problems at 10 min after the hour to see if its causing problems.
I also looked into our minute trend situation. Looks like the files are comrpessed and have checksum enabled. The size changes sometimes, but its roughly 35 MB per hour. So 840 MB per day.
According to the wiper.pl script, its trying to keep the minute-trend directory to below some fixed fraction of the total /frames disk. The comment in the scripts says 0.005%,
but I'm dubious since that's only 13TB*5e-5 = 600 MB, and that would only keep us for a day. Maybe the comment should read 0.5% instead...
The rsync job to sync our frames over to the cluster has been on a 20 MB/s BW limit for awhile now.
Dan Kozak has now set up a cronjob to do this at 10 min after the hour, every hour. Let's see how this goes.
You can find the script and its logfile name by doing 'crontab -l' on nodus.
Baking both CC1 at 85 C for 60 hrs did not help.
The temperature is increased to 125 C and it is being repeated.
CC1s are not reading any longer. It is an attempt to clean them over the weekend at 85C
These brand new gauges "10421002" sn 11823-vertical, sn 11837 horizontal replaced 11 years old 421 http://nsei.missouri.edu/manuals/hps-mks/421%20Cold%20Cathode%20Ionization%20Guage.pdf on 09-06-2012 http://nodus.ligo.caltech.edu:8080/40m/7441
We have two cold cathode gauges at the pump spool and one signal cable to controller. CC1 in horizontal position and CC1 in vertical position.
CC1 h started not reading so I moved cable over to CC1 v
* Relocked IMC. I guess it was stuck somewhere in the autlocker loop. I disabled autolocker and locked it manually. Autolocker has been reenabled and seems to be running just fine.
* The X arm has been having trouble staying locked. There seemed to be some amount of gain peaking. I reduced the gain from 0.007 to 0.006.
* I disabled the triggered BounceRG filter : FM8 in the Xarm filter module. We already have a triggered Bounce filter: FM6 that takes care of the noise at bounce/roll frequencies. FM8 was just adding too much gain at 16.5Hz. Once this filter was disabled the X arm lock has been much more stable.
Also, the Y arm doesn't use FM8 for locking either.
was this change not elogged??
This is my sin.
Back in Febuary (around the 25th) I modified c1sus.mdl, removing the simulated plant connections we weren't using from c1lsp and c1sup. This was included in the model's svn log, but not elogged.
The models don't start with the rtcds restart shortcut, because I removed them from the c1lsc line in FB:/diskless/root/etc/rtsystab (or c1lsc:/etc/rtsystab). There is a commented out line in there that can be uncommented to restore them to the list of models c1lsc is allowed to run.
However, I wouldn't suspect that the models not running should affect the suspension drift, since the connections from them to c1sus have been removed. If we still have trends from early February, we could look and see if the drift was happening before I made this change.
I saw that entry, but it doesn't state what the calibration is in units of Hz/counts. It just gives the final calibrated spectrum.
Arm powers had drifted to ~ 0.5 in transmission.
X and Y arms were locked and ASS'd to bring the arm transmission powers to ~1.
I just found out that c1lsp and c1sup models no more exist on the FE status medm screens. I am assuming some changes were done to the models as well.
Earlier today, I was looking at some of the old medm screens running on Donatella that did not reflect this modification.
Did I miss any elogs about this or was this change not elogged??
I found the c1lsp and c1sup models not running anymore on c1lsc (white blocks for status lights on medm).
To fix this, I ssh'd into c1lsc. c1lsc status did not show c1lsp and c1sup models running on it.
I tried the usual rtcds restart <model name> for both and that returned error "Cannot start/stop model 'c1XXX' on host c1lsc".
I also tried rtcds restart all on c1lsc, but that has NOT brought back the models alive.
Does anyone know how I can fix this??
c1sup runs some the suspension controls. So I am afraid that the drift and frequent unlocking of the arms we see might be related to this.
P.S. We might also want to add the FE status channels to the summary pages.
The last MC_F calibration was done by Ayaka : Elog 7823
And does anyone know what the MC_F calibration is?
I have created a wiki page with introductory info about the summary page configuration: https://wiki-40m.ligo.caltech.edu/Daily summary help
We can also use that to collect tips for editing the configuration files, etc.
I have set up new summary pages for the 40m: http://www.ligo.caltech.edu/~misi/summary/
This website shows plots (time series, spectra, spectrograms, Rayleigh statistics) of relevant channels and is updated with new data every 30 min.
The content and structure of the pages is determined by configuration files stored in nodus:/users/public_html/gwsumm-ini/ . The code looks at all files in that directory matching c1*.ini. You can look at the c1hoft.ini file to see how this works. Besides, a quick guide to the format can be found here http://www.ligo.caltech.edu/~misi/iniguide.pdf
Please look at the pages and edit the config files to make them useful to you. The files are under version control, so don’t worry about breaking anything.
Do let me know if you have any questions (or leave a comment in the pages).
Like Steve pointed out, the summary pages show that the y-arm transmission drifts a lot when locked. The OL summary page shows that this is all due to ITMY yaw.
Could be either that they coil driver / DAC is bad or that the suspension is poorly built. We need to dig into ITMY OL trends over long term to see if this is new or now.
Also, weather station needs a reboot. And does anyone know what the MC_F calibration is?
Also, EQ gave us a better (and not pwd protected) URL for the summary pages. Please replace your previous links with this new one:
Why is the Y arm drifting so much?
The " PSL FSS Slow Actuator Adjust " was brought back to range from 1.5 to 0.3 yesterday as ususual. Nothing else was touched.
I'm not sure if the timing scale is working correctly on theses summery plots. What is the definition of today?
The y-arm became much better as I noticed it at 5pm
As it was requested by the Bos.
It would be nice to read from the epic screen C1:Vac-state_mon.......Current State: Vacuum Normal, valve configuration
Maglev is the main pump of our vacuum system below 500 mTorr
It's long term pressure has to be <500 mTorr
Each chamber has it's own annulos. These small volumes are indipendent from main volume. Their pressure ranges are <5 mTorr at vac. normal valve configuration.
CC1=cold cathode gauge (low emmision), Pressure range: 1e-4 to 1e-10 Torr,
In vac- normal configuration CC1= 2e-6 Torr
The N2 supply is regulated to 60-80 PSI out put at the auto cylinder changer.
Each cylinder pressure will be measured before the regulator and summed for warning message to be send
at 1000 PSI
ln -s /users/public_html/$MYPAGE /export/home/
Koji suggested that I make a cosine fit for the curve instead of a linear fit.
I fit the data to
where L - cable length asymmetry (27 cm) , fb - beat frequency and v - velocity of light in the cable (2*108 m/s)
The plot with the cosine fit is attached.
Fit coefficients (with 95% confidence bounds):
A = 0.4177 (0.3763, 0.4591)
B = 2.941 (2.89, 2.992)
I left the arms locked last night. Looks like the drift in the Y arm power is related to the Y arm control signal being much bigger than X.
Attached is the schematic of the analog DFD and the plot showing the zero-crossing for a delay line length of 27cm. The bandwidth for the linear output signal obtained roughly matches what is expected from the length difference (370MHz) .
We could use a smaller cable to further increase our bandwidth. I propose we use this analog DFD to determine the range at which the frequency counter needs to be set and then use the frequency counter readout as the error signal for FOL.
X arm was far out in yaw, so I reran the ASS for Y and then X. Ran OK; the offload from ASS outputs to SUS bias is still pretty violent - needs smoother ramping.
After this I recentered the ITMX OL- it was off by 50 microradians in pitch. Just like the BS/PRM OLs, this one has a few badly assembled & flimsly mounts. Steve, please prepare for replacing the ITMX OL mirror mounts with the proper base/post/Polaris combo. I think we need ~3 of them. Pit/yaw loop measurements attached.
Based on the PEM-SEIS summary page, it looked like GUR1 was oscillating (and thereby saturating and suppressing the Z channel). So I power cycled both Guralps by turning off the interface box for ~30 seconds and the powering back on. Still not fixed; looks like the oscillations at 110 and 520 Hz have moved but GUR2_X/Y are suppressed above 1 Hz, and GUR1_Z is suppressed below 1 Hz. We need Jenne or Zach to come and use the Gur Paddle on these things to make them OK.
From the SUS-WatchDog summary page, it looked like the PRM tripped during the little 3.8 EQ at 4AM, so I un-tripped it.
Caryn's temperature sensors look like they're still plugged in. Does anyone know where they're connected?
Looks like some of our seismometers are oscillating, not mounted well, or something like that. No reason for them to be so different.
Which Guralp is where? And where are our accelerometers mounted?
Same thing again today. So I renamed the /etc/init/elog.conf so that it doesn't keep respawning bootlessly. Until then restart elog using the start script in /cvs/cds/caltech/elog/ as usual.
I'll let EQ debug when he gets back - probably we need to pause the elog respawn so that it waits until nodus is up for a few minutes before starting.
Since the nodus upgrade, Eric/Diego changed the old csh restart procedures to be more UNIX standard. The instructions are in the wiki.
After doing some software updates on nodus today, apache and elogd didn't come back OK. Maybe because of some race condition, elog tried to start but didn't get apache. Apache couldn't start because it found that someone was already binding the ELOGD port. So I killed ELOGD several times (because it kept trying to respawn). Once it stopped trying to come back I could restart Apache using the Wiki instructions. But the instructions didn't work for ELOGD, so I had to restart that using the usual .csh script way that we used to use.
Still processing, but I think it should work fine once we have a day of data. Until then, here's the summary pages so far, including Vac channels:
Rana asked me to include add slow outputs (OUT16) of the seismometer BLRMS channels to the frames.
All of the PEM slow channels are already set up in c1/chans/daq/C1EDCU_PEM.ini, but up to this point, daqd had no knowledge of this file, since it wasn't included in c1/target/fb/master, which defines all the places to look for files describing channels to be written to disk. This file already includes lines for C1EDCU_LSC.ini and such, which from old elogs, looks like was set up by hand for subsystems we care about.
Hence, since we now care about slow trends for the PEM subsystem, I have added a line to the daqd master file to tell it to save the PEM slow channels. This looks to have increased the size of the individual 16 second frame files from 57MB to 59MB, which isn't so bad.
We have 2 transduser PX303-3KG5V http://www.omega.com/pressure/pdf/PX303.pdf They will be installed on the out put of the N2 cylinders to read the supply pressure.
I will order one DC power supply http://www.omega.com/pptst/PSU93_FPW15.html PSU-93
One full cylinder pressure is ~ 2400 PSI max so two of them will give us ~9Vdc
The email reminder should be send at 1000 PSI = 1.8 V
Installed libmotif3 and libmotif4 on nodus so that we can run dataviewer on there.
Also, the lscsoft stuff wasn't installed for apt-get, so I did so following the instructions on the DASWG website:
Then I installed libmetaio1, libfftw3-3. Now, rather than complain about missing librarries, diaggui just silently dies.
Then I noticed that the awggui error message tells us to use 'ssh -Y' instead of 'ssh -X'. Using that I could run DTT on nodus from my office.
We want to have a VAC page in the summaries, so Steve - please put a list of important channel names for the vacuum system into the elog so that we can start monitoring for trouble.
Also, anyone that has any ideas can feel free to just add a comment to the summary pages DisQus comment section with the 40m shared account or make your own account.