Apparently, some time ago Larry Wallace installed a new, fast ethernet switch in the old nodus rack. Q and I have just now moved nodus' GC ethernet cable over to the new switch. Dan Kozak is going to use this faster connection to make the data flow over to the cluster not so lag-y.
We ran a Cat 6+ Ethernet cable from the 1X7 rack (where the new nodus is located) to the fast GC switch in the control room rack; now I will learn how to setup the 'outside world' network, iptables, and the like.
I remind that the current hardware/software status is posted in elog 10697 ; if additions or corrections are needed, let me know.
After I check a couple of things, we can use the new nodus (which is currently known in the martian network as rosalba) as a local test to see that everything is working. After that (and, mostly, after I'll have the network working), we will sync the data from the old nodus to the new one and make the switch.
I finished the polishing in the scripts/FOL directory, this is the current status and this post replaces my two previous posts on the subject:
With some advice from Jamie, I've gotten the lock loss plotting script that is used at LHO working on our machines. The other night, I modified the ALSwatch.py script to log lockloss times. Tying it together, I've written a small wrapper script that grabs the last time from the lockloss log, and plots it.
It is: scripts/LSC/LocklossData/lastlock.sh
Jamie's going to make an adjustment to the pydv codebase that will let me implement the auto y-scaling that we like. We also will need to get a feel for the right timing window, once we see what kind of delay in the ALSwatch script is typical.
Here's an example of the output, with the window of [-10,+2] seconds from the logged GPS time:
The other day, I hooked up the agilent analyzer to OUT2 of the MC board, which is currently set to output the MC refl error signal. I've written a GPIB-based program that continuously polls the analyzer, and plots the live spectrum, an exponentially weighted running mean, and the first measured spectrum.
The intended use case is to see if the FSS or MC loops are going crazy when we're locking. Sometimes the GPIB interface hangs/loses its connection, and the script needs a restart.
The script lives in scripts/MC/MCerrmon
Update: work is almost completed; the old nodus is still online, as I don't feel confident to make the switch and leave it on its own for the weekend. However, the new nodus is online with the IP address 188.8.131.52, so everyone can check that everything works. From my tests I can say that:
After everything will be in place, I will save every reasonably important configuration file of nodus into the svn.
I remind that every change made while accessing the 184.108.40.206 machine will be purged during the sync&switch
Nodus (solaris) is dead, long live Nodus (ubuntu).
Diego and I are smoothing out the Kinks as they appear, but the ELOG is running smoothly on our new machine.
SVN is working, but your checkouts may complain because they expect https, and we haven't turned SSL on yet...
SSL, https and backups are now working too!
A backup of nodus's configuration (with some explaining) will be done soon.
Nodus should be visible again from outside the Caltech Network; I added some basic configuration for postfix and smartmontools; configuration files and instructions for everything are in the svn in the nodus_config folder
Since the Nodus switch, the offsite backup scripts (scripts/backup/rsync.backup) had not been running successfully. I tracked it down to the weird NFS file ownership issues we've been seeing since making Chiara the fileserver. Since the backup script uses rsync's "archive" mode, which preserves ownership, permissions, modification dates, etc, not seeing the proper ownership made everything wacky.
Despite 99% of the searches you do about this problem saying you just need to match your user's uid and gid on the NFS client and server, it turns out NFSv4 doesn't use this mechanism at all, opting instead for some ID mapping service (idmapd), which I have no inclination of figuring out at this time.
Thus, I've configured /etc/fstab on Nodus (and the control room machines) to use NFSv3 when mounting /cvs/cds. Now, all the file ownerships show up correctly, and the offsite backup of /cvs/cds is churning along happily.
I just stumbled upon this while poking around:
Since the great crash of June 2014, the scripts backup script has not been workingon op340m. For some reason, it's only grabbing the PRFPMI folder, and nothing else.
Megatron seems to be able to run it. I've moved the job to megatron's crontab for now.
elog was not responding for unknown reasons, since the elogd process on nodus was alive; anyway, I restarted it.
I've set up nodus to start the ELOG on boot, through /etc/init/elog.conf. Also, thanks to this, we don't need to use the start-elog.csh script any more. We can now just do:
controls@nodus:~ $ sudo initctl restart elog
I also tweaked some of the ELOG settings, so that image thumbnails are produced at higher resolution and quality.
Given that op340m showed some undesired behavior, and that the FSS slow seems prone to railing lately, I've moved the FSS slow servo job over to megatron in the same way I did for the MC autolocker.
Namely, there is an upstart configuration (megatron:/etc/init/FSSslow.conf), that invokes the slow servo. Log file is in the same old place (/cvs/cds/caltech/logs/scripts), and the servo can be (re)started by running:
controls@megatron|~ > sudo initctl start FSSslow
Maybe this won't really change the behavior. We'll see
Today Q moved the FSS slow servo over to some init thing on megatron, and some time ago he did the same thing to the MC auto locker script. It isn't working though.
Even though megatron was rebooted, neither script started up automatically. As Diego mentioned in elog 10823, we ran sudo initctl start MCautolocker and sudo initctl start FSSslow, and the blinky lights for both of the scripts started. However, that seems to be the only thing that the scripts are doing. The MC auto locker is not detecting lockloses, and is not resetting things to allow the MC to relock. The MC is happy to lock if I do it by hand though. Similarly, the blinky light for the FSS is on, but the PSL temperature is moving a lot faster than normal. I expect that it will hit one of the rails in under an hour or so.
The MC autolocker and the FSS loop were both running earlier today, so maybe Q had some magic that he used when he started them up, that he didn't include in the elog instructions?
I ssh'd in, and was able to run each script manually successfully. I ran the initctl commands, and they started up fine too.
We've seen this kind of behavior before, generally after reboots; see ELOGS 10247 and 10572.
In the plot it is shown the behaviour of the PSL-FSS_SLOWDC signal during the last week; the blue rectangle marks an approximate estimate of the time when the scripts were moved to megatron. Apart from the bad things that happened on Friday during the big crash, and the work ongoing since yesterday, it seems that something is not working well. The scripts on megatron are actually running, but I'll try and have a look at it.
I reset the threshold to +6666 counts (the aligned MC transmission is ~16000 for the TEM00 mode) so that it only turns on when we're in a good locked state.
I've installed the very fresh ELOG 3.0, for nothing else than the new built in text editor which has a LATEX capable equation editor built right in.
Check out this sweet limerick:
I've upgraded our cdsutils installation to v382; there have been some changes to pydv which will allow me to implement the auto y-scaling on our lockloss plots.
After some brief testing, things seem to still work...
To use instafoton, right click an MEDM screen, open the Execute menu, and choose "Foton". Then click on the EPICS channel of a filter module as displayed on the screen.
Here's how it was set up:
export MEDM_EXEC_LIST="Edit this screen;medm &A &:Probe;probe &P &:Foton (Pick filter PV);/opt/rtcds/caltech/c1/scripts/instafoton.py &P &"
After recompiling medm with a patch for dumping screens (attached), I added a time machine to the right-click Execute menu. It's installed under /cvs/cds/caltech/users/wipf/src/medm_time_machine. Dependencies include the python CA server module (pcaspy) and the latest nds2-client 0.11.2. These were also installed under my users directory, to avoid interfering with other tools.
--- /ligo/apps/epics-3.14.12_long/extensions/src/medm/medm/utils.c.orig 2015-01-13 18:56:44.867720104 -0800
+++ /ligo/apps/epics-3.14.12_long/extensions/src/medm/medm/utils.c 2015-01-13 22:49:56.636820963 -0800
@@ -4156,6 +4156,37 @@
timeOffset = time900101 - time700101;
+#if((2*MAX_TRACES)+2) > MAX_PENS
+#define MAX_COUNT 2*MAX_TRACES+2
+#define MAX_COUNT MAX_PENS
I have installed kerberos on Rossa, so that I don't have to type my name and password every time I do an svn checkin, since I'm making some modifications and want to be sure that everything is checked in before and afterwards.
I ran sudo apt-get install krb5-user. I didn't put in a default_realm when it prompted me to during install, so I went into the /etc/krb5.conf file and changed the default_realm line to read default_realm = LIGO.ORG.
Now we can use kinit, but we must (as usual) remember to kdestroy our credentials when we're done.
As a reminder, to use:
> kinit albert.einstein
Password for albert.einstein@LIGO.ORG: (type your pw here)
When you're finished, run
WARNING: since the workstations are all shared user, if you forget to kdestroy the next user can commit under your user ID. It might be good to set the timeout to be something much shorter than 24 hours, like maybe 1, or 2.
Good call. I added a line ticket_lifetime = 3600, which should make it destroy the credentials after an hour.
I forgot to elog about these ones, my bad... The new/updated laptops are giada, viviana and paola; paola is already in the lab, while giada and viviana are in the control room waiting for a new home. The Pool of Names Wiki page has already been updated to reflect the changes.
I've fixed the gpib scripts for the SR785 and AG4395A to output data in the same format as expected by older scripts when called by them. In addition, there are now some easier modes of operation through the measurement scripts SRmeasure and AGmeasure. These are on the $PATH for the main control room machines, and live in scripts/general/netgpib
Case 1: I manually set up a measurement on the analyzer, and just want to download / plot the data.
Make sure you have a yellow prologix box plugged in, and can ping the address it is labeled with. (i.e. 'vanna'). Then, in the directory you want to save the data, run:
SRmeasure -i vanna -f mydata --getdata --plot
This saves mydata_(datetime).txt and mydata_(datetime).pdf in the current directory.
In all cases, AGmeasure has the identical syntax. If the GPIB address is something other than 10, specifiy it with -a, but this is rarely the case.
Case 2: I want to remotely specify a measurement
Rather than a series of command line arguments, which may get lost to the mists of time, I've set the scripts up to use parameter files that serve as arguments to the scripts.
Get the templates for spectrum and TF measurements in your current directory by running
Set the parameters with your text editor of choice, such as frequency span, filename output, whether to create a plot or not, then run the measurement:
Case 3: I want to compare my data with previous measurements
In the template parameter files, there is an option 'plotRefs', that will automatically plot the data from files whose filenames start with the same string as the current measurement.
If, in the "#" commented out header of the data file, there is a line that contains "memo:" or "timestamp:", it will include the text that follows in the plot legend.
There are also methods to remotely trigger an already configured measurement, or remotely reset an unresponsive instrument. Options can be perused by looking at the help in SRmeasure -h
I've tested, debugged, and used them for a bit, but wrinkles may remain. They've been svn40m committed, and I also set up a separate git repository for them at github.com/e-q/netgpibdata
Created a new medm screen C1ALS_FOL_PID.adl for FOL PID loop control in /medm/als/master/
This is not currently linked to the sitemap screen.
Over the past few days, I've occasionally been peeking at the framebuilder IO load to see If I could correlate anything with it, but it's usually been low when I looked. I.e. with daqd and all models running, the %wa time was in the few percents at most.
Just now, I was seeing some EPICS sluggishness, and sure enough, the %wa was in the 50-60 range. I used iostat -xmh 5 on the framebuilder to see that /dev/sda, the /frames drive, was at 100% utilization, which means it was reading and writing as fast as it possibliy could.
iostat -xmh 5
I ssh'd over to nodus, and with iotop found that an rsync job was running (rsync -am --exclude .*.gwf full 220.127.116.11::40m/full), and its IO rates corresponded very closely to the data read rates on the framebuilder from /frames.
rsync -am --exclude .*.gwf full 18.104.22.168::40m/full
I killed the rsync process on nodus, and the %wa time on the framebuilder dropped to near zero. The ASS striptools, where I had noticed the sluggishness, immediately started updating faster.
While rsync is supposed to play nice with a system's IO demands, maybe it only knows about nodus's IO usage, not fb which is the underlying NFS server where the frames live. I think it would be good to throttle the bandwidth of these jobs to a specific bandwidth. 50MB/s seemed like too much, so maybe 10MB/s is ok?
** along the way, I noticed that the reason this notebook hasn't been working since last night is that someone sadly installed a new anaconda python distro today without telling anyone by ELOG. This new distro didn't have all the packages of the previous one. I've updated it with astropy and uncertainties packages.
My bad, sorry!
Yesterday, I was trying to install a package with anaconda's package manager, conda, but it was crashing in some weird way. I wasn't able to fix it, which led me to create a fresh installation.
The rsync job to sync our frames over to the cluster has been on a 20 MB/s BW limit for awhile now.
Dan Kozak has now set up a cronjob to do this at 10 min after the hour, every hour. Let's see how this goes.
You can find the script and its logfile name by doing 'crontab -l' on nodus.
Back when Diego and I were getting all of the web services running up on the new nodus, we inexplicably were not able to get the hosting of the public_html directory and wikis to share the same port of 30889. In ELOG 10793, we stated that public_html was hosted on a new port, 30888, though we didn't really bring much attention to that new fact.
Unbeknowst to us at the time, this broke other links/bookmarks/sites that people had been using. Koji pointed this out to me the other day, but I have not made any sort of resolution. For now, the public_html directory, and the sites therein, have been taken offline.
In other nodus news, Jamie has set Nodus' apache service with a certificate for SSL goodness. We want to extend this to the ELOG, which uses a built in webserver, rather than apache.
He set up a proxy at the https address which will later host the secured elog: https://nodus.ligo.caltech.edu:8081/
When we make the switch to running the ELOG with HTTPS on by default, living on port 8081, we will set up apache to point 8080 at 8081, to preserve all of the old links.
I.e. this change should effectively be invisible to ELOG users if we implement it right.
I have created new slow channels for FOL. To do so, I have edited the fcreadout.db file in Domenica and the C0EDCU.ini file in /chans/daq
Domenica and frame builder were restarted after the edits.
Koji has moved the following files from /opt/rtcds/caltech/c1/chans/daq/ to /opt/rtcds/caltech/c1/chans/daq/trash as they are not being used anymore.
Since none of us here are experts in pearl, I have put together a python script for a simple PID controller. This can be imported into any main scripts that will run the actual PID loop. The script, PID.py, exists in /scripts/general/
CDSutlils has been updated to the newest version, 474; there are some matrix interface methods that will make our locking scripts easier to read, modify, and maintain.
I've tested the ALS and CARM down scripts, and the LSC offsets script, and they all work fine.
The SUS align/misalign scripts don't work after the new CDS utils upgrade.
I don't know if it's looking for the _SWSTAT channel to confirm that the offset has been turned on/off, or if it is trying to set that channel, to do the switching, but either way, the script is failing. Recall that our version of the RCG still has _SW1R and _SW2R, rather than the newer _SWSTAT for the filter banks.
ezca.ezca.EzcaConnectError: Could not connect to channel (timeout=2s): C1:SUS-PRM_OL_PIT_SWSTAT
Q, can you please (please, please, pretty please) undo this upgrade, and then hold off on any further changes to the system for a few weeks?
Q remotely reverted this change. Scripts seem to work again.
Q: please update this Wiki page with the go-back procedure:
Since the nodus upgrade, Eric/Diego changed the old csh restart procedures to be more UNIX standard. The instructions are in the wiki.
After doing some software updates on nodus today, apache and elogd didn't come back OK. Maybe because of some race condition, elog tried to start but didn't get apache. Apache couldn't start because it found that someone was already binding the ELOGD port. So I killed ELOGD several times (because it kept trying to respawn). Once it stopped trying to come back I could restart Apache using the Wiki instructions. But the instructions didn't work for ELOGD, so I had to restart that using the usual .csh script way that we used to use.
Installed libmotif3 and libmotif4 on nodus so that we can run dataviewer on there.
Also, the lscsoft stuff wasn't installed for apt-get, so I did so following the instructions on the DASWG website:
Then I installed libmetaio1, libfftw3-3. Now, rather than complain about missing librarries, diaggui just silently dies.
Then I noticed that the awggui error message tells us to use 'ssh -Y' instead of 'ssh -X'. Using that I could run DTT on nodus from my office.
Same thing again today. So I renamed the /etc/init/elog.conf so that it doesn't keep respawning bootlessly. Until then restart elog using the start script in /cvs/cds/caltech/elog/ as usual.
I'll let EQ debug when he gets back - probably we need to pause the elog respawn so that it waits until nodus is up for a few minutes before starting.
ln -s /users/public_html/$MYPAGE /export/home/
Also, EQ gave us a better (and not pwd protected) URL for the summary pages. Please replace your previous links with this new one:
Like Steve pointed out, the summary pages show that the y-arm transmission drifts a lot when locked. The OL summary page shows that this is all due to ITMY yaw.
Could be either that they coil driver / DAC is bad or that the suspension is poorly built. We need to dig into ITMY OL trends over long term to see if this is new or now.
Also, weather station needs a reboot. And does anyone know what the MC_F calibration is?
Still seems to be running without causing FB issues. One thought is that we could look through the FB status channel trends and see if there is some excess of FB problems at 10 min after the hour to see if its causing problems.
I also looked into our minute trend situation. Looks like the files are comrpessed and have checksum enabled. The size changes sometimes, but its roughly 35 MB per hour. So 840 MB per day.
According to the wiper.pl script, its trying to keep the minute-trend directory to below some fixed fraction of the total /frames disk. The comment in the scripts says 0.005%,
but I'm dubious since that's only 13TB*5e-5 = 600 MB, and that would only keep us for a day. Maybe the comment should read 0.5% instead...
Still seems to be running without causing FB issues.
I'm not so sure. I just was experiencing some severe network latency / EPICS channel freezes that was alleviated by killing the rsync job on nodus. It started a few minutes after ten past the hour, when the rysnc job started.
Unrelated to this, for some odd reason, there is some weirdness going on with ssh'ing to martian machines from the control room computers. I.e. on pianosa, ssh nodus fails with a failure to resolve hostaname message, but ssh nodus.martian succeeds.
Starting on the 14th (five days ago) the local chiara rsync backup of /cvs/cds to an external HDD has been failing:
2015-05-13 07:00:01,614 INFO Updating backup image of /cvs/cds
2015-05-13 07:49:46,266 INFO Backup rsync job ran successfully, transferred 6504 files.
2015-05-14 07:00:01,826 INFO Updating backup image of /cvs/cds
2015-05-14 07:50:18,709 ERROR Backup rysnc job failed with exit code 24!
2015-05-15 07:00:01,385 INFO Updating backup image of /cvs/cds
2015-05-15 08:09:18,527 ERROR Backup rysnc job failed with exit code 24!
Code 24 apparently means "Partial transfer due to vanished source files."
Manually running the backup command on chiara worked fine, returning a code of 0 (success), so we are backed up. For completeness, the command is controls@chiara: sudo rsync -av --delete --stats /home/cds/ /media/40mBackup
controls@chiara: sudo rsync -av --delete --stats /home/cds/ /media/40mBackup
Are the summary page jobs moving files around at this time of day? If so, one of the two should be rescheduled to not conflict.
Given some of the things we've facing lately, it occurs to me that we could be better served by having some sort of unified human-alerting scheme in place, for things like:
Currently, many of these things are just checked sporadically when it occurs to someone to do so, or when debugging random issues. Smoother IFO operation and peace of mind could be gained if we're confident that the relevant people are notified in a timely manner.
Thoughts? Suggestions on other things to monitor, like maybe frontend/model crashes?