I ssh'd in, and was able to run each script manually successfully. I ran the initctl commands, and they started up fine too.
We've seen this kind of behavior before, generally after reboots; see ELOGS 10247 and 10572.
In the plot it is shown the behaviour of the PSL-FSS_SLOWDC signal during the last week; the blue rectangle marks an approximate estimate of the time when the scripts were moved to megatron. Apart from the bad things that happened on Friday during the big crash, and the work ongoing since yesterday, it seems that something is not working well. The scripts on megatron are actually running, but I'll try and have a look at it.
I reset the threshold to +6666 counts (the aligned MC transmission is ~16000 for the TEM00 mode) so that it only turns on when we're in a good locked state.
I've installed the very fresh ELOG 3.0, for nothing else than the new built in text editor which has a LATEX capable equation editor built right in.
Check out this sweet limerick:
I've upgraded our cdsutils installation to v382; there have been some changes to pydv which will allow me to implement the auto y-scaling on our lockloss plots.
After some brief testing, things seem to still work...
To use instafoton, right click an MEDM screen, open the Execute menu, and choose "Foton". Then click on the EPICS channel of a filter module as displayed on the screen.
Here's how it was set up:
export MEDM_EXEC_LIST="Edit this screen;medm &A &:Probe;probe &P &:Foton (Pick filter PV);/opt/rtcds/caltech/c1/scripts/instafoton.py &P &"
After recompiling medm with a patch for dumping screens (attached), I added a time machine to the right-click Execute menu. It's installed under /cvs/cds/caltech/users/wipf/src/medm_time_machine. Dependencies include the python CA server module (pcaspy) and the latest nds2-client 0.11.2. These were also installed under my users directory, to avoid interfering with other tools.
I have installed kerberos on Rossa, so that I don't have to type my name and password every time I do an svn checkin, since I'm making some modifications and want to be sure that everything is checked in before and afterwards.
I ran sudo apt-get install krb5-user. I didn't put in a default_realm when it prompted me to during install, so I went into the /etc/krb5.conf file and changed the default_realm line to read default_realm = LIGO.ORG.
Now we can use kinit, but we must (as usual) remember to kdestroy our credentials when we're done.
As a reminder, to use:
> kinit albert.einstein
Password for albert.einstein@LIGO.ORG: (type your pw here)
When you're finished, run
WARNING: since the workstations are all shared user, if you forget to kdestroy the next user can commit under your user ID. It might be good to set the timeout to be something much shorter than 24 hours, like maybe 1, or 2.
Good call. I added a line ticket_lifetime = 3600, which should make it destroy the credentials after an hour.
I forgot to elog about these ones, my bad... The new/updated laptops are giada, viviana and paola; paola is already in the lab, while giada and viviana are in the control room waiting for a new home. The Pool of Names Wiki page has already been updated to reflect the changes.
I've fixed the gpib scripts for the SR785 and AG4395A to output data in the same format as expected by older scripts when called by them. In addition, there are now some easier modes of operation through the measurement scripts SRmeasure and AGmeasure. These are on the $PATH for the main control room machines, and live in scripts/general/netgpib
Case 1: I manually set up a measurement on the analyzer, and just want to download / plot the data.
Make sure you have a yellow prologix box plugged in, and can ping the address it is labeled with. (i.e. 'vanna'). Then, in the directory you want to save the data, run:
SRmeasure -i vanna -f mydata --getdata --plot
This saves mydata_(datetime).txt and mydata_(datetime).pdf in the current directory.
In all cases, AGmeasure has the identical syntax. If the GPIB address is something other than 10, specifiy it with -a, but this is rarely the case.
Case 2: I want to remotely specify a measurement
Rather than a series of command line arguments, which may get lost to the mists of time, I've set the scripts up to use parameter files that serve as arguments to the scripts.
Get the templates for spectrum and TF measurements in your current directory by running
Set the parameters with your text editor of choice, such as frequency span, filename output, whether to create a plot or not, then run the measurement:
Case 3: I want to compare my data with previous measurements
In the template parameter files, there is an option 'plotRefs', that will automatically plot the data from files whose filenames start with the same string as the current measurement.
If, in the "#" commented out header of the data file, there is a line that contains "memo:" or "timestamp:", it will include the text that follows in the plot legend.
There are also methods to remotely trigger an already configured measurement, or remotely reset an unresponsive instrument. Options can be perused by looking at the help in SRmeasure -h
I've tested, debugged, and used them for a bit, but wrinkles may remain. They've been svn40m committed, and I also set up a separate git repository for them at github.com/e-q/netgpibdata
Created a new medm screen C1ALS_FOL_PID.adl for FOL PID loop control in /medm/als/master/
This is not currently linked to the sitemap screen.
Over the past few days, I've occasionally been peeking at the framebuilder IO load to see If I could correlate anything with it, but it's usually been low when I looked. I.e. with daqd and all models running, the %wa time was in the few percents at most.
Just now, I was seeing some EPICS sluggishness, and sure enough, the %wa was in the 50-60 range. I used iostat -xmh 5 on the framebuilder to see that /dev/sda, the /frames drive, was at 100% utilization, which means it was reading and writing as fast as it possibliy could.
iostat -xmh 5
I ssh'd over to nodus, and with iotop found that an rsync job was running (rsync -am --exclude .*.gwf full 126.96.36.199::40m/full), and its IO rates corresponded very closely to the data read rates on the framebuilder from /frames.
rsync -am --exclude .*.gwf full 188.8.131.52::40m/full
I killed the rsync process on nodus, and the %wa time on the framebuilder dropped to near zero. The ASS striptools, where I had noticed the sluggishness, immediately started updating faster.
While rsync is supposed to play nice with a system's IO demands, maybe it only knows about nodus's IO usage, not fb which is the underlying NFS server where the frames live. I think it would be good to throttle the bandwidth of these jobs to a specific bandwidth. 50MB/s seemed like too much, so maybe 10MB/s is ok?
** along the way, I noticed that the reason this notebook hasn't been working since last night is that someone sadly installed a new anaconda python distro today without telling anyone by ELOG. This new distro didn't have all the packages of the previous one. I've updated it with astropy and uncertainties packages.
My bad, sorry!
Yesterday, I was trying to install a package with anaconda's package manager, conda, but it was crashing in some weird way. I wasn't able to fix it, which led me to create a fresh installation.
The rsync job to sync our frames over to the cluster has been on a 20 MB/s BW limit for awhile now.
Dan Kozak has now set up a cronjob to do this at 10 min after the hour, every hour. Let's see how this goes.
You can find the script and its logfile name by doing 'crontab -l' on nodus.
Back when Diego and I were getting all of the web services running up on the new nodus, we inexplicably were not able to get the hosting of the public_html directory and wikis to share the same port of 30889. In ELOG 10793, we stated that public_html was hosted on a new port, 30888, though we didn't really bring much attention to that new fact.
Unbeknowst to us at the time, this broke other links/bookmarks/sites that people had been using. Koji pointed this out to me the other day, but I have not made any sort of resolution. For now, the public_html directory, and the sites therein, have been taken offline.
In other nodus news, Jamie has set Nodus' apache service with a certificate for SSL goodness. We want to extend this to the ELOG, which uses a built in webserver, rather than apache.
He set up a proxy at the https address which will later host the secured elog: https://nodus.ligo.caltech.edu:8081/
When we make the switch to running the ELOG with HTTPS on by default, living on port 8081, we will set up apache to point 8080 at 8081, to preserve all of the old links.
I.e. this change should effectively be invisible to ELOG users if we implement it right.
I have created new slow channels for FOL. To do so, I have edited the fcreadout.db file in Domenica and the C0EDCU.ini file in /chans/daq
Domenica and frame builder were restarted after the edits.
Koji has moved the following files from /opt/rtcds/caltech/c1/chans/daq/ to /opt/rtcds/caltech/c1/chans/daq/trash as they are not being used anymore.
Since none of us here are experts in pearl, I have put together a python script for a simple PID controller. This can be imported into any main scripts that will run the actual PID loop. The script, PID.py, exists in /scripts/general/
CDSutlils has been updated to the newest version, 474; there are some matrix interface methods that will make our locking scripts easier to read, modify, and maintain.
I've tested the ALS and CARM down scripts, and the LSC offsets script, and they all work fine.
The SUS align/misalign scripts don't work after the new CDS utils upgrade.
I don't know if it's looking for the _SWSTAT channel to confirm that the offset has been turned on/off, or if it is trying to set that channel, to do the switching, but either way, the script is failing. Recall that our version of the RCG still has _SW1R and _SW2R, rather than the newer _SWSTAT for the filter banks.
ezca.ezca.EzcaConnectError: Could not connect to channel (timeout=2s): C1:SUS-PRM_OL_PIT_SWSTAT
Q, can you please (please, please, pretty please) undo this upgrade, and then hold off on any further changes to the system for a few weeks?
Q remotely reverted this change. Scripts seem to work again.
Q: please update this Wiki page with the go-back procedure:
Since the nodus upgrade, Eric/Diego changed the old csh restart procedures to be more UNIX standard. The instructions are in the wiki.
After doing some software updates on nodus today, apache and elogd didn't come back OK. Maybe because of some race condition, elog tried to start but didn't get apache. Apache couldn't start because it found that someone was already binding the ELOGD port. So I killed ELOGD several times (because it kept trying to respawn). Once it stopped trying to come back I could restart Apache using the Wiki instructions. But the instructions didn't work for ELOGD, so I had to restart that using the usual .csh script way that we used to use.
Installed libmotif3 and libmotif4 on nodus so that we can run dataviewer on there.
Also, the lscsoft stuff wasn't installed for apt-get, so I did so following the instructions on the DASWG website:
Then I installed libmetaio1, libfftw3-3. Now, rather than complain about missing librarries, diaggui just silently dies.
Then I noticed that the awggui error message tells us to use 'ssh -Y' instead of 'ssh -X'. Using that I could run DTT on nodus from my office.
Same thing again today. So I renamed the /etc/init/elog.conf so that it doesn't keep respawning bootlessly. Until then restart elog using the start script in /cvs/cds/caltech/elog/ as usual.
I'll let EQ debug when he gets back - probably we need to pause the elog respawn so that it waits until nodus is up for a few minutes before starting.
ln -s /users/public_html/$MYPAGE /export/home/
Also, EQ gave us a better (and not pwd protected) URL for the summary pages. Please replace your previous links with this new one:
Like Steve pointed out, the summary pages show that the y-arm transmission drifts a lot when locked. The OL summary page shows that this is all due to ITMY yaw.
Could be either that they coil driver / DAC is bad or that the suspension is poorly built. We need to dig into ITMY OL trends over long term to see if this is new or now.
Also, weather station needs a reboot. And does anyone know what the MC_F calibration is?
Still seems to be running without causing FB issues. One thought is that we could look through the FB status channel trends and see if there is some excess of FB problems at 10 min after the hour to see if its causing problems.
I also looked into our minute trend situation. Looks like the files are comrpessed and have checksum enabled. The size changes sometimes, but its roughly 35 MB per hour. So 840 MB per day.
According to the wiper.pl script, its trying to keep the minute-trend directory to below some fixed fraction of the total /frames disk. The comment in the scripts says 0.005%,
but I'm dubious since that's only 13TB*5e-5 = 600 MB, and that would only keep us for a day. Maybe the comment should read 0.5% instead...
Still seems to be running without causing FB issues.
I'm not so sure. I just was experiencing some severe network latency / EPICS channel freezes that was alleviated by killing the rsync job on nodus. It started a few minutes after ten past the hour, when the rysnc job started.
Unrelated to this, for some odd reason, there is some weirdness going on with ssh'ing to martian machines from the control room computers. I.e. on pianosa, ssh nodus fails with a failure to resolve hostaname message, but ssh nodus.martian succeeds.
Starting on the 14th (five days ago) the local chiara rsync backup of /cvs/cds to an external HDD has been failing:
2015-05-13 07:00:01,614 INFO Updating backup image of /cvs/cds
2015-05-13 07:49:46,266 INFO Backup rsync job ran successfully, transferred 6504 files.
2015-05-14 07:00:01,826 INFO Updating backup image of /cvs/cds
2015-05-14 07:50:18,709 ERROR Backup rysnc job failed with exit code 24!
2015-05-15 07:00:01,385 INFO Updating backup image of /cvs/cds
2015-05-15 08:09:18,527 ERROR Backup rysnc job failed with exit code 24!
Code 24 apparently means "Partial transfer due to vanished source files."
Manually running the backup command on chiara worked fine, returning a code of 0 (success), so we are backed up. For completeness, the command is controls@chiara: sudo rsync -av --delete --stats /home/cds/ /media/40mBackup
controls@chiara: sudo rsync -av --delete --stats /home/cds/ /media/40mBackup
Are the summary page jobs moving files around at this time of day? If so, one of the two should be rescheduled to not conflict.
Given some of the things we've facing lately, it occurs to me that we could be better served by having some sort of unified human-alerting scheme in place, for things like:
Currently, many of these things are just checked sporadically when it occurs to someone to do so, or when debugging random issues. Smoother IFO operation and peace of mind could be gained if we're confident that the relevant people are notified in a timely manner.
Thoughts? Suggestions on other things to monitor, like maybe frontend/model crashes?
I've started working on a general routine to measure noise couplings in our interferometers. Often this is done with swept sine measurements, but this misses the nonlinear part of the coupling, especially if the linear part is alreay reduced through some compensation or feedforward scheme. Rana suggested using a series of narrow band-limited noise injections.
The structure I'm working on is a python script that uses the AWG interface written by Chris W. to create the excitations. Afterwards, I calculate a series of PSD estimates from the data (i.e. a spectrogram), and apply a two-sample, unequal variance, t-test to test for statisically significant increases in the noise spectra to try and evaluate the nonlinear contriubutions to the noise. I've started a git repository at github.com/e-q/ifoCoupling with the code.
So far, I've tested one such injection of noise coupling from the ETMX oplev error point to the single arm length error signal. It's completely missing the user interface and structure to do a general series of measurements, but this is just organizational; I'm trying to get the math/science down first.
Here's a result from today:
Median, instead of the usual mean, PSDs are used throughout, to reject outliers/glitches.
The linear part of the coupling can be estimated using the coherence / spectrum height in the excitation band, but I'm not sure what the best what to present/paramerize the nonlinear parts of each individaul excitation band's result is.
Also, I anticipate being able to write an excitation auto-leveling routine, gradually increasing the exctiation level until the excited spectrum is some amount noisier than the baseline spectrum, up to some maximum amount configurable by the user.
The excitation shaping could probably be improved, too. It's currently and elliptic + butterworth bandpass for a sharp edge and rolloff.
I'm open to any thoughts and/or suggestions anyone may have!
Looks like a very handy code, especially with the real statistical tests.
I would make sure to use much smaller excitation amplitudes. Since the coupling is nonlinear, we expect that its only a good noise budget estimator when the excitation amplitude is less than a factor of 3 above the quiesscent excitation.
The local chiara backups are still failing due to vanished source files. I've emailed Max about the summary page jobs, since I think they're running remotely.
I've changed the chiara local backup script to read a folder exclusion file, and excluded /users/public_html/detcharsummary, and things are working again.
This was neccesary because the summary pages are being updated every half hour, which is faster than the time it takes for the backup script to run, so the file index that it builds at the start becomes invalid later on in the process.
Thinking about chiara's disk, it strikes me that when we went from the linux1 RAID to a single HDD on chiara, we may have tightened a bottleneck on our NFS latency, i.e. we are limited to that single hard drive's IO rates. This of course isn't the culprit for the more recent dramatic slowdowns, but in addition to fixing whatever has happened more recently, we may want to consider some kind of setup with higher IO capability for the NFS filesystem.
In fact, the file access is supposed to be WAY faster now than in the RAID case.
As noted in ELOG 9511, it was SCSI-2(or 3?) that had ~6MB/s thruput. Previously the backup took ~2hours.
This was improved to 30min by SATA HDD on llinux1.
I am looking at /opt/rtcds/caltech/c1/scripts/backup/rsync.backup.cumlog
In fact, this "30-min backup" was true until the end of March. After that the backup is taking 1h~1.5h.
This could be related to the recent NFS issue?
I have put the Wiener filter scripts into /opt/rtcds/caltech/c1/scripts/Wiener/ . They are under version control.
The idea is that you should copy ParameterFile_Example.m into your own directory, and modify parameters at the top of the file, and then when you run that script, it will output fitted filters ready to go into Foton. (Obviously you must check before actually implementing them that you're happy with the efficacy and fits of the filters).
Things to be edited in the ParameterFile include:
I think that's everything that is required.
Since Chiara's onboard ethernet card has a reputation to be flaky in Linux, Koji suggested we could just buy a new ethernet card and throw it in there, since they're cheap.
I've installed a Intel EXPI9301CT ethernet card in Chiara, which detected it without problems. I changed over the network settings in /etc/networking/interfaces to use eth1 instead of eth0, restarted nfs and bind9, and everything looked fine.
Sadly, EPICS/network slowdowns are still happening. :(
I've tweaked the ELOG code to allow uploading of PDFs by drag-and-drop into the main editor window. Once again we can bask in the glory of
There seems to be something funny going on with MATLAB's license authentication on the control room workstations. Earlier today, I was able to start MATLAB on pianosa, but now attempting to run /cvs/cds/caltech/apps/linux64/matlab/bin/matlab -desktop results in the message:
License checkout failed.
License Manager Error -15
MATLAB is unable to connect to the license server.
Check that the license manager has been started, and that the MATLAB client machine can communicate
with the license server.
Troubleshoot this issue by visiting:
License path: /home/controls/.matlab/R2013a_licenses:/cvs/cds/caltech/apps/linux64/matlab/licenses/license.dat:/cv
Licensing error: -15,570. System Error: 115
Frustrated by the single pixel width of the windows and how hard that makes it to drag things around, I explored StackExchange:
which showed how there is a .xml file which can be edited to increase this. I've changed the border size to 4 pixels on Rossa - its nice.
I made some changes to the c1tst model running on c1iscey in order to test my algorithm for frequency counting. I followed the steps listed in elog 8909 to make, install and start the model.
I need to debug a few things and run some more diagnostics so I am leaving the model in its edited version (Eric had committed it to the svn before I made any changes).
Trying to download some data using matlab today, I found that my ole mDV stuff doesn't work because its MEX files were built for AMD64...
Tried to rebuild the NDS1 MEX according to 7 year old instructions didn't work; our GCC is 'too' new.
From the Remote Data Access wiki (https://wiki.ligo.org/RemoteAccess/MatlabTools) I got the new 'get_data.m' and 'GWdata.m'. These didn't run, so I updated the nds2-client and matlab-nds2-client on Donatella.
Still doesn't run to get 40m data. It recognizes that we're C1, but throws some java exception error. Maybe it doesn't work on the NDS1 protocol of our framebuilder?
So then I noticed that our NDS2 server on megatron is no longer running...thought it was supposed to run via init.d. Found that the nds2 binary doesn't run because it can't find libframecpp.so.5; maybe this was blown away in some recent upgrade? We do have versions 3, 4, 6, 7, & 8 of this library installed.
So now, after an hour or two, I'm upgrading the nds2 server on megatron (plus a hundred dependencies) as well as getting a newer version of matlab to see if there's some kind of java version issue there.
Of course python still works to get data, but doesn't have any of the wiener filter calculating code that matlab has...
NDS2 restarted after hours long upgrade process; testing has begun. Let's try to get some long stretches of MC locked with MCL FF ON this weekend so's I can test out the angular FF idea.
I have been working on setting up a frequency counting module that can give us a readout of the beat frequency, divided by a factor of 2^14 using the Wenzel frequency dividers as described here. This is a summary of what I have thus far.
The algorithm, and simulink model
The basic idea is to pass the digitized signal through a Schmitt trigger (existing RCG module), which provides some noise immunity, and should in theory output a clean square wave with the same frequency as the input. The output of the Schmitt trigger module is either 0 (for input < lower threshold value) and 1 (for input greater than the high threshold value). By differencing this between successive samples, we can detect a "zero-crossing", and by measuring the time interval between successive zero crossings, we can take the reciprocal to get the frequency. The last bit of this operation (i.e. measuring the interval) is done using a piece of custom C code. Initially, I was trying to use the part "GPS" from CDS_PARTS to get the current GPS time and hence measure intervals between successive zero-crossings, but this didn't work out because the output of GPS is in seconds, and that doesn't give me the required precision to count frequency. I tried implementing some more precision timing using the clock_gettime() function, which is capable of giving nanosecond precision, but this didn't work for me. So I am now using a more crude way of measuring the interval, by using a counter variable that is incremented each time a zero-crossing is NOT detected, and then converting this to time using the FE_RATE macro (=16384). In any case, the ADC sampling rate limits the resolution of frequency counting using zero-crossing detection (more on this later). Attachment 1 shows the SIMULINK block diagram for this entire procedure.
Testing the model
I implemented all of this on c1tst, and followed the steps listed here to get the model up and running. I then used one of the DB37 breakout boards to send a signal to the ADC using the DS345 function generator. Attachment 2 shows some diagnostic plots - input signal was a 2.5Vpp (chosen to match the output from the Wenzel dividers) square wave at 2kHz:
The right column pointed me to the limitations of frequency counting using this method - even though the input frequency was constant (2kHz), the counter variable, and hence the frequency readout, was neither accurate nor precise. But this was to be expected given the limitations imposed by ADC sampling? We only get information of the state of the input signal once within each sampling interval, and hence, we cannot know if a zero crossing has occurred until the next sampling interval. Moreover, we can only count frequency in discrete steps. In attachments 3 and 4, I've plotted these discrete frequencies which can be measured - the error bars indicate the error in the frequency readout if the counter variable is 1 more or less than the "true" value - this can (and does) happen if the high and low times of the Schmitt trigger are not equal over time (see top left plot in Attachment 2, its not very obvious, but all the "low" times are not equal, and so, the interval between detected zero crossings is not equal). This becomes a problem for small values of the counter variable, i.e. at high input frequencies. I was having a look at the elogs Aidan wrote some years ago for a different digital frequency counting approach, and I guess the conclusion there was similar - for high input frequencies, the error is large.
I further did two frequency sweeps using the DS345, to see if I could recover this in the frequency readout. Attachments 5 and 6 show the results of these sweeps. For low frequencies, i.e. 100-500 Hz, the jitter in the readout is small (though this will be multiplied by a factor of 2^14), but by the time the input frequency gets up to 2kHz, the jitter in the readout is pretty bad (and gets worse for even higher frequencies.
Some refinements can be made to the algorithm, perhaps by introducing some averaging (i.e. not reading out frequency for every pair of zero crossings, but every 5) which may improve the jitter in the readout, but I would think that the current approach is not very useful above 2kHz (corresponding to ~30MHz of pre-divider frequency), because of the limitations shown in attachments 3 and 4.