I've changed the chiara local backup script to read a folder exclusion file, and excluded /users/public_html/detcharsummary, and things are working again.
This was neccesary because the summary pages are being updated every half hour, which is faster than the time it takes for the backup script to run, so the file index that it builds at the start becomes invalid later on in the process.
Thinking about chiara's disk, it strikes me that when we went from the linux1 RAID to a single HDD on chiara, we may have tightened a bottleneck on our NFS latency, i.e. we are limited to that single hard drive's IO rates. This of course isn't the culprit for the more recent dramatic slowdowns, but in addition to fixing whatever has happened more recently, we may want to consider some kind of setup with higher IO capability for the NFS filesystem.
In fact, the file access is supposed to be WAY faster now than in the RAID case.
As noted in ELOG 9511, it was SCSI-2(or 3?) that had ~6MB/s thruput. Previously the backup took ~2hours.
This was improved to 30min by SATA HDD on llinux1.
I am looking at /opt/rtcds/caltech/c1/scripts/backup/rsync.backup.cumlog
In fact, this "30-min backup" was true until the end of March. After that the backup is taking 1h~1.5h.
This could be related to the recent NFS issue?
I have put the Wiener filter scripts into /opt/rtcds/caltech/c1/scripts/Wiener/ . They are under version control.
The idea is that you should copy ParameterFile_Example.m into your own directory, and modify parameters at the top of the file, and then when you run that script, it will output fitted filters ready to go into Foton. (Obviously you must check before actually implementing them that you're happy with the efficacy and fits of the filters).
Things to be edited in the ParameterFile include:
I think that's everything that is required.
Since Chiara's onboard ethernet card has a reputation to be flaky in Linux, Koji suggested we could just buy a new ethernet card and throw it in there, since they're cheap.
I've installed a Intel EXPI9301CT ethernet card in Chiara, which detected it without problems. I changed over the network settings in /etc/networking/interfaces to use eth1 instead of eth0, restarted nfs and bind9, and everything looked fine.
Sadly, EPICS/network slowdowns are still happening. :(
I've tweaked the ELOG code to allow uploading of PDFs by drag-and-drop into the main editor window. Once again we can bask in the glory of
There seems to be something funny going on with MATLAB's license authentication on the control room workstations. Earlier today, I was able to start MATLAB on pianosa, but now attempting to run /cvs/cds/caltech/apps/linux64/matlab/bin/matlab -desktop results in the message:
License checkout failed.
License Manager Error -15
MATLAB is unable to connect to the license server.
Check that the license manager has been started, and that the MATLAB client machine can communicate
with the license server.
Troubleshoot this issue by visiting:
License path: /home/controls/.matlab/R2013a_licenses:/cvs/cds/caltech/apps/linux64/matlab/licenses/license.dat:/cv
Licensing error: -15,570. System Error: 115
Frustrated by the single pixel width of the windows and how hard that makes it to drag things around, I explored StackExchange:
which showed how there is a .xml file which can be edited to increase this. I've changed the border size to 4 pixels on Rossa - its nice.
I made some changes to the c1tst model running on c1iscey in order to test my algorithm for frequency counting. I followed the steps listed in elog 8909 to make, install and start the model.
I need to debug a few things and run some more diagnostics so I am leaving the model in its edited version (Eric had committed it to the svn before I made any changes).
Trying to download some data using matlab today, I found that my ole mDV stuff doesn't work because its MEX files were built for AMD64...
Tried to rebuild the NDS1 MEX according to 7 year old instructions didn't work; our GCC is 'too' new.
From the Remote Data Access wiki (https://wiki.ligo.org/RemoteAccess/MatlabTools) I got the new 'get_data.m' and 'GWdata.m'. These didn't run, so I updated the nds2-client and matlab-nds2-client on Donatella.
Still doesn't run to get 40m data. It recognizes that we're C1, but throws some java exception error. Maybe it doesn't work on the NDS1 protocol of our framebuilder?
So then I noticed that our NDS2 server on megatron is no longer running...thought it was supposed to run via init.d. Found that the nds2 binary doesn't run because it can't find libframecpp.so.5; maybe this was blown away in some recent upgrade? We do have versions 3, 4, 6, 7, & 8 of this library installed.
So now, after an hour or two, I'm upgrading the nds2 server on megatron (plus a hundred dependencies) as well as getting a newer version of matlab to see if there's some kind of java version issue there.
Of course python still works to get data, but doesn't have any of the wiener filter calculating code that matlab has...
NDS2 restarted after hours long upgrade process; testing has begun. Let's try to get some long stretches of MC locked with MCL FF ON this weekend so's I can test out the angular FF idea.
I have been working on setting up a frequency counting module that can give us a readout of the beat frequency, divided by a factor of 2^14 using the Wenzel frequency dividers as described here. This is a summary of what I have thus far.
The algorithm, and simulink model
The basic idea is to pass the digitized signal through a Schmitt trigger (existing RCG module), which provides some noise immunity, and should in theory output a clean square wave with the same frequency as the input. The output of the Schmitt trigger module is either 0 (for input < lower threshold value) and 1 (for input greater than the high threshold value). By differencing this between successive samples, we can detect a "zero-crossing", and by measuring the time interval between successive zero crossings, we can take the reciprocal to get the frequency. The last bit of this operation (i.e. measuring the interval) is done using a piece of custom C code. Initially, I was trying to use the part "GPS" from CDS_PARTS to get the current GPS time and hence measure intervals between successive zero-crossings, but this didn't work out because the output of GPS is in seconds, and that doesn't give me the required precision to count frequency. I tried implementing some more precision timing using the clock_gettime() function, which is capable of giving nanosecond precision, but this didn't work for me. So I am now using a more crude way of measuring the interval, by using a counter variable that is incremented each time a zero-crossing is NOT detected, and then converting this to time using the FE_RATE macro (=16384). In any case, the ADC sampling rate limits the resolution of frequency counting using zero-crossing detection (more on this later). Attachment 1 shows the SIMULINK block diagram for this entire procedure.
Testing the model
I implemented all of this on c1tst, and followed the steps listed here to get the model up and running. I then used one of the DB37 breakout boards to send a signal to the ADC using the DS345 function generator. Attachment 2 shows some diagnostic plots - input signal was a 2.5Vpp (chosen to match the output from the Wenzel dividers) square wave at 2kHz:
The right column pointed me to the limitations of frequency counting using this method - even though the input frequency was constant (2kHz), the counter variable, and hence the frequency readout, was neither accurate nor precise. But this was to be expected given the limitations imposed by ADC sampling? We only get information of the state of the input signal once within each sampling interval, and hence, we cannot know if a zero crossing has occurred until the next sampling interval. Moreover, we can only count frequency in discrete steps. In attachments 3 and 4, I've plotted these discrete frequencies which can be measured - the error bars indicate the error in the frequency readout if the counter variable is 1 more or less than the "true" value - this can (and does) happen if the high and low times of the Schmitt trigger are not equal over time (see top left plot in Attachment 2, its not very obvious, but all the "low" times are not equal, and so, the interval between detected zero crossings is not equal). This becomes a problem for small values of the counter variable, i.e. at high input frequencies. I was having a look at the elogs Aidan wrote some years ago for a different digital frequency counting approach, and I guess the conclusion there was similar - for high input frequencies, the error is large.
I further did two frequency sweeps using the DS345, to see if I could recover this in the frequency readout. Attachments 5 and 6 show the results of these sweeps. For low frequencies, i.e. 100-500 Hz, the jitter in the readout is small (though this will be multiplied by a factor of 2^14), but by the time the input frequency gets up to 2kHz, the jitter in the readout is pretty bad (and gets worse for even higher frequencies.
Some refinements can be made to the algorithm, perhaps by introducing some averaging (i.e. not reading out frequency for every pair of zero crossings, but every 5) which may improve the jitter in the readout, but I would think that the current approach is not very useful above 2kHz (corresponding to ~30MHz of pre-divider frequency), because of the limitations shown in attachments 3 and 4.
I definitely think lowpassing the output is the way to go. Since this frequency readback will be used for slow control of the beatnote frequency via auxillary laser temperature, even lowpassing at tens of Hz is fine. The jitter doesn't mean its useless, though.
If we lowpass at 16Hz, we're effectively averaging over 1024 samples, bringing, for example, a +-2kHz jitter of a 6kHz signal as you post down to 2kHz/sqrt(1024) ~ 60Hz, which is 1% of the carrier. This seems ok to me.
I was going to suggest using a software PLL, but perhaps averaging gives the same result. The same ADC signal can be fed to multiple blocks with different averaging times and we can just use whichever ones seems the most useful.
I noticed that Chiara's backup HD (which has a capacity of 1.8TB, vs the main drives 2TB) was near to getting full, meaning that we would soon be without a local backup.
I freed up ~200GB of space by compressing the autoburt snapshots from 2012, 2013, 2014. Nothing is deleted, I've just compressed text files into archives, so we can still dig out the data whenever we want.
None of the links here seem to work. I forgot what the story is with our special apache redirect
The story is: we currently don't expose the whole /users/public_html folder. Instead, we are symlinking the folders from public_html to /export/home/ on nodus, which is where apache looks for things
So, I fixed the links on the Core Optics page by running:
controls@nodus|~ > ln -sfn /users/public_html/40m_phasemap /export/home/
controls@nodus|~ > df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/nodus2--vg-root 355G 69G 269G 21% /
udev 5.9G 4.0K 5.9G 1% /dev
tmpfs 1.2G 308K 1.2G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 5.9G 0 5.9G 0% /run/shm
/dev/sda1 236M 210M 14M 94% /boot
chiara:/home/cds 2.0T 1.5T 459G 77% /cvs/cds
fb:/frames 13T 11T 1.6T 88% /frames
The /boot partition was filling up with old kernels. Nodus has automatic security updates turned on, so new kernels roll in and the old ones don't get removed.
I ran apt-get autoremove, which removed several old kernels. (apt is configured by default to keep two previous kernels around when autoremoving, so this isn't so risky)
Now: /dev/sda1 236M 94M 130M 42% /boot
/dev/sda1 236M 94M 130M 42% /boot
In principle, one should be able change a setting in /etc/apt/apt.conf.d/50unattended-upgrades that would do this cleanup automatically, but this mechanism has a bug whose fix hasn't propagated out yet (link). So, I've added a line to nodus' root crontab to autoremove once a week, Sunday morning.
COMSOL 5.1 has been installed at: /cvs/cds/caltech/apps/linux64/comsol51/bin/comsol
MATLAB 2015b has been installed at: /cvs/cds/caltech/apps/linux64/matlab15b/bin/matlab
This has not replaced the default matlab on the workstations, which remains at 2013a. If some testing reveals that the upgrade is ok, we can rename the folders to switch.
On VIDEO.adl, Image Capture and Video Capture did not seem to work and gave me some errors, so I fixed following two things:
1. just put one side of a USB cable to Pianosa the other side of which was connected to Sensoray; I don't know why but this was unconnected.
2. slightly fixed /users/sensoray/sdk_2253_1.2.2_linux/imsub/display-image.py as fpllows
L52: pix[j, i] = R, G, B -> pix[j, i] = int(R), int(G), int(B)
It seems to work, at least for some cameras including ETMYF and ITMYF.
Somehow, the controls user account on donatella lost its membership to the sudoers group, which meant doing anything that needs root authentication was impossible.
I fixed this by booting up from a Linux install USB drive, mounting the HD, and running useradd controls sudo
useradd controls sudo
I added 1 line to one of the ASS scripts, UNFREEZE_DITHER.py like this:
L29> ez.cawrite('C1:ASS-'+dof+'_GAIN', 0)
The reason why I added this is: without this line, C1:ASS-'+dof+'_GAIN become larger that 1.0, which is nomial value, if you UNFREEZE DITHER when the dither is already running or C1:ASS-'+dof+'_GAIN is not 0.0.
Here I explain usage of my scripts for loss map measurement. There are 7 script files in a same directory /opt/rtcds/caltech/c1/scripts/lossmap_scripts. With these scripts, round trip loss of an arm cavity with the beam spot on one mirror shifted to 5x5 (option: 3x3) points is measured. You can choose on which cavity you measure, the beam spot on which mirror you shift, and maximum shift of the beam spot in vertical and horizontal direction.
To start measurement from the beginning
Run the following command in an arbitrary directory and you will get several text files including the result of loss map measurement:
> python /opt/rtcds/caltech/c1/scripts/lossmap_scripts/lossmap.py [maximum shift in mm (PIT)] [maximum shift in mm (YAW)] [arm name (XorY)] [mirror name (E or I)]
Optionally, you can add "AUTO" at the end of the above command. Without "AUTO", you will be asked if the dithering has already settled down or not after each shift of the beam spot and you can let the scripts wait until the dithering settles down sufficiently. If you add "AUTO", it will be judged if the dithering has settled down or not according to some criteria, and the measurement will continue without your response to the terminal.
The files to be created in the current directory by the scripts are:
- lossmapETMX1-1.txt # [POX power (locked)] / [POX power (misaligned)]
- lossmapETMX1-2.txt # standard deviation of [POX power (locked)] / [POX power (misaligned)]
- lossmapETMX1-3.txt # TRX
- lossmapETMX1-1_converted.txt # round trip loss (ppm) calculated from lossmapETMX1-1.txt
- lossmapETMX1-1_converted_sigma.txt # standard deviation of round trip loss calculated from 1-1.txt and 1-2.txt
- lossmapETMX_result.txt # round trip loss and its error in a clear form.
The name of the files would be "lossmapITMY1-1.txt" etc. depending on which mirror you have chosen.
To restart measurement from a certain point
Run the following command in a directory containing "lossmap(mirror name)1-1.txt", "lossmap(mirror name)1-2.txt" and "lossmap(mirrorname)1-3.txt" which are created by previous not-completed measurement:
> python /opt/rtcds/caltech/c1/scripts/lossmap_scripts/lossmap.py [maximum shift in mm (PIT)] [maximum shift in mm (YAW)] [arm name (XorY)] [mirror name (E or I)] [restart point (PIT)] [restart point (YAW)]
You can also add "AUTO".
How to designate the restart point:
Matrix elements of output of this measurement procedure are characterized by a pair of two numbers as the following shows.
(-1,-1) -> (-1,-0.5) -> (-1,0) -> (-1,0.5) -> (-1,1)
(-0.5,1) <- (-0.5,0.5) <- (-0.5,0) <- (-0.5,-0.5) <- (0.5,-1)
(0,-1) -> (0,-0.5) -> (0,0) -> (0,0.5) -> (0,1)
(0.5,1) <- (0.5,0.5) <- (0.5,0) <- (0.5,-0.5) <- (0.5,-1)
(1,-1) -> (1,-0.5) -> (1,0) -> (1,0.5) -> (1,1)
Please write the numbers that correspond to the matrix element you want to restart at. Arrows show the order of sequence of measurement. About the correspondence between the matrix elements and real position on the ETMY and ETMX, see elog 11818 and 11857, respectively.
This script will overwrite the files (~1-1.txt etc.) so it is safer to make backup of the files before you run this script.
Some notes on the scripts and measurement
- Calibration has been done only for ETMs, i.e. for ITMs unit of [maximum shift] is not mm, but the values written in [maximum shift] equal to the maximum offsets added just after demodulation of ASS loop (ex. C1:ASS-YARM_ITM_PIT_L_DEMOD_I_OFFSET).
- It should be checked before doing measurement if the following parameters are correct or not.
POXzero (L47 in lossmapx.py and L52 in lossmapx_resume.py: the value of C1:LSC-POXDC_OUTPUT when no light injects into POXPD.)
POYzero (L45 in lossmapy.py and L50 in lossmapy_resume.py: the value of C1:LSC-POYDC_OUTPUT when no light injects into POYPD.)
mmr (L11 in lossmap_convert.py: (mode matching carrier power)/(total power))
Tf (L12 in lossmap_convert.py; transmittivity of ITM)
Tetm (L13 in lossmap_convert.py: transmittivity of ETM in ppm)
- Changing n (L50 in lossmap.py) from 5 to 3, the grid points will be 3x3 changed from the default value of 5x5. If 3x3, the matrix elements are characterized by
(-1,-1) -> (-1,0) -> (-1,1)
(0,1) <- (0,0) <- (0,-1)
(1,-1) -> (1,0) -> (1,1)
similarly to the case of 5x5.
- You can copy the directory lossmap_scripts anywhere in controls and use it. These scripts will work as long as all the 7 scripts exist in a same directory.
I've done a couple things to try and make nodus a little more secure. Some have worried that nodus may be susceptible to being drafted into a botnet, slowing down our operations.
1. I configured the ssh server settings to disallow logins as root. Ubuntu doesn't enable the root account by default anyways, but it doesn't hurt.
2. I installed fail2ban. Function: If some IP address fails to authenticate an ssh connection 3 times, it is banned from trying to connect for 10 minutes. This is mostly for thwarting mass brute force attacks. Looking at /var/log/auth.log doesn't indicate any of this kind of thing going on in the past week, at least.
3. I set up and enabled ufw (uncomplicated firewall) to only allow incoming traffic for:
I don't think there are any other ports we need open, but I could be wrong. Let me know if I broke something you need!
NDS2 and the usual ports so that we can use optimus as a comsol server.
This LHO log indicates that EPICS slow down could be due to NFS activity. Could we make some trend of NFS activity on Chiara and then see if it correlates with EPICS flatlines?
I wonder if our EPICS issues frequency is correlated to the Chiara install.
We changed the password for controls on nodus this afternoon. We also zeroed out the authorized_keys file and then added back in the couple that we want in there for automatic backups / detchar.
Also did the recommended Ubuntu updates on there. Everything seems to be going OK so far. We think nothing on the interferometer side cares about the nodus password.
We also decided to dis-allow personal laptops on the new Martian router (to be installed soon).
After hardware errors prevented me from using optimus, I switched my generation of summary pages back to the clusters. A day's worth of data is still too much to process using one computer, but I have successfully made summary pages for a timescales of a couple of hours on this site: https://ldas-jobs.ligo.caltech.edu/~praful.vasireddy/
Currently, I'm working on learning the current plot-generation code so that it can eventually be modified to include an interactive component (e.g., hovering over a point on a timeseries would display the GPS time). Also, the 40m summary pages have been down for the past 3 weeks but should be up and working soon as the clusters are now alive.
I've added a new tab for VMon under the SUS parent tab. I'm still working out the scale and units, but let me know if you think this is a useful addition. Here's a link to my summary page that has this tab: https://ldas-jobs.ligo.caltech.edu/~praful.vasireddy/1151193617-1151193917/sus/vmon/
I'll have another tab with VMon BLRMS up soon.
Also, the main summary pages should be back online soon after Max fixed a bug. I'll try to add the SUS/VMon tab to the main pages as well.
The main C1 summary pages are back online now thanks to Max and Duncan, with a gap in pages from June 8th to July 4th. Also, I've added my new VMon and Sensors tabs to the SUS parent tab on the main pages. These new tabs are now up and running on the July 7th summary page.
Here's a link to the main nodus pages with the new tabs: https://nodus.ligo.caltech.edu:30889/detcharsummary/day/20160707/sus/vmon/
And another to my ldas page with the tabs implemented: https://ldas-jobs.ligo.caltech.edu/~praful.vasireddy/1150848017-1150848317/sus/vmon/
Let me know if you have any suggestions or see anything wrong with these additions, I'm still working on getting the scales to be right for all graphs.
I started to receive emails from cron every 15min. Is the email related to this? And is it normal? I never received these cron emails before when the sum-page was running.
I don't know much about how the cron job runs, I'll forward this to Max.
Max says it should be fixed now. Have the emails stopped?
This should be fixed now—apologies for the spam.
It seemed something has been done. And I got cron emails.
Then, it seemed something has been done. And the emails stopped.
A new MEDM tab has been added to the summary pages (https://nodus.ligo.caltech.edu:30889/detcharsummary/day/20160708/medm/), although some of the screens are not updated when /cvs/cds/projects/statScreen/cronjob.sh is run. In /cvs/cds/projects/statScreen/log.txt, the following error is given for those files: import: unable to read X window image `0x20011f': Resource temporarily unavailable @ error/xwindow.c/XImportImage/5027. If anyone has seen this error before or knows how to fix it, please let me know.
In the meantime, I'll be working on creating an archive of MEDM screens for every hour to be displayed on the summary pages.
Some of the screens are up-to-date, and some are not. Are the errors associated with the screens that failed to get updated?
Thanks! Yes, only the screens that are not updated when the script is executed show this error. I'll try to keep debugging over the weekend.
The MEDM screen capture tab is now working and up on the summary pages: https://nodus.ligo.caltech.edu:30889/detcharsummary/day/20160725/medm/
Please let me know if you have any suggestions or notice any issues.
Looks pretty great. However, there's two problems:
1) Some of the MEDM screens don't show the time. You can fix this by editing the screens and copy/paste from screens which have working screens.
2) The snapshot script seems to not grab the full MEDM screen sometimes.
These are not a very big deal, so you can get the microphones working first and we can take care of this afterwards.
Usual Ubuntu apt-get upgrades; long delayed but now happening.
The nodus restart caused a bit of downtime. The apache configuration files were accidentally deleted the other day, so elog/svn/wikis were just holding on in memory; this fact was unfortunately not elogged.
Things should be up and running again, except for the 8080->8081 elog redirection which I haven't been able to figure out.
I will also set up the NFS backup to include nodus configuration files from now on
Nodus' /export and /etc directories are now being backed up at /cvs/cds/caltech/nodus_backup
They will be rsync'd over as part of the nightly tape backups (scripts/backup/rsync.backup)
Sorry I was writting the elog, but I had to dive into the chamber (@LHO) before completion.
The summary pages have been updated with the new naming seismometer channel naming conventions. Here's a link to them working on my own page: https://ldas-jobs.ligo.caltech.edu/~praful.vasireddy/1154908817-1154909717/pem/seismic/
Let me know if the actual pages aren't working when they come back online or if there's something that needs to be changed.
We found that someone had violated all rules of computer security decency and was storing our nodus password as a plain text file in their bash_profile.
After the flogging we have changed the pwd and put the new one in the usual secret place.
I tried to follow these instructions today to make the Simulink Webview accessible:
controls@nodus|public_html > ln -sfn /users/public_html/FE /export/home/
But...I got a "403 Forbidden" message. What is the secret handshake to get this to work? And why have we added this extra step of security?
This link works for me: https://nodus.ligo.caltech.edu:30889/FE/c1als_slwebview.html. The problem with just /FE/ is that there is no index.html, and we have turned off automatic directory listings.
IIRC, this arrangement was due to the fact that authentication of some of the folders (maybe the wikis) was broken during the nodus upgrade, so there was sensitive information being publicly displayed. This setup gives us discretion over what gets exposed.
I suppose before directory listings were turned off we should have fixed the script to make an index.html, but that's how it goes with "up"-grades. How about re-allow directory listing so that our old links for webview work again?
EQ: https://nodus.ligo.caltech.edu:30889/FE is live
This was done by adding "Options +Indexes" to /etc/apache/sites-available/nodus
I've added a little more info about the apache configuration on the wiki: ApacheOnNodus