Okay, more memory would definitely be good. I don't think I have been using MEDM (which Jamie tells me is the controls interface) so making a cronjob would probably be a good idea.
Upon investigation, the other methods besides process_data() take almost no time at all to run, by comparison. The process_data() method takes roughly 2521 seconds to run using Multiprocessing with eight threads. After its execution, the rest of the program only takes 120 seconds to run. So, since I still need to restructure the code, I won't bother adding multiprocessing to these other methods yet, since it won't significantly improve speed (and any improvements might not be compatible with how I restructure the code). For now, the code is just about as efficient as it can be (in terms of multiprocessing). Further improvements may or may not be gained when the code is restructured.
Multiprocessing: In its current form, the code uses multiprocessing to the maximal extent possible. It takes roughly 2600 seconds to run (times may vary depending on what else megatron is running, etc.). Multiprocessing is only used on the process_data() function calls, because this by far takes the longest. The other function calls after the process_data() calls take a combined ~120 seconds. See http://nodus.ligo.caltech.edu:8080/40m/8218 for details on the use of Multiprocessing to call process_data().
Crontab: I also updated the crontab in an attempt to fix the problem where data is only displayed until 5PM. Recall that previously (http://nodus.ligo.caltech.edu:8080/40m/8098) I found that the crontab wasn't even calling the summary_pages.py script after 5PM. I changed it then to be called at 11:59PM, which also didn't work because of the day change after midnight.
I decided it would be easiest to just call the function on the previous day's data at 12:01AM the next day. So, I changed the crontab.
59 5,11,17,23 * * * /users/public_html/40m-summary/bin/c1_summary_page.sh 2>&1
0 6,12,18 * * * /users/public_html/40m-summary/bin/c1_summary_page.sh 2>&1
1 0 * * * /users/public_html/40m-summary/bin/c1_summary_page.sh $(date "+%Y/%m/%d" --date="1 days ago") 2>&1
For some reason, as of 9:00PM today (March 4, 2013) I still don't see any data up, even though the change to the crontab was made on February 28. Even more bizarre is the fact that data is present for March 1-3. Perhaps some error was introduced into the code somehow, or I don't understand how crontab does its job. I will look into this now.
Once I fix the above problem, I will begin refactoring the code into different well-documented classes.
Attempted Battery Replacement on Backup Power Supply in the Control Room:
I tried to replace the batteries in the Smart UPS 2200 with new batteries purchased by Steve. However, the power port wasn't compatible with the batteries. The battery cable's plug was too tall to fit properly into the Smart UPS port. New batteries must be acquired. Steve has pictures of the original battery (gray) and the new battery (blue) plugs, which look quite different (even though the company said the battery would fit).
The Correct battery connector is GRAY : APC RBC55
Quick Note on Multiprocessing: The multiprocessing was plugged into the codebase on March 4. Since then, the various pages that appear when you click on certain tabs (such as the page found here: https://nodus.ligo.caltech.edu:30889/40m-summary/archive_daily/20130304/ifo/dc_mon/ from clicking the 'IFO' tab) don't display graphs. But, the graphs are being generated (if you click here or here, you will find the two graphs that are supposed to be displayed). So, for some reason, the multiprocessing is preventing these graphs from appearing, even though they are being generated. I rolled back the multiprocessing changes temporarily, so that the newly generated pages look correct until I find the cause of this.
Fixing Plot Limits: The plots generated by the summary_pages.py script have a few problems, one of which is: the graphs don't choose their boundaries in a very useful way. For example, in these pressure plots, the dropout 0 values 'ruin' the graph in the sense that they cause the plot to be scaled from 0 to 760, instead of a more useful range like 740 to 760 (which would allow us to see details better).
The call to the plotting functions begins in process_data() of summary_pages.py, around line 972, with a call to plot_data(). This function takes in a data list (which represents the x-y data values, as well as a few other fields such as axes labels). The easiest way to fix the plots would be to "cleanse" the data list before calling plot_data(). In doing so, we would remove dropout values and obtain a more meaningful plot.
To observe the data list that is passed to plot_data(), I added the following code:
# outfile is a string that represents the name of the .png file that will be generated by the code.
print_verbose("Saving data into a file.")
outfile_mch = open(outfile + '.dat', 'w')
# at this point in process_data(), data is an array that should contain the desired data values.
if (data == ):
print >> outfile_mch, data
When I ran this in the code midday, it gave a human-readable array of values that appeared to match the plots of pressure (i.e. values between 740 and 760, with a few dropout 0 values). However, when I let the code run overnight, instead of observing a nice list in 'outfile.dat', I observed:
[('Pressure', array([ 1.04667840e+09, 1.04667846e+09, 1.04667852e+09, ...,
1.04674284e+09, 1.04674290e+09, 1.04674296e+09]), masked_array(data = [ 744.11076965 744.14254761 744.14889221 ..., 742.01931356 742.05930208
mask = False,
fill_value = 1e+20)
I.e. there was an ellipsis (...) instead of actual data, for some reason. Python does this when printing lists in a few specific situations. The most common of which is that the list is recursively defined. For example:
a = 
It doesn't seem possible that the definitions for the data array become recursive (especially since the test worked midday). Perhaps the list becomes too long, and python doesn't want to print it all because of some setting.
Instead, I will use cPickle to save the data. The disadvantage is that the output is not human readable. But cPickle is very simple to use. I added the lines:
cPickle.dump(data, open(outfile + 'pickle.dat', 'w'))
This should save the 'data' array into a file, from which it can be later retrieved by cPickle.load().
There are other modules I can use that will produce human-readable output, but I'll stick with cPickle for now since it's well supported. Once I verify this works, I will be able to do two things:
1) Cut out the dropout data values to make better plots.
2) When the process_data() function is run in its current form, it reprocesses all the data every time. Instead, I will be able to draw the existing data out of the cPickle file I create. So, I can load the existing data, and only add new values. This will help the program run faster.
Jamie has informed me of numpy's numpy.savetxt() method, which is exactly what I want for this situation (human-readable text storage of an array). So, I will now be using:
# outfile is the name of the .png graph. data is the array with our desired data.
numpy.savetxt(outfile + '.dat', data)
to save the data. I can later retrieve it with numpy.loadtxt()
Graph Limits: The limits on graphs have been problematic. They often reflect too large of a range of values, usually because of dropouts in data collection. Thus, they do not provide useful information because the important information is washed out by the large limits on the graph. For example, the graph below shows data over an unnecessarily large range, because of the dropout in the 300-1000Hz pressure values.
The limits on the graphs can be modified using the config file found in /40m-summary/share/c1_summary_page.ini. At the entry for the appropriate graph, change the amplitude-lim=y1,y2 line by setting y1 to the desired lower limit and y2 to the desired upper limit. For example, I changed the amplitude limits on the above graph to amplitude-lim=.001,1, and achieved the following graph.
The limits could be tightened further to improve clarity - this is easily done by modifying the config file. I modified the config file for all the 2D plots to improve the bounds. However, on some plots, I wasn't sure what bounds were appropriate or what range of values we were interested in, so I will have to ask someone to find out.
Next: I now want to fix all the funny little problems with the site, such as scroll bars appearing where they should not appear, and graphs only plotting until 6PM. In order to do this most effectively, I need to restructure the code and factor it into several files. Otherwise, the code will not only be much harder to edit, but will become more and more confusing as I add on to it, compounding the problems that we currently have (i.e. that this code isn't very well documented and nobody knows how it works). We need lots of specific documentation on what exactly is happening before too many changes are made. Take the config files, for example. Someone put a lot of work into them, but we need a README specifying which options are supported for which types of graphs, etc. So we are slowed down because I have to figure out what is going on before I make small changes.
To fix this, I will divide the code into three main sectors. The division of labor will be:
- Sector 1: Figure out what the user wants (i.e. read config files, create a ConfigParser, etc...)
- Sector 2: Process the data and generate the plots based on what the user wants
- Sector 3: Generate the HTML
Duncan Macleod (original author of summary pages) has an updated version that I would like to import and work on. The code and installation instructions are found below.
I am not sure where we want to host this. I could put it in a new folder in /users/public_html/ on megatron, for example. Duncan appears to have just included the summary page code in the pylal repository. Should I reimport the whole repository? I'm not sure if this will mess up other things on megatron that use pylal. I am working on talking to Rana and Jamie to see what is best.
I am following the instructions here:
But there as an error when I run the ./00boot command near the beginning. I have asked Duncan Macleod about this and am waiting to hear back.
For now, I am putting things into /home/controls on allegra. My understanding is that this is not shared, so I don't have a chance of messing up anyone else's work. I have been moving slow and being extra cautious about what I do because I don't want to accidentally nuke anything.
I installed the new version of LAL on allegra. I don't think it has interfered with the existing version, but if anyone has problems, let me know. The old version on allegra was 6.9.1, but the new code uses 18.104.22.168. To use it, add . /opt/lscsoft/lal/etc/lal-user-ench.sh to the end of the .bashrc file (this is the simplest way, since it will automatically pull the new version).
I am having a little trouble getting some other unmet dependencies for the summary pages such as the new lalframe, etc. But I am working on it.
Once I get it working on allegra and know that I can get it without messing up current versions of lal, I will do this again on megatron so I can test and edit the new version of the summary pages.
LALFrame was successfully installed. Allegra had unmet dependencies with some of the library tools. I tried to install LALMetaIO, but there were unmet dependencies with other LSC software. After updating the LSC software, the problem has persisted. I will try some more, and ask Duncan if I'm not successful.
Installing these packages is rather time consuming, it would be nice if there was a way to do it all at once.
I am now working on megatron, installing in /home/controls/lal. I am having some unmet dependency issues that I have asked Duncan about.
I have figured out all the issues, and successfully installed the new versions of the LAL software. I am now going to get the summary pages set up using the new code.
There was an issue with running the new summary pages, because laldetchar was not included (the website I used for instructions doesn't mention that it is needed for the summary pages). I figured out how to include it with help from Duncan. There appear to be other needed dependencies, though. I have emailed Duncan to ask how these are imported into the code base. I am making a list of all the packages / dependencies that I needed that weren't included on the website, so this will be easier if/when it has to be done again.
Most dependencies are met. The next issue is that matplotlib.basemap is not installed, because it is not available for our version of python. We need to update python on megatron to fix this.
Replaced the batteries successfully in the control room. We just had to switch the clips from the old batteries to the new one, which we didn't know was possible until now.
The summary pages have been down due to incompatibilities with a software update and problems with the LDAS cluster. I'm working at the moment to fix the former and the LDAS admins are looking into the latter. Overall, we can expect the pages will be fully functional again by Monday.
The pages are live again. Please allow some time for the system to catch up and process missed days. If there are any further issues, please let me know.
URL reminder: https://nodus.ligo.caltech.edu:30889/detcharsummary/
It was brought to my attention that the "Code status" page (https://nodus.ligo.caltech.edu:30889/detcharsummary/status.html) had been stuck showing "Unknown status" for a while.
This was due to a sync error with LDAS and has now been fixed. Let me know if the issue returns.
The summary pages are currently unstable due to priority issues on the cluster*. The plots had been empty ever since the CDS updated started anyway. This issue will (presubmably) disappear once the jobs are moved to the new 40m shared LDAS account by the end of next week.
*namely, the jobs are put on hold (rather, status "idle") because we have low priority in the processing queue, making the usual 30min latency impossible.
Bad syntax errors in the c1sus.ini config file were causing the summary pages to crash: a plot type had not been indicated for plots 5 and 6, so I've made these "timeseries."
In the future, please remember to always specify a plot type, e.g.:
5 = C1:SUS-ETMX_SUSPIT_INMON.mean,
3 = C1:SUS-ETMX_SUSPIT_INMON.mean,
By the way, the pages will continue to be unavailable while I transfer them to the new shared account.
A shared LIGO Data Grid (LDG) account was created for use by the 40m lab. The purpose of this account is to provide access to the LSC computer cluster resources for 40m-specific projects that may benefit from increased computational power and are not linked to any user in particular (e.g. the summary pages).
For further information, please see https://wiki-40m.ligo.caltech.edu/40mLDASaccount
The summary pages are now generated from the new 40m LDAS account. The nodus URL (https://nodus.ligo.caltech.edu:30889/detcharsummary/) is the same and there are no changes to the way the configuration files work. However, the location on LDAS has changed to https://ldas-jobs.ligo.caltech.edu/~40m/summary/ and the config files are no longer version-controlled on the LDAS side (this was redundant, as they are under VCS in nodus).
I have posted a more detailed description of the summary page workflow, as well as instructions to run your own jobs and other technical minutiae, on the wiki: https://wiki-40m.ligo.caltech.edu/DailySummaryHelp
For the past couple of days, the summary pages have shown minute trend data disappear at 12:00 UTC (05:00 AM local time). This seems to be the case for all channels that we plot, see e.g. https://nodus.ligo.caltech.edu:30889/detcharsummary/day/20150724/ioo/. Using Dataviewer, Koji has checked that indeed the frames seem to have disappeared from disk. The data come back at 24 UTC (5pm local). Any ideas why this might be?
A new type of plot is now available for use in the summary pages, based on EricQ's 2D histogram plots (elog 11210). I have added an example of this to the SandBox tab (https://nodus.ligo.caltech.edu:30889/detcharsummary/day/20151119/sandbox/). The usage is straighforwad: the name to be used in config files is histogram2d; the first channel corresponds to the x-axis and the second one to the y-axis; the options accepted are the same as numpy.histogram2d and pyploy.pcolormesh (besides plot limits, titles, etc.). The default colormap is inferno_r and the shading is flat.
I have added a new cron job in pcdev1 at CIT using the 40m shared account. This will run the /home/40m/DetectorChar/bin/cleanarchive script one minute past midnight on the first of every month. The script removes GWsumm archive files older than 1 month old.
I have modified the c1summary.ini and c1lsc.ini configuration files slightly to avoid overloading the system and remove the errors that were preventing plots from being updated after certain time in the day.
The changes made are the following:
1- all high-resolution spectra from the Summary and LSC tabs are now computed for each state (X-arm locked, Y-arm locked, IFO locked, all);
2- I've removed MICH, PRCL & SRCL from the summary spectrum (those can still be found in the LSC tab);
3- I've split LSC into two subtabs.
The reason for these changes is that having high resolution (raw channels, 16kHz) spectra for multiple (>3) channels on a single tab requires a *lot* of memory to process. As a result, those jobs were failing in a way that blocked the queue, so even other "healthy" tabs could not be updated.
My changes, reflected from May 25 on, should hopefully fix this. As always, feel free to re organize the ini files to make the pages more useful to you, but keep in mind that we cannot support multiple high resolution spectra on a single tab, as explained above.
This should be fixed now—apologies for the spam.
I don't know much about how the cron job runs, I'll forward this to Max.
I started to receive emails from cron every 15min. Is the email related to this? And is it normal? I never received these cron emails before when the sum-page was running.
Summary pages down today due to schedulted LDAS cluster maintenance. The pages will be back automatically once the servers are back (by tomorrow).
The system is back from maintenance and the pages for last couple of days will be filled retroactively by the end of the week.
I've re-submitted the Condor job; pages should be back within the hour.
Been non-functional for 3 weeks. Anyone else notice this? images missing since ~Sep 21.
Summary pages will be unavailable today due to LDAS server maintenance. This is unrelated to the issue that Rana reported.
The summary pages were not successfully generated for a long period of time at the end of 2016 due to syntax errors in the PEM and Weather configuration files.
These errors caused the INI parser to crash and brought down the whole gwsumm system. It seems that changes in the configuration of the Condor daemon at the CIT clusters may have made our infrastructure less robust against these kinds of problems (which would explain why there wasn't a better error message/alert), but this requires further investigation.
In any case, the solution was as simple as correcting the typos in the config side (on the nodus side) and restarting the cron jobs (on the cluster side, by doing `condor_rm 40m && condor_submit DetectorChar/condor/gw_daily_summary.sub`). Producing pages for the missing days will take some time (how to do so for a particular day is explained in the wiki https://wiki-40m.ligo.caltech.edu/DailySummaryHelp).
RXA: later, Max sent us this secret note:
However, I realize it might not be clear from the page which are the key steps. These are just running:
1) ./DetectorChar/bin/gw_daily_summary --day YYYYMMDD --file-tag some_custom_tag To create pages for day YYYYMMDD (the file-tag option is not strictly necessary but will prevent conflict with other instances of the code running simultaneously).
2) sync those days back to nodus by doing, eg: ./DetectorChar/bin/pushnodus 20160701 20160702
This must all be done from the cluster using the 40m shared account.
./DetectorChar/bin/gw_daily_summary --day YYYYMMDD --file-tag some_custom_tag
./DetectorChar/bin/pushnodus 20160701 20160702
System-wide CIT LDAS cluster maintenance may cause disruptions to summary pages today.
LDAS has not recovered from maintenance causing the pages to remain unavailable until further notice.
> System-wide CIT LDAS cluster maintenance may cause disruptions to summary pages today.
FYI this issue has still not been solved, but the pages are working because I got the software running on an
alternative headnode (pcdev2). This may cause unexpected behavior (or not).
> LDAS has not recovered from maintenance causing the pages to remain unavailable until further notice.
> > System-wide CIT LDAS cluster maintenance may cause disruptions to summary pages today.
There has been a change in the default format for the output of the condor_q command at CIT clusters. This could be problematic for the summary page status monitor, so I have disabled the default behavior in favor of the old one. Specifically, I ran the following commands from the 40m shared account: mkdir -p ~/.condor echo "CONDOR_Q_DASH_BATCH_IS_DEFAULT=False" >> ~/.condor/user_config This should have no effect on the pages themselves.
Using the three Marconis in 40m at 11.1 MHz, the Three Cornered Hat technique was used to find the individual noise of each Marconi with different offset ranges and the direct/indirect frequency source of the rubidium clock.
Rana explained the TCH technique earlier - by measuring the phase noise of each pair of Marconis, the individual phase noise can be calculated by:
S1 = sqrt( (S12^2 + S13^2 - S23^2) / 2)
S2 = sqrt( (S12^2 + S23^2 - S13^2) / 2)
S3 = sqrt( (S13^2 + S23^2 - S12^2) / 2)
I measured the phase noise for offset ranges of 1Hz, 10Hz, 1kHz, and 100kHz (the maximum allowed for a frequency of 11.1Mhz) and calculated the individual phase noise for each source (using 7 averages, which gives all the spikes in the individual noise curves). The noise from each source is very similar, although not quite identical, while the noise is greater at higher frequencies for higher offset ranges, so the lowest possible offset range should be used. It appears the noise below a range of 10Hz is fairly constant, with a smoother curve at 10Hz.
The phase noise for direct vs indirect frequency source was measured with an offset range of 10Hz. While very similar at high and low frequencies for all 3 Marconis, the indirect source was consistently noisier in the middle frequencies, indicating that any Marconis connected to the rubidium clock should use the rubidium clock as a direct frequency reference.
Since I can't adjust settings of the Marconis at the moment, I have yet to finish measurements of the phase noise at 160 MHz and 80 MHz (those used in the PSL lab), but using the data I have for only the first 2 Marconis (so I can't finish the TCH technique), the phase noise appears to be lowest using the 100kHz offset except at the higher frequencies. The 160 MHz signal so far is noisier than the 11.1 MHz signal with offset ranges of 1 kHz and 10 Hz, but less noisy with a 100 kHz offset.
I still haven't measured anything at 80 MHz and have to finish taking more data to be able to use the TCH technique at 160 MHz, then the individual phase noise data will be used to measure the noise of the function generators used in the PSL lab.