The summary pages were not successfully generated for a long period of time at the end of 2016 due to syntax errors in the PEM and Weather configuration files.
These errors caused the INI parser to crash and brought down the whole gwsumm system. It seems that changes in the configuration of the Condor daemon at the CIT clusters may have made our infrastructure less robust against these kinds of problems (which would explain why there wasn't a better error message/alert), but this requires further investigation.
In any case, the solution was as simple as correcting the typos in the config side (on the nodus side) and restarting the cron jobs (on the cluster side, by doing `condor_rm 40m && condor_submit DetectorChar/condor/gw_daily_summary.sub`). Producing pages for the missing days will take some time (how to do so for a particular day is explained in the wiki https://wiki-40m.ligo.caltech.edu/DailySummaryHelp).
RXA: later, Max sent us this secret note:
However, I realize it might not be clear from the page which are the key steps. These are just running:
1) ./DetectorChar/bin/gw_daily_summary --day YYYYMMDD --file-tag some_custom_tag To create pages for day YYYYMMDD (the file-tag option is not strictly necessary but will prevent conflict with other instances of the code running simultaneously).
2) sync those days back to nodus by doing, eg: ./DetectorChar/bin/pushnodus 20160701 20160702
This must all be done from the cluster using the 40m shared account. |