40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Message ID: 15812     Entry time: Wed Feb 17 13:59:35 2021
Author: gautam 
Type: Update 
Category: DetChar 
Subject: Summary pages 

The summary pages had failed because of a conda env change. We are dependent on detchar's conda environment setup to run the scripts on the cluster. However, for some reason, when they upgraded to python3.9, they removed the python3.7 env, which was the cause of the original failure of the summary pages a couple of weeks ago. Here is a list of thoughts on how the pipeline can be made better.

  1. The status checking is pretty hacky at the moment.
    • I recommend not using shell via python to check if any condor jobs are "held".
    • Better is to use the dedicated python bindings. I have used this to plot the job durations, and it has worked well.
    • One caveat is that sometimes, there is a long delay between making a request via a python command, and condor actually returning the status. So you may have to experiment with execution times and build some try/except logic to catch "failures" that are just the condor command timing out and not an actual failure of the summare jobs.
  2. The status check should also add a mailer which emails the 40m list when the job is held. 
    • With htcondor and python, I think it's easy to also get the "hold reason" for the job and add that to the mailer.
  3. The job execution time command is not executing correctly anymore - for whatever reason, the condor_history command can't seem to apply the constraint of finding only jobs run by "40m", although running it without the constraint reveals that these certainly exist. Probably has to do with some recent upgrade of condor version or something. This should be fixed.
  4. We should clear the archive files regularly. 
    • The 40m home directory on the cluster was getting full. 
    • The summary page jobs generate a .h5 archive of all the data used to generate the plots. Over ~1 year, this amounts to ~1TB.
    • I added the cleanArchive job to the crontab, but it should be checked.
    • Do we even need these archives beyond 1 day? I think they make the plotting faster by saving data already downloaded locally, but maybe we should have the cron delete all archive 
  5. Can we make our own copy of the conda env and not be dependent on detchar conda env? The downside is that if something dramatic changes in gwsumm, we are responsible for debugging ourselves.

Remember that all the files are to be edited on nodus and not on the cluster.

ELOG V3.1.3-