40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Entry  Tue May 19 11:15:09 2015, ericq, Update, Computer Scripts / Programs, Chiara Backup Hiccup 
    Reply  Tue May 19 11:24:44 2015, ericq, Update, Computer Scripts / Programs, Notification Scheme 
    Reply  Wed May 27 15:20:54 2015, ericq, Update, Computer Scripts / Programs, Chiara Backup Hiccup 
       Reply  Fri May 29 11:28:42 2015, ericq, Update, Computer Scripts / Programs, Chiara Backup Hiccup 
          Reply  Fri May 29 12:49:53 2015, Koji, Update, Computer Scripts / Programs, Chiara Backup Hiccup 
             Reply  Fri May 29 15:12:39 2015, Koji, Update, Computer Scripts / Programs, Chiara Backup Hiccup backup_hours.pdf
Message ID: 11308     Entry time: Tue May 19 11:24:44 2015     In reply to: 11307
Author: ericq 
Type: Update 
Category: Computer Scripts / Programs 
Subject: Notification Scheme 

Given some of the things we've facing lately, it occurs to me that we could be better served by having some sort of unified human-alerting scheme in place, for things like:

  • Local/offsite backup failures
  • Vaccumm system problems
  • HDD status for things like /frames/ and /cvs/cds/, whether the disks are full, or their SMART status indicates imminent mechanical failure

Currently, many of these things are just checked sporadically when it occurs to someone to do so, or when debugging random issues. Smoother IFO operation and peace of mind could be gained if we're confident that the relevant people are notified in a timely manner. 

Thoughts? Suggestions on other things to monitor, like maybe frontend/model crashes?

ELOG V3.1.3-