Given some of the things we've facing lately, it occurs to me that we could be better served by having some sort of unified human-alerting scheme in place, for things like:
- Local/offsite backup failures
- Vaccumm system problems
- HDD status for things like /frames/ and /cvs/cds/, whether the disks are full, or their SMART status indicates imminent mechanical failure
Currently, many of these things are just checked sporadically when it occurs to someone to do so, or when debugging random issues. Smoother IFO operation and peace of mind could be gained if we're confident that the relevant people are notified in a timely manner.
Thoughts? Suggestions on other things to monitor, like maybe frontend/model crashes? |