Given some of the things we've facing lately, it occurs to me that we could be better served by having some sort of unified human-alerting scheme in place, for things like:
Currently, many of these things are just checked sporadically when it occurs to someone to do so, or when debugging random issues. Smoother IFO operation and peace of mind could be gained if we're confident that the relevant people are notified in a timely manner.
Thoughts? Suggestions on other things to monitor, like maybe frontend/model crashes?