At around 2:30pm today something brought down most of the martian network. All control room workstations, nodus, etc. were unresponsive. After poking around for a bit I finally figured it had to be linux1, which serves the NFS filesystem for all the important CDS stuff. linux1 was indeed completely unresponsive.
Looking closer I noticed that the Fibrenetix FX-606-U4 SCSI hardware RAID device connected to linux1 (see #1901), which holds cds network filesystem, was showing "IDE Channel #4 Error Reading" on it's little LCD display. I assumed this was the cause of the linux1 crash.
I hard shutdown linux1, and powered off the Fibrenetix device. I pulled the disk from slot 4 and replaced it with one of the spares we had in the control room cabinets. I powered the device back up and it beeped for a while. Unfortunately the device was requiring a password to access it from the front panel, and I could find no manual for the device in the lab, nor does the manufacturer offer the manual on it's web site.
Eventually I was able to get linux1 fully rebooted (after some fscks) and it seemed to mount the hardware RAID (as /dev/sdc1) fine. The brought the NFS back. I had to reboot nodus to get it recovered, but all the control room and front-end linux machines seemed to recover on their own (although the front-ends did need an mxstream restart).
The remaining problem is that the linux1 hardware RAID device is still currently unaccessible, and it's not clear to me that it's actually synced the new disk that I put in it. In other words I have very little confidence that we actually have an operational RAID for /opt/rtcds. I've contacted the LDAS guys (ie. Dan Kozak) who are managing the 40m backup to confirm that the backup is legit. In the mean time I'm going to spec out some replacement disks onto which to copy /opt/rtcds, and also so that we can get rid of this old SCSI RAID thing. |