40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Entry  Fri Feb 22 20:28:17 2013, Jamie, Update, Computers, linux1 dead, then undead FX-606-U4_1205.pdf
    Reply  Sat Feb 23 14:04:07 2013, Koji, Update, Computers, apache retarted (Re: linux1 dead, then undead) 
Message ID: 8140     Entry time: Fri Feb 22 20:28:17 2013     Reply to this: 8144
Author: Jamie 
Type: Update 
Category: Computers 
Subject: linux1 dead, then undead 

At around 2:30pm today something brought down most of the martian network.  All control room workstations, nodus, etc. were unresponsive.  After poking around for a bit I finally figured it had to be linux1, which serves the NFS filesystem for all the important CDS stuff.  linux1 was indeed completely unresponsive.

Looking closer I noticed that the Fibrenetix FX-606-U4 SCSI hardware RAID device connected to linux1 (see #1901), which holds cds network filesystem, was showing "IDE Channel #4 Error Reading" on it's little LCD display.  I assumed this was the cause of the linux1 crash.

I hard shutdown linux1, and powered off the Fibrenetix device.  I pulled the disk from slot 4 and replaced it with one of the spares we had in the control room cabinets.  I powered the device back up and it beeped for a while.  Unfortunately the device was requiring a password to access it from the front panel, and I could find no manual for the device in the lab, nor does the manufacturer offer the manual on it's web site.

Eventually I was able to get linux1 fully rebooted (after some fscks) and it seemed to mount the hardware RAID (as /dev/sdc1) fine.  The brought the NFS back.  I had to reboot nodus to get it recovered, but all the control room and front-end linux machines seemed to recover on their own (although the front-ends did need an mxstream restart).

The remaining problem is that the linux1 hardware RAID device is still currently unaccessible, and it's not clear to me that it's actually synced the new disk that I put in it.  In other words I have very little confidence that we actually have an operational RAID for /opt/rtcds.  I've contacted the LDAS guys (ie. Dan Kozak) who are managing the 40m backup to confirm that the backup is legit.  In the mean time I'm going to spec out some replacement disks onto which to copy /opt/rtcds, and also so that we can get rid of this old SCSI RAID thing.

Attachment 1: FX-606-U4_1205.pdf  561 kB  | Hide | Hide all
ELOG V3.1.3-