40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Entry  Tue Dec 31 23:19:58 2013, Koji, Summary, General, linux1 RAID crash & recovery 
    Reply  Thu Jan 2 10:15:20 2014, Jamie, Summary, General, linux1 RAID crash & recovery 
    Reply  Thu Jan 2 10:50:24 2014, Steve, Update, VAC, vacuum monitor is still blank Help.pngvacation.png20131220lastRGAscan.png
       Reply  Thu Jan 2 13:35:06 2014, Koji, Update, VAC, vacuum monitor is still blank 
          Reply  Tue Jan 28 11:13:50 2014, Koji, Update, VAC, vacuum monitor is still blank 
             Reply  Tue Jan 28 16:26:40 2014, Steve, Update, VAC, vacuum computers are back without safety c1vac1_vac2areback.png
                Reply  Wed Feb 5 08:14:59 2014, Steve, Update, VAC, vacuum computers are back without safety 
       Reply  Mon Jan 13 16:50:55 2014, Steve, Update, VAC, Maglev controller needs service 
          Reply  Wed Apr 2 16:34:15 2014, Steve, Update, VAC, Maglev controller needs service 
             Reply  Thu Apr 3 17:05:52 2014, Steve, Update, VAC, Maglev controller swapped loanerControllerIn.png
                Reply  Thu Apr 10 16:09:29 2014, Steve, Update, VAC, RGA scan at 75% pumping speed 520HzpumpingSpeed.png
                   Reply  Tue May 27 11:00:43 2014, Steve, Update, VAC, RGA scan at day 111 pd77d111RGA.png
                      Reply  Thu Aug 21 15:07:48 2014, Steve, Update, VAC, RGA scan at day 197 RGAscan@d197.png
                         Reply  Mon Sep 22 15:20:32 2014, Steve, Update, VAC, RGA scan at day 229 RGA@229day.pngRGAscan229d.png
                Reply  Mon Apr 21 10:14:00 2014, Steve, Update, VAC, Maglev controller serviced ourControllerBack560Hz.png
             Reply  Wed Aug 29 09:56:00 2018, Steve, Update, VAC, Maglev controller needs service 
                Reply  Mon Oct 22 15:19:05 2018, Steve, Update, VAC, Maglev controller serviced our_controller_is_back.png
    Reply  Mon Jan 6 16:32:40 2014, Koji, Summary, General, linux1 RAID crash & recovery 
Message ID: 9511     Entry time: Tue Dec 31 23:19:58 2013     Reply to this: 9513   9514   9520
Author: Koji 
Type: Summary 
Category: General 
Subject: linux1 RAID crash & recovery 

Dec 22 between 6AM and 7AM, physical or logical failure has occure on the 4th disk in the RAID array on linux1.
This caused the RAID disk fell into the readonly mode. All of the hosts dependent on linux1 via NFS were affected by the incident.

Today the system has been recovered. The failed filesystem was restored by copying all of the files (1.3TB total) on the RAID to a 2TB SATA disk.
The depending hosts were restarted and we recovered elog/wiki access as well as the interferometer control system.


Recovery process

o Recover the access to linux1

- Connect an LCD display on the host. The keyboard is already connected and on the machine.
- One can login to linux1 from one of the virtual consoles, which can be switched by Alt+1/2/3 ...etc
- The device file of the RAID is /dev/sda1
- The boot didn't go straightforward as mounting of the disks accoding to /dev/fstab doesn't go well.
- The 40m root password was used to login with the filesystem recovery mode.
- Use the following command to make the editing of /etc/fstab available

# mount -o rw, remount /

- In order to make the normal reboot successfull, the line for the RAID in /etc/fstab needed to be commented out.

o Connect the external disk on linux1

- Brought a spare 2TB SATA disk from rossa.
- Connect the disk via an USB-SATA enclosure (dev/sdd1)
- Mount the 2TB disk on /tmpdisk
- Run the following command for the duplication

# rsync -aHuv --progress /home/ /tmpdisk/ >/rsync_KA_20131229_0230.log

- Because of the slow SCSI I/F, the copy rate was limited to ~6MB/s. The copy started on 27th and finished 31st.

o Restart linux1

- It was found that linux1 couldn't boot if the USB drive is connected.
- The machine has two SATA ports. These two are used for another RAID array that is not actually used. (/oldhome)
- linux1 was pulled out from the shelf in order to remove the two SATA disks.
- The 2TB disk was installed on the SATA port0.
- Restart linux1 but didn't start as the new disk is recognized as the boot disk.
- The BIOS setting was changed so that the 80GB PATA disk is recognized as the boot disk.
- The boot process fell into the filesystem recovery mode again. /etc/fstab was modified as follows.

/dev/VolGroup00/LogVol00 /                ext3    defaults        1 1
LABEL=/boot              /boot            ext3    defaults        1 2
devpts                   /dev/pts         devpts  gid=5,mode=620  0 0
tmpfs                    /dev/shm         tmpfs   defaults        0 0
proc                     /proc            proc    defaults        0 0
sysfs                    /sys             sysfs   defaults        0 0
/dev/VolGroup00/LogVol01 swap             swap    defaults        0 0
#/dev/md0                 /oldhome         ext3    defaults        0 1
/dev/sda1                /home            ext3    defaults        0 1
#/dev/sdb1                /tmpraid         ext3    defaults        0 1

- Another reboot make the operating system launched as usual.

o What's happen to the RAID?

- Hot removal of the disk #4.
- Hot plug of the disk #4.
- Disk #4 started to get rebuilt -> ~3hours rebuilding done
- This made the system marked as "clean". Now the raid (/dev/sdb1) can be mounted as usual.


o Nodus

- Root password of nodus is not known.
- Connect an LCD monitor and a Sun keyboard on nodus.
- Type Stop-A. This leads the nodus transition to the monitor mode.
- Type sync.
- This leads the system rebooted.

ELOG V3.1.3-