Dec 22 between 6AM and 7AM, physical or logical failure has occure on the 4th disk in the RAID array on linux1.
This caused the RAID disk fell into the readonly mode. All of the hosts dependent on linux1 via NFS were affected by the incident.
Today the system has been recovered. The failed filesystem was restored by copying all of the files (1.3TB total) on the RAID to a 2TB SATA disk.
The depending hosts were restarted and we recovered elog/wiki access as well as the interferometer control system.
Recovery process
o Recover the access to linux1
- Connect an LCD display on the host. The keyboard is already connected and on the machine.
- One can login to linux1 from one of the virtual consoles, which can be switched by Alt+1/2/3 ...etc
- The device file of the RAID is /dev/sda1
- The boot didn't go straightforward as mounting of the disks accoding to /dev/fstab doesn't go well.
- The 40m root password was used to login with the filesystem recovery mode.
- Use the following command to make the editing of /etc/fstab available
# mount -o rw, remount /
- In order to make the normal reboot successfull, the line for the RAID in /etc/fstab needed to be commented out.
o Connect the external disk on linux1
- Brought a spare 2TB SATA disk from rossa.
- Connect the disk via an USB-SATA enclosure (dev/sdd1)
- Mount the 2TB disk on /tmpdisk
- Run the following command for the duplication
# rsync -aHuv --progress /home/ /tmpdisk/ >/rsync_KA_20131229_0230.log
- Because of the slow SCSI I/F, the copy rate was limited to ~6MB/s. The copy started on 27th and finished 31st.
o Restart linux1
- It was found that linux1 couldn't boot if the USB drive is connected.
- The machine has two SATA ports. These two are used for another RAID array that is not actually used. (/oldhome)
- linux1 was pulled out from the shelf in order to remove the two SATA disks.
- The 2TB disk was installed on the SATA port0.
- Restart linux1 but didn't start as the new disk is recognized as the boot disk.
- The BIOS setting was changed so that the 80GB PATA disk is recognized as the boot disk.
- The boot process fell into the filesystem recovery mode again. /etc/fstab was modified as follows.
/dev/VolGroup00/LogVol00 / ext3 defaults 1 1
LABEL=/boot /boot ext3 defaults 1 2
devpts /dev/pts devpts gid=5,mode=620 0 0
tmpfs /dev/shm tmpfs defaults 0 0
proc /proc proc defaults 0 0
sysfs /sys sysfs defaults 0 0
/dev/VolGroup00/LogVol01 swap swap defaults 0 0
#/dev/md0 /oldhome ext3 defaults 0 1
/dev/sda1 /home ext3 defaults 0 1
#/dev/sdb1 /tmpraid ext3 defaults 0 1
|
- Another reboot make the operating system launched as usual.
o What's happen to the RAID?
- Hot removal of the disk #4.
- Hot plug of the disk #4.
- Disk #4 started to get rebuilt -> ~3hours rebuilding done
- This made the system marked as "clean". Now the raid (/dev/sdb1) can be mounted as usual.
o Nodus
- Root password of nodus is not known.
- Connect an LCD monitor and a Sun keyboard on nodus.
- Type Stop-A. This leads the nodus transition to the monitor mode.
- Type sync.
- This leads the system rebooted. |