[paco, koji, tega, ian]
Today in the morning the name server / network file system running in chiara failed. This resulted in donatella/pianosa/rossa shell prompts to hang forever. It also made sitemap crash and even dropping into a bash shell and just listing files from some directory in the file system froze the computer. Remote ssh sessions on nodus also had the same symptoms.
A little after 1 pm, we started debugging this issue with help from Koji. He suggested we hook a monitor, keyboard, and mouse onto chiara as it should still work locally even if something with the NFS (network file system) failed. We did this and then we tried for a while to unmount the /dev/sdc1/ from /home/cds/ (main file system) and mount /dev/sdb1/ from /media/40mBackup (backup copy) such that they swap places. We had no trouble unmounting the backup drive, but only succeeded in unmounting the main drive with the "lazy" unmount, or running "umount -l". Running "df" we could see that the disk space was 100% used, with only ~ 1 GB of free space which may have been the cause for the issue. After swapping these disks by editing the /etc/fstab file to implement the aforementioned swapping, we rebooted chiara and we recovered the shell prompts in all workstations, sitemap, etc... due to the backup drive mounting. We then started investigating what caused the main drive to fill up that quickly, and noted that weirdly now the capacity was at 85% or about 500GB less than before (after reboot and remount), so some large file was probably dumped into chiara that froze the NFS causing the issue.
At this point we tried opening the PSL shutter to recover the IMC. The shutter would not open and we suspected the vacuum interlock was still tripped... and indeed there was an uncleared error in the VAC screen. So with Koji's guidance we walked to the c1vac near the HV station and did the following at ~ 5:13 PM -->
We made sure that P1a (main vacuum pressure) was dropping and before continuing we decided to look back to see what the nominal vacuum state was that we should try to restore.
We are currently searching the two systems for diffrences to see if we can narrow down the culprit of the failure.