40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Entry  Thu Sep 5 18:42:19 2019, aaron, HowTo, CDS, WFS discussion, restarting CDS 
    Reply  Thu Sep 5 20:30:43 2019, rana, HowTo, CDS, WFS discussion, restarting CDS 
       Reply  Fri Sep 6 09:40:56 2019, aaron, HowTo, CDS, WFS discussion, restarting CDS 
          Reply  Fri Sep 6 11:56:44 2019, aaron, HowTo, CDS, WFS discussion, restarting CDS B26CECF8-FC0D-4348-80DC-574B1E3A4514.jpeg
             Reply  Fri Sep 6 15:12:49 2019, Koji, HowTo, CDS, WFS discussion, restarting CDS 
             Reply  Fri Sep 6 21:22:06 2019, Koji, HowTo, CDS, How to save c1ioo 
                Reply  Fri Sep 6 22:03:30 2019, aaron, HowTo, CDS, How to save c1ioo 
                   Reply  Mon Sep 9 11:36:48 2019, aaron, HowTo, CDS, How to save c1ioo reboot.png
                      Reply  Thu Sep 19 15:59:29 2019, aaron, HowTo, CDS, How to save c1ioo 
Message ID: 14861     Entry time: Fri Sep 6 11:56:44 2019     In reply to: 14860     Reply to this: 14862   14865
Author: aaron 
Type: HowTo 
Category: CDS 
Subject: WFS discussion, restarting CDS 

Rebooting

I reset c1lsc, c1sus, and c1ioo.

I noticed that the script gives the command 'ssh c1XXX', but we have been getting no route to host using this command. Instead, the machines are currently only reachable as c1XXX.martian. I'm not sure why this is, so I just appended .martian in rebootC1LSC.sh

This time, the script does run. I did get 'no route to host' on c1ioo, so I think I need to reset that machine again. After reset, the script failed to login to c1ioo and c1lsc.

Fri Sep 6 13:09:05 2019

After lunch, I reset the computers again, and try the script again. There is again no route to host for c1ioo. I'm going inside to shutoff the power to c1ioo, since the reset buttom seems to not be working. I still can't login from nodus, so I'm bringing a keyboard and monitor over to plug in directly.

On reset, c1ioo repeatedly reaches the screen in attachment 1, before going black. Holding down shift or ctrl+alt+f1 doesn't get me a command prompt. After waiting/searching the elog for >>3 min, we decided to follow these instructions to cycle the power of c1ioo. The same problem recurred following power up. I found online some instructions that the SunSystems 4600 can hang during reboot if it has become too hot ("reboot during a thermal shutdown"); I did notice that the temperature light was on earlier in this procedure, so perhaps that is the problem. I followed the wiki instructions to shut down the computer again (pressed power button, unplugged 4 power supplies from back of machine), and left it unplugged for 10-30 min (Fri Sep 6 14:46:18 2019 ).

Fri Sep 6 15:03:31 2019

Rana plugged in the power supplies and reset the machine again.

Fri Sep 6 16:30:37 2019

c1ioo is still unreachable! I pressed reset once, and the reset button flashes white. The yellow warning light is still on.

Fri Sep 6 16:54:21 2019

The reset light has stopped flashing, but I still can't access c1ioo. I reset once more, this time watching c1ioo on a monitor directly. I'm still seeing the same boot screen repeatedly. I do see that CPU0 is not clocking, which seems weird.

Troubleshooting CPU module

Following gautam's elog here, I found the Sun Fire X4600 manual for locating faulty CPUs. After the white reset light stopped flashing, I held down the power button to turn off the system. Before shutdown, all of the CPU displayed amber lights; after shutdown, only the leftmost CPU (as viewed from the back, presumably CPU0) displays an amber light. The manual says this is evidence that the CPU or DIMM is faulty. Following the manual, I remove the standby power, then checked out these Instructions for replacing the CPU to remove the CPU; Gautam also has done this before.

Fri Sep 6 20:09:01 2019 Fri Sep 6 20:09:02 2019

I pulled the leftmost CPU module out, following the instructions above. The CPU module matches the physical layout and part number of the Sun Fire X4600 M2 8-DIMM CPU module; pressing the fault reminder light gives amber indicators at the DIMM ejectors, indicating faulty DIMMs (see). The other indicator LEDs did not illuminate.

I located several spare DIMMs in the digital cabinet along Y arm (and a couple with misc computer components in the control room), but didn't find the correct one for this CPU module. The DIMM is Sun PN 371-1764-01; I found it online and ordered eight. Please let me know if this is incorrect.

To protect the CPU module, I've put it in an ESD safe bag with some bubble wrap and a note. It's on the E shop bench.

Conclusion: Need new DIMM, didn't find the correct part but ordered it.

Attachment 1: B26CECF8-FC0D-4348-80DC-574B1E3A4514.jpeg  2.790 MB  Uploaded Fri Sep 6 18:01:34 2019  | Hide | Hide all
B26CECF8-FC0D-4348-80DC-574B1E3A4514.jpeg
ELOG V3.1.3-