I replaced the suspected faulty DIMM earlier today (actually I replaced a pair of them as per the Sun Fire X4600 manual). I did things in the following sequence, which was the recommended set of steps according to the maintenance manual and also the set of graphics on the top panel of the unit:
- Checked that Optimus was shut down
- Removed the power cables from the back to cut the standby power. Two of the fan units near the front of the chassis were displaying fault lights, perhaps this has been the case since the most recent power outage after which I did not reboot Optimus
- Took off the top cover, removed CPU 6 (labelled "G" in the unit). The manual recommends finding faulty DIMMs by looking for an LED that is supposed to indicate the location of the bad card, but I couldn't find any such LEDs in the unit we have, perhaps this is an addition to the newer modules?
- Replaced the topmost (w.r.t the orientation the CPU normally sits inside the chassis) DIMM card with one of the new ones Steve ordered
- Put everything back together, powered Optimus up again. Reboot went smoothly, fan unit fault lights which I mentioned earlier did not light up on the reboot so that doesn't look like an issue.
I then checked for memory errors using edac-utils, and over the last couple of hours, found no errors (corrected or otherwise, see Praful's earlier elog for the error messages that we were getting prior to the DIMM swap)- I guess we will need to monitor this for a while more before we can say that the issue has been resolved.
Looking at dmesg after the reboot, I noticed the following error messages (not related to the memory issue I think):
[ 19.375865] k10temp 0000:00:18.3: unreliable CPU thermal sensor; monitoring disabled
[ 19.375996] k10temp 0000:00:19.3: unreliable CPU thermal sensor; monitoring disabled
[ 19.376234] k10temp 0000:00:1a.3: unreliable CPU thermal sensor; monitoring disabled
[ 19.376362] k10temp 0000:00:1b.3: unreliable CPU thermal sensor; monitoring disabled
[ 19.376673] k10temp 0000:00:1c.3: unreliable CPU thermal sensor; monitoring disabled
[ 19.376816] k10temp 0000:00:1d.3: unreliable CPU thermal sensor; monitoring disabled
[ 19.376960] k10temp 0000:00:1e.3: unreliable CPU thermal sensor; monitoring disabled
[ 19.377152] k10temp 0000:00:1f.3: unreliable CPU thermal sensor; monitoring disabled
I wonder if this could explain why the fans on Optimus often go into overdrive and make a racket? For the moment, the fan volume seems normal, comparable to the other SunFire X4600s we have running like megatron and FB... |