40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Entry  Fri Jul 1 17:51:28 2016, Praful, Summary, Electronics, Replacing DIMM on Optimus 
    Reply  Mon Nov 14 14:21:06 2016, gautam, Summary, CDS, Replacing DIMM on Optimus 
       Reply  Mon Nov 14 19:32:51 2016, rana, Summary, CDS, Replacing DIMM on Optimus 
Message ID: 12613     Entry time: Mon Nov 14 14:21:06 2016     In reply to: 12239     Reply to this: 12615
Author: gautam 
Type: Summary 
Category: CDS 
Subject: Replacing DIMM on Optimus 

I replaced the suspected faulty DIMM earlier today (actually I replaced a pair of them as per the Sun Fire X4600 manual). I did things in the following sequence, which was the recommended set of steps according to the maintenance manual and also the set of graphics on the top panel of the unit:

  1. Checked that Optimus was shut down
  2. Removed the power cables from the back to cut the standby power. Two of the fan units near the front of the chassis were displaying fault lights, perhaps this has been the case since the most recent power outage after which I did not reboot Optimus
  3. Took off the top cover, removed CPU 6 (labelled "G" in the unit). The manual recommends finding faulty DIMMs by looking for an LED that is supposed to indicate the location of the bad card, but I couldn't find any such LEDs in the unit we have, perhaps this is an addition to the newer modules?
  4. Replaced the topmost (w.r.t the orientation the CPU normally sits inside the chassis) DIMM card with one of the new ones Steve ordered
  5. Put everything back together, powered Optimus up again. Reboot went smoothly, fan unit fault lights which I mentioned earlier did not light up on the reboot so that doesn't look like an issue.

I then checked for memory errors using edac-utils, and over the last couple of hours, found no errors (corrected or otherwise, see Praful's earlier elog for the error messages that we were getting prior to the DIMM swap)- I guess we will need to monitor this for a while more before we can say that the issue has been resolved.

Looking at dmesg after the reboot, I noticed the following error messages (not related to the memory issue I think):

[   19.375865] k10temp 0000:00:18.3: unreliable CPU thermal sensor; monitoring disabled
[   19.375996] k10temp 0000:00:19.3: unreliable CPU thermal sensor; monitoring disabled
[   19.376234] k10temp 0000:00:1a.3: unreliable CPU thermal sensor; monitoring disabled
[   19.376362] k10temp 0000:00:1b.3: unreliable CPU thermal sensor; monitoring disabled
[   19.376673] k10temp 0000:00:1c.3: unreliable CPU thermal sensor; monitoring disabled
[   19.376816] k10temp 0000:00:1d.3: unreliable CPU thermal sensor; monitoring disabled
[   19.376960] k10temp 0000:00:1e.3: unreliable CPU thermal sensor; monitoring disabled
[   19.377152] k10temp 0000:00:1f.3: unreliable CPU thermal sensor; monitoring disabled

I wonder if this could explain why the fans on Optimus often go into overdrive and make a racket? For the moment, the fan volume seems normal, comparable to the other SunFire X4600s we have running like megatron and FB...

ELOG V3.1.3-