40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Entry  Fri Jul 1 17:51:28 2016, Praful, Summary, Electronics, Replacing DIMM on Optimus 
    Reply  Mon Nov 14 14:21:06 2016, gautam, Summary, CDS, Replacing DIMM on Optimus 
       Reply  Mon Nov 14 19:32:51 2016, rana, Summary, CDS, Replacing DIMM on Optimus 
Message ID: 12239     Entry time: Fri Jul 1 17:51:28 2016     Reply to this: 12613
Author: Praful 
Type: Summary 
Category: Electronics 
Subject: Replacing DIMM on Optimus 

There has been an ongoing memory error in optimus with the following messages:

controls@optimus|~ >
Message from syslogd@optimus at Jun 30 14:57:48 ...
 kernel:[1292439.705127] [Hardware Error]: Corrected error, no action required.

Message from syslogd@optimus at Jun 30 14:57:48 ...
 kernel:[1292439.705174] [Hardware Error]: CPU:24 (10:4:2) MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc04410032080a13

Message from syslogd@optimus at Jun 30 14:57:48 ...
 kernel:[1292439.705237] [Hardware Error]: MC4_ADDR: 0x0000001ad2bd06d0

Message from syslogd@optimus at Jun 30 14:57:48 ...
 kernel:[1292439.705264] [Hardware Error]: MC4 Error (node 6): DRAM ECC error detected on the NB.

Message from syslogd@optimus at Jun 30 14:57:48 ...
 kernel:[1292439.705323] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

Optimus is a Sun Fire X4600 M2 Split-Plane server. Based on this message, the issue seems to be in memory controller (MC) 6, chip set row (csrow) 7, channel 0. I got this same result again after installing edac-utils and running edac-util -v, which gave me:

mc6: csrow7: mc#6csrow#7channel#0: 287 Corrected Errors 

and said that all other DIMMs were working fine with 0 errors. Each MC has 4 csrows numbered 4-7. I shut off optimus and checked inside and found that it consists of 8 CPU slots lined up horizontally, each with 4 DIMMs stacked vertically and 4 empty DIMM slots beneath. I'm thinking that each of the 8 CPU slots has its own memory controller (0-7) and that the csrow corresponds to the position in the vertical stack, with csrow 7 being the topmost DIMM in the stack. This would mean that MC 6, csrow 7 would be the 7th memory controller, topmost DIMM. The channel would then correspond to which one of the DIMMs in the pair is faulty although if the DIMM was replaced, both channels 0 and 1 would be switched out. Here are some sources that I used:

http://docs.oracle.com/cd/E19121-01/sf.x4600/819-4342-18/html/z40007f01291423.html#i1287456

https://siliconmechanics.zendesk.com/hc/en-us/articles/208891966-Identify-Bad-DIMM-from-EDAC

http://martinstumpf.com/how-to-diagnose-memory-errors-on-amd-x86_64-using-edac/

I'll find the exact part needed to replace soon.

ELOG V3.1.3-