There has been an ongoing memory error in optimus with the following messages:
controls@optimus|~ >
Message from syslogd@optimus at Jun 30 14:57:48 ...
kernel:[1292439.705127] [Hardware Error]: Corrected error, no action required.
Message from syslogd@optimus at Jun 30 14:57:48 ...
kernel:[1292439.705174] [Hardware Error]: CPU:24 (10:4:2) MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc04410032080a13
Message from syslogd@optimus at Jun 30 14:57:48 ...
kernel:[1292439.705237] [Hardware Error]: MC4_ADDR: 0x0000001ad2bd06d0
Message from syslogd@optimus at Jun 30 14:57:48 ...
kernel:[1292439.705264] [Hardware Error]: MC4 Error (node 6): DRAM ECC error detected on the NB.
Message from syslogd@optimus at Jun 30 14:57:48 ...
kernel:[1292439.705323] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
Optimus is a Sun Fire X4600 M2 Split-Plane server. Based on this message, the issue seems to be in memory controller (MC) 6, chip set row (csrow) 7, channel 0. I got this same result again after installing edac-utils and running edac-util -v, which gave me:
mc6: csrow7: mc#6csrow#7channel#0: 287 Corrected Errors
and said that all other DIMMs were working fine with 0 errors. Each MC has 4 csrows numbered 4-7. I shut off optimus and checked inside and found that it consists of 8 CPU slots lined up horizontally, each with 4 DIMMs stacked vertically and 4 empty DIMM slots beneath. I'm thinking that each of the 8 CPU slots has its own memory controller (0-7) and that the csrow corresponds to the position in the vertical stack, with csrow 7 being the topmost DIMM in the stack. This would mean that MC 6, csrow 7 would be the 7th memory controller, topmost DIMM. The channel would then correspond to which one of the DIMMs in the pair is faulty although if the DIMM was replaced, both channels 0 and 1 would be switched out. Here are some sources that I used:
http://docs.oracle.com/cd/E19121-01/sf.x4600/819-4342-18/html/z40007f01291423.html#i1287456
https://siliconmechanics.zendesk.com/hc/en-us/articles/208891966-Identify-Bad-DIMM-from-EDAC
http://martinstumpf.com/how-to-diagnose-memory-errors-on-amd-x86_64-using-edac/
I'll find the exact part needed to replace soon. |