anyone understand how to parse EDAC errors in dmesg ?

Lonni J Friedman lfriedman at nvidia.com
Sun Sep 4 19:09:14 PDT 2011


I've got a new server just deployed that started spewing EDAC
"Corrected error" messages in dmesg (496 of them thus far) under load:
EDAC MC1: CE row 4, channel 0, label "": Corrected error (Socket=1
channel=1 dimm=1)

These are basically ECC errors, which means that I've got bad RAM.
Luckily it is being detected and corrected, and isn't yet causing any
obvious problems.  Thankfully its always the same module, so only one
is bad (which is enough).

This cryptic dmesg EDAC stuff is documented here (although I've got a
headache trying to parse it all):
http://www.kernel.org/doc/Documentation/edac.txt

There's also a somewhat useful 'edac-util' tool in Linux which can be
used to report on this stuff too:
mc1: csrow4: ch0|ch1: 0 Uncorrected Errors
mc1: csrow4: ch0: 496 Corrected Errors

And finally, in the /sys filesystem, the error counts are tracked:
grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow6/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow3/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow4/ch0_ce_count:496
/sys/devices/system/edac/mc/mc1/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow6/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow7/ch0_ce_count:0

So if I'm understanding this stuff correctly, I think it means that
DIMM #5 associated with the 2nd CPU is the one having problems.
However, I don't have terribly high confidence that I'm interpretting
this correctly.  Does anyone else understand how to parse these errors
into something that corresponds to a real physical memory slot on the
motherboard?

thanks


-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama at gmail.com
LlamaLand                       https://netllama.linux-sxs.org




More information about the Linux-users mailing list