anyone understand how to parse EDAC errors in dmesg ?
Lonni J Friedman
lfriedman at nvidia.com
Sun Sep 4 19:09:14 PDT 2011
I've got a new server just deployed that started spewing EDAC
"Corrected error" messages in dmesg (496 of them thus far) under load:
EDAC MC1: CE row 4, channel 0, label "": Corrected error (Socket=1
channel=1 dimm=1)
These are basically ECC errors, which means that I've got bad RAM.
Luckily it is being detected and corrected, and isn't yet causing any
obvious problems. Thankfully its always the same module, so only one
is bad (which is enough).
This cryptic dmesg EDAC stuff is documented here (although I've got a
headache trying to parse it all):
http://www.kernel.org/doc/Documentation/edac.txt
There's also a somewhat useful 'edac-util' tool in Linux which can be
used to report on this stuff too:
mc1: csrow4: ch0|ch1: 0 Uncorrected Errors
mc1: csrow4: ch0: 496 Corrected Errors
And finally, in the /sys filesystem, the error counts are tracked:
grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow6/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow3/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow4/ch0_ce_count:496
/sys/devices/system/edac/mc/mc1/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow6/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow7/ch0_ce_count:0
So if I'm understanding this stuff correctly, I think it means that
DIMM #5 associated with the 2nd CPU is the one having problems.
However, I don't have terribly high confidence that I'm interpretting
this correctly. Does anyone else understand how to parse these errors
into something that corresponds to a real physical memory slot on the
motherboard?
thanks
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman netllama at gmail.com
LlamaLand https://netllama.linux-sxs.org
More information about the Linux-users
mailing list