anyone understand how to parse EDAC errors in dmesg ?

Mon Sep 5 09:40:17 PDT 2011

On Mon, Sep 5, 2011 at 6:07 AM, David A. Bandel <david.bandel at gmail.com> wrote:
> On Sun, Sep 4, 2011 at 21:09, Lonni J Friedman <lfriedman at nvidia.com> wrote:
>> I've got a new server just deployed that started spewing EDAC
>> "Corrected error" messages in dmesg (496 of them thus far) under load:
>> EDAC MC1: CE row 4, channel 0, label "": Corrected error (Socket=1
>> channel=1 dimm=1)
>>
>> These are basically ECC errors, which means that I've got bad RAM.
>> Luckily it is being detected and corrected, and isn't yet causing any
>> obvious problems.  Thankfully its always the same module, so only one
>> is bad (which is enough).
>>
>> This cryptic dmesg EDAC stuff is documented here (although I've got a
>> headache trying to parse it all):
>> http://www.kernel.org/doc/Documentation/edac.txt
>>
>> There's also a somewhat useful 'edac-util' tool in Linux which can be
>> used to report on this stuff too:
>> mc1: csrow4: ch0|ch1: 0 Uncorrected Errors
>> mc1: csrow4: ch0: 496 Corrected Errors
>>
>> And finally, in the /sys filesystem, the error counts are tracked:
>> grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
>> /sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
>> /sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0
>> /sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
>> /sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count:0
>> /sys/devices/system/edac/mc/mc0/csrow4/ch0_ce_count:0
>> /sys/devices/system/edac/mc/mc0/csrow5/ch0_ce_count:0
>> /sys/devices/system/edac/mc/mc0/csrow6/ch0_ce_count:0
>> /sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0
>> /sys/devices/system/edac/mc/mc1/csrow1/ch0_ce_count:0
>> /sys/devices/system/edac/mc/mc1/csrow2/ch0_ce_count:0
>> /sys/devices/system/edac/mc/mc1/csrow3/ch0_ce_count:0
>> /sys/devices/system/edac/mc/mc1/csrow4/ch0_ce_count:496
>
> I'd say your interpretation is good.  Have you tried reseating this DIMM?

Its a production server, and started spewing these errors on Friday
night.  So, I haven't been physically near it since the problem
started, and I didn't want to take it down until I had a bit more
confidence that I was touching the right DIMM.

>
>> /sys/devices/system/edac/mc/mc1/csrow5/ch0_ce_count:0
>> /sys/devices/system/edac/mc/mc1/csrow6/ch0_ce_count:0
>> /sys/devices/system/edac/mc/mc1/csrow7/ch0_ce_count:0
>>
>> So if I'm understanding this stuff correctly, I think it means that
>> DIMM #5 associated with the 2nd CPU is the one having problems.
>> However, I don't have terribly high confidence that I'm interpretting
>> this correctly.  Does anyone else understand how to parse these errors
>> into something that corresponds to a real physical memory slot on the
>> motherboard?
>
> I've not had this problem, so haven't seen the error.  Do you really
> have that many DIMM modules in this motherboard?  The above suggests
> 15 DIMMs (unless you truncated one of the lists).

Yes, this system has 16 slots, currently with 128GB of RAM.

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama at gmail.com
LlamaLand                       https://netllama.linux-sxs.org