anyone understand how to parse EDAC errors in dmesg ?

Mon Sep 5 06:07:01 PDT 2011

On Sun, Sep 4, 2011 at 21:09, Lonni J Friedman <lfriedman at nvidia.com> wrote:
> I've got a new server just deployed that started spewing EDAC
> "Corrected error" messages in dmesg (496 of them thus far) under load:
> EDAC MC1: CE row 4, channel 0, label "": Corrected error (Socket=1
> channel=1 dimm=1)
>
> These are basically ECC errors, which means that I've got bad RAM.
> Luckily it is being detected and corrected, and isn't yet causing any
> obvious problems.  Thankfully its always the same module, so only one
> is bad (which is enough).
>
> This cryptic dmesg EDAC stuff is documented here (although I've got a
> headache trying to parse it all):
> http://www.kernel.org/doc/Documentation/edac.txt
>
> There's also a somewhat useful 'edac-util' tool in Linux which can be
> used to report on this stuff too:
> mc1: csrow4: ch0|ch1: 0 Uncorrected Errors
> mc1: csrow4: ch0: 496 Corrected Errors
>
> And finally, in the /sys filesystem, the error counts are tracked:
> grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
> /sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
> /sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0
> /sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
> /sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count:0
> /sys/devices/system/edac/mc/mc0/csrow4/ch0_ce_count:0
> /sys/devices/system/edac/mc/mc0/csrow5/ch0_ce_count:0
> /sys/devices/system/edac/mc/mc0/csrow6/ch0_ce_count:0
> /sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0
> /sys/devices/system/edac/mc/mc1/csrow1/ch0_ce_count:0
> /sys/devices/system/edac/mc/mc1/csrow2/ch0_ce_count:0
> /sys/devices/system/edac/mc/mc1/csrow3/ch0_ce_count:0
> /sys/devices/system/edac/mc/mc1/csrow4/ch0_ce_count:496

I'd say your interpretation is good.  Have you tried reseating this DIMM?

> /sys/devices/system/edac/mc/mc1/csrow5/ch0_ce_count:0
> /sys/devices/system/edac/mc/mc1/csrow6/ch0_ce_count:0
> /sys/devices/system/edac/mc/mc1/csrow7/ch0_ce_count:0
>
> So if I'm understanding this stuff correctly, I think it means that
> DIMM #5 associated with the 2nd CPU is the one having problems.
> However, I don't have terribly high confidence that I'm interpretting
> this correctly.  Does anyone else understand how to parse these errors
> into something that corresponds to a real physical memory slot on the
> motherboard?

I've not had this problem, so haven't seen the error.  Do you really
have that many DIMM modules in this motherboard?  The above suggests
15 DIMMs (unless you truncated one of the lists).

>
> thanks
>
>
> --
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> L. Friedman                                    netllama at gmail.com
> LlamaLand                       https://netllama.linux-sxs.org
>
> _______________________________________________
> Linux-users mailing list ( Linux-users at linux-sxs.org )
> Unsub/Password/Etc:
> http://linux-sxs.org/mailman/listinfo/linux-users
>
> Need to chat further on this subject? Check out #linux-users on irc.linux-sxs.org !
>

Ciao,

David A. Bandel
-- 
Focus on the dream, not the competition.
            - Nemesis Air Racing Team motto
Visit my web page at: http://david.bandel.us/