2010-11-10 10:32:33

by Jon Masters

[permalink] [raw]
Subject: PROBLEM: first read of /dev/mcelog fails

Hi Andi,

I recently took some time to look at the daemon side of mcelog. I found
a few issues with that[0] that I will detail in a separate mail, but I
am more concerned that in recent kernels, the first read() of the mcelog
device performed by mcelog after boot always fails. 100% reproducible on
several machines - mcelog client always bombs out in process() due to
getting a "no such device" error from the kernel read attempt.
The /dev/mcelog device exists, but the first read fails.

Can you do a fresh boot, run mcelog, and see what I mean?

Jon.

[0] ignoredev not documented, cache-error-trigger uses $CPUS_AFFECTED
when it should be $AFFECTED_CPUS, mcelog-client socket is always created
no matter what the config says (and not deleted on shutdown).


2010-11-10 10:40:08

by Jon Masters

[permalink] [raw]
Subject: Re: PROBLEM: first read of /dev/mcelog fails

On Wed, 2010-11-10 at 05:32 -0500, Jon Masters wrote:

> I recently took some time to look at the daemon side of mcelog. I found
> a few issues with that[0] that I will detail in a separate mail, but I
> am more concerned that in recent kernels, the first read() of the mcelog
> device performed by mcelog after boot always fails. 100% reproducible on
> several machines - mcelog client always bombs out in process() due to
> getting a "no such device" error from the kernel read attempt.
> The /dev/mcelog device exists, but the first read fails.
>
> Can you do a fresh boot, run mcelog, and see what I mean?

(most distros won't see this because they still run mcelog in a cron
script - as mentioned in your recent paper - and ignore the error)

Thanks,

Jon.

2010-11-11 09:29:11

by Jon Masters

[permalink] [raw]
Subject: Re: PROBLEM: first read of /dev/mcelog fails

On Wed, 2010-11-10 at 05:32 -0500, Jon Masters wrote:

> I recently took some time to look at the daemon side of mcelog. I found
> a few issues with that[0] that I will detail in a separate mail, but I
> am more concerned that in recent kernels, the first read() of the mcelog
> device performed by mcelog after boot always fails. 100% reproducible on
> several machines - mcelog client always bombs out in process() due to
> getting a "no such device" error from the kernel read attempt.
> The /dev/mcelog device exists, but the first read fails.

I am looking into this, time permitting. Can you give me an indication
of the current plans for mcelog support in the kernel? Given that we've
had various discussion recently about EDAC and other error reporting
mechanisms, I would like to know if you plan for mcelog to stick around.

> [0] ignoredev not documented, cache-error-trigger uses $CPUS_AFFECTED
> when it should be $AFFECTED_CPUS, mcelog-client socket is always created
> no matter what the config says (and not deleted on shutdown).

I sent you a separate mail with a pull request containing the first
batch of fixes to this tool (already in the Fedora package).

Jon.