Hi Lei,
On Wed, May 05, 2021 at 07:02:14PM +0000, Lei Wang (DPLAT) wrote:
> Hi Boris,
first of all, please do not top-post.
> We found a corner case in production environment that there are ~500
> CE errors per second. The SoC otherwise functions just fine. Making
> printk ratelimited reduced CE error logging to < 20 per second.
If you want to avoid CE logs flooding dmesg, there's a couple of things
you can do:
1. Use drivers/ras/cec.c
2. Do not load EDAC drivers at all since you don't care about the error
reports, apparently.
3. Fix the CE source: replace the DIMMs, etc.
> Though this is just one case so far, we think moving to
> printk_ratelimited could benefit broader use as well, by helping
> control the amount of kernel logging.
No, this will make EDAC driver loading output incomplete when some of
the messages are omitted due to the ratelimiting. And no, this is not
going to happen.
HTH.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 2021-05-05 21:45:01, Borislav Petkov wrote:
> Hi Lei,
>
> On Wed, May 05, 2021 at 07:02:14PM +0000, Lei Wang (DPLAT) wrote:
> > Hi Boris,
>
> first of all, please do not top-post.
>
> > We found a corner case in production environment that there are ~500
> > CE errors per second. The SoC otherwise functions just fine. Making
> > printk ratelimited reduced CE error logging to < 20 per second.
>
> If you want to avoid CE logs flooding dmesg, there's a couple of things
> you can do:
>
> 1. Use drivers/ras/cec.c
>
> 2. Do not load EDAC drivers at all since you don't care about the error
> reports, apparently.
Lei, if you don't care about the CE error messages at all, there's
also an edac_mc_log_ce module parameter that can be used to quiet the
message emitted from edac_ce_error():
https://www.kernel.org/doc/html/latest/admin-guide/ras.html#module-parameters
> 3. Fix the CE source: replace the DIMMs, etc.
>
> > Though this is just one case so far, we think moving to
> > printk_ratelimited could benefit broader use as well, by helping
> > control the amount of kernel logging.
>
> No, this will make EDAC driver loading output incomplete when some of
> the messages are omitted due to the ratelimiting. And no, this is not
> going to happen.
Boris, I agree that a more surgical approach is needed than this if Lei
still needs some traces of the CE error messages in the logs. Would it
be any more acceptable to add an edac_mc_printk_ratelimited() macro,
which uses printk_ratelimited(), and then call that new macro from
edac_ce_error()?
If you still don't want those CE errors ratelimited by default, perhaps
a new, non-default mode (2) could be added to the edac_mc_log_ce module
parameter that uses the ratelimited variant?
Tyler
>
> HTH.
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
>
On Wed, May 05, 2021 at 03:23:57PM -0500, Tyler Hicks wrote:
> Would it be any more acceptable to add an
> edac_mc_printk_ratelimited() macro, which uses printk_ratelimited(),
> and then call that new macro from edac_ce_error()?
You guys are way off here: the intent of EDAC drivers is to accurately
report errors for purposes of counting them and doing analysis on
that collected data as to whether components are going wrong - not to
ratelimit them as some nuisance output.
With breaking the EDAC reporting, you're barking up the wrong tree - if
you don't want to see those errors, do not load the drivers. It is that
simple.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 2021-05-05 23:04:44, Borislav Petkov wrote:
> On Wed, May 05, 2021 at 03:23:57PM -0500, Tyler Hicks wrote:
> > Would it be any more acceptable to add an
> > edac_mc_printk_ratelimited() macro, which uses printk_ratelimited(),
> > and then call that new macro from edac_ce_error()?
>
> You guys are way off here: the intent of EDAC drivers is to accurately
> report errors for purposes of counting them and doing analysis on
> that collected data as to whether components are going wrong - not to
> ratelimit them as some nuisance output.
>
> With breaking the EDAC reporting, you're barking up the wrong tree - if
> you don't want to see those errors, do not load the drivers. It is that
> simple.
As I understand it, the idea here wasn't to treat the log messages as a
nuisance that should be completely squelched. The counters are monitored
and provide the definitive way to detect large scale problems but the CE
log messages can be an easier-to-discover way for humans to identify
potential problems when, for example, centralized log aggregation and
indexing is in place.
The thought was that the full stream of log messages isn't necessary to
notice that there's a problem when they are being emitted at such a high
rate (500 per second). They're just filling up disk space and/or wasting
networking bandwidth at that point. Of course, the best course of action
here is to service the machine but there's still a period of time
between the CE errors popping up and the machine being serviced.
Tyler