On 2021-05-06 00:55:00, Borislav Petkov wrote:
> On Wed, May 05, 2021 at 05:43:57PM -0500, Tyler Hicks wrote:
> > This is x86-specific
>
> That's because it is used by x86 currently. It shouldn't be hard to use
> it on another arch though as the machinery is pretty generic.
>
> > and not applicable in our situation.
>
> What is your situation? ARM?
Yes, though I'm not sure if those additional features are
important/useful enough for us to generalize that driver. The main
motivation here was just to prevent storage/network from being flooded
by obviously-bad nodes that haven't been offlined yet. :)
Lei and others on cc will need to evaluate porting cec.c and what it
will gain them. Thanks again.
Tyler
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
>
>> What is your situation? ARM?
>
> Yes, though I'm not sure if those additional features are
> important/useful enough for us to generalize that driver. The main
> motivation here was just to prevent storage/network from being flooded
> by obviously-bad nodes that haven't been offlined yet. :)
>
> Lei and others on cc will need to evaluate porting cec.c and what it
> will gain them. Thanks again.
Tyler,
You might also look at the x86 "storm" detection code (tl;dr version
"If error interrupts are coming too fast, turn off the interrupts and poll").
-Tony
Em Wed, 5 May 2021 18:01:52 -0500
Tyler Hicks <[email protected]> escreveu:
> On 2021-05-06 00:55:00, Borislav Petkov wrote:
> > On Wed, May 05, 2021 at 05:43:57PM -0500, Tyler Hicks wrote:
> > > This is x86-specific
> >
> > That's because it is used by x86 currently. It shouldn't be hard to use
> > it on another arch though as the machinery is pretty generic.
> >
> > > and not applicable in our situation.
> >
> > What is your situation? ARM?
>
> Yes, though I'm not sure if those additional features are
> important/useful enough for us to generalize that driver. The main
> motivation here was just to prevent storage/network from being flooded
> by obviously-bad nodes that haven't been offlined yet. :)
Well, if a machine starts to produce 500+ errors per second,
then it should be offlined as soon as possible, as otherwise bad results
will be produced ;-)
See, the CE error reporting mechanism is meant to be used together
with some error correction code algorithm like the ones used on ECC
memories. Such algorithms are designed to detect a single errored bit
with a change usually at the ~10⁻4 to 10^-7 order (the precision
depends on how many bits are used and what algorithm is used), but
if there are two wrong bits at the same word, the chance to detect
is a lot lower.
So, keeping the server enabled up to the point that it would consume
enough resources at the storage/network to bother someone sounds a
terrible idea, as sooner or later it will miss an error or produce
an uncorrected event ;-)
Besides that, if you're running rasdaemon to capture the hardware errors,
the storage will also be flooded by something like that, even if you
disable them from going to syslog via
sys/module/edac_core/parameters/edac_mc_log_ce.
Now, the question is: are those 500+ errors per second a real hardware
problem, or is it due to some broken error report mechanism?
In the latter case, the driver or the hardware that it is producing
invalid errors should be fixed.
>
> Lei and others on cc will need to evaluate porting cec.c and what it
> will gain them. Thanks again.
Regards,
Mauro