LinuxLists.cc - Re: edac driver injection of uncorrected errors & utils

2018-12-05 18:00:40

Subject: Re: edac driver injection of uncorrected errors & utils

On 12/5/18 8:38 AM, Tracy Smith wrote:
> This is more directed toward York for layerscape. I see some edac code
> that seem to do periodic scrubs based on intervals or scrub rate, but
> that is not needed for the layerscape driver to correct errors because
> errors are scrubbed when found by the edac scrub block or is it
> because the memory controller itself does the correction/scrubbing
> when an error is found?

Single-bit errors are corrected by memory controller without involving
software.

York

2018-12-05 22:02:59

by Tracy Smith

[permalink] [raw]

Subject: Patrol scrub questions

>Single-bit errors are corrected by memory controller without involving software.

Sorry for being verbose, but I need to explain the reason for the
questions below since I need to determine if a memory scrub is
required on layerscape and why. There are multiple layers to the
problem of ECC.

First layer, there is the immediate 'correction' of a flipped bit.

This does not 'fix' the source of the error but corrects the flipped
bit for use by the processor.

Most bit flips will be due to either a transitory noise problem on the
bus, which will not be associated with any given memory cell, OR it
will be due to a cosmic-ray induced bit flip in the memory cell which
will stay 'flipped' until the location has been written to again.

The safe action is to write the ECC corrected data back to the same
'error' location in memory. Does the layerscape memory controller
without software intervention do this?

Question 1) Does the layerscape memory controller automatically
perform a write of the corrected data back to the 'error' location to
make a correction? If not, is a patrol scrub required to do this?

Second layer, there is the risk of a double bit flip in memory.

Statistically this is very rare, but the odds significantly increase
that a double bit flip will occur in a single word when a single bit
flip goes uncorrected, giving more time for another cosmic ray induced
bit flip to occur in that word.

The layerscape memory controller can only detect a bit-flip when a
given location is read, correct? This is different from normal DRAM
refresh routines.

If a location is not normally read, it can go 'unserviced'
indefinitely, allowing multiple bit flips to accumulate.

By periodically (once a day should be more than sufficient overkill)
reading each location in the DRAM and writing that same (automatically
ECC corrected if correction was needed) value back into the DRAM, we
drastically reduce the potential for an uncorrectable multiple bit
error to accumulate in any given word in memory.

Question 2) Again this would require the EDAC layerscape driver to do
a control scrub, correct? If not, how is this handled by the memory
controller to avoid the need for a patrol scrub?

Third layer, there is how the memory controller handles UE errors. My
understanding is that the layerscape memory controller, can detect if
it is a single bit (correctable) error or a multi-bit error that is
not correctable. Is this the case?

An uncorrectable error in the data or the software will have
consequences ranging from negligible to critical. From a hardware
standpoint it can't tell if it is critical so it must assume it is.

Question 3) Because the memory controller or layerscape platform must
assume a UE is critical, will a single UE on layersape cause a WDT to
be triggered and a reset to occur?

Question 4) If so, will a panic ever be called if there is a hardware
uncorrectable memory failure?

2018-12-05 22:13:55

by York Sun

[permalink] [raw]

Subject: Re: Patrol scrub questions

On 12/5/18 2:00 PM, Tracy Smith wrote:
>> Single-bit errors are corrected by memory controller without involving software.
>
> Sorry for being verbose, but I need to explain the reason for the
> questions below since I need to determine if a memory scrub is
> required on layerscape and why. There are multiple layers to the
> problem of ECC.
>
> First layer, there is the immediate 'correction' of a flipped bit.
>
> This does not 'fix' the source of the error but corrects the flipped
> bit for use by the processor.
>
> Most bit flips will be due to either a transitory noise problem on the
> bus, which will not be associated with any given memory cell, OR it
> will be due to a cosmic-ray induced bit flip in the memory cell which
> will stay 'flipped' until the location has been written to again.
>
> The safe action is to write the ECC corrected data back to the same
> 'error' location in memory. Does the layerscape memory controller
> without software intervention do this?
>
> Question 1) Does the layerscape memory controller automatically
> perform a write of the corrected data back to the 'error' location to
> make a correction? If not, is a patrol scrub required to do this?
>

Tracy,

Layerscape SoCs have the feature to fix any detected single-bit errors.
It is not part of EDAC driver. The error is still counted so EDAC driver
can "see" this error. You can refer to SoC reference manual.

> Question 3) Because the memory controller or layerscape platform must
> assume a UE is critical, will a single UE on layersape cause a WDT to
> be triggered and a reset to occur?

No.

>
> Question 4) If so, will a panic ever be called if there is a hardware
> uncorrectable memory failure?

No. It is up to upper layer of EDAC driver. Layerscape driver only
reports CEs and UEs.

York

2018-12-05 22:54:54

by Tracy Smith

[permalink] [raw]

Subject: Layerscape behavior when a UE is detected

>> Question 4) If so, will a panic ever be called if there is a hardware
>> uncorrectable memory failure?

>No. It is up to upper layer of EDAC driver. Layerscape driver only reports CEs and UEs.

Just to be clear, the upper layer of the EDAC driver will or will not
panic when a UE is detected on layerscape?

If there is no panic by the upper layer and no reset triggered by the
layerscape CPLD or memory controller, what happens on layerscape when
a UE is detected by the memory controller?

Forcing a UE by grounding a dataline caused a reset on layerscape
after a few seconds, but no panic. It is unclear why it reset, but it
appears as though a WDT was tripped. The UE was reported by EDAC and
seen in the log.

2018-12-05 22:58:13

by York Sun

[permalink] [raw]

Subject: Re: Layerscape behavior when a UE is detected

On 12/5/18 2:54 PM, Tracy Smith wrote:
>>> Question 4) If so, will a panic ever be called if there is a hardware
>>> uncorrectable memory failure?
>
>> No. It is up to upper layer of EDAC driver. Layerscape driver only reports CEs and UEs.
>
> Just to be clear, the upper layer of the EDAC driver will or will not
> panic when a UE is detected on layerscape?
>
> If there is no panic by the upper layer and no reset triggered by the
> layerscape CPLD or memory controller, what happens on layerscape when
> a UE is detected by the memory controller?
>
> Forcing a UE by grounding a dataline caused a reset on layerscape
> after a few seconds, but no panic. It is unclear why it reset, but it
> appears as though a WDT was tripped. The UE was reported by EDAC and
> seen in the log.
>
I can't help you on that. I never tried to force errors by grounding the
signals. You have read the driver. Do you see panic? The idea is to
report the error and let upper layer to decide what to do. Sometimes
limping forward is better than reset or panic. Again, it is not driver's
responsibility.

York

2018-12-05 23:44:11

by Tracy Smith

[permalink] [raw]

Subject: Layerscape UE detected and no EDAC panic

> I can't help you on that. I never tried to force errors by grounding the
> signals. You have read the driver. Do you see panic? The idea is to
> report the error and let upper layer to decide what to do. Sometimes
> limping forward is better than reset or panic. Again, it is not driver's
> responsibility.

Thanks for the clarification York. Yes there is panic code in the EDAC
upper layer, but no panic occurred. A UE was printed on the serial
console, and the layerscape board reset.

The reason it did not panic is because edac_mc_panic_on_ue has to be
set at runtime. Just validated this will cause a panic when set. No
memory UE should reset the board, so this was caused because of
grounding the data line and an issue with how I'm testing for a UE not
related to a UE itself.

echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue this
is the way to force a panic on a UE error.

MODULE_PARM_DESC(edac_mc_panic_on_ue, "Panic on uncorrected error: 0=off 1=on");

So, this is validated. Produced a UE and was able to avoid a panic and
I was able to induce a panic on a UE. I'm satisfied with this. thanks
again!!