When running into situations like:
"Unhandled fault: synchronous external abort (0x210) at 0xXXX"
or
"Unhandled prefetch abort: synchronous external abort (0x210) at 0xXXX"
it is useful to know the content of ADFSR (Auxiliary Data Fault Status
Register) to indicate an ECC double-bit error in L1 or L2 cache.
Refer to:
Cortex-A15 Technical Reference Manual, Revision: r2p1
[6.4.8. Error Correction Code]
Signed-off-by: Wladislav Wiebe <[email protected]>
---
arch/arm/mm/fault.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 3232afb6fdc0..5e240deb6ed6 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -547,6 +547,22 @@ hook_fault_code(int nr, int (*fn)(unsigned long, unsigned int, struct pt_regs *)
fsr_info[nr].name = name;
}
+/*
+ * Check for ECC double-bit errors in Auxiliary Data Fault Status Register
+ */
+static void check_adfsr_for_ecc(void)
+{
+ u32 adfsr = 0;
+
+ asm("mrc p15, 0, %0, c5, c1, 0" : "=r" (adfsr));
+
+ if (adfsr & (BIT(31) | BIT(23))) {
+ pr_alert("ADFSR status 0x%x indicates that an L1 or L2 cache\n"
+ "ECC double-bit error occurred at some time.\n",
+ adfsr);
+ }
+}
+
/*
* Dispatch a data abort to the relevant handler.
*/
@@ -559,6 +575,7 @@ do_DataAbort(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
if (!inf->fn(addr, fsr & ~FSR_LNX_PF, regs))
return;
+ check_adfsr_for_ecc();
pr_alert("Unhandled fault: %s (0x%03x) at 0x%08lx\n",
inf->name, fsr, addr);
show_pte(current->mm, addr);
@@ -593,6 +610,7 @@ do_PrefetchAbort(unsigned long addr, unsigned int ifsr, struct pt_regs *regs)
if (!inf->fn(addr, ifsr | FSR_LNX_PF, regs))
return;
+ check_adfsr_for_ecc();
pr_alert("Unhandled prefetch abort: %s (0x%03x) at 0x%08lx\n",
inf->name, ifsr, addr);
--
2.16.1
On 29/10/2018 14:20, Wiebe, Wladislav (Nokia - DE/Ulm) wrote:
> When running into situations like:
> "Unhandled fault: synchronous external abort (0x210) at 0xXXX"
> or
> "Unhandled prefetch abort: synchronous external abort (0x210) at 0xXXX"
> it is useful to know the content of ADFSR (Auxiliary Data Fault Status
> Register) to indicate an ECC double-bit error in L1 or L2 cache.
>
> Refer to:
> Cortex-A15 Technical Reference Manual, Revision: r2p1
> [6.4.8. Error Correction Code]
The contents of ADFSR are implementation-defined, though, so this
interpretation is *only* valid on Cortex-A15. Other processors may use
those bit positions to report something else, at which point printing a
message about ECC errors would be totally misleading.
Robin.
> Signed-off-by: Wladislav Wiebe <[email protected]>
> ---
> arch/arm/mm/fault.c | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
> index 3232afb6fdc0..5e240deb6ed6 100644
> --- a/arch/arm/mm/fault.c
> +++ b/arch/arm/mm/fault.c
> @@ -547,6 +547,22 @@ hook_fault_code(int nr, int (*fn)(unsigned long, unsigned int, struct pt_regs *)
> fsr_info[nr].name = name;
> }
>
> +/*
> + * Check for ECC double-bit errors in Auxiliary Data Fault Status Register
> + */
> +static void check_adfsr_for_ecc(void)
> +{
> + u32 adfsr = 0;
> +
> + asm("mrc p15, 0, %0, c5, c1, 0" : "=r" (adfsr));
> +
> + if (adfsr & (BIT(31) | BIT(23))) {
> + pr_alert("ADFSR status 0x%x indicates that an L1 or L2 cache\n"
> + "ECC double-bit error occurred at some time.\n",
> + adfsr);
> + }
> +}
> +
> /*
> * Dispatch a data abort to the relevant handler.
> */
> @@ -559,6 +575,7 @@ do_DataAbort(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
> if (!inf->fn(addr, fsr & ~FSR_LNX_PF, regs))
> return;
>
> + check_adfsr_for_ecc();
> pr_alert("Unhandled fault: %s (0x%03x) at 0x%08lx\n",
> inf->name, fsr, addr);
> show_pte(current->mm, addr);
> @@ -593,6 +610,7 @@ do_PrefetchAbort(unsigned long addr, unsigned int ifsr, struct pt_regs *regs)
> if (!inf->fn(addr, ifsr | FSR_LNX_PF, regs))
> return;
>
> + check_adfsr_for_ecc();
> pr_alert("Unhandled prefetch abort: %s (0x%03x) at 0x%08lx\n",
> inf->name, ifsr, addr);
>
>
On Mon, Oct 29, 2018 at 02:20:51PM +0000, Wiebe, Wladislav (Nokia - DE/Ulm) wrote:
> When running into situations like:
> "Unhandled fault: synchronous external abort (0x210) at 0xXXX"
> or
> "Unhandled prefetch abort: synchronous external abort (0x210) at 0xXXX"
> it is useful to know the content of ADFSR (Auxiliary Data Fault Status
> Register) to indicate an ECC double-bit error in L1 or L2 cache.
>
> Refer to:
> Cortex-A15 Technical Reference Manual, Revision: r2p1
> [6.4.8. Error Correction Code]
This is CPU independent code, and so must only access registers that are
present on all CPUs which may run that code.
Here's the extract from the ARM ARM for the ADFSR and AIFSR:
The position of these registers is architecturally-defined, but the
content and use of the registers is IMPLEMENTATION DEFINED. An
implementation can use these registers to return additional fault
status information. An example use of these registers is to return
more information for diagnosing parity errors.
So by testing bits in this register, you are making use of
implementation defined values.
It also goes on to say:
These registers are not implemented in architecture versions before
ARMv7.
So before ARMv7, we have to take note of the unimplemented CP15 rules:
2. In an allocated CP15 primary register, accesses to all unallocated
encodings are UNPREDICTABLE for accesses at PL1 or higher. This
means that any MCR or MRC access from PL1 or higher with a
combination of <CRn>, <opc1>, <CRm> and <opc2> values not shown in,
or referenced from, Full list of VMSA CP15 registers, by coprocessor
register number on page B3-1481, that would access an allocated
CP15 primary register, is UNPREDICTABLE. As indicated by rule 1, for
the ARMv7-Aarchitecture, the allocated CP15 primary registers are:
• in any VMSA implementation, c0-c3, c5-c11, c13, and c15
...
So I'd prefer if we didn't attempt to read this register on CPUs where
this isn't explicitly implemented.
--
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up
Hi Robin, Russel,
> -----Original Message-----
> From: Robin Murphy <[email protected]>
> Sent: Monday, October 29, 2018 3:52 PM
[..]
> On 29/10/2018 14:20, Wiebe, Wladislav (Nokia - DE/Ulm) wrote:
> > When running into situations like:
> > "Unhandled fault: synchronous external abort (0x210) at 0xXXX"
> > or
> > "Unhandled prefetch abort: synchronous external abort (0x210) at 0xXXX"
> > it is useful to know the content of ADFSR (Auxiliary Data Fault Status
> > Register) to indicate an ECC double-bit error in L1 or L2 cache.
> >
> > Refer to:
> > Cortex-A15 Technical Reference Manual, Revision: r2p1 [6.4.8. Error
> > Correction Code]
>
> The contents of ADFSR are implementation-defined, though, so this
> interpretation is *only* valid on Cortex-A15. Other processors may use those
> bit positions to report something else, at which point printing a message
> about ECC errors would be totally misleading.
Good point, I thought initially it is valid for others as well.
Do you think we can go with this approach:
if (read_cpuid_part() == ARM_CPU_PART_CORTEX_A15) {
asm("mrc p15, 0, %0, c5, c1, 0" : "=r" (adfsr));
xxxx
}
?
Thanks a lot for the fast feedback!
- Wladislav
>
> Robin.
>
> > Signed-off-by: Wladislav Wiebe <[email protected]>
> > ---
> > arch/arm/mm/fault.c | 18 ++++++++++++++++++
> > 1 file changed, 18 insertions(+)
> >
> > diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index
> > 3232afb6fdc0..5e240deb6ed6 100644
> > --- a/arch/arm/mm/fault.c
> > +++ b/arch/arm/mm/fault.c
> > @@ -547,6 +547,22 @@ hook_fault_code(int nr, int (*fn)(unsigned long,
> unsigned int, struct pt_regs *)
> > fsr_info[nr].name = name;
> > }
> >
> > +/*
> > + * Check for ECC double-bit errors in Auxiliary Data Fault Status
> > +Register */ static void check_adfsr_for_ecc(void) {
> > + u32 adfsr = 0;
> > +
> > + asm("mrc p15, 0, %0, c5, c1, 0" : "=r" (adfsr));
> > +
> > + if (adfsr & (BIT(31) | BIT(23))) {
> > + pr_alert("ADFSR status 0x%x indicates that an L1 or L2
> cache\n"
> > + "ECC double-bit error occurred at some time.\n",
> > + adfsr);
> > + }
> > +}
> > +
> > /*
> > * Dispatch a data abort to the relevant handler.
> > */
> > @@ -559,6 +575,7 @@ do_DataAbort(unsigned long addr, unsigned int fsr,
> struct pt_regs *regs)
> > if (!inf->fn(addr, fsr & ~FSR_LNX_PF, regs))
> > return;
> >
> > + check_adfsr_for_ecc();
> > pr_alert("Unhandled fault: %s (0x%03x) at 0x%08lx\n",
> > inf->name, fsr, addr);
> > show_pte(current->mm, addr);
> > @@ -593,6 +610,7 @@ do_PrefetchAbort(unsigned long addr, unsigned int
> ifsr, struct pt_regs *regs)
> > if (!inf->fn(addr, ifsr | FSR_LNX_PF, regs))
> > return;
> >
> > + check_adfsr_for_ecc();
> > pr_alert("Unhandled prefetch abort: %s (0x%03x) at 0x%08lx\n",
> > inf->name, ifsr, addr);
> >
> >
On Mon, Oct 29, 2018 at 02:20:51PM +0000, Wiebe, Wladislav (Nokia - DE/Ulm) wrote:
> When running into situations like:
> "Unhandled fault: synchronous external abort (0x210) at 0xXXX"
> or
> "Unhandled prefetch abort: synchronous external abort (0x210) at 0xXXX"
> it is useful to know the content of ADFSR (Auxiliary Data Fault Status
> Register) to indicate an ECC double-bit error in L1 or L2 cache.
>
> Refer to:
> Cortex-A15 Technical Reference Manual, Revision: r2p1
> [6.4.8. Error Correction Code]
>
> Signed-off-by: Wladislav Wiebe <[email protected]>
> ---
> arch/arm/mm/fault.c | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
> index 3232afb6fdc0..5e240deb6ed6 100644
> --- a/arch/arm/mm/fault.c
> +++ b/arch/arm/mm/fault.c
> @@ -547,6 +547,22 @@ hook_fault_code(int nr, int (*fn)(unsigned long, unsigned int, struct pt_regs *)
> fsr_info[nr].name = name;
> }
>
> +/*
> + * Check for ECC double-bit errors in Auxiliary Data Fault Status Register
> + */
> +static void check_adfsr_for_ecc(void)
> +{
> + u32 adfsr = 0;
> +
> + asm("mrc p15, 0, %0, c5, c1, 0" : "=r" (adfsr));
> +
> + if (adfsr & (BIT(31) | BIT(23))) {
> + pr_alert("ADFSR status 0x%x indicates that an L1 or L2 cache\n"
> + "ECC double-bit error occurred at some time.\n",
> + adfsr);
> + }
> +}
> +
> /*
> * Dispatch a data abort to the relevant handler.
> */
> @@ -559,6 +575,7 @@ do_DataAbort(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
> if (!inf->fn(addr, fsr & ~FSR_LNX_PF, regs))
> return;
>
> + check_adfsr_for_ecc();
> pr_alert("Unhandled fault: %s (0x%03x) at 0x%08lx\n",
> inf->name, fsr, addr);
> show_pte(current->mm, addr);
> @@ -593,6 +610,7 @@ do_PrefetchAbort(unsigned long addr, unsigned int ifsr, struct pt_regs *regs)
> if (!inf->fn(addr, ifsr | FSR_LNX_PF, regs))
> return;
>
> + check_adfsr_for_ecc();
> pr_alert("Unhandled prefetch abort: %s (0x%03x) at 0x%08lx\n",
> inf->name, ifsr, addr);
IIUC at this point the task is preemptible (and interruptible), so I
believe this is too late to snapshot the ADFSR. The task could have been
migrated to a different core, with an irrelavant ADFSR, or a fault could
have occured within an interrupt handler, etc.
Thanks,
Mark.
On Mon, Oct 29, 2018 at 03:54:36PM +0000, Mark Rutland wrote:
> On Mon, Oct 29, 2018 at 02:20:51PM +0000, Wiebe, Wladislav (Nokia - DE/Ulm) wrote:
> > When running into situations like:
> > "Unhandled fault: synchronous external abort (0x210) at 0xXXX"
> > or
> > "Unhandled prefetch abort: synchronous external abort (0x210) at 0xXXX"
> > it is useful to know the content of ADFSR (Auxiliary Data Fault Status
> > Register) to indicate an ECC double-bit error in L1 or L2 cache.
> >
> > Refer to:
> > Cortex-A15 Technical Reference Manual, Revision: r2p1
> > [6.4.8. Error Correction Code]
> >
> > Signed-off-by: Wladislav Wiebe <[email protected]>
> > ---
> > arch/arm/mm/fault.c | 18 ++++++++++++++++++
> > 1 file changed, 18 insertions(+)
> >
> > diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
> > index 3232afb6fdc0..5e240deb6ed6 100644
> > --- a/arch/arm/mm/fault.c
> > +++ b/arch/arm/mm/fault.c
> > @@ -547,6 +547,22 @@ hook_fault_code(int nr, int (*fn)(unsigned long, unsigned int, struct pt_regs *)
> > fsr_info[nr].name = name;
> > }
> >
> > +/*
> > + * Check for ECC double-bit errors in Auxiliary Data Fault Status Register
> > + */
> > +static void check_adfsr_for_ecc(void)
> > +{
> > + u32 adfsr = 0;
> > +
> > + asm("mrc p15, 0, %0, c5, c1, 0" : "=r" (adfsr));
> > +
> > + if (adfsr & (BIT(31) | BIT(23))) {
> > + pr_alert("ADFSR status 0x%x indicates that an L1 or L2 cache\n"
> > + "ECC double-bit error occurred at some time.\n",
> > + adfsr);
> > + }
> > +}
> > +
> > /*
> > * Dispatch a data abort to the relevant handler.
> > */
> > @@ -559,6 +575,7 @@ do_DataAbort(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
> > if (!inf->fn(addr, fsr & ~FSR_LNX_PF, regs))
> > return;
> >
> > + check_adfsr_for_ecc();
> > pr_alert("Unhandled fault: %s (0x%03x) at 0x%08lx\n",
> > inf->name, fsr, addr);
> > show_pte(current->mm, addr);
> > @@ -593,6 +610,7 @@ do_PrefetchAbort(unsigned long addr, unsigned int ifsr, struct pt_regs *regs)
> > if (!inf->fn(addr, ifsr | FSR_LNX_PF, regs))
> > return;
> >
> > + check_adfsr_for_ecc();
> > pr_alert("Unhandled prefetch abort: %s (0x%03x) at 0x%08lx\n",
> > inf->name, ifsr, addr);
>
> IIUC at this point the task is preemptible (and interruptible),
It may be preemptable, but isn't necessarily so. It depends whether the
called FSR specific function enabled interrupts or not.
So, it would be better to read the ADFSR before calling the FSR specific
function to guarantee that we read the values that correspond with _this_
fault.
--
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up