2019-01-29 12:32:00

by Zhang, Lei

[permalink] [raw]
Subject: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

On some variants of the Fujitsu-A64FX cores ver(1.0, 1.1),
memory accesses may cause undefined fault (Data abort, DFSC=0b111111).
This problem will be fixed by next version of Fujitsu-A64FX.

This fault occurs under a specific hardware condition
when a load/store instruction perform an address translation using:
case-1 TTBR0_EL1 with TCR_EL1.NFD0 == 1.
case-2 TTBR0_EL2 with TCR_EL2.NFD0 == 1.
case-3 TTBR1_EL1 with TCR_EL1.NFD1 == 1.
case-4 TTBR1_EL2 with TCR_EL2.NFD1 == 1.
And this fault occurs completely spurious.

Since TCR_ELx.NFD1 is set to '1' at the kernel in versions
past 4.17, the case-3 or case-4 may happen.

This fault can be taken only at stage-1,
so this fault is taken from EL0 to EL1/EL2, from EL1 to EL1,
or from EL2 to EL2.

I would like to post a workaround to avoid this problem on
existing Fujitsu-A64FX version.

There are 2 points in this workaround.
Point1: trap from EL1 to EL1, EL2 to EL2
Set '0' to TCR_ELx.NFD1in kernel-entry,
and set '1' in kernel-exit.

From the view point of ARM specification, there is no problem to
reset TCR_ELx.{NFD0,NFD1} while in EL1/EL2, because
TCR_ELx.{NFD0,NFD1} controls whether to perform a translation
table walk in response to an access from EL0.

I confirmed that:
・There is no load/store instruction between
tramp_ventry and setting TCR_ELx.NFD1 to '0'.
・There is no load/store instruction between
setting TCR_ELx.NFD1 to '1' and tramp_exit.

Point2: trap from EL0 to EL1/EL2
Since this fault also occurs in EL0,
replace the fault handler for Data abort
DFSC=0b111111 with a new one to ignore this undefined fault.
I guarantee that a thread will stop delivering this fault code by ignore
this undefined fault.

The hardware condition which cause this fault is reset at exception entry,
therefore execution of at least one instruction is
guaranteed by this single retry.


This workaround is based on linux-5.0-rc2,
which TCR_ELx.NFD1 is set to '1'
only once at boot sequence,
and TCR_ELx.NFD0 is not set by kernel.
I will update my patch if new kernel makes some changes
about TCR_ELx.{NFD0,NFD1}.

Changes since [v1]
As Mark's review:

* Adopted errata framework.

Changes since [v2]
As Mark and James' review:

* Added framework to change TCR_ELx.NFD1.
- Change TCR_ELx.NFD1 to 0 when entry kernel.
- Change TCR_ELx.NFD1 to 1 when exit kernel.

I fully appreciate that if someone can test this patch on different chips
to verity no harmful effect on other chips.

If there is no problem on other chips, please merge this patch.

The patch based on linux-5.0-rc2.

Zhang Lei (1):
Arm64: Add workaround for Fujitsu A64FX erratum 010001

Documentation/arm64/silicon-errata.txt | 1 +
arch/arm64/Kconfig | 22 ++++++++++++++++++++++
arch/arm64/include/asm/cpucaps.h | 3 ++-
arch/arm64/include/asm/cputype.h | 4 ++++
arch/arm64/kernel/cpu_errata.c | 8 ++++++++
arch/arm64/kernel/entry.S | 16 ++++++++++++++++
arch/arm64/mm/fault.c | 16 +++++++++++++++-
arch/arm64/mm/proc.S | 20 ++++++++++++++++++++
8 files changed, 88 insertions(+), 2 deletions(-)

--
1.8.3.1


2019-01-29 18:11:26

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

Hi,

Could you please copy the whole description from the cover letter to the
actual patch and only send one email (full description as in here
together with the patch)? If we commit this to the kernel, it would be
useful to have the information in the log for reference later on.

More comments below:

On Tue, Jan 29, 2019 at 12:29:58PM +0000, Zhang, Lei wrote:
> On some variants of the Fujitsu-A64FX cores ver(1.0, 1.1),
> memory accesses may cause undefined fault (Data abort, DFSC=0b111111).
> This problem will be fixed by next version of Fujitsu-A64FX.
>
> This fault occurs under a specific hardware condition
> when a load/store instruction perform an address translation using:
> case-1 TTBR0_EL1 with TCR_EL1.NFD0 == 1.
> case-2 TTBR0_EL2 with TCR_EL2.NFD0 == 1.
> case-3 TTBR1_EL1 with TCR_EL1.NFD1 == 1.
> case-4 TTBR1_EL2 with TCR_EL2.NFD1 == 1.
> And this fault occurs completely spurious.

So this looks like new information on the hardware behaviour since the
v2 of the patch. Can this fault occur for any type of instruction
accessing the memory or only for SVE instructions?

> Since TCR_ELx.NFD1 is set to '1' at the kernel in versions
> past 4.17, the case-3 or case-4 may happen.
>
> This fault can be taken only at stage-1,
> so this fault is taken from EL0 to EL1/EL2, from EL1 to EL1,
> or from EL2 to EL2.
>
> I would like to post a workaround to avoid this problem on
> existing Fujitsu-A64FX version.

How likely is it to trigger this erratum? In other words, aren't we
better off with a spurious fault that we ignore rather than toggling the
TCR_ELx.NFD1 bit?

> There are 2 points in this workaround.
> Point1: trap from EL1 to EL1, EL2 to EL2
> Set '0' to TCR_ELx.NFD1in kernel-entry,
> and set '1' in kernel-exit.
>
> From the view point of ARM specification, there is no problem to
> reset TCR_ELx.{NFD0,NFD1} while in EL1/EL2, because
> TCR_ELx.{NFD0,NFD1} controls whether to perform a translation
> table walk in response to an access from EL0.

The problem is that this bit may be cached in the TLB (I haven't checked
the ARM ARM but that's usually the case with the TCR_ELx bits). If
that's the case, you can't guarantee a change unless you also perform a
TLBI VMALL. Arguably, if Fujitsu's microarchitecture doesn't cache the
NFD bits in the TLB, we could apply the workaround but I'd rather have
the spurious trap if it's not too often.

> I confirmed that:
> ・There is no load/store instruction between
> tramp_ventry and setting TCR_ELx.NFD1 to '0'.
> ・There is no load/store instruction between
> setting TCR_ELx.NFD1 to '1' and tramp_exit.

Could speculative loads also trigger this? Another option would be to
toggle it during kernel_neon_begin/end (with the caveat of TLBI as
mentioned above).

--
Catalin

2019-01-30 14:57:39

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

Hi guys,

On 01/29/2019 06:10 PM, Catalin Marinas wrote:
> Could you please copy the whole description from the cover letter to the
> actual patch and only send one email (full description as in here
> together with the patch)? If we commit this to the kernel, it would be
> useful to have the information in the log for reference later on.
>
> More comments below:
>
> On Tue, Jan 29, 2019 at 12:29:58PM +0000, Zhang, Lei wrote:
>> On some variants of the Fujitsu-A64FX cores ver(1.0, 1.1),
>> memory accesses may cause undefined fault (Data abort, DFSC=0b111111).
>> This problem will be fixed by next version of Fujitsu-A64FX.
>>
>> This fault occurs under a specific hardware condition
>> when a load/store instruction perform an address translation using:
>> case-1 TTBR0_EL1 with TCR_EL1.NFD0 == 1.
>> case-2 TTBR0_EL2 with TCR_EL2.NFD0 == 1.
>> case-3 TTBR1_EL1 with TCR_EL1.NFD1 == 1.
>> case-4 TTBR1_EL2 with TCR_EL2.NFD1 == 1.
>> And this fault occurs completely spurious.
>
> So this looks like new information on the hardware behaviour since the
> v2 of the patch. Can this fault occur for any type of instruction
> accessing the memory or only for SVE instructions?
>
>> Since TCR_ELx.NFD1 is set to '1' at the kernel in versions
>> past 4.17, the case-3 or case-4 may happen.
>>
>> This fault can be taken only at stage-1,
>> so this fault is taken from EL0 to EL1/EL2, from EL1 to EL1,
>> or from EL2 to EL2.
>>
>> I would like to post a workaround to avoid this problem on
>> existing Fujitsu-A64FX version.
>
> How likely is it to trigger this erratum? In other words, aren't we
> better off with a spurious fault that we ignore rather than toggling the
> TCR_ELx.NFD1 bit?

It sounds like the spurious fault can occur as a result of load/store.
('there is no load/store instruction between'...).

If this can happen in kernel_enter it will overwrite the exception
registers, and we lose the original ELR.

If load/store trigger it, I don't think we can ignore it.

Thanks,

James

2019-01-30 15:01:18

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

Hi!

On 01/29/2019 12:29 PM, Zhang, Lei wrote:
> On some variants of the Fujitsu-A64FX cores ver(1.0, 1.1),
> memory accesses may cause undefined fault (Data abort, DFSC=0b111111).
> This problem will be fixed by next version of Fujitsu-A64FX.
>
> This fault occurs under a specific hardware condition
> when a load/store instruction perform an address translation using:
> case-1 TTBR0_EL1 with TCR_EL1.NFD0 == 1.
> case-2 TTBR0_EL2 with TCR_EL2.NFD0 == 1.
> case-3 TTBR1_EL1 with TCR_EL1.NFD1 == 1.
> case-4 TTBR1_EL2 with TCR_EL2.NFD1 == 1.
> And this fault occurs completely spurious.
>
> Since TCR_ELx.NFD1 is set to '1' at the kernel in versions
> past 4.17, the case-3 or case-4 may happen.

e03e61c3173c ("arm64: kaslr: Set TCR_EL1.NFD1 when
CONFIG_RANDOMIZE_BASE=y") ?

So you'd never see it if you disabled CONFIG_RANDOMIZE_BASE?


Thanks,

James

2019-02-01 05:54:39

by Zhang, Lei

[permalink] [raw]
Subject: RE: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

Hi James,

> -----Original Message-----
> From: linux-arm-kernel
> [mailto:[email protected]] On Behalf Of
> James Morse
> Sent: Thursday, January 31, 2019 12:00 AM
> To: Zhang, Lei/$BD%(B $BMk(B
> Cc: 'Mark Rutland'; 'Catalin Marinas'; '[email protected]';
> '[email protected]';
> '[email protected]'
> Subject: Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX
> erratum 010001
>
>
> e03e61c3173c ("arm64: kaslr: Set TCR_EL1.NFD1 when
> CONFIG_RANDOMIZE_BASE=y") ?
>
> So you'd never see it if you disabled CONFIG_RANDOMIZE_BASE?
For security, it is necessary to set CONFIG_RANDOMIZE_BASE=y.

Thanks,
Zhang Lei


2019-02-01 10:53:54

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

On Fri, Feb 01, 2019 at 05:53:50AM +0000, Zhang, Lei wrote:
> > -----Original Message-----
> > From: linux-arm-kernel
> > [mailto:[email protected]] On Behalf Of
> > James Morse
> > Sent: Thursday, January 31, 2019 12:00 AM
> > To: Zhang, Lei/張 雷
> > Cc: 'Mark Rutland'; 'Catalin Marinas'; '[email protected]';
> > '[email protected]';
> > '[email protected]'
> > Subject: Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX
> > erratum 010001
> >
> >
> > e03e61c3173c ("arm64: kaslr: Set TCR_EL1.NFD1 when
> > CONFIG_RANDOMIZE_BASE=y") ?
> >
> > So you'd never see it if you disabled CONFIG_RANDOMIZE_BASE?
> For security, it is necessary to set CONFIG_RANDOMIZE_BASE=y.

So I guess we should boot with NFD1 clear, and then set it only when we
realise we're not on an A64FX?

Will

2019-02-05 12:50:36

by Zhang, Lei

[permalink] [raw]
Subject: RE: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

Hi Catalin,

> -----Original Message-----
> From: Catalin Marinas [mailto:[email protected]]
> Sent: Wednesday, January 30, 2019 3:11 AM
> To: Zhang, Lei
> Cc: '[email protected]'; 'Mark Rutland';
> '[email protected]'; '[email protected]';
> '[email protected]'
> Subject: Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX
> erratum 010001
>
> Could you please copy the whole description from the cover letter to the
> actual patch and only send one email (full description as in here
> together with the patch)? If we commit this to the kernel, it would be
> useful to have the information in the log for reference later on.

Thank you for your suggestion. I will send one email with whole description.

> So this looks like new information on the hardware behaviour since the
> v2 of the patch. Can this fault occur for any type of instruction
> accessing the memory or only for SVE instructions?

This erratum is that any load/store instruction, including Armv8 and SVE,
except non-fault access might occur a spurious fault.

> How likely is it to trigger this erratum? In other words, aren't we
> better off with a spurious fault that we ignore rather than toggling the
> TCR_ELx.NFD1 bit?

Although the erratum occurs exceptionally rare, this path is required
to handle the issue pointed out by James and Mark in:
https://lkml.org/lkml/2019/1/22/533,
https://lkml.org/lkml/2019/1/22/642.

As James and Mark pointed, if the erratum occurs at EL1/EL2 before
system registers, ELR and SPSR, are backed up, these registers will
be overwritten and we will lose that information.

So, we set the TCR_ELx.NFD1=0 during EL1/EL2.
Please see the supplemental explanation in the end of this mail.

> The problem is that this bit may be cached in the TLB (I haven't checked
> the ARM ARM but that's usually the case with the TCR_ELx bits). If
> that's the case, you can't guarantee a change unless you also perform
> a
> TLBI VMALL. Arguably, if Fujitsu's microarchitecture doesn't cache the
> NFD bits in the TLB, we could apply the workaround but I'd rather have
> the spurious trap if it's not too often.

It is not necessary to perform a TLBI VMALL in A64FX microarchitecture
to guarantee a change of TCR_ELx.{NFD0,NFD1}.

> Could speculative loads also trigger this? Another option would be to
> toggle it during kernel_neon_begin/end (with the caveat of TLBI as
> mentioned above).

No, a speculative load does not trigger this erratum.

Here are supplemental explanations:

Since this erratum occurs only when TCR_ELx.NFD1=1,
we keep TCR_ELx.NFD1=0 during EL1/EL2.
By doing so, the erratum occurs only in EL0 and the
spurious trap can be handled by the fault handler.

To keep TCR_ELx.NFD1=0 in EL1/EL2, there are two critical
sections to assure the completeness of the implementation.
One is the transition from EL0 to EL1/EL2 and the other
is from EL1/EL2 to EL0

For the former case, I set TCR_ELx.NFD1=0 at codes tramp_map_kernel.
And there is no load/store instruction before setting
TCR_ELx.NFD1=0 at EL1/EL2, so undefined fault will not be happened.

For the latter case, I set TCR_ELx.NFD1=1 at codes tramp_unmap_kernel.
And there is no load/store instruction after setting
TCR_ELx.NFD1=1 at EL1/EL2, so undefined fault will not be happened.

To handle the spurious fault in EL0,
I replace the fault handler for Data abort DFSC=0b111111 with
a new fault handler to ignore this spurious fault caused by the erratum.

Thanks,
Zhang Lei

2019-02-05 13:45:45

by Zhang, Lei

[permalink] [raw]
Subject: RE: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

Hi Will,

> -----Original Message-----
> From: linux-arm-kernel
> [mailto:[email protected]] On Behalf Of
> Will Deacon
> Sent: Friday, February 01, 2019 7:52 PM
> To: Zhang, Lei
> Cc: 'Mark Rutland'; 'Catalin Marinas'; 'James Morse';
> '[email protected]';
> '[email protected]'
> Subject: Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX
> erratum 010001

> So I guess we should boot with NFD1 clear, and then set it only when we
> realise we're not on an A64FX?

In my patch, I do similar things at __cpu_setup which we
set NFD1=1 on all processors except A64FX.

Do you mean we would better to change the place where we
set/clear NFD1?

Thanks,
Zhang Lei

2019-02-13 14:33:49

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

On Tue, Feb 05, 2019 at 01:32:07PM +0000, Zhang, Lei wrote:
> > -----Original Message-----
> > From: linux-arm-kernel
> > [mailto:[email protected]] On Behalf Of
> > Will Deacon
> > Sent: Friday, February 01, 2019 7:52 PM
> > To: Zhang, Lei
> > Cc: 'Mark Rutland'; 'Catalin Marinas'; 'James Morse';
> > '[email protected]';
> > '[email protected]'
> > Subject: Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX
> > erratum 010001
>
> > So I guess we should boot with NFD1 clear, and then set it only when we
> > realise we're not on an A64FX?
>
> In my patch, I do similar things at __cpu_setup which we
> set NFD1=1 on all processors except A64FX.
>
> Do you mean we would better to change the place where we
> set/clear NFD1?

Yes, I think we should keep the code as simple as we can:

- Don't set NFDx=1 in proc.S for any CPU
- Later, in C code, we can set the bit for any CPU that is not an affected
A64FX (this can be a simple MIDR check).

Does that work?

Will