On this HPE Apollo 70 arm64 server with 256 CPUs, triggering a crash
dump just hung. It has 4 threads on each core. Each 2-core share a same
L1 and L2 caches, so that is 8 CPUs shares those. All CPUs share a same
L3 cache.
It turned out that this was due to the TLB contained stale entries (or
uninitialized junk which just happened to look valid) from the first
kernel before turning the MMU on in the second kernel which caused this
instruction hung,
msr sctlr_el1, x0
Signed-off-by: Qian Cai <[email protected]>
---
arch/arm64/kernel/head.S | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 4471f570a295..5196f3d729de 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -771,6 +771,10 @@ ENTRY(__enable_mmu)
msr ttbr0_el1, x2 // load TTBR0
msr ttbr1_el1, x1 // load TTBR1
isb
+ dsb nshst
+ tlbi vmalle1 // invalidate TLB
+ dsb nsh
+ isb
msr sctlr_el1, x0
isb
/*
--
2.17.2 (Apple Git-113)
Hi Qian Cai,
On Thu, Dec 13, 2018 at 10:53 AM Qian Cai <[email protected]> wrote:
>
> On this HPE Apollo 70 arm64 server with 256 CPUs, triggering a crash
> dump just hung. It has 4 threads on each core. Each 2-core share a same
> L1 and L2 caches, so that is 8 CPUs shares those. All CPUs share a same
> L3 cache.
>
> It turned out that this was due to the TLB contained stale entries (or
> uninitialized junk which just happened to look valid) from the first
> kernel before turning the MMU on in the second kernel which caused this
> instruction hung,
>
> msr sctlr_el1, x0
>
> Signed-off-by: Qian Cai <[email protected]>
> ---
> arch/arm64/kernel/head.S | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 4471f570a295..5196f3d729de 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -771,6 +771,10 @@ ENTRY(__enable_mmu)
> msr ttbr0_el1, x2 // load TTBR0
> msr ttbr1_el1, x1 // load TTBR1
> isb
> + dsb nshst
> + tlbi vmalle1 // invalidate TLB
> + dsb nsh
> + isb
This will be executed both for the primary and kdump kernel, right? I
don't think we really want to invalidate the TLB when booting the
primary kernel.
It would be too slow and considering that we need to minimize boot
timings on embedded arm64 devices, I think it would not be a good
idea.
> msr sctlr_el1, x0
> isb
> /*
> --
> 2.17.2 (Apple Git-113)
>
Also did you check this issue I reported on the HPE apollo machines
some days back with the kdump kernel boot
<https://www.spinics.net/lists/kexec/msg21750.html>.
Can you please confirm that you are not facing the same issue (as I
suspect from reading your earlier Bug Report) on the HPE apollo
machine. Also adding 'earlycon' to the bootargs being passed to the
kdump kernel you can see if you are able to atleast get some console
output from the kdump kernel.
Thanks,
Bhupesh
Hi Qian,
On 13/12/2018 05:22, Qian Cai wrote:
> On this HPE Apollo 70 arm64 server with 256 CPUs, triggering a crash
> dump just hung. It has 4 threads on each core. Each 2-core share a same
> L1 and L2 caches, so that is 8 CPUs shares those. All CPUs share a same
> L3 cache.
>
> It turned out that this was due to the TLB contained stale entries (or
> uninitialized junk which just happened to look valid) from the first
> kernel before turning the MMU on in the second kernel which caused this
> instruction hung,
This is a great find, thanks for debugging this!
The kernel should already handle this, as we don't trust the bootloader to clean
up either.
In arch/arm64/mm/proc.S::__cpu_setup()
|/*
| * __cpu_setup
| *
| * Initialise the processor for turning the MMU on. Return in x0 the
| * value of the SCTLR_EL1 register.
| */
| .pushsection ".idmap.text", "awx"
| ENTRY(__cpu_setup)
| tlbi vmalle1 // Invalidate local TLB
| dsb nsh
This is called from stext, which then branches to __primary_switch(), which
calls __enable_mmu() where you see this problem. It shouldn't not be possible to
allocate new tlb entries between these points...
Do you have CONFIG_RANDOMIZE_BASE disabled? This causes enable_mmu() to be
called twice, the extra tlb maintenance is in __primary_switch.
(if it works with this turned off, it points to the extra off/tlbi/on sequence).
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 4471f570a295..5196f3d729de 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -771,6 +771,10 @@ ENTRY(__enable_mmu)
> msr ttbr0_el1, x2 // load TTBR0
> msr ttbr1_el1, x1 // load TTBR1
> isb
> + dsb nshst
> + tlbi vmalle1 // invalidate TLB
> + dsb nsh
> + isb
> msr sctlr_el1, x0
> isb
The overall change here is that we do extra maintenance later.
Can move this around to bisect where the TLB entries are either coming from, or
failing-to-be invalidated?
Do your first and kdump kernels have the same VA_BITS/PAGE_SIZE?
As a stab in the dark, (totally untested):
------------------------------%<------------------------------
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 2c75b0b903ae..a5f3b7314bda 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -406,9 +406,6 @@ ENDPROC(idmap_kpti_install_ng_mappings)
*/
.pushsection ".idmap.text", "awx"
ENTRY(__cpu_setup)
- tlbi vmalle1 // Invalidate local TLB
- dsb nsh
-
mov x0, #3 << 20
msr cpacr_el1, x0 // Enable FP/ASIMD
mov x0, #1 << 12 // Reset mdscr_el1 and disable
@@ -465,5 +462,10 @@ ENTRY(__cpu_setup)
1:
#endif /* CONFIG_ARM64_HW_AFDBM */
msr tcr_el1, x10
+ isb
+
+ tlbi vmalle1 // Invalidate local TLB
+ dsb nsh
+
ret // return to head.S
ENDPROC(__cpu_setup)
------------------------------%<------------------------------
Thanks,
James
On Thu, 2018-12-13 at 11:10 +0530, Bhupesh Sharma wrote:
> Hi Qian Cai,
>
> On Thu, Dec 13, 2018 at 10:53 AM Qian Cai <[email protected]> wrote:
> >
> > On this HPE Apollo 70 arm64 server with 256 CPUs, triggering a crash
> > dump just hung. It has 4 threads on each core. Each 2-core share a same
> > L1 and L2 caches, so that is 8 CPUs shares those. All CPUs share a same
> > L3 cache.
> >
> > It turned out that this was due to the TLB contained stale entries (or
> > uninitialized junk which just happened to look valid) from the first
> > kernel before turning the MMU on in the second kernel which caused this
> > instruction hung,
> >
> > msr sctlr_el1, x0
> >
> > Signed-off-by: Qian Cai <[email protected]>
> > ---
> > arch/arm64/kernel/head.S | 4 ++++
> > 1 file changed, 4 insertions(+)
> >
> > diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> > index 4471f570a295..5196f3d729de 100644
> > --- a/arch/arm64/kernel/head.S
> > +++ b/arch/arm64/kernel/head.S
> > @@ -771,6 +771,10 @@ ENTRY(__enable_mmu)
> > msr ttbr0_el1, x2 // load TTBR0
> > msr ttbr1_el1, x1 // load TTBR1
> > isb
> > + dsb nshst
> > + tlbi vmalle1 // invalidate TLB
> > + dsb nsh
> > + isb
>
> This will be executed both for the primary and kdump kernel, right? I
> don't think we really want to invalidate the TLB when booting the
> primary kernel.
> It would be too slow and considering that we need to minimize boot
> timings on embedded arm64 devices, I think it would not be a good
> idea.
Yes, it will be executed for the first kernel as well. As James mentioned, it
needs to be done to invalidate TLB that might be used by bootloader anyway.
>
> > msr sctlr_el1, x0
> > isb
> > /*
> > --
> > 2.17.2 (Apple Git-113)
> >
>
> Also did you check this issue I reported on the HPE apollo machines
> some days back with the kdump kernel boot
> <https://www.spinics.net/lists/kexec/msg21750.html>.
> Can you please confirm that you are not facing the same issue (as I
> suspect from reading your earlier Bug Report) on the HPE apollo
> machine. Also adding 'earlycon' to the bootargs being passed to the
> kdump kernel you can see if you are able to atleast get some console
> output from the kdump kernel.
No, here did not encounter the problem you mentioned.
On Thu, 2018-12-13 at 10:44 +0000, James Morse wrote:
> The kernel should already handle this, as we don't trust the bootloader to
> clean
> up either.
>
> In arch/arm64/mm/proc.S::__cpu_setup()
> > /*
> > * __cpu_setup
> > *
> > * Initialise the processor for turning the MMU on. Return in x0 the
> > * value of the SCTLR_EL1 register.
> > */
> > .pushsection ".idmap.text", "awx"
> > ENTRY(__cpu_setup)
> > tlbi vmalle1 // Invalidate local
> > TLB
> > dsb nsh
>
> This is called from stext, which then branches to __primary_switch(), which
> calls __enable_mmu() where you see this problem. It shouldn't not be possible
> to
> allocate new tlb entries between these points...
>
> Do you have CONFIG_RANDOMIZE_BASE disabled? This causes enable_mmu() to be
> called twice, the extra tlb maintenance is in __primary_switch.
> (if it works with this turned off, it points to the extra off/tlbi/on
> sequence).
Yes, CONFIG_RANDOMIZE_BASE is NOT set.
>
>
> > diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> > index 4471f570a295..5196f3d729de 100644
> > --- a/arch/arm64/kernel/head.S
> > +++ b/arch/arm64/kernel/head.S
> > @@ -771,6 +771,10 @@ ENTRY(__enable_mmu)
> > msr ttbr0_el1, x2 // load TTBR0
> > msr ttbr1_el1, x1 // load TTBR1
> > isb
> > + dsb nshst
> > + tlbi vmalle1 // invalidate
> > TLB
> > + dsb nsh
> > + isb
> > msr sctlr_el1, x0
> > isb
>
> The overall change here is that we do extra maintenance later.
>
> Can move this around to bisect where the TLB entries are either coming from,
> or
> failing-to-be invalidated?
> Do your first and kdump kernels have the same VA_BITS/PAGE_SIZE?
Yes,
CONFIG_ARM64_VA_BITS=48
CONFIG_ARM64_PAGE_SHIFT=16
# CONFIG_ARM64_4K_PAGES is not set
# CONFIG_ARM64_16K_PAGES is not set
CONFIG_ARM64_64K_PAGES=y
> As a stab in the dark, (totally untested):
> ------------------------------%<------------------------------
> diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
> index 2c75b0b903ae..a5f3b7314bda 100644
> --- a/arch/arm64/mm/proc.S
> +++ b/arch/arm64/mm/proc.S
> @@ -406,9 +406,6 @@ ENDPROC(idmap_kpti_install_ng_mappings)
> */
> .pushsection ".idmap.text", "awx"
> ENTRY(__cpu_setup)
> - tlbi vmalle1 // Invalidate local TLB
> - dsb nsh
> -
> mov x0, #3 << 20
> msr cpacr_el1, x0 // Enable FP/ASIMD
> mov x0, #1 << 12 // Reset mdscr_el1 and disable
> @@ -465,5 +462,10 @@ ENTRY(__cpu_setup)
> 1:
> #endif /* CONFIG_ARM64_HW_AFDBM */
> msr tcr_el1, x10
> + isb
> +
> + tlbi vmalle1 // Invalidate local TLB
> + dsb nsh
> +
> ret // return to head.S
> ENDPROC(__cpu_setup)
> ------------------------------%<------------------------------
>
This patch works well too.
On this HPE Apollo 70 arm64 server with 256 CPUs, triggering a crash
dump just hung. It has 4 threads on each core. Each 2-core share a same
L1 and L2 caches, so that is 8 CPUs shares those. All CPUs share a same
L3 cache.
It turned out that this was due to the TLB contained stale entries (or
uninitialized junk which just happened to look valid) before turning the
MMU on in the second kernel which caused this instruction hung,
msr sctlr_el1, x0
Although there is a local TLB flush in the second kernel in
__cpu_setup(), it is called too early. When the time to turn the MMU on
later, the TLB is dirty again from some reasons.
Also tried to move the local TLB flush part around a bit inside
__cpu_setup(), although it did complete kdump some times, it did trigger
"Synchronous Exception" in EFI after a cold-reboot fairly often that
seems no way to recover remotely without reinstalling the OS. For
example, in those places,
ENTRY(__cpu_setup)
+ isb
tlbi vmalle1
dsb nsh
or
mov x0, #3 << 20
msr cpacr_el1, x0
+ tlbi vmalle1
+ dsb nsh
Since it is only necessary to flush local TLB right before turning the
MMU on, just re-arrage the part a bit like the one in __primary_switch()
within CONFIG_RANDOMIZE_BASE path, so it does not depends on other
instructions in between that could pollute the TLB, and it no longer
trigger "Synchronous Exception" as well.
Signed-off-by: Qian Cai <[email protected]>
---
v2: merge the similar part from __cpu_setup() pointed out by James.
arch/arm64/kernel/head.S | 4 ++++
arch/arm64/mm/proc.S | 3 ---
2 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 4471f570a295..7f555dd4577e 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -771,6 +771,10 @@ ENTRY(__enable_mmu)
msr ttbr0_el1, x2 // load TTBR0
msr ttbr1_el1, x1 // load TTBR1
isb
+
+ tlbi vmalle1 // invalidate TLB
+ dsb nsh
+
msr sctlr_el1, x0
isb
/*
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 2c75b0b903ae..14f68afdd57f 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -406,9 +406,6 @@ ENDPROC(idmap_kpti_install_ng_mappings)
*/
.pushsection ".idmap.text", "awx"
ENTRY(__cpu_setup)
- tlbi vmalle1 // Invalidate local TLB
- dsb nsh
-
mov x0, #3 << 20
msr cpacr_el1, x0 // Enable FP/ASIMD
mov x0, #1 << 12 // Reset mdscr_el1 and disable
--
2.17.2 (Apple Git-113)
On Fri, Dec 14, 2018 at 9:39 AM Qian Cai <[email protected]> wrote:
>
> On this HPE Apollo 70 arm64 server with 256 CPUs, triggering a crash
> dump just hung. It has 4 threads on each core. Each 2-core share a same
> L1 and L2 caches, so that is 8 CPUs shares those. All CPUs share a same
> L3 cache.
>
> It turned out that this was due to the TLB contained stale entries (or
> uninitialized junk which just happened to look valid) before turning the
> MMU on in the second kernel which caused this instruction hung,
>
> msr sctlr_el1, x0
>
> Although there is a local TLB flush in the second kernel in
> __cpu_setup(), it is called too early. When the time to turn the MMU on
> later, the TLB is dirty again from some reasons.
>
> Also tried to move the local TLB flush part around a bit inside
> __cpu_setup(), although it did complete kdump some times, it did trigger
> "Synchronous Exception" in EFI after a cold-reboot fairly often that
> seems no way to recover remotely without reinstalling the OS. For
> example, in those places,
>
> ENTRY(__cpu_setup)
> + isb
> tlbi vmalle1
> dsb nsh
>
> or
>
> mov x0, #3 << 20
> msr cpacr_el1, x0
> + tlbi vmalle1
> + dsb nsh
>
> Since it is only necessary to flush local TLB right before turning the
> MMU on, just re-arrage the part a bit like the one in __primary_switch()
> within CONFIG_RANDOMIZE_BASE path, so it does not depends on other
> instructions in between that could pollute the TLB, and it no longer
> trigger "Synchronous Exception" as well.
>
> Signed-off-by: Qian Cai <[email protected]>
> ---
>
> v2: merge the similar part from __cpu_setup() pointed out by James.
>
> arch/arm64/kernel/head.S | 4 ++++
> arch/arm64/mm/proc.S | 3 ---
> 2 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 4471f570a295..7f555dd4577e 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -771,6 +771,10 @@ ENTRY(__enable_mmu)
> msr ttbr0_el1, x2 // load TTBR0
> msr ttbr1_el1, x1 // load TTBR1
> isb
> +
> + tlbi vmalle1 // invalidate TLB
> + dsb nsh
> +
> msr sctlr_el1, x0
> isb
> /*
> diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
> index 2c75b0b903ae..14f68afdd57f 100644
> --- a/arch/arm64/mm/proc.S
> +++ b/arch/arm64/mm/proc.S
> @@ -406,9 +406,6 @@ ENDPROC(idmap_kpti_install_ng_mappings)
> */
> .pushsection ".idmap.text", "awx"
> ENTRY(__cpu_setup)
> - tlbi vmalle1 // Invalidate local TLB
> - dsb nsh
> -
> mov x0, #3 << 20
> msr cpacr_el1, x0 // Enable FP/ASIMD
> mov x0, #1 << 12 // Reset mdscr_el1 and disable
> --
> 2.17.2 (Apple Git-113)
>
Not sure why I can't reproduce on my HPE Apollo machine, so a couple
of questions:
1. How many CPUs do you enable in the kdump kernel - do you pass
'nr_cpus=1' to the kdump kernel to limit the maximum number of cores
to 1 in the kdump kernel?
2. Which firmware version do you use on your board?
Thanks,
Bhupesh
On Fri, 14 Dec 2018 at 05:08, Qian Cai <[email protected]> wrote:
> Also tried to move the local TLB flush part around a bit inside
> __cpu_setup(), although it did complete kdump some times, it did trigger
> "Synchronous Exception" in EFI after a cold-reboot fairly often that
> seems no way to recover remotely without reinstalling the OS.
This doesn't make any sense to me. If the system gets into a weird
state out of cold reboot, how could this code be the culprit? Please
check your firmware, and try to reproduce the issue on a system that
doesn't have such defects.
On 12/14/18 12:01 AM, Bhupesh Sharma wrote:
> Not sure why I can't reproduce on my HPE Apollo machine, so a couple
> of questions:
> 1. How many CPUs do you enable in the kdump kernel - do you pass
> 'nr_cpus=1' to the kdump kernel to limit the maximum number of cores
> to 1 in the kdump kernel?
Yes
> 2. Which firmware version do you use on your board?
Handle 0x0000, DMI type 0, 26 bytes
BIOS Information
Vendor: American Megatrends Inc.
Version: L50_5.13_1.0.6
Release Date: 07/10/2018
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 64 MB
Characteristics:
PCI is supported
BIOS is upgradeable
BIOS shadowing is allowed
Boot from CD is supported
Selectable boot is supported
BIOS ROM is socketed
ACPI is supported
BIOS boot specification is supported
Targeted content distribution is supported
UEFI is supported
BIOS Revision: 6.3
On 12/14/18 2:23 AM, Ard Biesheuvel wrote:
> On Fri, 14 Dec 2018 at 05:08, Qian Cai <[email protected]> wrote:
>> Also tried to move the local TLB flush part around a bit inside
>> __cpu_setup(), although it did complete kdump some times, it did trigger
>> "Synchronous Exception" in EFI after a cold-reboot fairly often that
>> seems no way to recover remotely without reinstalling the OS.
>
> This doesn't make any sense to me. If the system gets into a weird
> state out of cold reboot, how could this code be the culprit? Please
> check your firmware, and try to reproduce the issue on a system that
> doesn't have such defects.
>
I'll continue investigating those "Synchronous Exception" although it is kind of
hard due to I don't have any source code of the firmware to confirm it is buggy
or not.
I did manage to reproduce this kdump issue on around 5 of those server running a
fairly recent version of the firmware (07/01/2018). I don't have access to other
large CPU machines.
Hi Qian,
On Sat, Dec 15, 2018 at 7:24 AM Qian Cai <[email protected]> wrote:
>
> On 12/14/18 2:23 AM, Ard Biesheuvel wrote:
> > On Fri, 14 Dec 2018 at 05:08, Qian Cai <[email protected]> wrote:
> >> Also tried to move the local TLB flush part around a bit inside
> >> __cpu_setup(), although it did complete kdump some times, it did trigger
> >> "Synchronous Exception" in EFI after a cold-reboot fairly often that
> >> seems no way to recover remotely without reinstalling the OS.
> >
> > This doesn't make any sense to me. If the system gets into a weird
> > state out of cold reboot, how could this code be the culprit? Please
> > check your firmware, and try to reproduce the issue on a system that
> > doesn't have such defects.
> >
>
> I'll continue investigating those "Synchronous Exception" although it is kind of
> hard due to I don't have any source code of the firmware to confirm it is buggy
> or not.
>
> I did manage to reproduce this kdump issue on around 5 of those server running a
> fairly recent version of the firmware (07/01/2018). I don't have access to other
> large CPU machines.
Sorry I got busy with some other stuff, but as I reported earlier, I
am not able to reproduce this on my HPE apollo with the latest linus
tree as well.
Here are some details on my setup:
1. # uname -r
5.0.0-rc1+
with the following commit as the HEAD:
commit a88cc8da0279f8e481b0d90e51a0a1cffac55906 (HEAD -> master,
origin/master, origin/HEAD)
Merge: 9cb2feb4d21d 73444bc4d8f9
Author: Linus Torvalds <[email protected]>
Date: Tue Jan 8 18:58:29 2019 -0800
Merge branch 'akpm' (patches from Andrew)
2. I use the following kdump commandline:
Kernel command line: BOOT_IMAGE=(hd9,gpt2)/vmlinuz-5.0.0-rc1+ ro
irqpoll nr_cpus=1 swiotlb=noforce reset_devices
earlycon=pl011,mmio,0x402020000
3. I am able to run kdump successfully on the machine and also collect
the crash core properly:
.. snip..
kdump: saving to /sysroot//var/crash/127.0.0.1-2019-01-10-10:52:25/
kdump: saving vmcore-dmesg.txt
kdump: saving vmcore-dmesg.txt complete
kdump: saving vmcore
Copying data : [100.0 %] \
eta: 0s
kdump: saving vmcore complete
.. snip ..
4. I use the same firmware version on the board as you shared earlier:
# dmidecode | grep -A 20 -i "BIOS Information"
BIOS Information
Vendor: American Megatrends Inc.
Version: L50_5.13_1.0.6
Release Date: 07/10/2018
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 64 MB
Characteristics:
PCI is supported
BIOS is upgradeable
BIOS shadowing is allowed
Boot from CD is supported
Selectable boot is supported
BIOS ROM is socketed
ACPI is supported
BIOS boot specification is supported
Targeted content distribution is supported
UEFI is supported
BIOS Revision: 6.3
So, I am guessing that it might be a kdump command line issue at your end.
Thanks,
Bhupesh