2023-03-16 07:55:14

by Andrea Righi

[permalink] [raw]
Subject: kernel 6.2 stuck at boot (efi_call_rts) on arm64

Hello,

the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
gets stuck and never completes the boot. On the console I see this:

[ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
[ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
[ 72.064949] Task dump for CPU 22:
[ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
[ 72.078156] Workqueue: efi_rts_wq efi_call_rts
[ 72.082595] Call trace:
[ 72.085029] __switch_to+0xbc/0x100
[ 72.088508] 0xffff80000fe83d4c

After that, as a consequence, I start to get a lot of hung task timeout traces.

I tried to bisect the problem and I found that the offending commit is
this one:

e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")

I've reverted this commit for now and everything works just fine, but I
was wondering if the problem could be caused by a lack of entropy on
these arm64 boxes or something else.

Any suggestion? Let me know if you want me to do any specific test.

Thanks,
-Andrea


2023-03-16 07:58:56

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

Hello Andrea,

On Thu, 16 Mar 2023 at 08:54, Andrea Righi <[email protected]> wrote:
>
> Hello,
>
> the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> gets stuck and never completes the boot. On the console I see this:
>
> [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> [ 72.064949] Task dump for CPU 22:
> [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> [ 72.082595] Call trace:
> [ 72.085029] __switch_to+0xbc/0x100
> [ 72.088508] 0xffff80000fe83d4c
>
> After that, as a consequence, I start to get a lot of hung task timeout traces.
>
> I tried to bisect the problem and I found that the offending commit is
> this one:
>
> e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
>
> I've reverted this commit for now and everything works just fine, but I
> was wondering if the problem could be caused by a lack of entropy on
> these arm64 boxes or something else.
>
> Any suggestion? Let me know if you want me to do any specific test.
>

Thanks for the report.

This is most likely the EFI SetVariable() call going off into the
weeds and never returning.

Is this an Ampere Altra system by any chance? Do you see it on
different types of hardware?

Could you check whether SetVariable works on this system? E.g. by
updating the EFI boot timeout (sudo efibootmgr -t <n>)?

2023-03-16 09:45:17

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 16.03.23 08:54, Andrea Righi wrote:
> Hello,
>
> the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> gets stuck and never completes the boot. On the console I see this:
>
> [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> [ 72.064949] Task dump for CPU 22:
> [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> [ 72.082595] Call trace:
> [ 72.085029] __switch_to+0xbc/0x100
> [ 72.088508] 0xffff80000fe83d4c
>
> After that, as a consequence, I start to get a lot of hung task timeout traces.
>
> I tried to bisect the problem and I found that the offending commit is
> this one:
>
> e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
>
> I've reverted this commit for now and everything works just fine, but I
> was wondering if the problem could be caused by a lack of entropy on
> these arm64 boxes or something else.

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced e7b813b32a42
#regzbot title efi: stuck at boot (efi_call_rts) on arm64
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

2023-03-16 09:45:28

by Andrea Righi

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, Mar 16, 2023 at 08:58:20AM +0100, Ard Biesheuvel wrote:
> Hello Andrea,
>
> On Thu, 16 Mar 2023 at 08:54, Andrea Righi <[email protected]> wrote:
> >
> > Hello,
> >
> > the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> > gets stuck and never completes the boot. On the console I see this:
> >
> > [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> > [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> > [ 72.064949] Task dump for CPU 22:
> > [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> > [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> > [ 72.082595] Call trace:
> > [ 72.085029] __switch_to+0xbc/0x100
> > [ 72.088508] 0xffff80000fe83d4c
> >
> > After that, as a consequence, I start to get a lot of hung task timeout traces.
> >
> > I tried to bisect the problem and I found that the offending commit is
> > this one:
> >
> > e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
> >
> > I've reverted this commit for now and everything works just fine, but I
> > was wondering if the problem could be caused by a lack of entropy on
> > these arm64 boxes or something else.
> >
> > Any suggestion? Let me know if you want me to do any specific test.
> >
>
> Thanks for the report.
>
> This is most likely the EFI SetVariable() call going off into the
> weeds and never returning.
>
> Is this an Ampere Altra system by any chance? Do you see it on
> different types of hardware?

This is: Ampere eMAG / Lenovo ThinkSystem HR330a.

>
> Could you check whether SetVariable works on this system? E.g. by
> updating the EFI boot timeout (sudo efibootmgr -t <n>)?

ubuntu@kuzzle:~$ sudo efibootmgr -t 10
^C^C^C^C

^ Stuck there, so it really looks like SetVariable is the problem.

Thanks,
-Andrea

2023-03-16 09:57:18

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

(cc Darren)

On Thu, 16 Mar 2023 at 10:45, Andrea Righi <[email protected]> wrote:
>
> On Thu, Mar 16, 2023 at 08:58:20AM +0100, Ard Biesheuvel wrote:
> > Hello Andrea,
> >
> > On Thu, 16 Mar 2023 at 08:54, Andrea Righi <[email protected]> wrote:
> > >
> > > Hello,
> > >
> > > the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> > > gets stuck and never completes the boot. On the console I see this:
> > >
> > > [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> > > [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> > > [ 72.064949] Task dump for CPU 22:
> > > [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> > > [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> > > [ 72.082595] Call trace:
> > > [ 72.085029] __switch_to+0xbc/0x100
> > > [ 72.088508] 0xffff80000fe83d4c
> > >
> > > After that, as a consequence, I start to get a lot of hung task timeout traces.
> > >
> > > I tried to bisect the problem and I found that the offending commit is
> > > this one:
> > >
> > > e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
> > >
> > > I've reverted this commit for now and everything works just fine, but I
> > > was wondering if the problem could be caused by a lack of entropy on
> > > these arm64 boxes or something else.
> > >
> > > Any suggestion? Let me know if you want me to do any specific test.
> > >
> >
> > Thanks for the report.
> >
> > This is most likely the EFI SetVariable() call going off into the
> > weeds and never returning.
> >
> > Is this an Ampere Altra system by any chance? Do you see it on
> > different types of hardware?
>
> This is: Ampere eMAG / Lenovo ThinkSystem HR330a.
>
> >
> > Could you check whether SetVariable works on this system? E.g. by
> > updating the EFI boot timeout (sudo efibootmgr -t <n>)?
>
> ubuntu@kuzzle:~$ sudo efibootmgr -t 10
> ^C^C^C^C
>
> ^ Stuck there, so it really looks like SetVariable is the problem.
>

Could you please share the output of

dmidecode -s bios
dmidecode -s system-family

Thanks,
Ard.

2023-03-16 10:03:12

by Andrea Righi

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, Mar 16, 2023 at 10:55:58AM +0100, Ard Biesheuvel wrote:
> (cc Darren)
>
> On Thu, 16 Mar 2023 at 10:45, Andrea Righi <[email protected]> wrote:
> >
> > On Thu, Mar 16, 2023 at 08:58:20AM +0100, Ard Biesheuvel wrote:
> > > Hello Andrea,
> > >
> > > On Thu, 16 Mar 2023 at 08:54, Andrea Righi <[email protected]> wrote:
> > > >
> > > > Hello,
> > > >
> > > > the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> > > > gets stuck and never completes the boot. On the console I see this:
> > > >
> > > > [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> > > > [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> > > > [ 72.064949] Task dump for CPU 22:
> > > > [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> > > > [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> > > > [ 72.082595] Call trace:
> > > > [ 72.085029] __switch_to+0xbc/0x100
> > > > [ 72.088508] 0xffff80000fe83d4c
> > > >
> > > > After that, as a consequence, I start to get a lot of hung task timeout traces.
> > > >
> > > > I tried to bisect the problem and I found that the offending commit is
> > > > this one:
> > > >
> > > > e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
> > > >
> > > > I've reverted this commit for now and everything works just fine, but I
> > > > was wondering if the problem could be caused by a lack of entropy on
> > > > these arm64 boxes or something else.
> > > >
> > > > Any suggestion? Let me know if you want me to do any specific test.
> > > >
> > >
> > > Thanks for the report.
> > >
> > > This is most likely the EFI SetVariable() call going off into the
> > > weeds and never returning.
> > >
> > > Is this an Ampere Altra system by any chance? Do you see it on
> > > different types of hardware?
> >
> > This is: Ampere eMAG / Lenovo ThinkSystem HR330a.
> >
> > >
> > > Could you check whether SetVariable works on this system? E.g. by
> > > updating the EFI boot timeout (sudo efibootmgr -t <n>)?
> >
> > ubuntu@kuzzle:~$ sudo efibootmgr -t 10
> > ^C^C^C^C
> >
> > ^ Stuck there, so it really looks like SetVariable is the problem.
> >
>
> Could you please share the output of
>
> dmidecode -s bios
> dmidecode -s system-family

$ sudo dmidecode -s bios-vendor
LENOVO
$ sudo dmidecode -s bios-version
hve104r-1.15
$ sudo dmidecode -s bios-release-date
02/26/2021
$ sudo dmidecode -s bios-revision
1.15
$ sudo dmidecode -s system-family
Lenovo ThinkSystem HR330A/HR350A

-Andrea

2023-03-16 10:18:40

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, 16 Mar 2023 at 11:03, Andrea Righi <[email protected]> wrote:
>
> On Thu, Mar 16, 2023 at 10:55:58AM +0100, Ard Biesheuvel wrote:
> > (cc Darren)
> >
> > On Thu, 16 Mar 2023 at 10:45, Andrea Righi <[email protected]> wrote:
> > >
> > > On Thu, Mar 16, 2023 at 08:58:20AM +0100, Ard Biesheuvel wrote:
> > > > Hello Andrea,
> > > >
> > > > On Thu, 16 Mar 2023 at 08:54, Andrea Righi <[email protected]> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> > > > > gets stuck and never completes the boot. On the console I see this:
> > > > >
> > > > > [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> > > > > [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> > > > > [ 72.064949] Task dump for CPU 22:
> > > > > [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> > > > > [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> > > > > [ 72.082595] Call trace:
> > > > > [ 72.085029] __switch_to+0xbc/0x100
> > > > > [ 72.088508] 0xffff80000fe83d4c
> > > > >
> > > > > After that, as a consequence, I start to get a lot of hung task timeout traces.
> > > > >
> > > > > I tried to bisect the problem and I found that the offending commit is
> > > > > this one:
> > > > >
> > > > > e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
> > > > >
> > > > > I've reverted this commit for now and everything works just fine, but I
> > > > > was wondering if the problem could be caused by a lack of entropy on
> > > > > these arm64 boxes or something else.
> > > > >
> > > > > Any suggestion? Let me know if you want me to do any specific test.
> > > > >
> > > >
> > > > Thanks for the report.
> > > >
> > > > This is most likely the EFI SetVariable() call going off into the
> > > > weeds and never returning.
> > > >
> > > > Is this an Ampere Altra system by any chance? Do you see it on
> > > > different types of hardware?
> > >
> > > This is: Ampere eMAG / Lenovo ThinkSystem HR330a.
> > >
> > > >
> > > > Could you check whether SetVariable works on this system? E.g. by
> > > > updating the EFI boot timeout (sudo efibootmgr -t <n>)?
> > >
> > > ubuntu@kuzzle:~$ sudo efibootmgr -t 10
> > > ^C^C^C^C
> > >
> > > ^ Stuck there, so it really looks like SetVariable is the problem.
> > >
> >
> > Could you please share the output of
> >
> > dmidecode -s bios
> > dmidecode -s system-family
>
> $ sudo dmidecode -s bios-vendor
> LENOVO
> $ sudo dmidecode -s bios-version
> hve104r-1.15
> $ sudo dmidecode -s bios-release-date
> 02/26/2021
> $ sudo dmidecode -s bios-revision
> 1.15
> $ sudo dmidecode -s system-family
> Lenovo ThinkSystem HR330A/HR350A
>

Thanks

Mind checking if this patch fixes your issue as well?

https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=altra-fix&id=77fa99dd4741456da85049c13ec31a148f5f5ac0

2023-03-16 11:34:48

by Andrea Righi

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, Mar 16, 2023 at 11:18:21AM +0100, Ard Biesheuvel wrote:
> On Thu, 16 Mar 2023 at 11:03, Andrea Righi <[email protected]> wrote:
> >
> > On Thu, Mar 16, 2023 at 10:55:58AM +0100, Ard Biesheuvel wrote:
> > > (cc Darren)
> > >
> > > On Thu, 16 Mar 2023 at 10:45, Andrea Righi <[email protected]> wrote:
> > > >
> > > > On Thu, Mar 16, 2023 at 08:58:20AM +0100, Ard Biesheuvel wrote:
> > > > > Hello Andrea,
> > > > >
> > > > > On Thu, 16 Mar 2023 at 08:54, Andrea Righi <[email protected]> wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> > > > > > gets stuck and never completes the boot. On the console I see this:
> > > > > >
> > > > > > [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> > > > > > [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> > > > > > [ 72.064949] Task dump for CPU 22:
> > > > > > [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> > > > > > [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> > > > > > [ 72.082595] Call trace:
> > > > > > [ 72.085029] __switch_to+0xbc/0x100
> > > > > > [ 72.088508] 0xffff80000fe83d4c
> > > > > >
> > > > > > After that, as a consequence, I start to get a lot of hung task timeout traces.
> > > > > >
> > > > > > I tried to bisect the problem and I found that the offending commit is
> > > > > > this one:
> > > > > >
> > > > > > e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
> > > > > >
> > > > > > I've reverted this commit for now and everything works just fine, but I
> > > > > > was wondering if the problem could be caused by a lack of entropy on
> > > > > > these arm64 boxes or something else.
> > > > > >
> > > > > > Any suggestion? Let me know if you want me to do any specific test.
> > > > > >
> > > > >
> > > > > Thanks for the report.
> > > > >
> > > > > This is most likely the EFI SetVariable() call going off into the
> > > > > weeds and never returning.
> > > > >
> > > > > Is this an Ampere Altra system by any chance? Do you see it on
> > > > > different types of hardware?
> > > >
> > > > This is: Ampere eMAG / Lenovo ThinkSystem HR330a.
> > > >
> > > > >
> > > > > Could you check whether SetVariable works on this system? E.g. by
> > > > > updating the EFI boot timeout (sudo efibootmgr -t <n>)?
> > > >
> > > > ubuntu@kuzzle:~$ sudo efibootmgr -t 10
> > > > ^C^C^C^C
> > > >
> > > > ^ Stuck there, so it really looks like SetVariable is the problem.
> > > >
> > >
> > > Could you please share the output of
> > >
> > > dmidecode -s bios
> > > dmidecode -s system-family
> >
> > $ sudo dmidecode -s bios-vendor
> > LENOVO
> > $ sudo dmidecode -s bios-version
> > hve104r-1.15
> > $ sudo dmidecode -s bios-release-date
> > 02/26/2021
> > $ sudo dmidecode -s bios-revision
> > 1.15
> > $ sudo dmidecode -s system-family
> > Lenovo ThinkSystem HR330A/HR350A
> >
>
> Thanks
>
> Mind checking if this patch fixes your issue as well?
>
> https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=altra-fix&id=77fa99dd4741456da85049c13ec31a148f5f5ac0

Unfortunately this doesn't seem to be enough, I'm still getting the same
problem also with this patch applied.

-Andrea

2023-03-16 12:21:55

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, 16 Mar 2023 at 12:34, Andrea Righi <[email protected]> wrote:
>
> On Thu, Mar 16, 2023 at 11:18:21AM +0100, Ard Biesheuvel wrote:
> > On Thu, 16 Mar 2023 at 11:03, Andrea Righi <[email protected]> wrote:
> > >
> > > On Thu, Mar 16, 2023 at 10:55:58AM +0100, Ard Biesheuvel wrote:
> > > > (cc Darren)
> > > >
> > > > On Thu, 16 Mar 2023 at 10:45, Andrea Righi <[email protected]> wrote:
> > > > >
> > > > > On Thu, Mar 16, 2023 at 08:58:20AM +0100, Ard Biesheuvel wrote:
> > > > > > Hello Andrea,
> > > > > >
> > > > > > On Thu, 16 Mar 2023 at 08:54, Andrea Righi <[email protected]> wrote:
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> > > > > > > gets stuck and never completes the boot. On the console I see this:
> > > > > > >
> > > > > > > [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> > > > > > > [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> > > > > > > [ 72.064949] Task dump for CPU 22:
> > > > > > > [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> > > > > > > [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> > > > > > > [ 72.082595] Call trace:
> > > > > > > [ 72.085029] __switch_to+0xbc/0x100
> > > > > > > [ 72.088508] 0xffff80000fe83d4c
> > > > > > >
> > > > > > > After that, as a consequence, I start to get a lot of hung task timeout traces.
> > > > > > >
> > > > > > > I tried to bisect the problem and I found that the offending commit is
> > > > > > > this one:
> > > > > > >
> > > > > > > e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
> > > > > > >
> > > > > > > I've reverted this commit for now and everything works just fine, but I
> > > > > > > was wondering if the problem could be caused by a lack of entropy on
> > > > > > > these arm64 boxes or something else.
> > > > > > >
> > > > > > > Any suggestion? Let me know if you want me to do any specific test.
> > > > > > >
> > > > > >
> > > > > > Thanks for the report.
> > > > > >
> > > > > > This is most likely the EFI SetVariable() call going off into the
> > > > > > weeds and never returning.
> > > > > >
> > > > > > Is this an Ampere Altra system by any chance? Do you see it on
> > > > > > different types of hardware?
> > > > >
> > > > > This is: Ampere eMAG / Lenovo ThinkSystem HR330a.
> > > > >
> > > > > >
> > > > > > Could you check whether SetVariable works on this system? E.g. by
> > > > > > updating the EFI boot timeout (sudo efibootmgr -t <n>)?
> > > > >
> > > > > ubuntu@kuzzle:~$ sudo efibootmgr -t 10
> > > > > ^C^C^C^C
> > > > >
> > > > > ^ Stuck there, so it really looks like SetVariable is the problem.
> > > > >
> > > >
> > > > Could you please share the output of
> > > >
> > > > dmidecode -s bios
> > > > dmidecode -s system-family
> > >
> > > $ sudo dmidecode -s bios-vendor
> > > LENOVO
> > > $ sudo dmidecode -s bios-version
> > > hve104r-1.15
> > > $ sudo dmidecode -s bios-release-date
> > > 02/26/2021
> > > $ sudo dmidecode -s bios-revision
> > > 1.15
> > > $ sudo dmidecode -s system-family
> > > Lenovo ThinkSystem HR330A/HR350A
> > >
> >
> > Thanks
> >
> > Mind checking if this patch fixes your issue as well?
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=altra-fix&id=77fa99dd4741456da85049c13ec31a148f5f5ac0
>
> Unfortunately this doesn't seem to be enough, I'm still getting the same
> problem also with this patch applied.
>

Thanks for trying.

How about the last 3 patches on this branch?

https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=efi-smbios-altra-fix

2023-03-16 12:38:51

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, 16 Mar 2023 at 13:21, Ard Biesheuvel <[email protected]> wrote:
>
> On Thu, 16 Mar 2023 at 12:34, Andrea Righi <[email protected]> wrote:
> >
> > On Thu, Mar 16, 2023 at 11:18:21AM +0100, Ard Biesheuvel wrote:
> > > On Thu, 16 Mar 2023 at 11:03, Andrea Righi <[email protected]> wrote:
> > > >
> > > > On Thu, Mar 16, 2023 at 10:55:58AM +0100, Ard Biesheuvel wrote:
> > > > > (cc Darren)
> > > > >
> > > > > On Thu, 16 Mar 2023 at 10:45, Andrea Righi <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Mar 16, 2023 at 08:58:20AM +0100, Ard Biesheuvel wrote:
> > > > > > > Hello Andrea,
> > > > > > >
> > > > > > > On Thu, 16 Mar 2023 at 08:54, Andrea Righi <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> > > > > > > > gets stuck and never completes the boot. On the console I see this:
> > > > > > > >
> > > > > > > > [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > > [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> > > > > > > > [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> > > > > > > > [ 72.064949] Task dump for CPU 22:
> > > > > > > > [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> > > > > > > > [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> > > > > > > > [ 72.082595] Call trace:
> > > > > > > > [ 72.085029] __switch_to+0xbc/0x100
> > > > > > > > [ 72.088508] 0xffff80000fe83d4c
> > > > > > > >
> > > > > > > > After that, as a consequence, I start to get a lot of hung task timeout traces.
> > > > > > > >
> > > > > > > > I tried to bisect the problem and I found that the offending commit is
> > > > > > > > this one:
> > > > > > > >
> > > > > > > > e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
> > > > > > > >
> > > > > > > > I've reverted this commit for now and everything works just fine, but I
> > > > > > > > was wondering if the problem could be caused by a lack of entropy on
> > > > > > > > these arm64 boxes or something else.
> > > > > > > >
> > > > > > > > Any suggestion? Let me know if you want me to do any specific test.
> > > > > > > >
> > > > > > >
> > > > > > > Thanks for the report.
> > > > > > >
> > > > > > > This is most likely the EFI SetVariable() call going off into the
> > > > > > > weeds and never returning.
> > > > > > >
> > > > > > > Is this an Ampere Altra system by any chance? Do you see it on
> > > > > > > different types of hardware?
> > > > > >
> > > > > > This is: Ampere eMAG / Lenovo ThinkSystem HR330a.
> > > > > >
> > > > > > >
> > > > > > > Could you check whether SetVariable works on this system? E.g. by
> > > > > > > updating the EFI boot timeout (sudo efibootmgr -t <n>)?
> > > > > >
> > > > > > ubuntu@kuzzle:~$ sudo efibootmgr -t 10
> > > > > > ^C^C^C^C
> > > > > >
> > > > > > ^ Stuck there, so it really looks like SetVariable is the problem.
> > > > > >
> > > > >
> > > > > Could you please share the output of
> > > > >
> > > > > dmidecode -s bios
> > > > > dmidecode -s system-family
> > > >
> > > > $ sudo dmidecode -s bios-vendor
> > > > LENOVO
> > > > $ sudo dmidecode -s bios-version
> > > > hve104r-1.15
> > > > $ sudo dmidecode -s bios-release-date
> > > > 02/26/2021
> > > > $ sudo dmidecode -s bios-revision
> > > > 1.15
> > > > $ sudo dmidecode -s system-family
> > > > Lenovo ThinkSystem HR330A/HR350A
> > > >
> > >
> > > Thanks
> > >
> > > Mind checking if this patch fixes your issue as well?
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=altra-fix&id=77fa99dd4741456da85049c13ec31a148f5f5ac0
> >
> > Unfortunately this doesn't seem to be enough, I'm still getting the same
> > problem also with this patch applied.
> >
>
> Thanks for trying.
>
> How about the last 3 patches on this branch?
>
> https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=efi-smbios-altra-fix

Actually, that may not match your hardware.

Does your kernel log have a line like

SMCCC: SOC_ID: ID = jep106:036b:0019 Revision = 0x00000102

?

2023-03-16 12:41:55

by Andrea Righi

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, Mar 16, 2023 at 01:38:30PM +0100, Ard Biesheuvel wrote:
> On Thu, 16 Mar 2023 at 13:21, Ard Biesheuvel <[email protected]> wrote:
> >
> > On Thu, 16 Mar 2023 at 12:34, Andrea Righi <[email protected]> wrote:
> > >
> > > On Thu, Mar 16, 2023 at 11:18:21AM +0100, Ard Biesheuvel wrote:
> > > > On Thu, 16 Mar 2023 at 11:03, Andrea Righi <[email protected]> wrote:
> > > > >
> > > > > On Thu, Mar 16, 2023 at 10:55:58AM +0100, Ard Biesheuvel wrote:
> > > > > > (cc Darren)
> > > > > >
> > > > > > On Thu, 16 Mar 2023 at 10:45, Andrea Righi <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 16, 2023 at 08:58:20AM +0100, Ard Biesheuvel wrote:
> > > > > > > > Hello Andrea,
> > > > > > > >
> > > > > > > > On Thu, 16 Mar 2023 at 08:54, Andrea Righi <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> > > > > > > > > gets stuck and never completes the boot. On the console I see this:
> > > > > > > > >
> > > > > > > > > [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > > > [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> > > > > > > > > [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> > > > > > > > > [ 72.064949] Task dump for CPU 22:
> > > > > > > > > [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> > > > > > > > > [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> > > > > > > > > [ 72.082595] Call trace:
> > > > > > > > > [ 72.085029] __switch_to+0xbc/0x100
> > > > > > > > > [ 72.088508] 0xffff80000fe83d4c
> > > > > > > > >
> > > > > > > > > After that, as a consequence, I start to get a lot of hung task timeout traces.
> > > > > > > > >
> > > > > > > > > I tried to bisect the problem and I found that the offending commit is
> > > > > > > > > this one:
> > > > > > > > >
> > > > > > > > > e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
> > > > > > > > >
> > > > > > > > > I've reverted this commit for now and everything works just fine, but I
> > > > > > > > > was wondering if the problem could be caused by a lack of entropy on
> > > > > > > > > these arm64 boxes or something else.
> > > > > > > > >
> > > > > > > > > Any suggestion? Let me know if you want me to do any specific test.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Thanks for the report.
> > > > > > > >
> > > > > > > > This is most likely the EFI SetVariable() call going off into the
> > > > > > > > weeds and never returning.
> > > > > > > >
> > > > > > > > Is this an Ampere Altra system by any chance? Do you see it on
> > > > > > > > different types of hardware?
> > > > > > >
> > > > > > > This is: Ampere eMAG / Lenovo ThinkSystem HR330a.
> > > > > > >
> > > > > > > >
> > > > > > > > Could you check whether SetVariable works on this system? E.g. by
> > > > > > > > updating the EFI boot timeout (sudo efibootmgr -t <n>)?
> > > > > > >
> > > > > > > ubuntu@kuzzle:~$ sudo efibootmgr -t 10
> > > > > > > ^C^C^C^C
> > > > > > >
> > > > > > > ^ Stuck there, so it really looks like SetVariable is the problem.
> > > > > > >
> > > > > >
> > > > > > Could you please share the output of
> > > > > >
> > > > > > dmidecode -s bios
> > > > > > dmidecode -s system-family
> > > > >
> > > > > $ sudo dmidecode -s bios-vendor
> > > > > LENOVO
> > > > > $ sudo dmidecode -s bios-version
> > > > > hve104r-1.15
> > > > > $ sudo dmidecode -s bios-release-date
> > > > > 02/26/2021
> > > > > $ sudo dmidecode -s bios-revision
> > > > > 1.15
> > > > > $ sudo dmidecode -s system-family
> > > > > Lenovo ThinkSystem HR330A/HR350A
> > > > >
> > > >
> > > > Thanks
> > > >
> > > > Mind checking if this patch fixes your issue as well?
> > > >
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=altra-fix&id=77fa99dd4741456da85049c13ec31a148f5f5ac0
> > >
> > > Unfortunately this doesn't seem to be enough, I'm still getting the same
> > > problem also with this patch applied.
> > >
> >
> > Thanks for trying.
> >
> > How about the last 3 patches on this branch?
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=efi-smbios-altra-fix
>
> Actually, that may not match your hardware.
>
> Does your kernel log have a line like
>
> SMCCC: SOC_ID: ID = jep106:036b:0019 Revision = 0x00000102
>
> ?

$ sudo dmesg | grep "SMCCC: SOC_ID"
[ 5.320782] SMCCC: SOC_ID: ARCH_SOC_ID not implemented, skipping ....

-Andrea

2023-03-16 12:43:54

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, 16 Mar 2023 at 13:41, Andrea Righi <[email protected]> wrote:
>
> On Thu, Mar 16, 2023 at 01:38:30PM +0100, Ard Biesheuvel wrote:
> > On Thu, 16 Mar 2023 at 13:21, Ard Biesheuvel <[email protected]> wrote:
> > >
> > > On Thu, 16 Mar 2023 at 12:34, Andrea Righi <[email protected]> wrote:
> > > >
> > > > On Thu, Mar 16, 2023 at 11:18:21AM +0100, Ard Biesheuvel wrote:
> > > > > On Thu, 16 Mar 2023 at 11:03, Andrea Righi <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Mar 16, 2023 at 10:55:58AM +0100, Ard Biesheuvel wrote:
> > > > > > > (cc Darren)
> > > > > > >
> > > > > > > On Thu, 16 Mar 2023 at 10:45, Andrea Righi <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Thu, Mar 16, 2023 at 08:58:20AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > Hello Andrea,
> > > > > > > > >
> > > > > > > > > On Thu, 16 Mar 2023 at 08:54, Andrea Righi <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > Hello,
> > > > > > > > > >
> > > > > > > > > > the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> > > > > > > > > > gets stuck and never completes the boot. On the console I see this:
> > > > > > > > > >
> > > > > > > > > > [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > > > > [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> > > > > > > > > > [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> > > > > > > > > > [ 72.064949] Task dump for CPU 22:
> > > > > > > > > > [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> > > > > > > > > > [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> > > > > > > > > > [ 72.082595] Call trace:
> > > > > > > > > > [ 72.085029] __switch_to+0xbc/0x100
> > > > > > > > > > [ 72.088508] 0xffff80000fe83d4c
> > > > > > > > > >
> > > > > > > > > > After that, as a consequence, I start to get a lot of hung task timeout traces.
> > > > > > > > > >
> > > > > > > > > > I tried to bisect the problem and I found that the offending commit is
> > > > > > > > > > this one:
> > > > > > > > > >
> > > > > > > > > > e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
> > > > > > > > > >
> > > > > > > > > > I've reverted this commit for now and everything works just fine, but I
> > > > > > > > > > was wondering if the problem could be caused by a lack of entropy on
> > > > > > > > > > these arm64 boxes or something else.
> > > > > > > > > >
> > > > > > > > > > Any suggestion? Let me know if you want me to do any specific test.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks for the report.
> > > > > > > > >
> > > > > > > > > This is most likely the EFI SetVariable() call going off into the
> > > > > > > > > weeds and never returning.
> > > > > > > > >
> > > > > > > > > Is this an Ampere Altra system by any chance? Do you see it on
> > > > > > > > > different types of hardware?
> > > > > > > >
> > > > > > > > This is: Ampere eMAG / Lenovo ThinkSystem HR330a.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Could you check whether SetVariable works on this system? E.g. by
> > > > > > > > > updating the EFI boot timeout (sudo efibootmgr -t <n>)?
> > > > > > > >
> > > > > > > > ubuntu@kuzzle:~$ sudo efibootmgr -t 10
> > > > > > > > ^C^C^C^C
> > > > > > > >
> > > > > > > > ^ Stuck there, so it really looks like SetVariable is the problem.
> > > > > > > >
> > > > > > >
> > > > > > > Could you please share the output of
> > > > > > >
> > > > > > > dmidecode -s bios
> > > > > > > dmidecode -s system-family
> > > > > >
> > > > > > $ sudo dmidecode -s bios-vendor
> > > > > > LENOVO
> > > > > > $ sudo dmidecode -s bios-version
> > > > > > hve104r-1.15
> > > > > > $ sudo dmidecode -s bios-release-date
> > > > > > 02/26/2021
> > > > > > $ sudo dmidecode -s bios-revision
> > > > > > 1.15
> > > > > > $ sudo dmidecode -s system-family
> > > > > > Lenovo ThinkSystem HR330A/HR350A
> > > > > >
> > > > >
> > > > > Thanks
> > > > >
> > > > > Mind checking if this patch fixes your issue as well?
> > > > >
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=altra-fix&id=77fa99dd4741456da85049c13ec31a148f5f5ac0
> > > >
> > > > Unfortunately this doesn't seem to be enough, I'm still getting the same
> > > > problem also with this patch applied.
> > > >
> > >
> > > Thanks for trying.
> > >
> > > How about the last 3 patches on this branch?
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=efi-smbios-altra-fix
> >
> > Actually, that may not match your hardware.
> >
> > Does your kernel log have a line like
> >
> > SMCCC: SOC_ID: ID = jep106:036b:0019 Revision = 0x00000102
> >
> > ?
>
> $ sudo dmesg | grep "SMCCC: SOC_ID"
> [ 5.320782] SMCCC: SOC_ID: ARCH_SOC_ID not implemented, skipping ....
>

Thanks. Could you share the entire dmidecode output somewhere? Or at
least the type 4 record(s)?

2023-03-16 12:50:16

by Andrea Righi

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, Mar 16, 2023 at 01:43:32PM +0100, Ard Biesheuvel wrote:
> On Thu, 16 Mar 2023 at 13:41, Andrea Righi <[email protected]> wrote:
> >
> > On Thu, Mar 16, 2023 at 01:38:30PM +0100, Ard Biesheuvel wrote:
> > > On Thu, 16 Mar 2023 at 13:21, Ard Biesheuvel <[email protected]> wrote:
> > > >
> > > > On Thu, 16 Mar 2023 at 12:34, Andrea Righi <[email protected]> wrote:
> > > > >
> > > > > On Thu, Mar 16, 2023 at 11:18:21AM +0100, Ard Biesheuvel wrote:
> > > > > > On Thu, 16 Mar 2023 at 11:03, Andrea Righi <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 16, 2023 at 10:55:58AM +0100, Ard Biesheuvel wrote:
> > > > > > > > (cc Darren)
> > > > > > > >
> > > > > > > > On Thu, 16 Mar 2023 at 10:45, Andrea Righi <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Mar 16, 2023 at 08:58:20AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > Hello Andrea,
> > > > > > > > > >
> > > > > > > > > > On Thu, 16 Mar 2023 at 08:54, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hello,
> > > > > > > > > > >
> > > > > > > > > > > the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> > > > > > > > > > > gets stuck and never completes the boot. On the console I see this:
> > > > > > > > > > >
> > > > > > > > > > > [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > > > > > [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> > > > > > > > > > > [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> > > > > > > > > > > [ 72.064949] Task dump for CPU 22:
> > > > > > > > > > > [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> > > > > > > > > > > [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> > > > > > > > > > > [ 72.082595] Call trace:
> > > > > > > > > > > [ 72.085029] __switch_to+0xbc/0x100
> > > > > > > > > > > [ 72.088508] 0xffff80000fe83d4c
> > > > > > > > > > >
> > > > > > > > > > > After that, as a consequence, I start to get a lot of hung task timeout traces.
> > > > > > > > > > >
> > > > > > > > > > > I tried to bisect the problem and I found that the offending commit is
> > > > > > > > > > > this one:
> > > > > > > > > > >
> > > > > > > > > > > e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
> > > > > > > > > > >
> > > > > > > > > > > I've reverted this commit for now and everything works just fine, but I
> > > > > > > > > > > was wondering if the problem could be caused by a lack of entropy on
> > > > > > > > > > > these arm64 boxes or something else.
> > > > > > > > > > >
> > > > > > > > > > > Any suggestion? Let me know if you want me to do any specific test.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks for the report.
> > > > > > > > > >
> > > > > > > > > > This is most likely the EFI SetVariable() call going off into the
> > > > > > > > > > weeds and never returning.
> > > > > > > > > >
> > > > > > > > > > Is this an Ampere Altra system by any chance? Do you see it on
> > > > > > > > > > different types of hardware?
> > > > > > > > >
> > > > > > > > > This is: Ampere eMAG / Lenovo ThinkSystem HR330a.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Could you check whether SetVariable works on this system? E.g. by
> > > > > > > > > > updating the EFI boot timeout (sudo efibootmgr -t <n>)?
> > > > > > > > >
> > > > > > > > > ubuntu@kuzzle:~$ sudo efibootmgr -t 10
> > > > > > > > > ^C^C^C^C
> > > > > > > > >
> > > > > > > > > ^ Stuck there, so it really looks like SetVariable is the problem.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Could you please share the output of
> > > > > > > >
> > > > > > > > dmidecode -s bios
> > > > > > > > dmidecode -s system-family
> > > > > > >
> > > > > > > $ sudo dmidecode -s bios-vendor
> > > > > > > LENOVO
> > > > > > > $ sudo dmidecode -s bios-version
> > > > > > > hve104r-1.15
> > > > > > > $ sudo dmidecode -s bios-release-date
> > > > > > > 02/26/2021
> > > > > > > $ sudo dmidecode -s bios-revision
> > > > > > > 1.15
> > > > > > > $ sudo dmidecode -s system-family
> > > > > > > Lenovo ThinkSystem HR330A/HR350A
> > > > > > >
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > Mind checking if this patch fixes your issue as well?
> > > > > >
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=altra-fix&id=77fa99dd4741456da85049c13ec31a148f5f5ac0
> > > > >
> > > > > Unfortunately this doesn't seem to be enough, I'm still getting the same
> > > > > problem also with this patch applied.
> > > > >
> > > >
> > > > Thanks for trying.
> > > >
> > > > How about the last 3 patches on this branch?
> > > >
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=efi-smbios-altra-fix
> > >
> > > Actually, that may not match your hardware.
> > >
> > > Does your kernel log have a line like
> > >
> > > SMCCC: SOC_ID: ID = jep106:036b:0019 Revision = 0x00000102
> > >
> > > ?
> >
> > $ sudo dmesg | grep "SMCCC: SOC_ID"
> > [ 5.320782] SMCCC: SOC_ID: ARCH_SOC_ID not implemented, skipping ....
> >
>
> Thanks. Could you share the entire dmidecode output somewhere? Or at
> least the type 4 record(s)?

Sure, here's the full output of dmidecode:
https://pastebin.ubuntu.com/p/4ZmKmP2xTm/

-Andrea

2023-03-16 13:46:44

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, 16 Mar 2023 at 13:50, Andrea Righi <[email protected]> wrote:
>
> On Thu, Mar 16, 2023 at 01:43:32PM +0100, Ard Biesheuvel wrote:
> > On Thu, 16 Mar 2023 at 13:41, Andrea Righi <[email protected]> wrote:
> > >
> > > On Thu, Mar 16, 2023 at 01:38:30PM +0100, Ard Biesheuvel wrote:
> > > > On Thu, 16 Mar 2023 at 13:21, Ard Biesheuvel <[email protected]> wrote:
> > > > >
> > > > > On Thu, 16 Mar 2023 at 12:34, Andrea Righi <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Mar 16, 2023 at 11:18:21AM +0100, Ard Biesheuvel wrote:
> > > > > > > On Thu, 16 Mar 2023 at 11:03, Andrea Righi <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Thu, Mar 16, 2023 at 10:55:58AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > (cc Darren)
> > > > > > > > >
> > > > > > > > > On Thu, 16 Mar 2023 at 10:45, Andrea Righi <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > On Thu, Mar 16, 2023 at 08:58:20AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > Hello Andrea,
> > > > > > > > > > >
> > > > > > > > > > > On Thu, 16 Mar 2023 at 08:54, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hello,
> > > > > > > > > > > >
> > > > > > > > > > > > the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> > > > > > > > > > > > gets stuck and never completes the boot. On the console I see this:
> > > > > > > > > > > >
> > > > > > > > > > > > [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > > > > > > [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> > > > > > > > > > > > [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> > > > > > > > > > > > [ 72.064949] Task dump for CPU 22:
> > > > > > > > > > > > [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> > > > > > > > > > > > [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> > > > > > > > > > > > [ 72.082595] Call trace:
> > > > > > > > > > > > [ 72.085029] __switch_to+0xbc/0x100
> > > > > > > > > > > > [ 72.088508] 0xffff80000fe83d4c
> > > > > > > > > > > >
> > > > > > > > > > > > After that, as a consequence, I start to get a lot of hung task timeout traces.
> > > > > > > > > > > >
> > > > > > > > > > > > I tried to bisect the problem and I found that the offending commit is
> > > > > > > > > > > > this one:
> > > > > > > > > > > >
> > > > > > > > > > > > e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
> > > > > > > > > > > >
> > > > > > > > > > > > I've reverted this commit for now and everything works just fine, but I
> > > > > > > > > > > > was wondering if the problem could be caused by a lack of entropy on
> > > > > > > > > > > > these arm64 boxes or something else.
> > > > > > > > > > > >
> > > > > > > > > > > > Any suggestion? Let me know if you want me to do any specific test.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the report.
> > > > > > > > > > >
> > > > > > > > > > > This is most likely the EFI SetVariable() call going off into the
> > > > > > > > > > > weeds and never returning.
> > > > > > > > > > >
> > > > > > > > > > > Is this an Ampere Altra system by any chance? Do you see it on
> > > > > > > > > > > different types of hardware?
> > > > > > > > > >
> > > > > > > > > > This is: Ampere eMAG / Lenovo ThinkSystem HR330a.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Could you check whether SetVariable works on this system? E.g. by
> > > > > > > > > > > updating the EFI boot timeout (sudo efibootmgr -t <n>)?
> > > > > > > > > >
> > > > > > > > > > ubuntu@kuzzle:~$ sudo efibootmgr -t 10
> > > > > > > > > > ^C^C^C^C
> > > > > > > > > >
> > > > > > > > > > ^ Stuck there, so it really looks like SetVariable is the problem.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Could you please share the output of
> > > > > > > > >
> > > > > > > > > dmidecode -s bios
> > > > > > > > > dmidecode -s system-family
> > > > > > > >
> > > > > > > > $ sudo dmidecode -s bios-vendor
> > > > > > > > LENOVO
> > > > > > > > $ sudo dmidecode -s bios-version
> > > > > > > > hve104r-1.15
> > > > > > > > $ sudo dmidecode -s bios-release-date
> > > > > > > > 02/26/2021
> > > > > > > > $ sudo dmidecode -s bios-revision
> > > > > > > > 1.15
> > > > > > > > $ sudo dmidecode -s system-family
> > > > > > > > Lenovo ThinkSystem HR330A/HR350A
> > > > > > > >
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > Mind checking if this patch fixes your issue as well?
> > > > > > >
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=altra-fix&id=77fa99dd4741456da85049c13ec31a148f5f5ac0
> > > > > >
> > > > > > Unfortunately this doesn't seem to be enough, I'm still getting the same
> > > > > > problem also with this patch applied.
> > > > > >
> > > > >
> > > > > Thanks for trying.
> > > > >
> > > > > How about the last 3 patches on this branch?
> > > > >
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=efi-smbios-altra-fix
> > > >
> > > > Actually, that may not match your hardware.
> > > >
> > > > Does your kernel log have a line like
> > > >
> > > > SMCCC: SOC_ID: ID = jep106:036b:0019 Revision = 0x00000102
> > > >
> > > > ?
> > >
> > > $ sudo dmesg | grep "SMCCC: SOC_ID"
> > > [ 5.320782] SMCCC: SOC_ID: ARCH_SOC_ID not implemented, skipping ....
> > >
> >
> > Thanks. Could you share the entire dmidecode output somewhere? Or at
> > least the type 4 record(s)?
>
> Sure, here's the full output of dmidecode:
> https://pastebin.ubuntu.com/p/4ZmKmP2xTm/
>

Thanks. I have updated my SMBIOS patches to take the processor version
'eMAG' into account, which appears to be what these boxes are using.

I have updated the efi/urgent branch here with the latest versions.
Mind giving them a spin?


In the mean time, just for the record - could you please run this as well?

hexdump -C /sys/firmware/dmi/entries/4-0/raw

(as root)

There seem to be eMAG boxes that put the type 4 ID in the wrong word
order, so I'd like to make sure we have a record of the binary
representation.

Thanks a lot for spending time on this.

2023-03-16 13:47:14

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, 16 Mar 2023 at 14:45, Ard Biesheuvel <[email protected]> wrote:
>
> On Thu, 16 Mar 2023 at 13:50, Andrea Righi <[email protected]> wrote:
> >
> > On Thu, Mar 16, 2023 at 01:43:32PM +0100, Ard Biesheuvel wrote:
> > > On Thu, 16 Mar 2023 at 13:41, Andrea Righi <[email protected]> wrote:
> > > >
> > > > On Thu, Mar 16, 2023 at 01:38:30PM +0100, Ard Biesheuvel wrote:
> > > > > On Thu, 16 Mar 2023 at 13:21, Ard Biesheuvel <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, 16 Mar 2023 at 12:34, Andrea Righi <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 16, 2023 at 11:18:21AM +0100, Ard Biesheuvel wrote:
> > > > > > > > On Thu, 16 Mar 2023 at 11:03, Andrea Righi <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Mar 16, 2023 at 10:55:58AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > (cc Darren)
> > > > > > > > > >
> > > > > > > > > > On Thu, 16 Mar 2023 at 10:45, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Mar 16, 2023 at 08:58:20AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > Hello Andrea,
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, 16 Mar 2023 at 08:54, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hello,
> > > > > > > > > > > > >
> > > > > > > > > > > > > the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> > > > > > > > > > > > > gets stuck and never completes the boot. On the console I see this:
> > > > > > > > > > > > >
> > > > > > > > > > > > > [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > > > > > > > [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> > > > > > > > > > > > > [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> > > > > > > > > > > > > [ 72.064949] Task dump for CPU 22:
> > > > > > > > > > > > > [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> > > > > > > > > > > > > [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> > > > > > > > > > > > > [ 72.082595] Call trace:
> > > > > > > > > > > > > [ 72.085029] __switch_to+0xbc/0x100
> > > > > > > > > > > > > [ 72.088508] 0xffff80000fe83d4c
> > > > > > > > > > > > >
> > > > > > > > > > > > > After that, as a consequence, I start to get a lot of hung task timeout traces.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I tried to bisect the problem and I found that the offending commit is
> > > > > > > > > > > > > this one:
> > > > > > > > > > > > >
> > > > > > > > > > > > > e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
> > > > > > > > > > > > >
> > > > > > > > > > > > > I've reverted this commit for now and everything works just fine, but I
> > > > > > > > > > > > > was wondering if the problem could be caused by a lack of entropy on
> > > > > > > > > > > > > these arm64 boxes or something else.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Any suggestion? Let me know if you want me to do any specific test.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the report.
> > > > > > > > > > > >
> > > > > > > > > > > > This is most likely the EFI SetVariable() call going off into the
> > > > > > > > > > > > weeds and never returning.
> > > > > > > > > > > >
> > > > > > > > > > > > Is this an Ampere Altra system by any chance? Do you see it on
> > > > > > > > > > > > different types of hardware?
> > > > > > > > > > >
> > > > > > > > > > > This is: Ampere eMAG / Lenovo ThinkSystem HR330a.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Could you check whether SetVariable works on this system? E.g. by
> > > > > > > > > > > > updating the EFI boot timeout (sudo efibootmgr -t <n>)?
> > > > > > > > > > >
> > > > > > > > > > > ubuntu@kuzzle:~$ sudo efibootmgr -t 10
> > > > > > > > > > > ^C^C^C^C
> > > > > > > > > > >
> > > > > > > > > > > ^ Stuck there, so it really looks like SetVariable is the problem.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Could you please share the output of
> > > > > > > > > >
> > > > > > > > > > dmidecode -s bios
> > > > > > > > > > dmidecode -s system-family
> > > > > > > > >
> > > > > > > > > $ sudo dmidecode -s bios-vendor
> > > > > > > > > LENOVO
> > > > > > > > > $ sudo dmidecode -s bios-version
> > > > > > > > > hve104r-1.15
> > > > > > > > > $ sudo dmidecode -s bios-release-date
> > > > > > > > > 02/26/2021
> > > > > > > > > $ sudo dmidecode -s bios-revision
> > > > > > > > > 1.15
> > > > > > > > > $ sudo dmidecode -s system-family
> > > > > > > > > Lenovo ThinkSystem HR330A/HR350A
> > > > > > > > >
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > >
> > > > > > > > Mind checking if this patch fixes your issue as well?
> > > > > > > >
> > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=altra-fix&id=77fa99dd4741456da85049c13ec31a148f5f5ac0
> > > > > > >
> > > > > > > Unfortunately this doesn't seem to be enough, I'm still getting the same
> > > > > > > problem also with this patch applied.
> > > > > > >
> > > > > >
> > > > > > Thanks for trying.
> > > > > >
> > > > > > How about the last 3 patches on this branch?
> > > > > >
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=efi-smbios-altra-fix
> > > > >
> > > > > Actually, that may not match your hardware.
> > > > >
> > > > > Does your kernel log have a line like
> > > > >
> > > > > SMCCC: SOC_ID: ID = jep106:036b:0019 Revision = 0x00000102
> > > > >
> > > > > ?
> > > >
> > > > $ sudo dmesg | grep "SMCCC: SOC_ID"
> > > > [ 5.320782] SMCCC: SOC_ID: ARCH_SOC_ID not implemented, skipping ....
> > > >
> > >
> > > Thanks. Could you share the entire dmidecode output somewhere? Or at
> > > least the type 4 record(s)?
> >
> > Sure, here's the full output of dmidecode:
> > https://pastebin.ubuntu.com/p/4ZmKmP2xTm/
> >
>
> Thanks. I have updated my SMBIOS patches to take the processor version
> 'eMAG' into account, which appears to be what these boxes are using.
>
> I have updated the efi/urgent branch here with the latest versions.
> Mind giving them a spin?
>

https://git.kernel.org/pub/scm/linux/kernel/git/efi/efi.git/log/?h=urgent

2023-03-16 13:50:57

by Andrea Righi

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, Mar 16, 2023 at 02:45:49PM +0100, Ard Biesheuvel wrote:
> On Thu, 16 Mar 2023 at 13:50, Andrea Righi <[email protected]> wrote:
> >
> > On Thu, Mar 16, 2023 at 01:43:32PM +0100, Ard Biesheuvel wrote:
> > > On Thu, 16 Mar 2023 at 13:41, Andrea Righi <[email protected]> wrote:
> > > >
> > > > On Thu, Mar 16, 2023 at 01:38:30PM +0100, Ard Biesheuvel wrote:
> > > > > On Thu, 16 Mar 2023 at 13:21, Ard Biesheuvel <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, 16 Mar 2023 at 12:34, Andrea Righi <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 16, 2023 at 11:18:21AM +0100, Ard Biesheuvel wrote:
> > > > > > > > On Thu, 16 Mar 2023 at 11:03, Andrea Righi <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Mar 16, 2023 at 10:55:58AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > (cc Darren)
> > > > > > > > > >
> > > > > > > > > > On Thu, 16 Mar 2023 at 10:45, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Mar 16, 2023 at 08:58:20AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > Hello Andrea,
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, 16 Mar 2023 at 08:54, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hello,
> > > > > > > > > > > > >
> > > > > > > > > > > > > the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> > > > > > > > > > > > > gets stuck and never completes the boot. On the console I see this:
> > > > > > > > > > > > >
> > > > > > > > > > > > > [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > > > > > > > [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> > > > > > > > > > > > > [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> > > > > > > > > > > > > [ 72.064949] Task dump for CPU 22:
> > > > > > > > > > > > > [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> > > > > > > > > > > > > [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> > > > > > > > > > > > > [ 72.082595] Call trace:
> > > > > > > > > > > > > [ 72.085029] __switch_to+0xbc/0x100
> > > > > > > > > > > > > [ 72.088508] 0xffff80000fe83d4c
> > > > > > > > > > > > >
> > > > > > > > > > > > > After that, as a consequence, I start to get a lot of hung task timeout traces.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I tried to bisect the problem and I found that the offending commit is
> > > > > > > > > > > > > this one:
> > > > > > > > > > > > >
> > > > > > > > > > > > > e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
> > > > > > > > > > > > >
> > > > > > > > > > > > > I've reverted this commit for now and everything works just fine, but I
> > > > > > > > > > > > > was wondering if the problem could be caused by a lack of entropy on
> > > > > > > > > > > > > these arm64 boxes or something else.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Any suggestion? Let me know if you want me to do any specific test.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the report.
> > > > > > > > > > > >
> > > > > > > > > > > > This is most likely the EFI SetVariable() call going off into the
> > > > > > > > > > > > weeds and never returning.
> > > > > > > > > > > >
> > > > > > > > > > > > Is this an Ampere Altra system by any chance? Do you see it on
> > > > > > > > > > > > different types of hardware?
> > > > > > > > > > >
> > > > > > > > > > > This is: Ampere eMAG / Lenovo ThinkSystem HR330a.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Could you check whether SetVariable works on this system? E.g. by
> > > > > > > > > > > > updating the EFI boot timeout (sudo efibootmgr -t <n>)?
> > > > > > > > > > >
> > > > > > > > > > > ubuntu@kuzzle:~$ sudo efibootmgr -t 10
> > > > > > > > > > > ^C^C^C^C
> > > > > > > > > > >
> > > > > > > > > > > ^ Stuck there, so it really looks like SetVariable is the problem.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Could you please share the output of
> > > > > > > > > >
> > > > > > > > > > dmidecode -s bios
> > > > > > > > > > dmidecode -s system-family
> > > > > > > > >
> > > > > > > > > $ sudo dmidecode -s bios-vendor
> > > > > > > > > LENOVO
> > > > > > > > > $ sudo dmidecode -s bios-version
> > > > > > > > > hve104r-1.15
> > > > > > > > > $ sudo dmidecode -s bios-release-date
> > > > > > > > > 02/26/2021
> > > > > > > > > $ sudo dmidecode -s bios-revision
> > > > > > > > > 1.15
> > > > > > > > > $ sudo dmidecode -s system-family
> > > > > > > > > Lenovo ThinkSystem HR330A/HR350A
> > > > > > > > >
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > >
> > > > > > > > Mind checking if this patch fixes your issue as well?
> > > > > > > >
> > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=altra-fix&id=77fa99dd4741456da85049c13ec31a148f5f5ac0
> > > > > > >
> > > > > > > Unfortunately this doesn't seem to be enough, I'm still getting the same
> > > > > > > problem also with this patch applied.
> > > > > > >
> > > > > >
> > > > > > Thanks for trying.
> > > > > >
> > > > > > How about the last 3 patches on this branch?
> > > > > >
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=efi-smbios-altra-fix
> > > > >
> > > > > Actually, that may not match your hardware.
> > > > >
> > > > > Does your kernel log have a line like
> > > > >
> > > > > SMCCC: SOC_ID: ID = jep106:036b:0019 Revision = 0x00000102
> > > > >
> > > > > ?
> > > >
> > > > $ sudo dmesg | grep "SMCCC: SOC_ID"
> > > > [ 5.320782] SMCCC: SOC_ID: ARCH_SOC_ID not implemented, skipping ....
> > > >
> > >
> > > Thanks. Could you share the entire dmidecode output somewhere? Or at
> > > least the type 4 record(s)?
> >
> > Sure, here's the full output of dmidecode:
> > https://pastebin.ubuntu.com/p/4ZmKmP2xTm/
> >
>
> Thanks. I have updated my SMBIOS patches to take the processor version
> 'eMAG' into account, which appears to be what these boxes are using.
>
> I have updated the efi/urgent branch here with the latest versions.
> Mind giving them a spin?
>
>
> In the mean time, just for the record - could you please run this as well?
>
> hexdump -C /sys/firmware/dmi/entries/4-0/raw
>
> (as root)

hm.. I don't have that in /sys/firmware/, this is what I have:

# ls -l /sys/firmware/dmi/
total 0
drwxr-xr-x 2 root root 0 Mar 16 13:26 tables
# ls -l /sys/firmware/dmi/tables/
total 0
-r-------- 1 root root 5004 Mar 16 13:26 DMI
-r-------- 1 root root 24 Mar 16 13:26 smbios_entry_point

-Andrea

2023-03-16 13:53:56

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, 16 Mar 2023 at 14:50, Andrea Righi <[email protected]> wrote:
>
> On Thu, Mar 16, 2023 at 02:45:49PM +0100, Ard Biesheuvel wrote:
> > On Thu, 16 Mar 2023 at 13:50, Andrea Righi <[email protected]> wrote:
> > >
> > > On Thu, Mar 16, 2023 at 01:43:32PM +0100, Ard Biesheuvel wrote:
> > > > On Thu, 16 Mar 2023 at 13:41, Andrea Righi <[email protected]> wrote:
> > > > >
> > > > > On Thu, Mar 16, 2023 at 01:38:30PM +0100, Ard Biesheuvel wrote:
> > > > > > On Thu, 16 Mar 2023 at 13:21, Ard Biesheuvel <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, 16 Mar 2023 at 12:34, Andrea Righi <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Thu, Mar 16, 2023 at 11:18:21AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > On Thu, 16 Mar 2023 at 11:03, Andrea Righi <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > On Thu, Mar 16, 2023 at 10:55:58AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > (cc Darren)
> > > > > > > > > > >
> > > > > > > > > > > On Thu, 16 Mar 2023 at 10:45, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Mar 16, 2023 at 08:58:20AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > > Hello Andrea,
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, 16 Mar 2023 at 08:54, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hello,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> > > > > > > > > > > > > > gets stuck and never completes the boot. On the console I see this:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > > > > > > > > [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> > > > > > > > > > > > > > [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> > > > > > > > > > > > > > [ 72.064949] Task dump for CPU 22:
> > > > > > > > > > > > > > [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> > > > > > > > > > > > > > [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> > > > > > > > > > > > > > [ 72.082595] Call trace:
> > > > > > > > > > > > > > [ 72.085029] __switch_to+0xbc/0x100
> > > > > > > > > > > > > > [ 72.088508] 0xffff80000fe83d4c
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > After that, as a consequence, I start to get a lot of hung task timeout traces.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I tried to bisect the problem and I found that the offending commit is
> > > > > > > > > > > > > > this one:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I've reverted this commit for now and everything works just fine, but I
> > > > > > > > > > > > > > was wondering if the problem could be caused by a lack of entropy on
> > > > > > > > > > > > > > these arm64 boxes or something else.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Any suggestion? Let me know if you want me to do any specific test.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the report.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This is most likely the EFI SetVariable() call going off into the
> > > > > > > > > > > > > weeds and never returning.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Is this an Ampere Altra system by any chance? Do you see it on
> > > > > > > > > > > > > different types of hardware?
> > > > > > > > > > > >
> > > > > > > > > > > > This is: Ampere eMAG / Lenovo ThinkSystem HR330a.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Could you check whether SetVariable works on this system? E.g. by
> > > > > > > > > > > > > updating the EFI boot timeout (sudo efibootmgr -t <n>)?
> > > > > > > > > > > >
> > > > > > > > > > > > ubuntu@kuzzle:~$ sudo efibootmgr -t 10
> > > > > > > > > > > > ^C^C^C^C
> > > > > > > > > > > >
> > > > > > > > > > > > ^ Stuck there, so it really looks like SetVariable is the problem.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Could you please share the output of
> > > > > > > > > > >
> > > > > > > > > > > dmidecode -s bios
> > > > > > > > > > > dmidecode -s system-family
> > > > > > > > > >
> > > > > > > > > > $ sudo dmidecode -s bios-vendor
> > > > > > > > > > LENOVO
> > > > > > > > > > $ sudo dmidecode -s bios-version
> > > > > > > > > > hve104r-1.15
> > > > > > > > > > $ sudo dmidecode -s bios-release-date
> > > > > > > > > > 02/26/2021
> > > > > > > > > > $ sudo dmidecode -s bios-revision
> > > > > > > > > > 1.15
> > > > > > > > > > $ sudo dmidecode -s system-family
> > > > > > > > > > Lenovo ThinkSystem HR330A/HR350A
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > >
> > > > > > > > > Mind checking if this patch fixes your issue as well?
> > > > > > > > >
> > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=altra-fix&id=77fa99dd4741456da85049c13ec31a148f5f5ac0
> > > > > > > >
> > > > > > > > Unfortunately this doesn't seem to be enough, I'm still getting the same
> > > > > > > > problem also with this patch applied.
> > > > > > > >
> > > > > > >
> > > > > > > Thanks for trying.
> > > > > > >
> > > > > > > How about the last 3 patches on this branch?
> > > > > > >
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=efi-smbios-altra-fix
> > > > > >
> > > > > > Actually, that may not match your hardware.
> > > > > >
> > > > > > Does your kernel log have a line like
> > > > > >
> > > > > > SMCCC: SOC_ID: ID = jep106:036b:0019 Revision = 0x00000102
> > > > > >
> > > > > > ?
> > > > >
> > > > > $ sudo dmesg | grep "SMCCC: SOC_ID"
> > > > > [ 5.320782] SMCCC: SOC_ID: ARCH_SOC_ID not implemented, skipping ....
> > > > >
> > > >
> > > > Thanks. Could you share the entire dmidecode output somewhere? Or at
> > > > least the type 4 record(s)?
> > >
> > > Sure, here's the full output of dmidecode:
> > > https://pastebin.ubuntu.com/p/4ZmKmP2xTm/
> > >
> >
> > Thanks. I have updated my SMBIOS patches to take the processor version
> > 'eMAG' into account, which appears to be what these boxes are using.
> >
> > I have updated the efi/urgent branch here with the latest versions.
> > Mind giving them a spin?
> >
> >
> > In the mean time, just for the record - could you please run this as well?
> >
> > hexdump -C /sys/firmware/dmi/entries/4-0/raw
> >
> > (as root)
>
> hm.. I don't have that in /sys/firmware/, this is what I have:
>
> # ls -l /sys/firmware/dmi/
> total 0
> drwxr-xr-x 2 root root 0 Mar 16 13:26 tables
> # ls -l /sys/firmware/dmi/tables/
> total 0
> -r-------- 1 root root 5004 Mar 16 13:26 DMI
> -r-------- 1 root root 24 Mar 16 13:26 smbios_entry_point
>

You'll need to load the dmi_sysfs module for that. But no big deal
otherwise, I'm pretty sure the word order is the correct on on your
system in any case (it decodes the value correctly in the next line)

2023-03-16 14:00:00

by Andrea Righi

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, Mar 16, 2023 at 02:53:24PM +0100, Ard Biesheuvel wrote:
> On Thu, 16 Mar 2023 at 14:50, Andrea Righi <[email protected]> wrote:
> >
> > On Thu, Mar 16, 2023 at 02:45:49PM +0100, Ard Biesheuvel wrote:
> > > On Thu, 16 Mar 2023 at 13:50, Andrea Righi <[email protected]> wrote:
> > > >
> > > > On Thu, Mar 16, 2023 at 01:43:32PM +0100, Ard Biesheuvel wrote:
> > > > > On Thu, 16 Mar 2023 at 13:41, Andrea Righi <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Mar 16, 2023 at 01:38:30PM +0100, Ard Biesheuvel wrote:
> > > > > > > On Thu, 16 Mar 2023 at 13:21, Ard Biesheuvel <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Thu, 16 Mar 2023 at 12:34, Andrea Righi <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Mar 16, 2023 at 11:18:21AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > On Thu, 16 Mar 2023 at 11:03, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Mar 16, 2023 at 10:55:58AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > (cc Darren)
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, 16 Mar 2023 at 10:45, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Mar 16, 2023 at 08:58:20AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > > > Hello Andrea,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, 16 Mar 2023 at 08:54, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hello,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> > > > > > > > > > > > > > > gets stuck and never completes the boot. On the console I see this:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > > > > > > > > > [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> > > > > > > > > > > > > > > [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> > > > > > > > > > > > > > > [ 72.064949] Task dump for CPU 22:
> > > > > > > > > > > > > > > [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> > > > > > > > > > > > > > > [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> > > > > > > > > > > > > > > [ 72.082595] Call trace:
> > > > > > > > > > > > > > > [ 72.085029] __switch_to+0xbc/0x100
> > > > > > > > > > > > > > > [ 72.088508] 0xffff80000fe83d4c
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > After that, as a consequence, I start to get a lot of hung task timeout traces.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I tried to bisect the problem and I found that the offending commit is
> > > > > > > > > > > > > > > this one:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I've reverted this commit for now and everything works just fine, but I
> > > > > > > > > > > > > > > was wondering if the problem could be caused by a lack of entropy on
> > > > > > > > > > > > > > > these arm64 boxes or something else.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Any suggestion? Let me know if you want me to do any specific test.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the report.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This is most likely the EFI SetVariable() call going off into the
> > > > > > > > > > > > > > weeds and never returning.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Is this an Ampere Altra system by any chance? Do you see it on
> > > > > > > > > > > > > > different types of hardware?
> > > > > > > > > > > > >
> > > > > > > > > > > > > This is: Ampere eMAG / Lenovo ThinkSystem HR330a.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Could you check whether SetVariable works on this system? E.g. by
> > > > > > > > > > > > > > updating the EFI boot timeout (sudo efibootmgr -t <n>)?
> > > > > > > > > > > > >
> > > > > > > > > > > > > ubuntu@kuzzle:~$ sudo efibootmgr -t 10
> > > > > > > > > > > > > ^C^C^C^C
> > > > > > > > > > > > >
> > > > > > > > > > > > > ^ Stuck there, so it really looks like SetVariable is the problem.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Could you please share the output of
> > > > > > > > > > > >
> > > > > > > > > > > > dmidecode -s bios
> > > > > > > > > > > > dmidecode -s system-family
> > > > > > > > > > >
> > > > > > > > > > > $ sudo dmidecode -s bios-vendor
> > > > > > > > > > > LENOVO
> > > > > > > > > > > $ sudo dmidecode -s bios-version
> > > > > > > > > > > hve104r-1.15
> > > > > > > > > > > $ sudo dmidecode -s bios-release-date
> > > > > > > > > > > 02/26/2021
> > > > > > > > > > > $ sudo dmidecode -s bios-revision
> > > > > > > > > > > 1.15
> > > > > > > > > > > $ sudo dmidecode -s system-family
> > > > > > > > > > > Lenovo ThinkSystem HR330A/HR350A
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > >
> > > > > > > > > > Mind checking if this patch fixes your issue as well?
> > > > > > > > > >
> > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=altra-fix&id=77fa99dd4741456da85049c13ec31a148f5f5ac0
> > > > > > > > >
> > > > > > > > > Unfortunately this doesn't seem to be enough, I'm still getting the same
> > > > > > > > > problem also with this patch applied.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Thanks for trying.
> > > > > > > >
> > > > > > > > How about the last 3 patches on this branch?
> > > > > > > >
> > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=efi-smbios-altra-fix
> > > > > > >
> > > > > > > Actually, that may not match your hardware.
> > > > > > >
> > > > > > > Does your kernel log have a line like
> > > > > > >
> > > > > > > SMCCC: SOC_ID: ID = jep106:036b:0019 Revision = 0x00000102
> > > > > > >
> > > > > > > ?
> > > > > >
> > > > > > $ sudo dmesg | grep "SMCCC: SOC_ID"
> > > > > > [ 5.320782] SMCCC: SOC_ID: ARCH_SOC_ID not implemented, skipping ....
> > > > > >
> > > > >
> > > > > Thanks. Could you share the entire dmidecode output somewhere? Or at
> > > > > least the type 4 record(s)?
> > > >
> > > > Sure, here's the full output of dmidecode:
> > > > https://pastebin.ubuntu.com/p/4ZmKmP2xTm/
> > > >
> > >
> > > Thanks. I have updated my SMBIOS patches to take the processor version
> > > 'eMAG' into account, which appears to be what these boxes are using.
> > >
> > > I have updated the efi/urgent branch here with the latest versions.
> > > Mind giving them a spin?
> > >
> > >
> > > In the mean time, just for the record - could you please run this as well?
> > >
> > > hexdump -C /sys/firmware/dmi/entries/4-0/raw
> > >
> > > (as root)
> >
> > hm.. I don't have that in /sys/firmware/, this is what I have:
> >
> > # ls -l /sys/firmware/dmi/
> > total 0
> > drwxr-xr-x 2 root root 0 Mar 16 13:26 tables
> > # ls -l /sys/firmware/dmi/tables/
> > total 0
> > -r-------- 1 root root 5004 Mar 16 13:26 DMI
> > -r-------- 1 root root 24 Mar 16 13:26 smbios_entry_point
> >
>
> You'll need to load the dmi_sysfs module for that. But no big deal
> otherwise, I'm pretty sure the word order is the correct on on your
> system in any case (it decodes the value correctly in the next line)

ok, much better after modprobe dmi_sysfs. :)

$ sudo hexdump -C /sys/firmware/dmi/entries/4-0/raw
00000000 04 30 04 00 01 03 fe 02 02 00 3f 50 00 00 00 00 |.0........?P....|
00000010 03 89 b8 0b e4 0c b8 0b 41 06 05 00 06 00 07 00 |........A.......|
00000020 04 00 00 20 20 20 7c 00 01 01 00 00 00 00 00 00 |... |.........|
00000030 43 50 55 20 31 00 41 6d 70 65 72 65 28 54 4d 29 |CPU 1.Ampere(TM)|
00000040 00 65 4d 41 47 20 00 30 30 30 30 30 30 30 30 30 |.eMAG .000000000|
00000050 30 30 30 30 30 30 30 35 30 30 35 30 31 30 35 30 |0000000500501050|
00000060 32 46 42 30 39 38 38 00 55 6e 6b 6e 6f 77 6e 00 |2FB0988.Unknown.|
00000070 55 6e 6b 6e 6f 77 6e 00 00 |Unknown..|
00000079

-Andrea

2023-03-16 14:07:18

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, 16 Mar 2023 at 14:59, Andrea Righi <[email protected]> wrote:
>
> On Thu, Mar 16, 2023 at 02:53:24PM +0100, Ard Biesheuvel wrote:
> > On Thu, 16 Mar 2023 at 14:50, Andrea Righi <[email protected]> wrote:
> > >
> > > On Thu, Mar 16, 2023 at 02:45:49PM +0100, Ard Biesheuvel wrote:
> > > > On Thu, 16 Mar 2023 at 13:50, Andrea Righi <[email protected]> wrote:
> > > > >
> > > > > On Thu, Mar 16, 2023 at 01:43:32PM +0100, Ard Biesheuvel wrote:
> > > > > > On Thu, 16 Mar 2023 at 13:41, Andrea Righi <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 16, 2023 at 01:38:30PM +0100, Ard Biesheuvel wrote:
> > > > > > > > On Thu, 16 Mar 2023 at 13:21, Ard Biesheuvel <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, 16 Mar 2023 at 12:34, Andrea Righi <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > On Thu, Mar 16, 2023 at 11:18:21AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > On Thu, 16 Mar 2023 at 11:03, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Mar 16, 2023 at 10:55:58AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > > (cc Darren)
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, 16 Mar 2023 at 10:45, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Mar 16, 2023 at 08:58:20AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > > > > Hello Andrea,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, 16 Mar 2023 at 08:54, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hello,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> > > > > > > > > > > > > > > > gets stuck and never completes the boot. On the console I see this:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > > > > > > > > > > [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> > > > > > > > > > > > > > > > [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> > > > > > > > > > > > > > > > [ 72.064949] Task dump for CPU 22:
> > > > > > > > > > > > > > > > [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> > > > > > > > > > > > > > > > [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> > > > > > > > > > > > > > > > [ 72.082595] Call trace:
> > > > > > > > > > > > > > > > [ 72.085029] __switch_to+0xbc/0x100
> > > > > > > > > > > > > > > > [ 72.088508] 0xffff80000fe83d4c
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > After that, as a consequence, I start to get a lot of hung task timeout traces.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I tried to bisect the problem and I found that the offending commit is
> > > > > > > > > > > > > > > > this one:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I've reverted this commit for now and everything works just fine, but I
> > > > > > > > > > > > > > > > was wondering if the problem could be caused by a lack of entropy on
> > > > > > > > > > > > > > > > these arm64 boxes or something else.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Any suggestion? Let me know if you want me to do any specific test.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the report.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This is most likely the EFI SetVariable() call going off into the
> > > > > > > > > > > > > > > weeds and never returning.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Is this an Ampere Altra system by any chance? Do you see it on
> > > > > > > > > > > > > > > different types of hardware?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This is: Ampere eMAG / Lenovo ThinkSystem HR330a.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Could you check whether SetVariable works on this system? E.g. by
> > > > > > > > > > > > > > > updating the EFI boot timeout (sudo efibootmgr -t <n>)?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ubuntu@kuzzle:~$ sudo efibootmgr -t 10
> > > > > > > > > > > > > > ^C^C^C^C
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ^ Stuck there, so it really looks like SetVariable is the problem.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Could you please share the output of
> > > > > > > > > > > > >
> > > > > > > > > > > > > dmidecode -s bios
> > > > > > > > > > > > > dmidecode -s system-family
> > > > > > > > > > > >
> > > > > > > > > > > > $ sudo dmidecode -s bios-vendor
> > > > > > > > > > > > LENOVO
> > > > > > > > > > > > $ sudo dmidecode -s bios-version
> > > > > > > > > > > > hve104r-1.15
> > > > > > > > > > > > $ sudo dmidecode -s bios-release-date
> > > > > > > > > > > > 02/26/2021
> > > > > > > > > > > > $ sudo dmidecode -s bios-revision
> > > > > > > > > > > > 1.15
> > > > > > > > > > > > $ sudo dmidecode -s system-family
> > > > > > > > > > > > Lenovo ThinkSystem HR330A/HR350A
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks
> > > > > > > > > > >
> > > > > > > > > > > Mind checking if this patch fixes your issue as well?
> > > > > > > > > > >
> > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=altra-fix&id=77fa99dd4741456da85049c13ec31a148f5f5ac0
> > > > > > > > > >
> > > > > > > > > > Unfortunately this doesn't seem to be enough, I'm still getting the same
> > > > > > > > > > problem also with this patch applied.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks for trying.
> > > > > > > > >
> > > > > > > > > How about the last 3 patches on this branch?
> > > > > > > > >
> > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=efi-smbios-altra-fix
> > > > > > > >
> > > > > > > > Actually, that may not match your hardware.
> > > > > > > >
> > > > > > > > Does your kernel log have a line like
> > > > > > > >
> > > > > > > > SMCCC: SOC_ID: ID = jep106:036b:0019 Revision = 0x00000102
> > > > > > > >
> > > > > > > > ?
> > > > > > >
> > > > > > > $ sudo dmesg | grep "SMCCC: SOC_ID"
> > > > > > > [ 5.320782] SMCCC: SOC_ID: ARCH_SOC_ID not implemented, skipping ....
> > > > > > >
> > > > > >
> > > > > > Thanks. Could you share the entire dmidecode output somewhere? Or at
> > > > > > least the type 4 record(s)?
> > > > >
> > > > > Sure, here's the full output of dmidecode:
> > > > > https://pastebin.ubuntu.com/p/4ZmKmP2xTm/
> > > > >
> > > >
> > > > Thanks. I have updated my SMBIOS patches to take the processor version
> > > > 'eMAG' into account, which appears to be what these boxes are using.
> > > >
> > > > I have updated the efi/urgent branch here with the latest versions.
> > > > Mind giving them a spin?
> > > >
> > > >
> > > > In the mean time, just for the record - could you please run this as well?
> > > >
> > > > hexdump -C /sys/firmware/dmi/entries/4-0/raw
> > > >
> > > > (as root)
> > >
> > > hm.. I don't have that in /sys/firmware/, this is what I have:
> > >
> > > # ls -l /sys/firmware/dmi/
> > > total 0
> > > drwxr-xr-x 2 root root 0 Mar 16 13:26 tables
> > > # ls -l /sys/firmware/dmi/tables/
> > > total 0
> > > -r-------- 1 root root 5004 Mar 16 13:26 DMI
> > > -r-------- 1 root root 24 Mar 16 13:26 smbios_entry_point
> > >
> >
> > You'll need to load the dmi_sysfs module for that. But no big deal
> > otherwise, I'm pretty sure the word order is the correct on on your
> > system in any case (it decodes the value correctly in the next line)
>
> ok, much better after modprobe dmi_sysfs. :)
>

Yeah better, thanks.

> $ sudo hexdump -C /sys/firmware/dmi/entries/4-0/raw
> 00000000 04 30 04 00 01 03 fe 02 02 00 3f 50 00 00 00 00 |.0........?P....|
> 00000010 03 89 b8 0b e4 0c b8 0b 41 06 05 00 06 00 07 00 |........A.......|
> 00000020 04 00 00 20 20 20 7c 00 01 01 00 00 00 00 00 00 |... |.........|
> 00000030 43 50 55 20 31 00 41 6d 70 65 72 65 28 54 4d 29 |CPU 1.Ampere(TM)|
> 00000040 00 65 4d 41 47 20 00 30 30 30 30 30 30 30 30 30 |.eMAG .000000000|

Darn, this means we have to match for "eMAG " (with the trailing
space) so the branch i just pushed needs to be updated for this.

> 00000050 30 30 30 30 30 30 30 35 30 30 35 30 31 30 35 30 |0000000500501050|
> 00000060 32 46 42 30 39 38 38 00 55 6e 6b 6e 6f 77 6e 00 |2FB0988.Unknown.|
> 00000070 55 6e 6b 6e 6f 77 6e 00 00 |Unknown..|
> 00000079
>
> -Andrea

2023-03-16 14:09:22

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, 16 Mar 2023 at 15:06, Ard Biesheuvel <[email protected]> wrote:
>
> On Thu, 16 Mar 2023 at 14:59, Andrea Righi <[email protected]> wrote:
> >
> > On Thu, Mar 16, 2023 at 02:53:24PM +0100, Ard Biesheuvel wrote:
> > > On Thu, 16 Mar 2023 at 14:50, Andrea Righi <[email protected]> wrote:
> > > >
> > > > On Thu, Mar 16, 2023 at 02:45:49PM +0100, Ard Biesheuvel wrote:
> > > > > On Thu, 16 Mar 2023 at 13:50, Andrea Righi <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Mar 16, 2023 at 01:43:32PM +0100, Ard Biesheuvel wrote:
> > > > > > > On Thu, 16 Mar 2023 at 13:41, Andrea Righi <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Thu, Mar 16, 2023 at 01:38:30PM +0100, Ard Biesheuvel wrote:
> > > > > > > > > On Thu, 16 Mar 2023 at 13:21, Ard Biesheuvel <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > On Thu, 16 Mar 2023 at 12:34, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Mar 16, 2023 at 11:18:21AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > On Thu, 16 Mar 2023 at 11:03, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Mar 16, 2023 at 10:55:58AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > > > (cc Darren)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, 16 Mar 2023 at 10:45, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, Mar 16, 2023 at 08:58:20AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > > > > > Hello Andrea,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Thu, 16 Mar 2023 at 08:54, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hello,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> > > > > > > > > > > > > > > > > gets stuck and never completes the boot. On the console I see this:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > > > > > > > > > > > [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> > > > > > > > > > > > > > > > > [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> > > > > > > > > > > > > > > > > [ 72.064949] Task dump for CPU 22:
> > > > > > > > > > > > > > > > > [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> > > > > > > > > > > > > > > > > [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> > > > > > > > > > > > > > > > > [ 72.082595] Call trace:
> > > > > > > > > > > > > > > > > [ 72.085029] __switch_to+0xbc/0x100
> > > > > > > > > > > > > > > > > [ 72.088508] 0xffff80000fe83d4c
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > After that, as a consequence, I start to get a lot of hung task timeout traces.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I tried to bisect the problem and I found that the offending commit is
> > > > > > > > > > > > > > > > > this one:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I've reverted this commit for now and everything works just fine, but I
> > > > > > > > > > > > > > > > > was wondering if the problem could be caused by a lack of entropy on
> > > > > > > > > > > > > > > > > these arm64 boxes or something else.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Any suggestion? Let me know if you want me to do any specific test.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for the report.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This is most likely the EFI SetVariable() call going off into the
> > > > > > > > > > > > > > > > weeds and never returning.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Is this an Ampere Altra system by any chance? Do you see it on
> > > > > > > > > > > > > > > > different types of hardware?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This is: Ampere eMAG / Lenovo ThinkSystem HR330a.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Could you check whether SetVariable works on this system? E.g. by
> > > > > > > > > > > > > > > > updating the EFI boot timeout (sudo efibootmgr -t <n>)?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ubuntu@kuzzle:~$ sudo efibootmgr -t 10
> > > > > > > > > > > > > > > ^C^C^C^C
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ^ Stuck there, so it really looks like SetVariable is the problem.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Could you please share the output of
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > dmidecode -s bios
> > > > > > > > > > > > > > dmidecode -s system-family
> > > > > > > > > > > > >
> > > > > > > > > > > > > $ sudo dmidecode -s bios-vendor
> > > > > > > > > > > > > LENOVO
> > > > > > > > > > > > > $ sudo dmidecode -s bios-version
> > > > > > > > > > > > > hve104r-1.15
> > > > > > > > > > > > > $ sudo dmidecode -s bios-release-date
> > > > > > > > > > > > > 02/26/2021
> > > > > > > > > > > > > $ sudo dmidecode -s bios-revision
> > > > > > > > > > > > > 1.15
> > > > > > > > > > > > > $ sudo dmidecode -s system-family
> > > > > > > > > > > > > Lenovo ThinkSystem HR330A/HR350A
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks
> > > > > > > > > > > >
> > > > > > > > > > > > Mind checking if this patch fixes your issue as well?
> > > > > > > > > > > >
> > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=altra-fix&id=77fa99dd4741456da85049c13ec31a148f5f5ac0
> > > > > > > > > > >
> > > > > > > > > > > Unfortunately this doesn't seem to be enough, I'm still getting the same
> > > > > > > > > > > problem also with this patch applied.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks for trying.
> > > > > > > > > >
> > > > > > > > > > How about the last 3 patches on this branch?
> > > > > > > > > >
> > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=efi-smbios-altra-fix
> > > > > > > > >
> > > > > > > > > Actually, that may not match your hardware.
> > > > > > > > >
> > > > > > > > > Does your kernel log have a line like
> > > > > > > > >
> > > > > > > > > SMCCC: SOC_ID: ID = jep106:036b:0019 Revision = 0x00000102
> > > > > > > > >
> > > > > > > > > ?
> > > > > > > >
> > > > > > > > $ sudo dmesg | grep "SMCCC: SOC_ID"
> > > > > > > > [ 5.320782] SMCCC: SOC_ID: ARCH_SOC_ID not implemented, skipping ....
> > > > > > > >
> > > > > > >
> > > > > > > Thanks. Could you share the entire dmidecode output somewhere? Or at
> > > > > > > least the type 4 record(s)?
> > > > > >
> > > > > > Sure, here's the full output of dmidecode:
> > > > > > https://pastebin.ubuntu.com/p/4ZmKmP2xTm/
> > > > > >
> > > > >
> > > > > Thanks. I have updated my SMBIOS patches to take the processor version
> > > > > 'eMAG' into account, which appears to be what these boxes are using.
> > > > >
> > > > > I have updated the efi/urgent branch here with the latest versions.
> > > > > Mind giving them a spin?
> > > > >
> > > > >
> > > > > In the mean time, just for the record - could you please run this as well?
> > > > >
> > > > > hexdump -C /sys/firmware/dmi/entries/4-0/raw
> > > > >
> > > > > (as root)
> > > >
> > > > hm.. I don't have that in /sys/firmware/, this is what I have:
> > > >
> > > > # ls -l /sys/firmware/dmi/
> > > > total 0
> > > > drwxr-xr-x 2 root root 0 Mar 16 13:26 tables
> > > > # ls -l /sys/firmware/dmi/tables/
> > > > total 0
> > > > -r-------- 1 root root 5004 Mar 16 13:26 DMI
> > > > -r-------- 1 root root 24 Mar 16 13:26 smbios_entry_point
> > > >
> > >
> > > You'll need to load the dmi_sysfs module for that. But no big deal
> > > otherwise, I'm pretty sure the word order is the correct on on your
> > > system in any case (it decodes the value correctly in the next line)
> >
> > ok, much better after modprobe dmi_sysfs. :)
> >
>
> Yeah better, thanks.
>
> > $ sudo hexdump -C /sys/firmware/dmi/entries/4-0/raw
> > 00000000 04 30 04 00 01 03 fe 02 02 00 3f 50 00 00 00 00 |.0........?P....|
> > 00000010 03 89 b8 0b e4 0c b8 0b 41 06 05 00 06 00 07 00 |........A.......|
> > 00000020 04 00 00 20 20 20 7c 00 01 01 00 00 00 00 00 00 |... |.........|
> > 00000030 43 50 55 20 31 00 41 6d 70 65 72 65 28 54 4d 29 |CPU 1.Ampere(TM)|
> > 00000040 00 65 4d 41 47 20 00 30 30 30 30 30 30 30 30 30 |.eMAG .000000000|
>
> Darn, this means we have to match for "eMAG " (with the trailing
> space) so the branch i just pushed needs to be updated for this.
>

I.e.,

--- a/drivers/firmware/efi/libstub/arm64.c
+++ b/drivers/firmware/efi/libstub/arm64.c
@@ -36,7 +36,7 @@ static bool system_needs_vamap(void)
default:
version = efi_get_smbios_string(&record->header, 4,
processor_version);
- if (!version || strcmp(version, "eMAG"))
+ if (!version || strncmp(version, "eMAG", 4))
break;

fallthrough;

2023-03-16 14:25:28

by Andrea Righi

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, Mar 16, 2023 at 03:08:53PM +0100, Ard Biesheuvel wrote:
> On Thu, 16 Mar 2023 at 15:06, Ard Biesheuvel <[email protected]> wrote:
> >
> > On Thu, 16 Mar 2023 at 14:59, Andrea Righi <[email protected]> wrote:
> > >
> > > On Thu, Mar 16, 2023 at 02:53:24PM +0100, Ard Biesheuvel wrote:
> > > > On Thu, 16 Mar 2023 at 14:50, Andrea Righi <[email protected]> wrote:
> > > > >
> > > > > On Thu, Mar 16, 2023 at 02:45:49PM +0100, Ard Biesheuvel wrote:
> > > > > > On Thu, 16 Mar 2023 at 13:50, Andrea Righi <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 16, 2023 at 01:43:32PM +0100, Ard Biesheuvel wrote:
> > > > > > > > On Thu, 16 Mar 2023 at 13:41, Andrea Righi <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Mar 16, 2023 at 01:38:30PM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > On Thu, 16 Mar 2023 at 13:21, Ard Biesheuvel <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Thu, 16 Mar 2023 at 12:34, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Mar 16, 2023 at 11:18:21AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > > On Thu, 16 Mar 2023 at 11:03, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Mar 16, 2023 at 10:55:58AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > > > > (cc Darren)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, 16 Mar 2023 at 10:45, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Thu, Mar 16, 2023 at 08:58:20AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > > > > > > Hello Andrea,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Thu, 16 Mar 2023 at 08:54, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hello,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> > > > > > > > > > > > > > > > > > gets stuck and never completes the boot. On the console I see this:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > > > > > > > > > > > > [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> > > > > > > > > > > > > > > > > > [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> > > > > > > > > > > > > > > > > > [ 72.064949] Task dump for CPU 22:
> > > > > > > > > > > > > > > > > > [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> > > > > > > > > > > > > > > > > > [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> > > > > > > > > > > > > > > > > > [ 72.082595] Call trace:
> > > > > > > > > > > > > > > > > > [ 72.085029] __switch_to+0xbc/0x100
> > > > > > > > > > > > > > > > > > [ 72.088508] 0xffff80000fe83d4c
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > After that, as a consequence, I start to get a lot of hung task timeout traces.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I tried to bisect the problem and I found that the offending commit is
> > > > > > > > > > > > > > > > > > this one:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I've reverted this commit for now and everything works just fine, but I
> > > > > > > > > > > > > > > > > > was wondering if the problem could be caused by a lack of entropy on
> > > > > > > > > > > > > > > > > > these arm64 boxes or something else.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Any suggestion? Let me know if you want me to do any specific test.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks for the report.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > This is most likely the EFI SetVariable() call going off into the
> > > > > > > > > > > > > > > > > weeds and never returning.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Is this an Ampere Altra system by any chance? Do you see it on
> > > > > > > > > > > > > > > > > different types of hardware?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This is: Ampere eMAG / Lenovo ThinkSystem HR330a.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Could you check whether SetVariable works on this system? E.g. by
> > > > > > > > > > > > > > > > > updating the EFI boot timeout (sudo efibootmgr -t <n>)?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ubuntu@kuzzle:~$ sudo efibootmgr -t 10
> > > > > > > > > > > > > > > > ^C^C^C^C
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ^ Stuck there, so it really looks like SetVariable is the problem.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Could you please share the output of
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > dmidecode -s bios
> > > > > > > > > > > > > > > dmidecode -s system-family
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > $ sudo dmidecode -s bios-vendor
> > > > > > > > > > > > > > LENOVO
> > > > > > > > > > > > > > $ sudo dmidecode -s bios-version
> > > > > > > > > > > > > > hve104r-1.15
> > > > > > > > > > > > > > $ sudo dmidecode -s bios-release-date
> > > > > > > > > > > > > > 02/26/2021
> > > > > > > > > > > > > > $ sudo dmidecode -s bios-revision
> > > > > > > > > > > > > > 1.15
> > > > > > > > > > > > > > $ sudo dmidecode -s system-family
> > > > > > > > > > > > > > Lenovo ThinkSystem HR330A/HR350A
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks
> > > > > > > > > > > > >
> > > > > > > > > > > > > Mind checking if this patch fixes your issue as well?
> > > > > > > > > > > > >
> > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=altra-fix&id=77fa99dd4741456da85049c13ec31a148f5f5ac0
> > > > > > > > > > > >
> > > > > > > > > > > > Unfortunately this doesn't seem to be enough, I'm still getting the same
> > > > > > > > > > > > problem also with this patch applied.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks for trying.
> > > > > > > > > > >
> > > > > > > > > > > How about the last 3 patches on this branch?
> > > > > > > > > > >
> > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=efi-smbios-altra-fix
> > > > > > > > > >
> > > > > > > > > > Actually, that may not match your hardware.
> > > > > > > > > >
> > > > > > > > > > Does your kernel log have a line like
> > > > > > > > > >
> > > > > > > > > > SMCCC: SOC_ID: ID = jep106:036b:0019 Revision = 0x00000102
> > > > > > > > > >
> > > > > > > > > > ?
> > > > > > > > >
> > > > > > > > > $ sudo dmesg | grep "SMCCC: SOC_ID"
> > > > > > > > > [ 5.320782] SMCCC: SOC_ID: ARCH_SOC_ID not implemented, skipping ....
> > > > > > > > >
> > > > > > > >
> > > > > > > > Thanks. Could you share the entire dmidecode output somewhere? Or at
> > > > > > > > least the type 4 record(s)?
> > > > > > >
> > > > > > > Sure, here's the full output of dmidecode:
> > > > > > > https://pastebin.ubuntu.com/p/4ZmKmP2xTm/
> > > > > > >
> > > > > >
> > > > > > Thanks. I have updated my SMBIOS patches to take the processor version
> > > > > > 'eMAG' into account, which appears to be what these boxes are using.
> > > > > >
> > > > > > I have updated the efi/urgent branch here with the latest versions.
> > > > > > Mind giving them a spin?
> > > > > >
> > > > > >
> > > > > > In the mean time, just for the record - could you please run this as well?
> > > > > >
> > > > > > hexdump -C /sys/firmware/dmi/entries/4-0/raw
> > > > > >
> > > > > > (as root)
> > > > >
> > > > > hm.. I don't have that in /sys/firmware/, this is what I have:
> > > > >
> > > > > # ls -l /sys/firmware/dmi/
> > > > > total 0
> > > > > drwxr-xr-x 2 root root 0 Mar 16 13:26 tables
> > > > > # ls -l /sys/firmware/dmi/tables/
> > > > > total 0
> > > > > -r-------- 1 root root 5004 Mar 16 13:26 DMI
> > > > > -r-------- 1 root root 24 Mar 16 13:26 smbios_entry_point
> > > > >
> > > >
> > > > You'll need to load the dmi_sysfs module for that. But no big deal
> > > > otherwise, I'm pretty sure the word order is the correct on on your
> > > > system in any case (it decodes the value correctly in the next line)
> > >
> > > ok, much better after modprobe dmi_sysfs. :)
> > >
> >
> > Yeah better, thanks.
> >
> > > $ sudo hexdump -C /sys/firmware/dmi/entries/4-0/raw
> > > 00000000 04 30 04 00 01 03 fe 02 02 00 3f 50 00 00 00 00 |.0........?P....|
> > > 00000010 03 89 b8 0b e4 0c b8 0b 41 06 05 00 06 00 07 00 |........A.......|
> > > 00000020 04 00 00 20 20 20 7c 00 01 01 00 00 00 00 00 00 |... |.........|
> > > 00000030 43 50 55 20 31 00 41 6d 70 65 72 65 28 54 4d 29 |CPU 1.Ampere(TM)|
> > > 00000040 00 65 4d 41 47 20 00 30 30 30 30 30 30 30 30 30 |.eMAG .000000000|
> >
> > Darn, this means we have to match for "eMAG " (with the trailing
> > space) so the branch i just pushed needs to be updated for this.
> >
>
> I.e.,
>
> --- a/drivers/firmware/efi/libstub/arm64.c
> +++ b/drivers/firmware/efi/libstub/arm64.c
> @@ -36,7 +36,7 @@ static bool system_needs_vamap(void)
> default:
> version = efi_get_smbios_string(&record->header, 4,
> processor_version);
> - if (!version || strcmp(version, "eMAG"))
> + if (!version || strncmp(version, "eMAG", 4))
> break;
>
> fallthrough;

OK, I can add that and test it.

-Andrea

2023-03-16 17:53:03

by Andrea Righi

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, Mar 16, 2023 at 03:08:53PM +0100, Ard Biesheuvel wrote:
> On Thu, 16 Mar 2023 at 15:06, Ard Biesheuvel <[email protected]> wrote:
> >
> > On Thu, 16 Mar 2023 at 14:59, Andrea Righi <[email protected]> wrote:
> > >
> > > On Thu, Mar 16, 2023 at 02:53:24PM +0100, Ard Biesheuvel wrote:
> > > > On Thu, 16 Mar 2023 at 14:50, Andrea Righi <[email protected]> wrote:
> > > > >
> > > > > On Thu, Mar 16, 2023 at 02:45:49PM +0100, Ard Biesheuvel wrote:
> > > > > > On Thu, 16 Mar 2023 at 13:50, Andrea Righi <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 16, 2023 at 01:43:32PM +0100, Ard Biesheuvel wrote:
> > > > > > > > On Thu, 16 Mar 2023 at 13:41, Andrea Righi <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Mar 16, 2023 at 01:38:30PM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > On Thu, 16 Mar 2023 at 13:21, Ard Biesheuvel <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Thu, 16 Mar 2023 at 12:34, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Mar 16, 2023 at 11:18:21AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > > On Thu, 16 Mar 2023 at 11:03, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Mar 16, 2023 at 10:55:58AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > > > > (cc Darren)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, 16 Mar 2023 at 10:45, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Thu, Mar 16, 2023 at 08:58:20AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > > > > > > Hello Andrea,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Thu, 16 Mar 2023 at 08:54, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hello,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> > > > > > > > > > > > > > > > > > gets stuck and never completes the boot. On the console I see this:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > > > > > > > > > > > > [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> > > > > > > > > > > > > > > > > > [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> > > > > > > > > > > > > > > > > > [ 72.064949] Task dump for CPU 22:
> > > > > > > > > > > > > > > > > > [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> > > > > > > > > > > > > > > > > > [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> > > > > > > > > > > > > > > > > > [ 72.082595] Call trace:
> > > > > > > > > > > > > > > > > > [ 72.085029] __switch_to+0xbc/0x100
> > > > > > > > > > > > > > > > > > [ 72.088508] 0xffff80000fe83d4c
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > After that, as a consequence, I start to get a lot of hung task timeout traces.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I tried to bisect the problem and I found that the offending commit is
> > > > > > > > > > > > > > > > > > this one:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I've reverted this commit for now and everything works just fine, but I
> > > > > > > > > > > > > > > > > > was wondering if the problem could be caused by a lack of entropy on
> > > > > > > > > > > > > > > > > > these arm64 boxes or something else.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Any suggestion? Let me know if you want me to do any specific test.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks for the report.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > This is most likely the EFI SetVariable() call going off into the
> > > > > > > > > > > > > > > > > weeds and never returning.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Is this an Ampere Altra system by any chance? Do you see it on
> > > > > > > > > > > > > > > > > different types of hardware?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This is: Ampere eMAG / Lenovo ThinkSystem HR330a.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Could you check whether SetVariable works on this system? E.g. by
> > > > > > > > > > > > > > > > > updating the EFI boot timeout (sudo efibootmgr -t <n>)?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ubuntu@kuzzle:~$ sudo efibootmgr -t 10
> > > > > > > > > > > > > > > > ^C^C^C^C
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ^ Stuck there, so it really looks like SetVariable is the problem.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Could you please share the output of
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > dmidecode -s bios
> > > > > > > > > > > > > > > dmidecode -s system-family
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > $ sudo dmidecode -s bios-vendor
> > > > > > > > > > > > > > LENOVO
> > > > > > > > > > > > > > $ sudo dmidecode -s bios-version
> > > > > > > > > > > > > > hve104r-1.15
> > > > > > > > > > > > > > $ sudo dmidecode -s bios-release-date
> > > > > > > > > > > > > > 02/26/2021
> > > > > > > > > > > > > > $ sudo dmidecode -s bios-revision
> > > > > > > > > > > > > > 1.15
> > > > > > > > > > > > > > $ sudo dmidecode -s system-family
> > > > > > > > > > > > > > Lenovo ThinkSystem HR330A/HR350A
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks
> > > > > > > > > > > > >
> > > > > > > > > > > > > Mind checking if this patch fixes your issue as well?
> > > > > > > > > > > > >
> > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=altra-fix&id=77fa99dd4741456da85049c13ec31a148f5f5ac0
> > > > > > > > > > > >
> > > > > > > > > > > > Unfortunately this doesn't seem to be enough, I'm still getting the same
> > > > > > > > > > > > problem also with this patch applied.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks for trying.
> > > > > > > > > > >
> > > > > > > > > > > How about the last 3 patches on this branch?
> > > > > > > > > > >
> > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=efi-smbios-altra-fix
> > > > > > > > > >
> > > > > > > > > > Actually, that may not match your hardware.
> > > > > > > > > >
> > > > > > > > > > Does your kernel log have a line like
> > > > > > > > > >
> > > > > > > > > > SMCCC: SOC_ID: ID = jep106:036b:0019 Revision = 0x00000102
> > > > > > > > > >
> > > > > > > > > > ?
> > > > > > > > >
> > > > > > > > > $ sudo dmesg | grep "SMCCC: SOC_ID"
> > > > > > > > > [ 5.320782] SMCCC: SOC_ID: ARCH_SOC_ID not implemented, skipping ....
> > > > > > > > >
> > > > > > > >
> > > > > > > > Thanks. Could you share the entire dmidecode output somewhere? Or at
> > > > > > > > least the type 4 record(s)?
> > > > > > >
> > > > > > > Sure, here's the full output of dmidecode:
> > > > > > > https://pastebin.ubuntu.com/p/4ZmKmP2xTm/
> > > > > > >
> > > > > >
> > > > > > Thanks. I have updated my SMBIOS patches to take the processor version
> > > > > > 'eMAG' into account, which appears to be what these boxes are using.
> > > > > >
> > > > > > I have updated the efi/urgent branch here with the latest versions.
> > > > > > Mind giving them a spin?
> > > > > >
> > > > > >
> > > > > > In the mean time, just for the record - could you please run this as well?
> > > > > >
> > > > > > hexdump -C /sys/firmware/dmi/entries/4-0/raw
> > > > > >
> > > > > > (as root)
> > > > >
> > > > > hm.. I don't have that in /sys/firmware/, this is what I have:
> > > > >
> > > > > # ls -l /sys/firmware/dmi/
> > > > > total 0
> > > > > drwxr-xr-x 2 root root 0 Mar 16 13:26 tables
> > > > > # ls -l /sys/firmware/dmi/tables/
> > > > > total 0
> > > > > -r-------- 1 root root 5004 Mar 16 13:26 DMI
> > > > > -r-------- 1 root root 24 Mar 16 13:26 smbios_entry_point
> > > > >
> > > >
> > > > You'll need to load the dmi_sysfs module for that. But no big deal
> > > > otherwise, I'm pretty sure the word order is the correct on on your
> > > > system in any case (it decodes the value correctly in the next line)
> > >
> > > ok, much better after modprobe dmi_sysfs. :)
> > >
> >
> > Yeah better, thanks.
> >
> > > $ sudo hexdump -C /sys/firmware/dmi/entries/4-0/raw
> > > 00000000 04 30 04 00 01 03 fe 02 02 00 3f 50 00 00 00 00 |.0........?P....|
> > > 00000010 03 89 b8 0b e4 0c b8 0b 41 06 05 00 06 00 07 00 |........A.......|
> > > 00000020 04 00 00 20 20 20 7c 00 01 01 00 00 00 00 00 00 |... |.........|
> > > 00000030 43 50 55 20 31 00 41 6d 70 65 72 65 28 54 4d 29 |CPU 1.Ampere(TM)|
> > > 00000040 00 65 4d 41 47 20 00 30 30 30 30 30 30 30 30 30 |.eMAG .000000000|
> >
> > Darn, this means we have to match for "eMAG " (with the trailing
> > space) so the branch i just pushed needs to be updated for this.
> >
>
> I.e.,
>
> --- a/drivers/firmware/efi/libstub/arm64.c
> +++ b/drivers/firmware/efi/libstub/arm64.c
> @@ -36,7 +36,7 @@ static bool system_needs_vamap(void)
> default:
> version = efi_get_smbios_string(&record->header, 4,
> processor_version);
> - if (!version || strcmp(version, "eMAG"))
> + if (!version || strncmp(version, "eMAG", 4))
> break;
>
> fallthrough;

Yay! Success! I just tested your latest efi/urgent (with the fixup) and
system completed the boot without any soft lockups.

Thanks!
-Andrea

2023-03-16 18:55:57

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, 16 Mar 2023 at 18:52, Andrea Righi <[email protected]> wrote:
>
> On Thu, Mar 16, 2023 at 03:08:53PM +0100, Ard Biesheuvel wrote:
> > On Thu, 16 Mar 2023 at 15:06, Ard Biesheuvel <[email protected]> wrote:
> > >
> > > On Thu, 16 Mar 2023 at 14:59, Andrea Righi <[email protected]> wrote:
> > > >
> > > > On Thu, Mar 16, 2023 at 02:53:24PM +0100, Ard Biesheuvel wrote:
> > > > > On Thu, 16 Mar 2023 at 14:50, Andrea Righi <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Mar 16, 2023 at 02:45:49PM +0100, Ard Biesheuvel wrote:
> > > > > > > On Thu, 16 Mar 2023 at 13:50, Andrea Righi <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Thu, Mar 16, 2023 at 01:43:32PM +0100, Ard Biesheuvel wrote:
> > > > > > > > > On Thu, 16 Mar 2023 at 13:41, Andrea Righi <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > On Thu, Mar 16, 2023 at 01:38:30PM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > On Thu, 16 Mar 2023 at 13:21, Ard Biesheuvel <[email protected]> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, 16 Mar 2023 at 12:34, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Mar 16, 2023 at 11:18:21AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > > > On Thu, 16 Mar 2023 at 11:03, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, Mar 16, 2023 at 10:55:58AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > > > > > (cc Darren)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Thu, 16 Mar 2023 at 10:45, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Thu, Mar 16, 2023 at 08:58:20AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > > > > > > > Hello Andrea,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Thu, 16 Mar 2023 at 08:54, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Hello,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> > > > > > > > > > > > > > > > > > > gets stuck and never completes the boot. On the console I see this:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > > > > > > > > > > > > > [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> > > > > > > > > > > > > > > > > > > [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> > > > > > > > > > > > > > > > > > > [ 72.064949] Task dump for CPU 22:
> > > > > > > > > > > > > > > > > > > [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> > > > > > > > > > > > > > > > > > > [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> > > > > > > > > > > > > > > > > > > [ 72.082595] Call trace:
> > > > > > > > > > > > > > > > > > > [ 72.085029] __switch_to+0xbc/0x100
> > > > > > > > > > > > > > > > > > > [ 72.088508] 0xffff80000fe83d4c
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > After that, as a consequence, I start to get a lot of hung task timeout traces.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I tried to bisect the problem and I found that the offending commit is
> > > > > > > > > > > > > > > > > > > this one:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I've reverted this commit for now and everything works just fine, but I
> > > > > > > > > > > > > > > > > > > was wondering if the problem could be caused by a lack of entropy on
> > > > > > > > > > > > > > > > > > > these arm64 boxes or something else.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Any suggestion? Let me know if you want me to do any specific test.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks for the report.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > This is most likely the EFI SetVariable() call going off into the
> > > > > > > > > > > > > > > > > > weeds and never returning.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Is this an Ampere Altra system by any chance? Do you see it on
> > > > > > > > > > > > > > > > > > different types of hardware?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > This is: Ampere eMAG / Lenovo ThinkSystem HR330a.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Could you check whether SetVariable works on this system? E.g. by
> > > > > > > > > > > > > > > > > > updating the EFI boot timeout (sudo efibootmgr -t <n>)?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ubuntu@kuzzle:~$ sudo efibootmgr -t 10
> > > > > > > > > > > > > > > > > ^C^C^C^C
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ^ Stuck there, so it really looks like SetVariable is the problem.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Could you please share the output of
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > dmidecode -s bios
> > > > > > > > > > > > > > > > dmidecode -s system-family
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > $ sudo dmidecode -s bios-vendor
> > > > > > > > > > > > > > > LENOVO
> > > > > > > > > > > > > > > $ sudo dmidecode -s bios-version
> > > > > > > > > > > > > > > hve104r-1.15
> > > > > > > > > > > > > > > $ sudo dmidecode -s bios-release-date
> > > > > > > > > > > > > > > 02/26/2021
> > > > > > > > > > > > > > > $ sudo dmidecode -s bios-revision
> > > > > > > > > > > > > > > 1.15
> > > > > > > > > > > > > > > $ sudo dmidecode -s system-family
> > > > > > > > > > > > > > > Lenovo ThinkSystem HR330A/HR350A
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Mind checking if this patch fixes your issue as well?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=altra-fix&id=77fa99dd4741456da85049c13ec31a148f5f5ac0
> > > > > > > > > > > > >
> > > > > > > > > > > > > Unfortunately this doesn't seem to be enough, I'm still getting the same
> > > > > > > > > > > > > problem also with this patch applied.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for trying.
> > > > > > > > > > > >
> > > > > > > > > > > > How about the last 3 patches on this branch?
> > > > > > > > > > > >
> > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=efi-smbios-altra-fix
> > > > > > > > > > >
> > > > > > > > > > > Actually, that may not match your hardware.
> > > > > > > > > > >
> > > > > > > > > > > Does your kernel log have a line like
> > > > > > > > > > >
> > > > > > > > > > > SMCCC: SOC_ID: ID = jep106:036b:0019 Revision = 0x00000102
> > > > > > > > > > >
> > > > > > > > > > > ?
> > > > > > > > > >
> > > > > > > > > > $ sudo dmesg | grep "SMCCC: SOC_ID"
> > > > > > > > > > [ 5.320782] SMCCC: SOC_ID: ARCH_SOC_ID not implemented, skipping ....
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks. Could you share the entire dmidecode output somewhere? Or at
> > > > > > > > > least the type 4 record(s)?
> > > > > > > >
> > > > > > > > Sure, here's the full output of dmidecode:
> > > > > > > > https://pastebin.ubuntu.com/p/4ZmKmP2xTm/
> > > > > > > >
> > > > > > >
> > > > > > > Thanks. I have updated my SMBIOS patches to take the processor version
> > > > > > > 'eMAG' into account, which appears to be what these boxes are using.
> > > > > > >
> > > > > > > I have updated the efi/urgent branch here with the latest versions.
> > > > > > > Mind giving them a spin?
> > > > > > >
> > > > > > >
> > > > > > > In the mean time, just for the record - could you please run this as well?
> > > > > > >
> > > > > > > hexdump -C /sys/firmware/dmi/entries/4-0/raw
> > > > > > >
> > > > > > > (as root)
> > > > > >
> > > > > > hm.. I don't have that in /sys/firmware/, this is what I have:
> > > > > >
> > > > > > # ls -l /sys/firmware/dmi/
> > > > > > total 0
> > > > > > drwxr-xr-x 2 root root 0 Mar 16 13:26 tables
> > > > > > # ls -l /sys/firmware/dmi/tables/
> > > > > > total 0
> > > > > > -r-------- 1 root root 5004 Mar 16 13:26 DMI
> > > > > > -r-------- 1 root root 24 Mar 16 13:26 smbios_entry_point
> > > > > >
> > > > >
> > > > > You'll need to load the dmi_sysfs module for that. But no big deal
> > > > > otherwise, I'm pretty sure the word order is the correct on on your
> > > > > system in any case (it decodes the value correctly in the next line)
> > > >
> > > > ok, much better after modprobe dmi_sysfs. :)
> > > >
> > >
> > > Yeah better, thanks.
> > >
> > > > $ sudo hexdump -C /sys/firmware/dmi/entries/4-0/raw
> > > > 00000000 04 30 04 00 01 03 fe 02 02 00 3f 50 00 00 00 00 |.0........?P....|
> > > > 00000010 03 89 b8 0b e4 0c b8 0b 41 06 05 00 06 00 07 00 |........A.......|
> > > > 00000020 04 00 00 20 20 20 7c 00 01 01 00 00 00 00 00 00 |... |.........|
> > > > 00000030 43 50 55 20 31 00 41 6d 70 65 72 65 28 54 4d 29 |CPU 1.Ampere(TM)|
> > > > 00000040 00 65 4d 41 47 20 00 30 30 30 30 30 30 30 30 30 |.eMAG .000000000|
> > >
> > > Darn, this means we have to match for "eMAG " (with the trailing
> > > space) so the branch i just pushed needs to be updated for this.
> > >
> >
> > I.e.,
> >
> > --- a/drivers/firmware/efi/libstub/arm64.c
> > +++ b/drivers/firmware/efi/libstub/arm64.c
> > @@ -36,7 +36,7 @@ static bool system_needs_vamap(void)
> > default:
> > version = efi_get_smbios_string(&record->header, 4,
> > processor_version);
> > - if (!version || strcmp(version, "eMAG"))
> > + if (!version || strncmp(version, "eMAG", 4))
> > break;
> >
> > fallthrough;
>
> Yay! Success! I just tested your latest efi/urgent (with the fixup) and
> system completed the boot without any soft lockups.
>

Thanks for confirming. I'll take that as a tested-by

2023-03-16 18:57:45

by Andrea Righi

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, Mar 16, 2023 at 07:55:36PM +0100, Ard Biesheuvel wrote:
...
> >
> > Yay! Success! I just tested your latest efi/urgent (with the fixup) and
> > system completed the boot without any soft lockups.
> >
>
> Thanks for confirming. I'll take that as a tested-by

Sure, thanks!

Tested-by: Andrea Righi <[email protected]>

2023-03-16 22:30:13

by Darren Hart

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, Mar 16, 2023 at 07:55:36PM +0100, Ard Biesheuvel wrote:
> On Thu, 16 Mar 2023 at 18:52, Andrea Righi <[email protected]> wrote:
> >
> > On Thu, Mar 16, 2023 at 03:08:53PM +0100, Ard Biesheuvel wrote:
> > > On Thu, 16 Mar 2023 at 15:06, Ard Biesheuvel <[email protected]> wrote:
> > > >
> > > > On Thu, 16 Mar 2023 at 14:59, Andrea Righi <[email protected]> wrote:
> > > > >
> > > > > On Thu, Mar 16, 2023 at 02:53:24PM +0100, Ard Biesheuvel wrote:
> > > > > > On Thu, 16 Mar 2023 at 14:50, Andrea Righi <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 16, 2023 at 02:45:49PM +0100, Ard Biesheuvel wrote:
> > > > > > > > On Thu, 16 Mar 2023 at 13:50, Andrea Righi <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Mar 16, 2023 at 01:43:32PM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > On Thu, 16 Mar 2023 at 13:41, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Mar 16, 2023 at 01:38:30PM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > On Thu, 16 Mar 2023 at 13:21, Ard Biesheuvel <[email protected]> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, 16 Mar 2023 at 12:34, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Mar 16, 2023 at 11:18:21AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > > > > On Thu, 16 Mar 2023 at 11:03, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Thu, Mar 16, 2023 at 10:55:58AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > > > > > > (cc Darren)
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Thu, 16 Mar 2023 at 10:45, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Thu, Mar 16, 2023 at 08:58:20AM +0100, Ard Biesheuvel wrote:
> > > > > > > > > > > > > > > > > > > Hello Andrea,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Thu, 16 Mar 2023 at 08:54, Andrea Righi <[email protected]> wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Hello,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
> > > > > > > > > > > > > > > > > > > > gets stuck and never completes the boot. On the console I see this:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > > > > > > > > > > > > > > [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
> > > > > > > > > > > > > > > > > > > > [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
> > > > > > > > > > > > > > > > > > > > [ 72.064949] Task dump for CPU 22:
> > > > > > > > > > > > > > > > > > > > [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
> > > > > > > > > > > > > > > > > > > > [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
> > > > > > > > > > > > > > > > > > > > [ 72.082595] Call trace:
> > > > > > > > > > > > > > > > > > > > [ 72.085029] __switch_to+0xbc/0x100
> > > > > > > > > > > > > > > > > > > > [ 72.088508] 0xffff80000fe83d4c
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > After that, as a consequence, I start to get a lot of hung task timeout traces.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > I tried to bisect the problem and I found that the offending commit is
> > > > > > > > > > > > > > > > > > > > this one:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > I've reverted this commit for now and everything works just fine, but I
> > > > > > > > > > > > > > > > > > > > was wondering if the problem could be caused by a lack of entropy on
> > > > > > > > > > > > > > > > > > > > these arm64 boxes or something else.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Any suggestion? Let me know if you want me to do any specific test.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks for the report.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > This is most likely the EFI SetVariable() call going off into the
> > > > > > > > > > > > > > > > > > > weeds and never returning.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Is this an Ampere Altra system by any chance? Do you see it on
> > > > > > > > > > > > > > > > > > > different types of hardware?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > This is: Ampere eMAG / Lenovo ThinkSystem HR330a.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Could you check whether SetVariable works on this system? E.g. by
> > > > > > > > > > > > > > > > > > > updating the EFI boot timeout (sudo efibootmgr -t <n>)?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > ubuntu@kuzzle:~$ sudo efibootmgr -t 10
> > > > > > > > > > > > > > > > > > ^C^C^C^C
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > ^ Stuck there, so it really looks like SetVariable is the problem.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Could you please share the output of
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > dmidecode -s bios
> > > > > > > > > > > > > > > > > dmidecode -s system-family
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > $ sudo dmidecode -s bios-vendor
> > > > > > > > > > > > > > > > LENOVO
> > > > > > > > > > > > > > > > $ sudo dmidecode -s bios-version
> > > > > > > > > > > > > > > > hve104r-1.15
> > > > > > > > > > > > > > > > $ sudo dmidecode -s bios-release-date
> > > > > > > > > > > > > > > > 02/26/2021
> > > > > > > > > > > > > > > > $ sudo dmidecode -s bios-revision
> > > > > > > > > > > > > > > > 1.15
> > > > > > > > > > > > > > > > $ sudo dmidecode -s system-family
> > > > > > > > > > > > > > > > Lenovo ThinkSystem HR330A/HR350A
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Mind checking if this patch fixes your issue as well?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=altra-fix&id=77fa99dd4741456da85049c13ec31a148f5f5ac0
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Unfortunately this doesn't seem to be enough, I'm still getting the same
> > > > > > > > > > > > > > problem also with this patch applied.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for trying.
> > > > > > > > > > > > >
> > > > > > > > > > > > > How about the last 3 patches on this branch?
> > > > > > > > > > > > >
> > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=efi-smbios-altra-fix
> > > > > > > > > > > >
> > > > > > > > > > > > Actually, that may not match your hardware.
> > > > > > > > > > > >
> > > > > > > > > > > > Does your kernel log have a line like
> > > > > > > > > > > >
> > > > > > > > > > > > SMCCC: SOC_ID: ID = jep106:036b:0019 Revision = 0x00000102
> > > > > > > > > > > >
> > > > > > > > > > > > ?
> > > > > > > > > > >
> > > > > > > > > > > $ sudo dmesg | grep "SMCCC: SOC_ID"
> > > > > > > > > > > [ 5.320782] SMCCC: SOC_ID: ARCH_SOC_ID not implemented, skipping ....
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks. Could you share the entire dmidecode output somewhere? Or at
> > > > > > > > > > least the type 4 record(s)?
> > > > > > > > >
> > > > > > > > > Sure, here's the full output of dmidecode:
> > > > > > > > > https://pastebin.ubuntu.com/p/4ZmKmP2xTm/
> > > > > > > > >
> > > > > > > >
> > > > > > > > Thanks. I have updated my SMBIOS patches to take the processor version
> > > > > > > > 'eMAG' into account, which appears to be what these boxes are using.
> > > > > > > >
> > > > > > > > I have updated the efi/urgent branch here with the latest versions.
> > > > > > > > Mind giving them a spin?
> > > > > > > >
> > > > > > > >
> > > > > > > > In the mean time, just for the record - could you please run this as well?
> > > > > > > >
> > > > > > > > hexdump -C /sys/firmware/dmi/entries/4-0/raw
> > > > > > > >
> > > > > > > > (as root)
> > > > > > >
> > > > > > > hm.. I don't have that in /sys/firmware/, this is what I have:
> > > > > > >
> > > > > > > # ls -l /sys/firmware/dmi/
> > > > > > > total 0
> > > > > > > drwxr-xr-x 2 root root 0 Mar 16 13:26 tables
> > > > > > > # ls -l /sys/firmware/dmi/tables/
> > > > > > > total 0
> > > > > > > -r-------- 1 root root 5004 Mar 16 13:26 DMI
> > > > > > > -r-------- 1 root root 24 Mar 16 13:26 smbios_entry_point
> > > > > > >
> > > > > >
> > > > > > You'll need to load the dmi_sysfs module for that. But no big deal
> > > > > > otherwise, I'm pretty sure the word order is the correct on on your
> > > > > > system in any case (it decodes the value correctly in the next line)
> > > > >
> > > > > ok, much better after modprobe dmi_sysfs. :)
> > > > >
> > > >
> > > > Yeah better, thanks.
> > > >
> > > > > $ sudo hexdump -C /sys/firmware/dmi/entries/4-0/raw
> > > > > 00000000 04 30 04 00 01 03 fe 02 02 00 3f 50 00 00 00 00 |.0........?P....|
> > > > > 00000010 03 89 b8 0b e4 0c b8 0b 41 06 05 00 06 00 07 00 |........A.......|
> > > > > 00000020 04 00 00 20 20 20 7c 00 01 01 00 00 00 00 00 00 |... |.........|
> > > > > 00000030 43 50 55 20 31 00 41 6d 70 65 72 65 28 54 4d 29 |CPU 1.Ampere(TM)|
> > > > > 00000040 00 65 4d 41 47 20 00 30 30 30 30 30 30 30 30 30 |.eMAG .000000000|
> > > >
> > > > Darn, this means we have to match for "eMAG " (with the trailing
> > > > space) so the branch i just pushed needs to be updated for this.
> > > >
> > >
> > > I.e.,
> > >
> > > --- a/drivers/firmware/efi/libstub/arm64.c
> > > +++ b/drivers/firmware/efi/libstub/arm64.c
> > > @@ -36,7 +36,7 @@ static bool system_needs_vamap(void)
> > > default:
> > > version = efi_get_smbios_string(&record->header, 4,
> > > processor_version);
> > > - if (!version || strcmp(version, "eMAG"))
> > > + if (!version || strncmp(version, "eMAG", 4))
> > > break;
> > >
> > > fallthrough;
> >
> > Yay! Success! I just tested your latest efi/urgent (with the fixup) and
> > system completed the boot without any soft lockups.
> >
>
> Thanks for confirming. I'll take that as a tested-by

The solution in the current branch looks like the best approach we have to date
to address the broadest of affected systems. We could switch the eMAG test to an
MIDR test I believe (but this won't work for Altra as that would capture all the
Neoverse v1 cores beyond Altra). I can look into the MIDR test if you think it's
worthwhile - but since I don't think we can eliminate the SMBIOS string test, it
doesn't buy us much since we don't need a greedier eMAG test (there aren't more
of them to match).

Given that some OEM Altra platforms change the processor ID, I don't see a
better solution currently than adding their the "product name" to the smbios
string tests unfortunately.

--
Darren Hart
Ampere Computing / OS and Kernel

2023-03-18 10:36:03

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, 16 Mar 2023 at 23:28, Darren Hart <[email protected]> wrote:
>
> On Thu, Mar 16, 2023 at 07:55:36PM +0100, Ard Biesheuvel wrote:
> > On Thu, 16 Mar 2023 at 18:52, Andrea Righi <[email protected]> wrote:
...
> > >
> > > Yay! Success! I just tested your latest efi/urgent (with the fixup) and
> > > system completed the boot without any soft lockups.
> > >
> >
> > Thanks for confirming. I'll take that as a tested-by
>
> The solution in the current branch looks like the best approach we have to date
> to address the broadest of affected systems. We could switch the eMAG test to an
> MIDR test I believe (but this won't work for Altra as that would capture all the
> Neoverse v1 cores beyond Altra). I can look into the MIDR test if you think it's
> worthwhile - but since I don't think we can eliminate the SMBIOS string test, it
> doesn't buy us much since we don't need a greedier eMAG test (there aren't more
> of them to match).
>
> Given that some OEM Altra platforms change the processor ID, I don't see a
> better solution currently than adding their the "product name" to the smbios
> string tests unfortunately.
>

Indeed. I spotted a Gigabyte system [0] with a different processor ID,
but with a version we can test for.

So for now, I'll go with

socid = (u32 *)record->processor_id;
switch (*socid & 0xffff000f) {
static char const altra[] = "Ampere(TM) Altra(TM) Processor";
static char const emag[] = "eMAG";
default:
version = efi_get_smbios_string(&record->header, 4,
processor_version);
if (!version || (strncmp(version, altra, sizeof(altra) - 1) &&
strncmp(version, emag, sizeof(emag) - 1)))
break;

fallthrough;

case 0x0a160001: // Altra
case 0x0a160002: // Altra Max
efi_warn("Working around broken SetVirtualAddressMap()\n");
...

which should cover all the affected systems we encountered so far.

I'll push this to linux-next to let it soak for a little bit, and then
send it to Linus somewhere during the week

Thanks,
Ard.


[0] https://pastebin.com/HQLE1yYv

2023-03-20 18:08:33

by Darren Hart

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Sat, Mar 18, 2023 at 11:35:44AM +0100, Ard Biesheuvel wrote:
> On Thu, 16 Mar 2023 at 23:28, Darren Hart <[email protected]> wrote:
> >
> > On Thu, Mar 16, 2023 at 07:55:36PM +0100, Ard Biesheuvel wrote:
> > > On Thu, 16 Mar 2023 at 18:52, Andrea Righi <[email protected]> wrote:
> ...
> > > >
> > > > Yay! Success! I just tested your latest efi/urgent (with the fixup) and
> > > > system completed the boot without any soft lockups.
> > > >
> > >
> > > Thanks for confirming. I'll take that as a tested-by
> >
> > The solution in the current branch looks like the best approach we have to date
> > to address the broadest of affected systems. We could switch the eMAG test to an
> > MIDR test I believe (but this won't work for Altra as that would capture all the
> > Neoverse v1 cores beyond Altra). I can look into the MIDR test if you think it's
> > worthwhile - but since I don't think we can eliminate the SMBIOS string test, it
> > doesn't buy us much since we don't need a greedier eMAG test (there aren't more
> > of them to match).
> >
> > Given that some OEM Altra platforms change the processor ID, I don't see a
> > better solution currently than adding their the "product name" to the smbios
> > string tests unfortunately.
> >
>
> Indeed. I spotted a Gigabyte system [0] with a different processor ID,
> but with a version we can test for.
>
> So for now, I'll go with
>
> socid = (u32 *)record->processor_id;
> switch (*socid & 0xffff000f) {
> static char const altra[] = "Ampere(TM) Altra(TM) Processor";
> static char const emag[] = "eMAG";
> default:
> version = efi_get_smbios_string(&record->header, 4,
> processor_version);
> if (!version || (strncmp(version, altra, sizeof(altra) - 1) &&
> strncmp(version, emag, sizeof(emag) - 1)))
> break;
>
> fallthrough;
>
> case 0x0a160001: // Altra
> case 0x0a160002: // Altra Max
> efi_warn("Working around broken SetVirtualAddressMap()\n");
> ...
>
> which should cover all the affected systems we encountered so far.
>
> I'll push this to linux-next to let it soak for a little bit, and then
> send it to Linus somewhere during the week

Thank you Ard, I think this is our best option.

--
Darren Hart
Ampere Computing / OS and Kernel

2023-04-05 12:57:49

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

[TLDR: This mail in primarily relevant for Linux kernel regression
tracking. See link in footer if these mails annoy you.]

On 16.03.23 10:45, Linux regression tracking #adding (Thorsten Leemhuis)
wrote:
> On 16.03.23 08:54, Andrea Righi wrote:
>> Hello,
>>
>> the latest v6.2.6 kernel fails to boot on some arm64 systems, the kernel
>> gets stuck and never completes the boot. On the console I see this:
>>
>> [ 72.043484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>> [ 72.049571] rcu: 22-...0: (30 GPs behind) idle=b10c/1/0x4000000000000000 softirq=164/164 fqs=6443
>> [ 72.058520] (detected by 28, t=15005 jiffies, g=449, q=174 ncpus=32)
>> [ 72.064949] Task dump for CPU 22:
>> [ 72.068251] task:kworker/u64:5 state:R running task stack:0 pid:447 ppid:2 flags:0x0000000a
>> [ 72.078156] Workqueue: efi_rts_wq efi_call_rts
>> [ 72.082595] Call trace:
>> [ 72.085029] __switch_to+0xbc/0x100
>> [ 72.088508] 0xffff80000fe83d4c
>>
>> After that, as a consequence, I start to get a lot of hung task timeout traces.
>>
>> I tried to bisect the problem and I found that the offending commit is
>> this one:
>>
>> e7b813b32a42 ("efi: random: refresh non-volatile random seed when RNG is initialized")
>>
>> I've reverted this commit for now and everything works just fine, but I
>> was wondering if the problem could be caused by a lack of entropy on
>> these arm64 boxes or something else.
>
> Thanks for the report. To be sure the issue doesn't fall through the
> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> tracking bot:
>
> #regzbot ^introduced e7b813b32a42
> #regzbot title efi: stuck at boot (efi_call_rts) on arm64
> #regzbot ignore-activity

#regzbot fix: eb684408f3ea4856
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.


2023-04-13 20:28:12

by Andrea Righi

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Sat, Mar 18, 2023 at 11:35:44AM +0100, Ard Biesheuvel wrote:
> On Thu, 16 Mar 2023 at 23:28, Darren Hart <[email protected]> wrote:
> >
> > On Thu, Mar 16, 2023 at 07:55:36PM +0100, Ard Biesheuvel wrote:
> > > On Thu, 16 Mar 2023 at 18:52, Andrea Righi <[email protected]> wrote:
> ...
> > > >
> > > > Yay! Success! I just tested your latest efi/urgent (with the fixup) and
> > > > system completed the boot without any soft lockups.
> > > >
> > >
> > > Thanks for confirming. I'll take that as a tested-by
> >
> > The solution in the current branch looks like the best approach we have to date
> > to address the broadest of affected systems. We could switch the eMAG test to an
> > MIDR test I believe (but this won't work for Altra as that would capture all the
> > Neoverse v1 cores beyond Altra). I can look into the MIDR test if you think it's
> > worthwhile - but since I don't think we can eliminate the SMBIOS string test, it
> > doesn't buy us much since we don't need a greedier eMAG test (there aren't more
> > of them to match).
> >
> > Given that some OEM Altra platforms change the processor ID, I don't see a
> > better solution currently than adding their the "product name" to the smbios
> > string tests unfortunately.
> >
>
> Indeed. I spotted a Gigabyte system [0] with a different processor ID,
> but with a version we can test for.
>
> So for now, I'll go with
>
> socid = (u32 *)record->processor_id;
> switch (*socid & 0xffff000f) {
> static char const altra[] = "Ampere(TM) Altra(TM) Processor";
> static char const emag[] = "eMAG";
> default:
> version = efi_get_smbios_string(&record->header, 4,
> processor_version);
> if (!version || (strncmp(version, altra, sizeof(altra) - 1) &&
> strncmp(version, emag, sizeof(emag) - 1)))
> break;
>
> fallthrough;
>
> case 0x0a160001: // Altra
> case 0x0a160002: // Altra Max
> efi_warn("Working around broken SetVirtualAddressMap()\n");
> ...
>
> which should cover all the affected systems we encountered so far.
>
> I'll push this to linux-next to let it soak for a little bit, and then
> send it to Linus somewhere during the week
>
> Thanks,
> Ard.
>
>
> [0] https://pastebin.com/HQLE1yYv

Not sure if it's a similar issue, but I have found another Ampere box
that is booting fine with your fixes, but the eifvars.sh kselftest is
failing with some I/O errors, specifically:

$ sudo ./efivarfs.sh
--------------------
running test_create
--------------------
./efivarfs.sh: line 58: printf: write error: Input/output error
/sys/firmware/efi/efivars/test_create-210be57c-9849-4fc7-a635-e6382d1aec27 has invalid size
[FAIL]
--------------------
running test_create_empty
--------------------
[PASS]
--------------------
running test_create_read
--------------------
[PASS]
--------------------
running test_delete
--------------------
./efivarfs.sh: line 103: printf: write error: Input/output error
[PASS]
--------------------
running test_zero_size_delete
--------------------
./efivarfs.sh: line 126: printf: write error: Input/output error
./efivarfs.sh: line 134: printf: write error: Input/output error
/sys/firmware/efi/efivars/test_zero_size_delete-210be57c-9849-4fc7-a635-e6382d1aec27 should have been deleted
[FAIL]
--------------------
running test_open_unlink
--------------------
open(O_WRONLY): Operation not permitted
[FAIL]
--------------------
running test_valid_filenames
--------------------
./efivarfs.sh: line 158: printf: write error: Input/output error
./efivarfs.sh: line 158: printf: write error: Input/output error
./efivarfs.sh: line 158: printf: write error: Input/output error
./efivarfs.sh: line 158: printf: write error: Input/output error
[PASS]
--------------------
running test_invalid_filenames
--------------------
[PASS]

If it helps:

$ sudo hexdump -C /sys/firmware/dmi/entries/4-0/raw
00000000 04 30 04 00 01 03 fe 02 c1 d0 3f 41 00 00 00 00 |.0........?A....|
00000010 03 8a 72 06 b8 0b f0 0a 41 06 05 00 06 00 07 00 |..r.....A.......|
00000020 04 05 06 50 50 50 04 00 01 01 01 00 01 00 01 00 |...PPP..........|
00000030 43 50 55 20 31 00 41 6d 70 65 72 65 28 52 29 00 |CPU 1.Ampere(R).|
00000040 41 6d 70 65 72 65 28 52 29 20 41 6c 74 72 61 28 |Ampere(R) Altra(|
00000050 52 29 20 50 72 6f 63 65 73 73 6f 72 00 30 30 30 |R) Processor.000|
00000060 30 30 30 30 30 30 30 30 30 30 30 30 30 30 32 35 |0000000000000025|
00000070 35 30 32 30 39 30 33 33 38 36 35 42 34 00 30 30 |50209033865B4.00|
00000080 30 30 30 30 30 31 00 51 38 30 2d 33 30 00 00 |000001.Q80-30..|
0000008f

I guess EFI is not very reliable here...

-Andrea

2023-04-17 22:06:51

by Darren Hart

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Thu, Apr 13, 2023 at 10:24:38PM +0200, Andrea Righi wrote:
>
> Not sure if it's a similar issue, but I have found another Ampere box
> that is booting fine with your fixes, but the eifvars.sh kselftest is
> failing with some I/O errors, specifically:

Thanks for reporting. Can you confirm this worked reliably for you prior
to v6.1?

--
Darren

>
> $ sudo ./efivarfs.sh
> --------------------
> running test_create
> --------------------
> ./efivarfs.sh: line 58: printf: write error: Input/output error
> /sys/firmware/efi/efivars/test_create-210be57c-9849-4fc7-a635-e6382d1aec27 has invalid size
> [FAIL]
> --------------------
> running test_create_empty
> --------------------
> [PASS]
> --------------------
> running test_create_read
> --------------------
> [PASS]
> --------------------
> running test_delete
> --------------------
> ./efivarfs.sh: line 103: printf: write error: Input/output error
> [PASS]
> --------------------
> running test_zero_size_delete
> --------------------
> ./efivarfs.sh: line 126: printf: write error: Input/output error
> ./efivarfs.sh: line 134: printf: write error: Input/output error
> /sys/firmware/efi/efivars/test_zero_size_delete-210be57c-9849-4fc7-a635-e6382d1aec27 should have been deleted
> [FAIL]
> --------------------
> running test_open_unlink
> --------------------
> open(O_WRONLY): Operation not permitted
> [FAIL]
> --------------------
> running test_valid_filenames
> --------------------
> ./efivarfs.sh: line 158: printf: write error: Input/output error
> ./efivarfs.sh: line 158: printf: write error: Input/output error
> ./efivarfs.sh: line 158: printf: write error: Input/output error
> ./efivarfs.sh: line 158: printf: write error: Input/output error
> [PASS]
> --------------------
> running test_invalid_filenames
> --------------------
> [PASS]
>
> If it helps:
>
> $ sudo hexdump -C /sys/firmware/dmi/entries/4-0/raw
> 00000000 04 30 04 00 01 03 fe 02 c1 d0 3f 41 00 00 00 00 |.0........?A....|
> 00000010 03 8a 72 06 b8 0b f0 0a 41 06 05 00 06 00 07 00 |..r.....A.......|
> 00000020 04 05 06 50 50 50 04 00 01 01 01 00 01 00 01 00 |...PPP..........|
> 00000030 43 50 55 20 31 00 41 6d 70 65 72 65 28 52 29 00 |CPU 1.Ampere(R).|
> 00000040 41 6d 70 65 72 65 28 52 29 20 41 6c 74 72 61 28 |Ampere(R) Altra(|
> 00000050 52 29 20 50 72 6f 63 65 73 73 6f 72 00 30 30 30 |R) Processor.000|
> 00000060 30 30 30 30 30 30 30 30 30 30 30 30 30 30 32 35 |0000000000000025|
> 00000070 35 30 32 30 39 30 33 33 38 36 35 42 34 00 30 30 |50209033865B4.00|
> 00000080 30 30 30 30 30 31 00 51 38 30 2d 33 30 00 00 |000001.Q80-30..|
> 0000008f
>
> I guess EFI is not very reliable here...
>
> -Andrea

--
Darren Hart
Ampere Computing / OS and Kernel

2023-04-18 05:45:08

by Andrea Righi

[permalink] [raw]
Subject: Re: kernel 6.2 stuck at boot (efi_call_rts) on arm64

On Mon, Apr 17, 2023 at 03:05:18PM -0700, Darren Hart wrote:
> On Thu, Apr 13, 2023 at 10:24:38PM +0200, Andrea Righi wrote:
> >
> > Not sure if it's a similar issue, but I have found another Ampere box
> > that is booting fine with your fixes, but the eifvars.sh kselftest is
> > failing with some I/O errors, specifically:
>
> Thanks for reporting. Can you confirm this worked reliably for you prior
> to v6.1?
>
> --
> Darren

I tested again and I confirm that after a reboot everything looks fine.
Maybe EFI was messed up with a previous test and the latest kernel fixes
everything. Anyway this issue seems resolved for me.

Thanks,
-Andrea

>
> >
> > $ sudo ./efivarfs.sh
> > --------------------
> > running test_create
> > --------------------
> > ./efivarfs.sh: line 58: printf: write error: Input/output error
> > /sys/firmware/efi/efivars/test_create-210be57c-9849-4fc7-a635-e6382d1aec27 has invalid size
> > [FAIL]
> > --------------------
> > running test_create_empty
> > --------------------
> > [PASS]
> > --------------------
> > running test_create_read
> > --------------------
> > [PASS]
> > --------------------
> > running test_delete
> > --------------------
> > ./efivarfs.sh: line 103: printf: write error: Input/output error
> > [PASS]
> > --------------------
> > running test_zero_size_delete
> > --------------------
> > ./efivarfs.sh: line 126: printf: write error: Input/output error
> > ./efivarfs.sh: line 134: printf: write error: Input/output error
> > /sys/firmware/efi/efivars/test_zero_size_delete-210be57c-9849-4fc7-a635-e6382d1aec27 should have been deleted
> > [FAIL]
> > --------------------
> > running test_open_unlink
> > --------------------
> > open(O_WRONLY): Operation not permitted
> > [FAIL]
> > --------------------
> > running test_valid_filenames
> > --------------------
> > ./efivarfs.sh: line 158: printf: write error: Input/output error
> > ./efivarfs.sh: line 158: printf: write error: Input/output error
> > ./efivarfs.sh: line 158: printf: write error: Input/output error
> > ./efivarfs.sh: line 158: printf: write error: Input/output error
> > [PASS]
> > --------------------
> > running test_invalid_filenames
> > --------------------
> > [PASS]
> >
> > If it helps:
> >
> > $ sudo hexdump -C /sys/firmware/dmi/entries/4-0/raw
> > 00000000 04 30 04 00 01 03 fe 02 c1 d0 3f 41 00 00 00 00 |.0........?A....|
> > 00000010 03 8a 72 06 b8 0b f0 0a 41 06 05 00 06 00 07 00 |..r.....A.......|
> > 00000020 04 05 06 50 50 50 04 00 01 01 01 00 01 00 01 00 |...PPP..........|
> > 00000030 43 50 55 20 31 00 41 6d 70 65 72 65 28 52 29 00 |CPU 1.Ampere(R).|
> > 00000040 41 6d 70 65 72 65 28 52 29 20 41 6c 74 72 61 28 |Ampere(R) Altra(|
> > 00000050 52 29 20 50 72 6f 63 65 73 73 6f 72 00 30 30 30 |R) Processor.000|
> > 00000060 30 30 30 30 30 30 30 30 30 30 30 30 30 30 32 35 |0000000000000025|
> > 00000070 35 30 32 30 39 30 33 33 38 36 35 42 34 00 30 30 |50209033865B4.00|
> > 00000080 30 30 30 30 30 31 00 51 38 30 2d 33 30 00 00 |000001.Q80-30..|
> > 0000008f
> >
> > I guess EFI is not very reliable here...
> >
> > -Andrea
>
> --
> Darren Hart
> Ampere Computing / OS and Kernel