LinuxLists.cc - printk.time causes rare kernel boot hangs

2023-06-13 13:50:48

Subject: printk.time causes rare kernel boot hangs

[Being tracked in this bug which contains much more detail:
https://gitlab.com/qemu-project/qemu/-/issues/1696 ]

Recent kernels hang rarely when booted on qemu. Usually you need to
boot 100s or 1,000s of times to see the hang, compared to 292,612 [sic]
successful boots which I was able to do before the problematic commit.

A reproducer (you'll probably need to use Fedora) is:

$ while guestfish -a /dev/null -v run >& /tmp/log; do echo -n . ; done

You will need to leave it running for probably several hours, and
examine the /tmp/log file at the end.

I tracked this down to the following commit:

commit f31dcb152a3d0816e2f1deab4e64572336da197d
Author: Aaron Thompson <[email protected]>
Date: Thu Apr 13 17:50:12 2023 +0000

sched/clock: Fix local_clock() before sched_clock_init()

Have local_clock() return sched_clock() if sched_clock_init() has not
yet run. sched_clock_cpu() has this check but it was not included in the
new noinstr implementation of local_clock().

(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f31dcb152a3d0816e2f1deab4e64572336da197d)

Reverting this commit fixes the problem.

I don't know _why_ this commit is wrong, but can we revert it as it
causes serious problems with libguestfs hanging randomly.

Or if there's anything you want me to try out then let me know,
because I can reproduce the problem locally quite easily.

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

2023-06-13 14:38:50

by Thorsten Leemhuis

[permalink] [raw]

Subject: Re: printk.time causes rare kernel boot hangs

[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 13.06.23 15:41, Richard W.M. Jones wrote:
> [Being tracked in this bug which contains much more detail:
> https://gitlab.com/qemu-project/qemu/-/issues/1696 ]
>
> Recent kernels hang rarely when booted on qemu. Usually you need to
> boot 100s or 1,000s of times to see the hang, compared to 292,612 [sic]
> successful boots which I was able to do before the problematic commit.
>
> A reproducer (you'll probably need to use Fedora) is:
>
> $ while guestfish -a /dev/null -v run >& /tmp/log; do echo -n . ; done
>
> You will need to leave it running for probably several hours, and
> examine the /tmp/log file at the end.
>
> I tracked this down to the following commit:
>
> commit f31dcb152a3d0816e2f1deab4e64572336da197d
> Author: Aaron Thompson <[email protected]>
> Date: Thu Apr 13 17:50:12 2023 +0000
>
> sched/clock: Fix local_clock() before sched_clock_init()
>
> Have local_clock() return sched_clock() if sched_clock_init() has not
> yet run. sched_clock_cpu() has this check but it was not included in the
> new noinstr implementation of local_clock().
>
> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f31dcb152a3d0816e2f1deab4e64572336da197d)
>
> Reverting this commit fixes the problem.
>
> I don't know _why_ this commit is wrong, but can we revert it as it
> causes serious problems with libguestfs hanging randomly.
>
> Or if there's anything you want me to try out then let me know,
> because I can reproduce the problem locally quite easily.

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced f31dcb152a3d0816e2f1deab4e64572336da197d
#regzbot title sched/clock: printk.time causes rare kernel boot hangs
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

2023-06-14 09:31:15

by Peter Zijlstra

[permalink] [raw]

Subject: Re: printk.time causes rare kernel boot hangs

On Tue, Jun 13, 2023 at 02:41:05PM +0100, Richard W.M. Jones wrote:
> [Being tracked in this bug which contains much more detail:
> https://gitlab.com/qemu-project/qemu/-/issues/1696 ]

Can I please just get the detail in mail instead of having to go look at
random websites?

> Recent kernels hang rarely when booted on qemu. Usually you need to
> boot 100s or 1,000s of times to see the hang, compared to 292,612 [sic]
> successful boots which I was able to do before the problematic commit.
>
> A reproducer (you'll probably need to use Fedora) is:

Debian only shop here... in fact, I still have machines without systemd.

> $ while guestfish -a /dev/null -v run >& /tmp/log; do echo -n . ; done
>
> You will need to leave it running for probably several hours, and
> examine the /tmp/log file at the end.
>
> I tracked this down to the following commit:
>
> commit f31dcb152a3d0816e2f1deab4e64572336da197d
> Author: Aaron Thompson <[email protected]>
> Date: Thu Apr 13 17:50:12 2023 +0000
>
> sched/clock: Fix local_clock() before sched_clock_init()
>
> Have local_clock() return sched_clock() if sched_clock_init() has not
> yet run. sched_clock_cpu() has this check but it was not included in the
> new noinstr implementation of local_clock().
>
> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f31dcb152a3d0816e2f1deab4e64572336da197d)
>
> Reverting this commit fixes the problem.
>
> I don't know _why_ this commit is wrong, but can we revert it as it
> causes serious problems with libguestfs hanging randomly.
>
> Or if there's anything you want me to try out then let me know,
> because I can reproduce the problem locally quite easily.

Well, since it's virt and all, can you attach gdb to the gdb-stub and
see where it's at? Any clue is better than no clue.

2023-06-14 10:20:33

Subject: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Attachments:

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs

Subject: Re: printk.time causes rare kernel boot hangs