2024-04-02 08:17:26

by Thorsten Leemhuis

[permalink] [raw]
Subject: Bug 218665 - nohz_full=0 prevents kernel from booting

Hi, Thorsten here, the Linux kernel's regression tracker.

I noticed a regression report in bugzilla.kernel.org. As many (most?)
kernel developers don't keep an eye on it, I decided to forward it by mail.

Tejun, apparently it's cause by a change of yours.

Note, you have to use bugzilla to reach the reporter, as I sadly[1] can
not CCed them in mails like this.

Quoting from https://bugzilla.kernel.org/show_bug.cgi?id=218665 :

> booting the current kernel (6.9.0-rc1, master/712e1425) on x86_64
> with nohz_full=0 cause a page fault and prevents the kernel from
> booting.
>
> Steps to reproduce:
> - make defconfig
> - set CONFIG_NO_HZ_FULL=y
> - set CONFIG_SUSPEND=n and CONFIG_HIBERNATION=n (to get CONFIG_PM_SLEEP_SMP=n)
> - make
> - qemu-system-x86_64 -nographic -cpu qemu64 -smp cores=2 -m 1024 -kernel arch/x86/boot/bzImage -append "earlyprintk=ttyS0 console=ttyS0 root=/dev/dummy rootwait nohz_full=0"
>
> I have attached the output of a failed nohz_full=0 boot as
> nohz_full_0.txt and - for reference - the output of a nohz_full=1
> boot as nohz_full_1.txt.
>
> Interestingly enough, using the deprecated isolcpus parameter to
> enable NO_HZ for cpu0 works. I've attached the output as
> isolcpus_nohz_0.txt.
>
> Bisecting showed 5797b1c18919cd9c289ded7954383e499f729ce0 as first bad commit.

See the ticket for more details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

[1] because bugzilla.kernel.org tells users upon registration their
"email address will never be displayed to logged out users"

#regzbot introduced: 5797b1c18919cd9c289ded7954383e499f729ce0
#regzbot from: Friedrich Oslage
#regzbot duplicate https://bugzilla.kernel.org/show_bug.cgi?id=218665
#regzbot title: workqueue: nohz_full=0 prevents booting
#regzbot ignore-activity


2024-04-03 19:14:53

by Tejun Heo

[permalink] [raw]
Subject: Re: Bug 218665 - nohz_full=0 prevents kernel from booting

Hello, Thorsten.

On Tue, Apr 02, 2024 at 10:17:16AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> Hi, Thorsten here, the Linux kernel's regression tracker.
>
> I noticed a regression report in bugzilla.kernel.org. As many (most?)
> kernel developers don't keep an eye on it, I decided to forward it by mail.
>
> Tejun, apparently it's cause by a change of yours.
>
> Note, you have to use bugzilla to reach the reporter, as I sadly[1] can
> not CCed them in mails like this.
>
> Quoting from https://bugzilla.kernel.org/show_bug.cgi?id=218665 :

This looks like the same problem that's being discussed in the following
thread.

http://lkml.kernel.org/r/[email protected]

Hopefully, we'll soon reach a resolution.

Thanks.

--
tejun

2024-04-07 22:48:06

by Bjorn Andersson

[permalink] [raw]
Subject: Re: Bug 218665 - nohz_full=0 prevents kernel from booting

On Tue, Apr 02, 2024 at 10:17:16AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> Hi, Thorsten here, the Linux kernel's regression tracker.
>
> I noticed a regression report in bugzilla.kernel.org. As many (most?)
> kernel developers don't keep an eye on it, I decided to forward it by mail.
>
> Tejun, apparently it's cause by a change of yours.
>
> Note, you have to use bugzilla to reach the reporter, as I sadly[1] can
> not CCed them in mails like this.
>
> Quoting from https://bugzilla.kernel.org/show_bug.cgi?id=218665 :
>
> > booting the current kernel (6.9.0-rc1, master/712e1425) on x86_64
> > with nohz_full=0 cause a page fault and prevents the kernel from
> > booting.
> >
> > Steps to reproduce:
> > - make defconfig
> > - set CONFIG_NO_HZ_FULL=y
> > - set CONFIG_SUSPEND=n and CONFIG_HIBERNATION=n (to get CONFIG_PM_SLEEP_SMP=n)
> > - make
> > - qemu-system-x86_64 -nographic -cpu qemu64 -smp cores=2 -m 1024 -kernel arch/x86/boot/bzImage -append "earlyprintk=ttyS0 console=ttyS0 root=/dev/dummy rootwait nohz_full=0"
> >
> > I have attached the output of a failed nohz_full=0 boot as
> > nohz_full_0.txt and - for reference - the output of a nohz_full=1
> > boot as nohz_full_1.txt.
> >
> > Interestingly enough, using the deprecated isolcpus parameter to
> > enable NO_HZ for cpu0 works. I've attached the output as
> > isolcpus_nohz_0.txt.
> >
> > Bisecting showed 5797b1c18919cd9c289ded7954383e499f729ce0 as first bad commit.
>
> See the ticket for more details.
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>
> [1] because bugzilla.kernel.org tells users upon registration their
> "email address will never be displayed to logged out users"
>
> #regzbot introduced: 5797b1c18919cd9c289ded7954383e499f729ce0
> #regzbot from: Friedrich Oslage
> #regzbot duplicate https://bugzilla.kernel.org/show_bug.cgi?id=218665
> #regzbot title: workqueue: nohz_full=0 prevents booting
> #regzbot ignore-activity

In addition to this report, I have finally bisected another regression
to the same commit:

I start neovim, send SIGSTOP (i.e. ^Z) to it, start another neovim
instance and upon sending SIGSTOP to that instance all of userspace
locks up - 100% reproducible.

The kernel seems to continue to operate, and tapping the power button
dislodge the lockup and I get a clean shutdown.

This is seen on multiple Arm64 (Qualcomm) machines with upstream
defconfig since commit '5797b1c18919 ("workqueue: Implement system-wide
nr_active enforcement for unbound workqueues")'.

Regards,
Bjorn

2024-04-10 09:21:13

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: Bug 218665 - nohz_full=0 prevents kernel from booting

On 08.04.24 00:52, Bjorn Andersson wrote:
> On Tue, Apr 02, 2024 at 10:17:16AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
>>
>> I noticed a regression report in bugzilla.kernel.org. As many (most?)
>> kernel developers don't keep an eye on it, I decided to forward it by mail.
>>
>> Tejun, apparently it's cause by a change of yours.
>>
>> Note, you have to use bugzilla to reach the reporter, as I sadly[1] can
>> not CCed them in mails like this.
>>
>> Quoting from https://bugzilla.kernel.org/show_bug.cgi?id=218665 :
>>
>>> booting the current kernel (6.9.0-rc1, master/712e1425) on x86_64
>>> with nohz_full=0 cause a page fault and prevents the kernel from
>>> booting.
> [...]
> In addition to this report, I have finally bisected another regression
> to the same commit:
>
> I start neovim, send SIGSTOP (i.e. ^Z) to it, start another neovim
> instance and upon sending SIGSTOP to that instance all of userspace
> locks up - 100% reproducible.
>
> The kernel seems to continue to operate, and tapping the power button
> dislodge the lockup and I get a clean shutdown.
>
> This is seen on multiple Arm64 (Qualcomm) machines with upstream
> defconfig since commit '5797b1c18919 ("workqueue: Implement system-wide
> nr_active enforcement for unbound workqueues")'.

Hmmm, I had hoped Tejun would reply and share an opinion if these
problems are related. But that didn't happen. :-/ So let me at least ask
one question that might help to answer that question: is the machine
using CPU isolation, like the two other reports about problems caused by
this commit do (see the
https://bugzilla.kernel.org/show_bug.cgi?id=218665 and
https://lore.kernel.org/all/[email protected]/ for
details) ?

Ciao, Thorsten

2024-04-12 03:01:08

by Bjorn Andersson

[permalink] [raw]
Subject: Re: Bug 218665 - nohz_full=0 prevents kernel from booting

On Wed, Apr 10, 2024 at 11:18:04AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 08.04.24 00:52, Bjorn Andersson wrote:
> > On Tue, Apr 02, 2024 at 10:17:16AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> >>
> >> I noticed a regression report in bugzilla.kernel.org. As many (most?)
> >> kernel developers don't keep an eye on it, I decided to forward it by mail.
> >>
> >> Tejun, apparently it's cause by a change of yours.
> >>
> >> Note, you have to use bugzilla to reach the reporter, as I sadly[1] can
> >> not CCed them in mails like this.
> >>
> >> Quoting from https://bugzilla.kernel.org/show_bug.cgi?id=218665 :
> >>
> >>> booting the current kernel (6.9.0-rc1, master/712e1425) on x86_64
> >>> with nohz_full=0 cause a page fault and prevents the kernel from
> >>> booting.
> > [...]
> > In addition to this report, I have finally bisected another regression
> > to the same commit:
> >
> > I start neovim, send SIGSTOP (i.e. ^Z) to it, start another neovim
> > instance and upon sending SIGSTOP to that instance all of userspace
> > locks up - 100% reproducible.
> >
> > The kernel seems to continue to operate, and tapping the power button
> > dislodge the lockup and I get a clean shutdown.
> >
> > This is seen on multiple Arm64 (Qualcomm) machines with upstream
> > defconfig since commit '5797b1c18919 ("workqueue: Implement system-wide
> > nr_active enforcement for unbound workqueues")'.
>
> Hmmm, I had hoped Tejun would reply and share an opinion if these
> problems are related. But that didn't happen. :-/ So let me at least ask
> one question that might help to answer that question: is the machine
> using CPU isolation, like the two other reports about problems caused by
> this commit do (see the
> https://bugzilla.kernel.org/show_bug.cgi?id=218665 and
> https://lore.kernel.org/all/[email protected]/ for
> details) ?
>

No, this is a clean SMP system running stock arch/arm64/defconfig,
booted with "clk_ignore_unused pd_ignore_unused audit=0" as the command
line.

Regards,
Bjorn

2024-04-16 06:08:26

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: Bug 218665 - nohz_full=0 prevents kernel from booting

On 12.04.24 04:57, Bjorn Andersson wrote:
> On Wed, Apr 10, 2024 at 11:18:04AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
>> On 08.04.24 00:52, Bjorn Andersson wrote:
>>> On Tue, Apr 02, 2024 at 10:17:16AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
>>>>
>>>> Tejun, apparently it's cause by a change of yours.
>>>> Quoting from https://bugzilla.kernel.org/show_bug.cgi?id=218665 :
>>>>
>>>>> booting the current kernel (6.9.0-rc1, master/712e1425) on x86_64
>>>>> with nohz_full=0 cause a page fault and prevents the kernel from
>>>>> booting.
>>> [...]

Tejun, I got a bit lost here. Can you help me out please?

I'm currently assuming that these two reports have the same cause:
https://lore.kernel.org/all/[email protected]/T/#u
https://bugzilla.kernel.org/show_bug.cgi?id=218665

And that both will be fixed by this patch from Oleg Nesterov:
https://lore.kernel.org/lkml/[email protected]/

But well, to me it looks like below issue from Bjorn is different, even
if it is caused by the same change -- nevertheless it looks like nobody
has looked into this since it was reported about two weeks ago. Or was
progress made and I just missed it?

>>> In addition to this report, I have finally bisected another regression
>>> to the same commit:
>>>
>>> I start neovim, send SIGSTOP (i.e. ^Z) to it, start another neovim
>>> instance and upon sending SIGSTOP to that instance all of userspace
>>> locks up - 100% reproducible.
>>>
>>> The kernel seems to continue to operate, and tapping the power button
>>> dislodge the lockup and I get a clean shutdown.
>>>
>>> This is seen on multiple Arm64 (Qualcomm) machines with upstream
>>> defconfig since commit '5797b1c18919 ("workqueue: Implement system-wide
>>> nr_active enforcement for unbound workqueues")'.
>>
>> Hmmm, I had hoped Tejun would reply and share an opinion if these
>> problems are related. But that didn't happen. :-/ So let me at least ask
>> one question that might help to answer that question: is the machine
>> using CPU isolation, like the two other reports about problems caused by
>> this commit do (see the
>> https://bugzilla.kernel.org/show_bug.cgi?id=218665 and
>> https://lore.kernel.org/all/[email protected]/ for
>> details) ?
>
> No, this is a clean SMP system running stock arch/arm64/defconfig,
> booted with "clk_ignore_unused pd_ignore_unused audit=0" as the command
> line.
>
> Regards,
> Bjorn

Ciao, Thorsten

2024-04-16 23:21:53

by Tejun Heo

[permalink] [raw]
Subject: Re: Bug 218665 - nohz_full=0 prevents kernel from booting

Hello,

On Tue, Apr 16, 2024 at 08:08:07AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 12.04.24 04:57, Bjorn Andersson wrote:
> > On Wed, Apr 10, 2024 at 11:18:04AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> >> On 08.04.24 00:52, Bjorn Andersson wrote:
> >>> On Tue, Apr 02, 2024 at 10:17:16AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> >>>>
> >>>> Tejun, apparently it's cause by a change of yours.
> >>>> Quoting from https://bugzilla.kernel.org/show_bug.cgi?id=218665 :
> >>>>
> >>>>> booting the current kernel (6.9.0-rc1, master/712e1425) on x86_64
> >>>>> with nohz_full=0 cause a page fault and prevents the kernel from
> >>>>> booting.
> >>> [...]
>
> Tejun, I got a bit lost here. Can you help me out please?
>
> I'm currently assuming that these two reports have the same cause:
> https://lore.kernel.org/all/[email protected]/T/#u
> https://bugzilla.kernel.org/show_bug.cgi?id=218665
>
> And that both will be fixed by this patch from Oleg Nesterov:
> https://lore.kernel.org/lkml/[email protected]/
>
> But well, to me it looks like below issue from Bjorn is different, even
> if it is caused by the same change -- nevertheless it looks like nobody
> has looked into this since it was reported about two weeks ago. Or was
> progress made and I just missed it?

Can you elaborate why Bjorn's case is different? I was assuming it was the
same problem and that Oleg's fixes would address the issue.

Thanks.

--
tejun

2024-04-17 05:48:48

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: Bug 218665 - nohz_full=0 prevents kernel from booting

On 17.04.24 01:21, Tejun Heo wrote:
> On Tue, Apr 16, 2024 at 08:08:07AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
>> On 12.04.24 04:57, Bjorn Andersson wrote:
>>> On Wed, Apr 10, 2024 at 11:18:04AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
>>>> On 08.04.24 00:52, Bjorn Andersson wrote:
>>>>> On Tue, Apr 02, 2024 at 10:17:16AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
>>>>>>
>>>>>> Tejun, apparently it's cause by a change of yours.
>>>>>> Quoting from https://bugzilla.kernel.org/show_bug.cgi?id=218665 :
>>>>>>
>>>>>>> booting the current kernel (6.9.0-rc1, master/712e1425) on x86_64
>>>>>>> with nohz_full=0 cause a page fault and prevents the kernel from
>>>>>>> booting.
>>>>> [...]
>>
>> Tejun, I got a bit lost here. Can you help me out please?
>>
>> I'm currently assuming that these two reports have the same cause:
>> https://lore.kernel.org/all/[email protected]/T/#u
>> https://bugzilla.kernel.org/show_bug.cgi?id=218665
>>
>> And that both will be fixed by this patch from Oleg Nesterov:
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> But well, to me it looks like below issue from Bjorn is different, even
>> if it is caused by the same change -- nevertheless it looks like nobody
>> has looked into this since it was reported about two weeks ago. Or was
>> progress made and I just missed it?
>
> Can you elaborate why Bjorn's case is different?

Well "not booting at all when using 'nohz_full=0'"[as reported two
times] and "I start neovim, send SIGSTOP (i.e. ^Z) to it, start another
neovim instance and upon sending SIGSTOP to that instance all of
userspace locks up - 100% reproducible."[while no 'nohz_full=0' in use]
at least on the first sight to and outsider sound a lot like different
problems to me -- but of course that impression might be wrong and you
know better about these things.

> I was assuming it was the
> same problem and that Oleg's fixes would address the issue.

Bjorn, could you give it a try?

Ciao, Thorsten

2024-04-18 02:07:37

by Tejun Heo

[permalink] [raw]
Subject: Re: Bug 218665 - nohz_full=0 prevents kernel from booting

Hello, Thorsten.

On Wed, Apr 17, 2024 at 07:48:33AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> > Can you elaborate why Bjorn's case is different?
>
> Well "not booting at all when using 'nohz_full=0'"[as reported two
> times] and "I start neovim, send SIGSTOP (i.e. ^Z) to it, start another
> neovim instance and upon sending SIGSTOP to that instance all of
> userspace locks up - 100% reproducible."[while no 'nohz_full=0' in use]
> at least on the first sight to and outsider sound a lot like different
> problems to me -- but of course that impression might be wrong and you
> know better about these things.

You are right. That is very different.

> > I was assuming it was the
> > same problem and that Oleg's fixes would address the issue.
>
> Bjorn, could you give it a try?

Yeah, I'm curious whether it's just a different symptom of the same problem.

Thanks.

--
tejun

2024-04-22 21:23:24

by Bjorn Andersson

[permalink] [raw]
Subject: Re: Bug 218665 - nohz_full=0 prevents kernel from booting

On Wed, Apr 17, 2024 at 04:07:26PM -1000, Tejun Heo wrote:
> Hello, Thorsten.
>
> On Wed, Apr 17, 2024 at 07:48:33AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> > > Can you elaborate why Bjorn's case is different?
> >
> > Well "not booting at all when using 'nohz_full=0'"[as reported two
> > times] and "I start neovim, send SIGSTOP (i.e. ^Z) to it, start another
> > neovim instance and upon sending SIGSTOP to that instance all of
> > userspace locks up - 100% reproducible."[while no 'nohz_full=0' in use]
> > at least on the first sight to and outsider sound a lot like different
> > problems to me -- but of course that impression might be wrong and you
> > know better about these things.
>
> You are right. That is very different.
>
> > > I was assuming it was the
> > > same problem and that Oleg's fixes would address the issue.
> >
> > Bjorn, could you give it a try?
>
> Yeah, I'm curious whether it's just a different symptom of the same problem.
>

Sorry for the late reply, had to step back to v6.8 on my work machine
and didn't retry this in a timely manner.

I've now confirmed that my problem with Neovim was resolved in v6.9-rc3,
through the introduction of:
73eaa2b58349 ("io_uring: use private workqueue for exit work")

Regards,
Bjorn