2024-02-21 16:32:34

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [REGRESSION] 6.8-rc process is unable to exit and consumes a lot of cpu

[adding Al, Christian and a few lists to the list of recipients to
ensure all affected parties are aware of this new report about a bug for
which a fix is committed, but not yet mainlined]

Thread starts here:
https://lore.kernel.org/all/[email protected]/

On 21.02.24 16:56, Paul Holzinger wrote:
> Hi Thorsten,
>
> On 21/02/2024 15:42, Linux regression tracking (Thorsten Leemhuis) wrote:
>> On 21.02.24 15:31, Paul Holzinger wrote:
>>> On 21/02/2024 15:20, Paul Holzinger wrote:
>>>> we are seeing problems with the 6.8-rc kernels[1] in our CI systems,
>>>> we see random process timeouts across our test suite. It appears that
>>>> sometimes a process is unable to exit, nothing happens even if we send
>>>> SIGKILL and instead the process consumes a lof of cpu.
>>> [...]
>> Thx for the report.
>>
>> Warning, this is not my area of expertise, so this might send you in the
>> totally wrong direction.
>>
>> I briefly checked lore for similar reports and noticed this one when I
>> searched for shrink_dcache_parent:
>>
>> https://lore.kernel.org/all/[email protected]/
>
>> Do you think that might be related? A fix for this is pending in vfs.git.
>>
> yes that does seem very relevant. Running the sysrq command I get the
> same backtrace as the reporter there so I think it is fair to assume
> this is the same bug. Looking forward to get the fix into mainline.

FWIW, "the fix" afaics is 7e4a205fe56b90 ("Revert "get rid of
DCACHE_GENOCIDE"") sitting 'fixes' of
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for more than
a week now.

I assume Al or Christian will send this to Linus soon. Christian in fact
already mentioned that he plans to send another vfs fix to Linux, but
that one iirc was sitting in another repo (but I might be mistaken there!).

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

P.S.: let me update regzbot while at it:

#regzbot introduced 57851607326a2beef21e67f83f4f53a90df8445a.
#regzbot fix: Revert "get rid of DCACHE_GENOCIDE"


2024-02-24 07:00:53

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [REGRESSION] 6.8-rc process is unable to exit and consumes a lot of cpu

On 21.02.24 17:32, Linux regression tracking (Thorsten Leemhuis) wrote:
> [adding Al, Christian and a few lists to the list of recipients to
> ensure all affected parties are aware of this new report about a bug for
> which a fix is committed, but not yet mainlined]
>
> Thread starts here:
> https://lore.kernel.org/all/[email protected]/

[adding Linus now as well]

TWIMC, the quoted mail apparently did not get delivered to Al (I got a
"48 hours on the queue" warning from my hoster's MTA ~10 hours ago).

Ohh, and there is some suspicion that the problem Calvin[1] and Paul
(this thread, see quote below for the gist) encountered also causes
problems for bwrap (used by Flapak)[2].
[1] https://lore.kernel.org/all/[email protected]/
[2] https://github.com/containers/bubblewrap/issues/620

Christian, Linus, all that makes me wonder if it might be wise to pick
up the revert[1] Al queued directly in case Al does not submit a PR
today or tomorrow for -rc6.

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git/commit/?h=fixes&id=7e4a205fe56b9092f0143dad6aa5fee081139b09

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

> On 21.02.24 16:56, Paul Holzinger wrote:
>> Hi Thorsten,
>>
>> On 21/02/2024 15:42, Linux regression tracking (Thorsten Leemhuis) wrote:
>>> On 21.02.24 15:31, Paul Holzinger wrote:
>>>> On 21/02/2024 15:20, Paul Holzinger wrote:
>>>>> we are seeing problems with the 6.8-rc kernels[1] in our CI systems,
>>>>> we see random process timeouts across our test suite. It appears that
>>>>> sometimes a process is unable to exit, nothing happens even if we send
>>>>> SIGKILL and instead the process consumes a lof of cpu.
>>>> [...]
>>> Thx for the report.
>>>
>>> Warning, this is not my area of expertise, so this might send you in the
>>> totally wrong direction.
>>>
>>> I briefly checked lore for similar reports and noticed this one when I
>>> searched for shrink_dcache_parent:
>>>
>>> https://lore.kernel.org/all/[email protected]/
>>
>>> Do you think that might be related? A fix for this is pending in vfs.git.
>>>
>> yes that does seem very relevant. Running the sysrq command I get the
>> same backtrace as the reporter there so I think it is fair to assume
>> this is the same bug. Looking forward to get the fix into mainline.
>
> FWIW, "the fix" afaics is 7e4a205fe56b90 ("Revert "get rid of
> DCACHE_GENOCIDE"") sitting 'fixes' of
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for more than
> a week now.
>
> I assume Al or Christian will send this to Linus soon. Christian in fact
> already mentioned that he plans to send another vfs fix to Linux, but
> that one iirc was sitting in another repo (but I might be mistaken there!).
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>
> P.S.: let me update regzbot while at it:
>
> #regzbot introduced 57851607326a2beef21e67f83f4f53a90df8445a.
> #regzbot fix: Revert "get rid of DCACHE_GENOCIDE"

2024-02-24 23:44:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [REGRESSION] 6.8-rc process is unable to exit and consumes a lot of cpu

On Fri, 23 Feb 2024 at 23:00, Thorsten Leemhuis
<[email protected]> wrote:
>
> TWIMC, the quoted mail apparently did not get delivered to Al (I got a
> "48 hours on the queue" warning from my hoster's MTA ~10 hours ago).

Al's email has been broken for the last almost two weeks - the machine
went belly-up in a major way.

I bounced the email to his kernel.org email that seems to work, but I
also think Al ends up being busy trying to get through everything else
he missed, in addition to trying to get the machine working again...

Linus

2024-02-25 01:16:57

by Al Viro

[permalink] [raw]
Subject: Re: [REGRESSION] 6.8-rc process is unable to exit and consumes a lot of cpu

On Sat, Feb 24, 2024 at 08:00:27AM +0100, Thorsten Leemhuis wrote:
> On 21.02.24 17:32, Linux regression tracking (Thorsten Leemhuis) wrote:
> > [adding Al, Christian and a few lists to the list of recipients to
> > ensure all affected parties are aware of this new report about a bug for
> > which a fix is committed, but not yet mainlined]
> >
> > Thread starts here:
> > https://lore.kernel.org/all/[email protected]/
>
> [adding Linus now as well]
>
> TWIMC, the quoted mail apparently did not get delivered to Al (I got a
> "48 hours on the queue" warning from my hoster's MTA ~10 hours ago).
>
> Ohh, and there is some suspicion that the problem Calvin[1] and Paul
> (this thread, see quote below for the gist) encountered also causes
> problems for bwrap (used by Flapak)[2].
> [1] https://lore.kernel.org/all/[email protected]/
> [2] https://github.com/containers/bubblewrap/issues/620
>
> Christian, Linus, all that makes me wonder if it might be wise to pick
> up the revert[1] Al queued directly in case Al does not submit a PR
> today or tomorrow for -rc6.

See #fixes in my tree.

2024-02-25 01:21:49

by Al Viro

[permalink] [raw]
Subject: Re: [REGRESSION] 6.8-rc process is unable to exit and consumes a lot of cpu

On Sat, Feb 24, 2024 at 03:43:43PM -0800, Linus Torvalds wrote:
> On Fri, 23 Feb 2024 at 23:00, Thorsten Leemhuis
> <[email protected]> wrote:
> >
> > TWIMC, the quoted mail apparently did not get delivered to Al (I got a
> > "48 hours on the queue" warning from my hoster's MTA ~10 hours ago).
>
> Al's email has been broken for the last almost two weeks - the machine
> went belly-up in a major way.
>
> I bounced the email to his kernel.org email that seems to work, but I
> also think Al ends up being busy trying to get through everything else
> he missed, in addition to trying to get the machine working again...

FWIW, I'm pretty sure that it's fixed by #fixes^ (7e4a205fe56b) in
my tree; I'll send a pull request, both for #fixes and #fixes.pathwalk.rcu

2024-02-25 05:57:41

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [REGRESSION] 6.8-rc process is unable to exit and consumes a lot of cpu

On 25.02.24 02:22, Al Viro wrote:
> On Sat, Feb 24, 2024 at 03:43:43PM -0800, Linus Torvalds wrote:
>> On Fri, 23 Feb 2024 at 23:00, Thorsten Leemhuis
>> <[email protected]> wrote:
>>>
>>> TWIMC, the quoted mail apparently did not get delivered to Al (I got a
>>> "48 hours on the queue" warning from my hoster's MTA ~10 hours ago).
>>
>> Al's email has been broken for the last almost two weeks - the machine
>> went belly-up in a major way.
>>
>> I bounced the email to his kernel.org email that seems to work,

Thx!

>> but I
>> also think Al ends up being busy trying to get through everything else
>> he missed, in addition to trying to get the machine working again...
>
> FWIW, I'm pretty sure that it's fixed by #fixes^ (7e4a205fe56b) in
> my tree; I'll send a pull request, both for #fixes and #fixes.pathwalk.rcu

Great, thank you, too!

Ciao, Thorsten