2023-07-02 04:16:52

by Bagas Sanjaya

[permalink] [raw]
Subject: Fwd: RCU stalls with wireguard over bonding over igb on Linux 6.3.0+

Hi,

I notice a regression report on Bugzilla [1]. Quoting from it:

> I've spent the last week on debugging a problem with my attempt to upgrade my kernel from 6.2.8 to 6.3.8 (now also with 6.4.0 too).
>
> The lenghty and detailed bug reports with all aspects of git bisect are at
> https://bugs.gentoo.org/909066
>
> A summary:
> - if I do not configure wg0, the kernel does not hang
> - if I use a kernel older than commit fed8d8773b8ea68ad99d9eee8c8343bef9da2c2c, it does not hang
>
> The commit refers to code that seems unrelated to the problem for my naiive eye.
>
> The hardware is a Dell PowerEdge R620 running Gentoo ~amd64.
>
> I have so far excluded:
> - dracut for generating the initramfs is the same version over all kernels
> - linux-firmware has been the same
> - CPU microcode has been the same
>
> It's been a long time since I seriously involved with software development and I have been even less involved with kernel development.
>
> Gentoo maintainers recommended me to open a bug with upstream, so here I am.
>
> I currently have no idea how to make progress, but I'm willing to try things.

See Bugzilla for the full thread.

Anyway, I'm adding it to regzbot to make sure it doesn't fall through cracks
unnoticed:

#regzbot introduced: fed8d8773b8ea6 https://bugzilla.kernel.org/show_bug.cgi?id=217620
#regzbot title: correcting acpi_is_processor_usable() check causes RCU stalls with wireguard over bonding+igb
#regzbot link: https://bugs.gentoo.org/909066

Thanks.

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=217620

--
An old man doll... just what I always wanted! - Clara


2023-07-02 12:42:06

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: Fwd: RCU stalls with wireguard over bonding over igb on Linux 6.3.0+

[also Cc: original reporter]

On 7/2/23 10:31, Bagas Sanjaya wrote:
> Hi,
>
> I notice a regression report on Bugzilla [1]. Quoting from it:
>
>> I've spent the last week on debugging a problem with my attempt to upgrade my kernel from 6.2.8 to 6.3.8 (now also with 6.4.0 too).
>>
>> The lenghty and detailed bug reports with all aspects of git bisect are at
>> https://bugs.gentoo.org/909066
>>
>> A summary:
>> - if I do not configure wg0, the kernel does not hang
>> - if I use a kernel older than commit fed8d8773b8ea68ad99d9eee8c8343bef9da2c2c, it does not hang
>>
>> The commit refers to code that seems unrelated to the problem for my naiive eye.
>>
>> The hardware is a Dell PowerEdge R620 running Gentoo ~amd64.
>>
>> I have so far excluded:
>> - dracut for generating the initramfs is the same version over all kernels
>> - linux-firmware has been the same
>> - CPU microcode has been the same
>>
>> It's been a long time since I seriously involved with software development and I have been even less involved with kernel development.
>>
>> Gentoo maintainers recommended me to open a bug with upstream, so here I am.
>>
>> I currently have no idea how to make progress, but I'm willing to try things.
>
> See Bugzilla for the full thread.
>
> Anyway, I'm adding it to regzbot to make sure it doesn't fall through cracks
> unnoticed:
>
> #regzbot introduced: fed8d8773b8ea6 https://bugzilla.kernel.org/show_bug.cgi?id=217620
> #regzbot title: correcting acpi_is_processor_usable() check causes RCU stalls with wireguard over bonding+igb
> #regzbot link: https://bugs.gentoo.org/909066
>

satmd: Can you repeat bisection to confirm that fed8d8773b8ea6 is
really the culprit?

Thorsten: It seems like the reporter concluded bisection to the
(possibly) incorrect culprit. What can I do in this case besides
asking to repeat bisection?

--
An old man doll... just what I always wanted! - Clara


Subject: Re: Fwd: RCU stalls with wireguard over bonding over igb on Linux 6.3.0+

On 02.07.23 13:57, Bagas Sanjaya wrote:
> [also Cc: original reporter]

BTW: I think you CCed too many developers here. There are situations
where this can makes sense, but it's rare. And if you do this too often
people might start to not really look into your mails or might even
ignore them completely.

Normally it's enough to write the mail to (1) the people in the
signed-off-by-chain, (2) the maintainers of the subsystem that merged a
commit, and (3) the lists for all affected subsystems; leave it up to
developers from the first two groups to CC the maintainers of the third
group.

> On 7/2/23 10:31, Bagas Sanjaya wrote:
>> I notice a regression report on Bugzilla [1]. Quoting from it:
>>
>>> I've spent the last week on debugging a problem with my attempt to upgrade my kernel from 6.2.8 to 6.3.8 (now also with
> [...]
>> See Bugzilla for the full thread.
>>
>> Anyway, I'm adding it to regzbot to make sure it doesn't fall through cracks
>> unnoticed:
>>
>> #regzbot introduced: fed8d8773b8ea6 https://bugzilla.kernel.org/show_bug.cgi?id=217620
>> #regzbot title: correcting acpi_is_processor_usable() check causes RCU stalls with wireguard over bonding+igb
>> #regzbot link: https://bugs.gentoo.org/909066

> satmd: Can you repeat bisection to confirm that fed8d8773b8ea6 is
> really the culprit?

I'd be careful to ask people that, as that might mean a lot of work for
them. Best to leave things like that to developers, unless it's pretty
obvious that something went sideways.

> Thorsten: It seems like the reporter concluded bisection to the
> (possibly) incorrect culprit.

What makes your think so? I just looked at bugzilla and it (for now)
seems reverting fed8d8773b8ea6 ontop of 6.4 fixed things for the
reporter, which is a pretty strong indicator that this change really
causes the trouble somehow.

/me really wonders what's he's missing

> What can I do in this case besides
> asking to repeat bisection?

Not much apart from updating regzbot state (e.g. something like "regzbot
introduced v6.3..v6.4") and a reply to your initial report (ideally with
a quick apology) to let everyone know it was a false alarm.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

2023-07-02 14:04:43

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: Fwd: RCU stalls with wireguard over bonding over igb on Linux 6.3.0+

I've got an overdue patch that I still need to submit to netdev, which
I suspect might actually fix this.

Can you let me know if
https://git.zx2c4.com/wireguard-linux/patch/?id=54d5e4329efe0d1dba8b4a58720d29493926bed0
solves the problem?

Jason

2023-07-02 14:32:21

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: Fwd: RCU stalls with wireguard over bonding over igb on Linux 6.3.0+

On 7/2/23 19:37, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 02.07.23 13:57, Bagas Sanjaya wrote:
>> [also Cc: original reporter]
>
> BTW: I think you CCed too many developers here. There are situations
> where this can makes sense, but it's rare. And if you do this too often
> people might start to not really look into your mails or might even
> ignore them completely.
>
> Normally it's enough to write the mail to (1) the people in the
> signed-off-by-chain, (2) the maintainers of the subsystem that merged a
> commit, and (3) the lists for all affected subsystems; leave it up to
> developers from the first two groups to CC the maintainers of the third
> group.
>

Hi,

In this case I had to also Cc: wireguard, bonding, RCU, and x86 people,
since this issue spans these subsystems (I naively thought). Anyway,
thanks for detailed tip (honestly /me wonder if I forgot this later, as
is often the case).

>> On 7/2/23 10:31, Bagas Sanjaya wrote:
>>> I notice a regression report on Bugzilla [1]. Quoting from it:
>>>
>>>> I've spent the last week on debugging a problem with my attempt to upgrade my kernel from 6.2.8 to 6.3.8 (now also with
>> [...]
>>> See Bugzilla for the full thread.
>>>
>>> Anyway, I'm adding it to regzbot to make sure it doesn't fall through cracks
>>> unnoticed:
>>>
>>> #regzbot introduced: fed8d8773b8ea6 https://bugzilla.kernel.org/show_bug.cgi?id=217620
>>> #regzbot title: correcting acpi_is_processor_usable() check causes RCU stalls with wireguard over bonding+igb
>>> #regzbot link: https://bugs.gentoo.org/909066
>
>> satmd: Can you repeat bisection to confirm that fed8d8773b8ea6 is
>> really the culprit?
>
> I'd be careful to ask people that, as that might mean a lot of work for
> them. Best to leave things like that to developers, unless it's pretty
> obvious that something went sideways.
>

OK.

>> Thorsten: It seems like the reporter concluded bisection to the
>> (possibly) incorrect culprit.
>
> What makes your think so? I just looked at bugzilla and it (for now)
> seems reverting fed8d8773b8ea6 ontop of 6.4 fixed things for the
> reporter, which is a pretty strong indicator that this change really
> causes the trouble somehow.
>

OK too.

> /me really wonders what's he's missing
>
>> What can I do in this case besides
>> asking to repeat bisection?
>
> Not much apart from updating regzbot state (e.g. something like "regzbot
> introduced v6.3..v6.4") and a reply to your initial report (ideally with
> a quick apology) to let everyone know it was a false alarm.
>

OK.

--
An old man doll... just what I always wanted! - Clara


2023-07-03 01:45:23

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: Fwd: RCU stalls with wireguard over bonding over igb on Linux 6.3.0+

On Sun, Jul 02, 2023 at 03:46:38PM +0200, Jason A. Donenfeld wrote:
> I've got an overdue patch that I still need to submit to netdev, which
> I suspect might actually fix this.
>
> Can you let me know if
> https://git.zx2c4.com/wireguard-linux/patch/?id=54d5e4329efe0d1dba8b4a58720d29493926bed0
> solves the problem?

satmd, the original reporter, confirmed over on the Gentoo bug report -
https://bugs.gentoo.org/909066 - that this patch fixes the issue.

This patch has been sent into netdev and will presumably hit the various
trees and stable in due time.

Jason

2023-07-03 02:53:47

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: Fwd: RCU stalls with wireguard over bonding over igb on Linux 6.3.0+

On Sun, Jul 02, 2023 at 03:46:38PM +0200, Jason A. Donenfeld wrote:
> I've got an overdue patch that I still need to submit to netdev, which
> I suspect might actually fix this.
>
> Can you let me know if
> https://git.zx2c4.com/wireguard-linux/patch/?id=54d5e4329efe0d1dba8b4a58720d29493926bed0
> solves the problem?

The reporter on Bugzilla [1] said it fixed the regression, so telling
regzbot:

#regzbot fix: 54d5e4329efe0d

Thanks.

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=217620#c6

--
An old man doll... just what I always wanted! - Clara


Attachments:
(No filename) (576.00 B)
signature.asc (235.00 B)
Download all attachments