2023-07-07 08:34:47

by Zhang, Rui

[permalink] [raw]
Subject: [Regression][BISECTED] kernel boot hang after 19898ce9cf8a ("wifi: iwlwifi: split 22000.c into multiple files")

Hi, all,

I run into a NULL pointer dereference and kernel boot hang after
switching to latest upstream kernel, and git bisect shows that below
commit is the first offending commit, and I have confirmed that commit
19898ce9cf8a has the issue while 19898ce9cf8a~1 does not.

commit 19898ce9cf8a33e0ac35cb4c7f68de297cc93cb2 (refs/bisect/bad)
Author: Johannes Berg <[email protected]>
AuthorDate: Wed Jun 21 13:12:07 2023 +0300
Commit: Johannes Berg <[email protected]>
CommitDate: Wed Jun 21 14:07:00 2023 +0200

wifi: iwlwifi: split 22000.c into multiple files

Split the configuration list in 22000.c into four new files,
per new device family, so we don't have this huge unusable
file. Yes, this duplicates a few small things, but that's
still much better than what we have now.

Signed-off-by: Johannes Berg <[email protected]>
Signed-off-by: Gregory Greenman <[email protected]>
Link:
https://lore.kernel.org/r/20230621130443.7543603b2ee7.Ia8dd54216d341ef1ddc0531f2c9aa30d30536a5d@changeid
Signed-off-by: Johannes Berg <[email protected]>

I have some screenshots which show that RIP points to iwl_mem_free_skb,
I can create a kernel bugzilla and attach the screenshots there if
needed.

BTW, lspci output of the wifi device and git bisect log attached.

If any other information needed, please let me know.

thanks,
rui


Attachments:
lspci-iwlwifi (2.57 kB)
lspci-iwlwifi
git-bisect-log (2.82 kB)
git-bisect-log
Download all attachments

2023-07-07 10:55:53

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [Regression][BISECTED] kernel boot hang after 19898ce9cf8a ("wifi: iwlwifi: split 22000.c into multiple files")

[CCing the regression list, netdev, the net maintainers, and Linus;
Johannes and Kalle as well, but just for the record, they afaik are
unavailable]

Hi, Thorsten here, the Linux kernel's regression tracker.

On 07.07.23 10:25, Zhang, Rui wrote:
>
> I run into a NULL pointer dereference and kernel boot hang after
> switching to latest upstream kernel, and git bisect shows that below
> commit is the first offending commit, and I have confirmed that commit
> 19898ce9cf8a has the issue while 19898ce9cf8a~1 does not.

FWIW, this is the fourth such report about this that I'm aware of.

The first is this one (with two affected users afaics):
https://bugzilla.kernel.org/show_bug.cgi?id=217622

The second is this one:
https://lore.kernel.org/all/CAAJw_Zug6VCS5ZqTWaFSr9sd85k%3DtyPm9DEE%2BmV%3DAKoECZM%[email protected]/

The third:
https://lore.kernel.org/all/[email protected]/

And in the past few days two people from Fedora land talked to me on IRC
with problems that in retrospective might be caused by this as well.

This many reports about a problem at this stage of the cycle makes me
suspect we'll see a lot more once -rc1 is out. That's why I raising the
awareness of this. Sadly a simple revert of just this commit is not
possible. :-/

Ciao, Thorsten

> commit 19898ce9cf8a33e0ac35cb4c7f68de297cc93cb2 (refs/bisect/bad)
> Author: Johannes Berg <[email protected]>
> AuthorDate: Wed Jun 21 13:12:07 2023 +0300
> Commit: Johannes Berg <[email protected]>
> CommitDate: Wed Jun 21 14:07:00 2023 +0200
>
> wifi: iwlwifi: split 22000.c into multiple files
>
> Split the configuration list in 22000.c into four new files,
> per new device family, so we don't have this huge unusable
> file. Yes, this duplicates a few small things, but that's
> still much better than what we have now.
>
> Signed-off-by: Johannes Berg <[email protected]>
> Signed-off-by: Gregory Greenman <[email protected]>
> Link:
> https://lore.kernel.org/r/20230621130443.7543603b2ee7.Ia8dd54216d341ef1ddc0531f2c9aa30d30536a5d@changeid
> Signed-off-by: Johannes Berg <[email protected]>
>
> I have some screenshots which show that RIP points to iwl_mem_free_skb,
> I can create a kernel bugzilla and attach the screenshots there if
> needed.
>
> BTW, lspci output of the wifi device and git bisect log attached.
>
> If any other information needed, please let me know.

--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

P.S.: for regzbot

#regzbot ^introduced 19898ce9cf8a
#regzbot dup-of:
https://lore.kernel.org/all/[email protected]/

2023-07-08 14:30:23

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [Regression][BISECTED] kernel boot hang after 19898ce9cf8a ("wifi: iwlwifi: split 22000.c into multiple files")

On 07.07.23 12:55, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 07.07.23 10:25, Zhang, Rui wrote:
>>
>> I run into a NULL pointer dereference and kernel boot hang after
>> switching to latest upstream kernel, and git bisect shows that below
>> commit is the first offending commit, and I have confirmed that commit
>> 19898ce9cf8a has the issue while 19898ce9cf8a~1 does not.
>
> FWIW, this is the fourth such report about this that I'm aware of.
>
> The first is this one (with two affected users afaics):
> https://bugzilla.kernel.org/show_bug.cgi?id=217622
>
> The second is this one:
> https://lore.kernel.org/all/CAAJw_Zug6VCS5ZqTWaFSr9sd85k%3DtyPm9DEE%2BmV%3DAKoECZM%[email protected]/
>
> The third:
> https://lore.kernel.org/all/[email protected]/
>
> And in the past few days two people from Fedora land talked to me on IRC
> with problems that in retrospective might be caused by this as well.

I got confirmation: one of those cases is also caused by 19898ce9cf8a
But I write for a different reason:

Larry (now CCed) looked at the culprit and spotted something that looked
suspicious to him; he posted a patch and looks for testers:
https://lore.kernel.org/all/[email protected]/

Ciao, Thorsten

> This many reports about a problem at this stage of the cycle makes me
> suspect we'll see a lot more once -rc1 is out. That's why I raising the
> awareness of this. Sadly a simple revert of just this commit is not
> possible. :-/
>
> Ciao, Thorsten
>
>> commit 19898ce9cf8a33e0ac35cb4c7f68de297cc93cb2 (refs/bisect/bad)
>> Author: Johannes Berg <[email protected]>
>> AuthorDate: Wed Jun 21 13:12:07 2023 +0300
>> Commit: Johannes Berg <[email protected]>
>> CommitDate: Wed Jun 21 14:07:00 2023 +0200
>>
>> wifi: iwlwifi: split 22000.c into multiple files
>>
>> Split the configuration list in 22000.c into four new files,
>> per new device family, so we don't have this huge unusable
>> file. Yes, this duplicates a few small things, but that's
>> still much better than what we have now.
>>
>> Signed-off-by: Johannes Berg <[email protected]>
>> Signed-off-by: Gregory Greenman <[email protected]>
>> Link:
>> https://lore.kernel.org/r/20230621130443.7543603b2ee7.Ia8dd54216d341ef1ddc0531f2c9aa30d30536a5d@changeid
>> Signed-off-by: Johannes Berg <[email protected]>
>>
>> I have some screenshots which show that RIP points to iwl_mem_free_skb,
>> I can create a kernel bugzilla and attach the screenshots there if
>> needed.
>>
>> BTW, lspci output of the wifi device and git bisect log attached.
>>
>> If any other information needed, please let me know.
>
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> That page also explains what to do if mails like this annoy you.
>
> P.S.: for regzbot
>
> #regzbot ^introduced 19898ce9cf8a
> #regzbot dup-of:
> https://lore.kernel.org/all/[email protected]/

2023-07-09 13:42:39

by Zhang, Rui

[permalink] [raw]
Subject: Re: [Regression][BISECTED] kernel boot hang after 19898ce9cf8a ("wifi: iwlwifi: split 22000.c into multiple files")

On Sat, 2023-07-08 at 16:17 +0200, Thorsten Leemhuis wrote:
> On 07.07.23 12:55, Linux regression tracking (Thorsten Leemhuis)
> wrote:
> > On 07.07.23 10:25, Zhang, Rui wrote:
> > >
> > > I run into a NULL pointer dereference and kernel boot hang after
> > > switching to latest upstream kernel, and git bisect shows that
> > > below
> > > commit is the first offending commit, and I have confirmed that
> > > commit
> > > 19898ce9cf8a has the issue while 19898ce9cf8a~1 does not.
> >
> > FWIW, this is the fourth such report about this that I'm aware of.
> >
> > The first is this one (with two affected users afaics):
> > https://bugzilla.kernel.org/show_bug.cgi?id=217622
> >
> > The second is this one:
> > https://lore.kernel.org/all/CAAJw_Zug6VCS5ZqTWaFSr9sd85k%3DtyPm9DEE%2BmV%3DAKoECZM%[email protected]/
> >
> > The third:
> > https://lore.kernel.org/all/[email protected]/
> >
> > And in the past few days two people from Fedora land talked to me
> > on IRC
> > with problems that in retrospective might be caused by this as
> > well.
>
> I got confirmation: one of those cases is also caused by 19898ce9cf8a
> But I write for a different reason:
>
> Larry (now CCed) looked at the culprit and spotted something that
> looked
> suspicious to him; he posted a patch and looks for testers:
> https://lore.kernel.org/all/[email protected]/

I applied this patch but the problem still exists.

thanks,
rui
>
> Ciao, Thorsten
>
> > This many reports about a problem at this stage of the cycle makes
> > me
> > suspect we'll see a lot more once -rc1 is out. That's why I raising
> > the
> > awareness of this. Sadly a simple revert of just this commit is not
> > possible. :-/
> >
> > Ciao, Thorsten
> >
> > > commit 19898ce9cf8a33e0ac35cb4c7f68de297cc93cb2 (refs/bisect/bad)
> > > Author:     Johannes Berg <[email protected]>
> > > AuthorDate: Wed Jun 21 13:12:07 2023 +0300
> > > Commit:     Johannes Berg <[email protected]>
> > > CommitDate: Wed Jun 21 14:07:00 2023 +0200
> > >
> > >     wifi: iwlwifi: split 22000.c into multiple files
> > >    
> > >     Split the configuration list in 22000.c into four new files,
> > >     per new device family, so we don't have this huge unusable
> > >     file. Yes, this duplicates a few small things, but that's
> > >     still much better than what we have now.
> > >    
> > >     Signed-off-by: Johannes Berg <[email protected]>
> > >     Signed-off-by: Gregory Greenman <[email protected]>
> > >     Link:
> > > https://lore.kernel.org/r/20230621130443.7543603b2ee7.Ia8dd54216d341ef1ddc0531f2c9aa30d30536a5d@changeid
> > >     Signed-off-by: Johannes Berg <[email protected]>
> > >
> > > I have some screenshots which show that RIP points to
> > > iwl_mem_free_skb,
> > > I can create a kernel bugzilla and attach the screenshots there
> > > if
> > > needed.
> > >
> > > BTW, lspci output of the wifi device and git bisect log attached.
> > >
> > > If any other information needed, please let me know.
> >
> > --
> > Everything you wanna know about Linux kernel regression tracking:
> > https://linux-regtracking.leemhuis.info/about/#tldr
> > That page also explains what to do if mails like this annoy you.
> >
> > P.S.: for regzbot
> >
> > #regzbot ^introduced 19898ce9cf8a
> > #regzbot dup-of:
> > https://lore.kernel.org/all/[email protected]/

2023-07-09 17:14:10

by Larry Finger

[permalink] [raw]
Subject: Re: [Regression][BISECTED] kernel boot hang after 19898ce9cf8a ("wifi: iwlwifi: split 22000.c into multiple files")

On 7/9/23 08:27, Zhang, Rui wrote:
> On Sat, 2023-07-08 at 16:17 +0200, Thorsten Leemhuis wrote:
>> On 07.07.23 12:55, Linux regression tracking (Thorsten Leemhuis)
>> wrote:
>>> On 07.07.23 10:25, Zhang, Rui wrote:
>>>>
>>>> I run into a NULL pointer dereference and kernel boot hang after
>>>> switching to latest upstream kernel, and git bisect shows that
>>>> below
>>>> commit is the first offending commit, and I have confirmed that
>>>> commit
>>>> 19898ce9cf8a has the issue while 19898ce9cf8a~1 does not.
>>>
>>> FWIW, this is the fourth such report about this that I'm aware of.
>>>
>>> The first is this one (with two affected users afaics):
>>> https://bugzilla.kernel.org/show_bug.cgi?id=217622
>>>
>>> The second is this one:
>>> https://lore.kernel.org/all/CAAJw_Zug6VCS5ZqTWaFSr9sd85k%3DtyPm9DEE%2BmV%3DAKoECZM%[email protected]/
>>>
>>> The third:
>>> https://lore.kernel.org/all/[email protected]/
>>>
>>> And in the past few days two people from Fedora land talked to me
>>> on IRC
>>> with problems that in retrospective might be caused by this as
>>> well.
>>
>> I got confirmation: one of those cases is also caused by 19898ce9cf8a
>> But I write for a different reason:
>>
>> Larry (now CCed) looked at the culprit and spotted something that
>> looked
>> suspicious to him; he posted a patch and looks for testers:
>> https://lore.kernel.org/all/[email protected]/
>
> I applied this patch but the problem still exists.
>
> thanks,
> rui

Rui,

I am not surprised that the patch did not help. I guess you will need to stay
with kernel 6.3.X until the Intel developers return from their summer break.

Larry



2023-07-09 18:29:07

by Johannes Berg

[permalink] [raw]
Subject: Re: [Regression][BISECTED] kernel boot hang after 19898ce9cf8a ("wifi: iwlwifi: split 22000.c into multiple files")

On Sun, 2023-07-09 at 09:31 -0700, Linus Torvalds wrote:
> On Fri, 7 Jul 2023 at 03:55, Linux regression tracking (Thorsten
> Leemhuis) <[email protected]> wrote:
> >
> > [CCing the regression list, netdev, the net maintainers, and Linus;
> > Johannes and Kalle as well, but just for the record, they afaik are
> > unavailable]
>
> So I will release rc1 with this issue, but remind me - if it hasn't
> had any traction next week and the radio silence continues, I'll just
> revert it all.

Sorry. I got back home a few hours ago (for few days anyway) and I think
I already know what the issue is. I'll send a fix to try in a few
minutes, was just trying to collect all the reported-by etc.

There's clearly a separate bug in the init failure path, but the reason
it fails in the first place is a mismatch between different changes.

johannes



2023-07-10 02:41:33

by Zhang, Rui

[permalink] [raw]
Subject: Re: [Regression][BISECTED] kernel boot hang after 19898ce9cf8a ("wifi: iwlwifi: split 22000.c into multiple files")

On Sun, 2023-07-09 at 20:07 +0200, Johannes Berg wrote:
> On Sun, 2023-07-09 at 09:31 -0700, Linus Torvalds wrote:
> > On Fri, 7 Jul 2023 at 03:55, Linux regression tracking (Thorsten
> > Leemhuis) <[email protected]> wrote:
> > >
> > > [CCing the regression list, netdev, the net maintainers, and
> > > Linus;
> > > Johannes and Kalle as well, but just for the record, they afaik
> > > are
> > > unavailable]
> >
> > So I will release rc1 with this issue, but remind me - if it hasn't
> > had any traction next week and the radio silence continues, I'll
> > just
> > revert it all.
>
> Sorry. I got back home a few hours ago (for few days anyway) and I
> think
> I already know what the issue is. I'll send a fix to try in a few
> minutes, was just trying to collect all the reported-by etc.
>
> There's clearly a separate bug in the init failure path, but the
> reason
> it fails in the first place is a mismatch between different changes.
>
>
I have tested Johannes' patch and it fixes the problem on my side.

Thanks,
rui