2024-04-23 09:07:27

by Johannes Berg

[permalink] [raw]
Subject: Re: 6.9.0-rc2+ kernel hangs on boot (bisected, maybe LED related)

On Tue, 2024-04-23 at 11:00 +0200, Linux regression tracking (Thorsten
Leemhuis) wrote:
> On 16.04.24 08:17, Johannes Berg wrote:
> > On Mon, 2024-04-15 at 13:37 -0700, Ben Greear wrote:
> > >
> > > Johannes, you had another suggestion: changing iwlwifi's request_module() to request_module_nowait() in
> > > iwl_req_fw_callback()
> > >
> > > Is that still best thing to try in your opinion?
> >
> > I guess so, I don't have any better ideas so far anyway ...
>
> [adding the iwlwifi maintainer; thread starts here:
> https://lore.kernel.org/lkml/[email protected]/
>
> ]
>
> Johannes, Miri, what's the status wrt to this regression? From here
> things look somewhat stalled -- but maybe there was progress and I just
> missed it.

What do you want? It got bisected to an LED merge, but you ping _us_?
Way to go ...

johannes


Subject: Re: 6.9.0-rc2+ kernel hangs on boot (bisected, maybe LED related)

On 23.04.24 11:06, Johannes Berg wrote:
> On Tue, 2024-04-23 at 11:00 +0200, Linux regression tracking (Thorsten
> Leemhuis) wrote:
>> On 16.04.24 08:17, Johannes Berg wrote:
>>> On Mon, 2024-04-15 at 13:37 -0700, Ben Greear wrote:
>>>>
>>>> Johannes, you had another suggestion: changing iwlwifi's request_module() to request_module_nowait() in
>>>> iwl_req_fw_callback()
>>>>
>>>> Is that still best thing to try in your opinion?
>>>
>>> I guess so, I don't have any better ideas so far anyway ...
>>
>> [adding the iwlwifi maintainer; thread starts here:
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> ]
>>
>> Johannes, Miri, what's the status wrt to this regression? From here
>> things look somewhat stalled -- but maybe there was progress and I just
>> missed it.
>
> What do you want? It got bisected to an LED merge, but you ping _us_?
> Way to go ...

Sorry, to me it sounded a bit like you had an idea for a fix and were
going to give it a try -- similar to how the maintainers for a r8169
driver and the igc driver provided fixes for bugs recent LED changes
exposed.

But sure, you are right, in the end some LED change seems to have cause
this, so the duty to fix it lies in that field. Therefore:

Lee, what's the status here to get this fixed before the final?

Ciao, Thorsten



2024-04-24 17:29:01

by Ben Greear

[permalink] [raw]
Subject: Re: 6.9.0-rc2+ kernel hangs on boot (bisected, maybe LED related)

On 4/23/24 02:29, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 23.04.24 11:06, Johannes Berg wrote:
>> On Tue, 2024-04-23 at 11:00 +0200, Linux regression tracking (Thorsten
>> Leemhuis) wrote:
>>> On 16.04.24 08:17, Johannes Berg wrote:
>>>> On Mon, 2024-04-15 at 13:37 -0700, Ben Greear wrote:
>>>>>
>>>>> Johannes, you had another suggestion: changing iwlwifi's request_module() to request_module_nowait() in
>>>>> iwl_req_fw_callback()
>>>>>
>>>>> Is that still best thing to try in your opinion?
>>>>
>>>> I guess so, I don't have any better ideas so far anyway ...
>>>
>>> [adding the iwlwifi maintainer; thread starts here:
>>> https://lore.kernel.org/lkml/[email protected]/
>>>
>>> ]
>>>
>>> Johannes, Miri, what's the status wrt to this regression? From here
>>> things look somewhat stalled -- but maybe there was progress and I just
>>> missed it.
>>
>> What do you want? It got bisected to an LED merge, but you ping _us_?
>> Way to go ...
>
> Sorry, to me it sounded a bit like you had an idea for a fix and were
> going to give it a try -- similar to how the maintainers for a r8169
> driver and the igc driver provided fixes for bugs recent LED changes
> exposed.
>
> But sure, you are right, in the end some LED change seems to have cause
> this, so the duty to fix it lies in that field. Therefore:
>
> Lee, what's the status here to get this fixed before the final?
>
> Ciao, Thorsten

Hello Johannes,

This patch makes the problem go away in my testbed with 24 Intel
iwlwifi radios. My guess is that it is just papering over the problem, but
maybe good enough? Would you like me to submit as official
patch to linux-wireless?

$ git diff
diff --git a/drivers/net/wireless/intel/iwlwifi/iwl-drv.c b/drivers/net/wireless/intel/iwlwifi/iwl-drv.c
index 4696d73c8971..993177e1de27 100644
--- a/drivers/net/wireless/intel/iwlwifi/iwl-drv.c
+++ b/drivers/net/wireless/intel/iwlwifi/iwl-drv.c
@@ -1744,7 +1744,7 @@ static void iwl_req_fw_callback(const struct firmware *ucode_raw, void *context)
* or hangs loading.
*/
if (load_module)
- request_module("%s", op->name);
+ request_module_nowait("%s", op->name);
failure = false;
goto free;


Thanks,
Ben

>
>
>

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com



2024-04-24 17:31:42

by Johannes Berg

[permalink] [raw]
Subject: Re: 6.9.0-rc2+ kernel hangs on boot (bisected, maybe LED related)

Hi Ben,

On Wed, 2024-04-24 at 10:26 -0700, Ben Greear wrote:
>
> This patch makes the problem go away in my testbed with 24 Intel
> iwlwifi radios. My guess is that it is just papering over the problem,

Agree, there seems to be some locking issue with LED stuff, but I'm not
sure where, and the driver doesn't even hold any locks here.

> but maybe good enough?

For all I care, yes. We explicitly do this last, from a worker that
holds no locks in the driver ... so it's odd. Looking at the history of
it, it seems that it was originally _nowait(), but then in 3.6 I changed
it because of some backport concerns. Though then I also moved it out of
the locked section.

> Would you like me to submit as official
> patch to linux-wireless?

Sure.

johannes

2024-05-02 07:19:21

by Lee Jones

[permalink] [raw]
Subject: Re: 6.9.0-rc2+ kernel hangs on boot (bisected, maybe LED related)

On Tue, 23 Apr 2024, Linux regression tracking (Thorsten Leemhuis) wrote:

> On 23.04.24 11:06, Johannes Berg wrote:
> > On Tue, 2024-04-23 at 11:00 +0200, Linux regression tracking (Thorsten
> > Leemhuis) wrote:
> >> On 16.04.24 08:17, Johannes Berg wrote:
> >>> On Mon, 2024-04-15 at 13:37 -0700, Ben Greear wrote:
> >>>>
> >>>> Johannes, you had another suggestion: changing iwlwifi's request_module() to request_module_nowait() in
> >>>> iwl_req_fw_callback()
> >>>>
> >>>> Is that still best thing to try in your opinion?
> >>>
> >>> I guess so, I don't have any better ideas so far anyway ...
> >>
> >> [adding the iwlwifi maintainer; thread starts here:
> >> https://lore.kernel.org/lkml/[email protected]/
> >>
> >> ]
> >>
> >> Johannes, Miri, what's the status wrt to this regression? From here
> >> things look somewhat stalled -- but maybe there was progress and I just
> >> missed it.
> >
> > What do you want? It got bisected to an LED merge, but you ping _us_?
> > Way to go ...
>
> Sorry, to me it sounded a bit like you had an idea for a fix and were
> going to give it a try -- similar to how the maintainers for a r8169
> driver and the igc driver provided fixes for bugs recent LED changes
> exposed.
>
> But sure, you are right, in the end some LED change seems to have cause
> this, so the duty to fix it lies in that field. Therefore:
>
> Lee, what's the status here to get this fixed before the final?

No idea. Did you send a fix?

--
Lee Jones [李琼斯]

Subject: Re: 6.9.0-rc2+ kernel hangs on boot (bisected, maybe LED related)

On 02.05.24 09:19, Lee Jones wrote:
> On Tue, 23 Apr 2024, Linux regression tracking (Thorsten Leemhuis) wrote:
>
>> On 23.04.24 11:06, Johannes Berg wrote:
>>> On Tue, 2024-04-23 at 11:00 +0200, Linux regression tracking (Thorsten
>>> Leemhuis) wrote:
>>>> On 16.04.24 08:17, Johannes Berg wrote:
>>>>> On Mon, 2024-04-15 at 13:37 -0700, Ben Greear wrote:
>>>>>>
>>>>>> Johannes, you had another suggestion: changing iwlwifi's request_module() to request_module_nowait() in
>>>>>> iwl_req_fw_callback()
>>>>>>
>>>>>> Is that still best thing to try in your opinion?
>>>>>
>>>>> I guess so, I don't have any better ideas so far anyway ...
>>>>
>>>> [adding the iwlwifi maintainer; thread starts here:
>>>> https://lore.kernel.org/lkml/[email protected]/
>>>> ]
>>>>
>>>> Johannes, Miri, what's the status wrt to this regression? From here
>>>> things look somewhat stalled -- but maybe there was progress and I just
>>>> missed it.
>>>
>>> What do you want? It got bisected to an LED merge, but you ping _us_?
>>> Way to go ...
>>
>> Sorry, to me it sounded a bit like you had an idea for a fix and were
>> going to give it a try -- similar to how the maintainers for a r8169
>> driver and the igc driver provided fixes for bugs recent LED changes
>> exposed.
>>
>> But sure, you are right, in the end some LED change seems to have cause
>> this, so the duty to fix it lies in that field. Therefore:
>>
>> Lee, what's the status here to get this fixed before the final?
>
> No idea. Did you send a fix?

I'm just here to help Linus keeping an eye on regression to ensure they
are handled like he wants them to be handled.

But Ben Greear sent a patch to work around the problem:

https://lore.kernel.org/all/[email protected]/

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

2024-05-05 05:48:22

by Ben Greear

[permalink] [raw]
Subject: Re: 6.9.0-rc2+ kernel hangs on boot (bisected, maybe LED related)

On 5/2/24 00:19, Lee Jones wrote:
> On Tue, 23 Apr 2024, Linux regression tracking (Thorsten Leemhuis) wrote:
>
>> On 23.04.24 11:06, Johannes Berg wrote:
>>> On Tue, 2024-04-23 at 11:00 +0200, Linux regression tracking (Thorsten
>>> Leemhuis) wrote:
>>>> On 16.04.24 08:17, Johannes Berg wrote:
>>>>> On Mon, 2024-04-15 at 13:37 -0700, Ben Greear wrote:
>>>>>>
>>>>>> Johannes, you had another suggestion: changing iwlwifi's request_module() to request_module_nowait() in
>>>>>> iwl_req_fw_callback()
>>>>>>
>>>>>> Is that still best thing to try in your opinion?
>>>>>
>>>>> I guess so, I don't have any better ideas so far anyway ...
>>>>
>>>> [adding the iwlwifi maintainer; thread starts here:
>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>
>>>> ]
>>>>
>>>> Johannes, Miri, what's the status wrt to this regression? From here
>>>> things look somewhat stalled -- but maybe there was progress and I just
>>>> missed it.
>>>
>>> What do you want? It got bisected to an LED merge, but you ping _us_?
>>> Way to go ...
>>
>> Sorry, to me it sounded a bit like you had an idea for a fix and were
>> going to give it a try -- similar to how the maintainers for a r8169
>> driver and the igc driver provided fixes for bugs recent LED changes
>> exposed.
>>
>> But sure, you are right, in the end some LED change seems to have cause
>> this, so the duty to fix it lies in that field. Therefore:
>>
>> Lee, what's the status here to get this fixed before the final?
>
> No idea. Did you send a fix?

I sent what is probably just a work-around. I also spent time bisecting and testing.
The problem appears to have come in with the LED related merge. I think it is fair
to ask the LED folks to at least take a look at the lockdep debugging I posted. It is
not fair to expect anyone that manages to find or track a bug to also fix it.

If someone has a different suggested fix than the hack I posted, I will be happy to
test. On my system with lots of radios, it is 100% reproducible.
Maybe email me directly as I don't keep close watch on LKML.

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2024-05-05 10:55:43

by Tetsuo Handa

[permalink] [raw]
Subject: Re: 6.9.0-rc2+ kernel hangs on boot (bisected, maybe LED related)

On 2024/05/05 14:48, Ben Greear wrote:
> If someone has a different suggested fix than the hack I posted, I will be happy to
> test.  On my system with lots of radios, it is 100% reproducible.
> Maybe email me directly as I don't keep close watch on LKML.

Please collect stacktraces of all lock holders using
https://lkml.kernel.org/r/[email protected] .

Depending on the output, I might ask you to decode addresses using ./scripts/faddr2line .


2024-05-07 04:21:43

by Ben Greear

[permalink] [raw]
Subject: Re: 6.9.0-rc2+ kernel hangs on boot (bisected, maybe LED related)

On 5/5/24 03:55, Tetsuo Handa wrote:
> On 2024/05/05 14:48, Ben Greear wrote:
>> If someone has a different suggested fix than the hack I posted, I will be happy to
>> test.  On my system with lots of radios, it is 100% reproducible.
>> Maybe email me directly as I don't keep close watch on LKML.
>
> Please collect stacktraces of all lock holders using
> https://lkml.kernel.org/r/[email protected] .
>
> Depending on the output, I might ask you to decode addresses using ./scripts/faddr2line .
>
>

I am travelling for next few weeks, but will work on this when I return.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2024-05-07 08:23:32

by Lee Jones

[permalink] [raw]
Subject: Re: 6.9.0-rc2+ kernel hangs on boot (bisected, maybe LED related)

On Sat, 04 May 2024, Ben Greear wrote:

> On 5/2/24 00:19, Lee Jones wrote:
> > On Tue, 23 Apr 2024, Linux regression tracking (Thorsten Leemhuis)
> > wrote:
> >
> > > On 23.04.24 11:06, Johannes Berg wrote:
> > > > On Tue, 2024-04-23 at 11:00 +0200, Linux regression tracking
> > > > (Thorsten Leemhuis) wrote:
> > > > > On 16.04.24 08:17, Johannes Berg wrote:
> > > > > > On Mon, 2024-04-15 at 13:37 -0700, Ben Greear wrote:
> > > > > > >
> > > > > > > Johannes, you had another suggestion: changing iwlwifi's
> > > > > > > request_module() to request_module_nowait() in
> > > > > > > iwl_req_fw_callback()
> > > > > > >
> > > > > > > Is that still best thing to try in your opinion?
> > > > > >
> > > > > > I guess so, I don't have any better ideas so far anyway ...
> > > > >
> > > > > [adding the iwlwifi maintainer; thread starts here:
> > > > > https://lore.kernel.org/lkml/[email protected]/
> > > > >
> > > > > ]
> > > > >
> > > > > Johannes, Miri, what's the status wrt to this regression? From
> > > > > here things look somewhat stalled -- but maybe there was
> > > > > progress and I just missed it.
> > > >
> > > > What do you want? It got bisected to an LED merge, but you ping
> > > > _us_? Way to go ...
> > >
> > > Sorry, to me it sounded a bit like you had an idea for a fix and
> > > were going to give it a try -- similar to how the maintainers for
> > > a r8169 driver and the igc driver provided fixes for bugs recent
> > > LED changes exposed.
> > >
> > > But sure, you are right, in the end some LED change seems to have
> > > cause this, so the duty to fix it lies in that field. Therefore:
> > >
> > > Lee, what's the status here to get this fixed before the final?
> >
> > No idea. Did you send a fix?
>
> I sent what is probably just a work-around. I also spent time
> bisecting and testing. The problem appears to have come in with the
> LED related merge. I think it is fair to ask the LED folks to at
> least take a look at the lockdep debugging I posted.

I can't speak for Pavel, but I personally have no way of debugging or
reproducing this. The only usefulness I can provide is to review and
apply fixes as and when they appear.

> It is not fair to expect anyone that manages to find or track a bug to
> also fix it.

No such expectation has been felt or communicated.

--
Lee Jones [李琼斯]