On Tue, Aug 22, 2023 at 03:56:24PM +0300, Kalle Valo wrote:
> Johan Hovold <[email protected]> writes:
> > On Wed, Aug 09, 2023 at 09:34:32AM +0200, Johan Hovold wrote:
> >
> >> Disabling threaded NAPI caused a severe regression in 6.5-rc5 by making
> >> the X13s completely unusable (e.g. no keyboard input, I've seen an RCU
> >> splat once).
> > Any chance we can get the offending commit reverted before 6.5 is
> > released?
>
> The problem here is that would break QCN9074 again so there is no good
> solution. I suspect we have a fundamental issue in ath11k which we just
> haven't discovered yet. I would prefer to get to the bottom of this
> before reverting anything.
Sure, ideally we can find and fix the underlying issues these next few
days, but since this regression was introduced in rc5 in an attempt to
address the QCN9074 issue which has been there since 6.1 I think we
need to revert otherwise.
> > I'll take a closer look at this meanwhile.
>
> Thanks, much appreciated. Did you try enabling all kernel debug
> features, maybe they would give some hints?
Yes, I have a bunch of those enabled. Lockdep does not complain, but the
hard lockup detector triggers and it looks like CPU0 (which handles most
interrupts on this machine currently) has got stuck while processing an
interrupt.
RCU also detects the stall on CPU0 and provides a task dump for
ksoftirqd with the following call trace:
__switch_to
run_ksoftirqd
smpboot_thread_fn
kthread
ret_from_fork
I just tried the out-of-tree pseudo NMI series [0] to get a stack trace,
but CPU0 does not respond to those either when I hit this.
Note that it takes a bit of RX to trigger this, but I hit it as soon as
I try to download something substantial (e.g. after a couple of MB).
Johan
[0] https://lore.kernel.org/lkml/[email protected]/
On Tue, Aug 22, 2023 at 03:44:45PM +0200, Johan Hovold wrote:
> On Tue, Aug 22, 2023 at 03:56:24PM +0300, Kalle Valo wrote:
> > Johan Hovold <[email protected]> writes:
> > > On Wed, Aug 09, 2023 at 09:34:32AM +0200, Johan Hovold wrote:
> > >
> > >> Disabling threaded NAPI caused a severe regression in 6.5-rc5 by making
> > >> the X13s completely unusable (e.g. no keyboard input, I've seen an RCU
> > >> splat once).
>
> > > Any chance we can get the offending commit reverted before 6.5 is
> > > released?
> >
> > The problem here is that would break QCN9074 again so there is no good
> > solution. I suspect we have a fundamental issue in ath11k which we just
> > haven't discovered yet. I would prefer to get to the bottom of this
> > before reverting anything.
>
> Sure, ideally we can find and fix the underlying issues these next few
> days, but since this regression was introduced in rc5 in an attempt to
> address the QCN9074 issue which has been there since 6.1 I think we
> need to revert otherwise.
I've managed to track down what causes the hang on the X13s after
disabling threaded NAPI. Turns out to be a severe regression in the
genirq code that causes the software resend tasklet to loop
indefinitely.
I've just sent a fix here:
https://lore.kernel.org/lkml/[email protected]/
I've also made some progress on the QCN9074 hang, but keeping the
threaded NAPI revert for now is indeed the right thing to do.
Johan
Johan Hovold <[email protected]> writes:
> On Tue, Aug 22, 2023 at 03:44:45PM +0200, Johan Hovold wrote:
>> On Tue, Aug 22, 2023 at 03:56:24PM +0300, Kalle Valo wrote:
>> > Johan Hovold <[email protected]> writes:
>> > > On Wed, Aug 09, 2023 at 09:34:32AM +0200, Johan Hovold wrote:
>> > >
>> > >> Disabling threaded NAPI caused a severe regression in 6.5-rc5 by making
>> > >> the X13s completely unusable (e.g. no keyboard input, I've seen an RCU
>> > >> splat once).
>>
>> > > Any chance we can get the offending commit reverted before 6.5 is
>> > > released?
>> >
>> > The problem here is that would break QCN9074 again so there is no good
>> > solution. I suspect we have a fundamental issue in ath11k which we just
>> > haven't discovered yet. I would prefer to get to the bottom of this
>> > before reverting anything.
>>
>> Sure, ideally we can find and fix the underlying issues these next few
>> days, but since this regression was introduced in rc5 in an attempt to
>> address the QCN9074 issue which has been there since 6.1 I think we
>> need to revert otherwise.
>
> I've managed to track down what causes the hang on the X13s after
> disabling threaded NAPI. Turns out to be a severe regression in the
> genirq code that causes the software resend tasklet to loop
> indefinitely.
>
> I've just sent a fix here:
>
> https://lore.kernel.org/lkml/[email protected]/
Oh wow, that's a tricky bug :o I'm sure it was not easy to find.
> I've also made some progress on the QCN9074 hang, but keeping the
> threaded NAPI revert for now is indeed the right thing to do.
Ok, thanks for the update and looking at also this problem. Very much
appreciated! I'm sure we have a major bug lurking somewhere in ath11k,
would be so good to fix that.
--
https://patchwork.kernel.org/project/linux-wireless/list/
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches