2023-03-04 16:24:16

by Thorsten Leemhuis

[permalink] [raw]
Subject: [Regression] rt2800usb - Wifi performance issues and connection drops

Hi, this is your Linux kernel regression tracker.

I noticed a regression report in bugzilla.kernel.org. As many (most?)
kernel developer don't keep an eye on it, I decided to forward it by
mail. Quoting from https://bugzilla.kernel.org/show_bug.cgi?id=217119 :

> Thomas Mann 2023-03-03 15:12:03 UTC
>
> After the update of linux to 6.2.x, i get connection drops and bandwidth problems.
>
> 6.2.1 was completely unusable and 6.2.2 still has bandwidth problems but works a bit better
>
> The device in use is:
>
> 13d3:3273 IMC Networks 802.11 n/g/b Wireless LAN USB Mini-Card
>
> Downgrading the kernel to 6.1.[14,15] fixes the problem and the wifi gets stable again and the available bandwidth increases.
>
> demsg shows no errors
>
> [tag] [reply] [−]
> Private
> Comment 1 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-03-04 05:45:33 UTC
>
> Please attach dmesg [without it most people won't even know which driver is in use for your card]
>
> [tag] [reply] [−]
> Private
> Comment 2 Thomas Mann 2023-03-04 12:36:45 UTC
>
> drive in use is rt2800usb
>
> [tag] [reply] [−]
> Private
> Comment 3 Thomas Mann 2023-03-04 12:38:01 UTC
>
> Created attachment 303840 [details]
> dmesg output
>


See the ticket for more details.


[TLDR for the rest of this mail: I'm adding this report to the list of
tracked Linux kernel regressions; the text you find below is based on a
few templates paragraphs you might have encountered already in similar
form.]

BTW, let me use this mail to also add the report to the list of tracked
regressions to ensure it's doesn't fall through the cracks:

#regzbot introduced: v6.1..v6.2
https://bugzilla.kernel.org/show_bug.cgi?id=217119
#regzbot title: net: wireless: rt2800usb: wifi performance issues and
connection drops
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (e.g. the buzgzilla ticket and maybe this mail as well, if
this thread sees some discussion). See page linked in footer for details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.


2023-03-05 17:26:10

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [Regression] rt2800usb - Wifi performance issues and connection drops

On 04.03.23 17:24, Linux regression tracking (Thorsten Leemhuis) wrote:
> Hi, this is your Linux kernel regression tracker.
>
> I noticed a regression report in bugzilla.kernel.org. As many (most?)
> kernel developer don't keep an eye on it, I decided to forward it by
> mail. Quoting from https://bugzilla.kernel.org/show_bug.cgi?id=217119 :
>
>> Thomas Mann 2023-03-03 15:12:03 UTC
>>
>> After the update of linux to 6.2.x, i get connection drops and bandwidth problems.
>>
>> 6.2.1 was completely unusable and 6.2.2 still has bandwidth problems but works a bit better
>>
>> The device in use is:
>>
>> 13d3:3273 IMC Networks 802.11 n/g/b Wireless LAN USB Mini-Card
>>
>> Downgrading the kernel to 6.1.[14,15] fixes the problem and the wifi gets stable again and the available bandwidth increases.

Quick update from bugzilla:

```
--- Comment #4 from Thomas Mann ([email protected]) ---
i bisected and found the commit that introduced the regression:

# first bad commit: [4444bc2116aecdcde87dce80373540adc8bd478b] wifi:
mac80211: Proper mark iTXQs for resumption
```

That's a commit from Alexander, applied by Johannes.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

>> demsg shows no errors
>>
>> [tag] [reply] [−]
>> Private
>> Comment 1 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-03-04 05:45:33 UTC
>>
>> Please attach dmesg [without it most people won't even know which driver is in use for your card]
>>
>> [tag] [reply] [−]
>> Private
>> Comment 2 Thomas Mann 2023-03-04 12:36:45 UTC
>>
>> drive in use is rt2800usb
>>
>> [tag] [reply] [−]
>> Private
>> Comment 3 Thomas Mann 2023-03-04 12:38:01 UTC
>>
>> Created attachment 303840 [details]
>> dmesg output
>>
>
>
> See the ticket for more details.
>
>
> [TLDR for the rest of this mail: I'm adding this report to the list of
> tracked Linux kernel regressions; the text you find below is based on a
> few templates paragraphs you might have encountered already in similar
> form.]
>
> BTW, let me use this mail to also add the report to the list of tracked
> regressions to ensure it's doesn't fall through the cracks:
>
> #regzbot introduced: v6.1..v6.2
> https://bugzilla.kernel.org/show_bug.cgi?id=217119

P.S.:

#regzbot introduced: 4444bc2116aecdcde8
#regzbot ignore-activity

> #regzbot title: net: wireless: rt2800usb: wifi performance issues and
> connection drops
> #regzbot ignore-activity
>
> This isn't a regression? This issue or a fix for it are already
> discussed somewhere else? It was fixed already? You want to clarify when
> the regression started to happen? Or point out I got the title or
> something else totally wrong? Then just reply and tell me -- ideally
> while also telling regzbot about it, as explained by the page listed in
> the footer of this mail.
>
> Developers: When fixing the issue, remember to add 'Link:' tags pointing
> to the report (e.g. the buzgzilla ticket and maybe this mail as well, if
> this thread sees some discussion). See page linked in footer for details.
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.

2023-03-05 22:29:01

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [Regression] rt2800usb - Wifi performance issues and connection drops

> Quick update from bugzilla:
>
> ```
> --- Comment #4 from Thomas Mann ([email protected]) ---
> i bisected and found the commit that introduced the regression:
>
> # first bad commit: [4444bc2116aecdcde87dce80373540adc8bd478b] wifi:
> mac80211: Proper mark iTXQs for resumption
> ```
>
> That's a commit from Alexander, applied by Johannes.
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>

I just uploaded a test patch to bugzilla.
Please have a look if that fixes the issue.

If not I would be interested in the output of your iTXQ status.
Enable CONFIG_MAC80211_DEBUGFS and run this command when the connection
is bad and send/share/upload to bugzilla the resulting debug.out:

k=1; while [ $k -lt 10 ]; do \
cat /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm; \
k=$(($k+1)); done >> debug.out

Alexander

2023-03-07 20:55:00

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [Regression] rt2800usb - Wifi performance issues and connection drops

>>
>
> I just uploaded a test patch to bugzilla.
> Please have a look if that fixes the issue.
>
> If not I would be interested in the output of your iTXQ status.
> Enable CONFIG_MAC80211_DEBUGFS and run this command when the connection
> is bad and send/share/upload to bugzilla the resulting debug.out:
>
> k=1; while [ $k -lt 10 ]; do \
> cat /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm; \
> k=$(($k+1)); done >> debug.out

Thomas and I continued with some debugging in
https://bugzilla.kernel.org/show_bug.cgi?id=217119

But the results so far are unexpected and we decided to continue the
debugging with the round here. Hoping someone sees something I miss.

A very summary where we are:
I can't reproduce the bug with a very similar card and kernel config so
far. Thomas card stops the iTXQs for intervalls >30s. Mine operates
normally.

A more useful but longer summary:

Thomas updated to a 6.2 kernel and reported "connection drops and
bandwidth problems" with his rt2800usb wlan card. (6.1 is ok.) Asked for
some more details he reported:
"...slow bandwidth stuff works better, but the main problem/test case is
to start a 8-16 mbit video stream, which sometimes runs for a few
seconds and then stops or it doesn't start at all"

He bisected the issue and identified my commit 4444bc2116ae ("wifi:
mac80211: Proper mark iTXQs for resumption") as culprit.

Checking the internal iTXQ status when the issue is ongoing shows, that
TID zero is flagged as dirty and thus is not transmitting queued
packets. Interesting line from
/sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm:
tid ac backlog-bytes backlog-packets new-flows drops marks overlimit
collisions tx-bytes tx-packets flags
0 2 619736 404 1681 0 0 0 1 4513965 3019 0xe(RUN AMPDU NO-AMSDU DIRTY)

--> The "normal" iTXQ handling IEEE80211_AC_BE has queued packets and is
flagged as DIRTY. There even is a potential race setting the DIRTY flag,
but the fix for that is not helping.

Thus Thomas applied two debug patches, to better understand why the
DIRTY flag is not cleared.

And looking at the output from those we see that the driver stops Tx by
calling ieee80211_stop_queue(). When ieee80211_wake_queue() mac80211
correctly resumes TX but is getting stopped by the driver after a single
packet again. (The start of the relevant log is missing, so that may be
initially more).
I assume TX is still ok at that stage. But after some singe Tx
operations the driver stops the queues again. Here the relevant part of
the log:
[ 179.584997] XXXX __ieee80211_wake_txqs: waking TID 0
[ 179.585022] XXXX drv_tx: TX
[ 179.585027] XXXX ieee80211_stop_queue: called
[ 179.585028] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
[ 179.585030] XXXX __ieee80211_wake_txqs: TID 3 NOT dirty
[ 179.585031] XXXX __ieee80211_wake_txqs: TID 8 NOT dirty
[ 179.585033] XXXX __ieee80211_wake_txqs: TID 11 NOT dirty
[ 179.585034] XXXX __ieee80211_wake_txqs: EXIT
[ 179.585035] XXXX __ieee80211_wake_txqs: ENTRY
[ 179.585036] XXXX __ieee80211_wake_txqs: TID 1 NOT dirty
[ 179.585037] XXXX __ieee80211_wake_txqs: TID 2 NOT dirty
[ 179.585038] XXXX __ieee80211_wake_txqs: TID 9 NOT dirty
[ 179.585040] XXXX __ieee80211_wake_txqs: TID 10 NOT dirty
[ 179.585041] XXXX __ieee80211_wake_txqs: EXIT
[ 179.585047] XXXX drv_tx: TX
[ 179.585056] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
[ 179.585271] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
[ 179.585868] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
[ 179.586120] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
[ 179.586544] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
[ 179.586792] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
[ 179.587317] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
[ 179.587591] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
[ 179.588569] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
....
[ 214.307617] XXXX ieee80211_wake_queue: called


--> So the driver blocked TX for more than 30s. Which is a good
explanation of what Thomas observes.

But there is nothing mac80211 can do differently here. Whatever is the
real reason for the issue, it's nothing obvious I see.

Luckily I found a card using the same driver and nearly the same card:
Thomas systems:Linux version 6.2.2-gentoo (root@foo) (gcc (Gentoo
Hardened 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.39
p5) 2.39.0) #2 SMP Fri Mar 3 16:59:02 CET 2023ieee80211 phy0:
rt2x00_set_rt: Info - RT chipset 3070, rev 0201 detected
ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'

My system, using the kernel config from Thomas with only minor
modifications (different filesystems and initramfs settings and enabled
mac80211 debug and developer options):
Linux version 6.2.2-gentoo ([email protected]) (gcc (Gentoo
12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.40 p2)
2.40.0) #2 SMP Tue Mar 7 18:18:47 CET 2023ieee80211 phy0:
rt2x00_set_rt: Info - RT chipset 3070, rev 0200 detected
ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
ieee80211 phy0: rt2x00lib_request_firmware: Info - Loading firmware file
'rt2870.bin'
ieee80211 phy0: rt2x00lib_request_firmware: Info - Firmware detected -
version: 0.36

But there is one big difference on my system: I can't reproduce the bug
so far. It's working as it should... (I did not apply the debug patches
myself so far)

I'm now planning to look a bit more into the rt2800usb driver and
provide another debug patch for interesting looking code pieces in it.

@Thomas:
I've also uploaded you my binary kernel I'm running at the moment here:
https://www.awhome.eu/s/5FjqMS73rtCtSBM

That kernel should also be able to boot and operate your system. Can you
try that and tell me, if that makes any difference?

I'm also planning to provide some more debug patches, to figuring out
which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
for resumption") fixes the issue for you. Assuming my understanding
above is correct the patch should not really fix/break anything for
you...With the findings above I would have expected your git bisec to
identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
callback to drivers") as the first broken commit...

Alexander

2023-03-07 22:31:56

by Thomas Mann

[permalink] [raw]
Subject: Re: [Regression] rt2800usb - Wifi performance issues and connection drops

Hi Alexander,

i can't boot the binary kernel here, as the initramfs is included in my
kernel, if you send me a patch, i can apply it and test it.

Thomas

On Tue, 7 Mar 2023 21:54:31 +0100
Alexander Wetzel <[email protected]> wrote:

> >>
> >
> > I just uploaded a test patch to bugzilla.
> > Please have a look if that fixes the issue.
> >
> > If not I would be interested in the output of your iTXQ status.
> > Enable CONFIG_MAC80211_DEBUGFS and run this command when the
> > connection is bad and send/share/upload to bugzilla the resulting
> > debug.out:
> >
> > k=1; while [ $k -lt 10 ]; do \
> > cat /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm; \
> > k=$(($k+1)); done >> debug.out
>
> Thomas and I continued with some debugging in
> https://bugzilla.kernel.org/show_bug.cgi?id=217119
>
> But the results so far are unexpected and we decided to continue the
> debugging with the round here. Hoping someone sees something I miss.
>
> A very summary where we are:
> I can't reproduce the bug with a very similar card and kernel config
> so far. Thomas card stops the iTXQs for intervalls >30s. Mine
> operates normally.
>
> A more useful but longer summary:
>
> Thomas updated to a 6.2 kernel and reported "connection drops and
> bandwidth problems" with his rt2800usb wlan card. (6.1 is ok.) Asked
> for some more details he reported:
> "...slow bandwidth stuff works better, but the main problem/test case
> is to start a 8-16 mbit video stream, which sometimes runs for a few
> seconds and then stops or it doesn't start at all"
>
> He bisected the issue and identified my commit 4444bc2116ae ("wifi:
> mac80211: Proper mark iTXQs for resumption") as culprit.
>
> Checking the internal iTXQ status when the issue is ongoing shows,
> that TID zero is flagged as dirty and thus is not transmitting queued
> packets. Interesting line from
> /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm:
> tid ac backlog-bytes backlog-packets new-flows drops marks overlimit
> collisions tx-bytes tx-packets flags
> 0 2 619736 404 1681 0 0 0 1 4513965 3019 0xe(RUN AMPDU NO-AMSDU DIRTY)
>
> --> The "normal" iTXQ handling IEEE80211_AC_BE has queued packets and
> is flagged as DIRTY. There even is a potential race setting the DIRTY
> flag, but the fix for that is not helping.
>
> Thus Thomas applied two debug patches, to better understand why the
> DIRTY flag is not cleared.
>
> And looking at the output from those we see that the driver stops Tx
> by calling ieee80211_stop_queue(). When ieee80211_wake_queue()
> mac80211 correctly resumes TX but is getting stopped by the driver
> after a single packet again. (The start of the relevant log is
> missing, so that may be initially more).
> I assume TX is still ok at that stage. But after some singe Tx
> operations the driver stops the queues again. Here the relevant part
> of the log:
> [ 179.584997] XXXX __ieee80211_wake_txqs: waking TID 0
> [ 179.585022] XXXX drv_tx: TX
> [ 179.585027] XXXX ieee80211_stop_queue: called
> [ 179.585028] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [ 179.585030] XXXX __ieee80211_wake_txqs: TID 3 NOT dirty
> [ 179.585031] XXXX __ieee80211_wake_txqs: TID 8 NOT dirty
> [ 179.585033] XXXX __ieee80211_wake_txqs: TID 11 NOT dirty
> [ 179.585034] XXXX __ieee80211_wake_txqs: EXIT
> [ 179.585035] XXXX __ieee80211_wake_txqs: ENTRY
> [ 179.585036] XXXX __ieee80211_wake_txqs: TID 1 NOT dirty
> [ 179.585037] XXXX __ieee80211_wake_txqs: TID 2 NOT dirty
> [ 179.585038] XXXX __ieee80211_wake_txqs: TID 9 NOT dirty
> [ 179.585040] XXXX __ieee80211_wake_txqs: TID 10 NOT dirty
> [ 179.585041] XXXX __ieee80211_wake_txqs: EXIT
> [ 179.585047] XXXX drv_tx: TX
> [ 179.585056] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [ 179.585271] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [ 179.585868] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [ 179.586120] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [ 179.586544] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [ 179.586792] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [ 179.587317] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [ 179.587591] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [ 179.588569] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> ....
> [ 214.307617] XXXX ieee80211_wake_queue: called
>
>
> --> So the driver blocked TX for more than 30s. Which is a good
> explanation of what Thomas observes.
>
> But there is nothing mac80211 can do differently here. Whatever is
> the real reason for the issue, it's nothing obvious I see.
>
> Luckily I found a card using the same driver and nearly the same card:
> Thomas systems:Linux version 6.2.2-gentoo (root@foo) (gcc (Gentoo
> Hardened 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo
> 2.39 p5) 2.39.0) #2 SMP Fri Mar 3 16:59:02 CET 2023ieee80211 phy0:
> rt2x00_set_rt: Info - RT chipset 3070, rev 0201 detected
> ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
> ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
>
> My system, using the kernel config from Thomas with only minor
> modifications (different filesystems and initramfs settings and
> enabled mac80211 debug and developer options):
> Linux version 6.2.2-gentoo ([email protected]) (gcc (Gentoo
> 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.40 p2)
> 2.40.0) #2 SMP Tue Mar 7 18:18:47 CET 2023ieee80211 phy0:
> rt2x00_set_rt: Info - RT chipset 3070, rev 0200 detected
> ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
> ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
> ieee80211 phy0: rt2x00lib_request_firmware: Info - Loading firmware
> file 'rt2870.bin'
> ieee80211 phy0: rt2x00lib_request_firmware: Info - Firmware detected
> - version: 0.36
>
> But there is one big difference on my system: I can't reproduce the
> bug so far. It's working as it should... (I did not apply the debug
> patches myself so far)
>
> I'm now planning to look a bit more into the rt2800usb driver and
> provide another debug patch for interesting looking code pieces in it.
>
> @Thomas:
> I've also uploaded you my binary kernel I'm running at the moment
> here: https://www.awhome.eu/s/5FjqMS73rtCtSBM
>
> That kernel should also be able to boot and operate your system. Can
> you try that and tell me, if that makes any difference?
>
> I'm also planning to provide some more debug patches, to figuring out
> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
> for resumption") fixes the issue for you. Assuming my understanding
> above is correct the patch should not really fix/break anything for
> you...With the findings above I would have expected your git bisec to
> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
> callback to drivers") as the first broken commit...
>
> Alexander


2023-03-08 07:13:46

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [Regression] rt2800usb - Wifi performance issues and connection drops

On 07.03.23 23:31, Thomas Mann wrote:
> Hi Alexander,

Since I suspect we'll exchange quite some mails here:
Top posting is being frowned on the mailing lists on copy.
Details here: https://www.infradead.org/~dwmw2/email.html

I've moved your post to the correct position and replied there.

>
>>>>
>>>
>>> I just uploaded a test patch to bugzilla.
>>> Please have a look if that fixes the issue.
>>>
>>> If not I would be interested in the output of your iTXQ status.
>>> Enable CONFIG_MAC80211_DEBUGFS and run this command when the
>>> connection is bad and send/share/upload to bugzilla the resulting
>>> debug.out:
>>>
>>> k=1; while [ $k -lt 10 ]; do \
>>> cat /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm; \
>>> k=$(($k+1)); done >> debug.out
>>
>> Thomas and I continued with some debugging in
>> https://bugzilla.kernel.org/show_bug.cgi?id=217119
>>
>> But the results so far are unexpected and we decided to continue the
>> debugging with the round here. Hoping someone sees something I miss.
>>
>> A very summary where we are:
>> I can't reproduce the bug with a very similar card and kernel config
>> so far. Thomas card stops the iTXQs for intervalls >30s. Mine
>> operates normally.
>>
>> A more useful but longer summary:
>>
>> Thomas updated to a 6.2 kernel and reported "connection drops and
>> bandwidth problems" with his rt2800usb wlan card. (6.1 is ok.) Asked
>> for some more details he reported:
>> "...slow bandwidth stuff works better, but the main problem/test case
>> is to start a 8-16 mbit video stream, which sometimes runs for a few
>> seconds and then stops or it doesn't start at all"
>>
>> He bisected the issue and identified my commit 4444bc2116ae ("wifi:
>> mac80211: Proper mark iTXQs for resumption") as culprit.
>>
>> Checking the internal iTXQ status when the issue is ongoing shows,
>> that TID zero is flagged as dirty and thus is not transmitting queued
>> packets. Interesting line from
>> /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm:
>> tid ac backlog-bytes backlog-packets new-flows drops marks overlimit
>> collisions tx-bytes tx-packets flags
>> 0 2 619736 404 1681 0 0 0 1 4513965 3019 0xe(RUN AMPDU NO-AMSDU DIRTY)
>>
>> --> The "normal" iTXQ handling IEEE80211_AC_BE has queued packets and
>> is flagged as DIRTY. There even is a potential race setting the DIRTY
>> flag, but the fix for that is not helping.
>>
>> Thus Thomas applied two debug patches, to better understand why the
>> DIRTY flag is not cleared.
>>
>> And looking at the output from those we see that the driver stops Tx
>> by calling ieee80211_stop_queue(). When ieee80211_wake_queue()
>> mac80211 correctly resumes TX but is getting stopped by the driver
>> after a single packet again. (The start of the relevant log is
>> missing, so that may be initially more).
>> I assume TX is still ok at that stage. But after some singe Tx
>> operations the driver stops the queues again. Here the relevant part
>> of the log:
>> [ 179.584997] XXXX __ieee80211_wake_txqs: waking TID 0
>> [ 179.585022] XXXX drv_tx: TX
>> [ 179.585027] XXXX ieee80211_stop_queue: called
>> [ 179.585028] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
>> [ 179.585030] XXXX __ieee80211_wake_txqs: TID 3 NOT dirty
>> [ 179.585031] XXXX __ieee80211_wake_txqs: TID 8 NOT dirty
>> [ 179.585033] XXXX __ieee80211_wake_txqs: TID 11 NOT dirty
>> [ 179.585034] XXXX __ieee80211_wake_txqs: EXIT
>> [ 179.585035] XXXX __ieee80211_wake_txqs: ENTRY
>> [ 179.585036] XXXX __ieee80211_wake_txqs: TID 1 NOT dirty
>> [ 179.585037] XXXX __ieee80211_wake_txqs: TID 2 NOT dirty
>> [ 179.585038] XXXX __ieee80211_wake_txqs: TID 9 NOT dirty
>> [ 179.585040] XXXX __ieee80211_wake_txqs: TID 10 NOT dirty
>> [ 179.585041] XXXX __ieee80211_wake_txqs: EXIT
>> [ 179.585047] XXXX drv_tx: TX
>> [ 179.585056] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
>> [ 179.585271] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
>> [ 179.585868] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
>> [ 179.586120] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
>> [ 179.586544] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
>> [ 179.586792] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
>> [ 179.587317] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
>> [ 179.587591] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
>> [ 179.588569] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
>> ....
>> [ 214.307617] XXXX ieee80211_wake_queue: called
>>
>>
>> --> So the driver blocked TX for more than 30s. Which is a good
>> explanation of what Thomas observes.
>>
>> But there is nothing mac80211 can do differently here. Whatever is
>> the real reason for the issue, it's nothing obvious I see.
>>
>> Luckily I found a card using the same driver and nearly the same card:
>> Thomas systems:Linux version 6.2.2-gentoo (root@foo) (gcc (Gentoo
>> Hardened 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo
>> 2.39 p5) 2.39.0) #2 SMP Fri Mar 3 16:59:02 CET 2023ieee80211 phy0:
>> rt2x00_set_rt: Info - RT chipset 3070, rev 0201 detected
>> ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
>> ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
>>
>> My system, using the kernel config from Thomas with only minor
>> modifications (different filesystems and initramfs settings and
>> enabled mac80211 debug and developer options):
>> Linux version 6.2.2-gentoo ([email protected]) (gcc (Gentoo
>> 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.40 p2)
>> 2.40.0) #2 SMP Tue Mar 7 18:18:47 CET 2023ieee80211 phy0:
>> rt2x00_set_rt: Info - RT chipset 3070, rev 0200 detected
>> ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
>> ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
>> ieee80211 phy0: rt2x00lib_request_firmware: Info - Loading firmware
>> file 'rt2870.bin'
>> ieee80211 phy0: rt2x00lib_request_firmware: Info - Firmware detected
>> - version: 0.36
>>
>> But there is one big difference on my system: I can't reproduce the
>> bug so far. It's working as it should... (I did not apply the debug
>> patches myself so far)
>>
>> I'm now planning to look a bit more into the rt2800usb driver and
>> provide another debug patch for interesting looking code pieces in it.
>>
>> @Thomas:
>> I've also uploaded you my binary kernel I'm running at the moment
>> here: https://www.awhome.eu/s/5FjqMS73rtCtSBM
>>
>> That kernel should also be able to boot and operate your system. Can
>> you try that and tell me, if that makes any difference?

>
> i can't boot the binary kernel here, as the initramfs is included in
> my kernel, if you send me a patch, i can apply it and test it.

That was an unpatched kernel. Idea was to verify that it's not a
compiler issue. (You seem to be using a hardened Gentoo profile.)

Can you share your initrd, so I can include it? (Mail it to me directly,
upload it to bug in buguilla or send a link to some cloud storage.)



>>
>> I'm also planning to provide some more debug patches, to figuring out
>> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
>> for resumption") fixes the issue for you. Assuming my understanding
>> above is correct the patch should not really fix/break anything for
>> you...With the findings above I would have expected your git bisec to
>> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
>> callback to drivers") as the first broken commit...
>>
>> Alexander
>


2023-03-08 07:52:24

by Felix Fietkau

[permalink] [raw]
Subject: Re: [Regression] rt2800usb - Wifi performance issues and connection drops

On 07.03.23 21:54, Alexander Wetzel wrote:
>>>
>>
>> I just uploaded a test patch to bugzilla.
>> Please have a look if that fixes the issue.
>>
>> If not I would be interested in the output of your iTXQ status.
>> Enable CONFIG_MAC80211_DEBUGFS and run this command when the connection
>> is bad and send/share/upload to bugzilla the resulting debug.out:
>>
>> k=1; while [ $k -lt 10 ]; do \
>> cat /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm; \
>> k=$(($k+1)); done >> debug.out
>
> Thomas and I continued with some debugging in
> https://bugzilla.kernel.org/show_bug.cgi?id=217119
>
> But the results so far are unexpected and we decided to continue the
> debugging with the round here. Hoping someone sees something I miss.
>
> A very summary where we are:
> I can't reproduce the bug with a very similar card and kernel config so
> far. Thomas card stops the iTXQs for intervalls >30s. Mine operates
> normally.
>
> A more useful but longer summary:
>
> Thomas updated to a 6.2 kernel and reported "connection drops and
> bandwidth problems" with his rt2800usb wlan card. (6.1 is ok.) Asked for
> some more details he reported:
> "...slow bandwidth stuff works better, but the main problem/test case is
> to start a 8-16 mbit video stream, which sometimes runs for a few
> seconds and then stops or it doesn't start at all"
>
> He bisected the issue and identified my commit 4444bc2116ae ("wifi:
> mac80211: Proper mark iTXQs for resumption") as culprit.
>
> Checking the internal iTXQ status when the issue is ongoing shows, that
> TID zero is flagged as dirty and thus is not transmitting queued
> packets. Interesting line from
> /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm:
> tid ac backlog-bytes backlog-packets new-flows drops marks overlimit
> collisions tx-bytes tx-packets flags
> 0 2 619736 404 1681 0 0 0 1 4513965 3019 0xe(RUN AMPDU NO-AMSDU DIRTY)
>
> --> The "normal" iTXQ handling IEEE80211_AC_BE has queued packets and is
> flagged as DIRTY. There even is a potential race setting the DIRTY flag,
> but the fix for that is not helping.
>
> Thus Thomas applied two debug patches, to better understand why the
> DIRTY flag is not cleared.
>
> And looking at the output from those we see that the driver stops Tx by
> calling ieee80211_stop_queue(). When ieee80211_wake_queue() mac80211
> correctly resumes TX but is getting stopped by the driver after a single
> packet again. (The start of the relevant log is missing, so that may be
> initially more).
> I assume TX is still ok at that stage. But after some singe Tx
> operations the driver stops the queues again. Here the relevant part of
> the log:
> [ 179.584997] XXXX __ieee80211_wake_txqs: waking TID 0
> [ 179.585022] XXXX drv_tx: TX
> [ 179.585027] XXXX ieee80211_stop_queue: called
> [ 179.585028] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [ 179.585030] XXXX __ieee80211_wake_txqs: TID 3 NOT dirty
> [ 179.585031] XXXX __ieee80211_wake_txqs: TID 8 NOT dirty
> [ 179.585033] XXXX __ieee80211_wake_txqs: TID 11 NOT dirty
> [ 179.585034] XXXX __ieee80211_wake_txqs: EXIT
> [ 179.585035] XXXX __ieee80211_wake_txqs: ENTRY
> [ 179.585036] XXXX __ieee80211_wake_txqs: TID 1 NOT dirty
> [ 179.585037] XXXX __ieee80211_wake_txqs: TID 2 NOT dirty
> [ 179.585038] XXXX __ieee80211_wake_txqs: TID 9 NOT dirty
> [ 179.585040] XXXX __ieee80211_wake_txqs: TID 10 NOT dirty
> [ 179.585041] XXXX __ieee80211_wake_txqs: EXIT
> [ 179.585047] XXXX drv_tx: TX
> [ 179.585056] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [ 179.585271] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [ 179.585868] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [ 179.586120] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [ 179.586544] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [ 179.586792] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [ 179.587317] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [ 179.587591] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [ 179.588569] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> ....
> [ 214.307617] XXXX ieee80211_wake_queue: called
>
>
> --> So the driver blocked TX for more than 30s. Which is a good
> explanation of what Thomas observes.
>
> But there is nothing mac80211 can do differently here. Whatever is the
> real reason for the issue, it's nothing obvious I see.
>
> Luckily I found a card using the same driver and nearly the same card:
> Thomas systems:Linux version 6.2.2-gentoo (root@foo) (gcc (Gentoo
> Hardened 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.39
> p5) 2.39.0) #2 SMP Fri Mar 3 16:59:02 CET 2023ieee80211 phy0:
> rt2x00_set_rt: Info - RT chipset 3070, rev 0201 detected
> ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
> ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
>
> My system, using the kernel config from Thomas with only minor
> modifications (different filesystems and initramfs settings and enabled
> mac80211 debug and developer options):
> Linux version 6.2.2-gentoo ([email protected]) (gcc (Gentoo
> 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.40 p2)
> 2.40.0) #2 SMP Tue Mar 7 18:18:47 CET 2023ieee80211 phy0:
> rt2x00_set_rt: Info - RT chipset 3070, rev 0200 detected
> ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
> ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
> ieee80211 phy0: rt2x00lib_request_firmware: Info - Loading firmware file
> 'rt2870.bin'
> ieee80211 phy0: rt2x00lib_request_firmware: Info - Firmware detected -
> version: 0.36
>
> But there is one big difference on my system: I can't reproduce the bug
> so far. It's working as it should... (I did not apply the debug patches
> myself so far)
>
> I'm now planning to look a bit more into the rt2800usb driver and
> provide another debug patch for interesting looking code pieces in it.
>
> @Thomas:
> I've also uploaded you my binary kernel I'm running at the moment here:
> https://www.awhome.eu/s/5FjqMS73rtCtSBM
>
> That kernel should also be able to boot and operate your system. Can you
> try that and tell me, if that makes any difference?
>
> I'm also planning to provide some more debug patches, to figuring out
> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
> for resumption") fixes the issue for you. Assuming my understanding
> above is correct the patch should not really fix/break anything for
> you...With the findings above I would have expected your git bisec to
> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
> callback to drivers") as the first broken commit...
I can't point to any specific series of events where it would go wrong,
but I suspect that the problem might be the fact that you're doing tx
scheduling from within ieee80211_handle_wake_tx_queue. I don't see how
it's properly protected from potentially being called on different CPUs
concurrently.

Back when I was debugging some iTXQ issues in mt76, I also had problems
when tx scheduling could happen from multiple places. My solution was to
have a single worker thread that handles tx, which is scheduled from the
wake_tx_queue op.
Maybe you could do something similar in mac80211 for non-iTXQ drivers.

- Felix

2023-03-08 10:26:33

by Thomas Mann

[permalink] [raw]
Subject: Re: [Regression] rt2800usb - Wifi performance issues and connection drops

On Wed, 8 Mar 2023 08:13:32 +0100
Alexander Wetzel <[email protected]> wrote:

> On 07.03.23 23:31, Thomas Mann wrote:
> > Hi Alexander,
>
> Since I suspect we'll exchange quite some mails here:
> Top posting is being frowned on the mailing lists on copy.
> Details here: https://www.infradead.org/~dwmw2/email.html
>
> I've moved your post to the correct position and replied there.
>
> >
> >>>>
> >>>
> >>> I just uploaded a test patch to bugzilla.
> >>> Please have a look if that fixes the issue.
> >>>
> >>> If not I would be interested in the output of your iTXQ status.
> >>> Enable CONFIG_MAC80211_DEBUGFS and run this command when the
> >>> connection is bad and send/share/upload to bugzilla the resulting
> >>> debug.out:
> >>>
> >>> k=1; while [ $k -lt 10 ]; do \
> >>> cat /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm; \
> >>> k=$(($k+1)); done >> debug.out
> >>
> >> Thomas and I continued with some debugging in
> >> https://bugzilla.kernel.org/show_bug.cgi?id=217119
> >>
> >> But the results so far are unexpected and we decided to continue
> >> the debugging with the round here. Hoping someone sees something I
> >> miss.
> >>
> >> A very summary where we are:
> >> I can't reproduce the bug with a very similar card and kernel
> >> config so far. Thomas card stops the iTXQs for intervalls >30s.
> >> Mine operates normally.
> >>
> >> A more useful but longer summary:
> >>
> >> Thomas updated to a 6.2 kernel and reported "connection drops and
> >> bandwidth problems" with his rt2800usb wlan card. (6.1 is ok.)
> >> Asked for some more details he reported:
> >> "...slow bandwidth stuff works better, but the main problem/test
> >> case is to start a 8-16 mbit video stream, which sometimes runs
> >> for a few seconds and then stops or it doesn't start at all"
> >>
> >> He bisected the issue and identified my commit 4444bc2116ae ("wifi:
> >> mac80211: Proper mark iTXQs for resumption") as culprit.
> >>
> >> Checking the internal iTXQ status when the issue is ongoing shows,
> >> that TID zero is flagged as dirty and thus is not transmitting
> >> queued packets. Interesting line from
> >> /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm:
> >> tid ac backlog-bytes backlog-packets new-flows drops marks
> >> overlimit collisions tx-bytes tx-packets flags
> >> 0 2 619736 404 1681 0 0 0 1 4513965 3019 0xe(RUN AMPDU NO-AMSDU
> >> DIRTY)
> >> --> The "normal" iTXQ handling IEEE80211_AC_BE has queued packets
> >> and is flagged as DIRTY. There even is a potential race setting
> >> the DIRTY flag, but the fix for that is not helping.
> >>
> >> Thus Thomas applied two debug patches, to better understand why the
> >> DIRTY flag is not cleared.
> >>
> >> And looking at the output from those we see that the driver stops
> >> Tx by calling ieee80211_stop_queue(). When ieee80211_wake_queue()
> >> mac80211 correctly resumes TX but is getting stopped by the driver
> >> after a single packet again. (The start of the relevant log is
> >> missing, so that may be initially more).
> >> I assume TX is still ok at that stage. But after some singe Tx
> >> operations the driver stops the queues again. Here the relevant
> >> part of the log:
> >> [ 179.584997] XXXX __ieee80211_wake_txqs: waking TID 0
> >> [ 179.585022] XXXX drv_tx: TX
> >> [ 179.585027] XXXX ieee80211_stop_queue: called
> >> [ 179.585028] XXXX ieee80211_tx_dequeue: mark TID 0 dirty.
> >> Reason: 1 [ 179.585030] XXXX __ieee80211_wake_txqs: TID 3 NOT
> >> dirty [ 179.585031] XXXX __ieee80211_wake_txqs: TID 8 NOT dirty
> >> [ 179.585033] XXXX __ieee80211_wake_txqs: TID 11 NOT dirty
> >> [ 179.585034] XXXX __ieee80211_wake_txqs: EXIT
> >> [ 179.585035] XXXX __ieee80211_wake_txqs: ENTRY
> >> [ 179.585036] XXXX __ieee80211_wake_txqs: TID 1 NOT dirty
> >> [ 179.585037] XXXX __ieee80211_wake_txqs: TID 2 NOT dirty
> >> [ 179.585038] XXXX __ieee80211_wake_txqs: TID 9 NOT dirty
> >> [ 179.585040] XXXX __ieee80211_wake_txqs: TID 10 NOT dirty
> >> [ 179.585041] XXXX __ieee80211_wake_txqs: EXIT
> >> [ 179.585047] XXXX drv_tx: TX
> >> [ 179.585056] XXXX ieee80211_tx_dequeue: mark TID 0 dirty.
> >> Reason: 1 [ 179.585271] XXXX ieee80211_tx_dequeue: mark TID 0
> >> dirty. Reason: 1 [ 179.585868] XXXX ieee80211_tx_dequeue: mark
> >> TID 0 dirty. Reason: 1 [ 179.586120] XXXX ieee80211_tx_dequeue:
> >> mark TID 0 dirty. Reason: 1 [ 179.586544] XXXX
> >> ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1 [ 179.586792]
> >> XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1 [
> >> 179.587317] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> >> [ 179.587591] XXXX ieee80211_tx_dequeue: mark TID 0 dirty.
> >> Reason: 1 [ 179.588569] XXXX ieee80211_tx_dequeue: mark TID 0
> >> dirty. Reason: 1 .... [ 214.307617] XXXX ieee80211_wake_queue:
> >> called
> >>
> >>
> >> --> So the driver blocked TX for more than 30s. Which is a good
> >> explanation of what Thomas observes.
> >>
> >> But there is nothing mac80211 can do differently here. Whatever is
> >> the real reason for the issue, it's nothing obvious I see.
> >>
> >> Luckily I found a card using the same driver and nearly the same
> >> card: Thomas systems:Linux version 6.2.2-gentoo (root@foo) (gcc
> >> (Gentoo Hardened 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld
> >> (Gentoo 2.39 p5) 2.39.0) #2 SMP Fri Mar 3 16:59:02 CET
> >> 2023ieee80211 phy0: rt2x00_set_rt: Info - RT chipset 3070, rev
> >> 0201 detected ieee80211 phy0: rt2x00_set_rf: Info - RF chipset
> >> 0005 detected ieee80211 phy0: Selected rate control algorithm
> >> 'minstrel_ht'
> >>
> >> My system, using the kernel config from Thomas with only minor
> >> modifications (different filesystems and initramfs settings and
> >> enabled mac80211 debug and developer options):
> >> Linux version 6.2.2-gentoo ([email protected]) (gcc (Gentoo
> >> 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.40 p2)
> >> 2.40.0) #2 SMP Tue Mar 7 18:18:47 CET 2023ieee80211 phy0:
> >> rt2x00_set_rt: Info - RT chipset 3070, rev 0200 detected
> >> ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
> >> ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
> >> ieee80211 phy0: rt2x00lib_request_firmware: Info - Loading firmware
> >> file 'rt2870.bin'
> >> ieee80211 phy0: rt2x00lib_request_firmware: Info - Firmware
> >> detected
> >> - version: 0.36
> >>
> >> But there is one big difference on my system: I can't reproduce the
> >> bug so far. It's working as it should... (I did not apply the debug
> >> patches myself so far)
> >>
> >> I'm now planning to look a bit more into the rt2800usb driver and
> >> provide another debug patch for interesting looking code pieces in
> >> it.
> >>
> >> @Thomas:
> >> I've also uploaded you my binary kernel I'm running at the moment
> >> here: https://www.awhome.eu/s/5FjqMS73rtCtSBM
> >>
> >> That kernel should also be able to boot and operate your system.
> >> Can you try that and tell me, if that makes any difference?
>
> >
> > i can't boot the binary kernel here, as the initramfs is included
> > in my kernel, if you send me a patch, i can apply it and test it.
>
> That was an unpatched kernel. Idea was to verify that it's not a
> compiler issue. (You seem to be using a hardened Gentoo profile.)
>
> Can you share your initrd, so I can include it? (Mail it to me
> directly, upload it to bug in buguilla or send a link to some cloud
> storage.)
>
I can't share this config, as it's a production system, and i'm not
allowed to run abitrary binary code on the system. As 6.1.x works
without a problem, i don't think it's a compiler problem. I will try to
get a none hardened compiler and recompile the kernel.

>
>
> >>
> >> I'm also planning to provide some more debug patches, to figuring
> >> out which part of commit 4444bc2116ae ("wifi: mac80211: Proper
> >> mark iTXQs for resumption") fixes the issue for you. Assuming my
> >> understanding above is correct the patch should not really
> >> fix/break anything for you...With the findings above I would have
> >> expected your git bisec to identify commit a790cc3a4fad ("wifi:
> >> mac80211: add wake_tx_queue callback to drivers") as the first
> >> broken commit...
> >>
> >> Alexander
> >
>


2023-03-08 11:41:41

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [Regression] rt2800usb - Wifi performance issues and connection drops

On 08.03.23 08:52, Felix Fietkau wrote:

>> I'm also planning to provide some more debug patches, to figuring out
>> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
>> for resumption") fixes the issue for you. Assuming my understanding
>> above is correct the patch should not really fix/break anything for
>> you...With the findings above I would have expected your git bisec to
>> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
>> callback to drivers") as the first broken commit...
> I can't point to any specific series of events where it would go wrong,
> but I suspect that the problem might be the fact that you're doing tx
> scheduling from within ieee80211_handle_wake_tx_queue. I don't see how
> it's properly protected from potentially being called on different CPUs
> concurrently.
>
> Back when I was debugging some iTXQ issues in mt76, I also had problems
> when tx scheduling could happen from multiple places. My solution was to
> have a single worker thread that handles tx, which is scheduled from the
> wake_tx_queue op.
> Maybe you could do something similar in mac80211 for non-iTXQ drivers.
>

I think it's already doing all of that:
ieee80211_handle_wake_tx_queue() is the mac80211 implementation for the
wake_tx_queue op. The drivers without native iTXQ support simply link it
to this handler.

Alexander



2023-03-08 11:57:40

by Felix Fietkau

[permalink] [raw]
Subject: Re: [Regression] rt2800usb - Wifi performance issues and connection drops

On 08.03.23 12:41, Alexander Wetzel wrote:
> On 08.03.23 08:52, Felix Fietkau wrote:
>
>>> I'm also planning to provide some more debug patches, to figuring out
>>> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
>>> for resumption") fixes the issue for you. Assuming my understanding
>>> above is correct the patch should not really fix/break anything for
>>> you...With the findings above I would have expected your git bisec to
>>> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
>>> callback to drivers") as the first broken commit...
>> I can't point to any specific series of events where it would go wrong,
>> but I suspect that the problem might be the fact that you're doing tx
>> scheduling from within ieee80211_handle_wake_tx_queue. I don't see how
>> it's properly protected from potentially being called on different CPUs
>> concurrently.
>>
>> Back when I was debugging some iTXQ issues in mt76, I also had problems
>> when tx scheduling could happen from multiple places. My solution was to
>> have a single worker thread that handles tx, which is scheduled from the
>> wake_tx_queue op.
>> Maybe you could do something similar in mac80211 for non-iTXQ drivers.
>>
>
> I think it's already doing all of that:
> ieee80211_handle_wake_tx_queue() is the mac80211 implementation for the
> wake_tx_queue op. The drivers without native iTXQ support simply link it
> to this handler.
I know. The problem I see is that I can't find anything that guarantees
that .wake_tx_queue_op is not being called concurrently from multiple
different places. ieee80211_handle_wake_tx_queue is doing the scheduling
directly, instead of deferring it to a single workqueue/tasklet/thread,
and multiple concurrent calls to it could potentially cause issues.

- Felix

2023-03-08 12:10:23

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [Regression] rt2800usb - Wifi performance issues and connection drops

>>>> @Thomas:
>>>> I've also uploaded you my binary kernel I'm running at the moment
>>>> here: https://www.awhome.eu/s/5FjqMS73rtCtSBM
>>>>
>>>> That kernel should also be able to boot and operate your system.
>>>> Can you try that and tell me, if that makes any difference?
>>
>> >
>> > i can't boot the binary kernel here, as the initramfs is included
>> > in my kernel, if you send me a patch, i can apply it and test it.
>>
>> That was an unpatched kernel. Idea was to verify that it's not a
>> compiler issue. (You seem to be using a hardened Gentoo profile.)
>>
>> Can you share your initrd, so I can include it? (Mail it to me
>> directly, upload it to bug in buguilla or send a link to some cloud
>> storage.)
>>
> I can't share this config, as it's a production system, and i'm not
> allowed to run abitrary binary code on the system. As 6.1.x works
> without a problem, i don't think it's a compiler problem. I will try to
> get a none hardened compiler and recompile the kernel.
>

I was suspecting something like that. I may try the same in reverse. But
it's so far quite some way down on the list. There are more promising
ways to spend the debug time I have for so far.

But one remark:
As far as TX is concerned 6.1 and 6.2 kernels are handling TX in
drastically different ways for many - including yours - cards.

The patch you identified as culprit is well after the move to the new TX
mode of operation and only fixes a comparable minor issue.

Your setup seems to require both, the move to iTXQ AND this minor fix.

>>
>>>>
>>>> I'm also planning to provide some more debug patches, to figuring
>>>> out which part of commit 4444bc2116ae ("wifi: mac80211: Proper
>>>> mark iTXQs for resumption") fixes the issue for you. Assuming my
>>>> understanding above is correct the patch should not really
>>>> fix/break anything for you...With the findings above I would have
>>>> expected your git bisec to identify commit a790cc3a4fad ("wifi:
>>>> mac80211: add wake_tx_queue callback to drivers") as the first
>>>> broken commit...
>>>>
>>>> Alexander
>>>
>>
>


2023-03-08 12:23:16

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [Regression] rt2800usb - Wifi performance issues and connection drops

On 08.03.23 12:57, Felix Fietkau wrote:
> On 08.03.23 12:41, Alexander Wetzel wrote:
>> On 08.03.23 08:52, Felix Fietkau wrote:
>>>> I'm also planning to provide some more debug patches, to figuring out
>>>> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
>>>> for resumption") fixes the issue for you. Assuming my understanding
>>>> above is correct the patch should not really fix/break anything for
>>>> you...With the findings above I would have expected your git bisec to
>>>> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
>>>> callback to drivers") as the first broken commit...
>>> I can't point to any specific series of events where it would go
>>> wrong, but I suspect that the problem might be the fact that you're
>>> doing tx scheduling from within ieee80211_handle_wake_tx_queue. I
>>> don't see how it's properly protected from potentially being called
>>> on different CPUs concurrently.
>>> Back when I was debugging some iTXQ issues in mt76, I also had
>>> problems when tx scheduling could happen from multiple places. My
>>> solution was to have a single worker thread that handles tx, which is
>>> scheduled from the wake_tx_queue op.
>>> Maybe you could do something similar in mac80211 for non-iTXQ drivers.
>> I think it's already doing all of that:
>> ieee80211_handle_wake_tx_queue() is the mac80211 implementation for the
>> wake_tx_queue op. The drivers without native iTXQ support simply link it
>> to this handler.
> I know. The problem I see is that I can't find anything that guarantees
> that .wake_tx_queue_op is not being called concurrently from multiple
> different places. ieee80211_handle_wake_tx_queue is doing the scheduling
> directly, instead of deferring it to a single workqueue/tasklet/thread,
> and multiple concurrent calls to it could potentially cause issues.

Alexander, Felix, many thx for looking into this.

This more and more sounds like something that might take a while to get
fixed, which makes it harder to get this fixed within those time-frames
Documentation/process/handling-regressions.rst outlines. So please allow
me to ask:

Is reverting the culprit (and reapplying it later once the real cause is
found and fixed) an option, or would that cause other regressions?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

2023-03-08 12:28:53

by Thomas Mann

[permalink] [raw]
Subject: Re: [Regression] rt2800usb - Wifi performance issues and connection drops

On Wed, 8 Mar 2023 11:26:50 +0100
Thomas Mann <[email protected]> wrote:

> On Wed, 8 Mar 2023 08:13:32 +0100
> Alexander Wetzel <[email protected]> wrote:
>
> > On 07.03.23 23:31, Thomas Mann wrote:
> > > Hi Alexander,
> >
> > Since I suspect we'll exchange quite some mails here:
> > Top posting is being frowned on the mailing lists on copy.
> > Details here: https://www.infradead.org/~dwmw2/email.html
> >
> > I've moved your post to the correct position and replied there.
> >
> > >
> > >>>>
> > >>>
> > >>> I just uploaded a test patch to bugzilla.
> > >>> Please have a look if that fixes the issue.
> > >>>
> > >>> If not I would be interested in the output of your iTXQ status.
> > >>> Enable CONFIG_MAC80211_DEBUGFS and run this command when the
> > >>> connection is bad and send/share/upload to bugzilla the
> > >>> resulting debug.out:
> > >>>
> > >>> k=1; while [ $k -lt 10 ]; do \
> > >>> cat /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm; \
> > >>> k=$(($k+1)); done >> debug.out
> > >>
> > >> Thomas and I continued with some debugging in
> > >> https://bugzilla.kernel.org/show_bug.cgi?id=217119
> > >>
> > >> But the results so far are unexpected and we decided to continue
> > >> the debugging with the round here. Hoping someone sees something
> > >> I miss.
> > >>
> > >> A very summary where we are:
> > >> I can't reproduce the bug with a very similar card and kernel
> > >> config so far. Thomas card stops the iTXQs for intervalls >30s.
> > >> Mine operates normally.
> > >>
> > >> A more useful but longer summary:
> > >>
> > >> Thomas updated to a 6.2 kernel and reported "connection drops and
> > >> bandwidth problems" with his rt2800usb wlan card. (6.1 is ok.)
> > >> Asked for some more details he reported:
> > >> "...slow bandwidth stuff works better, but the main problem/test
> > >> case is to start a 8-16 mbit video stream, which sometimes runs
> > >> for a few seconds and then stops or it doesn't start at all"
> > >>
> > >> He bisected the issue and identified my commit 4444bc2116ae
> > >> ("wifi: mac80211: Proper mark iTXQs for resumption") as culprit.
> > >>
> > >> Checking the internal iTXQ status when the issue is ongoing
> > >> shows, that TID zero is flagged as dirty and thus is not
> > >> transmitting queued packets. Interesting line from
> > >> /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm:
> > >> tid ac backlog-bytes backlog-packets new-flows drops marks
> > >> overlimit collisions tx-bytes tx-packets flags
> > >> 0 2 619736 404 1681 0 0 0 1 4513965 3019 0xe(RUN AMPDU NO-AMSDU
> > >> DIRTY)
> > >> --> The "normal" iTXQ handling IEEE80211_AC_BE has queued
> > >> packets and is flagged as DIRTY. There even is a potential race
> > >> setting the DIRTY flag, but the fix for that is not helping.
> > >>
> > >> Thus Thomas applied two debug patches, to better understand why
> > >> the DIRTY flag is not cleared.
> > >>
> > >> And looking at the output from those we see that the driver stops
> > >> Tx by calling ieee80211_stop_queue(). When ieee80211_wake_queue()
> > >> mac80211 correctly resumes TX but is getting stopped by the
> > >> driver after a single packet again. (The start of the relevant
> > >> log is missing, so that may be initially more).
> > >> I assume TX is still ok at that stage. But after some singe Tx
> > >> operations the driver stops the queues again. Here the relevant
> > >> part of the log:
> > >> [ 179.584997] XXXX __ieee80211_wake_txqs: waking TID 0
> > >> [ 179.585022] XXXX drv_tx: TX
> > >> [ 179.585027] XXXX ieee80211_stop_queue: called
> > >> [ 179.585028] XXXX ieee80211_tx_dequeue: mark TID 0 dirty.
> > >> Reason: 1 [ 179.585030] XXXX __ieee80211_wake_txqs: TID 3 NOT
> > >> dirty [ 179.585031] XXXX __ieee80211_wake_txqs: TID 8 NOT dirty
> > >> [ 179.585033] XXXX __ieee80211_wake_txqs: TID 11 NOT dirty
> > >> [ 179.585034] XXXX __ieee80211_wake_txqs: EXIT
> > >> [ 179.585035] XXXX __ieee80211_wake_txqs: ENTRY
> > >> [ 179.585036] XXXX __ieee80211_wake_txqs: TID 1 NOT dirty
> > >> [ 179.585037] XXXX __ieee80211_wake_txqs: TID 2 NOT dirty
> > >> [ 179.585038] XXXX __ieee80211_wake_txqs: TID 9 NOT dirty
> > >> [ 179.585040] XXXX __ieee80211_wake_txqs: TID 10 NOT dirty
> > >> [ 179.585041] XXXX __ieee80211_wake_txqs: EXIT
> > >> [ 179.585047] XXXX drv_tx: TX
> > >> [ 179.585056] XXXX ieee80211_tx_dequeue: mark TID 0 dirty.
> > >> Reason: 1 [ 179.585271] XXXX ieee80211_tx_dequeue: mark TID 0
> > >> dirty. Reason: 1 [ 179.585868] XXXX ieee80211_tx_dequeue: mark
> > >> TID 0 dirty. Reason: 1 [ 179.586120] XXXX ieee80211_tx_dequeue:
> > >> mark TID 0 dirty. Reason: 1 [ 179.586544] XXXX
> > >> ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1 [ 179.586792]
> > >> XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1 [
> > >> 179.587317] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason:
> > >> 1 [ 179.587591] XXXX ieee80211_tx_dequeue: mark TID 0 dirty.
> > >> Reason: 1 [ 179.588569] XXXX ieee80211_tx_dequeue: mark TID 0
> > >> dirty. Reason: 1 .... [ 214.307617] XXXX ieee80211_wake_queue:
> > >> called
> > >>
> > >>
> > >> --> So the driver blocked TX for more than 30s. Which is a good
> > >> explanation of what Thomas observes.
> > >>
> > >> But there is nothing mac80211 can do differently here. Whatever
> > >> is the real reason for the issue, it's nothing obvious I see.
> > >>
> > >> Luckily I found a card using the same driver and nearly the same
> > >> card: Thomas systems:Linux version 6.2.2-gentoo (root@foo) (gcc
> > >> (Gentoo Hardened 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld
> > >> (Gentoo 2.39 p5) 2.39.0) #2 SMP Fri Mar 3 16:59:02 CET
> > >> 2023ieee80211 phy0: rt2x00_set_rt: Info - RT chipset 3070, rev
> > >> 0201 detected ieee80211 phy0: rt2x00_set_rf: Info - RF chipset
> > >> 0005 detected ieee80211 phy0: Selected rate control algorithm
> > >> 'minstrel_ht'
> > >>
> > >> My system, using the kernel config from Thomas with only minor
> > >> modifications (different filesystems and initramfs settings and
> > >> enabled mac80211 debug and developer options):
> > >> Linux version 6.2.2-gentoo ([email protected]) (gcc (Gentoo
> > >> 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.40 p2)
> > >> 2.40.0) #2 SMP Tue Mar 7 18:18:47 CET 2023ieee80211 phy0:
> > >> rt2x00_set_rt: Info - RT chipset 3070, rev 0200 detected
> > >> ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
> > >> ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
> > >> ieee80211 phy0: rt2x00lib_request_firmware: Info - Loading
> > >> firmware file 'rt2870.bin'
> > >> ieee80211 phy0: rt2x00lib_request_firmware: Info - Firmware
> > >> detected
> > >> - version: 0.36
> > >>
> > >> But there is one big difference on my system: I can't reproduce
> > >> the bug so far. It's working as it should... (I did not apply
> > >> the debug patches myself so far)
> > >>
> > >> I'm now planning to look a bit more into the rt2800usb driver and
> > >> provide another debug patch for interesting looking code pieces
> > >> in it.
> > >>
> > >> @Thomas:
> > >> I've also uploaded you my binary kernel I'm running at the moment
> > >> here: https://www.awhome.eu/s/5FjqMS73rtCtSBM
> > >>
> > >> That kernel should also be able to boot and operate your system.
> > >> Can you try that and tell me, if that makes any difference?
> >
> > >
> > > i can't boot the binary kernel here, as the initramfs is included
> > > in my kernel, if you send me a patch, i can apply it and test
> > > it.
> >
> > That was an unpatched kernel. Idea was to verify that it's not a
> > compiler issue. (You seem to be using a hardened Gentoo profile.)
> >
> > Can you share your initrd, so I can include it? (Mail it to me
> > directly, upload it to bug in buguilla or send a link to some cloud
> > storage.)
> >
> I can't share this config, as it's a production system, and i'm not
> allowed to run abitrary binary code on the system. As 6.1.x works
> without a problem, i don't think it's a compiler problem. I will try
> to get a none hardened compiler and recompile the kernel.

I compiled the kernel now with a none hardened tools/compiler
(gcc (Gentoo 12.2.1_p20230121-r1 p10) 12.2.1 20230121) and the kernel
still has the same bug/behaviour.

>
> >
> >
> > >>
> > >> I'm also planning to provide some more debug patches, to figuring
> > >> out which part of commit 4444bc2116ae ("wifi: mac80211: Proper
> > >> mark iTXQs for resumption") fixes the issue for you. Assuming my
> > >> understanding above is correct the patch should not really
> > >> fix/break anything for you...With the findings above I would have
> > >> expected your git bisec to identify commit a790cc3a4fad ("wifi:
> > >> mac80211: add wake_tx_queue callback to drivers") as the first
> > >> broken commit...
> > >>
> > >> Alexander
> > >
> >
>


2023-03-08 16:50:17

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [Regression] rt2800usb - Wifi performance issues and connection drops

On 08.03.23 13:21, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 08.03.23 12:57, Felix Fietkau wrote:
>> On 08.03.23 12:41, Alexander Wetzel wrote:
>>> On 08.03.23 08:52, Felix Fietkau wrote:
>>>>> I'm also planning to provide some more debug patches, to figuring out
>>>>> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
>>>>> for resumption") fixes the issue for you. Assuming my understanding
>>>>> above is correct the patch should not really fix/break anything for
>>>>> you...With the findings above I would have expected your git bisec to
>>>>> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
>>>>> callback to drivers") as the first broken commit...
>>>> I can't point to any specific series of events where it would go
>>>> wrong, but I suspect that the problem might be the fact that you're
>>>> doing tx scheduling from within ieee80211_handle_wake_tx_queue. I
>>>> don't see how it's properly protected from potentially being called
>>>> on different CPUs concurrently.
>>>> Back when I was debugging some iTXQ issues in mt76, I also had
>>>> problems when tx scheduling could happen from multiple places. My
>>>> solution was to have a single worker thread that handles tx, which is
>>>> scheduled from the wake_tx_queue op.
>>>> Maybe you could do something similar in mac80211 for non-iTXQ drivers.
>>> I think it's already doing all of that:
>>> ieee80211_handle_wake_tx_queue() is the mac80211 implementation for the
>>> wake_tx_queue op. The drivers without native iTXQ support simply link it
>>> to this handler.
>> I know. The problem I see is that I can't find anything that guarantees
>> that .wake_tx_queue_op is not being called concurrently from multiple
>> different places. ieee80211_handle_wake_tx_queue is doing the scheduling
>> directly, instead of deferring it to a single workqueue/tasklet/thread,
>> and multiple concurrent calls to it could potentially cause issues.
>
> Alexander, Felix, many thx for looking into this.
>
> This more and more sounds like something that might take a while to get
> fixed, which makes it harder to get this fixed within those time-frames
> Documentation/process/handling-regressions.rst outlines. So please allow
> me to ask:
>
> Is reverting the culprit (and reapplying it later once the real cause is
> found and fixed) an option, or would that cause other regressions?

This patch turned out to fix a (much worse) pre-release regression. See e.g.
https://lore.kernel.org/linux-wireless/[email protected]/T/#t

To fix both regressions will force us to revert more commits other
patches depends on...

>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.


2023-03-09 08:00:58

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [Regression] rt2800usb - Wifi performance issues and connection drops

On 08.03.23 17:50, Alexander Wetzel wrote:
> On 08.03.23 13:21, Linux regression tracking (Thorsten Leemhuis) wrote:
>> On 08.03.23 12:57, Felix Fietkau wrote:
>>> On 08.03.23 12:41, Alexander Wetzel wrote:
>>>> On 08.03.23 08:52, Felix Fietkau wrote:
>>>>>> I'm also planning to provide some more debug patches, to figuring out
>>>>>> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
>>>>>> for resumption") fixes the issue for you. Assuming my understanding
>>>>>> above is correct the patch should not really fix/break anything for
>>>>>> you...With the findings above I would have expected your git bisec to
>>>>>> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
>>>>>> callback to drivers") as the first broken commit...
>>>>> I can't point to any specific series of events where it would go
>>>>> wrong, but I suspect that the problem might be the fact that you're
>>>>> doing tx scheduling from within ieee80211_handle_wake_tx_queue. I
>>>>> don't see how it's properly protected from potentially being called
>>>>> on different CPUs concurrently.
>>>>> Back when I was debugging some iTXQ issues in mt76, I also had
>>>>> problems when tx scheduling could happen from multiple places. My
>>>>> solution was to have a single worker thread that handles tx, which is
>>>>> scheduled from the wake_tx_queue op.
>>>>> Maybe you could do something similar in mac80211 for non-iTXQ drivers.
>>>> I think it's already doing all of that:
>>>> ieee80211_handle_wake_tx_queue() is the mac80211 implementation for the
>>>> wake_tx_queue op. The drivers without native iTXQ support simply
>>>> link it
>>>> to this handler.
>>> I know. The problem I see is that I can't find anything that guarantees
>>> that .wake_tx_queue_op is not being called concurrently from multiple
>>> different places. ieee80211_handle_wake_tx_queue is doing the scheduling
>>> directly, instead of deferring it to a single workqueue/tasklet/thread,
>>> and multiple concurrent calls to it could potentially cause issues.
>>
>> Alexander, Felix, many thx for looking into this.
>>
>> This more and more sounds like something that might take a while to get
>> fixed, which makes it harder to get this fixed within those time-frames
>> Documentation/process/handling-regressions.rst outlines. So please allow
>> me to ask:
>>
>> Is reverting the culprit (and reapplying it later once the real cause is
>> found and fixed) an option, or would that cause other regressions?
>
> This patch turned out to fix a (much worse) pre-release regression. See
> e.g.
> https://lore.kernel.org/linux-wireless/[email protected]/T/#t

Uggh, thx for the update, that's unfortunate, but that's how it is
sometimes. I just asked because the culprit didn't have a Reported-by or
together with a Link: to the backstory, so it looked like it might be
fine to revert. But then it's not a option.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

2023-03-09 17:05:24

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [Regression] rt2800usb - Wifi performance issues and connection drops

On 07.03.23 21:54, Alexander Wetzel wrote:
>>>
>>
>> I just uploaded a test patch to bugzilla.
>> Please have a look if that fixes the issue.
>>
>> If not I would be interested in the output of your iTXQ status.
>> Enable CONFIG_MAC80211_DEBUGFS and run this command when the
>> connection is bad and send/share/upload to bugzilla the resulting
>> debug.out:
>>
>> k=1; while [ $k -lt 10 ]; do \
>> cat /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm; \
>> k=$(($k+1)); done >> debug.out
>
> Thomas and I continued with some debugging in
> https://bugzilla.kernel.org/show_bug.cgi?id=217119
>
> But the results so far are unexpected and we decided to continue the
> debugging with the round here. Hoping someone sees something I miss.
>
> A very summary where we are:
> I can't reproduce the bug with a very similar card and kernel config so
> far. Thomas card stops the iTXQs for intervalls >30s. Mine operates
> normally.
>
> A more useful but longer summary:
>
> Thomas updated to a 6.2 kernel and reported "connection drops and
> bandwidth problems" with his rt2800usb wlan card. (6.1 is ok.) Asked for
> some more details he reported:
> "...slow bandwidth stuff works better, but the main problem/test case is
> to start a 8-16 mbit video stream, which sometimes runs for a few
> seconds and then stops or it doesn't start at all"
>
> He bisected the issue and identified my commit 4444bc2116ae ("wifi:
> mac80211: Proper mark iTXQs for resumption") as culprit.
>
> Checking the internal iTXQ status when the issue is ongoing shows, that
> TID zero is flagged as dirty and thus is not transmitting queued
> packets. Interesting line from
> /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm:
> tid ac backlog-bytes backlog-packets new-flows drops marks overlimit
> collisions tx-bytes tx-packets flags
> 0 2 619736 404 1681 0 0 0 1 4513965 3019 0xe(RUN AMPDU NO-AMSDU DIRTY)
>
> --> The "normal" iTXQ handling IEEE80211_AC_BE has queued packets and is
> flagged as DIRTY. There even is a potential race setting the DIRTY flag,
> but the fix for that is not helping.
>
> Thus Thomas applied two debug patches, to better understand why the
> DIRTY flag is not cleared.
>
> And looking at the output from those we see that the driver stops Tx by
> calling ieee80211_stop_queue(). When ieee80211_wake_queue() mac80211
> correctly resumes TX but is getting stopped by the driver after a single
> packet again. (The start of the relevant log is missing, so that may be
> initially more).
> I assume TX is still ok at that stage. But after some singe Tx
> operations the driver stops the queues again. Here the relevant part of
> the log:
> [  179.584997] XXXX __ieee80211_wake_txqs: waking TID 0
> [  179.585022] XXXX drv_tx: TX
> [  179.585027] XXXX ieee80211_stop_queue: called
> [  179.585028] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.585030] XXXX __ieee80211_wake_txqs: TID 3 NOT dirty
> [  179.585031] XXXX __ieee80211_wake_txqs: TID 8 NOT dirty
> [  179.585033] XXXX __ieee80211_wake_txqs: TID 11 NOT dirty
> [  179.585034] XXXX __ieee80211_wake_txqs: EXIT
> [  179.585035] XXXX __ieee80211_wake_txqs: ENTRY
> [  179.585036] XXXX __ieee80211_wake_txqs: TID 1 NOT dirty
> [  179.585037] XXXX __ieee80211_wake_txqs: TID 2 NOT dirty
> [  179.585038] XXXX __ieee80211_wake_txqs: TID 9 NOT dirty
> [  179.585040] XXXX __ieee80211_wake_txqs: TID 10 NOT dirty
> [  179.585041] XXXX __ieee80211_wake_txqs: EXIT
> [  179.585047] XXXX drv_tx: TX
> [  179.585056] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.585271] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.585868] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.586120] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.586544] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.586792] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.587317] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.587591] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.588569] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> ....
> [  214.307617] XXXX ieee80211_wake_queue: called
>
>
> --> So the driver blocked TX for more than 30s. Which is a good
> explanation of what Thomas observes.
>
> But there is nothing mac80211 can do differently here. Whatever is the
> real reason for the issue, it's nothing obvious I see.

Best shot I have so far is a driver bug/issue now exposed by the changed
traffic pattern from mac80211. And while digging into the rt2800usb
driver I found a watchdog introduced here:
https://lore.kernel.org/[email protected]

From mac80211 debugging it looks like it may just be that: A random
hang of the driver/card.

For sure rt2800usb tells mac80211 to stop TXing and needs ages (>30s in
known sample) to unblock the queue. And this watchdog is disabled by
default.

Now I'm clearly wondering, if the changed traffic pattern due to the
mac80211 patch is just triggering the random hangs...

I've also uploaded more test patches to bugzilla.

@Thomas
Can you also try with this watchdog enabled? It must be enabled for
rt2800lib. Since you have compiled in the driver the following boot
parameter should enable it:
rt2800lib.watchdog=1

@Stanislaw
Can you maybe also have a look at the issue and how that looks compared
to the known random hangs you introduced the watchdog for?

>
> Luckily I found a card using the same driver and nearly the same card:
> Thomas systems:Linux version 6.2.2-gentoo (root@foo) (gcc (Gentoo
> Hardened 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.39
> p5) 2.39.0) #2 SMP Fri Mar  3 16:59:02 CET 2023ieee80211 phy0:
> rt2x00_set_rt: Info - RT chipset 3070, rev 0201 detected
> ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
> ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
>
> My system, using the kernel config from Thomas with only minor
> modifications (different filesystems and initramfs settings and enabled
> mac80211 debug and developer options):
> Linux version 6.2.2-gentoo ([email protected]) (gcc (Gentoo
> 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.40 p2)
> 2.40.0) #2 SMP Tue Mar  7 18:18:47 CET 2023ieee80211 phy0:
> rt2x00_set_rt: Info - RT chipset 3070, rev 0200 detected
> ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
> ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
> ieee80211 phy0: rt2x00lib_request_firmware: Info - Loading firmware file
> 'rt2870.bin'
> ieee80211 phy0: rt2x00lib_request_firmware: Info - Firmware detected -
> version: 0.36
>
> But there is one big difference on my system: I can't reproduce the bug
> so far. It's working as it should... (I did not apply the debug patches
> myself so far)
>
> I'm now planning to look a bit more into the rt2800usb driver and
> provide another debug patch for interesting looking code pieces in it.
>
> @Thomas:
> I've also uploaded you my binary kernel I'm running at the moment here:
> https://www.awhome.eu/s/5FjqMS73rtCtSBM
>
> That kernel should also be able to boot and operate your system. Can you
> try that and tell me, if that makes any difference?
>
> I'm also planning to provide some more debug patches, to figuring out
> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
> for resumption") fixes the issue for you. Assuming my understanding
> above is correct the patch should not really fix/break anything for
> you...With the findings above I would have expected your git bisec to
> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
> callback to drivers") as the first broken commit...
>
> Alexander


2023-03-09 17:29:14

by Thomas Mann

[permalink] [raw]
Subject: Re: [Regression] rt2800usb - Wifi performance issues and connection drops

On Thu, 9 Mar 2023 18:00:04 +0100
Alexander Wetzel <[email protected]> wrote:

> On 07.03.23 21:54, Alexander Wetzel wrote:
> >>>
> >>
> >> I just uploaded a test patch to bugzilla.
> >> Please have a look if that fixes the issue.
> >>
> >> If not I would be interested in the output of your iTXQ status.
> >> Enable CONFIG_MAC80211_DEBUGFS and run this command when the
> >> connection is bad and send/share/upload to bugzilla the resulting
> >> debug.out:
> >>
> >> k=1; while [ $k -lt 10 ]; do \
> >> cat /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm; \
> >> k=$(($k+1)); done >> debug.out
> >
> > Thomas and I continued with some debugging in
> > https://bugzilla.kernel.org/show_bug.cgi?id=217119
> >
> > But the results so far are unexpected and we decided to continue
> > the debugging with the round here. Hoping someone sees something I
> > miss.
> >
> > A very summary where we are:
> > I can't reproduce the bug with a very similar card and kernel
> > config so far. Thomas card stops the iTXQs for intervalls >30s.
> > Mine operates normally.
> >
> > A more useful but longer summary:
> >
> > Thomas updated to a 6.2 kernel and reported "connection drops and
> > bandwidth problems" with his rt2800usb wlan card. (6.1 is ok.)
> > Asked for some more details he reported:
> > "...slow bandwidth stuff works better, but the main problem/test
> > case is to start a 8-16 mbit video stream, which sometimes runs for
> > a few seconds and then stops or it doesn't start at all"
> >
> > He bisected the issue and identified my commit 4444bc2116ae ("wifi:
> > mac80211: Proper mark iTXQs for resumption") as culprit.
> >
> > Checking the internal iTXQ status when the issue is ongoing shows,
> > that TID zero is flagged as dirty and thus is not transmitting
> > queued packets. Interesting line from
> > /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm:
> > tid ac backlog-bytes backlog-packets new-flows drops marks
> > overlimit collisions tx-bytes tx-packets flags
> > 0 2 619736 404 1681 0 0 0 1 4513965 3019 0xe(RUN AMPDU NO-AMSDU
> > DIRTY)
> > --> The "normal" iTXQ handling IEEE80211_AC_BE has queued packets
> > and is flagged as DIRTY. There even is a potential race setting the
> > DIRTY flag, but the fix for that is not helping.
> >
> > Thus Thomas applied two debug patches, to better understand why the
> > DIRTY flag is not cleared.
> >
> > And looking at the output from those we see that the driver stops
> > Tx by calling ieee80211_stop_queue(). When ieee80211_wake_queue()
> > mac80211 correctly resumes TX but is getting stopped by the driver
> > after a single packet again. (The start of the relevant log is
> > missing, so that may be initially more).
> > I assume TX is still ok at that stage. But after some singe Tx
> > operations the driver stops the queues again. Here the relevant
> > part of the log:
> > [  179.584997] XXXX __ieee80211_wake_txqs: waking TID 0
> > [  179.585022] XXXX drv_tx: TX
> > [  179.585027] XXXX ieee80211_stop_queue: called
> > [  179.585028] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason:
> > 1 [  179.585030] XXXX __ieee80211_wake_txqs: TID 3 NOT dirty
> > [  179.585031] XXXX __ieee80211_wake_txqs: TID 8 NOT dirty
> > [  179.585033] XXXX __ieee80211_wake_txqs: TID 11 NOT dirty
> > [  179.585034] XXXX __ieee80211_wake_txqs: EXIT
> > [  179.585035] XXXX __ieee80211_wake_txqs: ENTRY
> > [  179.585036] XXXX __ieee80211_wake_txqs: TID 1 NOT dirty
> > [  179.585037] XXXX __ieee80211_wake_txqs: TID 2 NOT dirty
> > [  179.585038] XXXX __ieee80211_wake_txqs: TID 9 NOT dirty
> > [  179.585040] XXXX __ieee80211_wake_txqs: TID 10 NOT dirty
> > [  179.585041] XXXX __ieee80211_wake_txqs: EXIT
> > [  179.585047] XXXX drv_tx: TX
> > [  179.585056] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason:
> > 1 [  179.585271] XXXX ieee80211_tx_dequeue: mark TID 0 dirty.
> > Reason: 1 [  179.585868] XXXX ieee80211_tx_dequeue: mark TID 0
> > dirty. Reason: 1 [  179.586120] XXXX ieee80211_tx_dequeue: mark TID
> > 0 dirty. Reason: 1 [  179.586544] XXXX ieee80211_tx_dequeue: mark
> > TID 0 dirty. Reason: 1 [  179.586792] XXXX ieee80211_tx_dequeue:
> > mark TID 0 dirty. Reason: 1 [  179.587317] XXXX
> > ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1 [  179.587591]
> > XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1 [
> > 179.588569] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> > .... [  214.307617] XXXX ieee80211_wake_queue: called
> >
> >
> > --> So the driver blocked TX for more than 30s. Which is a good
> > explanation of what Thomas observes.
> >
> > But there is nothing mac80211 can do differently here. Whatever is
> > the real reason for the issue, it's nothing obvious I see.
>
> Best shot I have so far is a driver bug/issue now exposed by the
> changed traffic pattern from mac80211. And while digging into the
> rt2800usb driver I found a watchdog introduced here:
> https://lore.kernel.org/[email protected]
>
> From mac80211 debugging it looks like it may just be that: A random
> hang of the driver/card.
>
> For sure rt2800usb tells mac80211 to stop TXing and needs ages (>30s
> in known sample) to unblock the queue. And this watchdog is disabled
> by default.
>
> Now I'm clearly wondering, if the changed traffic pattern due to the
> mac80211 patch is just triggering the random hangs...
>
> I've also uploaded more test patches to bugzilla.
>
> @Thomas
> Can you also try with this watchdog enabled? It must be enabled for
> rt2800lib. Since you have compiled in the driver the following boot
> parameter should enable it:
> rt2800lib.watchdog=1

i responded on the bugtracker. Enabling the watchdog doesn't solve the
problem.

>
> @Stanislaw
> Can you maybe also have a look at the issue and how that looks
> compared to the known random hangs you introduced the watchdog for?
>
> >
> > Luckily I found a card using the same driver and nearly the same
> > card: Thomas systems:Linux version 6.2.2-gentoo (root@foo) (gcc
> > (Gentoo Hardened 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld
> > (Gentoo 2.39 p5) 2.39.0) #2 SMP Fri Mar  3 16:59:02 CET
> > 2023ieee80211 phy0: rt2x00_set_rt: Info - RT chipset 3070, rev 0201
> > detected ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005
> > detected ieee80211 phy0: Selected rate control algorithm
> > 'minstrel_ht'
> >
> > My system, using the kernel config from Thomas with only minor
> > modifications (different filesystems and initramfs settings and
> > enabled mac80211 debug and developer options):
> > Linux version 6.2.2-gentoo ([email protected]) (gcc (Gentoo
> > 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.40 p2)
> > 2.40.0) #2 SMP Tue Mar  7 18:18:47 CET 2023ieee80211 phy0:
> > rt2x00_set_rt: Info - RT chipset 3070, rev 0200 detected
> > ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
> > ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
> > ieee80211 phy0: rt2x00lib_request_firmware: Info - Loading firmware
> > file 'rt2870.bin'
> > ieee80211 phy0: rt2x00lib_request_firmware: Info - Firmware
> > detected - version: 0.36
> >
> > But there is one big difference on my system: I can't reproduce the
> > bug so far. It's working as it should... (I did not apply the debug
> > patches myself so far)
> >
> > I'm now planning to look a bit more into the rt2800usb driver and
> > provide another debug patch for interesting looking code pieces in
> > it.
> >
> > @Thomas:
> > I've also uploaded you my binary kernel I'm running at the moment
> > here: https://www.awhome.eu/s/5FjqMS73rtCtSBM
> >
> > That kernel should also be able to boot and operate your system.
> > Can you try that and tell me, if that makes any difference?
> >
> > I'm also planning to provide some more debug patches, to figuring
> > out which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark
> > iTXQs for resumption") fixes the issue for you. Assuming my
> > understanding above is correct the patch should not really
> > fix/break anything for you...With the findings above I would have
> > expected your git bisec to identify commit a790cc3a4fad ("wifi:
> > mac80211: add wake_tx_queue callback to drivers") as the first
> > broken commit...
> >
> > Alexander
>


2023-03-09 22:13:34

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [Regression] rt2800usb - Wifi performance issues and connection drops

On 08.03.23 12:57, Felix Fietkau wrote:
> On 08.03.23 12:41, Alexander Wetzel wrote:
>> On 08.03.23 08:52, Felix Fietkau wrote:
>>
>>>> I'm also planning to provide some more debug patches, to figuring out
>>>> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
>>>> for resumption") fixes the issue for you. Assuming my understanding
>>>> above is correct the patch should not really fix/break anything for
>>>> you...With the findings above I would have expected your git bisec to
>>>> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
>>>> callback to drivers") as the first broken commit...
>>> I can't point to any specific series of events where it would go
>>> wrong, but I suspect that the problem might be the fact that you're
>>> doing tx scheduling from within ieee80211_handle_wake_tx_queue. I
>>> don't see how it's properly protected from potentially being called
>>> on different CPUs concurrently.
>>>
>>> Back when I was debugging some iTXQ issues in mt76, I also had
>>> problems when tx scheduling could happen from multiple places. My
>>> solution was to have a single worker thread that handles tx, which is
>>> scheduled from the wake_tx_queue op.
>>> Maybe you could do something similar in mac80211 for non-iTXQ drivers.
>>>
>>
>> I think it's already doing all of that:
>> ieee80211_handle_wake_tx_queue() is the mac80211 implementation for the
>> wake_tx_queue op. The drivers without native iTXQ support simply link it
>> to this handler.
> I know. The problem I see is that I can't find anything that guarantees
> that .wake_tx_queue_op is not being called concurrently from multiple
> different places. ieee80211_handle_wake_tx_queue is doing the scheduling
> directly, instead of deferring it to a single workqueue/tasklet/thread,
> and multiple concurrent calls to it could potentially cause issues.

Good hint, thanks.
According to the latest debug log exactly that seems to be happening:

ieee80211_wake_queue() is called by the driver and wake_txqs_tasklet
tasklet is started. But during execution of the drv_wake_tx_queue() from
the tasklet userspace queues a new skb and also calls into
drv_wake_tx_queue(), which is then run overlapping...

Not sure yet how that could cause the problem. But this breaks the
assumption that drv_wake_tx_queue() are not overlapping. And TX fails
directly after such an overlapping TX...

I'll probably just serialize the calls and then we verify if that helps...

Alexander

2023-03-11 21:26:47

by Alexander Wetzel

[permalink] [raw]
Subject: Re: [Regression] rt2800usb - Wifi performance issues and connection drops

On 09.03.23 23:13, Alexander Wetzel wrote:
> On 08.03.23 12:57, Felix Fietkau wrote:
>> On 08.03.23 12:41, Alexander Wetzel wrote:
>>> On 08.03.23 08:52, Felix Fietkau wrote:
>>>
>>>>> I'm also planning to provide some more debug patches, to figuring out
>>>>> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
>>>>> for resumption") fixes the issue for you. Assuming my understanding
>>>>> above is correct the patch should not really fix/break anything for
>>>>> you...With the findings above I would have expected your git bisec to
>>>>> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
>>>>> callback to drivers") as the first broken commit...
>>>> I can't point to any specific series of events where it would go
>>>> wrong, but I suspect that the problem might be the fact that you're
>>>> doing tx scheduling from within ieee80211_handle_wake_tx_queue. I
>>>> don't see how it's properly protected from potentially being called
>>>> on different CPUs concurrently.
>>>>
>>>> Back when I was debugging some iTXQ issues in mt76, I also had
>>>> problems when tx scheduling could happen from multiple places. My
>>>> solution was to have a single worker thread that handles tx, which
>>>> is scheduled from the wake_tx_queue op.
>>>> Maybe you could do something similar in mac80211 for non-iTXQ drivers.
>>>>
>>>
>>> I think it's already doing all of that:
>>> ieee80211_handle_wake_tx_queue() is the mac80211 implementation for the
>>> wake_tx_queue op. The drivers without native iTXQ support simply link it
>>> to this handler.
>> I know. The problem I see is that I can't find anything that
>> guarantees that .wake_tx_queue_op is not being called concurrently
>> from multiple different places. ieee80211_handle_wake_tx_queue is
>> doing the scheduling directly, instead of deferring it to a single
>> workqueue/tasklet/thread, and multiple concurrent calls to it could
>> potentially cause issues.
>
> Good hint, thanks.
> According to the latest debug log exactly that seems to be happening:
>
> ieee80211_wake_queue() is called by the driver and wake_txqs_tasklet
> tasklet is started. But during execution of the drv_wake_tx_queue() from
> the tasklet userspace queues a new skb and also calls into
> drv_wake_tx_queue(), which is then run overlapping...
>
> Not sure yet how that could cause the problem. But this breaks the
> assumption that drv_wake_tx_queue() are not overlapping. And TX fails
> directly after such an overlapping TX...
>
> I'll probably just serialize the calls and then we verify if that helps...

Serialization helps. A (crude and in multiple ways incorrect) patch
preventing two drv_wake_tx_queue() running for the same ac fixed the
issue for Thomas:
https://bugzilla.kernel.org/show_bug.cgi?id=217119#c20

So it looks like we'll now have soon a fix for the issue.

The driver wakes the queue for IEEE80211_AC_BE often for only a single
skb and then stops it again.
The short run time is insufficient for wake_txqs_tasklet to proper wake
all queues itself and from time to time a new TX operation squeezes in
after IEEE80211_AC_BE has been unblocked but prior of drv_wake_tx_queue
being called from the wake_txqs_tasklet. When this happens
drv_wake_tx_queue is called two times: Once from the tasklet, once from
the userspace.

ieee80211_handle_wake_tx_queue is using ieee80211_txq_schedule_start,
which has this documented requirement:
"The driver must not call multiple TXQ scheduling rounds concurrently."
Now I don't think that is causing the reported regression. Nevertheless
we should prevent concurrent calls of ieee80211_handle_wake_tx_queue for
that reason alone.

The real reason of the hangs is probably in the rt2800usb driver or
hardware. I don't see anything in the driver code, so probably the HW
itself has a problem with the two near-concurrent TX operations.

The real culprit of the regression should be commit a790cc3a4fad ("wifi:
mac80211: add wake_tx_queue callback to drivers"), which switched
rt2800usb over to iTXQs. But without the fix from commit 4444bc2116ae
("wifi: mac80211: Proper mark iTXQs for resumption") mac80211 omitted to
schedule the required run of the wake_txqs_tasklet. Thus thus instead of
two concurrent drv_wake_tx_queue we only got one and the driver
continued to work.

I asked Thomas on bugzilla to test the "best" solution I came up with.

There seems to be multiple ways. But I can't find a simple, low risk and
complete fix. So I compromised...

When Thomas can confirm the fix we can soon discuss the fix on
linux-wireless.


Alexander


2023-03-12 08:58:44

by Felix Fietkau

[permalink] [raw]
Subject: Re: [Regression] rt2800usb - Wifi performance issues and connection drops

On 11.03.23 22:26, Alexander Wetzel wrote:
> Serialization helps. A (crude and in multiple ways incorrect) patch
> preventing two drv_wake_tx_queue() running for the same ac fixed the
> issue for Thomas:
> https://bugzilla.kernel.org/show_bug.cgi?id=217119#c20
>
> So it looks like we'll now have soon a fix for the issue.
>
> The driver wakes the queue for IEEE80211_AC_BE often for only a single
> skb and then stops it again.
> The short run time is insufficient for wake_txqs_tasklet to proper wake
> all queues itself and from time to time a new TX operation squeezes in
> after IEEE80211_AC_BE has been unblocked but prior of drv_wake_tx_queue
> being called from the wake_txqs_tasklet. When this happens
> drv_wake_tx_queue is called two times: Once from the tasklet, once from
> the userspace.
>
> ieee80211_handle_wake_tx_queue is using ieee80211_txq_schedule_start,
> which has this documented requirement:
> "The driver must not call multiple TXQ scheduling rounds concurrently."
> Now I don't think that is causing the reported regression. Nevertheless
> we should prevent concurrent calls of ieee80211_handle_wake_tx_queue for
> that reason alone.
>
> The real reason of the hangs is probably in the rt2800usb driver or
> hardware. I don't see anything in the driver code, so probably the HW
> itself has a problem with the two near-concurrent TX operations.
>
> The real culprit of the regression should be commit a790cc3a4fad ("wifi:
> mac80211: add wake_tx_queue callback to drivers"), which switched
> rt2800usb over to iTXQs. But without the fix from commit 4444bc2116ae
> ("wifi: mac80211: Proper mark iTXQs for resumption") mac80211 omitted to
> schedule the required run of the wake_txqs_tasklet. Thus thus instead of
> two concurrent drv_wake_tx_queue we only got one and the driver
> continued to work.
>
> I asked Thomas on bugzilla to test the "best" solution I came up with.
>
> There seems to be multiple ways. But I can't find a simple, low risk and
> complete fix. So I compromised...
>
> When Thomas can confirm the fix we can soon discuss the fix on
> linux-wireless.

I would recommend the following approach for properly fixing this issue:

On init if the .wake_tx_queue op is set to
ieee80211_handle_wake_tx_queue, create a single kthread that iterates
over all hw queues and schedules each one of them like
ieee80211_handle_wake_tx_queue does now.
Change ieee80211_handle_wake_tx_queue to simply schedule the kthread
without doing anything else.
This is how mt76 handles tx scheduling in the driver, and it works quite
well.

- Felix