2018-08-08 12:59:38

by Wen Gong

[permalink] [raw]
Subject: [PATCH v2 0/2] Change sk_pacing_shift in ieee80211_hw for best tx throughput

Add a field for ath10k to adjust the sk_pacing_shift, mac80211 set
the default value to 8, and ath10k will change it to 6. Then mac80211
will use the changed value 6 as sk_pacing_shift since 6 is the best
value for tx throughput by test result.

Wen Gong (2):
mac80211: Change sk_pacing_shift saved to ieee80211_hw
ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

drivers/net/wireless/ath/ath10k/core.c | 6 ++++++
drivers/net/wireless/ath/ath10k/hw.h | 3 +++
drivers/net/wireless/ath/ath10k/mac.c | 4 ++++
include/net/mac80211.h | 5 +++++
net/mac80211/main.c | 2 ++
net/mac80211/tx.c | 2 +-
6 files changed, 21 insertions(+), 1 deletion(-)

--
1.9.1


2018-08-13 13:59:47

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: RE: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

Wen Gong <[email protected]> writes:

> Hi Toke,
>
> I have tested with your method for shift 6/8/10/7 all twice time in
> open air environment.

Great, thanks!

> Shift 6:
> [email protected]:~/flent$ flent -H 192.168.1.7 -t
> "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=1

Just to be sure: You are running this on your laptop? And that laptop
has the ath10k and the modified kernel you are testing? Is 192.168.1.7
the AP or another device?

I'd like to see what happens when the link is fully saturated my
multiple streams as well. Could you repeat the tests with
upload_streams=5, please? And if you could share the .flent.gz files
(upload them somewhere, or email me off-list), that would be useful as
well :)

Thanks,

-Toke

2018-08-08 12:59:40

by Wen Gong

[permalink] [raw]
Subject: [PATCH v2 1/2] mac80211: Change sk_pacing_shift saved to ieee80211_hw

Add the skb_pacing_shift adjustment, this change make it
configurable for other driver. If no other driver set it, then
mac8011 will use the default value.

Signed-off-by: Wen Gong <[email protected]>
---
V2:
-add the description for tx_sk_pacing_shift
include/net/mac80211.h | 5 +++++
net/mac80211/main.c | 2 ++
net/mac80211/tx.c | 2 +-
3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/net/mac80211.h b/include/net/mac80211.h
index 5790f55..9804d65 100644
--- a/include/net/mac80211.h
+++ b/include/net/mac80211.h
@@ -2290,6 +2290,10 @@ enum ieee80211_hw_flags {
* supported by HW.
* @max_nan_de_entries: maximum number of NAN DE functions supported by the
* device.
+ *
+ * @tx_sk_pacing_shift: The TCP stack allow more than a single ms of data
+ * to be queued in the stack. The value is a bit-shift of 1 second, e.g. 8
+ * is ~4ms of queued data. It only affects local TCP sockets.
*/
struct ieee80211_hw {
struct ieee80211_conf conf;
@@ -2325,6 +2329,7 @@ struct ieee80211_hw {
u8 n_cipher_schemes;
const struct ieee80211_cipher_scheme *cipher_schemes;
u8 max_nan_de_entries;
+ u8 tx_sk_pacing_shift;
};

static inline bool _ieee80211_hw_check(struct ieee80211_hw *hw,
diff --git a/net/mac80211/main.c b/net/mac80211/main.c
index 4fb2709..23490fa 100644
--- a/net/mac80211/main.c
+++ b/net/mac80211/main.c
@@ -594,6 +594,8 @@ struct ieee80211_hw *ieee80211_alloc_hw_nm(size_t priv_data_len,
local->ops = ops;
local->use_chanctx = use_chanctx;

+ local->hw.tx_sk_pacing_shift = 8;
+
/* set up some defaults */
local->hw.queues = 1;
local->hw.max_rates = 1;
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index 6a79d56..80855a8 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -3601,7 +3601,7 @@ void __ieee80211_subif_start_xmit(struct sk_buff *skb,
* second, so 8 is ~4ms of queued data. Only affects local TCP
* sockets.
*/
- sk_pacing_shift_update(skb->sk, 8);
+ sk_pacing_shift_update(skb->sk, sdata->local->hw.tx_sk_pacing_shift);

fast_tx = rcu_dereference(sta->fast_tx);

--
1.9.1

2018-08-14 08:41:15

by Wen Gong

[permalink] [raw]
Subject: RE: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

> -----Original Message-----
> From: Toke H=F8iland-J=F8rgensen <[email protected]>
> Sent: Monday, August 13, 2018 7:18 PM
> To: Wen Gong <[email protected]>; Wen Gong
> <[email protected]>; [email protected];
> [email protected]
> Cc: [email protected]
> Subject: RE: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiF=
i
> chips
>=20
> Wen Gong <[email protected]> writes:
>=20
> > Hi Toke,
> >
> > I have tested with your method for shift 6/8/10/7 all twice time in
> > open air environment.
>=20
> Great, thanks!
>=20
> > Shift 6:
> > [email protected]:~/flent$ flent -H 192.168.1.7 -t
> > "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=3D1
>=20
> Just to be sure: You are running this on your laptop? And that laptop has=
the
> ath10k and the modified kernel you are testing? Is 192.168.1.7 the AP or
> another device?
Hi Toke,
Yes, I run on my laptop(Dell LATITUDE E5440 + QCA6174 PCI wireless card) in=
open air environment,
Kernel is 4.18.0-rc2-wt-ath+ #2 SMP which is built myself.
The mac80211.ko/ath10k is built with the 2 patches.
192.168.1.7 is another device(PC installed with 4.4.0-116-generic #140~14.0=
4.1-Ubuntu)
>=20
> I'd like to see what happens when the link is fully saturated my multiple
> streams as well. Could you repeat the tests with upload_streams=3D5, plea=
se?
> And if you could share the .flent.gz files (upload them somewhere, or ema=
il
> me off-list), that would be useful as well :)

Test result with upload_streams=3D5:

sk_pacing_shift6:
[email protected]:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_p=
acing_shift6" tcp_nup --test-parameter upload_streams=3D5
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-14T105332.356811.sk_pacing_shift6.fl=
ent.gz.
Summary of tcp_nup test run 'sk_pacing_shift6' (at 2018-08-14 02:53:32.3568=
11):

avg median # data pts
Ping (ms) ICMP : 20.46 13.85 ms 350
TCP upload avg : 66.30 68.71 Mbits/s 301
TCP upload sum : 331.49 343.55 Mbits/s 301
TCP upload::1 : 60.80 64.65 Mbits/s 202
TCP upload::2 : 77.72 82.89 Mbits/s 211
TCP upload::3 : 60.52 56.09 Mbits/s 202
TCP upload::4 : 67.39 73.56 Mbits/s 204
TCP upload::5 : 65.06 71.97 Mbits/s 201
[email protected]:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_p=
acing_shift6" tcp_nup --test-parameter upload_streams=3D5
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-14T105554.583603.sk_pacing_shift6.fl=
ent.gz.
Summary of tcp_nup test run 'sk_pacing_shift6' (at 2018-08-14 02:55:54.5836=
03):

avg median # data pts
Ping (ms) ICMP : 20.86 13.80 ms 350
TCP upload avg : 75.88 83.17 Mbits/s 301
TCP upload sum : 379.42 415.84 Mbits/s 301
TCP upload::1 : 82.07 90.73 Mbits/s 225
TCP upload::2 : 74.08 78.29 Mbits/s 204
TCP upload::3 : 70.44 75.65 Mbits/s 217
TCP upload::4 : 82.70 92.86 Mbits/s 223
TCP upload::5 : 70.13 76.87 Mbits/s 210

sk_pacing_shift7:
[email protected]:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_p=
acing_shift7" tcp_nup --test-parameter upload_streams=3D5
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-14T105759.169367.sk_pacing_shift7.fl=
ent.gz.
Summary of tcp_nup test run 'sk_pacing_shift7' (at 2018-08-14 02:57:59.1693=
67):

avg median # data pts
Ping (ms) ICMP : 24.66 15.55 ms 350
TCP upload avg : 65.33 72.83 Mbits/s 301
TCP upload sum : 326.67 363.10 Mbits/s 301
TCP upload::1 : 77.60 92.93 Mbits/s 214
TCP upload::2 : 67.22 75.95 Mbits/s 213
TCP upload::3 : 65.81 74.54 Mbits/s 205
TCP upload::4 : 63.06 70.37 Mbits/s 207
TCP upload::5 : 52.98 55.78 Mbits/s 198
[email protected]:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_p=
acing_shift7" tcp_nup --test-parameter upload_streams=3D5
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-14T105923.620572.sk_pacing_shift7.fl=
ent.gz.
Summary of tcp_nup test run 'sk_pacing_shift7' (at 2018-08-14 02:59:23.6205=
72):

avg median # data pts
Ping (ms) ICMP : 25.03 14.25 ms 350
TCP upload avg : 74.35 85.64 Mbits/s 297
TCP upload sum : 371.77 428.19 Mbits/s 297
TCP upload::1 : 74.12 82.55 Mbits/s 216
TCP upload::2 : 67.78 71.87 Mbits/s 208
TCP upload::3 : 82.40 94.72 Mbits/s 210
TCP upload::4 : 70.77 82.43 Mbits/s 206
TCP upload::5 : 76.70 88.62 Mbits/s 210

sk_pacing_shift8:
[email protected]:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_p=
acing_shift8" tcp_nup --test-parameter upload_streams=3D5
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-14T110334.670845.sk_pacing_shift8.fl=
ent.gz.
Summary of tcp_nup test run 'sk_pacing_shift8' (at 2018-08-14 03:03:34.6708=
45):

avg median # data pts
Ping (ms) ICMP : 25.03 19.50 ms 350
TCP upload avg : 58.11 59.70 Mbits/s 301
TCP upload sum : 290.53 298.51 Mbits/s 301
TCP upload::1 : 51.37 51.74 Mbits/s 197
TCP upload::2 : 42.02 43.66 Mbits/s 192
TCP upload::3 : 61.17 62.33 Mbits/s 200
TCP upload::4 : 66.11 70.31 Mbits/s 198
TCP upload::5 : 69.86 76.31 Mbits/s 202
[email protected]:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_p=
acing_shift8" tcp_nup --test-parameter upload_streams=3D5
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-14T110557.587769.sk_pacing_shift8.fl=
ent.gz.
Summary of tcp_nup test run 'sk_pacing_shift8' (at 2018-08-14 03:05:57.5877=
69):

avg median # data pts
Ping (ms) ICMP : 21.50 13.05 ms 350
TCP upload avg : 61.59 62.00 Mbits/s 301
TCP upload sum : 307.93 310.01 Mbits/s 301
TCP upload::1 : 69.70 76.22 Mbits/s 210
TCP upload::2 : 68.51 74.88 Mbits/s 207
TCP upload::3 : 71.06 77.57 Mbits/s 200
TCP upload::4 : 47.08 51.42 Mbits/s 196
TCP upload::5 : 51.58 51.98 Mbits/s 203

sk_pacing_shift10:
[email protected]:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_p=
acing_shift10" tcp_nup --test-parameter upload_streams=3D5
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-14T110814.434543.sk_pacing_shift10.f=
lent.gz.
Summary of tcp_nup test run 'sk_pacing_shift10' (at 2018-08-14 03:08:14.434=
543):

avg median # data pts
Ping (ms) ICMP : 31.57 19.35 ms 350
TCP upload avg : 56.61 62.61 Mbits/s 301
TCP upload sum : 283.07 313.07 Mbits/s 301
TCP upload::1 : 39.39 42.96 Mbits/s 187
TCP upload::2 : 62.20 72.39 Mbits/s 193
TCP upload::3 : 61.72 74.31 Mbits/s 191
TCP upload::4 : 61.86 73.74 Mbits/s 190
TCP upload::5 : 57.90 65.11 Mbits/s 193
[email protected]:~/flent/5stream$ flent -H 192.168.1.7 -t "sk_p=
acing_shift10" tcp_nup --test-parameter upload_streams=3D5
Started Flent 1.2.2 using Python 2.7.12.
Starting tcp_nup test. Expected run time: 70 seconds.
Data file written to ./tcp_nup-2018-08-14T110931.986159.sk_pacing_shift10.f=
lent.gz.
Summary of tcp_nup test run 'sk_pacing_shift10' (at 2018-08-14 03:09:31.986=
159):

avg median # data pts
Ping (ms) ICMP : 19.23 13.20 ms 350
TCP upload avg : 76.36 81.37 Mbits/s 301
TCP upload sum : 381.80 406.85 Mbits/s 301
TCP upload::1 : 64.95 67.91 Mbits/s 212
TCP upload::2 : 82.16 92.35 Mbits/s 215
TCP upload::3 : 67.51 70.18 Mbits/s 213
TCP upload::4 : 77.42 82.11 Mbits/s 232
TCP upload::5 : 89.76 99.96 Mbits/s 226

>=20
> Thanks,
>=20
> -Toke

2018-08-17 14:35:52

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: RE: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

Wen Gong <[email protected]> writes:

>> -----Original Message-----
>> From: Toke H=C3=B8iland-J=C3=B8rgensen <[email protected]>
>> Sent: Monday, August 13, 2018 7:18 PM
>> To: Wen Gong <[email protected]>; Wen Gong
>> <[email protected]>; [email protected];
>> [email protected]
>> Cc: [email protected]
>> Subject: RE: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC Wi=
Fi
>> chips
>>=20
>> Wen Gong <[email protected]> writes:
>>=20
>> > Hi Toke,
>> >
>> > I have tested with your method for shift 6/8/10/7 all twice time in
>> > open air environment.
>>=20
>> Great, thanks!
>>=20
>> > Shift 6:
>> > [email protected]:~/flent$ flent -H 192.168.1.7 -t
>> > "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=3D1
>>=20
>> Just to be sure: You are running this on your laptop? And that laptop ha=
s the
>> ath10k and the modified kernel you are testing? Is 192.168.1.7 the AP or
>> another device?
> Hi Toke,
> Yes, I run on my laptop(Dell LATITUDE E5440 + QCA6174 PCI wireless card) =
in open air environment,
> Kernel is 4.18.0-rc2-wt-ath+ #2 SMP which is built myself.
> The mac80211.ko/ath10k is built with the 2 patches.
> 192.168.1.7 is another device(PC installed with 4.4.0-116-generic #140~14=
.04.1-Ubuntu)
>>=20
>> I'd like to see what happens when the link is fully saturated my multiple
>> streams as well. Could you repeat the tests with upload_streams=3D5, ple=
ase?
>> And if you could share the .flent.gz files (upload them somewhere, or em=
ail
>> me off-list), that would be useful as well :)
>
> Test result with upload_streams=3D5:

Sorry for the delay in looking at this, I've been terribly busy this
week; will get back to it in more detail next week.

One question for now: In the results you quote below the total
throughput is way lower than the up to 600 mbps you quoted in the patch
description. How did you run your original tests to achieve that high a
throughput?

-Toke

2018-08-10 21:59:41

by Arend Van Spriel

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Change sk_pacing_shift in ieee80211_hw for best tx throughput

On 8/10/2018 3:20 PM, Toke H?iland-J?rgensen wrote:
> Arend van Spriel <[email protected]> writes:
>
>> On 8/8/2018 9:00 PM, Peter Oh wrote:
>>>
>>>
>>> On 08/08/2018 03:40 AM, Wen Gong wrote:
>>>> Add a field for ath10k to adjust the sk_pacing_shift, mac80211 set
>>>> the default value to 8, and ath10k will change it to 6. Then mac80211
>>>> will use the changed value 6 as sk_pacing_shift since 6 is the best
>>>> value for tx throughput by test result.
>>> I don't think you can convince people with the numbers unless you
>>> provide latency along with the numbers and also measurement result on
>>> different chipsets as Michal addressed (QCA4019, QCA9984, etc.) From
>>> users view point, I also agree on Toke that we cannot scarify latency
>>> for the small throughput improvement.
>>
>> Yeah. The wireless industry (admittedly that is me too :-p ) has been
>> focused on just throughput long enough.
>
> Tell me about it ;)
>
>> All the preaching about bufferbloat from Dave and others is (just)
>> starting to sink in here and there.
>
> Yeah, I've noticed; this is good!
>
>> Now as for the value of the sk_pacing_shift I think we agree it
>> depends on the specific device so in that sense the api makes sense,
>> but I think there are a lot of variables so I was wondering if we
>> could introduce a sysctl parameter for it. Does that make sense?
>
> I'm not sure a sysctl parameter would make sense; for one thing, it
> would be global for the host, while different network interfaces will
> probably need different values. And for another, I don't think it's
> something a user can reasonably be expected to set correctly, and I
> think it *is* actually possible to pick a value that works well at the
> driver level.

I not sure either. Do you think a user could come up with something like
this (found here [1]):

sysctl -w net.core.rmem_max=8388608
sysctl -w net.core.wmem_max=8388608
sysctl -w net.core.rmem_default=65536
sysctl -w net.core.wmem_default=65536
sysctl -w net.ipv4.tcp_rmem='4096 87380 8388608'
sysctl -w net.ipv4.tcp_wmem='4096 65536 8388608'
sysctl -w net.ipv4.tcp_mem='8388608 8388608 8388608'
sysctl -w net.ipv4.route.flush=1

Now the page listing this config claims this is for use "on Linux 2.4+
for high-bandwidth applications". Beats me if it still is correct in 4.17.

Anyway, sysctl is nice for parameterizing code that is built-in the
kernel so you don't need to rebuild it. mac80211 tends to be a module in
most distros so maybe sysctl is not a good fit. So lets agree on that.

Picking a value at driver level may be possible, but a driver tends to
support a number of different devices. So how do you see the picking
work. Some static table with entries for the different devices?

Regards,
Arend

[1] https://wwwx.cs.unc.edu/~sparkst/howto/network_tuning.php

2018-08-08 12:59:41

by Wen Gong

[permalink] [raw]
Subject: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

Upstream kernel has an interface to help adjust sk_pacing_shift to help
improve TCP UL throughput.
The sk_pacing_shift is 8 in mac80211, this is based on test with 11N
WiFi chips with ath9k. For QCA6174/QCA9377 PCI 11AC chips, the 11AC
VHT80 TCP UL throughput testing result shows 6 is the optimal.
Overwrite the sk_pacing_shift to 6 in ath10k driver for QCA6174/9377 PCI.

Tested with QCA6174 PCI with firmware
WLAN.RM.4.4.1-00109-QCARMSWPZ-1, but this will also affect QCA9377 PCI.
It's not a regression with new firmware releases.

There have 2 test result of different settings:

ARM CPU based device with QCA6174A PCI with different
sk_pacing_shift:

sk_pacing_shift throughput(Mbps) CPU utilization
6 500(-P5) ~75% idle, Focus on CPU1: ~14%idle
7 454(-P5) ~80% idle, Focus on CPU1: ~4%idle
8 288 ~90% idle, Focus on CPU1: ~35%idle
9 ~200 ~92% idle, Focus on CPU1: ~50%idle

5G TCP UL VTH80 on X86 platform with QCA6174A PCI with sk_packing_shift
set to 6:

tcp_limit_output_bytes throughput(Mbps)
default(262144)+1 Stream 336
default(262144)+2 Streams 558
default(262144)+3 Streams 584
default(262144)+4 Streams 602
default(262144)+5 Streams 598
changed(2621440)+1 Stream 598
changed(2621440)+2 Streams 601

Signed-off-by: Wen Gong <[email protected]>
---
V2:
-add the optimal for configurable for each hardware type
drivers/net/wireless/ath/ath10k/core.c | 6 ++++++
drivers/net/wireless/ath/ath10k/hw.h | 6 ++++++
drivers/net/wireless/ath/ath10k/mac.c | 3 +++
3 files changed, 15 insertions(+)

diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c
index 85c58eb..fbd13ec 100644
--- a/drivers/net/wireless/ath/ath10k/core.c
+++ b/drivers/net/wireless/ath/ath10k/core.c
@@ -189,6 +189,7 @@
.per_ce_irq = false,
.shadow_reg_support = false,
.rri_on_ddr = false,
+ .tx_sk_pacing_shift = SK_PACING_SHIFT_6174,
},
{
.id = QCA6174_HW_2_1_VERSION,
@@ -221,6 +222,7 @@
.per_ce_irq = false,
.shadow_reg_support = false,
.rri_on_ddr = false,
+ .tx_sk_pacing_shift = SK_PACING_SHIFT_6174,
},
{
.id = QCA6174_HW_3_0_VERSION,
@@ -253,6 +255,7 @@
.per_ce_irq = false,
.shadow_reg_support = false,
.rri_on_ddr = false,
+ .tx_sk_pacing_shift = SK_PACING_SHIFT_6174,
},
{
.id = QCA6174_HW_3_2_VERSION,
@@ -288,6 +291,7 @@
.per_ce_irq = false,
.shadow_reg_support = false,
.rri_on_ddr = false,
+ .tx_sk_pacing_shift = SK_PACING_SHIFT_6174,
},
{
.id = QCA99X0_HW_2_0_DEV_VERSION,
@@ -443,6 +447,7 @@
.per_ce_irq = false,
.shadow_reg_support = false,
.rri_on_ddr = false,
+ .tx_sk_pacing_shift = SK_PACING_SHIFT_9377,
},
{
.id = QCA9377_HW_1_1_DEV_VERSION,
@@ -477,6 +482,7 @@
.per_ce_irq = false,
.shadow_reg_support = false,
.rri_on_ddr = false,
+ .tx_sk_pacing_shift = SK_PACING_SHIFT_9377,
},
{
.id = QCA4019_HW_1_0_DEV_VERSION,
diff --git a/drivers/net/wireless/ath/ath10k/hw.h b/drivers/net/wireless/ath/ath10k/hw.h
index a274bd8..1f956d6 100644
--- a/drivers/net/wireless/ath/ath10k/hw.h
+++ b/drivers/net/wireless/ath/ath10k/hw.h
@@ -161,6 +161,9 @@ enum qca9377_chip_id_rev {

#define REG_DUMP_COUNT_QCA988X 60

+#define SK_PACING_SHIFT_6174 6
+#define SK_PACING_SHIFT_9377 6
+
struct ath10k_fw_ie {
__le32 id;
__le32 len;
@@ -589,6 +592,9 @@ struct ath10k_hw_params {

/* Number of bytes to be the offset for each FFT sample */
int spectral_bin_offset;
+
+ /* Number of shift to override the default value of ieee80211_hw*/
+ u8 tx_sk_pacing_shift;
};

struct htt_rx_desc;
diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
index 95243b4..4f2c07f 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -8361,6 +8361,9 @@ int ath10k_mac_register(struct ath10k *ar)
ar->hw->wiphy->max_scan_ssids = WLAN_SCAN_PARAMS_MAX_SSID;
ar->hw->wiphy->max_scan_ie_len = WLAN_SCAN_PARAMS_MAX_IE_LEN;

+ if (ar->hw_params.tx_sk_pacing_shift != 0)
+ ar->hw->tx_sk_pacing_shift = ar->hw_params.tx_sk_pacing_shift;
+
ar->hw->vif_data_size = sizeof(struct ath10k_vif);
ar->hw->sta_data_size = sizeof(struct ath10k_sta);
ar->hw->txq_data_size = sizeof(struct ath10k_txq);
--
1.9.1

2018-08-31 03:30:00

by Peter Oh

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

Hi Toke,


On 08/17/2018 04:32 AM, Toke Høiland-Jørgensen wrote:
>
>>>> Shift 6:
>>>> [email protected]:~/flent$ flent -H 192.168.1.7 -t
>>>> "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=1
>>>
Does flent have options to set TOS or QoS queue for TCP or UDP test?
If I add "--test-paramete burst_tos=0xe0", will it use corresponding
mac80211 QoS queue?
flent -H xxxx -t "xxxx" tcp_nup --test-parameter upload_streams=1
--test-paramete burst_tos=0xe0

Thanks,
Peter

2018-08-10 10:40:47

by Wen Gong

[permalink] [raw]
Subject: RE: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

> -----Original Message-----
> From: ath10k <[email protected]> On Behalf Of Toke
> H=F8iland-J=F8rgensen
> Sent: Wednesday, August 8, 2018 6:44 PM
> To: Wen Gong <[email protected]>; [email protected];
> [email protected]
> Cc: [email protected]
> Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiF=
i
> chips
>=20
> Wen Gong <[email protected]> writes:
>=20
> > Upstream kernel has an interface to help adjust sk_pacing_shift to
> > help improve TCP UL throughput.
> > The sk_pacing_shift is 8 in mac80211, this is based on test with 11N
> > WiFi chips with ath9k. For QCA6174/QCA9377 PCI 11AC chips, the 11AC
> > VHT80 TCP UL throughput testing result shows 6 is the optimal.
> > Overwrite the sk_pacing_shift to 6 in ath10k driver for QCA6174/9377 PC=
I.
> >
> > Tested with QCA6174 PCI with firmware
> > WLAN.RM.4.4.1-00109-QCARMSWPZ-1, but this will also affect QCA9377
> PCI.
> > It's not a regression with new firmware releases.
> >
> > There have 2 test result of different settings:
> >
> > ARM CPU based device with QCA6174A PCI with different
> > sk_pacing_shift:
> >
> > sk_pacing_shift throughput(Mbps) CPU utilization
> > 6 500(-P5) ~75% idle, Focus on CPU1: ~14%idle
> > 7 454(-P5) ~80% idle, Focus on CPU1: ~4%idle
> > 8 288 ~90% idle, Focus on CPU1: ~35%idle
> > 9 ~200 ~92% idle, Focus on CPU1: ~50%idle
> >
> > 5G TCP UL VTH80 on X86 platform with QCA6174A PCI with
> > sk_packing_shift set to 6:
> >
> > tcp_limit_output_bytes throughput(Mbps)
> > default(262144)+1 Stream 336
> > default(262144)+2 Streams 558
> > default(262144)+3 Streams 584
> > default(262144)+4 Streams 602
> > default(262144)+5 Streams 598
> > changed(2621440)+1 Stream 598
> > changed(2621440)+2 Streams 601
>=20
> You still haven't provided any latency numbers for these tests, which mak=
es
> it impossible to verify that setting sk_pacing_shift to 6 is the right tr=
adeoff.
>=20
> As I said before, from your numbers I suspect the right setting is actual=
ly 7,
> which would be 10-20ms less latency under load; way more important than
> ~50 Mbps...
>=20
Hi Toke,
Could you give the command line for the latency test?
https://flent.org/intro.html#quick-start
I used the command but test failed:
flent tcp_download -p 1 -l 60 -H 192.168.1.5 -t text-to-be-included-in-plot=
-o file1.png
error loading plotter: unable to find plot configuration "1"

> -Toke
>=20
> _______________________________________________
> ath10k mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/ath10k

2018-08-10 15:47:35

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: RE: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

Wen Gong <[email protected]> writes:

>> -----Original Message-----
>> From: ath10k <[email protected]> On Behalf Of Toke
>> H=C3=B8iland-J=C3=B8rgensen
>> Sent: Wednesday, August 8, 2018 6:44 PM
>> To: Wen Gong <[email protected]>; [email protected];
>> [email protected]
>> Cc: [email protected]
>> Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC Wi=
Fi
>> chips
>>=20
>> Wen Gong <[email protected]> writes:
>>=20
>> > Upstream kernel has an interface to help adjust sk_pacing_shift to
>> > help improve TCP UL throughput.
>> > The sk_pacing_shift is 8 in mac80211, this is based on test with 11N
>> > WiFi chips with ath9k. For QCA6174/QCA9377 PCI 11AC chips, the 11AC
>> > VHT80 TCP UL throughput testing result shows 6 is the optimal.
>> > Overwrite the sk_pacing_shift to 6 in ath10k driver for QCA6174/9377 P=
CI.
>> >
>> > Tested with QCA6174 PCI with firmware
>> > WLAN.RM.4.4.1-00109-QCARMSWPZ-1, but this will also affect QCA9377
>> PCI.
>> > It's not a regression with new firmware releases.
>> >
>> > There have 2 test result of different settings:
>> >
>> > ARM CPU based device with QCA6174A PCI with different
>> > sk_pacing_shift:
>> >
>> > sk_pacing_shift throughput(Mbps) CPU utilization
>> > 6 500(-P5) ~75% idle, Focus on CPU1: ~14%idle
>> > 7 454(-P5) ~80% idle, Focus on CPU1: ~4%idle
>> > 8 288 ~90% idle, Focus on CPU1: ~35%idle
>> > 9 ~200 ~92% idle, Focus on CPU1: ~50%idle
>> >
>> > 5G TCP UL VTH80 on X86 platform with QCA6174A PCI with
>> > sk_packing_shift set to 6:
>> >
>> > tcp_limit_output_bytes throughput(Mbps)
>> > default(262144)+1 Stream 336
>> > default(262144)+2 Streams 558
>> > default(262144)+3 Streams 584
>> > default(262144)+4 Streams 602
>> > default(262144)+5 Streams 598
>> > changed(2621440)+1 Stream 598
>> > changed(2621440)+2 Streams 601
>>=20
>> You still haven't provided any latency numbers for these tests, which ma=
kes
>> it impossible to verify that setting sk_pacing_shift to 6 is the right t=
radeoff.
>>=20
>> As I said before, from your numbers I suspect the right setting is actua=
lly 7,
>> which would be 10-20ms less latency under load; way more important than
>> ~50 Mbps...
>>=20
> Hi Toke,
> Could you give the command line for the latency test?
> https://flent.org/intro.html#quick-start
> I used the command but test failed:
> flent tcp_download -p 1 -l 60 -H 192.168.1.5 -t text-to-be-included-in-pl=
ot -o file1.png
> error loading plotter: unable to find plot configuration "1"

Try something like:


flent -H 192.168.1.5 -t "sk_pacing_shift 7" tcp_nup --test-parameter upload=
_streams=3D1


This will note the value of sk_pacing_shift you are testing in the data
file (so change that as appropriate), and you can vary the number of TCP
streams by changing the upload_streams parameter.

Note that in this case I'm assuming you are running Flent on the device
with the kernel you are trying to test, so you want a TCP transfer going
*from* the device. If not, change "tcp_nup" to "tcp_ndown" and
"upload_streams" to "download_streams". Upload is netperf TCP_STREAM
test, and download is TCP_MAERTS.

When running the above command you'll get a summary output on the
terminal that you can paste on the list; and also a data file to plot
things form. For instance, you can do something like 'flent -p ping_cdf
*.flent.gz' to get a CDF plot of all your test results afterwards. You
are also very welcome to send me the .flent.gz data files and I'll take
a look to make sure everything looks reasonable :)

-Toke

2018-08-11 22:27:12

by Arend Van Spriel

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Change sk_pacing_shift in ieee80211_hw for best tx throughput

+ Eric

On 8/10/2018 9:52 PM, Ben Greear wrote:
> On 08/10/2018 12:28 PM, Arend van Spriel wrote:
>> On 8/10/2018 3:20 PM, Toke H?iland-J?rgensen wrote:
>>> Arend van Spriel <[email protected]> writes:
>>>
>>>> On 8/8/2018 9:00 PM, Peter Oh wrote:
>>>>>
>>>>>
>>>>> On 08/08/2018 03:40 AM, Wen Gong wrote:
>>>>>> Add a field for ath10k to adjust the sk_pacing_shift, mac80211 set
>>>>>> the default value to 8, and ath10k will change it to 6. Then mac80211
>>>>>> will use the changed value 6 as sk_pacing_shift since 6 is the best
>>>>>> value for tx throughput by test result.
>>>>> I don't think you can convince people with the numbers unless you
>>>>> provide latency along with the numbers and also measurement result on
>>>>> different chipsets as Michal addressed (QCA4019, QCA9984, etc.) From
>>>>> users view point, I also agree on Toke that we cannot scarify latency
>>>>> for the small throughput improvement.
>>>>
>>>> Yeah. The wireless industry (admittedly that is me too :-p ) has been
>>>> focused on just throughput long enough.
>>>
>>> Tell me about it ;)
>>>
>>>> All the preaching about bufferbloat from Dave and others is (just)
>>>> starting to sink in here and there.
>>>
>>> Yeah, I've noticed; this is good!
>>>
>>>> Now as for the value of the sk_pacing_shift I think we agree it
>>>> depends on the specific device so in that sense the api makes sense,
>>>> but I think there are a lot of variables so I was wondering if we
>>>> could introduce a sysctl parameter for it. Does that make sense?
>>>
>>> I'm not sure a sysctl parameter would make sense; for one thing, it
>>> would be global for the host, while different network interfaces will
>>> probably need different values. And for another, I don't think it's
>>> something a user can reasonably be expected to set correctly, and I
>>> think it *is* actually possible to pick a value that works well at the
>>> driver level.
>>
>> I not sure either. Do you think a user could come up with something
>> like this (found here [1]):
>>
>> sysctl -w net.core.rmem_max=8388608
>> sysctl -w net.core.wmem_max=8388608
>> sysctl -w net.core.rmem_default=65536
>> sysctl -w net.core.wmem_default=65536
>> sysctl -w net.ipv4.tcp_rmem='4096 87380 8388608'
>> sysctl -w net.ipv4.tcp_wmem='4096 65536 8388608'
>> sysctl -w net.ipv4.tcp_mem='8388608 8388608 8388608'
>> sysctl -w net.ipv4.route.flush=1
>>
>> Now the page listing this config claims this is for use "on Linux 2.4+
>> for high-bandwidth applications". Beats me if it still is correct in
>> 4.17.
>>
>> Anyway, sysctl is nice for parameterizing code that is built-in the
>> kernel so you don't need to rebuild it. mac80211 tends to be a module
>> in most distros so
>> maybe sysctl is not a good fit. So lets agree on that.
>>
>> Picking a value at driver level may be possible, but a driver tends to
>> support a number of different devices. So how do you see the picking
>> work. Some static
>> table with entries for the different devices?
>
> Some users are not going to care about latency, and for others, latency may
> be absolutely important and they don't care about bandwidth.
>
> So, it should be tunable. sysctl can support per network-device settings,
> right? Or, probably could use ethtool API to set a per-netdev value as
> well.
> That might be nice for other network devices as well, not just wifi.

I was under the impression that the parameters are all global, but your
statement made me look. I came across some references here [2] so I
checked the kernel sources under net/ and found net/ipv4/devinet.c [3].
So that confirms it supports per-netdev settings.

The sk_pacing_shift is actually a socket setting although I did not find
a user-space api to change it. For instance it is not a socket option.
There might be a good reason for not exposing it to user-space (added Eric).

Regards,
Arend

[2]
https://askubuntu.com/questions/316492/disabling-ipv6-on-a-single-interface
[3] https://elixir.bootlin.com/linux/latest/source/net/ipv4/devinet.c#L2309

2018-08-20 16:02:04

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Change sk_pacing_shift in ieee80211_hw for best tx throughput

Arend van Spriel <[email protected]> writes:

> + Eric
>
> On 8/10/2018 9:52 PM, Ben Greear wrote:
>> On 08/10/2018 12:28 PM, Arend van Spriel wrote:
>>> On 8/10/2018 3:20 PM, Toke H=C3=B8iland-J=C3=B8rgensen wrote:
>>>> Arend van Spriel <[email protected]> writes:
>>>>
>>>>> On 8/8/2018 9:00 PM, Peter Oh wrote:
>>>>>>
>>>>>>
>>>>>> On 08/08/2018 03:40 AM, Wen Gong wrote:
>>>>>>> Add a field for ath10k to adjust the sk_pacing_shift, mac80211 set
>>>>>>> the default value to 8, and ath10k will change it to 6. Then mac802=
11
>>>>>>> will use the changed value 6 as sk_pacing_shift since 6 is the best
>>>>>>> value for tx throughput by test result.
>>>>>> I don't think you can convince people with the numbers unless you
>>>>>> provide latency along with the numbers and also measurement result on
>>>>>> different chipsets as Michal addressed (QCA4019, QCA9984, etc.) From
>>>>>> users view point, I also agree on Toke that we cannot scarify latency
>>>>>> for the small throughput improvement.
>>>>>
>>>>> Yeah. The wireless industry (admittedly that is me too :-p ) has been
>>>>> focused on just throughput long enough.
>>>>
>>>> Tell me about it ;)
>>>>
>>>>> All the preaching about bufferbloat from Dave and others is (just)
>>>>> starting to sink in here and there.
>>>>
>>>> Yeah, I've noticed; this is good!
>>>>
>>>>> Now as for the value of the sk_pacing_shift I think we agree it
>>>>> depends on the specific device so in that sense the api makes sense,
>>>>> but I think there are a lot of variables so I was wondering if we
>>>>> could introduce a sysctl parameter for it. Does that make sense?
>>>>
>>>> I'm not sure a sysctl parameter would make sense; for one thing, it
>>>> would be global for the host, while different network interfaces will
>>>> probably need different values. And for another, I don't think it's
>>>> something a user can reasonably be expected to set correctly, and I
>>>> think it *is* actually possible to pick a value that works well at the
>>>> driver level.
>>>
>>> I not sure either. Do you think a user could come up with something
>>> like this (found here [1]):
>>>
>>> sysctl -w net.core.rmem_max=3D8388608
>>> sysctl -w net.core.wmem_max=3D8388608
>>> sysctl -w net.core.rmem_default=3D65536
>>> sysctl -w net.core.wmem_default=3D65536
>>> sysctl -w net.ipv4.tcp_rmem=3D'4096 87380 8388608'
>>> sysctl -w net.ipv4.tcp_wmem=3D'4096 65536 8388608'
>>> sysctl -w net.ipv4.tcp_mem=3D'8388608 8388608 8388608'
>>> sysctl -w net.ipv4.route.flush=3D1
>>>
>>> Now the page listing this config claims this is for use "on Linux 2.4+
>>> for high-bandwidth applications". Beats me if it still is correct in
>>> 4.17.
>>>
>>> Anyway, sysctl is nice for parameterizing code that is built-in the
>>> kernel so you don't need to rebuild it. mac80211 tends to be a module
>>> in most distros so
>>> maybe sysctl is not a good fit. So lets agree on that.
>>>
>>> Picking a value at driver level may be possible, but a driver tends to
>>> support a number of different devices. So how do you see the picking
>>> work. Some static
>>> table with entries for the different devices?
>>
>> Some users are not going to care about latency, and for others, latency =
may
>> be absolutely important and they don't care about bandwidth.
>>
>> So, it should be tunable. sysctl can support per network-device setting=
s,
>> right? Or, probably could use ethtool API to set a per-netdev value as
>> well.
>> That might be nice for other network devices as well, not just wifi.
>
> I was under the impression that the parameters are all global, but your=20
> statement made me look. I came across some references here [2] so I=20
> checked the kernel sources under net/ and found net/ipv4/devinet.c [3].=20
> So that confirms it supports per-netdev settings.

Yeah, I think that *if* this is to be made configurable, a per-netdev
sysctl would be the way to go, with the driver being able to set the
default.

However, the reason I think it may not be worth it to expose this as a
setting is that it is very much a case of diminishing returns. Once the
buffer size is large enough that full aggregates can be built,
increasing it further just adds latency with very little effect on
throughput. Which means that fiddling with the parameter is not going to
have a lot of effect, so it is not very useful to expose, which makes it
not worth the added configuration complexity...

-Toke

2018-08-20 18:31:07

by Ben Greear

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Change sk_pacing_shift in ieee80211_hw for best tx throughput



On 08/20/2018 05:46 AM, Toke Høiland-Jørgensen wrote:
> Arend van Spriel <[email protected]> writes:
>
>> + Eric
>>
>> On 8/10/2018 9:52 PM, Ben Greear wrote:
>>> On 08/10/2018 12:28 PM, Arend van Spriel wrote:
>>>> On 8/10/2018 3:20 PM, Toke Høiland-Jørgensen wrote:
>>>>> Arend van Spriel <[email protected]> writes:
>>>>>
>>>>>> On 8/8/2018 9:00 PM, Peter Oh wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 08/08/2018 03:40 AM, Wen Gong wrote:
>>>>>>>> Add a field for ath10k to adjust the sk_pacing_shift, mac80211 set
>>>>>>>> the default value to 8, and ath10k will change it to 6. Then mac80211
>>>>>>>> will use the changed value 6 as sk_pacing_shift since 6 is the best
>>>>>>>> value for tx throughput by test result.
>>>>>>> I don't think you can convince people with the numbers unless you
>>>>>>> provide latency along with the numbers and also measurement result on
>>>>>>> different chipsets as Michal addressed (QCA4019, QCA9984, etc.) From
>>>>>>> users view point, I also agree on Toke that we cannot scarify latency
>>>>>>> for the small throughput improvement.
>>>>>>
>>>>>> Yeah. The wireless industry (admittedly that is me too :-p ) has been
>>>>>> focused on just throughput long enough.
>>>>>
>>>>> Tell me about it ;)
>>>>>
>>>>>> All the preaching about bufferbloat from Dave and others is (just)
>>>>>> starting to sink in here and there.
>>>>>
>>>>> Yeah, I've noticed; this is good!
>>>>>
>>>>>> Now as for the value of the sk_pacing_shift I think we agree it
>>>>>> depends on the specific device so in that sense the api makes sense,
>>>>>> but I think there are a lot of variables so I was wondering if we
>>>>>> could introduce a sysctl parameter for it. Does that make sense?
>>>>>
>>>>> I'm not sure a sysctl parameter would make sense; for one thing, it
>>>>> would be global for the host, while different network interfaces will
>>>>> probably need different values. And for another, I don't think it's
>>>>> something a user can reasonably be expected to set correctly, and I
>>>>> think it *is* actually possible to pick a value that works well at the
>>>>> driver level.
>>>>
>>>> I not sure either. Do you think a user could come up with something
>>>> like this (found here [1]):
>>>>
>>>> sysctl -w net.core.rmem_max=8388608
>>>> sysctl -w net.core.wmem_max=8388608
>>>> sysctl -w net.core.rmem_default=65536
>>>> sysctl -w net.core.wmem_default=65536
>>>> sysctl -w net.ipv4.tcp_rmem='4096 87380 8388608'
>>>> sysctl -w net.ipv4.tcp_wmem='4096 65536 8388608'
>>>> sysctl -w net.ipv4.tcp_mem='8388608 8388608 8388608'
>>>> sysctl -w net.ipv4.route.flush=1
>>>>
>>>> Now the page listing this config claims this is for use "on Linux 2.4+
>>>> for high-bandwidth applications". Beats me if it still is correct in
>>>> 4.17.
>>>>
>>>> Anyway, sysctl is nice for parameterizing code that is built-in the
>>>> kernel so you don't need to rebuild it. mac80211 tends to be a module
>>>> in most distros so
>>>> maybe sysctl is not a good fit. So lets agree on that.
>>>>
>>>> Picking a value at driver level may be possible, but a driver tends to
>>>> support a number of different devices. So how do you see the picking
>>>> work. Some static
>>>> table with entries for the different devices?
>>>
>>> Some users are not going to care about latency, and for others, latency may
>>> be absolutely important and they don't care about bandwidth.
>>>
>>> So, it should be tunable. sysctl can support per network-device settings,
>>> right? Or, probably could use ethtool API to set a per-netdev value as
>>> well.
>>> That might be nice for other network devices as well, not just wifi.
>>
>> I was under the impression that the parameters are all global, but your
>> statement made me look. I came across some references here [2] so I
>> checked the kernel sources under net/ and found net/ipv4/devinet.c [3].
>> So that confirms it supports per-netdev settings.
>
> Yeah, I think that *if* this is to be made configurable, a per-netdev
> sysctl would be the way to go, with the driver being able to set the
> default.
>
> However, the reason I think it may not be worth it to expose this as a
> setting is that it is very much a case of diminishing returns. Once the
> buffer size is large enough that full aggregates can be built,
> increasing it further just adds latency with very little effect on
> throughput. Which means that fiddling with the parameter is not going to
> have a lot of effect, so it is not very useful to expose, which makes it
> not worth the added configuration complexity...

If it were easy, it would already be correct. I think adding tuning knob
and some documentation will allow users to more easily try different things
and use what is best for them (and let the community at large know what works
so maybe the defaults can be tweaked over time).

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2018-08-13 08:18:10

by Wen Gong

[permalink] [raw]
Subject: RE: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

PiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiBGcm9tOiBUb2tlIEjDuGlsYW5kLUrDuHJn
ZW5zZW4gPHRva2VAdG9rZS5kaz4NCj4gU2VudDogRnJpZGF5LCBBdWd1c3QgMTAsIDIwMTggOTox
OCBQTQ0KPiBUbzogV2VuIEdvbmcgPHdnb25nQHF0aS5xdWFsY29tbS5jb20+OyBXZW4gR29uZw0K
PiA8d2dvbmdAY29kZWF1cm9yYS5vcmc+OyBhdGgxMGtAbGlzdHMuaW5mcmFkZWFkLm9yZzsNCj4g
am9oYW5uZXNAc2lwc29sdXRpb25zLm5ldA0KPiBDYzogbGludXgtd2lyZWxlc3NAdmdlci5rZXJu
ZWwub3JnDQo+IFN1YmplY3Q6IFJFOiBbUEFUQ0ggdjIgMi8yXSBhdGgxMGs6IFNldCBza19wYWNp
bmdfc2hpZnQgdG8gNiBmb3IgMTFBQyBXaUZpDQo+IGNoaXBzDQo+IA0KPiBXZW4gR29uZyA8d2dv
bmdAcXRpLnF1YWxjb21tLmNvbT4gd3JpdGVzOg0KPiANCj4gPj4NCj4gPiBIaSBUb2tlLA0KPiA+
IENvdWxkIHlvdSBnaXZlIHRoZSBjb21tYW5kIGxpbmUgZm9yIHRoZSBsYXRlbmN5IHRlc3Q/DQo+
ID4gaHR0cHM6Ly9mbGVudC5vcmcvaW50cm8uaHRtbCNxdWljay1zdGFydA0KPiA+IEkgdXNlZCB0
aGUgY29tbWFuZCBidXQgdGVzdCBmYWlsZWQ6DQo+ID4gZmxlbnQgdGNwX2Rvd25sb2FkIC1wIDEg
LWwgNjAgLUggMTkyLjE2OC4xLjUgLXQNCj4gPiB0ZXh0LXRvLWJlLWluY2x1ZGVkLWluLXBsb3Qg
LW8gZmlsZTEucG5nIGVycm9yIGxvYWRpbmcgcGxvdHRlcjogdW5hYmxlIHRvIGZpbmQNCj4gcGxv
dCBjb25maWd1cmF0aW9uICIxIg0KPiANCj4gVHJ5IHNvbWV0aGluZyBsaWtlOg0KPiANCj4gDQo+
IGZsZW50IC1IIDE5Mi4xNjguMS41IC10ICJza19wYWNpbmdfc2hpZnQgNyIgdGNwX251cCAtLXRl
c3QtcGFyYW1ldGVyDQo+IHVwbG9hZF9zdHJlYW1zPTENCj4gDQo+IA0KPiBUaGlzIHdpbGwgbm90
ZSB0aGUgdmFsdWUgb2Ygc2tfcGFjaW5nX3NoaWZ0IHlvdSBhcmUgdGVzdGluZyBpbiB0aGUgZGF0
YSBmaWxlIChzbw0KPiBjaGFuZ2UgdGhhdCBhcyBhcHByb3ByaWF0ZSksIGFuZCB5b3UgY2FuIHZh
cnkgdGhlIG51bWJlciBvZiBUQ1Agc3RyZWFtcyBieQ0KPiBjaGFuZ2luZyB0aGUgdXBsb2FkX3N0
cmVhbXMgcGFyYW1ldGVyLg0KPiANCj4gTm90ZSB0aGF0IGluIHRoaXMgY2FzZSBJJ20gYXNzdW1p
bmcgeW91IGFyZSBydW5uaW5nIEZsZW50IG9uIHRoZSBkZXZpY2Ugd2l0aA0KPiB0aGUga2VybmVs
IHlvdSBhcmUgdHJ5aW5nIHRvIHRlc3QsIHNvIHlvdSB3YW50IGEgVENQIHRyYW5zZmVyIGdvaW5n
DQo+ICpmcm9tKiB0aGUgZGV2aWNlLiBJZiBub3QsIGNoYW5nZSAidGNwX251cCIgdG8gInRjcF9u
ZG93biIgYW5kDQo+ICJ1cGxvYWRfc3RyZWFtcyIgdG8gImRvd25sb2FkX3N0cmVhbXMiLiBVcGxv
YWQgaXMgbmV0cGVyZiBUQ1BfU1RSRUFNDQo+IHRlc3QsIGFuZCBkb3dubG9hZCBpcyBUQ1BfTUFF
UlRTLg0KPiANCj4gV2hlbiBydW5uaW5nIHRoZSBhYm92ZSBjb21tYW5kIHlvdSdsbCBnZXQgYSBz
dW1tYXJ5IG91dHB1dCBvbiB0aGUNCj4gdGVybWluYWwgdGhhdCB5b3UgY2FuIHBhc3RlIG9uIHRo
ZSBsaXN0OyBhbmQgYWxzbyBhIGRhdGEgZmlsZSB0byBwbG90IHRoaW5ncyBmb3JtLg0KPiBGb3Ig
aW5zdGFuY2UsIHlvdSBjYW4gZG8gc29tZXRoaW5nIGxpa2UgJ2ZsZW50IC1wIHBpbmdfY2RmICou
ZmxlbnQuZ3onIHRvIGdldCBhDQo+IENERiBwbG90IG9mIGFsbCB5b3VyIHRlc3QgcmVzdWx0cyBh
ZnRlcndhcmRzLiBZb3UgYXJlIGFsc28gdmVyeSB3ZWxjb21lIHRvDQo+IHNlbmQgbWUgdGhlIC5m
bGVudC5neiBkYXRhIGZpbGVzIGFuZCBJJ2xsIHRha2UgYSBsb29rIHRvIG1ha2Ugc3VyZSBldmVy
eXRoaW5nDQo+IGxvb2tzIHJlYXNvbmFibGUgOikNCj4gDQo+IC1Ub2tlDQoNCkhpIFRva2UsDQoN
CkkgaGF2ZSB0ZXN0ZWQgd2l0aCB5b3VyIG1ldGhvZCBmb3Igc2hpZnQgNi84LzEwLzcgYWxsIHR3
aWNlIHRpbWUgaW4gb3BlbiBhaXIgZW52aXJvbm1lbnQuDQpTaGlmdCA2Og0Kd2dvbmdAd2dvbmct
TGF0aXR1ZGUtRTU0NDAtMTp+L2ZsZW50JCBmbGVudCAtSCAxOTIuMTY4LjEuNyAtdCAic2tfcGFj
aW5nX3NoaWZ0NiIgdGNwX251cCAtLXRlc3QtcGFyYW1ldGVyIHVwbG9hZF9zdHJlYW1zPTENClN0
YXJ0ZWQgRmxlbnQgMS4yLjIgdXNpbmcgUHl0aG9uIDIuNy4xMi4NClN0YXJ0aW5nIHRjcF9udXAg
dGVzdC4gRXhwZWN0ZWQgcnVuIHRpbWU6IDcwIHNlY29uZHMuDQpEYXRhIGZpbGUgd3JpdHRlbiB0
byAuL3RjcF9udXAtMjAxOC0wOC0xM1QxMTA0MTQuNjk5NTEyLnNrX3BhY2luZ19zaGlmdDYuZmxl
bnQuZ3ouDQpTdW1tYXJ5IG9mIHRjcF9udXAgdGVzdCBydW4gJ3NrX3BhY2luZ19zaGlmdDYnIChh
dCAyMDE4LTA4LTEzIDAzOjA0OjE0LjY5OTUxMik6DQoNCiAgICAgICAgICAgICAgICAgICAgICAg
ICAgIGF2ZyAgICAgICBtZWRpYW4gICAgICAgICAgIyBkYXRhIHB0cw0KIFBpbmcgKG1zKSBJQ01Q
IDogICAgICAgICA5LjkxICAgICAgICAgNC45OSBtcyAgICAgICAgICAgICAgMzUwDQogVENQIHVw
bG9hZCBhdmcgOiAgICAgICAyNDIuNDggICAgICAgMjYyLjQzIE1iaXRzL3MgICAgICAgICAzMDEN
CiBUQ1AgdXBsb2FkIHN1bSA6ICAgICAgIDI0Mi40OCAgICAgICAyNjIuNDMgTWJpdHMvcyAgICAg
ICAgIDMwMQ0KIFRDUCB1cGxvYWQ6OjEgIDogICAgICAgMjQyLjQ4ICAgICAgIDI2My4zNCBNYml0
cy9zICAgICAgICAgMjcxDQp3Z29uZ0B3Z29uZy1MYXRpdHVkZS1FNTQ0MC0xOn4vZmxlbnQkIGZs
ZW50IC1IIDE5Mi4xNjguMS43IC10ICJza19wYWNpbmdfc2hpZnQ2IiB0Y3BfbnVwIC0tdGVzdC1w
YXJhbWV0ZXIgdXBsb2FkX3N0cmVhbXM9MQ0KU3RhcnRlZCBGbGVudCAxLjIuMiB1c2luZyBQeXRo
b24gMi43LjEyLg0KU3RhcnRpbmcgdGNwX251cCB0ZXN0LiBFeHBlY3RlZCBydW4gdGltZTogNzAg
c2Vjb25kcy4NCkRhdGEgZmlsZSB3cml0dGVuIHRvIC4vdGNwX251cC0yMDE4LTA4LTEzVDExMzMx
Ny4wNzQwNzcuc2tfcGFjaW5nX3NoaWZ0Ni5mbGVudC5nei4NClN1bW1hcnkgb2YgdGNwX251cCB0
ZXN0IHJ1biAnc2tfcGFjaW5nX3NoaWZ0NicgKGF0IDIwMTgtMDgtMTMgMDM6MzM6MTcuMDc0MDc3
KToNCg0KICAgICAgICAgICAgICAgICAgICAgICAgICAgYXZnICAgICAgIG1lZGlhbiAgICAgICAg
ICAjIGRhdGEgcHRzDQogUGluZyAobXMpIElDTVAgOiAgICAgICAgIDcuNzUgICAgICAgICA1LjMw
IG1zICAgICAgICAgICAgICAzNTANCiBUQ1AgdXBsb2FkIGF2ZyA6ICAgICAgIDIzOS4xNyAgICAg
ICAyNTAuODQgTWJpdHMvcyAgICAgICAgIDMwMQ0KIFRDUCB1cGxvYWQgc3VtIDogICAgICAgMjM5
LjE3ICAgICAgIDI1MC44NCBNYml0cy9zICAgICAgICAgMzAxDQogVENQIHVwbG9hZDo6MSAgOiAg
ICAgICAyMzkuMTcgICAgICAgMjU1LjAzIE1iaXRzL3MgICAgICAgICAyNjYNCg0KU2hpZnQgODoN
Cndnb25nQHdnb25nLUxhdGl0dWRlLUU1NDQwLTE6fi9mbGVudCQgZmxlbnQgLUggMTkyLjE2OC4x
LjcgLXQgInNrX3BhY2luZ19zaGlmdDgiIHRjcF9udXAgLS10ZXN0LXBhcmFtZXRlciB1cGxvYWRf
c3RyZWFtcz0xDQpTdGFydGVkIEZsZW50IDEuMi4yIHVzaW5nIFB5dGhvbiAyLjcuMTIuDQpTdGFy
dGluZyB0Y3BfbnVwIHRlc3QuIEV4cGVjdGVkIHJ1biB0aW1lOiA3MCBzZWNvbmRzLg0KRGF0YSBm
aWxlIHdyaXR0ZW4gdG8gLi90Y3BfbnVwLTIwMTgtMDgtMTNUMTIxNDMzLjE4Nzc4MS5za19wYWNp
bmdfc2hpZnQ4LmZsZW50Lmd6Lg0KU3VtbWFyeSBvZiB0Y3BfbnVwIHRlc3QgcnVuICdza19wYWNp
bmdfc2hpZnQ4JyAoYXQgMjAxOC0wOC0xMyAwNDoxNDozMy4xODc3ODEpOg0KDQogICAgICAgICAg
ICAgICAgICAgICAgICAgICBhdmcgICAgICAgbWVkaWFuICAgICAgICAgICMgZGF0YSBwdHMNCiBQ
aW5nIChtcykgSUNNUCA6ICAgICAgICAxNy4xMiAgICAgICAgIDcuMDcgbXMgICAgICAgICAgICAg
IDM1MA0KIFRDUCB1cGxvYWQgYXZnIDogICAgICAgMTgwLjA1ICAgICAgIDE4NS44MiBNYml0cy9z
ICAgICAgICAgMzAxDQogVENQIHVwbG9hZCBzdW0gOiAgICAgICAxODAuMDUgICAgICAgMTg1Ljgy
IE1iaXRzL3MgICAgICAgICAzMDENCiBUQ1AgdXBsb2FkOjoxICA6ICAgICAgIDE4MC4wNSAgICAg
ICAxODkuNDEgTWJpdHMvcyAgICAgICAgIDI1Mw0Kd2dvbmdAd2dvbmctTGF0aXR1ZGUtRTU0NDAt
MTp+L2ZsZW50JCBmbGVudCAtSCAxOTIuMTY4LjEuNyAtdCAic2tfcGFjaW5nX3NoaWZ0OCIgdGNw
X251cCAtLXRlc3QtcGFyYW1ldGVyIHVwbG9hZF9zdHJlYW1zPTENClN0YXJ0ZWQgRmxlbnQgMS4y
LjIgdXNpbmcgUHl0aG9uIDIuNy4xMi4NClN0YXJ0aW5nIHRjcF9udXAgdGVzdC4gRXhwZWN0ZWQg
cnVuIHRpbWU6IDcwIHNlY29uZHMuDQpEYXRhIGZpbGUgd3JpdHRlbiB0byAuL3RjcF9udXAtMjAx
OC0wOC0xM1QxMjE2MDIuMzgyNTc1LnNrX3BhY2luZ19zaGlmdDguZmxlbnQuZ3ouDQpTdW1tYXJ5
IG9mIHRjcF9udXAgdGVzdCBydW4gJ3NrX3BhY2luZ19zaGlmdDgnIChhdCAyMDE4LTA4LTEzIDA0
OjE2OjAyLjM4MjU3NSk6DQoNCiAgICAgICAgICAgICAgICAgICAgICAgICAgIGF2ZyAgICAgICBt
ZWRpYW4gICAgICAgICAgIyBkYXRhIHB0cw0KIFBpbmcgKG1zKSBJQ01QIDogICAgICAgIDEzLjkw
ICAgICAgICAgNS44OSBtcyAgICAgICAgICAgICAgMzUwDQogVENQIHVwbG9hZCBhdmcgOiAgICAg
ICAyMDcuNTYgICAgICAgMjI4LjE2IE1iaXRzL3MgICAgICAgICAzMDENCiBUQ1AgdXBsb2FkIHN1
bSA6ICAgICAgIDIwNy41NiAgICAgICAyMjguMTYgTWJpdHMvcyAgICAgICAgIDMwMQ0KIFRDUCB1
cGxvYWQ6OjEgIDogICAgICAgMjA3LjU2ICAgICAgIDIyOC4xMSBNYml0cy9zICAgICAgICAgMjU5
DQoNClNoaWZ0IDEwOg0Kd2dvbmdAd2dvbmctTGF0aXR1ZGUtRTU0NDAtMTp+L2ZsZW50JCBmbGVu
dCAtSCAxOTIuMTY4LjEuNyAtdCAic2tfcGFjaW5nX3NoaWZ0MTAiIHRjcF9udXAgLS10ZXN0LXBh
cmFtZXRlciB1cGxvYWRfc3RyZWFtcz0xDQpTdGFydGVkIEZsZW50IDEuMi4yIHVzaW5nIFB5dGhv
biAyLjcuMTIuDQpTdGFydGluZyB0Y3BfbnVwIHRlc3QuIEV4cGVjdGVkIHJ1biB0aW1lOiA3MCBz
ZWNvbmRzLg0KRGF0YSBmaWxlIHdyaXR0ZW4gdG8gLi90Y3BfbnVwLTIwMTgtMDgtMTNUMTIxODQ0
LjQ5MzQ5OC5za19wYWNpbmdfc2hpZnQxMC5mbGVudC5nei4NClN1bW1hcnkgb2YgdGNwX251cCB0
ZXN0IHJ1biAnc2tfcGFjaW5nX3NoaWZ0MTAnIChhdCAyMDE4LTA4LTEzIDA0OjE4OjQ0LjQ5MzQ5
OCk6DQoNCiAgICAgICAgICAgICAgICAgICAgICAgICAgIGF2ZyAgICAgICBtZWRpYW4gICAgICAg
ICAgIyBkYXRhIHB0cw0KIFBpbmcgKG1zKSBJQ01QIDogICAgICAgIDE1LjExICAgICAgICAgNy40
MSBtcyAgICAgICAgICAgICAgMzUwDQogVENQIHVwbG9hZCBhdmcgOiAgICAgICAxNjIuMzggICAg
ICAgMTY0LjEwIE1iaXRzL3MgICAgICAgICAzMDENCiBUQ1AgdXBsb2FkIHN1bSA6ICAgICAgIDE2
Mi4zOCAgICAgICAxNjQuMTAgTWJpdHMvcyAgICAgICAgIDMwMQ0KIFRDUCB1cGxvYWQ6OjEgIDog
ICAgICAgMTYyLjM4ICAgICAgIDE2NS40NyBNYml0cy9zICAgICAgICAgMjUyDQp3Z29uZ0B3Z29u
Zy1MYXRpdHVkZS1FNTQ0MC0xOn4vZmxlbnQkIGZsZW50IC1IIDE5Mi4xNjguMS43IC10ICJza19w
YWNpbmdfc2hpZnQxMCIgdGNwX251cCAtLXRlc3QtcGFyYW1ldGVyIHVwbG9hZF9zdHJlYW1zPTEN
ClN0YXJ0ZWQgRmxlbnQgMS4yLjIgdXNpbmcgUHl0aG9uIDIuNy4xMi4NClN0YXJ0aW5nIHRjcF9u
dXAgdGVzdC4gRXhwZWN0ZWQgcnVuIHRpbWU6IDcwIHNlY29uZHMuDQpEYXRhIGZpbGUgd3JpdHRl
biB0byAuL3RjcF9udXAtMjAxOC0wOC0xM1QxMjIxMDguMzQ3MTYzLnNrX3BhY2luZ19zaGlmdDEw
LmZsZW50Lmd6Lg0KU3VtbWFyeSBvZiB0Y3BfbnVwIHRlc3QgcnVuICdza19wYWNpbmdfc2hpZnQx
MCcgKGF0IDIwMTgtMDgtMTMgMDQ6MjE6MDguMzQ3MTYzKToNCg0KICAgICAgICAgICAgICAgICAg
ICAgICAgICAgYXZnICAgICAgIG1lZGlhbiAgICAgICAgICAjIGRhdGEgcHRzDQogUGluZyAobXMp
IElDTVAgOiAgICAgICAgMTMuNjkgICAgICAgICA3LjQ4IG1zICAgICAgICAgICAgICAzNTANCiBU
Q1AgdXBsb2FkIGF2ZyA6ICAgICAgIDE3MS4xMSAgICAgICAxNzAuNTIgTWJpdHMvcyAgICAgICAg
IDMwMQ0KIFRDUCB1cGxvYWQgc3VtIDogICAgICAgMTcxLjExICAgICAgIDE3MC41MiBNYml0cy9z
ICAgICAgICAgMzAxDQogVENQIHVwbG9hZDo6MSAgOiAgICAgICAxNzEuMTEgICAgICAgMTcxLjM2
IE1iaXRzL3MgICAgICAgICAyNTgNCg0KU2hpZnQgNzoNCndnb25nQHdnb25nLUxhdGl0dWRlLUU1
NDQwLTE6fi9mbGVudCQgZmxlbnQgLUggMTkyLjE2OC4xLjcgLXQgInNrX3BhY2luZ19zaGlmdDci
IHRjcF9udXAgLS10ZXN0LXBhcmFtZXRlciB1cGxvYWRfc3RyZWFtcz0xDQpTdGFydGVkIEZsZW50
IDEuMi4yIHVzaW5nIFB5dGhvbiAyLjcuMTIuDQpTdGFydGluZyB0Y3BfbnVwIHRlc3QuIEV4cGVj
dGVkIHJ1biB0aW1lOiA3MCBzZWNvbmRzLg0KRGF0YSBmaWxlIHdyaXR0ZW4gdG8gLi90Y3BfbnVw
LTIwMTgtMDgtMTNUMTIyOTQ4LjAyMDk3NC5za19wYWNpbmdfc2hpZnQ3LmZsZW50Lmd6Lg0KU3Vt
bWFyeSBvZiB0Y3BfbnVwIHRlc3QgcnVuICdza19wYWNpbmdfc2hpZnQ3JyAoYXQgMjAxOC0wOC0x
MyAwNDoyOTo0OC4wMjA5NzQpOg0KDQogICAgICAgICAgICAgICAgICAgICAgICAgICBhdmcgICAg
ICAgbWVkaWFuICAgICAgICAgICMgZGF0YSBwdHMNCiBQaW5nIChtcykgSUNNUCA6ICAgICAgICAx
NC4xMiAgICAgICAgIDYuNjEgbXMgICAgICAgICAgICAgIDM1MA0KIFRDUCB1cGxvYWQgYXZnIDog
ICAgICAgMTg4LjE5ICAgICAgIDE4OC4wNCBNYml0cy9zICAgICAgICAgMzAxDQogVENQIHVwbG9h
ZCBzdW0gOiAgICAgICAxODguMTkgICAgICAgMTg4LjA0IE1iaXRzL3MgICAgICAgICAzMDENCiBU
Q1AgdXBsb2FkOjoxICA6ICAgICAgIDE4OC4xOSAgICAgICAxOTAuODggTWJpdHMvcyAgICAgICAg
IDI1OA0Kd2dvbmdAd2dvbmctTGF0aXR1ZGUtRTU0NDAtMTp+L2ZsZW50JCBmbGVudCAtSCAxOTIu
MTY4LjEuNyAtdCAic2tfcGFjaW5nX3NoaWZ0NyIgdGNwX251cCAtLXRlc3QtcGFyYW1ldGVyIHVw
bG9hZF9zdHJlYW1zPTENClN0YXJ0ZWQgRmxlbnQgMS4yLjIgdXNpbmcgUHl0aG9uIDIuNy4xMi4N
ClN0YXJ0aW5nIHRjcF9udXAgdGVzdC4gRXhwZWN0ZWQgcnVuIHRpbWU6IDcwIHNlY29uZHMuDQpE
YXRhIGZpbGUgd3JpdHRlbiB0byAuL3RjcF9udXAtMjAxOC0wOC0xM1QxMjMxMjkuNTI2NTE0LnNr
X3BhY2luZ19zaGlmdDcuZmxlbnQuZ3ouDQpTdW1tYXJ5IG9mIHRjcF9udXAgdGVzdCBydW4gJ3Nr
X3BhY2luZ19zaGlmdDcnIChhdCAyMDE4LTA4LTEzIDA0OjMxOjI5LjUyNjUxNCk6DQoNCiAgICAg
ICAgICAgICAgICAgICAgICAgICAgIGF2ZyAgICAgICBtZWRpYW4gICAgICAgICAgIyBkYXRhIHB0
cw0KIFBpbmcgKG1zKSBJQ01QIDogICAgICAgIDEwLjMxICAgICAgICAgNi4zMiBtcyAgICAgICAg
ICAgICAgMzUwDQogVENQIHVwbG9hZCBhdmcgOiAgICAgICAyMTIuNzAgICAgICAgMjMzLjY5IE1i
aXRzL3MgICAgICAgICAzMDENCiBUQ1AgdXBsb2FkIHN1bSA6ICAgICAgIDIxMi43MCAgICAgICAy
MzMuNjkgTWJpdHMvcyAgICAgICAgIDMwMQ0KIFRDUCB1cGxvYWQ6OjEgIDogICAgICAgMjEyLjcw
ICAgICAgIDIzNy42NSBNYml0cy9zICAgICAgICAgMjYyDQo=

2018-08-08 13:02:46

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

Wen Gong <[email protected]> writes:

> Upstream kernel has an interface to help adjust sk_pacing_shift to help
> improve TCP UL throughput.
> The sk_pacing_shift is 8 in mac80211, this is based on test with 11N
> WiFi chips with ath9k. For QCA6174/QCA9377 PCI 11AC chips, the 11AC
> VHT80 TCP UL throughput testing result shows 6 is the optimal.
> Overwrite the sk_pacing_shift to 6 in ath10k driver for QCA6174/9377 PCI.
>
> Tested with QCA6174 PCI with firmware
> WLAN.RM.4.4.1-00109-QCARMSWPZ-1, but this will also affect QCA9377 PCI.
> It's not a regression with new firmware releases.
>
> There have 2 test result of different settings:
>
> ARM CPU based device with QCA6174A PCI with different
> sk_pacing_shift:
>
> sk_pacing_shift throughput(Mbps) CPU utilization
> 6 500(-P5) ~75% idle, Focus on CPU1: ~14%idle
> 7 454(-P5) ~80% idle, Focus on CPU1: ~4%idle
> 8 288 ~90% idle, Focus on CPU1: ~35%idle
> 9 ~200 ~92% idle, Focus on CPU1: ~50%idle
>
> 5G TCP UL VTH80 on X86 platform with QCA6174A PCI with sk_packing_shift
> set to 6:
>
> tcp_limit_output_bytes throughput(Mbps)
> default(262144)+1 Stream 336
> default(262144)+2 Streams 558
> default(262144)+3 Streams 584
> default(262144)+4 Streams 602
> default(262144)+5 Streams 598
> changed(2621440)+1 Stream 598
> changed(2621440)+2 Streams 601

You still haven't provided any latency numbers for these tests, which
makes it impossible to verify that setting sk_pacing_shift to 6 is the
right tradeoff.

As I said before, from your numbers I suspect the right setting is
actually 7, which would be 10-20ms less latency under load; way more
important than ~50 Mbps...

-Toke

2018-08-10 22:24:08

by Ben Greear

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Change sk_pacing_shift in ieee80211_hw for best tx throughput

On 08/10/2018 12:28 PM, Arend van Spriel wrote:
> On 8/10/2018 3:20 PM, Toke H?iland-J?rgensen wrote:
>> Arend van Spriel <[email protected]> writes:
>>
>>> On 8/8/2018 9:00 PM, Peter Oh wrote:
>>>>
>>>>
>>>> On 08/08/2018 03:40 AM, Wen Gong wrote:
>>>>> Add a field for ath10k to adjust the sk_pacing_shift, mac80211 set
>>>>> the default value to 8, and ath10k will change it to 6. Then mac80211
>>>>> will use the changed value 6 as sk_pacing_shift since 6 is the best
>>>>> value for tx throughput by test result.
>>>> I don't think you can convince people with the numbers unless you
>>>> provide latency along with the numbers and also measurement result on
>>>> different chipsets as Michal addressed (QCA4019, QCA9984, etc.) From
>>>> users view point, I also agree on Toke that we cannot scarify latency
>>>> for the small throughput improvement.
>>>
>>> Yeah. The wireless industry (admittedly that is me too :-p ) has been
>>> focused on just throughput long enough.
>>
>> Tell me about it ;)
>>
>>> All the preaching about bufferbloat from Dave and others is (just)
>>> starting to sink in here and there.
>>
>> Yeah, I've noticed; this is good!
>>
>>> Now as for the value of the sk_pacing_shift I think we agree it
>>> depends on the specific device so in that sense the api makes sense,
>>> but I think there are a lot of variables so I was wondering if we
>>> could introduce a sysctl parameter for it. Does that make sense?
>>
>> I'm not sure a sysctl parameter would make sense; for one thing, it
>> would be global for the host, while different network interfaces will
>> probably need different values. And for another, I don't think it's
>> something a user can reasonably be expected to set correctly, and I
>> think it *is* actually possible to pick a value that works well at the
>> driver level.
>
> I not sure either. Do you think a user could come up with something like this (found here [1]):
>
> sysctl -w net.core.rmem_max=8388608
> sysctl -w net.core.wmem_max=8388608
> sysctl -w net.core.rmem_default=65536
> sysctl -w net.core.wmem_default=65536
> sysctl -w net.ipv4.tcp_rmem='4096 87380 8388608'
> sysctl -w net.ipv4.tcp_wmem='4096 65536 8388608'
> sysctl -w net.ipv4.tcp_mem='8388608 8388608 8388608'
> sysctl -w net.ipv4.route.flush=1
>
> Now the page listing this config claims this is for use "on Linux 2.4+ for high-bandwidth applications". Beats me if it still is correct in 4.17.
>
> Anyway, sysctl is nice for parameterizing code that is built-in the kernel so you don't need to rebuild it. mac80211 tends to be a module in most distros so
> maybe sysctl is not a good fit. So lets agree on that.
>
> Picking a value at driver level may be possible, but a driver tends to support a number of different devices. So how do you see the picking work. Some static
> table with entries for the different devices?

Some users are not going to care about latency, and for others, latency may
be absolutely important and they don't care about bandwidth.

So, it should be tunable. sysctl can support per network-device settings,
right? Or, probably could use ethtool API to set a per-netdev value as well.
That might be nice for other network devices as well, not just wifi.

If the driver is configuring the defaults, it can know the hardware type, firmware
revision, and lots of other info to make the best decision it can when registering
the radio with the upper stacks.

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2018-08-09 11:56:46

by Arend Van Spriel

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Change sk_pacing_shift in ieee80211_hw for best tx throughput

On 8/8/2018 9:00 PM, Peter Oh wrote:
>
>
> On 08/08/2018 03:40 AM, Wen Gong wrote:
>> Add a field for ath10k to adjust the sk_pacing_shift, mac80211 set
>> the default value to 8, and ath10k will change it to 6. Then mac80211
>> will use the changed value 6 as sk_pacing_shift since 6 is the best
>> value for tx throughput by test result.
> I don't think you can convince people with the numbers unless you
> provide latency along with the numbers and also measurement result on
> different chipsets as Michal addressed (QCA4019, QCA9984, etc.) From
> users view point, I also agree on Toke that we cannot scarify latency
> for the small throughput improvement.

Yeah. The wireless industry (admittedly that is me too :-p ) has been
focused on just throughput long enough. All the preaching about
bufferbloat from Dave and others is (just) starting to sink in here and
there.

Now as for the value of the sk_pacing_shift I think we agree it depends
on the specific device so in that sense the api makes sense, but I think
there are a lot of variables so I was wondering if we could introduce a
sysctl parameter for it. Does that make sense?

Regards,
Arend

2018-08-10 15:49:59

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Change sk_pacing_shift in ieee80211_hw for best tx throughput

Arend van Spriel <[email protected]> writes:

> On 8/8/2018 9:00 PM, Peter Oh wrote:
>>
>>
>> On 08/08/2018 03:40 AM, Wen Gong wrote:
>>> Add a field for ath10k to adjust the sk_pacing_shift, mac80211 set
>>> the default value to 8, and ath10k will change it to 6. Then mac80211
>>> will use the changed value 6 as sk_pacing_shift since 6 is the best
>>> value for tx throughput by test result.
>> I don't think you can convince people with the numbers unless you
>> provide latency along with the numbers and also measurement result on
>> different chipsets as Michal addressed (QCA4019, QCA9984, etc.) From
>> users view point, I also agree on Toke that we cannot scarify latency
>> for the small throughput improvement.
>
> Yeah. The wireless industry (admittedly that is me too :-p ) has been
> focused on just throughput long enough.

Tell me about it ;)

> All the preaching about bufferbloat from Dave and others is (just)
> starting to sink in here and there.

Yeah, I've noticed; this is good!

> Now as for the value of the sk_pacing_shift I think we agree it
> depends on the specific device so in that sense the api makes sense,
> but I think there are a lot of variables so I was wondering if we
> could introduce a sysctl parameter for it. Does that make sense?

I'm not sure a sysctl parameter would make sense; for one thing, it
would be global for the host, while different network interfaces will
probably need different values. And for another, I don't think it's
something a user can reasonably be expected to set correctly, and I
think it *is* actually possible to pick a value that works well at the
driver level.

-Toke

2018-08-31 19:45:00

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

Peter Oh <[email protected]> writes:

> Hi Toke,
>
>
> On 08/17/2018 04:32 AM, Toke H=C3=B8iland-J=C3=B8rgensen wrote:
>>
>>>>> Shift 6:
>>>>> [email protected]:~/flent$ flent -H 192.168.1.7 -t
>>>>> "sk_pacing_shift6" tcp_nup --test-parameter upload_streams=3D1
>>>>
> Does flent have options to set TOS or QoS queue for TCP or UDP test?
> If I add "--test-paramete burst_tos=3D0xe0", will it use corresponding=20
> mac80211 QoS queue?

Yes, for some tests:

- The RRUL test will run four TCP flows in each direction, one marked for
each 802.11 QoS queue.

- For the rtt_fair* tests you can set them with a test parameter:
--test-parameter markings=3DBE,CS1 to set flows 1 and 2 to BE and CS1
respectively. You can use the rtt_fair_var* tests to get a variable
number of flows (the same host can be specified multiple times by having
different hostnames resolve to the same IP).

-Toke

2018-08-08 21:22:07

by Peter Oh

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Change sk_pacing_shift in ieee80211_hw for best tx throughput



On 08/08/2018 03:40 AM, Wen Gong wrote:
> Add a field for ath10k to adjust the sk_pacing_shift, mac80211 set
> the default value to 8, and ath10k will change it to 6. Then mac80211
> will use the changed value 6 as sk_pacing_shift since 6 is the best
> value for tx throughput by test result.
I don't think you can convince people with the numbers unless you
provide latency along with the numbers and also measurement result on
different chipsets as Michal addressed (QCA4019, QCA9984, etc.) From
users view point, I also agree on Toke that we cannot scarify latency
for the small throughput improvement.

Thanks,
Peter

2018-09-05 04:10:41

by Grant Grundler

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

On Mon, Sep 3, 2018 at 6:35 AM Toke H=C3=B8iland-J=C3=B8rgensen <[email protected]=
dk> wrote:
>
> Johannes Berg <[email protected]> writes:
....
> > Grant's data shows a significant difference between 6 and 7 for both
> > latency and throughput:

Minor nit: this is wgong's data more thoughtfully processed.

> >
> > * median tpt
> > - ~241 vs ~201 (both 1 and 5 streams)
> > * median latency
> > - 7.5 vs 6 (1 stream)
> > - 17.3 vs. 16.6 (5 streams)
> >
> > A 20% throughput improvement at <=3D 1.5ms latency cost seems like a
> > pretty reasonable trade-off?
>
> Yeah, on it's face. What I'm bothered about is that it is the exact
> opposite results that I got from my ath10k tests (there, throughput
> *dropped* and latency doubled when going to from 4 to 16 ms of
> buffering).

Hrm, yeah...that would bother me too. I think even if we don't
understand why/how that happened, at some level we need to allow
subsystems or drivers to adjust sk_pacing_shift value. Changing
sk_pacing_shift clearly has an effect that can be optimized.

If smaller values of sk_pacing_shift is increasing the interval (and
allows more buffering), I'm wondering why CPU utilization gets higher.
More buffering is usually more efficient. :/

wgong: can you confirm (a) I've entered the data correctly in the
spreadsheet and (b) you've labeled the data sets correctly when you
generated the data?
If either of us made a mistake, it would be good to know. :)

I'm going to speculate that "other processing" (e.g. device level
interrupt mitigation or possibly CPU scheduling behaviors which
handles TX/RX completion) could cause a "bathtub" effect similar to
the performance graphs that originally got NAPI accepted into the
kernel ~15 years ago. So the "optimal" value could be different for
different architectures and different IO devices (which have different
max link rates and different host bus interconnects). But honestly, I
don't understand all the details of sk_pacing_shift value nearly as
well as just about anyone else reading this thread.

> And, well, Grant's data is from a single test in a noisy
> environment where the time series graph shows that throughput is all over
> the place for the duration of the test; so it's hard to draw solid
> conclusions from (for instance, for the 5-stream test, the average
> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
> used in this test, so I can't go verify it myself; so the only thing I
> can do is grumble about it here... :)

It's a fair complaint and I agree with it. My counter argument is the
opposite is true too: most ideal benchmarks don't measure what most
users see. While the data wgong provided are way more noisy than I
like, my overall "confidence" in the "conclusion" I offered is still
positive.

cheers,
grant

2018-09-03 13:58:02

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

On Thu, 2018-08-30 at 16:32 -0700, Grant Grundler wrote:
> On Sun, Aug 12, 2018 at 10:37 PM Wen Gong <[email protected]> wrote:
> ...
> > I have tested with your method for shift 6/8/10/7 all twice time in open air environment.
>
> I've attached the graphed data which Wen provided me and I made one
> "cleanup": the median and average "ICMP" are computed only using
> values where flent has an "antagonist" load running. The first 28 and
> last 26 seconds of the test are "idle" - it makes no sense to include
> the latency of those. This drops about 1/7th of the data points.
>
> My observations:
> o the delta between average and median is evidence this is not
> particularly reliable data (but expected result for "open air" wifi
> testing in a corp environment).
> o sk_pacing_shift=8 and/or =7 appears to have suffered from
> significant open air interference - I've attached a graph of the raw
> data.
> o I'm comfortable asserting throughput is higher and latency is lower
> for sk_pacing_shift=6 (vs 7) at the cost of additional CPU
> utilization.
>
> My questions:
> 1) Is the current conversation just quibbling about 6 vs 7 for the
> ath10k default?

6 vs. 8, I think? But I didn't follow the full discussion.

I also think that we shouldn't necessarily restrict this to "for the
ath10k". Is there any reason to think that this would be different for
different chips?

I could see a reason for a driver to set the default *higher* than the
mac80211 default (needing more buffers for whatever internal reason),
but I can't really see a reason it should be *lower*. Also, the
networking-stack wide default is 0, after all, so it seems that each new
layer might have some knowledge about why to *increase* it (like
mac80211 knows that A-MPDUs/A-MSDUs need to be built), but it seems not
so reasonable that it should be reduced again. Unless you have some
specific knowledge that ath10k would have a smaller A-MSDU/A-MPDU size
limit or something?

> 2) If yes, can both patches be accepted and then later add code to
> select a default based on CPU available?

I've applied the first patch now (with some comment cleanups, you may
want to pick those up for your internal usage) because I can see an
argument for increasing it.

However, I think that the "quibbling" over the default should most
likely result in a changed default value for mac80211, rather than
ath10k using the driver setting as a way out of that discussion.

johannes

2018-09-03 17:55:25

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

Johannes Berg <[email protected]> writes:

> On Mon, 2018-09-03 at 13:11 +0200, Toke H=C3=B8iland-J=C3=B8rgensen wrote:
>
>> > 6 vs. 8, I think? But I didn't follow the full discussion.
>
> Err, I just realized that I was completely wrong - the default, of
> course, is 10. So smaller values mean more buffering.
>
> Most of my argumentation was thus utter garbage :-)

Well, I got the gist, even if the sign bit was wrong ;)

>> > I also think that we shouldn't necessarily restrict this to "for the
>> > ath10k". Is there any reason to think that this would be different for
>> > different chips?
>>=20
>> No, I also think it should be possible to select a value that will work
>> for all drivers. There's a strong diminishing returns effect here after
>> a certain point, and I believe we should strongly push back against just
>> adding more buffering to chase (single-flow) throughput numbers; and I'm
>> not convinced that having different values for different drivers is
>> worth the complexity.
>
> I think I can see some point in it if the driver requires some
> buffering for some specific reason? But you'd have to be able to state
> that reason, I guess, I could imagine having a firmware limitation to
> need to build two aggregates, or something around MU-MIMO, etc.

Right, I'm not ruling out that there can be legitimate reasons to add
extra buffering; but a lot of times it's just used to paper over other
issues, so a good explanation is definitely needed...

>> As far as the actual value, I *think* it may be that the default shift
>> should be 7 (8 ms) rather than 8 (4 ms) as it currently is. Going back
>> and looking at my data from when I submitted the original patch, it
>> looks like the point of diminishing returns is somewhere between those
>> two with ath9k (where I did most of my testing), and it seems reasonable
>> that it could be slightly higher (i.e., more buffering) for ath10k.
>
> Grant's data shows a significant difference between 6 and 7 for both
> latency and throughput:
>
> * median tpt
> - ~241 vs ~201 (both 1 and 5 streams)
> * median latency
> - 7.5 vs 6 (1 stream)
> - 17.3 vs. 16.6 (5 streams)
>
> A 20% throughput improvement at <=3D 1.5ms latency cost seems like a
> pretty reasonable trade-off?

Yeah, on it's face. What I'm bothered about is that it is the exact
opposite results that I got from my ath10k tests (there, throughput
*dropped* and latency doubled when going to from 4 to 16 ms of
buffering). And, well, Grant's data is from a single test in a noisy
environment where the timeseries graph shows that throughput is all over
the place for the duration of the test; so it's hard to draw solid
conclusions from (for instance, for the 5-stream test, the average
throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
used in this test, so I can't go verify it myself; so the only thing I
can do is grumble about it here... :)

-Toke

2018-09-05 11:52:00

by Wen Gong

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

On 2018-09-05 07:43, Grant Grundler wrote:
> On Mon, Sep 3, 2018 at 6:35 AM Toke Høiland-Jørgensen <[email protected]>
> wrote:
>>
>> Johannes Berg <[email protected]> writes:
> ....
>> > Grant's data shows a significant difference between 6 and 7 for both
>> > latency and throughput:
>
> Minor nit: this is wgong's data more thoughtfully processed.

>
> wgong: can you confirm (a) I've entered the data correctly in the
> spreadsheet and (b) you've labeled the data sets correctly when you
> generated the data?
> If either of us made a mistake, it would be good to know. :)
>
yes, the data is the test result from flent tool.

2018-09-03 15:31:03

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

Johannes Berg <[email protected]> writes:

> On Thu, 2018-08-30 at 16:32 -0700, Grant Grundler wrote:
>> On Sun, Aug 12, 2018 at 10:37 PM Wen Gong <[email protected]> wrote:
>> ...
>> > I have tested with your method for shift 6/8/10/7 all twice time in open air environment.
>>
>> I've attached the graphed data which Wen provided me and I made one
>> "cleanup": the median and average "ICMP" are computed only using
>> values where flent has an "antagonist" load running. The first 28 and
>> last 26 seconds of the test are "idle" - it makes no sense to include
>> the latency of those. This drops about 1/7th of the data points.
>>
>> My observations:
>> o the delta between average and median is evidence this is not
>> particularly reliable data (but expected result for "open air" wifi
>> testing in a corp environment).
>> o sk_pacing_shift=8 and/or =7 appears to have suffered from
>> significant open air interference - I've attached a graph of the raw
>> data.
>> o I'm comfortable asserting throughput is higher and latency is lower
>> for sk_pacing_shift=6 (vs 7) at the cost of additional CPU
>> utilization.
>>
>> My questions:
>> 1) Is the current conversation just quibbling about 6 vs 7 for the
>> ath10k default?
>
> 6 vs. 8, I think? But I didn't follow the full discussion.
>
> I also think that we shouldn't necessarily restrict this to "for the
> ath10k". Is there any reason to think that this would be different for
> different chips?

No, I also think it should be possible to select a value that will work
for all drivers. There's a strong diminishing returns effect here after
a certain point, and I believe we should strongly push back against just
adding more buffering to chase (single-flow) throughput numbers; and I'm
not convinced that having different values for different drivers is
worth the complexity.

As far as I'm concerned, just showing an increase in maximum throughput
under ideal conditions (as this initial patch submission did) is not
sufficient argument for adding more buffering. At a minimum, the
evaluation should also include:

- Impact on latency and throughput for competing flows in the same test.
- Impact on latency for the TCP flow itself.
- A full evaluation at low rates and in scenarios with more than one
client.

As far as the actual value, I *think* it may be that the default shift
should be 7 (8 ms) rather than 8 (4 ms) as it currently is. Going back
and looking at my data from when I submitted the original patch, it
looks like the point of diminishing returns is somewhere between those
two with ath9k (where I did most of my testing), and it seems reasonable
that it could be slightly higher (i.e., more buffering) for ath10k.

-Toke

2018-09-03 16:07:20

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

On Mon, 2018-09-03 at 13:11 +0200, Toke Høiland-Jørgensen wrote:

> > 6 vs. 8, I think? But I didn't follow the full discussion.

Err, I just realized that I was completely wrong - the default, of
course, is 10. So smaller values mean more buffering.

Most of my argumentation was thus utter garbage :-)

> > I also think that we shouldn't necessarily restrict this to "for the
> > ath10k". Is there any reason to think that this would be different for
> > different chips?
>
> No, I also think it should be possible to select a value that will work
> for all drivers. There's a strong diminishing returns effect here after
> a certain point, and I believe we should strongly push back against just
> adding more buffering to chase (single-flow) throughput numbers; and I'm
> not convinced that having different values for different drivers is
> worth the complexity.

I think I can see some point in it if the driver requires some buffering
for some specific reason? But you'd have to be able to state that
reason, I guess, I could imagine having a firmware limitation to need to
build two aggregates, or something around MU-MIMO, etc.

> As far as I'm concerned, just showing an increase in maximum throughput
> under ideal conditions (as this initial patch submission did) is not
> sufficient argument for adding more buffering. At a minimum, the
> evaluation should also include:
>
> - Impact on latency and throughput for competing flows in the same test.
> - Impact on latency for the TCP flow itself.
> - A full evaluation at low rates and in scenarios with more than one
> client.

That seems reasonable.

> As far as the actual value, I *think* it may be that the default shift
> should be 7 (8 ms) rather than 8 (4 ms) as it currently is. Going back
> and looking at my data from when I submitted the original patch, it
> looks like the point of diminishing returns is somewhere between those
> two with ath9k (where I did most of my testing), and it seems reasonable
> that it could be slightly higher (i.e., more buffering) for ath10k.

Grant's data shows a significant difference between 6 and 7 for both
latency and throughput:

* median tpt
- ~241 vs ~201 (both 1 and 5 streams)
* median latency
- 7.5 vs 6 (1 stream)
- 17.3 vs. 16.6 (5 streams)

A 20% throughput improvement at <= 1.5ms latency cost seems like a
pretty reasonable trade-off?

johannes

2018-09-03 19:56:12

by Dave Taht

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

While I'm being overly vague and general...

* Doing some form of ack compression (whether in the tcp stack or in
mac80211) seems useful. It's really
hard to fill 4Mbytes one way and the needed number of acks the other.
I'd also pointed out (on some thread on the make-wifi-fast list that I
can't find right now) that drivers had a tendency to eat an entire
txop for that first single ack and ping pong between empty and full on
a one way tcp benchmark.

* Similarly grabbing air during the contention window. If the sender
is grabbing 3x as much air as the reciever

(A1 A2 A3 B1 A4 B2)

* I know of one org that upped the codel target value for 802.11ac
with decent results but I can't talk about it...
(sort of why I asked for a cap). My problem with decreasing the pacing
shift is that it makes for less interleaving (I think! have to look),
where increasing the target compensates for more jittery networks.
('course my overall theme of my prior post was to reduce jittery
networks with less buffering and smaller txops overall)

* Is this a two device test? or a test through an AP?

Looking over this thread, I see a suggestion is to expose the pacing
rate via a sysctl. I don't think there's
any one answer, either, but I'm not sure this is the right variable to
expose. Perhaps a tunable for
latency that might (one day) look at multiple variables and decide?

Getting this more right is really a job for an algorithm of
minstrel-like complexity.

2018-09-06 14:53:32

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

Grant Grundler <[email protected]> writes:

>> And, well, Grant's data is from a single test in a noisy
>> environment where the time series graph shows that throughput is all over
>> the place for the duration of the test; so it's hard to draw solid
>> conclusions from (for instance, for the 5-stream test, the average
>> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
>> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
>> used in this test, so I can't go verify it myself; so the only thing I
>> can do is grumble about it here... :)
>
> It's a fair complaint and I agree with it. My counter argument is the
> opposite is true too: most ideal benchmarks don't measure what most
> users see. While the data wgong provided are way more noisy than I
> like, my overall "confidence" in the "conclusion" I offered is still
> positive.

Right. I guess I would just prefer a slightly more comprehensive
evaluation to base a 4x increase in buffer size on...

-Toke

2018-09-03 19:18:32

by Dave Taht

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

I have not been on this thread (I have had to shut down my wifi lab
and am not planning on doing any more wifi work in the future). But
for what it's worth:

* tcp_pacing_shift affects the performance of the tcp stack only. I
think 16ms is an outrageous amount, and I'm thinking we actually have
a problem on the other side of the link in getting acks out as I
write. That said,
I don't care all that much compared to the bad old days (
http://blog.cerowrt.org/tags/wifi ) and 16ms
is far better than the seconds it used to be.

That said, I'd rather appreciate a test on HT20 with the same chipset
at a lower rate. The hope was that
we'd achieve flat latency at every rate (see
http://blog.cerowrt.org/post/real_results/ for some lovely pics)

Flent can sample minstrel stats but I think we're SOL on the ath10k.

* The irtt test can measure one way delays pretty accurately so you
can see if the other side is actually the source of this issue.

* Not clear to me if the AP is also running the fq_codel for wifi
code? That, um, er, helps an AP enormously....

* Having an aircap and regular capture of what's going on during these
tests would be useful. Flent is a great test driver, but a tcpdump
taken during a flent test tells the truth via tcptrace and xplot.org.
We can inspect cwnd, rwnd, and look at drops. (actually I usually turn
ECN on so I can see when the codel algorithm is kicking in). Being
able to zoom in on the early ramp would be helpful. Etc.

I'll take a look at them if you'll supply them.

* At some point the ath10k driver also gained NAPI support. (?) I
don't think NAPI is a good fit for wifi. Delays in TCP RX processing
lead to less throughput.

* The firmware is perpetually trying to do more stuff with less
interrupts. Me, I dreamed of a register you could poll for "when will
the air be clear", and a "tx has one ms left to go, give me some more
data" interrupt. The fact we still have a minimum of ~12ms of
uncontrolled buffering bugs me. If we could just hold off submittal
until just before it was needed, we could cut it in half (and further,
by 10s of ms, when the media is contended).

* My general recomendation for contended APs was that they advertise a
reduced TXOP as the number of active stations go up. (this also
applies to the VI/VO queues). Never got around to algorizing it... I
gave up on my links and, disable VO/VI/BK entirely and just tell
hostapd to advertise 2ms. Yep, kills conventional measures of
throughput. Cuts inter-station service time by 2/3s though, and *that*
you can notice and measure with PLT benchmarks and overall feel.

This implies increasing the pacing shift to suit the advertised size
of the txop, not decreasing it - and I have no idea how to get that
from one place to another. Same goes for stuff destined for the VI/VO
queues.

* (also it would be good to see in the capture what the beacon says)

* Powersave has always been a problem...

* Cubic is a terrible tcp for wifi. BBR, however, is possibly worse...
or better. It would be educational to try running BBR on this link at
either shift setting. (seriously!) TCP is not TCP is not TCP....

Lastly... tcp_pacing_shift does not affect the number of other packets
that can get into a given txop given the other buffering - it's one in
the hardware, one ready to go, one being formed, try setting
pacing_shift to to 4 to see what happens....

2019-02-20 19:15:18

by Grant Grundler

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <[email protected]> wrote:
>
> Grant Grundler <[email protected]> writes:
>
> >> And, well, Grant's data is from a single test in a noisy
> >> environment where the time series graph shows that throughput is all over
> >> the place for the duration of the test; so it's hard to draw solid
> >> conclusions from (for instance, for the 5-stream test, the average
> >> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
> >> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
> >> used in this test, so I can't go verify it myself; so the only thing I
> >> can do is grumble about it here... :)
> >
> > It's a fair complaint and I agree with it. My counter argument is the
> > opposite is true too: most ideal benchmarks don't measure what most
> > users see. While the data wgong provided are way more noisy than I
> > like, my overall "confidence" in the "conclusion" I offered is still
> > positive.
>
> Right. I guess I would just prefer a slightly more comprehensive
> evaluation to base a 4x increase in buffer size on...

Kalle, is this why you didn't accept this patch? Other reasons?

Toke, what else would you like to see evaluated?

I generally want to see three things measured when "benchmarking"
technologies: throughput, latency, cpu utilization
We've covered those three I think "reasonably".

What does a "4x increase in memory" mean here? Wen, how much more
memory does this cause ath10k to use?

If a "4x increase in memory" means I'm using 1MB instead of 256KB, I'm
not going worry about that on a system with 2GB-16GB of RAM if it
doubles the throughput of the WIFI for a given workload. I expect
routers with 128-256MB RAM would make that tradeoff as well assuming
they don't have other RAM-demanding workload.

cheers,
grant

2019-02-21 04:39:53

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

Grant Grundler <[email protected]> writes:

> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <[email protected]> wrote:
>>
>> Grant Grundler <[email protected]> writes:
>>
>> >> And, well, Grant's data is from a single test in a noisy
>> >> environment where the time series graph shows that throughput is all over
>> >> the place for the duration of the test; so it's hard to draw solid
>> >> conclusions from (for instance, for the 5-stream test, the average
>> >> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
>> >> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
>> >> used in this test, so I can't go verify it myself; so the only thing I
>> >> can do is grumble about it here... :)
>> >
>> > It's a fair complaint and I agree with it. My counter argument is the
>> > opposite is true too: most ideal benchmarks don't measure what most
>> > users see. While the data wgong provided are way more noisy than I
>> > like, my overall "confidence" in the "conclusion" I offered is still
>> > positive.
>>
>> Right. I guess I would just prefer a slightly more comprehensive
>> evaluation to base a 4x increase in buffer size on...
>
> Kalle, is this why you didn't accept this patch? Other reasons?

Just for reference, here are patchwork links:

https://patchwork.kernel.org/cover/10559729/
https://patchwork.kernel.org/patch/10559731/
https://patchwork.kernel.org/patch/10559733/

The mac80211 patch is accepted and the ath10k patch is deferred. IIRC I
put it deferred as I didn't see a real consensus about the patch and was
supposed to look at it later, but haven't done yet. I don't have any
issues with the patch, except maybe removing the duplicate define for
9377 (which I can easily fix in the pending branch).

--
Kalle Valo

2019-02-21 15:43:01

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

Grant Grundler <[email protected]> writes:

> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <[email protected]> wrote:
>>
>> Grant Grundler <[email protected]> writes:
>>
>> >> And, well, Grant's data is from a single test in a noisy
>> >> environment where the time series graph shows that throughput is all over
>> >> the place for the duration of the test; so it's hard to draw solid
>> >> conclusions from (for instance, for the 5-stream test, the average
>> >> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
>> >> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
>> >> used in this test, so I can't go verify it myself; so the only thing I
>> >> can do is grumble about it here... :)
>> >
>> > It's a fair complaint and I agree with it. My counter argument is the
>> > opposite is true too: most ideal benchmarks don't measure what most
>> > users see. While the data wgong provided are way more noisy than I
>> > like, my overall "confidence" in the "conclusion" I offered is still
>> > positive.
>>
>> Right. I guess I would just prefer a slightly more comprehensive
>> evaluation to base a 4x increase in buffer size on...
>
> Kalle, is this why you didn't accept this patch? Other reasons?
>
> Toke, what else would you like to see evaluated?
>
> I generally want to see three things measured when "benchmarking"
> technologies: throughput, latency, cpu utilization
> We've covered those three I think "reasonably".

Hmm, going back and looking at this (I'd completely forgotten about this
patch), I think I had two main concerns:

1. What happens in a degraded signal situation, where the throughput is
limited by the signal conditions, or by contention with other devices.
Both of these happen regularly, and I worry that latency will be
badly affected under those conditions.

2. What happens with old hardware that has worse buffer management in
the driver->firmware path (especially drivers without push/pull mode
support)? For these, the lower-level queueing structure is less
effective at controlling queueing latency.

Getting the queue size limit patches from ChromeOS ported would
alleviate point 2. I do believe Kan said he'd look into that once the
airtime patches were merged. So Kan, any progress on that front? :)

> What does a "4x increase in memory" mean here? Wen, how much more
> memory does this cause ath10k to use?

I didn't say "memory", I said "buffer size"... :)
I.e., it's the latency impact of the increased buffering I'm worried
about (see above), not the system memory usage.

-Toke

2019-02-21 16:10:20

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

Toke Høiland-Jørgensen <[email protected]> writes:

> Grant Grundler <[email protected]> writes:
>
>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <[email protected]> wrote:
>>>
>>> Grant Grundler <[email protected]> writes:
>>>
>>> >> And, well, Grant's data is from a single test in a noisy
>>> >> environment where the time series graph shows that throughput is all over
>>> >> the place for the duration of the test; so it's hard to draw solid
>>> >> conclusions from (for instance, for the 5-stream test, the average
>>> >> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
>>> >> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
>>> >> used in this test, so I can't go verify it myself; so the only thing I
>>> >> can do is grumble about it here... :)
>>> >
>>> > It's a fair complaint and I agree with it. My counter argument is the
>>> > opposite is true too: most ideal benchmarks don't measure what most
>>> > users see. While the data wgong provided are way more noisy than I
>>> > like, my overall "confidence" in the "conclusion" I offered is still
>>> > positive.
>>>
>>> Right. I guess I would just prefer a slightly more comprehensive
>>> evaluation to base a 4x increase in buffer size on...
>>
>> Kalle, is this why you didn't accept this patch? Other reasons?
>>
>> Toke, what else would you like to see evaluated?
>>
>> I generally want to see three things measured when "benchmarking"
>> technologies: throughput, latency, cpu utilization
>> We've covered those three I think "reasonably".
>
> Hmm, going back and looking at this (I'd completely forgotten about this
> patch), I think I had two main concerns:
>
> 1. What happens in a degraded signal situation, where the throughput is
> limited by the signal conditions, or by contention with other devices.
> Both of these happen regularly, and I worry that latency will be
> badly affected under those conditions.
>
> 2. What happens with old hardware that has worse buffer management in
> the driver->firmware path (especially drivers without push/pull mode
> support)? For these, the lower-level queueing structure is less
> effective at controlling queueing latency.

Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377
PCI devices, which IIRC do not even support push/pull mode. All the
rest, including QCA988X and QCA9984 are unaffected.

--
Kalle Valo

2019-02-21 16:22:19

by Ben Greear

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

On 2/21/19 8:10 AM, Kalle Valo wrote:
> Toke Høiland-Jørgensen <[email protected]> writes:
>
>> Grant Grundler <[email protected]> writes:
>>
>>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <[email protected]> wrote:
>>>>
>>>> Grant Grundler <[email protected]> writes:
>>>>
>>>>>> And, well, Grant's data is from a single test in a noisy
>>>>>> environment where the time series graph shows that throughput is all over
>>>>>> the place for the duration of the test; so it's hard to draw solid
>>>>>> conclusions from (for instance, for the 5-stream test, the average
>>>>>> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
>>>>>> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
>>>>>> used in this test, so I can't go verify it myself; so the only thing I
>>>>>> can do is grumble about it here... :)
>>>>>
>>>>> It's a fair complaint and I agree with it. My counter argument is the
>>>>> opposite is true too: most ideal benchmarks don't measure what most
>>>>> users see. While the data wgong provided are way more noisy than I
>>>>> like, my overall "confidence" in the "conclusion" I offered is still
>>>>> positive.
>>>>
>>>> Right. I guess I would just prefer a slightly more comprehensive
>>>> evaluation to base a 4x increase in buffer size on...
>>>
>>> Kalle, is this why you didn't accept this patch? Other reasons?
>>>
>>> Toke, what else would you like to see evaluated?
>>>
>>> I generally want to see three things measured when "benchmarking"
>>> technologies: throughput, latency, cpu utilization
>>> We've covered those three I think "reasonably".
>>
>> Hmm, going back and looking at this (I'd completely forgotten about this
>> patch), I think I had two main concerns:
>>
>> 1. What happens in a degraded signal situation, where the throughput is
>> limited by the signal conditions, or by contention with other devices.
>> Both of these happen regularly, and I worry that latency will be
>> badly affected under those conditions.
>>
>> 2. What happens with old hardware that has worse buffer management in
>> the driver->firmware path (especially drivers without push/pull mode
>> support)? For these, the lower-level queueing structure is less
>> effective at controlling queueing latency.
>
> Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377
> PCI devices, which IIRC do not even support push/pull mode. All the
> rest, including QCA988X and QCA9984 are unaffected.

Just as a note, at least kernels such as 4.14.whatever perform poorly when
running ath10k on 9984 when acting as TCP endpoints. This makes them not
really usable for stuff like serving video to lots of clients.

Tweaking TCP (I do it a bit differently, but either way) can significantly
improve performance.

Recently I helped a user that could get barely 70 stations streaming at 1Mbps
on stock kernel (using one wave1 on 2.4, one wave-2 on 5Ghz),
and we got 110 working with a tweaked TCP stack. These were /n stations too.

I think it is lame that it _still_ requires out of tree patches to make TCP work
well on ath10k...even if you want to default to current behaviour, you should
allow users to tweak it to work with their use case.

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2019-02-21 16:28:17

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

Kalle Valo <[email protected]> writes:

> Toke Høiland-Jørgensen <[email protected]> writes:
>
>> Grant Grundler <[email protected]> writes:
>>
>>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <[email protected]> wrote:
>>>>
>>>> Grant Grundler <[email protected]> writes:
>>>>
>>>> >> And, well, Grant's data is from a single test in a noisy
>>>> >> environment where the time series graph shows that throughput is all over
>>>> >> the place for the duration of the test; so it's hard to draw solid
>>>> >> conclusions from (for instance, for the 5-stream test, the average
>>>> >> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
>>>> >> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
>>>> >> used in this test, so I can't go verify it myself; so the only thing I
>>>> >> can do is grumble about it here... :)
>>>> >
>>>> > It's a fair complaint and I agree with it. My counter argument is the
>>>> > opposite is true too: most ideal benchmarks don't measure what most
>>>> > users see. While the data wgong provided are way more noisy than I
>>>> > like, my overall "confidence" in the "conclusion" I offered is still
>>>> > positive.
>>>>
>>>> Right. I guess I would just prefer a slightly more comprehensive
>>>> evaluation to base a 4x increase in buffer size on...
>>>
>>> Kalle, is this why you didn't accept this patch? Other reasons?
>>>
>>> Toke, what else would you like to see evaluated?
>>>
>>> I generally want to see three things measured when "benchmarking"
>>> technologies: throughput, latency, cpu utilization
>>> We've covered those three I think "reasonably".
>>
>> Hmm, going back and looking at this (I'd completely forgotten about this
>> patch), I think I had two main concerns:
>>
>> 1. What happens in a degraded signal situation, where the throughput is
>> limited by the signal conditions, or by contention with other devices.
>> Both of these happen regularly, and I worry that latency will be
>> badly affected under those conditions.
>>
>> 2. What happens with old hardware that has worse buffer management in
>> the driver->firmware path (especially drivers without push/pull mode
>> support)? For these, the lower-level queueing structure is less
>> effective at controlling queueing latency.
>
> Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377
> PCI devices, which IIRC do not even support push/pull mode. All the
> rest, including QCA988X and QCA9984 are unaffected.

Ah, right; I did not go all the way back and look at the actual patch,
so missed that :)

But in that case, why are the latency results that low? Were these tests
done with the ChromeOS queue limit patches?

-Toke

2019-02-21 16:37:56

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

Ben Greear <[email protected]> writes:

> On 2/21/19 8:10 AM, Kalle Valo wrote:
>> Toke Høiland-Jørgensen <[email protected]> writes:
>>
>>> Grant Grundler <[email protected]> writes:
>>>
>>>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <[email protected]> wrote:
>>>>>
>>>>> Grant Grundler <[email protected]> writes:
>>>>>
>>>>>>> And, well, Grant's data is from a single test in a noisy
>>>>>>> environment where the time series graph shows that throughput is all over
>>>>>>> the place for the duration of the test; so it's hard to draw solid
>>>>>>> conclusions from (for instance, for the 5-stream test, the average
>>>>>>> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
>>>>>>> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
>>>>>>> used in this test, so I can't go verify it myself; so the only thing I
>>>>>>> can do is grumble about it here... :)
>>>>>>
>>>>>> It's a fair complaint and I agree with it. My counter argument is the
>>>>>> opposite is true too: most ideal benchmarks don't measure what most
>>>>>> users see. While the data wgong provided are way more noisy than I
>>>>>> like, my overall "confidence" in the "conclusion" I offered is still
>>>>>> positive.
>>>>>
>>>>> Right. I guess I would just prefer a slightly more comprehensive
>>>>> evaluation to base a 4x increase in buffer size on...
>>>>
>>>> Kalle, is this why you didn't accept this patch? Other reasons?
>>>>
>>>> Toke, what else would you like to see evaluated?
>>>>
>>>> I generally want to see three things measured when "benchmarking"
>>>> technologies: throughput, latency, cpu utilization
>>>> We've covered those three I think "reasonably".
>>>
>>> Hmm, going back and looking at this (I'd completely forgotten about this
>>> patch), I think I had two main concerns:
>>>
>>> 1. What happens in a degraded signal situation, where the throughput is
>>> limited by the signal conditions, or by contention with other devices.
>>> Both of these happen regularly, and I worry that latency will be
>>> badly affected under those conditions.
>>>
>>> 2. What happens with old hardware that has worse buffer management in
>>> the driver->firmware path (especially drivers without push/pull mode
>>> support)? For these, the lower-level queueing structure is less
>>> effective at controlling queueing latency.
>>
>> Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377
>> PCI devices, which IIRC do not even support push/pull mode. All the
>> rest, including QCA988X and QCA9984 are unaffected.
>
> Just as a note, at least kernels such as 4.14.whatever perform poorly when
> running ath10k on 9984 when acting as TCP endpoints. This makes them not
> really usable for stuff like serving video to lots of clients.
>
> Tweaking TCP (I do it a bit differently, but either way) can significantly
> improve performance.

Differently how? Did you have to do more than fiddle with the pacing_shift?

> Recently I helped a user that could get barely 70 stations streaming
> at 1Mbps on stock kernel (using one wave1 on 2.4, one wave-2 on 5Ghz),
> and we got 110 working with a tweaked TCP stack. These were /n
> stations too.
>
> I think it is lame that it _still_ requires out of tree patches to
> make TCP work well on ath10k...even if you want to default to current
> behaviour, you should allow users to tweak it to work with their use
> case.

Well if TCP is broken to the point of being unusable I do think we
should fix it; but I think "just provide a configuration knob" should be
the last resort...

-Toke

2019-02-21 16:57:55

by Ben Greear

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

On 2/21/19 8:37 AM, Toke Høiland-Jørgensen wrote:
> Ben Greear <[email protected]> writes:
>
>> On 2/21/19 8:10 AM, Kalle Valo wrote:
>>> Toke Høiland-Jørgensen <[email protected]> writes:
>>>
>>>> Grant Grundler <[email protected]> writes:
>>>>
>>>>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <[email protected]> wrote:
>>>>>>
>>>>>> Grant Grundler <[email protected]> writes:
>>>>>>
>>>>>>>> And, well, Grant's data is from a single test in a noisy
>>>>>>>> environment where the time series graph shows that throughput is all over
>>>>>>>> the place for the duration of the test; so it's hard to draw solid
>>>>>>>> conclusions from (for instance, for the 5-stream test, the average
>>>>>>>> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
>>>>>>>> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
>>>>>>>> used in this test, so I can't go verify it myself; so the only thing I
>>>>>>>> can do is grumble about it here... :)
>>>>>>>
>>>>>>> It's a fair complaint and I agree with it. My counter argument is the
>>>>>>> opposite is true too: most ideal benchmarks don't measure what most
>>>>>>> users see. While the data wgong provided are way more noisy than I
>>>>>>> like, my overall "confidence" in the "conclusion" I offered is still
>>>>>>> positive.
>>>>>>
>>>>>> Right. I guess I would just prefer a slightly more comprehensive
>>>>>> evaluation to base a 4x increase in buffer size on...
>>>>>
>>>>> Kalle, is this why you didn't accept this patch? Other reasons?
>>>>>
>>>>> Toke, what else would you like to see evaluated?
>>>>>
>>>>> I generally want to see three things measured when "benchmarking"
>>>>> technologies: throughput, latency, cpu utilization
>>>>> We've covered those three I think "reasonably".
>>>>
>>>> Hmm, going back and looking at this (I'd completely forgotten about this
>>>> patch), I think I had two main concerns:
>>>>
>>>> 1. What happens in a degraded signal situation, where the throughput is
>>>> limited by the signal conditions, or by contention with other devices.
>>>> Both of these happen regularly, and I worry that latency will be
>>>> badly affected under those conditions.
>>>>
>>>> 2. What happens with old hardware that has worse buffer management in
>>>> the driver->firmware path (especially drivers without push/pull mode
>>>> support)? For these, the lower-level queueing structure is less
>>>> effective at controlling queueing latency.
>>>
>>> Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377
>>> PCI devices, which IIRC do not even support push/pull mode. All the
>>> rest, including QCA988X and QCA9984 are unaffected.
>>
>> Just as a note, at least kernels such as 4.14.whatever perform poorly when
>> running ath10k on 9984 when acting as TCP endpoints. This makes them not
>> really usable for stuff like serving video to lots of clients.
>>
>> Tweaking TCP (I do it a bit differently, but either way) can significantly
>> improve performance.
>
> Differently how? Did you have to do more than fiddle with the pacing_shift?

This one, or a slightly tweaked version that applies to different kernels:

https://github.com/greearb/linux-ct-4.16/commit/3e14e8491a5b31ce994fb2752347145e6ab7eaf5

>> Recently I helped a user that could get barely 70 stations streaming
>> at 1Mbps on stock kernel (using one wave1 on 2.4, one wave-2 on 5Ghz),
>> and we got 110 working with a tweaked TCP stack. These were /n
>> stations too.
>>
>> I think it is lame that it _still_ requires out of tree patches to
>> make TCP work well on ath10k...even if you want to default to current
>> behaviour, you should allow users to tweak it to work with their use
>> case.
>
> Well if TCP is broken to the point of being unusable I do think we
> should fix it; but I think "just provide a configuration knob" should be
> the last resort...

So, it has been broken for years, and waiting for a perfect solution has not
gotten the problem fixed.

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2019-02-21 17:15:34

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

Ben Greear <[email protected]> writes:

> On 2/21/19 8:37 AM, Toke Høiland-Jørgensen wrote:
>> Ben Greear <[email protected]> writes:
>>
>>> On 2/21/19 8:10 AM, Kalle Valo wrote:
>>>> Toke Høiland-Jørgensen <[email protected]> writes:
>>>>
>>>>> Grant Grundler <[email protected]> writes:
>>>>>
>>>>>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <[email protected]> wrote:
>>>>>>>
>>>>>>> Grant Grundler <[email protected]> writes:
>>>>>>>
>>>>>>>>> And, well, Grant's data is from a single test in a noisy
>>>>>>>>> environment where the time series graph shows that throughput is all over
>>>>>>>>> the place for the duration of the test; so it's hard to draw solid
>>>>>>>>> conclusions from (for instance, for the 5-stream test, the average
>>>>>>>>> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
>>>>>>>>> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
>>>>>>>>> used in this test, so I can't go verify it myself; so the only thing I
>>>>>>>>> can do is grumble about it here... :)
>>>>>>>>
>>>>>>>> It's a fair complaint and I agree with it. My counter argument is the
>>>>>>>> opposite is true too: most ideal benchmarks don't measure what most
>>>>>>>> users see. While the data wgong provided are way more noisy than I
>>>>>>>> like, my overall "confidence" in the "conclusion" I offered is still
>>>>>>>> positive.
>>>>>>>
>>>>>>> Right. I guess I would just prefer a slightly more comprehensive
>>>>>>> evaluation to base a 4x increase in buffer size on...
>>>>>>
>>>>>> Kalle, is this why you didn't accept this patch? Other reasons?
>>>>>>
>>>>>> Toke, what else would you like to see evaluated?
>>>>>>
>>>>>> I generally want to see three things measured when "benchmarking"
>>>>>> technologies: throughput, latency, cpu utilization
>>>>>> We've covered those three I think "reasonably".
>>>>>
>>>>> Hmm, going back and looking at this (I'd completely forgotten about this
>>>>> patch), I think I had two main concerns:
>>>>>
>>>>> 1. What happens in a degraded signal situation, where the throughput is
>>>>> limited by the signal conditions, or by contention with other devices.
>>>>> Both of these happen regularly, and I worry that latency will be
>>>>> badly affected under those conditions.
>>>>>
>>>>> 2. What happens with old hardware that has worse buffer management in
>>>>> the driver->firmware path (especially drivers without push/pull mode
>>>>> support)? For these, the lower-level queueing structure is less
>>>>> effective at controlling queueing latency.
>>>>
>>>> Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377
>>>> PCI devices, which IIRC do not even support push/pull mode. All the
>>>> rest, including QCA988X and QCA9984 are unaffected.
>>>
>>> Just as a note, at least kernels such as 4.14.whatever perform poorly when
>>> running ath10k on 9984 when acting as TCP endpoints. This makes them not
>>> really usable for stuff like serving video to lots of clients.
>>>
>>> Tweaking TCP (I do it a bit differently, but either way) can significantly
>>> improve performance.
>>
>> Differently how? Did you have to do more than fiddle with the pacing_shift?
>
> This one, or a slightly tweaked version that applies to different kernels:
>
> https://github.com/greearb/linux-ct-4.16/commit/3e14e8491a5b31ce994fb2752347145e6ab7eaf5

Right; but the current mac80211 default (pacing shift 8) corresponds to
setting your sysctl to 4...

>>> Recently I helped a user that could get barely 70 stations streaming
>>> at 1Mbps on stock kernel (using one wave1 on 2.4, one wave-2 on 5Ghz),
>>> and we got 110 working with a tweaked TCP stack. These were /n
>>> stations too.
>>>
>>> I think it is lame that it _still_ requires out of tree patches to
>>> make TCP work well on ath10k...even if you want to default to current
>>> behaviour, you should allow users to tweak it to work with their use
>>> case.
>>
>> Well if TCP is broken to the point of being unusable I do think we
>> should fix it; but I think "just provide a configuration knob" should be
>> the last resort...
>
> So, it has been broken for years, and waiting for a perfect solution
> has not gotten the problem fixed.

Well, the current default should at least be closer to something that
works well.

I do think I may have erred on the wrong side of the optimum when I
submitted the original patch to set the default to 8; that should
probably have been 7 (i.e., 8 ms; the optimum in the evaluation we did
was around 6 ms, which is sadly not a power of two). Maybe changing that
default is actually better than having to redo the testing for all the
different devices as we're discussing in the context of this patch.
Maybe I should just send a patch to do that...

-Toke

2019-02-21 17:29:59

by Ben Greear

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

On 2/21/19 9:15 AM, Toke Høiland-Jørgensen wrote:
> Ben Greear <[email protected]> writes:
>
>> On 2/21/19 8:37 AM, Toke Høiland-Jørgensen wrote:
>>> Ben Greear <[email protected]> writes:
>>>
>>>> On 2/21/19 8:10 AM, Kalle Valo wrote:
>>>>> Toke Høiland-Jørgensen <[email protected]> writes:
>>>>>
>>>>>> Grant Grundler <[email protected]> writes:
>>>>>>
>>>>>>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Grant Grundler <[email protected]> writes:
>>>>>>>>
>>>>>>>>>> And, well, Grant's data is from a single test in a noisy
>>>>>>>>>> environment where the time series graph shows that throughput is all over
>>>>>>>>>> the place for the duration of the test; so it's hard to draw solid
>>>>>>>>>> conclusions from (for instance, for the 5-stream test, the average
>>>>>>>>>> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
>>>>>>>>>> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
>>>>>>>>>> used in this test, so I can't go verify it myself; so the only thing I
>>>>>>>>>> can do is grumble about it here... :)
>>>>>>>>>
>>>>>>>>> It's a fair complaint and I agree with it. My counter argument is the
>>>>>>>>> opposite is true too: most ideal benchmarks don't measure what most
>>>>>>>>> users see. While the data wgong provided are way more noisy than I
>>>>>>>>> like, my overall "confidence" in the "conclusion" I offered is still
>>>>>>>>> positive.
>>>>>>>>
>>>>>>>> Right. I guess I would just prefer a slightly more comprehensive
>>>>>>>> evaluation to base a 4x increase in buffer size on...
>>>>>>>
>>>>>>> Kalle, is this why you didn't accept this patch? Other reasons?
>>>>>>>
>>>>>>> Toke, what else would you like to see evaluated?
>>>>>>>
>>>>>>> I generally want to see three things measured when "benchmarking"
>>>>>>> technologies: throughput, latency, cpu utilization
>>>>>>> We've covered those three I think "reasonably".
>>>>>>
>>>>>> Hmm, going back and looking at this (I'd completely forgotten about this
>>>>>> patch), I think I had two main concerns:
>>>>>>
>>>>>> 1. What happens in a degraded signal situation, where the throughput is
>>>>>> limited by the signal conditions, or by contention with other devices.
>>>>>> Both of these happen regularly, and I worry that latency will be
>>>>>> badly affected under those conditions.
>>>>>>
>>>>>> 2. What happens with old hardware that has worse buffer management in
>>>>>> the driver->firmware path (especially drivers without push/pull mode
>>>>>> support)? For these, the lower-level queueing structure is less
>>>>>> effective at controlling queueing latency.
>>>>>
>>>>> Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377
>>>>> PCI devices, which IIRC do not even support push/pull mode. All the
>>>>> rest, including QCA988X and QCA9984 are unaffected.
>>>>
>>>> Just as a note, at least kernels such as 4.14.whatever perform poorly when
>>>> running ath10k on 9984 when acting as TCP endpoints. This makes them not
>>>> really usable for stuff like serving video to lots of clients.
>>>>
>>>> Tweaking TCP (I do it a bit differently, but either way) can significantly
>>>> improve performance.
>>>
>>> Differently how? Did you have to do more than fiddle with the pacing_shift?
>>
>> This one, or a slightly tweaked version that applies to different kernels:
>>
>> https://github.com/greearb/linux-ct-4.16/commit/3e14e8491a5b31ce994fb2752347145e6ab7eaf5
>
> Right; but the current mac80211 default (pacing shift 8) corresponds to
> setting your sysctl to 4...
>
>>>> Recently I helped a user that could get barely 70 stations streaming
>>>> at 1Mbps on stock kernel (using one wave1 on 2.4, one wave-2 on 5Ghz),
>>>> and we got 110 working with a tweaked TCP stack. These were /n
>>>> stations too.
>>>>
>>>> I think it is lame that it _still_ requires out of tree patches to
>>>> make TCP work well on ath10k...even if you want to default to current
>>>> behaviour, you should allow users to tweak it to work with their use
>>>> case.
>>>
>>> Well if TCP is broken to the point of being unusable I do think we
>>> should fix it; but I think "just provide a configuration knob" should be
>>> the last resort...
>>
>> So, it has been broken for years, and waiting for a perfect solution
>> has not gotten the problem fixed.
>
> Well, the current default should at least be closer to something that
> works well.
>
> I do think I may have erred on the wrong side of the optimum when I
> submitted the original patch to set the default to 8; that should
> probably have been 7 (i.e., 8 ms; the optimum in the evaluation we did
> was around 6 ms, which is sadly not a power of two). Maybe changing that
> default is actually better than having to redo the testing for all the
> different devices as we're discussing in the context of this patch.
> Maybe I should just send a patch to do that...

It is hubris to think one setting works well for everyone. Sure, set a good
default, but also let people tune the value.

And send the patches to stable so that users on older kernels can have good
performance.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2019-02-21 17:30:23

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: [PATCH] mac80211: Change default tx_sk_pacing_shift to 7

When we did the original tests for the optimal value of sk_pacing_shift, we
came up with 6 ms of buffering as the default. Sadly, 6 is not a power of
two, so when picking the shift value I erred on the size of less buffering
and picked 4 ms instead of 8. This was probably wrong; those 2 ms of extra
buffering makes a larger difference than I thought.

So, change the default pacing shift to 7, which corresponds to 8 ms of
buffering. The point of diminishing returns really kicks in after 8 ms, and
so having this as a default should cut down on the need for extensive
per-device testing and overrides needed in the drivers.

Signed-off-by: Toke Høiland-Jørgensen <[email protected]>
---
net/mac80211/main.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/mac80211/main.c b/net/mac80211/main.c
index 5055aeba5c5a..800e67615e2a 100644
--- a/net/mac80211/main.c
+++ b/net/mac80211/main.c
@@ -617,13 +617,13 @@ struct ieee80211_hw *ieee80211_alloc_hw_nm(size_t priv_data_len,
* We need a bit of data queued to build aggregates properly, so
* instruct the TCP stack to allow more than a single ms of data
* to be queued in the stack. The value is a bit-shift of 1
- * second, so 8 is ~4ms of queued data. Only affects local TCP
+ * second, so 7 is ~8ms of queued data. Only affects local TCP
* sockets.
* This is the default, anyhow - drivers may need to override it
* for local reasons (longer buffers, longer completion time, or
* similar).
*/
- local->hw.tx_sk_pacing_shift = 8;
+ local->hw.tx_sk_pacing_shift = 7;

/* set up some defaults */
local->hw.queues = 1;
--
2.20.1


2019-02-21 22:50:57

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

Ben Greear <[email protected]> writes:

> On 2/21/19 9:15 AM, Toke Høiland-Jørgensen wrote:
>> Ben Greear <[email protected]> writes:
>>
>>> On 2/21/19 8:37 AM, Toke Høiland-Jørgensen wrote:
>>>> Ben Greear <[email protected]> writes:
>>>>
>>>>> On 2/21/19 8:10 AM, Kalle Valo wrote:
>>>>>> Toke Høiland-Jørgensen <[email protected]> writes:
>>>>>>
>>>>>>> Grant Grundler <[email protected]> writes:
>>>>>>>
>>>>>>>> On Thu, Sep 6, 2018 at 3:18 AM Toke Høiland-Jørgensen <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> Grant Grundler <[email protected]> writes:
>>>>>>>>>
>>>>>>>>>>> And, well, Grant's data is from a single test in a noisy
>>>>>>>>>>> environment where the time series graph shows that throughput is all over
>>>>>>>>>>> the place for the duration of the test; so it's hard to draw solid
>>>>>>>>>>> conclusions from (for instance, for the 5-stream test, the average
>>>>>>>>>>> throughput for 6 is 331 and 379 Mbps for the two repetitions, and for 7
>>>>>>>>>>> it's 326 and 371 Mbps) . Unfortunately I don't have the same hardware
>>>>>>>>>>> used in this test, so I can't go verify it myself; so the only thing I
>>>>>>>>>>> can do is grumble about it here... :)
>>>>>>>>>>
>>>>>>>>>> It's a fair complaint and I agree with it. My counter argument is the
>>>>>>>>>> opposite is true too: most ideal benchmarks don't measure what most
>>>>>>>>>> users see. While the data wgong provided are way more noisy than I
>>>>>>>>>> like, my overall "confidence" in the "conclusion" I offered is still
>>>>>>>>>> positive.
>>>>>>>>>
>>>>>>>>> Right. I guess I would just prefer a slightly more comprehensive
>>>>>>>>> evaluation to base a 4x increase in buffer size on...
>>>>>>>>
>>>>>>>> Kalle, is this why you didn't accept this patch? Other reasons?
>>>>>>>>
>>>>>>>> Toke, what else would you like to see evaluated?
>>>>>>>>
>>>>>>>> I generally want to see three things measured when "benchmarking"
>>>>>>>> technologies: throughput, latency, cpu utilization
>>>>>>>> We've covered those three I think "reasonably".
>>>>>>>
>>>>>>> Hmm, going back and looking at this (I'd completely forgotten about this
>>>>>>> patch), I think I had two main concerns:
>>>>>>>
>>>>>>> 1. What happens in a degraded signal situation, where the throughput is
>>>>>>> limited by the signal conditions, or by contention with other devices.
>>>>>>> Both of these happen regularly, and I worry that latency will be
>>>>>>> badly affected under those conditions.
>>>>>>>
>>>>>>> 2. What happens with old hardware that has worse buffer management in
>>>>>>> the driver->firmware path (especially drivers without push/pull mode
>>>>>>> support)? For these, the lower-level queueing structure is less
>>>>>>> effective at controlling queueing latency.
>>>>>>
>>>>>> Do note that this patch changes behaviour _only_ for QCA6174 and QCA9377
>>>>>> PCI devices, which IIRC do not even support push/pull mode. All the
>>>>>> rest, including QCA988X and QCA9984 are unaffected.
>>>>>
>>>>> Just as a note, at least kernels such as 4.14.whatever perform poorly when
>>>>> running ath10k on 9984 when acting as TCP endpoints. This makes them not
>>>>> really usable for stuff like serving video to lots of clients.
>>>>>
>>>>> Tweaking TCP (I do it a bit differently, but either way) can significantly
>>>>> improve performance.
>>>>
>>>> Differently how? Did you have to do more than fiddle with the pacing_shift?
>>>
>>> This one, or a slightly tweaked version that applies to different kernels:
>>>
>>> https://github.com/greearb/linux-ct-4.16/commit/3e14e8491a5b31ce994fb2752347145e6ab7eaf5
>>
>> Right; but the current mac80211 default (pacing shift 8) corresponds to
>> setting your sysctl to 4...
>>
>>>>> Recently I helped a user that could get barely 70 stations streaming
>>>>> at 1Mbps on stock kernel (using one wave1 on 2.4, one wave-2 on 5Ghz),
>>>>> and we got 110 working with a tweaked TCP stack. These were /n
>>>>> stations too.
>>>>>
>>>>> I think it is lame that it _still_ requires out of tree patches to
>>>>> make TCP work well on ath10k...even if you want to default to current
>>>>> behaviour, you should allow users to tweak it to work with their use
>>>>> case.
>>>>
>>>> Well if TCP is broken to the point of being unusable I do think we
>>>> should fix it; but I think "just provide a configuration knob" should be
>>>> the last resort...
>>>
>>> So, it has been broken for years, and waiting for a perfect solution
>>> has not gotten the problem fixed.
>>
>> Well, the current default should at least be closer to something that
>> works well.
>>
>> I do think I may have erred on the wrong side of the optimum when I
>> submitted the original patch to set the default to 8; that should
>> probably have been 7 (i.e., 8 ms; the optimum in the evaluation we did
>> was around 6 ms, which is sadly not a power of two). Maybe changing that
>> default is actually better than having to redo the testing for all the
>> different devices as we're discussing in the context of this patch.
>> Maybe I should just send a patch to do that...
>
> It is hubris to think one setting works well for everyone. Sure, set a
> good default, but also let people tune the value.

Well I certainly won't object to a knob; I just don't think that most
people are going to use it, so we better make the default reasonable.

> And send the patches to stable so that users on older kernels can have
> good performance.

Sure, I can submit stable backports if needed :)

-Toke

2019-02-22 12:29:16

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Change default tx_sk_pacing_shift to 7

Toke Høiland-Jørgensen wrote:

> When we did the original tests for the optimal value of sk_pacing_shift, we
> came up with 6 ms of buffering as the default. Sadly, 6 is not a power of
> two, so when picking the shift value I erred on the size of less buffering
> and picked 4 ms instead of 8. This was probably wrong; those 2 ms of extra
> buffering makes a larger difference than I thought.
>
> So, change the default pacing shift to 7, which corresponds to 8 ms of
> buffering. The point of diminishing returns really kicks in after 8 ms, and
> so having this as a default should cut down on the need for extensive
> per-device testing and overrides needed in the drivers.
>
> Signed-off-by: Toke Høiland-Jørgensen <[email protected]>

Patch applied to wireless.git, thanks.

a41f56b9a17a (HEAD -> mac80211) mac80211: Change default tx_sk_pacing_shift to 7

--
https://patchwork.kernel.org/patch/10824427/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches


2019-02-22 13:06:38

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Change default tx_sk_pacing_shift to 7

Johannes Berg <[email protected]> writes:

> Toke Høiland-Jørgensen wrote:
>
>> When we did the original tests for the optimal value of sk_pacing_shift, we
>> came up with 6 ms of buffering as the default. Sadly, 6 is not a power of
>> two, so when picking the shift value I erred on the size of less buffering
>> and picked 4 ms instead of 8. This was probably wrong; those 2 ms of extra
>> buffering makes a larger difference than I thought.
>>
>> So, change the default pacing shift to 7, which corresponds to 8 ms of
>> buffering. The point of diminishing returns really kicks in after 8 ms, and
>> so having this as a default should cut down on the need for extensive
>> per-device testing and overrides needed in the drivers.
>>
>> Signed-off-by: Toke Høiland-Jørgensen <[email protected]>
>
> Patch applied to wireless.git, thanks.
>
> a41f56b9a17a (HEAD -> mac80211) mac80211: Change default
> tx_sk_pacing_shift to 7

Cool, thanks! What's the easiest way to backport this? I figure it's
easier to just update sk_pacing_shift_update() in tx.c for 4.19 (which
predates the addition of the driver override hook); shall I just send a
separate patch to stable for that? Or do we need to backport the driver
override hook as well?

-Toke

2019-02-22 13:07:41

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Change default tx_sk_pacing_shift to 7

On Fri, 2019-02-22 at 14:06 +0100, Toke Høiland-Jørgensen wrote:
> Johannes Berg <[email protected]> writes:
>
> > Toke Høiland-Jørgensen wrote:
> >
> > > When we did the original tests for the optimal value of sk_pacing_shift, we
> > > came up with 6 ms of buffering as the default. Sadly, 6 is not a power of
> > > two, so when picking the shift value I erred on the size of less buffering
> > > and picked 4 ms instead of 8. This was probably wrong; those 2 ms of extra
> > > buffering makes a larger difference than I thought.
> > >
> > > So, change the default pacing shift to 7, which corresponds to 8 ms of
> > > buffering. The point of diminishing returns really kicks in after 8 ms, and
> > > so having this as a default should cut down on the need for extensive
> > > per-device testing and overrides needed in the drivers.
> > >
> > > Signed-off-by: Toke Høiland-Jørgensen <[email protected]>
> >
> > Patch applied to wireless.git, thanks.
> >
> > a41f56b9a17a (HEAD -> mac80211) mac80211: Change default
> > tx_sk_pacing_shift to 7

This mess came from Kalle's tool btw, so I can't really use it yet :-)

> Cool, thanks! What's the easiest way to backport this? I figure it's
> easier to just update sk_pacing_shift_update() in tx.c for 4.19 (which
> predates the addition of the driver override hook); shall I just send a
> separate patch to stable for that? Or do we need to backport the driver
> override hook as well?

Just update the value, no need to backport the hook.

johannes


2019-02-22 13:40:16

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Change default tx_sk_pacing_shift to 7

Johannes Berg <[email protected]> writes:

> On Fri, 2019-02-22 at 14:06 +0100, Toke Høiland-Jørgensen wrote:
>> Johannes Berg <[email protected]> writes:
>>
>> > Toke Høiland-Jørgensen wrote:
>> >
>> > > When we did the original tests for the optimal value of sk_pacing_shift, we
>> > > came up with 6 ms of buffering as the default. Sadly, 6 is not a power of
>> > > two, so when picking the shift value I erred on the size of less buffering
>> > > and picked 4 ms instead of 8. This was probably wrong; those 2 ms of extra
>> > > buffering makes a larger difference than I thought.
>> > >
>> > > So, change the default pacing shift to 7, which corresponds to 8 ms of
>> > > buffering. The point of diminishing returns really kicks in after 8 ms, and
>> > > so having this as a default should cut down on the need for extensive
>> > > per-device testing and overrides needed in the drivers.
>> > >
>> > > Signed-off-by: Toke Høiland-Jørgensen <[email protected]>
>> >
>> > Patch applied to wireless.git, thanks.
>> >
>> > a41f56b9a17a (HEAD -> mac80211) mac80211: Change default
>> > tx_sk_pacing_shift to 7
>
> This mess came from Kalle's tool btw, so I can't really use it yet :-)
>
>> Cool, thanks! What's the easiest way to backport this? I figure it's
>> easier to just update sk_pacing_shift_update() in tx.c for 4.19 (which
>> predates the addition of the driver override hook); shall I just send a
>> separate patch to stable for that? Or do we need to backport the driver
>> override hook as well?
>
> Just update the value, no need to backport the hook.

And send it directly to stable, or does it need to go through you?

-Toke

2019-02-22 19:10:11

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Change default tx_sk_pacing_shift to 7

On Fri, 2019-02-22 at 14:40 +0100, Toke Høiland-Jørgensen wrote:
>
> And send it directly to stable, or does it need to go through you?

I added a Cc: stable tag. So all you really need to do is wait for the
emails telling you it failed to apply, and handle accordingly :)

johannes


2019-02-23 11:50:03

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH] mac80211: Change default tx_sk_pacing_shift to 7

Johannes Berg <[email protected]> writes:

> On Fri, 2019-02-22 at 14:40 +0100, Toke Høiland-Jørgensen wrote:
>>
>> And send it directly to stable, or does it need to go through you?
>
> I added a Cc: stable tag. So all you really need to do is wait for the
> emails telling you it failed to apply, and handle accordingly :)

Cool, that I can do; thanks! :)

-Toke

2020-04-23 06:35:11

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] ath10k: Set sk_pacing_shift to 6 for 11AC WiFi chips

Wen Gong <[email protected]> wrote:

> Upstream kernel has an interface to help adjust sk_pacing_shift to help
> improve TCP UL throughput.
> The sk_pacing_shift is 8 in mac80211, this is based on test with 11N
> WiFi chips with ath9k. For QCA6174/QCA9377 PCI 11AC chips, the 11AC
> VHT80 TCP UL throughput testing result shows 6 is the optimal.
> Overwrite the sk_pacing_shift to 6 in ath10k driver for QCA6174/9377 PCI.
>
> Tested with QCA6174 PCI with firmware
> WLAN.RM.4.4.1-00109-QCARMSWPZ-1, but this will also affect QCA9377 PCI.
> It's not a regression with new firmware releases.
>
> There have 2 test result of different settings:
>
> ARM CPU based device with QCA6174A PCI with different
> sk_pacing_shift:
>
> sk_pacing_shift throughput(Mbps) CPU utilization
> 6 500(-P5) ~75% idle, Focus on CPU1: ~14%idle
> 7 454(-P5) ~80% idle, Focus on CPU1: ~4%idle
> 8 288 ~90% idle, Focus on CPU1: ~35%idle
> 9 ~200 ~92% idle, Focus on CPU1: ~50%idle
>
> 5G TCP UL VTH80 on X86 platform with QCA6174A PCI with sk_packing_shift
> set to 6:
>
> tcp_limit_output_bytes throughput(Mbps)
> default(262144)+1 Stream 336
> default(262144)+2 Streams 558
> default(262144)+3 Streams 584
> default(262144)+4 Streams 602
> default(262144)+5 Streams 598
> changed(2621440)+1 Stream 598
> changed(2621440)+2 Streams 601
>
> Signed-off-by: Wen Gong <[email protected]>

The final result of this patch is unclear so I'm dropping this. Please
resend if the issue still exists.

Patch set to Changes Requested.

--
https://patchwork.kernel.org/patch/10559733/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches