2019-01-28 07:01:54

by Yibo Zhao

[permalink] [raw]
Subject: Re: FW: [PATCH] ath10k: fix return value check in wake_tx_q op

Hi Erik,

We have met performance issue on our two-core system after applying your
patch. In WDS mode, we found that the peak throughput in TCP-DL and
UDP-DL dropped more than 10% compared with previous one. And in some
cases, though throughput stays the same, one CPU usage rises about 20%
which leads to 10% in total CPU usage. With your change, I think driver
will try its best to push as many packets as it can. During this time,
the driver's queue lock will be held for too much time in one CPU and as
a result, the other CPU will be blocked if it wants to acquired the same
lock. Working in this way seems not efficiency.

So I think it is better to revert the change till we come up with a new
solution.

>> Subject: [PATCH] ath10k: fix return value check in wake_tx_q op
>>
>> ath10k_mac_tx_push_txq returns either a postive integer (length) on
>> success
>> or a negative error code on error.
>>
>> The "if (ret) break;" statement will thus always break out of the
>> loop
>> immediately after ath10k_mac_tx_push_txq has returned (making the
>> loop
>> pointless).
>>
>> A side effect of this fix is that we will iterate the queue until
>> ath10k_mac_tx_push_txq returns -ENOENT. This will make sure the queue
>> is
>> not added back to ar->txqs when it is empty. This could potentially
>> improve
>> performance somewhat (I have seen a small improvement with SDIO
>> devices).
>>
>> Signed-off-by: Erik Stromdahl <[email protected]>
>> ---
>> drivers/net/wireless/ath/ath10k/mac.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/wireless/ath/ath10k/mac.c
>> b/drivers/net/wireless/ath/ath10k/mac.c
>> index 3d7119ad7c7a..487a7a7380fd 100644
>> --- a/drivers/net/wireless/ath/ath10k/mac.c
>> +++ b/drivers/net/wireless/ath/ath10k/mac.c
>> @@ -4290,7 +4290,7 @@ static void ath10k_mac_op_wake_tx_queue(struct
>> ieee80211_hw *hw,
>>
>> while (ath10k_mac_tx_can_push(hw, f_txq) && max--) {
>> ret = ath10k_mac_tx_push_txq(hw, f_txq);
>> - if (ret)
>> + if (ret < 0)
>> break;
>> }
>> if (ret != -ENOENT)
>> --
>> 2.17.0

--
yibo


2019-02-07 14:25:21

by Kalle Valo

[permalink] [raw]
Subject: Re: FW: [PATCH] ath10k: fix return value check in wake_tx_q op

(please don't top post)

yiboz <[email protected]> writes:

> We have met performance issue on our two-core system after applying
> your patch. In WDS mode, we found that the peak throughput in TCP-DL
> and UDP-DL dropped more than 10% compared with previous one. And in
> some cases, though throughput stays the same, one CPU usage rises
> about 20% which leads to 10% in total CPU usage. With your change, I
> think driver will try its best to push as many packets as it can.
> During this time, the driver's queue lock will be held for too much
> time in one CPU and as a result, the other CPU will be blocked if it
> wants to acquired the same lock. Working in this way seems not
> efficiency.
>
> So I think it is better to revert the change till we come up with a
> new solution.

I don't think reverting is a clear option at this stage because that
again creates problems for SDIO. IIRC without this patch SDIO was
sending one packet a time (or something like that, can't remember all
the details right now).

Why does this happen only WDS mode? Did you test other modes, like AP or
client mode?

--
Kalle Valo

2019-02-25 04:40:11

by Yibo Zhao

[permalink] [raw]
Subject: Re: FW: [PATCH] ath10k: fix return value check in wake_tx_q op

在 2019-02-07 22:25,Kalle Valo 写道:
> Yibo Zhao <[email protected]> writes:
>
>> We have met performance issue on our two-core system after applying
>> your patch. In WDS mode, we found that the peak throughput in TCP-DL
>> and UDP-DL dropped more than 10% compared with previous one. And in
>> some cases, though throughput stays the same, one CPU usage rises
>> about 20% which leads to 10% in total CPU usage. With your change, I
>> think driver will try its best to push as many packets as it can.
>> During this time, the driver's queue lock will be held for too much
>> time in one CPU and as a result, the other CPU will be blocked if it
>> wants to acquired the same lock. Working in this way seems not
>> efficiency.
>>
>> So I think it is better to revert the change till we come up with a
>> new solution.
>
> I don't think reverting is a clear option at this stage because that
> again creates problems for SDIO. IIRC without this patch SDIO was
> sending one packet a time (or something like that, can't remember all
> the details right now).
>

Below is the aqm result after 1 min test with Erik's patch.

target 19999us interval 99999us ecn yes
tid ac backlog-bytes backlog-packets new-flows drops marks overlimit
collisions tx-bytes tx-packets flags
0 2 0 0 5342229 0 0 0 0 3867657029 5342229 0x0(RUN)
1 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
2 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
3 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
4 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
5 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
6 0 0 0 2 0 0 0 0 144 2 0x0(RUN)
7 0 0 0 2 0 0 0 0 282 2 0x0(RUN)
8 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
9 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
10 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
11 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
12 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
13 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
14 0 0 0 0 0 0 0 0 0 0 0x0(RUN)
15 0 0 0 0 0 0 0 0 0 0 0x0(RUN)

we see no difference between tx-packets and new-flows.
Whereas if we revert the patch, we get:

target 19999us interval 99999us ecn yes
tid ac backlog-bytes backlog-packets new-flows drops marks overlimit
collisions tx-bytes tx-packets flags
0 2 0 0 2233059 3 0 9236 12 1159661783 6380867 0x0(RUN)
1 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
2 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
3 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
4 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
5 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
6 0 0 0 1 0 0 0 0 144 2 0x0(RUN)
7 0 0 0 1 0 0 0 0 282 2 0x0(RUN)
8 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
9 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
10 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
11 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
12 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
13 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
14 0 0 0 0 0 0 0 0 0 0 0x0(RUN)
15 0 0 0 0 0 0 0 0 0 0 0x0(RUN)

new-flows are roughly one-third of the total tx-packets.

IMHO, with Erik's change, Erik's change has changed the way fq's
schedule behavior and it looks like there is no other packets in the fq
after a packet has been dequeued. And as a result, this flow's deficit
will be refill and then removed from fq list at once in the same CPU.
And during this time, the other CPU could be blocked. When new packet
comes, same thing happens. So we get equal new flows and tx-packets.

Things would be different without Erik's change. After a packet has been
dequeued, this flow's deficit will not be refill immediately in
CPU0. It is possible that the deficit to be refilled in CPU1 while at
the same time CPU0 can fetch data from ethernet. So we can see more
tx-packets delivered to FW from aqm.


> Why does this happen only WDS mode? Did you test other modes, like AP
> or
> client mode?
AP mode has same issue. UDP throughput drops more than 10%. As for TCP,
CPU usage rising a lot although throughput stays similiar.
So, I'd say Erik's change does not work for us.


--
Yibo

2019-03-04 01:56:44

by Yibo Zhao

[permalink] [raw]
Subject: Re: FW: [PATCH] ath10k: fix return value check in wake_tx_q op

在 2019-02-25 12:40,Yibo Zhao 写道:
> 在 2019-02-07 22:25,Kalle Valo 写道:
>> Yibo Zhao <[email protected]> writes:
>>
>>> We have met performance issue on our two-core system after applying
>>> your patch. In WDS mode, we found that the peak throughput in TCP-DL
>>> and UDP-DL dropped more than 10% compared with previous one. And in
>>> some cases, though throughput stays the same, one CPU usage rises
>>> about 20% which leads to 10% in total CPU usage. With your change, I
>>> think driver will try its best to push as many packets as it can.
>>> During this time, the driver's queue lock will be held for too much
>>> time in one CPU and as a result, the other CPU will be blocked if it
>>> wants to acquired the same lock. Working in this way seems not
>>> efficiency.
>>>
>>> So I think it is better to revert the change till we come up with a
>>> new solution.
>>
>> I don't think reverting is a clear option at this stage because that
>> again creates problems for SDIO. IIRC without this patch SDIO was
>> sending one packet a time (or something like that, can't remember all
>> the details right now).
>>
>
> Below is the aqm result after 1 min test with Erik's patch.
>
> target 19999us interval 99999us ecn yes
> tid ac backlog-bytes backlog-packets new-flows drops marks overlimit
> collisions tx-bytes tx-packets flags
> 0 2 0 0 5342229 0 0 0 0 3867657029 5342229 0x0(RUN)
> 1 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 2 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 3 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 4 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 5 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 6 0 0 0 2 0 0 0 0 144 2 0x0(RUN)
> 7 0 0 0 2 0 0 0 0 282 2 0x0(RUN)
> 8 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 9 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 10 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 11 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 12 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 13 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 14 0 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 15 0 0 0 0 0 0 0 0 0 0 0x0(RUN)
>
> we see no difference between tx-packets and new-flows.
> Whereas if we revert the patch, we get:
>
> target 19999us interval 99999us ecn yes
> tid ac backlog-bytes backlog-packets new-flows drops marks overlimit
> collisions tx-bytes tx-packets flags
> 0 2 0 0 2233059 3 0 9236 12 1159661783 6380867 0x0(RUN)
> 1 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 2 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 3 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 4 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 5 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 6 0 0 0 1 0 0 0 0 144 2 0x0(RUN)
> 7 0 0 0 1 0 0 0 0 282 2 0x0(RUN)
> 8 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 9 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 10 3 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 11 2 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 12 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 13 1 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 14 0 0 0 0 0 0 0 0 0 0 0x0(RUN)
> 15 0 0 0 0 0 0 0 0 0 0 0x0(RUN)
>
> new-flows are roughly one-third of the total tx-packets.
>
> IMHO, with Erik's change, Erik's change has changed the way fq's
> schedule behavior and it looks like there is no other packets in the
> fq after a packet has been dequeued. And as a result, this flow's
> deficit will be refill and then removed from fq list at once in the
> same CPU. And during this time, the other CPU could be blocked. When
> new packet comes, same thing happens. So we get equal new flows and
> tx-packets.
>
> Things would be different without Erik's change. After a packet has
> been dequeued, this flow's deficit will not be refill immediately in
> CPU0. It is possible that the deficit to be refilled in CPU1 while at
> the same time CPU0 can fetch data from ethernet. So we can see more
> tx-packets delivered to FW from aqm.
>
>
>> Why does this happen only WDS mode? Did you test other modes, like AP
>> or
>> client mode?
> AP mode has same issue. UDP throughput drops more than 10%. As for
> TCP, CPU usage rising a lot although throughput stays similiar.
> So, I'd say Erik's change does not work for us.

Hi Kalle,

May I have your comments?

--
Yibo

2019-03-11 06:44:13

by Erik Stromdahl

[permalink] [raw]
Subject: Re: FW: [PATCH] ath10k: fix return value check in wake_tx_q op

Hi Yibo,

Sorry for a late reply, but I have been busy with other projects lately.
I have added my comments below

On 3/4/19 2:56 AM, Yibo Zhao wrote:
> 在 2019-02-25 12:40,Yibo Zhao 写道:
>> 在 2019-02-07 22:25,Kalle Valo 写道:
>>> Yibo Zhao <[email protected]> writes:
>>>
>>>> We have met performance issue on our two-core system after applying
>>>> your patch. In WDS mode, we found that the peak throughput in TCP-DL
>>>> and UDP-DL dropped more than 10% compared with previous one. And in
>>>> some cases, though throughput stays the same, one CPU usage rises
>>>> about 20% which leads to 10% in total CPU usage. With your change, I
>>>> think driver will try its best to push as many packets as it can.
>>>> During this time, the driver's queue lock will be held for too much
>>>> time in one CPU and as a result, the other CPU will be blocked if it
>>>> wants to acquired the same lock. Working in this way seems not
>>>> efficiency.
>>>>
>>>> So I think it is better to revert the change till we come up with a
>>>> new solution.
>>>
>>> I don't think reverting is a clear option at this stage because that
>>> again creates problems for SDIO. IIRC without this patch SDIO was
>>> sending one packet a time (or something like that, can't remember all
>>> the details right now).

I have a few patches related to bundling of TX packets on my private repo.
I have not yet had the time to prepare them for submission.
This patch is related to that work, but I decided to submit it separately
since I considered it a bugfix.

<snip>

>>
>> IMHO, with Erik's change, Erik's change has changed the way fq's
>> schedule behavior and it looks like there is no other packets in the
>> fq after a packet has been dequeued. And as a result, this flow's
>> deficit will be refill and then removed from fq list at once in the
>> same CPU. And during this time, the other CPU could be blocked. When
>> new packet comes, same thing happens. So we get equal new flows and
>> tx-packets.
>>
>> Things would be different without Erik's change. After a packet has
>> been dequeued, this flow's deficit will not be refill immediately in
>> CPU0. It is possible that the deficit to be refilled in CPU1 while at
>> the same time CPU0 can fetch data from ethernet. So we can see more
>> tx-packets delivered to FW from aqm.
>>
>>
>>> Why does this happen only WDS mode? Did you test other modes, like AP or
>>> client mode?
>> AP mode has same issue. UDP throughput drops more than 10%. As for
>> TCP, CPU usage rising a lot although throughput stays similiar.
>> So, I'd say Erik's change does not work for us.
>
> Hi Kalle,
>
> May I have your comments?
>

As I wrote in the commit message, the original code will always break out of
the loop after just one iteration.

It is OK with me to bring back the old logic, but I think we should skip the
loop entirely then.

Something like this:

if (ath10k_mac_tx_can_push(hw, txq)) {
ath10k_mac_tx_push_txq(hw, txq);
}


Btw, I noticed that the "fair scheduling" mechanism (derived from ath9k) from
Toke have been integrated.

I haven't rebased my tree for a while, so I will most likely have to rewrite
my patches anyway in order to make them work with the new TX scheduling.

2019-03-12 02:23:37

by Yibo Zhao

[permalink] [raw]
Subject: Re: FW: [PATCH] ath10k: fix return value check in wake_tx_q op

On 2019-03-11 14:44, Erik Stromdahl wrote:
> Hi Yibo,
>
> Sorry for a late reply, but I have been busy with other projects
> lately.
> I have added my comments below
>
> On 3/4/19 2:56 AM, Yibo Zhao wrote:
>> 在 2019-02-25 12:40,Yibo Zhao 写道:
>>> 在 2019-02-07 22:25,Kalle Valo 写道:
>>>> Yibo Zhao <[email protected]> writes:
>
> I have a few patches related to bundling of TX packets on my private
> repo.
> I have not yet had the time to prepare them for submission.
> This patch is related to that work, but I decided to submit it
> separately
> since I considered it a bugfix.

Great! Really looking forward to your new patch.

>
> <snip>
>
>>>
>>> IMHO, with Erik's change, Erik's change has changed the way fq's
>>> schedule behavior and it looks like there is no other packets in the
>>> fq after a packet has been dequeued. And as a result, this flow's
>>> deficit will be refill and then removed from fq list at once in the
>>> same CPU. And during this time, the other CPU could be blocked. When
>>> new packet comes, same thing happens. So we get equal new flows and
>>> tx-packets.
>>>
>>> Things would be different without Erik's change. After a packet has
>>> been dequeued, this flow's deficit will not be refill immediately in
>>> CPU0. It is possible that the deficit to be refilled in CPU1 while at
>>> the same time CPU0 can fetch data from ethernet. So we can see more
>>> tx-packets delivered to FW from aqm.
>>>
>>>
>>>> Why does this happen only WDS mode? Did you test other modes, like
>>>> AP or
>>>> client mode?
>>> AP mode has same issue. UDP throughput drops more than 10%. As for
>>> TCP, CPU usage rising a lot although throughput stays similiar.
>>> So, I'd say Erik's change does not work for us.
>>
>> Hi Kalle,
>>
>> May I have your comments?
>>
>
> As I wrote in the commit message, the original code will always break
> out of
> the loop after just one iteration.
>
> It is OK with me to bring back the old logic, but I think we should
> skip the
> loop entirely then.
>
> Something like this:
>
> if (ath10k_mac_tx_can_push(hw, txq)) {
> ath10k_mac_tx_push_txq(hw, txq);
> }
Yes, it is the exact way we tried in our private repo. And it works fine
in our setup so far. Not sure it is ok for other situations. Have you
tested on your sdio setup with this change? Any issue observed?
>
>
> Btw, I noticed that the "fair scheduling" mechanism (derived from
> ath9k) from
> Toke have been integrated.
>
> I haven't rebased my tree for a while, so I will most likely have to
> rewrite
> my patches anyway in order to make them work with the new TX
> scheduling.

--
Yibo