The following problem was observed when running iperf:
[ 3] 0.0- 1.0 sec 2.00 MBytes 16.8 Mbits/sec
[ 3] 1.0- 2.0 sec 3.12 MBytes 26.2 Mbits/sec
[ 3] 2.0- 3.0 sec 3.25 MBytes 27.3 Mbits/sec
[ 3] 3.0- 4.0 sec 655 KBytes 5.36 Mbits/sec
[ 3] 4.0- 5.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 5.0- 6.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 6.0- 7.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 7.0- 8.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 8.0- 9.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 9.0-10.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 0.0-10.3 sec 9.01 MBytes 7.32 Mbits/sec
There are frames in the ieee80211_txq and there are frames that have
been removed from from this queue, but haven't yet been sent on the wire
(num_pending_tx).
When num_pending_tx reaches max_num_pending_tx, we will stop the queues
by calling ieee80211_stop_queues().
As frames that have previously been sent for transmission
(num_pending_tx) are completed, we will decrease num_pending_tx and wake
the queues by calling ieee80211_wake_queue(). ieee80211_wake_queue()
does not call wake_tx_queue, so we might still have frames in the
queue at this point.
While the queues were stopped, the socket buffer might have filled up,
and in order for user space to write more, we need to free the frames
in the queue, since they are accounted to the socket. In order to free
them, we first need to transmit them.
In order to avoid trying to flush the queue every time we free a frame,
only do this when there are 3 or less frames pending, and while we
actually have frames in the queue. This logic was copied from
mt76_txq_schedule (mt76), one of few other drivers that are actually
using wake_tx_queue.
Suggested-by: Toke Høiland-Jørgensen <[email protected]>
Signed-off-by: Niklas Cassel <[email protected]>
---
Changes since V1: use READ_ONCE() to disallow the compiler
optimizing things in undesirable ways.
drivers/net/wireless/ath/ath10k/txrx.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/net/wireless/ath/ath10k/txrx.c b/drivers/net/wireless/ath/ath10k/txrx.c
index cda164f6e9f6..264cf0bd5c00 100644
--- a/drivers/net/wireless/ath/ath10k/txrx.c
+++ b/drivers/net/wireless/ath/ath10k/txrx.c
@@ -95,6 +95,9 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
wake_up(&htt->empty_tx_wq);
spin_unlock_bh(&htt->tx_lock);
+ if (READ_ONCE(htt->num_pending_tx) <= 3 && !list_empty(&ar->txqs))
+ ath10k_mac_tx_push_pending(ar);
+
dma_unmap_single(dev, skb_cb->paddr, msdu->len, DMA_TO_DEVICE);
ath10k_report_offchan_tx(htt->ar, msdu);
--
2.17.0
On 2018-05-21 13:43, Niklas Cassel wrote:
> The following problem was observed when running iperf:
[...]
>
> In order to avoid trying to flush the queue every time we free a frame,
> only do this when there are 3 or less frames pending, and while we
> actually have frames in the queue. This logic was copied from
> mt76_txq_schedule (mt76), one of few other drivers that are actually
> using wake_tx_queue.
>
> Suggested-by: Toke Høiland-Jørgensen <[email protected]>
> Signed-off-by: Niklas Cassel <[email protected]>
> ---
> Changes since V1: use READ_ONCE() to disallow the compiler
> optimizing things in undesirable ways.
>
> drivers/net/wireless/ath/ath10k/txrx.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/net/wireless/ath/ath10k/txrx.c
> b/drivers/net/wireless/ath/ath10k/txrx.c
> index cda164f6e9f6..264cf0bd5c00 100644
> --- a/drivers/net/wireless/ath/ath10k/txrx.c
> +++ b/drivers/net/wireless/ath/ath10k/txrx.c
> @@ -95,6 +95,9 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
> wake_up(&htt->empty_tx_wq);
> spin_unlock_bh(&htt->tx_lock);
>
> + if (READ_ONCE(htt->num_pending_tx) <= 3 && !list_empty(&ar->txqs))
> + ath10k_mac_tx_push_pending(ar);
> +
Niklas,
Sorry for the late response. ath10k_mac_tx_push_pending is already
called
at the end of NAPI handler. Isn't that enough to process pending frames?
Earlier we observed performance issues in calling push_pending from each
tx completion. IMHO this change may introduce the same problem again.
-Rajkumar
On Mon, May 21, 2018 at 04:11:38PM -0700, Rajkumar Manoharan wrote:
> On 2018-05-21 13:43, Niklas Cassel wrote:
> > The following problem was observed when running iperf:
> [...]
> >
> > In order to avoid trying to flush the queue every time we free a frame,
> > only do this when there are 3 or less frames pending, and while we
> > actually have frames in the queue. This logic was copied from
> > mt76_txq_schedule (mt76), one of few other drivers that are actually
> > using wake_tx_queue.
> >
> > Suggested-by: Toke H?iland-J?rgensen <[email protected]>
> > Signed-off-by: Niklas Cassel <[email protected]>
> > ---
> > Changes since V1: use READ_ONCE() to disallow the compiler
> > optimizing things in undesirable ways.
> >
> > drivers/net/wireless/ath/ath10k/txrx.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/drivers/net/wireless/ath/ath10k/txrx.c
> > b/drivers/net/wireless/ath/ath10k/txrx.c
> > index cda164f6e9f6..264cf0bd5c00 100644
> > --- a/drivers/net/wireless/ath/ath10k/txrx.c
> > +++ b/drivers/net/wireless/ath/ath10k/txrx.c
> > @@ -95,6 +95,9 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
> > wake_up(&htt->empty_tx_wq);
> > spin_unlock_bh(&htt->tx_lock);
> >
> > + if (READ_ONCE(htt->num_pending_tx) <= 3 && !list_empty(&ar->txqs))
> > + ath10k_mac_tx_push_pending(ar);
> > +
> Niklas,
Hello Rajkumar
>
> Sorry for the late response. ath10k_mac_tx_push_pending is already called
> at the end of NAPI handler. Isn't that enough to process pending frames?
This is true for e.g. ATH10K_BUS_PCI and ATH10K_BUS_SNOC,
but not for e.g. ATH10K_BUS_SDIO and ATH10K_BUS_USB.
While there is some SDIO code merged in Kalle's tree already,
this problem was found when merging
https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/?h=ath10k-pending-sdio-usb
with Kalle's ath-next branch.
>
> Earlier we observed performance issues in calling push_pending from each
> tx completion. IMHO this change may introduce the same problem again.
I prefer functional TX over performance issues,
but I agree that it is unfortunate that SDIO doesn't use
ath10k_htt_txrx_compl_task().
Erik, is there a reason for this?
Perhaps it would be possible to call ath10k_mac_tx_push_pending()
from the equivalent to ath10k_htt_txrx_compl_task(),
but from SDIO's point of view.
Another solution might be to change so that we only call
ath10k_mac_tx_push_pending() from ath10k_txrx_tx_unref()
if (htt->num_pending_tx == 0). That should decrease the number
of calls to ath10k_mac_tx_push_pending(), while still avoiding
a "TX deadlock" scenario for SDIO.
Regards,
Niklas
On 2018-05-22 14:15, Niklas Cassel wrote:
> On Mon, May 21, 2018 at 04:11:38PM -0700, Rajkumar Manoharan wrote:
>> On 2018-05-21 13:43, Niklas Cassel wrote:
>> > The following problem was observed when running iperf:
[...]
>>
>> Sorry for the late response. ath10k_mac_tx_push_pending is already
>> called
>> at the end of NAPI handler. Isn't that enough to process pending
>> frames?
>
> This is true for e.g. ATH10K_BUS_PCI and ATH10K_BUS_SNOC,
> but not for e.g. ATH10K_BUS_SDIO and ATH10K_BUS_USB.
>
> While there is some SDIO code merged in Kalle's tree already,
> this problem was found when merging
> https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/?h=ath10k-pending-sdio-usb
> with Kalle's ath-next branch.
>
>>
>> Earlier we observed performance issues in calling push_pending from
>> each
>> tx completion. IMHO this change may introduce the same problem again.
>
> I prefer functional TX over performance issues,
> but I agree that it is unfortunate that SDIO doesn't use
> ath10k_htt_txrx_compl_task().
> Erik, is there a reason for this?
>
Thanks for details. Now I see your problem. In case of low latency
devices (PCI/SNOC/AHB),
all CE rings interrupts are serviced first and later consolidated data
processing done
from ath10k_htt_txrx_compl_task and then ath10k_mac_tx_push_pending is
called from NAPI Poll.
In case of high latency devices (USB/SDIO), each endpoints are serviced
and all tx/rx jobs
are completed from the same context. Hence no need of consolidated
processing.
> Perhaps it would be possible to call ath10k_mac_tx_push_pending()
> from the equivalent to ath10k_htt_txrx_compl_task(),
> but from SDIO's point of view.
>
> Another solution might be to change so that we only call
> ath10k_mac_tx_push_pending() from ath10k_txrx_tx_unref()
> if (htt->num_pending_tx == 0). That should decrease the number
> of calls to ath10k_mac_tx_push_pending(), while still avoiding
> a "TX deadlock" scenario for SDIO.
>
This issue is specific to HL devices. But your change is common which
will impact LL devices.
I would prefer to call ath10k_mac_tx_push_pending after processing all
received mbox/urb messages.
Export ath10k_mac_tx_push_pending API and call it from USB/SDIO irq
handler. Any thoughts?
-Rajkumar
On 05/22/2018 11:15 PM, Niklas Cassel wrote:
<snip>
>>
>> Earlier we observed performance issues in calling push_pending from each
>> tx completion. IMHO this change may introduce the same problem again.
>
> I prefer functional TX over performance issues,
> but I agree that it is unfortunate that SDIO doesn't use
> ath10k_htt_txrx_compl_task().
> Erik, is there a reason for this?
The reason is that the SDIO code has been derived mainly from qcacld and ath6kl
and they don't implement napi.
ath10k_htt_txrx_compl_task is currently only called from the napi poll function,
and the sdio bus driver doesn't have such a function.
>
> Perhaps it would be possible to call ath10k_mac_tx_push_pending()
> from the equivalent to ath10k_htt_txrx_compl_task(),
> but from SDIO's point of view.
An equivalent for SDIO would most likely be *ath10k_htt_htc_t2h_msg_handler*
or any of the other functions called from this function.
*ath10k_txrx_tx_unref* is actually called from *ath10k_htt_htc_t2h_msg_handler*,
so that function could be viewed as an equivalent.
If the call should be added in the bus driver (sdio.c) it should most likely be
placed in *ath10k_sdio_mbox_rx_process_packets*
if (!pkt->trailer_only) {
ep->ep_ops.ep_rx_complete(ar_sdio->ar, pkt->skb);
ath10k_mac_tx_push_pending(ar_sdio->ar);
} else {
kfree_skb(pkt->skb)
}
The above call would of course result in lot's of calls to *ath10k_mac_tx_push_pending*
Adding a htt_num_pending check here wouldn't look nice.
The HL RX path differs from the LL path in that the t2h_msg_handler returns
false indicating that it has consumed the skb.
This is because it is the HL RX indication handler that delivers the skb's
to mac80211.
Another solution could be to add an *else-statement* as a part of the *if (release)*
in *ath10k_htt_htc_t2h_msg_handler*, where *ath10k_mac_tx_push_pending* could be called.
Something like this perhaps:
/* Free the indication buffer */
if (release)
dev_kfree_skb_any(skb);
else if (!ar->htt.num_pending_tx)
ath10k_mac_tx_push_pending(ar);
I think I prefer your original patch though.
>
> Another solution might be to change so that we only call
> ath10k_mac_tx_push_pending() from ath10k_txrx_tx_unref()
> if (htt->num_pending_tx == 0). That should decrease the number
> of calls to ath10k_mac_tx_push_pending(), while still avoiding
> a "TX deadlock" scenario for SDIO.
Just out of curiosity, where did the limit of 3 come from?
If it works with a limit of 0, I think it should be used instead.
Another intersting thing that I stumbled upon when looking into the
code (while writing this email) is the *wake_up(&htt->empty_tx_wq);*
For some reason I have considered it not to be applicable for HL devices.
The queue is waited for in the flush op (*ath10k_flush*).
I am unsure what it is used for, but I don't think it affects the TX
deadlock scenario.
--
Erik
On 2018-05-23 09:25, Erik Stromdahl wrote:
> On 05/22/2018 11:15 PM, Niklas Cassel wrote:
>
[...]
>>
>> Perhaps it would be possible to call ath10k_mac_tx_push_pending()
>> from the equivalent to ath10k_htt_txrx_compl_task(),
>> but from SDIO's point of view.
> An equivalent for SDIO would most likely be
> *ath10k_htt_htc_t2h_msg_handler*
> or any of the other functions called from this function.
>
> *ath10k_txrx_tx_unref* is actually called from
> *ath10k_htt_htc_t2h_msg_handler*,
> so that function could be viewed as an equivalent.
>
> If the call should be added in the bus driver (sdio.c) it should most
> likely be
> placed in *ath10k_sdio_mbox_rx_process_packets*
>
> if (!pkt->trailer_only) {
> ep->ep_ops.ep_rx_complete(ar_sdio->ar, pkt->skb);
> ath10k_mac_tx_push_pending(ar_sdio->ar);
> } else {
> kfree_skb(pkt->skb)
> }
>
> The above call would of course result in lot's of calls to
> *ath10k_mac_tx_push_pending*
> Adding a htt_num_pending check here wouldn't look nice.
>
> The HL RX path differs from the LL path in that the t2h_msg_handler
> returns
> false indicating that it has consumed the skb.
>
> This is because it is the HL RX indication handler that delivers the
> skb's
> to mac80211.
>
I also dont prefer to call *_push_pending for every HTC packets. Similar
to
LL approach, call ath10k_mac_tx_push_pending after processing all
pending
rx messages like calling from ath10k_sdio_mbox_rxmsg_pending_handler.
--- a/drivers/net/wireless/ath/ath10k/sdio.c
+++ b/drivers/net/wireless/ath/ath10k/sdio.c
@@ -807,6 +807,8 @@ static int
ath10k_sdio_mbox_rxmsg_pending_handler(struct ath10k *ar,
ath10k_warn(ar, "failed to get pending recv messages:
%d\n",
ret);
+ ath10k_mac_tx_push_pending(ar);
+
return ret;
}
> Another solution could be to add an *else-statement* as a part of the
> *if (release)*
> in *ath10k_htt_htc_t2h_msg_handler*, where
> *ath10k_mac_tx_push_pending* could be called.
>
> Something like this perhaps:
>
> /* Free the indication buffer */
> if (release)
> dev_kfree_skb_any(skb);
> else if (!ar->htt.num_pending_tx)
> ath10k_mac_tx_push_pending(ar);
>
> I think I prefer your original patch though.
>>
Better to do changes as HL specific path instead in common path.
The above change will impact QCA6174 based devices.
-Rajkumar
On Wed, May 23, 2018 at 06:25:49PM +0200, Erik Stromdahl wrote:
>
>
> On 05/22/2018 11:15 PM, Niklas Cassel wrote:
>
> <snip>
> > >
> > > Earlier we observed performance issues in calling push_pending from each
> > > tx completion. IMHO this change may introduce the same problem again.
> >
> > I prefer functional TX over performance issues,
> > but I agree that it is unfortunate that SDIO doesn't use
> > ath10k_htt_txrx_compl_task().
> > Erik, is there a reason for this?
> The reason is that the SDIO code has been derived mainly from qcacld and ath6kl
> and they don't implement napi.
>
> ath10k_htt_txrx_compl_task is currently only called from the napi poll function,
> and the sdio bus driver doesn't have such a function.
Ok, thanks for the explanation. Perhaps we can change the SDIO code so that it
uses NAPI in the future.
<snip>
> > Another solution might be to change so that we only call
> > ath10k_mac_tx_push_pending() from ath10k_txrx_tx_unref()
> > if (htt->num_pending_tx == 0). That should decrease the number
> > of calls to ath10k_mac_tx_push_pending(), while still avoiding
> > a "TX deadlock" scenario for SDIO.
> Just out of curiosity, where did the limit of 3 come from?
> If it works with a limit of 0, I think it should be used instead.
It came from mt76_txq_schedule():
if (hwq->swq_queued >= 4 || list_empty(&hwq->swq))
break;
len = mt76_txq_schedule_list(dev, hwq);
Since this used a break, I simply inverted the logic,
and called ath10k_mac_tx_push_pending() rather than
mt76_txq_schedule_list().
However, I've submitted a V4 now that mimics the behavior
in ath10k_htt_txrx_compl_task() instead, so now I call
ath10k_mac_tx_push_pending() regardless of num_pending_tx.
In most cases, ath10k_mac_tx_push_pending() will not dequeue
any frames, since the ar->txqs list will be empty, so this
shouldn't be so bad after all.
>
> Another intersting thing that I stumbled upon when looking into the
> code (while writing this email) is the *wake_up(&htt->empty_tx_wq);*
>
> For some reason I have considered it not to be applicable for HL devices.
>
> The queue is waited for in the flush op (*ath10k_flush*).
> I am unsure what it is used for, but I don't think it affects the TX
> deadlock scenario.
It seems to be called by mac80211 in certain scenarios, but like you said,
it doesn't help with this problem.
Regards,
Niklas