2018-05-17 23:16:41

by Niklas Cassel

[permalink] [raw]
Subject: [PATCH] ath10k: transmit queued frames after waking queues

The following problem was observed when running iperf:

[ 3] 0.0- 1.0 sec 2.00 MBytes 16.8 Mbits/sec
[ 3] 1.0- 2.0 sec 3.12 MBytes 26.2 Mbits/sec
[ 3] 2.0- 3.0 sec 3.25 MBytes 27.3 Mbits/sec
[ 3] 3.0- 4.0 sec 655 KBytes 5.36 Mbits/sec
[ 3] 4.0- 5.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 5.0- 6.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 6.0- 7.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 7.0- 8.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 8.0- 9.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 9.0-10.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 0.0-10.3 sec 9.01 MBytes 7.32 Mbits/sec

There are frames in the ieee80211_txq and there are frames that have
been removed from from this queue, but haven't yet been sent on the wire
(num_pending_tx).

When num_pending_tx reaches max_num_pending_tx, we will stop the queues
by calling ieee80211_stop_queues().

As frames that have previously been sent for transmission
(num_pending_tx) are completed, we will decrease num_pending_tx and wake
the queues by calling ieee80211_wake_queue(). ieee80211_wake_queue()
does not call wake_tx_queue, so we might still have frames in the
queue at this point.

While the queues were stopped, the socket buffer might have filled up,
and in order for user space to write more, we need to free the frames
in the queue, since they are accounted to the socket. In order to free
them, we first need to transmit them.

In order to avoid trying to flush the queue every time we free a frame,
only do this when there are 3 or less frames pending, and while we
actually have frames in the queue. This logic was copied from
mt76_txq_schedule (mt76), one of few other drivers that are actually
using wake_tx_queue.

Suggested-by: Toke Høiland-Jørgensen <[email protected]>
Signed-off-by: Niklas Cassel <[email protected]>
---
drivers/net/wireless/ath/ath10k/txrx.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/net/wireless/ath/ath10k/txrx.c b/drivers/net/wireless/ath/ath10k/txrx.c
index cda164f6e9f6..1d3b2d2c3fee 100644
--- a/drivers/net/wireless/ath/ath10k/txrx.c
+++ b/drivers/net/wireless/ath/ath10k/txrx.c
@@ -95,6 +95,9 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
wake_up(&htt->empty_tx_wq);
spin_unlock_bh(&htt->tx_lock);

+ if (htt->num_pending_tx <= 3 && !list_empty(&ar->txqs))
+ ath10k_mac_tx_push_pending(ar);
+
dma_unmap_single(dev, skb_cb->paddr, msdu->len, DMA_TO_DEVICE);

ath10k_report_offchan_tx(htt->ar, msdu);
--
2.17.0


2018-05-24 15:50:36

by Bob Copeland

[permalink] [raw]
Subject: Re: [PATCH] ath10k: transmit queued frames after waking queues

On Mon, May 21, 2018 at 10:37:01PM +0200, Niklas Cassel wrote:
> On Thu, May 17, 2018 at 03:26:25PM -0700, Adrian Chadd wrote:
> > On Thu, 17 May 2018 at 16:16, Niklas Cassel <[email protected]>
> > wrote:
> >
> > > diff --git a/drivers/net/wireless/ath/ath10k/txrx.c
> > b/drivers/net/wireless/ath/ath10k/txrx.c
> > > index cda164f6e9f6..1d3b2d2c3fee 100644
> > > --- a/drivers/net/wireless/ath/ath10k/txrx.c
> > > +++ b/drivers/net/wireless/ath/ath10k/txrx.c
> > > @@ -95,6 +95,9 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
> > > wake_up(&htt->empty_tx_wq);
> > > spin_unlock_bh(&htt->tx_lock);
> >
> > > + if (htt->num_pending_tx <= 3 && !list_empty(&ar->txqs))
> > > + ath10k_mac_tx_push_pending(ar);
> > > +
> >
> > Just sanity checking - what's protecting htt->num_pending_tx? or is it
> > serialised some other way?
[...]
> I can't see that any of the examples applies, but let's add READ_ONCE(),
> to make sure that the compiler doesn't try to optimize this.

Couldn't you just move the num_pending_tx read inside tx_lock which is 2 lines
above? I think all the other manipulations are protected by tx_lock.

--
Bob Copeland %% https://bobcopeland.com/

2018-05-25 14:21:05

by Niklas Cassel

[permalink] [raw]
Subject: Re: [PATCH] ath10k: transmit queued frames after waking queues

On Fri, May 25, 2018 at 08:50:23AM -0400, Bob Copeland wrote:
> On Fri, May 25, 2018 at 02:36:56PM +0200, Niklas Cassel wrote:
> > A spin lock does have the advantage of ordering: memory operations issued
> > before the spin_unlock_bh() will be completed before the spin_unlock_bh()
> > operation has completed.
> >
> > However, ath10k_htt_tx_dec_pending() was called earlier in the same function,
> > which decreases htt->num_pending_tx, so that write will be completed before
> > our read. That is the only ordering we care about here (if we should call
> > ath10k_mac_tx_push_pending() or not).
>
> Sure. I also understand that reading inside a lock and operating on the
> value outside the lock isn't really the definition of synchronization
> (doesn't really matter in this case though).
>
> I was just suggesting that the implicit memory barrier in the spin unlock
> that we are already paying for would be sufficient here too, and it matches
> the semantic of "tx fields under tx_lock." On the other hand, maybe it's
> just me, but I tend to look askance at just-in-case READ_ONCEs sprinkled
> about.

I agree, because of the implicit memory barrier from spin_unlock_bh(),
READ_ONCE shouldn't really be needed in this case.

I think that it's a good thing to be critical of all "just-in-case" things,
however, it's not always that obvious if you actually need READ_ONCE or not.

E.g. you might need to use it even when you are holding a spin_lock.

Some people recommend to use it for all concurrent non-read-only shared memory
accesses: https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE

Is there a better guideline somewhere..?


Kind regards,
Niklas

2018-05-25 12:50:25

by Bob Copeland

[permalink] [raw]
Subject: Re: [PATCH] ath10k: transmit queued frames after waking queues

On Fri, May 25, 2018 at 02:36:56PM +0200, Niklas Cassel wrote:
> A spin lock does have the advantage of ordering: memory operations issued
> before the spin_unlock_bh() will be completed before the spin_unlock_bh()
> operation has completed.
>
> However, ath10k_htt_tx_dec_pending() was called earlier in the same function,
> which decreases htt->num_pending_tx, so that write will be completed before
> our read. That is the only ordering we care about here (if we should call
> ath10k_mac_tx_push_pending() or not).

Sure. I also understand that reading inside a lock and operating on the
value outside the lock isn't really the definition of synchronization
(doesn't really matter in this case though).

I was just suggesting that the implicit memory barrier in the spin unlock
that we are already paying for would be sufficient here too, and it matches
the semantic of "tx fields under tx_lock." On the other hand, maybe it's
just me, but I tend to look askance at just-in-case READ_ONCEs sprinkled
about.

--
Bob Copeland %% https://bobcopeland.com/

2018-05-25 12:37:00

by Niklas Cassel

[permalink] [raw]
Subject: Re: [PATCH] ath10k: transmit queued frames after waking queues

On Thu, May 24, 2018 at 11:50:34AM -0400, Bob Copeland wrote:
> On Mon, May 21, 2018 at 10:37:01PM +0200, Niklas Cassel wrote:
> > On Thu, May 17, 2018 at 03:26:25PM -0700, Adrian Chadd wrote:
> > > On Thu, 17 May 2018 at 16:16, Niklas Cassel <[email protected]>
> > > wrote:
> > >
> > > > diff --git a/drivers/net/wireless/ath/ath10k/txrx.c
> > > b/drivers/net/wireless/ath/ath10k/txrx.c
> > > > index cda164f6e9f6..1d3b2d2c3fee 100644
> > > > --- a/drivers/net/wireless/ath/ath10k/txrx.c
> > > > +++ b/drivers/net/wireless/ath/ath10k/txrx.c
> > > > @@ -95,6 +95,9 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
> > > > wake_up(&htt->empty_tx_wq);
> > > > spin_unlock_bh(&htt->tx_lock);
> > >
> > > > + if (htt->num_pending_tx <= 3 && !list_empty(&ar->txqs))
> > > > + ath10k_mac_tx_push_pending(ar);
> > > > +
> > >
> > > Just sanity checking - what's protecting htt->num_pending_tx? or is it
> > > serialised some other way?
> [...]
> > I can't see that any of the examples applies, but let's add READ_ONCE(),
> > to make sure that the compiler doesn't try to optimize this.
>
> Couldn't you just move the num_pending_tx read inside tx_lock which is 2 lines
> above? I think all the other manipulations are protected by tx_lock.

Hello Bob,

There is both a V2 and V3 of this patchset, V3 moves this to a sdio.c and
calls ath10k_mac_tx_push_pending() unconditionally.


But to answer your question,
it shouldn't matter if the read is done while holding the lock or not.

Sure, the compiler could do things so that the code path is always or never
executed, but that is why I added READ_ONCE() in V2.

A spin lock does have the advantage of ordering: memory operations issued
before the spin_unlock_bh() will be completed before the spin_unlock_bh()
operation has completed.

However, ath10k_htt_tx_dec_pending() was called earlier in the same function,
which decreases htt->num_pending_tx, so that write will be completed before
our read. That is the only ordering we care about here (if we should call
ath10k_mac_tx_push_pending() or not).


Kind regards,
Niklas

2018-05-17 23:26:57

by Adrian Chadd

[permalink] [raw]
Subject: Re: [PATCH] ath10k: transmit queued frames after waking queues

On Thu, 17 May 2018 at 16:16, Niklas Cassel <[email protected]>
wrote:

> diff --git a/drivers/net/wireless/ath/ath10k/txrx.c
b/drivers/net/wireless/ath/ath10k/txrx.c
> index cda164f6e9f6..1d3b2d2c3fee 100644
> --- a/drivers/net/wireless/ath/ath10k/txrx.c
> +++ b/drivers/net/wireless/ath/ath10k/txrx.c
> @@ -95,6 +95,9 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
> wake_up(&htt->empty_tx_wq);
> spin_unlock_bh(&htt->tx_lock);

> + if (htt->num_pending_tx <= 3 && !list_empty(&ar->txqs))
> + ath10k_mac_tx_push_pending(ar);
> +

Just sanity checking - what's protecting htt->num_pending_tx? or is it
serialised some other way?

> dma_unmap_single(dev, skb_cb->paddr, msdu->len, DMA_TO_DEVICE);

> ath10k_report_offchan_tx(htt->ar, msdu);
> --
> 2.17.0

2018-05-21 20:37:05

by Niklas Cassel

[permalink] [raw]
Subject: Re: [PATCH] ath10k: transmit queued frames after waking queues

On Thu, May 17, 2018 at 03:26:25PM -0700, Adrian Chadd wrote:
> On Thu, 17 May 2018 at 16:16, Niklas Cassel <[email protected]>
> wrote:
>
> > diff --git a/drivers/net/wireless/ath/ath10k/txrx.c
> b/drivers/net/wireless/ath/ath10k/txrx.c
> > index cda164f6e9f6..1d3b2d2c3fee 100644
> > --- a/drivers/net/wireless/ath/ath10k/txrx.c
> > +++ b/drivers/net/wireless/ath/ath10k/txrx.c
> > @@ -95,6 +95,9 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
> > wake_up(&htt->empty_tx_wq);
> > spin_unlock_bh(&htt->tx_lock);
>
> > + if (htt->num_pending_tx <= 3 && !list_empty(&ar->txqs))
> > + ath10k_mac_tx_push_pending(ar);
> > +
>
> Just sanity checking - what's protecting htt->num_pending_tx? or is it
> serialised some other way?

Hello Adrian,

I did not take the htt->tx_lock lock for this since it should not be
needed.

There are however some compiler optimizations that you have to look out
for:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/memory-barriers.txt?h=v4.17-rc6#n1547

I can't see that any of the examples applies, but let's add READ_ONCE(),
to make sure that the compiler doesn't try to optimize this.

Will send a V2 for this.

Regards,
Niklas

>
> > dma_unmap_single(dev, skb_cb->paddr, msdu->len, DMA_TO_DEVICE);
>
> > ath10k_report_offchan_tx(htt->ar, msdu);
> > --
> > 2.17.0