2018-11-20 21:22:23

by Daniel Santos

[permalink] [raw]
Subject: rt2800 tx frame dropping issue.

Hello,

I'm coming to this issue rather late and I realize there has been much
ado about it.  I want to follow up on a thread from 3 months ago
https://marc.info/?l=linux-wireless&m=153511575812945

> I get testing results from T-Bone user in bugzilla:
> https://bugzilla.kernel.org/show_bug.cgi?id=82751
>
> And get this:
>
> [ 781.644185] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due \
> to full tx queue 2 [ 781.662620] 2 d1 s1 p1 ms1
>
> Looks like rt2x00 correctly stop queue in mac80211, but sill get skb's.
>
> So we can do 3 things: increase queue size to 128, increase threshold
> to 16 and make the massage debug one instead of error (I checked
> some other drivers and looks most of them silently drop the frame
> in case queue is full). Especially removing the message can be
> useful since printk can somehow make mt7620 router unresponsive
> for some time resulting in connection drops.
>
> Thoughts ?
>
> Regards
> Stanislaw

I believe I have the answer as to why we're getting frames after we stop
the queue.  I had a little chat about this in #kernelnewbies and some
other devs believe it is intentional.

There is a race in ieee80211_tx_frags (net/mac80211/tx.c) between
releasing local->queue_stop_reason_lock and calling the driver's tx
until we lock queue->tx_lock in rt2x00queue_write_tx_frame -- in between
these two points the thread can be preempted.  So while we stop the
queue in one thread, there can be 20 other threads in between that have
already checked that the queue was running and have a frame buffer
sitting on their stacks.  I think it could be fixed with the below
patch, but I'm being told that it doesn't need it and that the driver
should just *quietly* drop the frame:

diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
index d43dab4..29009e0 100644
--- a/net/mac80211/ieee80211_i.h
+++ b/net/mac80211/ieee80211_i.h
@@ -1117,6 +1117,7 @@ struct ieee80211_local {
int q_stop_reasons[IEEE80211_MAX_QUEUES][IEEE80211_QUEUE_STOP_REASONS];
/* also used to protect ampdu_ac_queue and amdpu_ac_stop_refcnt */
spinlock_t queue_stop_reason_lock;
+ spinlock_t tx_path_lock;

int open_count;
int monitors, cooked_mntrs;
diff --git a/net/mac80211/main.c b/net/mac80211/main.c
index 4f8eceb..2312ac9 100644
--- a/net/mac80211/main.c
+++ b/net/mac80211/main.c
@@ -609,6 +609,7 @@ struct ieee80211_hw *ieee80211_alloc_hw_nm(size_t priv_data_len,
spin_lock_init(&local->filter_lock);
spin_lock_init(&local->rx_path_lock);
spin_lock_init(&local->queue_stop_reason_lock);
+ spin_lock_init(&local->tx_path_lock);

INIT_LIST_HEAD(&local->chanctx_list);
mutex_init(&local->chanctx_mtx);
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index 87d807f..8d43e2a 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -1311,6 +1311,7 @@ static bool ieee80211_tx_frags(struct ieee80211_local *local,
}
#endif

+ spin_lock_bh(&local->tx_path_lock);
spin_lock_irqsave(&local->queue_stop_reason_lock, flags);
if (local->queue_stop_reasons[q] ||
(!txpending && !skb_queue_empty(&local->pending[q]))) {
@@ -1327,6 +1328,7 @@ static bool ieee80211_tx_frags(struct ieee80211_local *local,
spin_unlock_irqrestore(
&local->queue_stop_reason_lock,
flags);
+ spin_unlock_bh(&local->tx_path_lock);
ieee80211_purge_tx_queue(&local->hw,
skbs);
return true;
@@ -1347,6 +1349,7 @@ static bool ieee80211_tx_frags(struct ieee80211_local *local,

spin_unlock_irqrestore(&local->queue_stop_reason_lock,
flags);
+ spin_unlock_bh(&local->tx_path_lock);
return false;
}
}
@@ -1356,6 +1359,7 @@ static bool ieee80211_tx_frags(struct ieee80211_local *local,

__skb_unlink(skb, skbs);
ieee80211_drv_tx(local, vif, sta, skb);
+ spin_unlock_bh(&local->tx_path_lock);
}

return true;


Could anybody illuminate for me why this isn't done?  I know that there
has to be a point where we realize there are too many items in the queue
and we can't keep up, but this would seem to be a terrible way to do
that.  Is it one of those things where it isn't worth the performance
degradation?  Any insights would be most appreciated!! :)

Like Stanislaw, I have not been able to reproduce the problem for
testing, but somebody who has the problem (in production) is going to
get a build where the only change is to remove the rt2x00_err message. 
It makes sense that it would be the problem as you mentioned.  I suppose
the proper thing to do is to just quietly drop it and increment the
struct net_device stats.tx_dropped?

Thanks,
Daniel



2018-11-23 19:46:00

by Johannes Berg

[permalink] [raw]
Subject: Re: rt2800 tx frame dropping issue.

On Tue, 2018-11-20 at 15:20 -0600, Daniel Santos wrote:

> I believe I have the answer as to why we're getting frames after we stop
> the queue. I had a little chat about this in #kernelnewbies and some
> other devs believe it is intentional.
>
> There is a race in ieee80211_tx_frags (net/mac80211/tx.c) between
> releasing local->queue_stop_reason_lock and calling the driver's tx
> until we lock queue->tx_lock in rt2x00queue_write_tx_frame -- in between
> these two points the thread can be preempted. So while we stop the
> queue in one thread, there can be 20 other threads in between that have
> already checked that the queue was running and have a frame buffer
> sitting on their stacks.

Not 20, only 1 per netdev queue. I suspect that means just 1 per
hardware queue, but I don't know how rt2x00 maps netdev queues to
hardware queues. If you have lots of netdevs, that might actually be 20,
but I suspect that's not what's going on here.

> I think it could be fixed with the below
> patch, but I'm being told that it doesn't need it and that the driver
> should just *quietly* drop the frame:

[snip patch]

> Could anybody illuminate for me why this isn't done? I know that there
> has to be a point where we realize there are too many items in the queue
> and we can't keep up, but this would seem to be a terrible way to do
> that. Is it one of those things where it isn't worth the performance
> degradation? Any insights would be most appreciated!! :)

There's just not much point, and doing the locking here will mean it's
across _all_ queues, whereas if you have multiple hardware queues you
wouldn't really need it.

Note that with internal TXQs with fq/codel support etc. we actually have
the fq->lock and it's global, and it turns out that's still a better
deal than actually doing parallel TX. So this may not matter so much.

In any case, while this might solve the problem for the specific case
you're thinking about, as soon as you have something else starting or
stopping queues from outside the TX path it still wouldn't actually
help.

By the way, one thing that can likely exacerbate the problem is
fragmentation, once you have that you're more likely to trigger lots of
frames in close succession.

What I would suggest doing in the driver is actually stop the queues
once a *threshold* is reached, rather than being full. Say if you have
128 entries in the HW queue, you can stop once you reach 120 (you
probably don't really need the other 8 places). If you get a 121st
frame, then you can still put it on the queue (until you filled 128),
but you can also stop the queue again (stopping queues is idempotent).

johannes


2018-11-26 09:38:35

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: rt2800 tx frame dropping issue.

On Fri, Nov 23, 2018 at 08:45:54PM +0100, Johannes Berg wrote:
> On Tue, 2018-11-20 at 15:20 -0600, Daniel Santos wrote:
>
> > I believe I have the answer as to why we're getting frames after we stop
> > the queue. I had a little chat about this in #kernelnewbies and some
> > other devs believe it is intentional.
> >
> > There is a race in ieee80211_tx_frags (net/mac80211/tx.c) between
> > releasing local->queue_stop_reason_lock and calling the driver's tx
> > until we lock queue->tx_lock in rt2x00queue_write_tx_frame -- in between
> > these two points the thread can be preempted. So while we stop the
> > queue in one thread, there can be 20 other threads in between that have
> > already checked that the queue was running and have a frame buffer
> > sitting on their stacks.
>
> Not 20, only 1 per netdev queue. I suspect that means just 1 per
> hardware queue, but I don't know how rt2x00 maps netdev queues to
> hardware queues. If you have lots of netdevs, that might actually be 20,
> but I suspect that's not what's going on here.
>
> > I think it could be fixed with the below
> > patch, but I'm being told that it doesn't need it and that the driver
> > should just *quietly* drop the frame:
>
> [snip patch]
>
> > Could anybody illuminate for me why this isn't done? I know that there
> > has to be a point where we realize there are too many items in the queue
> > and we can't keep up, but this would seem to be a terrible way to do
> > that. Is it one of those things where it isn't worth the performance
> > degradation? Any insights would be most appreciated!! :)
>
> There's just not much point, and doing the locking here will mean it's
> across _all_ queues, whereas if you have multiple hardware queues you
> wouldn't really need it.
>
> Note that with internal TXQs with fq/codel support etc. we actually have
> the fq->lock and it's global, and it turns out that's still a better
> deal than actually doing parallel TX. So this may not matter so much.
>
> In any case, while this might solve the problem for the specific case
> you're thinking about, as soon as you have something else starting or
> stopping queues from outside the TX path it still wouldn't actually
> help.
>
> By the way, one thing that can likely exacerbate the problem is
> fragmentation, once you have that you're more likely to trigger lots of
> frames in close succession.
>
> What I would suggest doing in the driver is actually stop the queues
> once a *threshold* is reached, rather than being full. Say if you have
> 128 entries in the HW queue, you can stop once you reach 120 (you
> probably don't really need the other 8 places). If you get a 121st
> frame, then you can still put it on the queue (until you filled 128),
> but you can also stop the queue again (stopping queues is idempotent).

That what rt2x00 does, but we have 64 entries on tx queues and
because of that threshold is small. In:

https://bugzilla.kernel.org/show_bug.cgi?id=82751

I proposed increase of queue size to 256 and hence make threshold
bigger. However I was concerned about bufferbloat and requested
testing from users how this affect latency. Never get testing
results :-(

Thanks
Stanislaw

2018-12-03 21:30:23

by Daniel Santos

[permalink] [raw]
Subject: Re: rt2800 tx frame dropping issue.

Hello!

Sorry for my delay, I got pretty sick last week.

On 11/23/18 1:45 PM, Johannes Berg wrote:
> On Tue, 2018-11-20 at 15:20 -0600, Daniel Santos wrote:
>
>> I believe I have the answer as to why we're getting frames after we stop
>> the queue. I had a little chat about this in #kernelnewbies and some
>> other devs believe it is intentional.
>>
>> There is a race in ieee80211_tx_frags (net/mac80211/tx.c) between
>> releasing local->queue_stop_reason_lock and calling the driver's tx
>> until we lock queue->tx_lock in rt2x00queue_write_tx_frame -- in between
>> these two points the thread can be preempted. So while we stop the
>> queue in one thread, there can be 20 other threads in between that have
>> already checked that the queue was running and have a frame buffer
>> sitting on their stacks.
> Not 20, only 1 per netdev queue. I suspect that means just 1 per
> hardware queue, but I don't know how rt2x00 maps netdev queues to
> hardware queues. If you have lots of netdevs, that might actually be 20,
> but I suspect that's not what's going on here.

First of all, thank you for the response!

This code path is reached in at least two different ways that I have
discovered (thus far): from the tasklet and from userspace calling
send(msg,to).  A popular captive portal software called coova-chilli can
have 100 such processes.  While I haven't verified beyond all doubt that
they *can* be preempted here it most certainly seems to be what is
happening. I'm attaching the backtraces for reference.


>> I think it could be fixed with the below
>> patch, but I'm being told that it doesn't need it and that the driver
>> should just *quietly* drop the frame:
> [snip patch]
>
>> Could anybody illuminate for me why this isn't done? I know that there
>> has to be a point where we realize there are too many items in the queue
>> and we can't keep up, but this would seem to be a terrible way to do
>> that. Is it one of those things where it isn't worth the performance
>> degradation? Any insights would be most appreciated!! :)
> There's just not much point, and doing the locking here will mean it's
> across _all_ queues, whereas if you have multiple hardware queues you
> wouldn't really need it.

So perhaps a better solution would be to either:

a.) define a mechanism for the driver to expose per-queue spinlocks to
the ieee80211 subsystem, or
b.) define a new driver function (or change the existing) to emit a
return value and ask ieee80211 to kindly stick the frame back in its own
queue.

I'm still new to this arena, so please be patient if I've said something
stupid.


> Note that with internal TXQs with fq/codel support etc. we actually have
> the fq->lock and it's global, and it turns out that's still a better
> deal than actually doing parallel TX. So this may not matter so much.
>
> In any case, while this might solve the problem for the specific case
> you're thinking about, as soon as you have something else starting or
> stopping queues from outside the TX path it still wouldn't actually
> help.

I don't think that's quite true.  Even if something else is starting and
stopping the queue, the patch could still *help* if it reduces the
instances of having to drop frames, even if it doesn't eliminate them.


>
> By the way, one thing that can likely exacerbate the problem is
> fragmentation, once you have that you're more likely to trigger lots of
> frames in close succession.


Unfortunately I'm new to this driver so I can't comment on that yet, but
I'm catching up!  I presume that's what RTS and CTS was to the standard
for -- so we can send large frames w/o starving other stations?


> What I would suggest doing in the driver is actually stop the queues
> once a *threshold* is reached, rather than being full. Say if you have
> 128 entries in the HW queue, you can stop once you reach 120 (you
> probably don't really need the other 8 places). If you get a 121st
> frame, then you can still put it on the queue (until you filled 128),
> but you can also stop the queue again (stopping queues is idempotent).
>
> johannes
>
>
Yes, hopefully we can get test data using Stanislaw's patch soon!

Thanks,
Daniel


Attachments:
rt2x00-backtraces.txt (16.63 kB)

2018-12-03 21:46:38

by Daniel Santos

[permalink] [raw]
Subject: Re: rt2800 tx frame dropping issue.

On 11/26/18 3:38 AM, Stanislaw Gruszka wrote:
> On Fri, Nov 23, 2018 at 08:45:54PM +0100, Johannes Berg wrote:
>> On Tue, 2018-11-20 at 15:20 -0600, Daniel Santos wrote:
>>
>>> I believe I have the answer as to why we're getting frames after we stop
>>> the queue. I had a little chat about this in #kernelnewbies and some
>>> other devs believe it is intentional.
>>>
>>> There is a race in ieee80211_tx_frags (net/mac80211/tx.c) between
>>> releasing local->queue_stop_reason_lock and calling the driver's tx
>>> until we lock queue->tx_lock in rt2x00queue_write_tx_frame -- in between
>>> these two points the thread can be preempted. So while we stop the
>>> queue in one thread, there can be 20 other threads in between that have
>>> already checked that the queue was running and have a frame buffer
>>> sitting on their stacks.
>> Not 20, only 1 per netdev queue. I suspect that means just 1 per
>> hardware queue, but I don't know how rt2x00 maps netdev queues to
>> hardware queues. If you have lots of netdevs, that might actually be 20,
>> but I suspect that's not what's going on here.
>>
>>> I think it could be fixed with the below
>>> patch, but I'm being told that it doesn't need it and that the driver
>>> should just *quietly* drop the frame:
>> [snip patch]
>>
>>> Could anybody illuminate for me why this isn't done? I know that there
>>> has to be a point where we realize there are too many items in the queue
>>> and we can't keep up, but this would seem to be a terrible way to do
>>> that. Is it one of those things where it isn't worth the performance
>>> degradation? Any insights would be most appreciated!! :)
>> There's just not much point, and doing the locking here will mean it's
>> across _all_ queues, whereas if you have multiple hardware queues you
>> wouldn't really need it.
>>
>> Note that with internal TXQs with fq/codel support etc. we actually have
>> the fq->lock and it's global, and it turns out that's still a better
>> deal than actually doing parallel TX. So this may not matter so much.
>>
>> In any case, while this might solve the problem for the specific case
>> you're thinking about, as soon as you have something else starting or
>> stopping queues from outside the TX path it still wouldn't actually
>> help.
>>
>> By the way, one thing that can likely exacerbate the problem is
>> fragmentation, once you have that you're more likely to trigger lots of
>> frames in close succession.
>>
>> What I would suggest doing in the driver is actually stop the queues
>> once a *threshold* is reached, rather than being full. Say if you have
>> 128 entries in the HW queue, you can stop once you reach 120 (you
>> probably don't really need the other 8 places). If you get a 121st
>> frame, then you can still put it on the queue (until you filled 128),
>> but you can also stop the queue again (stopping queues is idempotent).
> That what rt2x00 does, but we have 64 entries on tx queues and
> because of that threshold is small. In:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=82751
>
> I proposed increase of queue size to 256 and hence make threshold
> bigger. However I was concerned about bufferbloat and requested
> testing from users how this affect latency. Never get testing
> results :-(
>
> Thanks
> Stanislaw
>
Hello Stanislaw,

I almost managed to get that patch in a build to send to somebody who
can reproduce the error in abundance, but they have 15 different people
hammer the router to do it and we ended up sending them a few other
experimental builds instead.

I'm still learning this driver, but I don't see where it creates a
struct net_device -- was that something that came out after the driver
was originally written? (And maybe gets implicitly created somewhere
else?)  iiuc, the best way to do this is by setting tx_queue_len while
the interface is down (via "ip link") and then allocating the queue when
it's brought up.

Unless you know of a problem with this approach, I'm planning on making
a patch just for that.  It will then be easier to tune for an end user's
particular application.  For instance, if there is a small number of
uses who just use a very large amount of bandwidth then buffer bloat
could become a problem if it's too high.  But for a larger number of
users I don't think the buffer bloat probably will matter as much as
lost performance from dropping frames and needing to re-request many
lost packets at the TCP layer.  This would also result in more uplink
bandwidth being consumed.

BTW, I guess we need that struct net_device object in order to increment
tx_dropped properly as well.

Thanks,
Daniel

2018-12-04 10:17:04

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: rt2800 tx frame dropping issue.

Hi Daniel

On Mon, Dec 03, 2018 at 03:44:46PM -0600, Daniel Santos wrote:
> I almost managed to get that patch in a build to send to somebody who
> can reproduce the error in abundance, but they have 15 different people
> hammer the router to do it and we ended up sending them a few other
> experimental builds instead.
>
> I'm still learning this driver, but I don't see where it creates a
> struct net_device -- was that something that came out after the driver
> was originally written? (And maybe gets implicitly created somewhere
> else?)?

It is done in ieee80211_if_add(), one netdev per vif.

> iiuc, the best way to do this is by setting tx_queue_len while
> the interface is down (via "ip link") and then allocating the queue when
> it's brought up.

We have diffrent queues at various levels in the network stack.
The queues size I plan to increase are referenced as HW queues

> Unless you know of a problem with this approach, I'm planning on making
> a patch just for that.? It will then be easier to tune for an end user's
> particular application.?

I don't think it's correct approch. Maybe module parameter will be
better just for testing. But since this is for routers/APs just
making build with diffrent tx queues size seems to be better
approach.

> For instance, if there is a small number of
> uses who just use a very large amount of bandwidth then buffer bloat
> could become a problem if it's too high.? But for a larger number of
> users I don't think the buffer bloat probably will matter as much as
> lost performance from dropping frames and needing to re-request many
> lost packets at the TCP layer.? This would also result in more uplink
> bandwidth being consumed.

Well, I guess that correct, but I still wish to see if increase queues
size do not cause big negative effect.

Thanks
Stanislaw