DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0117F2080F
DomainKey-Signature: a=rsa-sha1; c=nofws; d=pobox.com; h=to:from:subject
        :message-id:date:mime-version:content-type
        :content-transfer-encoding; q=dns; s=sasl; b=upyZ3NV28EVpHFNy+Fo
        yIouyQql6pwETTpB2LXz3tvjEEQA4zFh/ZOo/jrMcKHhrCC+CUeaHrgJOuAXzHcP
        V69Dq7XTQVuxwhfJ7hwHMUZdgc3RbnLSYssDUP8A6aRQ6CCKYR2o2mf9RWnybo2w
        nUoNNBB1JbMQr4qqcW5YmO/A=
To:     Stanislaw Gruszka <sgruszka@redhat.com>,
        linux-wireless@vger.kernel.org
From:   Daniel Santos <daniel.santos@pobox.com>
Subject: rt2800 tx frame dropping issue.
Message-ID: <bc0bc5ea-dd10-a5a5-87ff-00a93bcf814c@pobox.com>
Date:   Tue, 20 Nov 2018 15:20:01 -0600
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.9.1
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable
Sender: linux-wireless-owner@vger.kernel.org
Precedence: bulk

Hello,

I'm coming to this issue rather late and I realize there has been much
ado about it.=C2=A0 I want to follow up on a thread from 3 months ago
https://marc.info/?l=3Dlinux-wireless&m=3D153511575812945

> I get testing results from T-Bone user in bugzilla:
> https://bugzilla.kernel.org/show_bug.cgi?id=3D82751
>
> And get this:
>
> [  781.644185] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Drop=
ping frame due \
> to full tx queue 2 [  781.662620] 2 d1 s1 p1 ms1
>
> Looks like rt2x00 correctly stop queue in mac80211, but sill get skb's.=
=20
>
> So we can do 3 things: increase queue size to 128, increase threshold
> to 16 and make the massage debug one instead of error (I checked
> some other drivers and looks most of them silently drop the frame
> in case queue is full). Especially removing the message can be
> useful since printk can somehow make mt7620 router unresponsive
> for some time resulting in connection drops.
>
> Thoughts ?
>
> Regards
> Stanislaw

I believe I have the answer as to why we're getting frames after we stop
the queue.=C2=A0 I had a little chat about this in #kernelnewbies and som=
e
other devs believe it is intentional.

There is a race in ieee80211_tx_frags (net/mac80211/tx.c) between
releasing local->queue_stop_reason_lock and calling the driver's tx
until we lock queue->tx_lock in rt2x00queue_write_tx_frame -- in between
these two points the thread can be preempted.=C2=A0 So while we stop the
queue in one thread, there can be 20 other threads in between that have
already checked that the queue was running and have a frame buffer
sitting on their stacks.=C2=A0 I think it could be fixed with the below
patch, but I'm being told that it doesn't need it and that the driver
should just *quietly* drop the frame:

diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
index d43dab4..29009e0 100644
--- a/net/mac80211/ieee80211_i.h
+++ b/net/mac80211/ieee80211_i.h
@@ -1117,6 +1117,7 @@ struct ieee80211_local {
 	int q_stop_reasons[IEEE80211_MAX_QUEUES][IEEE80211_QUEUE_STOP_REASONS];
 	/* also used to protect ampdu_ac_queue and amdpu_ac_stop_refcnt */
 	spinlock_t queue_stop_reason_lock;
+	spinlock_t tx_path_lock;
=20
 	int open_count;
 	int monitors, cooked_mntrs;
diff --git a/net/mac80211/main.c b/net/mac80211/main.c
index 4f8eceb..2312ac9 100644
--- a/net/mac80211/main.c
+++ b/net/mac80211/main.c
@@ -609,6 +609,7 @@ struct ieee80211_hw *ieee80211_alloc_hw_nm(size_t pri=
v_data_len,
 	spin_lock_init(&local->filter_lock);
 	spin_lock_init(&local->rx_path_lock);
 	spin_lock_init(&local->queue_stop_reason_lock);
+	spin_lock_init(&local->tx_path_lock);
=20
 	INIT_LIST_HEAD(&local->chanctx_list);
 	mutex_init(&local->chanctx_mtx);
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index 87d807f..8d43e2a 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -1311,6 +1311,7 @@ static bool ieee80211_tx_frags(struct ieee80211_loc=
al *local,
 		}
 #endif
=20
+		spin_lock_bh(&local->tx_path_lock);
 		spin_lock_irqsave(&local->queue_stop_reason_lock, flags);
 		if (local->queue_stop_reasons[q] ||
 		    (!txpending && !skb_queue_empty(&local->pending[q]))) {
@@ -1327,6 +1328,7 @@ static bool ieee80211_tx_frags(struct ieee80211_loc=
al *local,
 					spin_unlock_irqrestore(
 						&local->queue_stop_reason_lock,
 						flags);
+					spin_unlock_bh(&local->tx_path_lock);
 					ieee80211_purge_tx_queue(&local->hw,
 								 skbs);
 					return true;
@@ -1347,6 +1349,7 @@ static bool ieee80211_tx_frags(struct ieee80211_loc=
al *local,
=20
 				spin_unlock_irqrestore(&local->queue_stop_reason_lock,
 						       flags);
+				spin_unlock_bh(&local->tx_path_lock);
 				return false;
 			}
 		}
@@ -1356,6 +1359,7 @@ static bool ieee80211_tx_frags(struct ieee80211_loc=
al *local,
=20
 		__skb_unlink(skb, skbs);
 		ieee80211_drv_tx(local, vif, sta, skb);
+		spin_unlock_bh(&local->tx_path_lock);
 	}
=20
 	return true;


Could anybody illuminate for me why this isn't done?=C2=A0 I know that th=
ere
has to be a point where we realize there are too many items in the queue
and we can't keep up, but this would seem to be a terrible way to do
that.=C2=A0 Is it one of those things where it isn't worth the performanc=
e
degradation?=C2=A0 Any insights would be most appreciated!! :)

Like Stanislaw, I have not been able to reproduce the problem for
testing, but somebody who has the problem (in production) is going to
get a build where the only change is to remove the rt2x00_err message.=C2=
=A0
It makes sense that it would be the problem as you mentioned.=C2=A0 I sup=
pose
the proper thing to do is to just quietly drop it and increment the
struct net_device stats.tx_dropped?

Thanks,
Daniel