DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org BEF68206B7
DomainKey-Signature: a=rsa-sha1; c=nofws; d=pobox.com; h=subject:to:cc
        :references:from:message-id:date:mime-version:in-reply-to
        :content-type:content-transfer-encoding; q=dns; s=sasl; b=JzHdAb
        XjY7orm6gyahd2DBZaUirK0CWyaTOMmga6OrE1Jq7DCu+ywo0W71/JOuZoqnzPkb
        mdItrMNEykMiE+VFF1kJmGlPuhiWhXpOIHRpcBQNGb2IPGOXtAyvXLNCKQwS2bWj
        dVA6IMo2x1xHfiQRbZO8yW6QYXis6xaJMvf2c=
Subject: Re: rt2800 tx frame dropping issue.
To:     Stanislaw Gruszka <sgruszka@redhat.com>,
        Johannes Berg <johannes@sipsolutions.net>
Cc:     linux-wireless@vger.kernel.org
References: <bc0bc5ea-dd10-a5a5-87ff-00a93bcf814c@pobox.com>
 <3e8017792ea5eaa63f7795d9d37e7dcf85cdc612.camel@sipsolutions.net>
 <20181126093826.GB2047@redhat.com>
From:   Daniel Santos <daniel.santos@pobox.com>
Message-ID: <b23f23bb-c543-bf9c-2827-618841a9b708@pobox.com>
Date:   Mon, 3 Dec 2018 15:44:46 -0600
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
 Thunderbird/60.3.0
MIME-Version: 1.0
In-Reply-To: <20181126093826.GB2047@redhat.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable
Sender: linux-wireless-owner@vger.kernel.org
Precedence: bulk

On 11/26/18 3:38 AM, Stanislaw Gruszka wrote:
> On Fri, Nov 23, 2018 at 08:45:54PM +0100, Johannes Berg wrote:
>> On Tue, 2018-11-20 at 15:20 -0600, Daniel Santos wrote:
>>
>>> I believe I have the answer as to why we're getting frames after we s=
top
>>> the queue.  I had a little chat about this in #kernelnewbies and some
>>> other devs believe it is intentional.
>>>
>>> There is a race in ieee80211_tx_frags (net/mac80211/tx.c) between
>>> releasing local->queue_stop_reason_lock and calling the driver's tx
>>> until we lock queue->tx_lock in rt2x00queue_write_tx_frame -- in betw=
een
>>> these two points the thread can be preempted.  So while we stop the
>>> queue in one thread, there can be 20 other threads in between that ha=
ve
>>> already checked that the queue was running and have a frame buffer
>>> sitting on their stacks.
>> Not 20, only 1 per netdev queue. I suspect that means just 1 per
>> hardware queue, but I don't know how rt2x00 maps netdev queues to
>> hardware queues. If you have lots of netdevs, that might actually be 2=
0,
>> but I suspect that's not what's going on here.
>>
>>>   I think it could be fixed with the below
>>> patch, but I'm being told that it doesn't need it and that the driver
>>> should just *quietly* drop the frame:
>> [snip patch]
>>
>>> Could anybody illuminate for me why this isn't done?  I know that the=
re
>>> has to be a point where we realize there are too many items in the qu=
eue
>>> and we can't keep up, but this would seem to be a terrible way to do
>>> that.  Is it one of those things where it isn't worth the performance
>>> degradation?  Any insights would be most appreciated!! :)
>> There's just not much point, and doing the locking here will mean it's
>> across _all_ queues, whereas if you have multiple hardware queues you
>> wouldn't really need it.
>>
>> Note that with internal TXQs with fq/codel support etc. we actually ha=
ve
>> the fq->lock and it's global, and it turns out that's still a better
>> deal than actually doing parallel TX. So this may not matter so much.
>>
>> In any case, while this might solve the problem for the specific case
>> you're thinking about, as soon as you have something else starting or
>> stopping queues from outside the TX path it still wouldn't actually
>> help.
>>
>> By the way, one thing that can likely exacerbate the problem is
>> fragmentation, once you have that you're more likely to trigger lots o=
f
>> frames in close succession.
>>
>> What I would suggest doing in the driver is actually stop the queues
>> once a *threshold* is reached, rather than being full. Say if you have
>> 128 entries in the HW queue, you can stop once you reach 120 (you
>> probably don't really need the other 8 places). If you get a 121st
>> frame, then you can still put it on the queue (until you filled 128),
>> but you can also stop the queue again (stopping queues is idempotent).
> That what rt2x00 does, but we have 64 entries on tx queues and
> because of that threshold is small. In:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=3D82751
>
> I proposed increase of queue size to 256 and hence make threshold
> bigger. However I was concerned about bufferbloat and requested
> testing from users how this affect latency. Never get testing=20
> results :-(
>
> Thanks
> Stanislaw
>
Hello Stanislaw,

I almost managed to get that patch in a build to send to somebody who
can reproduce the error in abundance, but they have 15 different people
hammer the router to do it and we ended up sending them a few other
experimental builds instead.

I'm still learning this driver, but I don't see where it creates a
struct net_device -- was that something that came out after the driver
was originally written? (And maybe gets implicitly created somewhere
else?)=C2=A0 iiuc, the best way to do this is by setting tx_queue_len whi=
le
the interface is down (via "ip link") and then allocating the queue when
it's brought up.

Unless you know of a problem with this approach, I'm planning on making
a patch just for that.=C2=A0 It will then be easier to tune for an end us=
er's
particular application.=C2=A0 For instance, if there is a small number of
uses who just use a very large amount of bandwidth then buffer bloat
could become a problem if it's too high.=C2=A0 But for a larger number of
users I don't think the buffer bloat probably will matter as much as
lost performance from dropping frames and needing to re-request many
lost packets at the TCP layer.=C2=A0 This would also result in more uplin=
k
bandwidth being consumed.

BTW, I guess we need that struct net_device object in order to increment
tx_dropped properly as well.

Thanks,
Daniel