2019-12-12 14:52:42

by Johannes Berg

[permalink] [raw]
Subject: debugging TCP stalls on high-speed wifi

Hi Eric, all,

I've been debugging (much thanks to bpftrace) TCP stalls on wifi, in
particular on iwlwifi.

What happens, essentially, is that we transmit large aggregates (63
packets of 7.5k A-MSDU size each, for something on the order of 500kB
per PPDU). Theoretically we can have ~240 A-MSDUs on our hardware
queues, and the hardware aggregates them into up to 63 to send as a
single PPDU.

At HE rates (160 MHz, high rates) such a large PPDU takes less than 2ms
to transmit.

I'm seeing around 1400 Mbps TCP throughput (a bit more than 1800 UDP),
but I'm expecting more. A bit more than 1800 for UDP is about the max I
can expect on this AP (it only does 8k A-MSDU size), but I'd think TCP
then shouldn't be so much less (and our Windows drivers gets >1600).


What I see is that occasionally - and this doesn't happen all the time
but probably enough to matter - we reclaim a few of those large
aggregates and free the transmit SKBs, and then we try to pull from
mac80211's TXQs but they're empty.

At this point - we've just freed 400+k of data, I'd expect TCP to
immediately push more, but it doesn't happen. I sometimes see another
set of reclaims emptying the queue entirely (literally down to 0 packets
on the queue) and it then takes another millisecond or two for TCP to
start pushing packets again.

Once that happens, I also observe that TCP stops pushing large TSO
packets and goes down to sometimes less than a single A-MSDU (i.e.
~7.5k) in a TSO, perhaps even an MTU-sized frame - didn't check this,
only the # of frames we make out of this.


If you have any thoughts on this, I'd appreciate it.


Something I've been wondering is if our TSO implementation causes
issues, but apart from higher CPU usage I see no real difference if I
turned it off. I thought so because it splits up the SKBs into those A-
MSDU sized chunks using skb_gso_segment() and then splits them down into
MTU-sized all packed together into an A-MSDU using the hardware engine.
But that means we release a bunch of A-MSDU-sized SKBs back to the TCP
stack when they transmitted.

Another thought I had was our broken NAPI, but this is TX traffic so the
only RX thing is sync, and I'm currently still using kernel 5.4 anyway.

Thanks,
johannes


2019-12-12 15:13:21

by Neal Cardwell

[permalink] [raw]
Subject: Re: debugging TCP stalls on high-speed wifi

On Thu, Dec 12, 2019 at 9:50 AM Johannes Berg <[email protected]> wrote:
> If you have any thoughts on this, I'd appreciate it.

Thanks for the detailed report!

I was curious:

o What's the sender's qdisc configuration?

o Would you be able to log periodic dumps (e.g. every 50ms or 100ms)
of the test connection using a recent "ss" binary and "ss -tinm", to
hopefully get a sense of buffer parameters, and whether the flow in
these cases is being cwnd-limited, pacing-limited,
send-buffer-limited, or receive-window-limited?

o Would you be able to share a headers-only tcpdump pcap trace?

thanks,
neal

2019-12-12 15:48:05

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TCP stalls on high-speed wifi

Hi Neal,

On Thu, 2019-12-12 at 10:11 -0500, Neal Cardwell wrote:
> On Thu, Dec 12, 2019 at 9:50 AM Johannes Berg <[email protected]> wrote:
> > If you have any thoughts on this, I'd appreciate it.
>
> Thanks for the detailed report!

Well, it wasn't. For example, I neglected to mention that I have to
actually use at least 2 TCP streams (but have tried up to 20) in order
to not run into the gbit link limit on the AP :-)

> I was curious:
>
> o What's the sender's qdisc configuration?

There's none, mac80211 bypasses qdiscs in favour of its internal TXQs
with FQ/codel.

> o Would you be able to log periodic dumps (e.g. every 50ms or 100ms)
> of the test connection using a recent "ss" binary and "ss -tinm", to
> hopefully get a sense of buffer parameters, and whether the flow in
> these cases is being cwnd-limited, pacing-limited,
> send-buffer-limited, or receive-window-limited?

Sure, though I'm not sure my ss is recent enough for what you had in
mind - if not I'll have to rebuild it (it was iproute2-ss190708).

https://p.sipsolutions.net/3e515625bf13fa69.txt

Note there are 4 connections (iperf is being used) but two are control
and two are data. Easy to see the difference really :-)

> o Would you be able to share a headers-only tcpdump pcap trace?

I'm not sure how to do headers-only, but I guess -s100 will work.

https://johannes.sipsolutions.net/files/he-tcp.pcap.xz

Note that this is from a different run than the ss stats.

johannes

2019-12-12 18:12:09

by Eric Dumazet

[permalink] [raw]
Subject: Re: debugging TCP stalls on high-speed wifi



On 12/12/19 7:47 AM, Johannes Berg wrote:
> Hi Neal,
>
> On Thu, 2019-12-12 at 10:11 -0500, Neal Cardwell wrote:
>> On Thu, Dec 12, 2019 at 9:50 AM Johannes Berg <[email protected]> wrote:
>>> If you have any thoughts on this, I'd appreciate it.
>>
>> Thanks for the detailed report!
>
> Well, it wasn't. For example, I neglected to mention that I have to
> actually use at least 2 TCP streams (but have tried up to 20) in order
> to not run into the gbit link limit on the AP :-)
>
>> I was curious:
>>
>> o What's the sender's qdisc configuration?
>
> There's none, mac80211 bypasses qdiscs in favour of its internal TXQs
> with FQ/codel.
>
>> o Would you be able to log periodic dumps (e.g. every 50ms or 100ms)
>> of the test connection using a recent "ss" binary and "ss -tinm", to
>> hopefully get a sense of buffer parameters, and whether the flow in
>> these cases is being cwnd-limited, pacing-limited,
>> send-buffer-limited, or receive-window-limited?
>
> Sure, though I'm not sure my ss is recent enough for what you had in
> mind - if not I'll have to rebuild it (it was iproute2-ss190708).
>
> https://p.sipsolutions.net/3e515625bf13fa69.txt
>
> Note there are 4 connections (iperf is being used) but two are control
> and two are data. Easy to see the difference really :-)
>
>> o Would you be able to share a headers-only tcpdump pcap trace?
>
> I'm not sure how to do headers-only, but I guess -s100 will work.
>
> https://johannes.sipsolutions.net/files/he-tcp.pcap.xz
>

Lack of GRO on receiver is probably what is killing performance,
both for receiver (generating gazillions of acks) and sender (to process all these acks)

I had a plan about enabling compressing ACK as I did for SACK
in commit
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5d9f4262b7ea41ca9981cc790e37cca6e37c789e

But I have not done it yet.
It is a pity because this would tremendously help wifi I am sure.

2019-12-12 21:12:18

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TCP stalls on high-speed wifi

Hi Eric,

Thanks for looking :)

> > I'm not sure how to do headers-only, but I guess -s100 will work.
> >
> > https://johannes.sipsolutions.net/files/he-tcp.pcap.xz
> >
>
> Lack of GRO on receiver is probably what is killing performance,
> both for receiver (generating gazillions of acks) and sender
> (to process all these acks)
Yes, I'm aware of this, to some extent. And I'm not saying we should see
even close to 1800 Mbps like we have with UDP...

Mind you, the biggest thing that kills performance with many ACKs isn't
the load on the system - the sender system is only moderately loaded at
~20-25% of a single core with TSO, and around double that without TSO.
The thing that kills performance is eating up all the medium time with
small non-aggregated packets, due to the the half-duplex nature of WiFi.
I know you know, but in case somebody else is reading along :-)

But unless somehow you think processing the (many) ACKs on the sender
will cause it to stop transmitting, or something like that, I don't
think I should be seeing what I described earlier: we sometimes (have
to?) reclaim the entire transmit queue before TCP starts pushing data
again. That's less than 2MB split across at least two TCP streams, I
don't see why we should have to get to 0 (which takes about 7ms) until
more packets come in from TCP?

Or put another way - if I free say 400kB worth of SKBs, what could be
the reason we don't see more packets be sent out of the TCP stack within
the few ms or so? I guess I have to correlate this somehow with the ACKs
so I know how much data is outstanding for ACKs. (*)

The sk_pacing_shift is set to 7, btw, which should give us 8ms of
outstanding data. For now in this setup that's enough(**), and indeed
bumping the limit up (setting sk_pacing_shift to say 5) doesn't change
anything. So I think this part we actually solved - I get basically the
same performance and behaviour with two streams (needed due to GBit LAN
on the other side) as with 20 streams.


> I had a plan about enabling compressing ACK as I did for SACK
> in commit
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5d9f4262b7ea41ca9981cc790e37cca6e37c789e
>
> But I have not done it yet.
> It is a pity because this would tremendously help wifi I am sure.

Nice :-)

But that is something the *receiver* would have to do.

The dirty secret here is that we're getting close to 1700 Mbps TCP with
Windows in place of Linux in the setup, with the same receiver on the
other end (which is actually a single Linux machine with two GBit
network connections to the AP). So if we had this I'm sure it'd increase
performance, but it still wouldn't explain why we're so much slower than
Windows :-)

Now, I'm certainly not saying that TCP behaviour is the only reason for
the difference, we already found an issue for example where due to a
small Windows driver bug some packet extension was always used, and the
AP is also buggy in that it needs the extension but didn't request it
... so the two bugs cancelled each other out and things worked well, but
our Linux driver believed the AP ... :) Certainly there can be more
things like that still, I just started on the TCP side and ran into the
queueing behaviour that I cannot explain.


In any case, I'll try to dig deeper into the TCP stack to understand the
reason for this transmit behaviour.

Thanks,
johannes


(*) Hmm. Now I have another idea. Maybe we have some kind of problem
with the medium access configuration, and we transmit all this data
without the AP having a chance to send back all the ACKs? Too bad I
can't put an air sniffer into the setup - it's a conductive setup.


(**) As another aside to this, the next generation HW after this will
have 256 frames in a block-ack, so that means instead of up to 64 (we
only use 63 for internal reasons) frames aggregated together we'll be
able to aggregate 256 (or maybe we again only 255?). Each one of those
frames may be an A-MSDU with ~11k content though (only 8k in the setup I
have here right now), which means we can get a LOT of data into a single
PPDU ... we'll probably have to bump the sk_pacing_shift to be able to
fill that with a single TCP stream, though since we run all our
performance numbers with many streams, maybe we should just leave it :)


2019-12-12 21:31:02

by Ben Greear

[permalink] [raw]
Subject: Re: debugging TCP stalls on high-speed wifi

On 12/12/19 1:11 PM, Johannes Berg wrote:
> Hi Eric,
>
> Thanks for looking :)
>
>>> I'm not sure how to do headers-only, but I guess -s100 will work.
>>>
>>> https://johannes.sipsolutions.net/files/he-tcp.pcap.xz
>>>
>>
>> Lack of GRO on receiver is probably what is killing performance,
>> both for receiver (generating gazillions of acks) and sender
>> (to process all these acks)
> Yes, I'm aware of this, to some extent. And I'm not saying we should see
> even close to 1800 Mbps like we have with UDP...
>
> Mind you, the biggest thing that kills performance with many ACKs isn't
> the load on the system - the sender system is only moderately loaded at
> ~20-25% of a single core with TSO, and around double that without TSO.
> The thing that kills performance is eating up all the medium time with
> small non-aggregated packets, due to the the half-duplex nature of WiFi.
> I know you know, but in case somebody else is reading along :-)
>
> But unless somehow you think processing the (many) ACKs on the sender
> will cause it to stop transmitting, or something like that, I don't
> think I should be seeing what I described earlier: we sometimes (have
> to?) reclaim the entire transmit queue before TCP starts pushing data
> again. That's less than 2MB split across at least two TCP streams, I
> don't see why we should have to get to 0 (which takes about 7ms) until
> more packets come in from TCP?
>
> Or put another way - if I free say 400kB worth of SKBs, what could be
> the reason we don't see more packets be sent out of the TCP stack within
> the few ms or so? I guess I have to correlate this somehow with the ACKs
> so I know how much data is outstanding for ACKs. (*)
>
> The sk_pacing_shift is set to 7, btw, which should give us 8ms of
> outstanding data. For now in this setup that's enough(**), and indeed
> bumping the limit up (setting sk_pacing_shift to say 5) doesn't change
> anything. So I think this part we actually solved - I get basically the
> same performance and behaviour with two streams (needed due to GBit LAN
> on the other side) as with 20 streams.
>
>
>> I had a plan about enabling compressing ACK as I did for SACK
>> in commit
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5d9f4262b7ea41ca9981cc790e37cca6e37c789e
>>
>> But I have not done it yet.
>> It is a pity because this would tremendously help wifi I am sure.
>
> Nice :-)
>
> But that is something the *receiver* would have to do.
>
> The dirty secret here is that we're getting close to 1700 Mbps TCP with
> Windows in place of Linux in the setup, with the same receiver on the
> other end (which is actually a single Linux machine with two GBit
> network connections to the AP). So if we had this I'm sure it'd increase
> performance, but it still wouldn't explain why we're so much slower than
> Windows :-)
>
> Now, I'm certainly not saying that TCP behaviour is the only reason for
> the difference, we already found an issue for example where due to a
> small Windows driver bug some packet extension was always used, and the
> AP is also buggy in that it needs the extension but didn't request it
> ... so the two bugs cancelled each other out and things worked well, but
> our Linux driver believed the AP ... :) Certainly there can be more
> things like that still, I just started on the TCP side and ran into the
> queueing behaviour that I cannot explain.
>
>
> In any case, I'll try to dig deeper into the TCP stack to understand the
> reason for this transmit behaviour.
>
> Thanks,
> johannes
>
>
> (*) Hmm. Now I have another idea. Maybe we have some kind of problem
> with the medium access configuration, and we transmit all this data
> without the AP having a chance to send back all the ACKs? Too bad I
> can't put an air sniffer into the setup - it's a conductive setup.

splitter/combiner?

If it is just delayed acks coming back, which would slow down a stream, then
multiple streams would tend to work around that problem?

I would actually expect similar speedup with multiple streams if some TCP socket
was blocked on waiting for ACKs too.

Even if you can't sniff the air, you could sniff the wire or just look at packet
in/out counts. If you have a huge number of ACKs, that would show up in raw pkt
counters.

I'm not sure it matters these days, but this patch greatly helped TCP throughput on
ath10k for a while, and we are still using it. Maybe your sk_pacing change already
tweaked the same logic:

https://github.com/greearb/linux-ct-5.4/commit/65651d4269eb2b0d4b4952483c56316a7fbe2f48

if [ -w /proc/sys/net/ipv4/tcp_tsq_limit_output_interval ]
then
# This helps TCP tx throughput when using ath10k. Setting > 1 likely
# increases latency in some cases, but on average, seems a win for us.
TCP_TSQ=200
echo -n "Setting TCP-tsq limit to $TCP_TSQ....................................... "
echo $TCP_TSQ > /proc/sys/net/ipv4/tcp_tsq_limit_output_interval
echo "DONE"
fi


Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2019-12-12 21:58:46

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TCP stalls on high-speed wifi

On Thu, 2019-12-12 at 13:29 -0800, Ben Greear wrote:
>
> > (*) Hmm. Now I have another idea. Maybe we have some kind of problem
> > with the medium access configuration, and we transmit all this data
> > without the AP having a chance to send back all the ACKs? Too bad I
> > can't put an air sniffer into the setup - it's a conductive setup.
>
> splitter/combiner?

I guess. I haven't looked at it, it's halfway around the world or
something :)

> If it is just delayed acks coming back, which would slow down a stream, then
> multiple streams would tend to work around that problem?

Only a bit, because it allows somewhat more outstanding data. But each
stream estimates the throughput lower in its congestion control
algorithm, so it would have a smaller window size?

What I was thinking is that if we have some kind of skew in the system
and always/frequently/sometimes make our transmissions have priority
over the AP transmissions, then we'd not get ACKs back, and that might
cause what I see - the queue drains entirely and *then* we get an ACK
back...

That's not a _bad_ theory and I'll have to find a good way to test it,
but I'm not entirely convinced that's the problem.

Oh, actually, I guess I know it's *not* the problem because otherwise
the ss output would show we're blocked on congestion window far more
than it looks like now? I think?


> I would actually expect similar speedup with multiple streams if some TCP socket
> was blocked on waiting for ACKs too.
>
> Even if you can't sniff the air, you could sniff the wire or just look at packet
> in/out counts. If you have a huge number of ACKs, that would show up in raw pkt
> counters.

I know I have a huge number of ACKs, but I also know that's not the
(only) problem. My question/observation was related to the timing of
them.

> I'm not sure it matters these days, but this patch greatly helped TCP throughput on
> ath10k for a while, and we are still using it. Maybe your sk_pacing change already
> tweaked the same logic:
>
> https://github.com/greearb/linux-ct-5.4/commit/65651d4269eb2b0d4b4952483c56316a7fbe2f48

Yes, you should be able to drop that patch - look at it, it just
multiples the thing there that you have with "sk->sk_pacing_shift",
instead we currently by default set sk->sk_pacing_shift to 7 instead of
10 or something, so that'd be equivalent to setting your sysctl to 8.

> TCP_TSQ=200

Setting it to 200 is way excessive. In particular since you already get
the *8 from the default mac80211 behaviour, so now you effectively have
*1600, which means instead of 1ms you can have 1.6s worth of TCP data on
the queues ... way too much :)

johannes

2019-12-12 22:00:09

by Eric Dumazet

[permalink] [raw]
Subject: Re: debugging TCP stalls on high-speed wifi



On 12/12/19 1:11 PM, Johannes Berg wrote:
> Hi Eric,
>
> Thanks for looking :)
>
>>> I'm not sure how to do headers-only, but I guess -s100 will work.
>>>
>>> https://johannes.sipsolutions.net/files/he-tcp.pcap.xz
>>>
>>
>> Lack of GRO on receiver is probably what is killing performance,
>> both for receiver (generating gazillions of acks) and sender
>> (to process all these acks)
> Yes, I'm aware of this, to some extent. And I'm not saying we should see
> even close to 1800 Mbps like we have with UDP...
>
> Mind you, the biggest thing that kills performance with many ACKs isn't
> the load on the system - the sender system is only moderately loaded at
> ~20-25% of a single core with TSO, and around double that without TSO.
> The thing that kills performance is eating up all the medium time with
> small non-aggregated packets, due to the the half-duplex nature of WiFi.
> I know you know, but in case somebody else is reading along :-)
>
> But unless somehow you think processing the (many) ACKs on the sender
> will cause it to stop transmitting, or something like that, I don't
> think I should be seeing what I described earlier: we sometimes (have
> to?) reclaim the entire transmit queue before TCP starts pushing data
> again. That's less than 2MB split across at least two TCP streams, I
> don't see why we should have to get to 0 (which takes about 7ms) until
> more packets come in from TCP?
>
> Or put another way - if I free say 400kB worth of SKBs, what could be
> the reason we don't see more packets be sent out of the TCP stack within
> the few ms or so? I guess I have to correlate this somehow with the ACKs
> so I know how much data is outstanding for ACKs. (*)
>
> The sk_pacing_shift is set to 7, btw, which should give us 8ms of
> outstanding data. For now in this setup that's enough(**), and indeed
> bumping the limit up (setting sk_pacing_shift to say 5) doesn't change
> anything. So I think this part we actually solved - I get basically the
> same performance and behaviour with two streams (needed due to GBit LAN
> on the other side) as with 20 streams.
>
>
>> I had a plan about enabling compressing ACK as I did for SACK
>> in commit
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5d9f4262b7ea41ca9981cc790e37cca6e37c789e
>>
>> But I have not done it yet.
>> It is a pity because this would tremendously help wifi I am sure.
>
> Nice :-)
>
> But that is something the *receiver* would have to do.

Yes, this is the plan. Eventually the receiver gets smarter.

>
> The dirty secret here is that we're getting close to 1700 Mbps TCP with
> Windows in place of Linux in the setup, with the same receiver on the
> other end (which is actually a single Linux machine with two GBit
> network connections to the AP). So if we had this I'm sure it'd increase
> performance, but it still wouldn't explain why we're so much slower than
> Windows :-)
>

I presume you could hack TCP to no longer care about bufferbloat and you'll
probably match Windows 'performance' on a single flow and a lossless network.

Ie always send ~64KB TSO packets and fill the queues, inflating RTT.

Then, in presence of losses, you get a problem because the retransmit packets
can only be sent _after_ the huge queue that has been put on the sender.

If only TCP could predict the future ;)

2019-12-12 22:00:18

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TCP stalls on high-speed wifi

On Thu, 2019-12-12 at 13:50 -0800, Eric Dumazet wrote:

> I presume you could hack TCP to no longer care about bufferbloat and you'll
> probably match Windows 'performance' on a single flow and a lossless network.
>
> Ie always send ~64KB TSO packets and fill the queues, inflating RTT.
>
> Then, in presence of losses, you get a problem because the retransmit packets
> can only be sent _after_ the huge queue that has been put on the sender.

Sure, I'll do it as an experiment. Got any suggestions on how to do
that?

I tried to decrease sk_pacing_shift even further with no effect, I've
also tried many more TCP streams with no effect.

I don't even care about a single flow that much, honestly. The default
test is like 10 or 20 flows anyway.

> If only TCP could predict the future ;)

:)

johannes

2019-12-12 22:11:43

by Ben Greear

[permalink] [raw]
Subject: Re: debugging TCP stalls on high-speed wifi

On 12/12/19 1:46 PM, Johannes Berg wrote:
> On Thu, 2019-12-12 at 13:29 -0800, Ben Greear wrote:
>>
>>> (*) Hmm. Now I have another idea. Maybe we have some kind of problem
>>> with the medium access configuration, and we transmit all this data
>>> without the AP having a chance to send back all the ACKs? Too bad I
>>> can't put an air sniffer into the setup - it's a conductive setup.
>>
>> splitter/combiner?
>
> I guess. I haven't looked at it, it's halfway around the world or
> something :)
>
>> If it is just delayed acks coming back, which would slow down a stream, then
>> multiple streams would tend to work around that problem?
>
> Only a bit, because it allows somewhat more outstanding data. But each
> stream estimates the throughput lower in its congestion control
> algorithm, so it would have a smaller window size?
>
> What I was thinking is that if we have some kind of skew in the system
> and always/frequently/sometimes make our transmissions have priority
> over the AP transmissions, then we'd not get ACKs back, and that might
> cause what I see - the queue drains entirely and *then* we get an ACK
> back...
>
> That's not a _bad_ theory and I'll have to find a good way to test it,
> but I'm not entirely convinced that's the problem.
>
> Oh, actually, I guess I know it's *not* the problem because otherwise
> the ss output would show we're blocked on congestion window far more
> than it looks like now? I think?

If you get the rough packet counters or characteristics, you could set up UDP flows to mimic
download and upload packet behaviour and run them concurrently. If you can still push a good bit more UDP up even
with small UDP packets emulating TCP acks coming down, then I think you can be
confident that it is not ACKs clogging up the RF or AP being starved for airtime.

Since the windows driver works better, then probably it is not much to do with ACKs or
downstream traffic anyway.

>> TCP_TSQ=200
>
> Setting it to 200 is way excessive. In particular since you already get
> the *8 from the default mac80211 behaviour, so now you effectively have
> *1600, which means instead of 1ms you can have 1.6s worth of TCP data on
> the queues ... way too much :)

Yes, this was hacked in back when the pacing did not work well with ath10k.
I'll do some tests to see how much this matters on modern kernels when I get
a chance.

This will allow huge congestion control windows....

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2019-12-12 23:46:46

by Dave Taht

[permalink] [raw]
Subject: Re: debugging TCP stalls on high-speed wifi

On Thu, Dec 12, 2019 at 1:12 PM Johannes Berg <[email protected]> wrote:
>
> Hi Eric,
>
> Thanks for looking :)
>
> > > I'm not sure how to do headers-only, but I guess -s100 will work.
> > >
> > > https://johannes.sipsolutions.net/files/he-tcp.pcap.xz
> > >
> >
> > Lack of GRO on receiver is probably what is killing performance,
> > both for receiver (generating gazillions of acks) and sender
> > (to process all these acks)
> Yes, I'm aware of this, to some extent. And I'm not saying we should see
> even close to 1800 Mbps like we have with UDP...
>
> Mind you, the biggest thing that kills performance with many ACKs isn't
> the load on the system - the sender system is only moderately loaded at
> ~20-25% of a single core with TSO, and around double that without TSO.
> The thing that kills performance is eating up all the medium time with
> small non-aggregated packets, due to the the half-duplex nature of WiFi.
> I know you know, but in case somebody else is reading along :-)

I'm paying attention but pay attention faster if you cc make-wifi-fast.

If you captured the air you'd probably see the sender winning the
election for airtime 2 or more times in a row,
it's random and oft dependent on on a variety of factors.

Most Wifi is *not* "half" duplex, which implies it ping pongs between
send and receive.

>
> But unless somehow you think processing the (many) ACKs on the sender
> will cause it to stop transmitting, or something like that, I don't
> think I should be seeing what I described earlier: we sometimes (have
> to?) reclaim the entire transmit queue before TCP starts pushing data
> again. That's less than 2MB split across at least two TCP streams, I
> don't see why we should have to get to 0 (which takes about 7ms) until
> more packets come in from TCP?

Perhaps having a budget for ack processing within a 1ms window?

> Or put another way - if I free say 400kB worth of SKBs, what could be
> the reason we don't see more packets be sent out of the TCP stack within
> the few ms or so? I guess I have to correlate this somehow with the ACKs
> so I know how much data is outstanding for ACKs. (*)

yes.

It would be interesting to repeat this test in ht20 mode, and/or using

flent --socket-stats --step-size=.04 --te=upload_streams=2 -t
whatever_variant_of_test tcp_nup

That will capture some of the tcp stats for you.

>
> The sk_pacing_shift is set to 7, btw, which should give us 8ms of
> outstanding data. For now in this setup that's enough(**), and indeed
> bumping the limit up (setting sk_pacing_shift to say 5) doesn't change
> anything. So I think this part we actually solved - I get basically the
> same performance and behaviour with two streams (needed due to GBit LAN
> on the other side) as with 20 streams.
>
>
> > I had a plan about enabling compressing ACK as I did for SACK
> > in commit
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5d9f4262b7ea41ca9981cc790e37cca6e37c789e
> >
> > But I have not done it yet.
> > It is a pity because this would tremendously help wifi I am sure.
>
> Nice :-)
>
> But that is something the *receiver* would have to do.

Well it is certainly feasible to thin acks on the driver as we did in
cake. More general. More cpu intensive. I'm happily just awaiting
eric's work instead.

One thing comcast inadvertently does to most flows is remark them cs1,
which tosses big data into the bk queue and acks into the be queue. It
actually helps sometimes.

>
> The dirty secret here is that we're getting close to 1700 Mbps TCP with
> Windows in place of Linux in the setup, with the same receiver on the
> other end (which is actually a single Linux machine with two GBit
> network connections to the AP). So if we had this I'm sure it'd increase
> performance, but it still wouldn't explain why we're so much slower than
> Windows :-)
>
> Now, I'm certainly not saying that TCP behaviour is the only reason for
> the difference, we already found an issue for example where due to a
> small Windows driver bug some packet extension was always used, and the
> AP is also buggy in that it needs the extension but didn't request it
> ... so the two bugs cancelled each other out and things worked well, but
> our Linux driver believed the AP ... :) Certainly there can be more
> things like that still, I just started on the TCP side and ran into the
> queueing behaviour that I cannot explain.
>
>
> In any case, I'll try to dig deeper into the TCP stack to understand the
> reason for this transmit behaviour.
>
> Thanks,
> johannes
>
>
> (*) Hmm. Now I have another idea. Maybe we have some kind of problem
> with the medium access configuration, and we transmit all this data
> without the AP having a chance to send back all the ACKs? Too bad I
> can't put an air sniffer into the setup - it's a conductive setup.

see above
>
>
> (**) As another aside to this, the next generation HW after this will
> have 256 frames in a block-ack, so that means instead of up to 64 (we
> only use 63 for internal reasons) frames aggregated together we'll be
> able to aggregate 256 (or maybe we again only 255?).

My fervent wish is to somehow be able to mark every frame we can as not
needing a retransmit in future standards. I've lost track of what ax
can do. ? And for block ack retries
to give up far sooner.

you can safely drop all but the last three acks in a flow, and the
txop itself provides
a suitable clock.

And, ya know, releasing packets ooo doesn't hurt as much as it used
to, with rack.
> Each one of those
> frames may be an A-MSDU with ~11k content though (only 8k in the setup I
> have here right now), which means we can get a LOT of data into a single
> PPDU ...

Just wearing my usual hat, I would prefer to optimize for service
time, not bandwidth, in the future,
using smaller txops with this more data in them, than the biggest
txops possible.

If you constrain your max txop to 2ms in this test, you will see tcp
in slow start ramp up faster,
and the ap scale to way more devices, with way less jitter and
retries. Most flows never get out of slowstart.

> . we'll probably have to bump the sk_pacing_shift to be able to
> fill that with a single TCP stream, though since we run all our
> performance numbers with many streams, maybe we should just leave it :)

Please. Optimizing for single flow performance is an academic's game.

>
>


--
Make Music, Not War

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-435-0729

2019-12-13 01:04:19

by Simon Barber

[permalink] [raw]
Subject: Re: [Make-wifi-fast] debugging TCP stalls on high-speed wifi

I’m currently adding ACK thinning to Linux’s GRO code. Quite a simple addition given the way that code works.

Simon


> On Dec 12, 2019, at 3:42 PM, Dave Taht <[email protected]> wrote:
>
> On Thu, Dec 12, 2019 at 1:12 PM Johannes Berg <[email protected]> wrote:
>>
>> Hi Eric,
>>
>> Thanks for looking :)
>>
>>>> I'm not sure how to do headers-only, but I guess -s100 will work.
>>>>
>>>> https://johannes.sipsolutions.net/files/he-tcp.pcap.xz
>>>>
>>>
>>> Lack of GRO on receiver is probably what is killing performance,
>>> both for receiver (generating gazillions of acks) and sender
>>> (to process all these acks)
>> Yes, I'm aware of this, to some extent. And I'm not saying we should see
>> even close to 1800 Mbps like we have with UDP...
>>
>> Mind you, the biggest thing that kills performance with many ACKs isn't
>> the load on the system - the sender system is only moderately loaded at
>> ~20-25% of a single core with TSO, and around double that without TSO.
>> The thing that kills performance is eating up all the medium time with
>> small non-aggregated packets, due to the the half-duplex nature of WiFi.
>> I know you know, but in case somebody else is reading along :-)
>
> I'm paying attention but pay attention faster if you cc make-wifi-fast.
>
> If you captured the air you'd probably see the sender winning the
> election for airtime 2 or more times in a row,
> it's random and oft dependent on on a variety of factors.
>
> Most Wifi is *not* "half" duplex, which implies it ping pongs between
> send and receive.
>
>>
>> But unless somehow you think processing the (many) ACKs on the sender
>> will cause it to stop transmitting, or something like that, I don't
>> think I should be seeing what I described earlier: we sometimes (have
>> to?) reclaim the entire transmit queue before TCP starts pushing data
>> again. That's less than 2MB split across at least two TCP streams, I
>> don't see why we should have to get to 0 (which takes about 7ms) until
>> more packets come in from TCP?
>
> Perhaps having a budget for ack processing within a 1ms window?
>
>> Or put another way - if I free say 400kB worth of SKBs, what could be
>> the reason we don't see more packets be sent out of the TCP stack within
>> the few ms or so? I guess I have to correlate this somehow with the ACKs
>> so I know how much data is outstanding for ACKs. (*)
>
> yes.
>
> It would be interesting to repeat this test in ht20 mode, and/or using
>
> flent --socket-stats --step-size=.04 --te=upload_streams=2 -t
> whatever_variant_of_test tcp_nup
>
> That will capture some of the tcp stats for you.
>
>>
>> The sk_pacing_shift is set to 7, btw, which should give us 8ms of
>> outstanding data. For now in this setup that's enough(**), and indeed
>> bumping the limit up (setting sk_pacing_shift to say 5) doesn't change
>> anything. So I think this part we actually solved - I get basically the
>> same performance and behaviour with two streams (needed due to GBit LAN
>> on the other side) as with 20 streams.
>>
>>
>>> I had a plan about enabling compressing ACK as I did for SACK
>>> in commit
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5d9f4262b7ea41ca9981cc790e37cca6e37c789e
>>>
>>> But I have not done it yet.
>>> It is a pity because this would tremendously help wifi I am sure.
>>
>> Nice :-)
>>
>> But that is something the *receiver* would have to do.
>
> Well it is certainly feasible to thin acks on the driver as we did in
> cake. More general. More cpu intensive. I'm happily just awaiting
> eric's work instead.
>
> One thing comcast inadvertently does to most flows is remark them cs1,
> which tosses big data into the bk queue and acks into the be queue. It
> actually helps sometimes.
>
>>
>> The dirty secret here is that we're getting close to 1700 Mbps TCP with
>> Windows in place of Linux in the setup, with the same receiver on the
>> other end (which is actually a single Linux machine with two GBit
>> network connections to the AP). So if we had this I'm sure it'd increase
>> performance, but it still wouldn't explain why we're so much slower than
>> Windows :-)
>>
>> Now, I'm certainly not saying that TCP behaviour is the only reason for
>> the difference, we already found an issue for example where due to a
>> small Windows driver bug some packet extension was always used, and the
>> AP is also buggy in that it needs the extension but didn't request it
>> ... so the two bugs cancelled each other out and things worked well, but
>> our Linux driver believed the AP ... :) Certainly there can be more
>> things like that still, I just started on the TCP side and ran into the
>> queueing behaviour that I cannot explain.
>>
>>
>> In any case, I'll try to dig deeper into the TCP stack to understand the
>> reason for this transmit behaviour.
>>
>> Thanks,
>> johannes
>>
>>
>> (*) Hmm. Now I have another idea. Maybe we have some kind of problem
>> with the medium access configuration, and we transmit all this data
>> without the AP having a chance to send back all the ACKs? Too bad I
>> can't put an air sniffer into the setup - it's a conductive setup.
>
> see above
>>
>>
>> (**) As another aside to this, the next generation HW after this will
>> have 256 frames in a block-ack, so that means instead of up to 64 (we
>> only use 63 for internal reasons) frames aggregated together we'll be
>> able to aggregate 256 (or maybe we again only 255?).
>
> My fervent wish is to somehow be able to mark every frame we can as not
> needing a retransmit in future standards. I've lost track of what ax
> can do. ? And for block ack retries
> to give up far sooner.
>
> you can safely drop all but the last three acks in a flow, and the
> txop itself provides
> a suitable clock.
>
> And, ya know, releasing packets ooo doesn't hurt as much as it used
> to, with rack.
>> Each one of those
>> frames may be an A-MSDU with ~11k content though (only 8k in the setup I
>> have here right now), which means we can get a LOT of data into a single
>> PPDU ...
>
> Just wearing my usual hat, I would prefer to optimize for service
> time, not bandwidth, in the future,
> using smaller txops with this more data in them, than the biggest
> txops possible.
>
> If you constrain your max txop to 2ms in this test, you will see tcp
> in slow start ramp up faster,
> and the ap scale to way more devices, with way less jitter and
> retries. Most flows never get out of slowstart.
>
>> . we'll probably have to bump the sk_pacing_shift to be able to
>> fill that with a single TCP stream, though since we run all our
>> performance numbers with many streams, maybe we should just leave it :)
>
> Please. Optimizing for single flow performance is an academic's game.
>
>>
>>
>
>
> --
> Make Music, Not War
>
> Dave Täht
> CTO, TekLibre, LLC
> http://www.teklibre.com
> Tel: 1-831-435-0729
> _______________________________________________
> Make-wifi-fast mailing list
> [email protected]
> https://lists.bufferbloat.net/listinfo/make-wifi-fast

2019-12-13 01:54:46

by Eric Dumazet

[permalink] [raw]
Subject: Re: [Make-wifi-fast] debugging TCP stalls on high-speed wifi



On 12/12/19 4:59 PM, Simon Barber wrote:
> I’m currently adding ACK thinning to Linux’s GRO code. Quite a simple addition given the way that code works.
>
> Simon
>
>

Please don't.

1) It will not help since many NIC do not use GRO.

2) This does not help if you receive one ACK per NIC interrupt, which is quite common.

3) This breaks GRO transparency.

4) TCP can implement this in a more effective/controlled way,
since the peer know a lot more flow characteristics.

Middle-box should not try to make TCP better, they usually break things.

2019-12-13 01:59:16

by Simon Barber

[permalink] [raw]
Subject: Re: [Make-wifi-fast] debugging TCP stalls on high-speed wifi

In my application this is a bridge or router (not TCP endpoint), and the driver is doing GRO and NAPI polling. Also looking at using the skb->fraglist to make the GRO code more effective and more transparent by passing flags, short segments, etc through for perfect reconstruction by TSO.

Simon

> On Dec 12, 2019, at 5:46 PM, Eric Dumazet <[email protected]> wrote:
>
>
>
> On 12/12/19 4:59 PM, Simon Barber wrote:
>> I’m currently adding ACK thinning to Linux’s GRO code. Quite a simple addition given the way that code works.
>>
>> Simon
>>
>>
>
> Please don't.
>
> 1) It will not help since many NIC do not use GRO.
>
> 2) This does not help if you receive one ACK per NIC interrupt, which is quite common.
>
> 3) This breaks GRO transparency.
>
> 4) TCP can implement this in a more effective/controlled way,
> since the peer know a lot more flow characteristics.
>
> Middle-box should not try to make TCP better, they usually break things.

2019-12-13 04:16:12

by Justin Capella

[permalink] [raw]
Subject: Re: debugging TCP stalls on high-speed wifi

Could TCP window size (waiting for ACKnowledgements) be a contributing factor?

On Thu, Dec 12, 2019 at 6:52 AM Johannes Berg <[email protected]> wrote:
>
> Hi Eric, all,
>
> I've been debugging (much thanks to bpftrace) TCP stalls on wifi, in
> particular on iwlwifi.
>
> What happens, essentially, is that we transmit large aggregates (63
> packets of 7.5k A-MSDU size each, for something on the order of 500kB
> per PPDU). Theoretically we can have ~240 A-MSDUs on our hardware
> queues, and the hardware aggregates them into up to 63 to send as a
> single PPDU.
>
> At HE rates (160 MHz, high rates) such a large PPDU takes less than 2ms
> to transmit.
>
> I'm seeing around 1400 Mbps TCP throughput (a bit more than 1800 UDP),
> but I'm expecting more. A bit more than 1800 for UDP is about the max I
> can expect on this AP (it only does 8k A-MSDU size), but I'd think TCP
> then shouldn't be so much less (and our Windows drivers gets >1600).
>
>
> What I see is that occasionally - and this doesn't happen all the time
> but probably enough to matter - we reclaim a few of those large
> aggregates and free the transmit SKBs, and then we try to pull from
> mac80211's TXQs but they're empty.
>
> At this point - we've just freed 400+k of data, I'd expect TCP to
> immediately push more, but it doesn't happen. I sometimes see another
> set of reclaims emptying the queue entirely (literally down to 0 packets
> on the queue) and it then takes another millisecond or two for TCP to
> start pushing packets again.
>
> Once that happens, I also observe that TCP stops pushing large TSO
> packets and goes down to sometimes less than a single A-MSDU (i.e.
> ~7.5k) in a TSO, perhaps even an MTU-sized frame - didn't check this,
> only the # of frames we make out of this.
>
>
> If you have any thoughts on this, I'd appreciate it.
>
>
> Something I've been wondering is if our TSO implementation causes
> issues, but apart from higher CPU usage I see no real difference if I
> turned it off. I thought so because it splits up the SKBs into those A-
> MSDU sized chunks using skb_gso_segment() and then splits them down into
> MTU-sized all packed together into an A-MSDU using the hardware engine.
> But that means we release a bunch of A-MSDU-sized SKBs back to the TCP
> stack when they transmitted.
>
> Another thought I had was our broken NAPI, but this is TX traffic so the
> only RX thing is sync, and I'm currently still using kernel 5.4 anyway.
>
> Thanks,
> johannes
>

2019-12-13 04:43:18

by Dave Taht

[permalink] [raw]
Subject: Re: [Make-wifi-fast] debugging TCP stalls on high-speed wifi

On Thu, Dec 12, 2019 at 5:46 PM Eric Dumazet <[email protected]> wrote:
>
>
>
> On 12/12/19 4:59 PM, Simon Barber wrote:
> > I’m currently adding ACK thinning to Linux’s GRO code. Quite a simple addition given the way that code works.
> >
> > Simon
> >
> >
>
> Please don't.
>
> 1) It will not help since many NIC do not use GRO.
>
> 2) This does not help if you receive one ACK per NIC interrupt, which is quite common.

Packets accumulate in the wifi device and driver, if that's the bottleneck.

>
> 3) This breaks GRO transparency.
>
> 4) TCP can implement this in a more effective/controlled way,
> since the peer know a lot more flow characteristics.
>
> Middle-box should not try to make TCP better, they usually break things.

I generally have more hope for open source attempts at this than other
means. And there isn't much left
in TCP that will change in the future; it is an ossified protocol.

802.11n, at least, has a problem fitting many packets into an
aggregate. Sending less packets is a win
in multiple ways:

A) Improves bi-directional throughput
B) Reduces the size of the receivers txop (and retries) - the client
is also often running at a lower rate than
the ap.
C) Delivers the most current ack, sooner

When further transiting an aqm that uses random numbers, it hits the
right packet sooner, also.

I welcome experimentation in this area.



--
Make Music, Not War

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-435-0729

2019-12-13 07:44:08

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TCP stalls on high-speed wifi

On Thu, 2019-12-12 at 20:15 -0800, Justin Capella wrote:
> Could TCP window size (waiting for ACKnowledgements) be a contributing factor?

Quite possibly, although even with a large number of flows?

I thought we had a TCP_CHRONO_ for it but we don't, maybe I'll try to
add one for debug.

johannes

2019-12-13 08:11:22

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TCP stalls on high-speed wifi

On Thu, 2019-12-12 at 15:42 -0800, Dave Taht wrote:

> If you captured the air you'd probably see the sender winning the
> election for airtime 2 or more times in a row,
> it's random and oft dependent on on a variety of factors.

I'm going to try to capture more details - I can probably extract this
out of the firmware but it's more effort.

> Most Wifi is *not* "half" duplex, which implies it ping pongs between
> send and receive.

That's an interesting definition of "half duplex" which doesn't really
match anything that I've seen used or in the literature? What you're
describing sounds more like some sort of "half duplex with token-based
flow control" or something like that to me ...

> > But unless somehow you think processing the (many) ACKs on the sender
> > will cause it to stop transmitting, or something like that, I don't
> > think I should be seeing what I described earlier: we sometimes (have
> > to?) reclaim the entire transmit queue before TCP starts pushing data
> > again. That's less than 2MB split across at least two TCP streams, I
> > don't see why we should have to get to 0 (which takes about 7ms) until
> > more packets come in from TCP?
>
> Perhaps having a budget for ack processing within a 1ms window?

What do you mean? There's such a budget? What kind of budget? I have
plenty of CPU time left, as far as I can tell.

> It would be interesting to repeat this test in ht20 mode,

Why? HT20 is far slower, what would be the advantage. In my experience I
don't hit this until I get to HE80.

> flent --socket-stats --step-size=.04 --te=upload_streams=2 -t
> whatever_variant_of_test tcp_nup
>
> That will capture some of the tcp stats for you.

I guses I can try, but the upload_streams=2 won't actually help - I need
to run towards two different IP addresses - remember that I'm otherwise
limited by a GBit LAN link on the other side right now.

> > But that is something the *receiver* would have to do.
>
> Well it is certainly feasible to thin acks on the driver as we did in
> cake.

I really don't think it would help in my case, either the ACKs are the
problem (which I doubt) and then they're the problem on the air, or
they're not the problem since I have plenty of CPU time to waste on them
...

> One thing comcast inadvertently does to most flows is remark them cs1,
> which tosses big data into the bk queue and acks into the be queue. It
> actually helps sometimes.

I thought about doing this but if I make my flows BK it halves my
throughput (perhaps due to the more then double AIFSN?)

> > (**) As another aside to this, the next generation HW after this will
> > have 256 frames in a block-ack, so that means instead of up to 64 (we
> > only use 63 for internal reasons) frames aggregated together we'll be
> > able to aggregate 256 (or maybe we again only 255?).
>
> My fervent wish is to somehow be able to mark every frame we can as not
> needing a retransmit in future standards.

This can be done since ... oh I don't know, probably 2005 with the
802.11e amendment? Not sure off the top of my head how it interacts with
A-MPDUs though, and probably has bugs if you do that.

> I've lost track of what ax
> can do. ? And for block ack retries
> to give up far sooner.

You can do that too, it's just a local configuration how much you try
each packet. If you give up you leave a hole in the reorder window, but
if you start sending packets that are further ahead then the window, the
old ones will (have to be) released regardless.

> you can safely drop all but the last three acks in a flow, and the
> txop itself provides
> a suitable clock.

Now that's more tricky because once you stick the packets into the
hardware queue you likely have to decide whether or not they're
important.

I can probably think of ways of working around that (similar to the
table-based rate scaling we use), but it's tricky.

> And, ya know, releasing packets ooo doesn't hurt as much as it used
> to, with rack.

:)
That I think is not currently possible with A-MPDUs. It'd also still
have to be opt-in per frame since you can't really do that for anything
but TCP (and probably QUIC? Maybe SCTP?)

> Just wearing my usual hat, I would prefer to optimize for service
> time, not bandwidth, in the future,
> using smaller txops with this more data in them, than the biggest
> txops possible.

Patience. We're getting there now. HE will allow the AP to schedule
everything, and then you don't need TXOPs anymore. The problem is that
winning a TXOP is costly, so you *need* to put as much as possible into
it for good performance.

With HE and the AP scheduling, you win some, you lose some. The client
will lose the ability to actually make any decisions about its transmit
rate and things like that, but the AP can schedule & poll the clients
better without all the overhead.

> If you constrain your max txop to 2ms in this test, you will see tcp
> in slow start ramp up faster,
> and the ap scale to way more devices, with way less jitter and
> retries. Most flows never get out of slowstart.

I'm running a client ... you're forgetting that there's something else
that's actually talking to the AP you're thinking of :-)

> > . we'll probably have to bump the sk_pacing_shift to be able to
> > fill that with a single TCP stream, though since we run all our
> > performance numbers with many streams, maybe we should just leave it :)
>
> Please. Optimizing for single flow performance is an academic's game.

Same here, kinda.

johannes

2019-12-13 09:10:32

by Krishna Chaitanya

[permalink] [raw]
Subject: Re: debugging TCP stalls on high-speed wifi

On Fri, Dec 13, 2019 at 2:43 AM Johannes Berg <[email protected]> wrote:
>
> Hi Eric,
>
> Thanks for looking :)
>
> > > I'm not sure how to do headers-only, but I guess -s100 will work.
> > >
> > > https://johannes.sipsolutions.net/files/he-tcp.pcap.xz
> > >
> >
> > Lack of GRO on receiver is probably what is killing performance,
> > both for receiver (generating gazillions of acks) and sender
> > (to process all these acks)
> Yes, I'm aware of this, to some extent. And I'm not saying we should see
> even close to 1800 Mbps like we have with UDP...
>
> Mind you, the biggest thing that kills performance with many ACKs isn't
> the load on the system - the sender system is only moderately loaded at
> ~20-25% of a single core with TSO, and around double that without TSO.
> The thing that kills performance is eating up all the medium time with
> small non-aggregated packets, due to the the half-duplex nature of WiFi.
> I know you know, but in case somebody else is reading along :-)
>
> But unless somehow you think processing the (many) ACKs on the sender
> will cause it to stop transmitting, or something like that, I don't
> think I should be seeing what I described earlier: we sometimes (have
> to?) reclaim the entire transmit queue before TCP starts pushing data
> again. That's less than 2MB split across at least two TCP streams, I
> don't see why we should have to get to 0 (which takes about 7ms) until
> more packets come in from TCP?
>
> Or put another way - if I free say 400kB worth of SKBs, what could be
> the reason we don't see more packets be sent out of the TCP stack within
> the few ms or so? I guess I have to correlate this somehow with the ACKs
> so I know how much data is outstanding for ACKs. (*)
Maybe try 'reno' instead of 'cubic' to see if congestion control is
being too careful?I
n my experiments a while ago reno was a bit more aggressive esp. in less
lossy environments.
>
>
> The sk_pacing_shift is set to 7, btw, which should give us 8ms of
> outstanding data. For now in this setup that's enough(**), and indeed
> bumping the limit up (setting sk_pacing_shift to say 5) doesn't change
> anything. So I think this part we actually solved - I get basically the
> same performance and behaviour with two streams (needed due to GBit LAN
> on the other side) as with 20 streams.
As you have said CPU util is low, maybe try disabling RSS (as we are
using 2 streams)
and see if that is causing any concurrency issues?
>
>
> > I had a plan about enabling compressing ACK as I did for SACK
> > in commit
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5d9f4262b7ea41ca9981cc790e37cca6e37c789e
> >
> > But I have not done it yet.
> > It is a pity because this would tremendously help wifi I am sure.
>
> Nice :-)
>
> But that is something the *receiver* would have to do.
>
> The dirty secret here is that we're getting close to 1700 Mbps TCP with
> Windows in place of Linux in the setup, with the same receiver on the
> other end (which is actually a single Linux machine with two GBit
> network connections to the AP). So if we had this I'm sure it'd increase
> performance, but it still wouldn't explain why we're so much slower than
> Windows :-)
>
> Now, I'm certainly not saying that TCP behaviour is the only reason for
> the difference, we already found an issue for example where due to a
> small Windows driver bug some packet extension was always used, and the
> AP is also buggy in that it needs the extension but didn't request it
> ... so the two bugs cancelled each other out and things worked well, but
> our Linux driver believed the AP ... :) Certainly there can be more
> things like that still, I just started on the TCP side and ran into the
> queueing behaviour that I cannot explain.
>
>
> In any case, I'll try to dig deeper into the TCP stack to understand the
> reason for this transmit behaviour.
>
> Thanks,
> johannes
>
>
> (*) Hmm. Now I have another idea. Maybe we have some kind of problem
> with the medium access configuration, and we transmit all this data
> without the AP having a chance to send back all the ACKs? Too bad I
> can't put an air sniffer into the setup - it's a conductive setup.
>
>
> (**) As another aside to this, the next generation HW after this will
> have 256 frames in a block-ack, so that means instead of up to 64 (we
> only use 63 for internal reasons) frames aggregated together we'll be
> able to aggregate 256 (or maybe we again only 255?). Each one of those
> frames may be an A-MSDU with ~11k content though (only 8k in the setup I
> have here right now), which means we can get a LOT of data into a single
> PPDU ... we'll probably have to bump the sk_pacing_shift to be able to
> fill that with a single TCP stream, though since we run all our
> performance numbers with many streams, maybe we should just leave it :)
>
>


--
Thanks,
Regards,
Chaitanya T K.

2020-01-24 11:08:06

by Johannes Berg

[permalink] [raw]
Subject: Re: debugging TCP stalls on high-speed wifi

On Fri, 2019-12-13 at 14:40 +0530, Krishna Chaitanya wrote:
>
> Maybe try 'reno' instead of 'cubic' to see if congestion control is
> being too careful?

I played around with this a bit now, but apart from a few outliers, the
congestion control algorithm doesn't have much effect. The outliers are

* vegas with ~120 Mbps
* nv with ~300 Mbps
* cdg with ~600 Mbps

All the others from my list (reno cubic bbr bic cdg dctcp highspeed htcp
hybla illinois lp nv scalable vegas veno westwood yeah) are within 50
Mbps or so from each other (around 1.45Gbps).

johannes

2020-01-24 23:51:56

by Eric Dumazet

[permalink] [raw]
Subject: Re: debugging TCP stalls on high-speed wifi



On 1/24/20 2:34 AM, Johannes Berg wrote:
> On Fri, 2019-12-13 at 14:40 +0530, Krishna Chaitanya wrote:
>>
>> Maybe try 'reno' instead of 'cubic' to see if congestion control is
>> being too careful?
>
> I played around with this a bit now, but apart from a few outliers, the
> congestion control algorithm doesn't have much effect. The outliers are
>
> * vegas with ~120 Mbps
> * nv with ~300 Mbps
> * cdg with ~600 Mbps
>
> All the others from my list (reno cubic bbr bic cdg dctcp highspeed htcp
> hybla illinois lp nv scalable vegas veno westwood yeah) are within 50
> Mbps or so from each other (around 1.45Gbps).
>

When the stalls happens, what is causing TCP to resume the xmit ?

Some tcpdump traces could help.

(-s 100 to only capture headers)