LinuxLists.cc - TCP and BBR: reproducibly low cwnd and bandwidth

2018-02-16 17:03:30

Subject: TCP and BBR: reproducibly low cwnd and bandwidth

Hello.

I've faced an issue with a limited TCP bandwidth between my laptop and a
server in my 1 Gbps LAN while using BBR as a congestion control mechanism. To
verify my observations, I've set up 2 KVM VMs with the following parameters:

1) Linux v4.15.3
2) virtio NICs
3) 128 MiB of RAM
4) 2 vCPUs
5) tested on both non-PREEMPT/100 Hz and PREEMPT/1000 Hz

The VMs are interconnected via host bridge (-netdev bridge). I was running
iperf3 in the default and reverse mode. Here are the results:

1) BBR on both VMs

upload: 3.42 Gbits/sec, cwnd ~ 320 KBytes
download: 3.39 Gbits/sec, cwnd ~ 320 KBytes

2) Reno on both VMs

upload: 5.50 Gbits/sec, cwnd = 976 KBytes (constant)
download: 5.22 Gbits/sec, cwnd = 1.20 MBytes (constant)

3) Reno on client, BBR on server

upload: 5.29 Gbits/sec, cwnd = 952 KBytes (constant)
download: 3.45 Gbits/sec, cwnd ~ 320 KBytes

4) BBR on client, Reno on server

upload: 3.36 Gbits/sec, cwnd ~ 370 KBytes
download: 5.21 Gbits/sec, cwnd = 887 KBytes (constant)

So, as you may see, when BBR is in use, upload rate is bad and cwnd is low. If
using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput to
~100 Mbps (verifiable not only by iperf3, but also by scp while transferring
some files between hosts).

Also, I've tried to use YeAH instead of Reno, and it gives me the same results
as Reno (IOW, YeAH works fine too).

Questions:

1) is this expected?
2) or am I missing some extra BBR tuneable?
3) if it is not a regression (I don't have any previous data to compare with),
how can I fix this?
4) if it is a bug in BBR, what else should I provide or check for a proper
investigation?

Thanks.

Regards,
Oleksandr

2018-02-16 19:14:44

by Oleksandr Natalenko

[permalink] [raw]

Subject: Re: TCP and BBR: reproducibly low cwnd and bandwidth

Hi, David, Eric, Neal et al.

On čtvrtek 15. února 2018 21:42:26 CET Oleksandr Natalenko wrote:
> I've faced an issue with a limited TCP bandwidth between my laptop and a
> server in my 1 Gbps LAN while using BBR as a congestion control mechanism.
> To verify my observations, I've set up 2 KVM VMs with the following
> parameters:
>
> 1) Linux v4.15.3
> 2) virtio NICs
> 3) 128 MiB of RAM
> 4) 2 vCPUs
> 5) tested on both non-PREEMPT/100 Hz and PREEMPT/1000 Hz
>
> The VMs are interconnected via host bridge (-netdev bridge). I was running
> iperf3 in the default and reverse mode. Here are the results:
>
> 1) BBR on both VMs
>
> upload: 3.42 Gbits/sec, cwnd ~ 320 KBytes
> download: 3.39 Gbits/sec, cwnd ~ 320 KBytes
>
> 2) Reno on both VMs
>
> upload: 5.50 Gbits/sec, cwnd = 976 KBytes (constant)
> download: 5.22 Gbits/sec, cwnd = 1.20 MBytes (constant)
>
> 3) Reno on client, BBR on server
>
> upload: 5.29 Gbits/sec, cwnd = 952 KBytes (constant)
> download: 3.45 Gbits/sec, cwnd ~ 320 KBytes
>
> 4) BBR on client, Reno on server
>
> upload: 3.36 Gbits/sec, cwnd ~ 370 KBytes
> download: 5.21 Gbits/sec, cwnd = 887 KBytes (constant)
>
> So, as you may see, when BBR is in use, upload rate is bad and cwnd is low.
> If using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput
> to ~100 Mbps (verifiable not only by iperf3, but also by scp while
> transferring some files between hosts).
>
> Also, I've tried to use YeAH instead of Reno, and it gives me the same
> results as Reno (IOW, YeAH works fine too).
>
> Questions:
>
> 1) is this expected?
> 2) or am I missing some extra BBR tuneable?
> 3) if it is not a regression (I don't have any previous data to compare
> with), how can I fix this?
> 4) if it is a bug in BBR, what else should I provide or check for a proper
> investigation?

I've played with BBR a little bit more and managed to narrow the issue down to
the changes between v4.12 and v4.13. Here are my observations:

v4.12 + BBR + fq_codel == OK
v4.12 + BBR + fq == OK
v4.13 + BBR + fq_codel == Not OK
v4.13 + BBR + fq == OK

I think this has something to do with an internal TCP implementation for
pacing, that was introduced in v4.13 (commit 218af599fa63) specifically to
allow using BBR together with non-fq qdiscs. Once BBR relies on fq, the
throughput is high and saturates the link, but if another qdisc is in use, for
instance, fq_codel, the throughput drops. Just to be sure, I've also tried
pfifo_fast instead of fq_codel with the same outcome resulting in the low
throughput.

Unfortunately, I do not know if this is something expected or should be
considered as a regression. Thus, asking for an advice.

Ideas?

Thanks.

Regards,
Oleksandr

2018-02-16 19:18:04

by Eric Dumazet

[permalink] [raw]

Subject: Re: TCP and BBR: reproducibly low cwnd and bandwidth

On Fri, Feb 16, 2018 at 7:15 AM, Oleksandr Natalenko
<[email protected]> wrote:
> Hi, David, Eric, Neal et al.
>
> On čtvrtek 15. února 2018 21:42:26 CET Oleksandr Natalenko wrote:
>> I've faced an issue with a limited TCP bandwidth between my laptop and a
>> server in my 1 Gbps LAN while using BBR as a congestion control mechanism.
>> To verify my observations, I've set up 2 KVM VMs with the following
>> parameters:
>>
>> 1) Linux v4.15.3
>> 2) virtio NICs
>> 3) 128 MiB of RAM
>> 4) 2 vCPUs
>> 5) tested on both non-PREEMPT/100 Hz and PREEMPT/1000 Hz
>>
>> The VMs are interconnected via host bridge (-netdev bridge). I was running
>> iperf3 in the default and reverse mode. Here are the results:
>>
>> 1) BBR on both VMs
>>
>> upload: 3.42 Gbits/sec, cwnd ~ 320 KBytes
>> download: 3.39 Gbits/sec, cwnd ~ 320 KBytes
>>
>> 2) Reno on both VMs
>>
>> upload: 5.50 Gbits/sec, cwnd = 976 KBytes (constant)
>> download: 5.22 Gbits/sec, cwnd = 1.20 MBytes (constant)
>>
>> 3) Reno on client, BBR on server
>>
>> upload: 5.29 Gbits/sec, cwnd = 952 KBytes (constant)
>> download: 3.45 Gbits/sec, cwnd ~ 320 KBytes
>>
>> 4) BBR on client, Reno on server
>>
>> upload: 3.36 Gbits/sec, cwnd ~ 370 KBytes
>> download: 5.21 Gbits/sec, cwnd = 887 KBytes (constant)
>>
>> So, as you may see, when BBR is in use, upload rate is bad and cwnd is low.
>> If using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput
>> to ~100 Mbps (verifiable not only by iperf3, but also by scp while
>> transferring some files between hosts).
>>
>> Also, I've tried to use YeAH instead of Reno, and it gives me the same
>> results as Reno (IOW, YeAH works fine too).
>>
>> Questions:
>>
>> 1) is this expected?
>> 2) or am I missing some extra BBR tuneable?
>> 3) if it is not a regression (I don't have any previous data to compare
>> with), how can I fix this?
>> 4) if it is a bug in BBR, what else should I provide or check for a proper
>> investigation?
>
> I've played with BBR a little bit more and managed to narrow the issue down to
> the changes between v4.12 and v4.13. Here are my observations:
>
> v4.12 + BBR + fq_codel == OK
> v4.12 + BBR + fq == OK
> v4.13 + BBR + fq_codel == Not OK
> v4.13 + BBR + fq == OK
>
> I think this has something to do with an internal TCP implementation for
> pacing, that was introduced in v4.13 (commit 218af599fa63) specifically to
> allow using BBR together with non-fq qdiscs. Once BBR relies on fq, the
> throughput is high and saturates the link, but if another qdisc is in use, for
> instance, fq_codel, the throughput drops. Just to be sure, I've also tried
> pfifo_fast instead of fq_codel with the same outcome resulting in the low
> throughput.
>
> Unfortunately, I do not know if this is something expected or should be
> considered as a regression. Thus, asking for an advice.
>
> Ideas?

The way TCP pacing works, it defaults to internal pacing using a hint
stored in the socket.

If you change the qdisc while flow is alive, result could be unexpected.

(TCP socket remembers that one FQ was supposed to handle the pacing)

What results do you have if you use standard pfifo_fast ?

I am asking because TCP pacing relies on High resolution timers, and
that might be weak on your VM.

2018-02-16 19:19:21

by Holger Hoffstätte

[permalink] [raw]

Subject: Re: TCP and BBR: reproducibly low cwnd and bandwidth

On 02/16/18 16:15, Oleksandr Natalenko wrote:
> Hi, David, Eric, Neal et al.
>
> On čtvrtek 15. února 2018 21:42:26 CET Oleksandr Natalenko wrote:
>> I've faced an issue with a limited TCP bandwidth between my laptop and a
>> server in my 1 Gbps LAN while using BBR as a congestion control mechanism.
>> To verify my observations, I've set up 2 KVM VMs with the following
>> parameters:
>>
>> 1) Linux v4.15.3
>> 2) virtio NICs
>> 3) 128 MiB of RAM
>> 4) 2 vCPUs
>> 5) tested on both non-PREEMPT/100 Hz and PREEMPT/1000 Hz

These are very odd configurations. :)
Non-preempt/100 might well be too slow, whereas PREEMPT/1000 might simply
have too much overhead.

>> The VMs are interconnected via host bridge (-netdev bridge). I was running
>> iperf3 in the default and reverse mode. Here are the results:
>>
>> 1) BBR on both VMs
>>
>> upload: 3.42 Gbits/sec, cwnd ~ 320 KBytes
>> download: 3.39 Gbits/sec, cwnd ~ 320 KBytes
>>
>> 2) Reno on both VMs
>>
>> upload: 5.50 Gbits/sec, cwnd = 976 KBytes (constant)
>> download: 5.22 Gbits/sec, cwnd = 1.20 MBytes (constant)
>>
>> 3) Reno on client, BBR on server
>>
>> upload: 5.29 Gbits/sec, cwnd = 952 KBytes (constant)
>> download: 3.45 Gbits/sec, cwnd ~ 320 KBytes
>>
>> 4) BBR on client, Reno on server
>>
>> upload: 3.36 Gbits/sec, cwnd ~ 370 KBytes
>> download: 5.21 Gbits/sec, cwnd = 887 KBytes (constant)
>>
>> So, as you may see, when BBR is in use, upload rate is bad and cwnd is low.

BBR in general will run with lower cwnd than e.g. Cubic or others.
That's a feature and necessary for WAN transfers.

>> If using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput
>> to ~100 Mbps (verifiable not only by iperf3, but also by scp while
>> transferring some files between hosts).

Something seems really wrong with your setup. I get completely
expected throughput on wired 1Gb between two hosts:

Connecting to host tux, port 5201
[ 5] local 192.168.100.223 port 48718 connected to 192.168.100.222 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 113 MBytes 948 Mbits/sec 0 204 KBytes
[ 5] 1.00-2.00 sec 112 MBytes 941 Mbits/sec 0 204 KBytes
[ 5] 2.00-3.00 sec 112 MBytes 941 Mbits/sec 0 204 KBytes
[...]

Running it locally gives the more or less expected results as well:

Connecting to host ragnarok, port 5201
[ 5] local 192.168.100.223 port 54090 connected to 192.168.100.223 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 8.09 GBytes 69.5 Gbits/sec 0 512 KBytes
[ 5] 1.00-2.00 sec 8.14 GBytes 69.9 Gbits/sec 0 512 KBytes
[ 5] 2.00-3.00 sec 8.43 GBytes 72.4 Gbits/sec 0 512 KBytes
[...]

Both hosts running 4.14.x with bbr and fq_codel (default qdisc everywhere).
In the past I only used BBR briefly for testing since at 1Gb speeds on my
LAN it was actually slightly slower than Cubic (some of those bugs were
recently addressed) and made no difference otherwise, even for uploads -
which are capped by my 50/10 DSL anyway.

Please note that BBR was developed to address the case of WAN transfers
(or more precisely high BDP paths) which often suffer from TCP throughput
collapse due to single packet loss events. While it might "work" in other
scenarios as well, strictly speaking delay-based anything is increasingly
less likely to work when there is no meaningful notion of delay - such
as on a LAN. (yes, this is very simplified..)

The BBR mailing list has several nice reports why the current BBR
implementation (dubbed v1) has a few - sometimes severe - problems.
These are being addressed as we speak.

(let me know if you want some of those tech reports by email. :)

>> Also, I've tried to use YeAH instead of Reno, and it gives me the same
>> results as Reno (IOW, YeAH works fine too).
>>
>> Questions:
>>
>> 1) is this expected?
>> 2) or am I missing some extra BBR tuneable?

No, it should work out of the box.

>> 3) if it is not a regression (I don't have any previous data to compare
>> with), how can I fix this?
>> 4) if it is a bug in BBR, what else should I provide or check for a proper
>> investigation?
>
> I've played with BBR a little bit more and managed to narrow the issue down to
> the changes between v4.12 and v4.13. Here are my observations:
>
> v4.12 + BBR + fq_codel == OK
> v4.12 + BBR + fq == OK
> v4.13 + BBR + fq_codel == Not OK
> v4.13 + BBR + fq == OK
>
> I think this has something to do with an internal TCP implementation for
> pacing, that was introduced in v4.13 (commit 218af599fa63) specifically to
> allow using BBR together with non-fq qdiscs. Once BBR relies on fq, the
> throughput is high and saturates the link, but if another qdisc is in use, for
> instance, fq_codel, the throughput drops. Just to be sure, I've also tried
> pfifo_fast instead of fq_codel with the same outcome resulting in the low
> throughput.

I'm not sure testing the old version without builtin pacing is going to help
matters in finding the actual problem. :)
Several people have reported severe performance regressions with 4.15.x,
maybe that's related. Can you test latest 4.14.x?

Out of curiosity, what is the expected use case for BBR here?

cheers
Holger

2018-02-16 19:19:38

by Neal Cardwell

[permalink] [raw]

Subject: Re: TCP and BBR: reproducibly low cwnd and bandwidth

On Fri, Feb 16, 2018 at 11:26 AM, Holger Hoffstätte
<[email protected]> wrote:
>
> BBR in general will run with lower cwnd than e.g. Cubic or others.
> That's a feature and necessary for WAN transfers.

Please note that there's no general rule about whether BBR will run
with a lower or higher cwnd than CUBIC, Reno, or other loss-based
congestion control algorithms. Whether BBR's cwnd will be lower or
higher depends on the BDP of the path, the amount of buffering in the
bottleneck, and the number of flows. BBR tries to match the amount of
in-flight data to the BDP based on the available bandwidth and the
two-way propagation delay. This will usually produce an amount of data
in flight that is smaller than CUBIC/Reno (yielding lower latency) if
the path has deep buffers (bufferbloat), but can be larger than
CUBIC/Reno (yielding higher throughput) if the buffers are shallow and
the traffic is suffering burst losses.

>
>>> If using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput
>>> to ~100 Mbps (verifiable not only by iperf3, but also by scp while
>>> transferring some files between hosts).
>
> Something seems really wrong with your setup. I get completely
> expected throughput on wired 1Gb between two hosts:
>
> Connecting to host tux, port 5201
> [ 5] local 192.168.100.223 port 48718 connected to 192.168.100.222 port 5201
> [ ID] Interval Transfer Bitrate Retr Cwnd
> [ 5] 0.00-1.00 sec 113 MBytes 948 Mbits/sec 0 204 KBytes
> [ 5] 1.00-2.00 sec 112 MBytes 941 Mbits/sec 0 204 KBytes
> [ 5] 2.00-3.00 sec 112 MBytes 941 Mbits/sec 0 204 KBytes
> [...]
>
> Running it locally gives the more or less expected results as well:
>
> Connecting to host ragnarok, port 5201
> [ 5] local 192.168.100.223 port 54090 connected to 192.168.100.223 port 5201
> [ ID] Interval Transfer Bitrate Retr Cwnd
> [ 5] 0.00-1.00 sec 8.09 GBytes 69.5 Gbits/sec 0 512 KBytes
> [ 5] 1.00-2.00 sec 8.14 GBytes 69.9 Gbits/sec 0 512 KBytes
> [ 5] 2.00-3.00 sec 8.43 GBytes 72.4 Gbits/sec 0 512 KBytes
> [...]
>
> Both hosts running 4.14.x with bbr and fq_codel (default qdisc everywhere).

Can you please clarify if this is over bare metal or between VM
guests? It sounds like Oleksandr's initial report was between KVM VMs,
so the virtualization may be an ingredient here.

thanks,
neal

2018-02-16 19:21:11

by Oleksandr Natalenko

[permalink] [raw]

Subject: Re: TCP and BBR: reproducibly low cwnd and bandwidth

Hi.

On p?tek 16. ?nora 2018 17:25:58 CET Eric Dumazet wrote:
> The way TCP pacing works, it defaults to internal pacing using a hint
> stored in the socket.
>
> If you change the qdisc while flow is alive, result could be unexpected.

I don't change a qdisc while flow is alive. Either the VM is completely
restarted, or iperf3 is restarted on both sides.

> (TCP socket remembers that one FQ was supposed to handle the pacing)
>
> What results do you have if you use standard pfifo_fast ?

Almost the same as with fq_codel (see my previous email with numbers).

> I am asking because TCP pacing relies on High resolution timers, and
> that might be weak on your VM.

Also, I've switched to measuring things on a real HW only (also see previous
email with numbers).

Thanks.

Regards,
Oleksandr

2018-02-16 19:21:46

by Holger Hoffstätte

[permalink] [raw]

Subject: Re: TCP and BBR: reproducibly low cwnd and bandwidth

On 02/16/18 17:56, Neal Cardwell wrote:
> On Fri, Feb 16, 2018 at 11:26 AM, Holger Hoffstätte
> <[email protected]> wrote:
>>
>> BBR in general will run with lower cwnd than e.g. Cubic or others.
>> That's a feature and necessary for WAN transfers.
>
> Please note that there's no general rule about whether BBR will run
> with a lower or higher cwnd than CUBIC, Reno, or other loss-based
> congestion control algorithms. Whether BBR's cwnd will be lower or
> higher depends on the BDP of the path, the amount of buffering in the
> bottleneck, and the number of flows. BBR tries to match the amount of
> in-flight data to the BDP based on the available bandwidth and the
> two-way propagation delay. This will usually produce an amount of data
> in flight that is smaller than CUBIC/Reno (yielding lower latency) if
> the path has deep buffers (bufferbloat), but can be larger than
> CUBIC/Reno (yielding higher throughput) if the buffers are shallow and
> the traffic is suffering burst losses.

In all my tests I've never seen it larger, but OK. Thanks for the
explanation. :)
On second reading the "necessary for WAN transfers" was phrased a bit
unfortunately, but it likely doesn't matter for Oleksandr's case
anyway..

(snip)

>> Something seems really wrong with your setup. I get completely
>> expected throughput on wired 1Gb between two hosts:
>>
>> Connecting to host tux, port 5201
>> [ 5] local 192.168.100.223 port 48718 connected to 192.168.100.222 port 5201
>> [ ID] Interval Transfer Bitrate Retr Cwnd
>> [ 5] 0.00-1.00 sec 113 MBytes 948 Mbits/sec 0 204 KBytes
>> [ 5] 1.00-2.00 sec 112 MBytes 941 Mbits/sec 0 204 KBytes
>> [ 5] 2.00-3.00 sec 112 MBytes 941 Mbits/sec 0 204 KBytes
>> [...]
>>
>> Running it locally gives the more or less expected results as well:
>>
>> Connecting to host ragnarok, port 5201
>> [ 5] local 192.168.100.223 port 54090 connected to 192.168.100.223 port 5201
>> [ ID] Interval Transfer Bitrate Retr Cwnd
>> [ 5] 0.00-1.00 sec 8.09 GBytes 69.5 Gbits/sec 0 512 KBytes
>> [ 5] 1.00-2.00 sec 8.14 GBytes 69.9 Gbits/sec 0 512 KBytes
>> [ 5] 2.00-3.00 sec 8.43 GBytes 72.4 Gbits/sec 0 512 KBytes
>> [...]
>>
>> Both hosts running 4.14.x with bbr and fq_codel (default qdisc everywhere).
>
> Can you please clarify if this is over bare metal or between VM
> guests? It sounds like Oleksandr's initial report was between KVM VMs,
> so the virtualization may be an ingredient here.

These are real hosts, not VMs, wired by 1Gbit Ethernet (home office).
Like Eric said it's probably weird HZ, slow host, iffy high-res timer
(bad for both fq and fq_codel), overhead of retpoline in a VM or whatnot.

cheers
Holger

2018-02-16 19:36:25

by Oleksandr Natalenko

[permalink] [raw]

Subject: Re: TCP and BBR: reproducibly low cwnd and bandwidth

Hi.

On p?tek 16. ?nora 2018 17:26:11 CET Holger Hoffst?tte wrote:
> These are very odd configurations. :)
> Non-preempt/100 might well be too slow, whereas PREEMPT/1000 might simply
> have too much overhead.

Since the pacing is based on hrtimers, should HZ matter at all? Even if so,
poor 1 Gbps link shouldn't drop to below 100 Mbps, for sure.

> BBR in general will run with lower cwnd than e.g. Cubic or others.
> That's a feature and necessary for WAN transfers.

Okay, got it.

> Something seems really wrong with your setup. I get completely
> expected throughput on wired 1Gb between two hosts:
> /* snip */

Yes, and that's strange :/. And that's why I'm wondering what I am missing
since things cannot be *that* bad.

> /* snip */
> Please note that BBR was developed to address the case of WAN transfers
> (or more precisely high BDP paths) which often suffer from TCP throughput
> collapse due to single packet loss events. While it might "work" in other
> scenarios as well, strictly speaking delay-based anything is increasingly
> less likely to work when there is no meaningful notion of delay - such
> as on a LAN. (yes, this is very simplified..)
>
> The BBR mailing list has several nice reports why the current BBR
> implementation (dubbed v1) has a few - sometimes severe - problems.
> These are being addressed as we speak.
>
> (let me know if you want some of those tech reports by email. :)

Well, yes, please, why not :).

> /* snip */
> I'm not sure testing the old version without builtin pacing is going to help
> matters in finding the actual problem. :)
> Several people have reported severe performance regressions with 4.15.x,
> maybe that's related. Can you test latest 4.14.x?

Observed this on v4.14 too but didn't pay much attention until realised that
things look definitely wrong.

> Out of curiosity, what is the expected use case for BBR here?

Nothing special, just assumed it could be set as a default for both WAN and
LAN usage.

Regards,
Oleksandr