LinuxLists.cc - [PATCH bpf-next] tcp: bpf: Add TCP_BPF_RCV_SSTHRESH for bpf

2022-01-11 19:29:59

Subject: [PATCH bpf-next] tcp: bpf: Add TCP_BPF_RCV_SSTHRESH for bpf_setsockopt

This patch adds bpf_setsockopt(TCP_BPF_RCV_SSTHRESH) to allow setting
rcv_ssthresh value for TCP connections. Combined with increased
window_clamp via tcp_rmem[1], it allows to advertise initial scaled
TCP window larger than 64k. This is useful for high BDP connections,
where it allows to push data with fewer roundtrips, reducing latency.

For active connections the larger window is advertised in the first
non-SYN ACK packet as the part of the 3 way handshake.

For passive connections the larger window is advertised whenever
there's any packet to send after the 3 way handshake.

See: https://lkml.org/lkml/2021/12/22/652

Signed-off-by: Ivan Babrou <[email protected]>
---
include/uapi/linux/bpf.h | 1 +
net/core/filter.c | 6 ++++++
tools/include/uapi/linux/bpf.h | 1 +
3 files changed, 8 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 791f31dd0abe..36ebf87278bd 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5978,6 +5978,7 @@ enum {
TCP_BPF_SYN = 1005, /* Copy the TCP header */
TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */
TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */
+ TCP_BPF_RCV_SSTHRESH = 1008, /* Set rcv_ssthresh */
};

enum {
diff --git a/net/core/filter.c b/net/core/filter.c
index 2e32cee2c469..aafb6066b1a6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4904,6 +4904,12 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname,
return -EINVAL;
inet_csk(sk)->icsk_rto_min = timeout;
break;
+ case TCP_BPF_RCV_SSTHRESH:
+ if (val <= 0)
+ ret = -EINVAL;
+ else
+ tp->rcv_ssthresh = min_t(u32, val, tp->window_clamp);
+ break;
case TCP_SAVE_SYN:
if (val < 0 || val > 1)
ret = -EINVAL;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 791f31dd0abe..36ebf87278bd 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5978,6 +5978,7 @@ enum {
TCP_BPF_SYN = 1005, /* Copy the TCP header */
TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */
TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */
+ TCP_BPF_RCV_SSTHRESH = 1008, /* Set rcv_ssthresh */
};

enum {
--
2.34.1

2022-01-11 21:48:06

by Song Liu

[permalink] [raw]

Subject: Re: [PATCH bpf-next] tcp: bpf: Add TCP_BPF_RCV_SSTHRESH for bpf_setsockopt

On Tue, Jan 11, 2022 at 11:29 AM Ivan Babrou <[email protected]> wrote:
>
> This patch adds bpf_setsockopt(TCP_BPF_RCV_SSTHRESH) to allow setting
> rcv_ssthresh value for TCP connections. Combined with increased
> window_clamp via tcp_rmem[1], it allows to advertise initial scaled
> TCP window larger than 64k. This is useful for high BDP connections,
> where it allows to push data with fewer roundtrips, reducing latency.
>
> For active connections the larger window is advertised in the first
> non-SYN ACK packet as the part of the 3 way handshake.
>
> For passive connections the larger window is advertised whenever
> there's any packet to send after the 3 way handshake.
>
> See: https://lkml.org/lkml/2021/12/22/652

I guess this is [1] mentioned above. Please use lore link instead, e.g.

[1] https://lore.kernel.org/all/CABWYdi0qBQ57OHt4ZbRxMtdSzhubzkPaPKkYzdNfu4+cgPyXCA@mail.gmail.com/

Can we add a selftests for this? Something similar to

tools/testing/selftests/bpf/progs/test_misc_tcp_hdr_options.c

Thanks,
Song

[...]

2022-01-12 21:02:15

by Dave Taht

[permalink] [raw]

Subject: Re: [PATCH bpf-next] tcp: bpf: Add TCP_BPF_RCV_SSTHRESH for bpf_setsockopt

On Wed, Jan 12, 2022 at 11:59 AM Ivan Babrou <[email protected]> wrote:
>
> This patch adds bpf_setsockopt(TCP_BPF_RCV_SSTHRESH) to allow setting
> rcv_ssthresh value for TCP connections. Combined with increased
> window_clamp via tcp_rmem[1], it allows to advertise initial scaled
> TCP window larger than 64k. This is useful for high BDP connections,
> where it allows to push data with fewer roundtrips, reducing latency.

I would not use the word "latency" in this way, I would just say
potentially reducing
roundtrips...

and potentially massively increasing packet loss, oversaturating
links, and otherwise
hurting latency for other applications sharing the link, including the
application
that advertised an extreme window like this.

This overall focus tends to freak me out somewhat, especially when
faced with further statements that cloudflare is using an initcwnd of 250!???

The kind of damage just IW10 can do to much slower bandwidth
connections has to be
experienced to be believed.

https://tools.ietf.org/id/draft-gettys-iw10-considered-harmful-00.html

>
> For active connections the larger window is advertised in the first
> non-SYN ACK packet as the part of the 3 way handshake.
>
> For passive connections the larger window is advertised whenever
> there's any packet to send after the 3 way handshake.
>
> See: https://lkml.org/lkml/2021/12/22/652
>
> Signed-off-by: Ivan Babrou <[email protected]>
> ---
> include/uapi/linux/bpf.h | 1 +
> net/core/filter.c | 6 ++++++
> tools/include/uapi/linux/bpf.h | 1 +
> 3 files changed, 8 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 791f31dd0abe..36ebf87278bd 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -5978,6 +5978,7 @@ enum {
> TCP_BPF_SYN = 1005, /* Copy the TCP header */
> TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */
> TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */
> + TCP_BPF_RCV_SSTHRESH = 1008, /* Set rcv_ssthresh */
> };
>
> enum {
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 2e32cee2c469..aafb6066b1a6 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4904,6 +4904,12 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname,
> return -EINVAL;
> inet_csk(sk)->icsk_rto_min = timeout;
> break;
> + case TCP_BPF_RCV_SSTHRESH:
> + if (val <= 0)
> + ret = -EINVAL;
> + else
> + tp->rcv_ssthresh = min_t(u32, val, tp->window_clamp);
> + break;
> case TCP_SAVE_SYN:
> if (val < 0 || val > 1)
> ret = -EINVAL;
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 791f31dd0abe..36ebf87278bd 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -5978,6 +5978,7 @@ enum {
> TCP_BPF_SYN = 1005, /* Copy the TCP header */
> TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */
> TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */
> + TCP_BPF_RCV_SSTHRESH = 1008, /* Set rcv_ssthresh */
> };
>
> enum {
> --
> 2.34.1
>

--
I tried to build a better future, a few times:
https://wayforward.archive.org/?site=https%3A%2F%2Fwww.icei.org

Dave Täht CEO, TekLibre, LLC

2022-01-13 22:56:59

by Ivan Babrou

[permalink] [raw]

Subject: Re: [PATCH bpf-next] tcp: bpf: Add TCP_BPF_RCV_SSTHRESH for bpf_setsockopt

On Wed, Jan 12, 2022 at 1:02 PM Dave Taht <[email protected]> wrote:
> I would not use the word "latency" in this way, I would just say
> potentially reducing
> roundtrips...

Roundtrips translate directly into latency on high latency links.

> and potentially massively increasing packet loss, oversaturating
> links, and otherwise
> hurting latency for other applications sharing the link, including the
> application
> that advertised an extreme window like this.

The receive window is going to scale up to tcp_rmem[2] with traffic,
and packet loss won't stop it. That's around 3MiB on anything that's
not embedded these days.

My understanding is that congestion control on the sender side deals
with packet loss, bottleneck saturation, and packet pacing. This patch
only touches the receiving side, letting the client scale up faster if
they choose to do so. I don't think any out of the box sender will
make use of this, even if we enable it on the receiver, just because
the sender's congestion control constraints are lower (like
initcwnd=10).

Let me know if any of this doesn't look right to you.

> This overall focus tends to freak me out somewhat, especially when
> faced with further statements that cloudflare is using an initcwnd of 250!???

Congestion window is a learned property, not a static number. You
won't get a large initcwnd towards a poor connection.

We have a dedicated backbone with different properties.

2022-01-13 22:57:07

by Ivan Babrou

[permalink] [raw]

Subject: Re: [PATCH bpf-next] tcp: bpf: Add TCP_BPF_RCV_SSTHRESH for bpf_setsockopt

On Tue, Jan 11, 2022 at 1:48 PM Song Liu <[email protected]> wrote:
>
> I guess this is [1] mentioned above. Please use lore link instead, e.g.
>
> [1] https://lore.kernel.org/all/CABWYdi0qBQ57OHt4ZbRxMtdSzhubzkPaPKkYzdNfu4+cgPyXCA@mail.gmail.com/

Will do in the next iteration, thanks.

> Can we add a selftests for this? Something similar to
>
> tools/testing/selftests/bpf/progs/test_misc_tcp_hdr_options.c

I have the test based on the tcp_rtt selftest. Do you want me to amend
my commit with the test and resend it as v2 or make it a series of two
commits?

2022-01-14 05:44:16

by Dave Taht

[permalink] [raw]

Subject: Re: [PATCH bpf-next] tcp: bpf: Add TCP_BPF_RCV_SSTHRESH for bpf_setsockopt

On Thu, Jan 13, 2022 at 2:56 PM Ivan Babrou <[email protected]> wrote:
>
> On Wed, Jan 12, 2022 at 1:02 PM Dave Taht <[email protected]> wrote:
> > I would not use the word "latency" in this way, I would just say
> > potentially reducing
> > roundtrips...
>
> Roundtrips translate directly into latency on high latency links.

Yes, but with the caveats below. I'm fine with you just saying round trips,
and making this api possible.

It would comfort me further if you could provide an actual scenario.

See also:

https://datatracker.ietf.org/doc/html/rfc6928

which predates packet pacing (are you using sch_fq?)

>
> > and potentially massively increasing packet loss, oversaturating
> > links, and otherwise
> > hurting latency for other applications sharing the link, including the
> > application
> > that advertised an extreme window like this.
>
> The receive window is going to scale up to tcp_rmem[2] with traffic,
> and packet loss won't stop it. That's around 3MiB on anything that's
> not embedded these days.
>
> My understanding is that congestion control on the sender side deals
> with packet loss, bottleneck saturation, and packet pacing. This patch
> only touches the receiving side, letting the client scale up faster if
> they choose to do so. I don't think any out of the box sender will
> make use of this, even if we enable it on the receiver, just because
> the sender's congestion control constraints are lower (like
> initcwnd=10).

I've always kind of not liked the sender/reciever "language" in tcp.

they are peers.

> Let me know if any of this doesn't look right to you.
>
> > This overall focus tends to freak me out somewhat, especially when
> > faced with further statements that cloudflare is using an initcwnd of 250!???
>
> Congestion window is a learned property, not a static number. You
> won't get a large initcwnd towards a poor connection.

initcwnd is set globally or on a per route basis.

> We have a dedicated backbone with different properties.

It's not so much that I don't think your backbone can handle this...

... it's the prospect of handing whiskey, car keys and excessive
initcwnd to teenage boys on a saturday night.

--
I tried to build a better future, a few times:
https://wayforward.archive.org/?site=https%3A%2F%2Fwww.icei.org

Dave Täht CEO, TekLibre, LLC

2022-01-14 23:16:39

by Ivan Babrou

[permalink] [raw]

Subject: Re: [PATCH bpf-next] tcp: bpf: Add TCP_BPF_RCV_SSTHRESH for bpf_setsockopt

On Thu, Jan 13, 2022 at 9:44 PM Dave Taht <[email protected]> wrote:
> Yes, but with the caveats below. I'm fine with you just saying round trips,
> and making this api possible.
>
> It would comfort me further if you could provide an actual scenario.

The actual scenario is getting a response as quickly as possible on a
fresh connection across long distances (200ms+ RTT). If an RPC
response doesn't fit into the initial 64k of rcv_ssthresh, we end up
requiring more roundrips to receive the response. Some customers are
very picky about the latency they measure and cutting the extra
roundtrips made a very visible difference in the tests.

> See also:
>
> https://datatracker.ietf.org/doc/html/rfc6928
>
> which predates packet pacing (are you using sch_fq?)

We are using fq and bbr.

> > Congestion window is a learned property, not a static number. You
> > won't get a large initcwnd towards a poor connection.
>
> initcwnd is set globally or on a per route basis.

With TCP_BPF_IW the world is your oyster.

2022-01-16 06:47:35

by Dave Taht

[permalink] [raw]

Subject: Re: [PATCH bpf-next] tcp: bpf: Add TCP_BPF_RCV_SSTHRESH for bpf_setsockopt

On Fri, Jan 14, 2022 at 2:21 PM Ivan Babrou <[email protected]> wrote:
>
> On Thu, Jan 13, 2022 at 9:44 PM Dave Taht <[email protected]> wrote:
> > Yes, but with the caveats below. I'm fine with you just saying round trips,
> > and making this api possible.
> >
> > It would comfort me further if you could provide an actual scenario.
>
> The actual scenario is getting a response as quickly as possible on a
> fresh connection across long distances (200ms+ RTT). If an RPC
> response doesn't fit into the initial 64k of rcv_ssthresh, we end up
> requiring more roundrips to receive the response. Some customers are
> very picky about the latency they measure and cutting the extra
> roundtrips made a very visible difference in the tests.
>
> > See also:
> >
> > https://datatracker.ietf.org/doc/html/rfc6928
> >
> > which predates packet pacing (are you using sch_fq?)
>
> We are using fq and bbr.
>
> > > Congestion window is a learned property, not a static number. You
> > > won't get a large initcwnd towards a poor connection.
> >
> > initcwnd is set globally or on a per route basis.

Like I said, retaining state from an existing connection as to the
window is ok. i think arbitrarily declaring a window like this
for a new connection is not.

> With TCP_BPF_IW the world is your oyster.

The oyster has to co-habit in this ocean with all the other life
there, and I would be comforted if your customer also tracked various
other TCP_INFO statistics, like RTT growth, loss, marks, and
retransmits, and was aware of not just the self harm inflicted but of
collateral damage. In fact I really wish more were instrumenting
everything with that, of late we've seen a lot of need for
TCP_NOTSENT_LOWAT in things like apache traffic server in containers.
A simple one line patch for an widely used app I can't talk about, did
wonders for actual perceived throughput and responsiveness by the end
user. Measuring from the reciever is far, far more important than
measuring from the sender. Collecting long term statistics over many
connections, also, from
the real world. I hope y'all have been instrumenting your work as well
as google has, on these fronts.

I know that I'm getting old and crunchy and scarred by seeing so many
(corporate wifi mostly) networks over the last decade essentially in
congestion collapse!

https://blog.apnic.net/2020/01/22/bufferbloat-may-be-solved-but-its-not-over-yet/

I'm very happy with how well sch_fq + packet pacing works to mitigate
impuses like this, as well as with so many other things like BBR and
BQL, but pacing out != pacing in,
and despite my fervent wish for more FQ+AQM techniques on more
bottleneck links also, we're not there yet.

I like very much that BPF is allowing rapid innovation, but with great
power comes great responsibility.
--
I tried to build a better future, a few times:
https://wayforward.archive.org/?site=https%3A%2F%2Fwww.icei.org

Dave Täht CEO, TekLibre, LLC