2021-12-22 17:46:56

by Ivan Babrou

[permalink] [raw]
Subject: Initial TCP receive window is clamped to 64k by rcv_ssthresh

Hello,

I noticed that the advertised TCP receive window in the first ACK from
the client is clamped at 64k. I'm wondering if this is intentional.

We have an environment with many pairs of distant servers connected by
high BDP links. For the reasons that aren't relevant, we need to
re-establish connections between those often and expect to have as few
round trips as possible to get a response after a handshake.

We have made BBR cooperate on the initcwnd front with TCP_BPF_IW and
some code that remembers cwnd and lets new connections start with a
high value. It's safe to assume that we set initcwnd to 250 from the
server side. I have no issues with the congestion control side of
things.

We also have high rmem and wmem values and plenty of memory.

The problem lies in the fact that no matter how high we crank up the
initcwnd, the connection will hit the 64k wall of the receive window
and will have to stall waiting on ACKs from the other side, which take
a long while to arrive on high latency links. A realistic scenario:

1. TCP connection established, receive window = 64k.
2. Client sends a request.
3. Server userspace program generates a 120k response and writes it to
the socket. That's T0.
4. Server sends 64k worth of data in TCP packets to the client.
5. Client sees the first 64k worth of data T0 + RTT/2 later.
6. Client sends ACKs to cover for the data it just received.
7. Server sees the ACKs T0 + 1 RTT later.
8. Server sends the remaining data.
9. Client sees the remaining data T0 + RTT + RTT/2 later.

In my mind, on a good network (guarded by the initcwnd) I expect to
have the whole response to be sent immediately at T0 and received
RTT/2 later.

The current TCP connection establishment code picks two window sizes
in tcp_select_initial_window() during the SYN packet generation:

* rcv_wnd to advertise (cannot be higher than 64k during SYN, as we
don't know whether wscale is supported yet)
* window_clamp (current max memory allowed for the socket, can be large)

You can find these in code here:

* https://elixir.bootlin.com/linux/v5.15.10/source/include/linux/tcp.h#L209

The call into tcp_select_initial_window() is here:

* https://elixir.bootlin.com/linux/v5.15.10/source/net/ipv4/tcp_output.c#L3682

Then immediately after rcv_ssthresh is set to rcv_wnd. This is the
part that gives me pause.

During the generation of the first ACK after the SYN ACK is received
on the client, assuming the window scaling is supported, I expect the
client to advertise the whole buffer as available and let congestion
control handle whether it can be filled from the sender side. What
happens in reality is that rcv_ssthresh is sent as the window value.
Unfortunately, rcv_ssthresh is limited to 64k from rcv_wnd as
described above.

My question is whether it should be limited to window_clamp in
tcp_connect_init() instead.

I tried looking through git history and the following line was there
since Git import in 2005:

tp->rcv_ssthresh = tp->rcv_wnd;

I made a small patch that toggles rcv_ssthresh between rcv_wnd and
window_clamp and I'm doing some testing to see if it solves my issue.
I can see it advertise 512k receive buffer in the first ACK from the
client, which seems to address my problem. I'm not sure if there's
some drawback here.


2021-12-22 18:10:07

by Eric Dumazet

[permalink] [raw]
Subject: Re: Initial TCP receive window is clamped to 64k by rcv_ssthresh

On Wed, Dec 22, 2021 at 9:46 AM Ivan Babrou <[email protected]> wrote:
>
> Hello,
>
> I noticed that the advertised TCP receive window in the first ACK from
> the client is clamped at 64k. I'm wondering if this is intentional.
>
> We have an environment with many pairs of distant servers connected by
> high BDP links. For the reasons that aren't relevant, we need to
> re-establish connections between those often and expect to have as few
> round trips as possible to get a response after a handshake.
>
> We have made BBR cooperate on the initcwnd front with TCP_BPF_IW and
> some code that remembers cwnd and lets new connections start with a
> high value. It's safe to assume that we set initcwnd to 250 from the
> server side. I have no issues with the congestion control side of
> things.
>
> We also have high rmem and wmem values and plenty of memory.
>
> The problem lies in the fact that no matter how high we crank up the
> initcwnd, the connection will hit the 64k wall of the receive window
> and will have to stall waiting on ACKs from the other side, which take
> a long while to arrive on high latency links. A realistic scenario:
>
> 1. TCP connection established, receive window = 64k.
> 2. Client sends a request.
> 3. Server userspace program generates a 120k response and writes it to
> the socket. That's T0.
> 4. Server sends 64k worth of data in TCP packets to the client.
> 5. Client sees the first 64k worth of data T0 + RTT/2 later.
> 6. Client sends ACKs to cover for the data it just received.
> 7. Server sees the ACKs T0 + 1 RTT later.
> 8. Server sends the remaining data.
> 9. Client sees the remaining data T0 + RTT + RTT/2 later.
>
> In my mind, on a good network (guarded by the initcwnd) I expect to
> have the whole response to be sent immediately at T0 and received
> RTT/2 later.
>
> The current TCP connection establishment code picks two window sizes
> in tcp_select_initial_window() during the SYN packet generation:
>
> * rcv_wnd to advertise (cannot be higher than 64k during SYN, as we
> don't know whether wscale is supported yet)
> * window_clamp (current max memory allowed for the socket, can be large)
>
> You can find these in code here:
>
> * https://elixir.bootlin.com/linux/v5.15.10/source/include/linux/tcp.h#L209
>
> The call into tcp_select_initial_window() is here:
>
> * https://elixir.bootlin.com/linux/v5.15.10/source/net/ipv4/tcp_output.c#L3682
>
> Then immediately after rcv_ssthresh is set to rcv_wnd. This is the
> part that gives me pause.
>
> During the generation of the first ACK after the SYN ACK is received
> on the client, assuming the window scaling is supported, I expect the
> client to advertise the whole buffer as available and let congestion
> control handle whether it can be filled from the sender side. What
> happens in reality is that rcv_ssthresh is sent as the window value.
> Unfortunately, rcv_ssthresh is limited to 64k from rcv_wnd as
> described above.
>
> My question is whether it should be limited to window_clamp in
> tcp_connect_init() instead.
>
> I tried looking through git history and the following line was there
> since Git import in 2005:
>
> tp->rcv_ssthresh = tp->rcv_wnd;
>
> I made a small patch that toggles rcv_ssthresh between rcv_wnd and
> window_clamp and I'm doing some testing to see if it solves my issue.
> I can see it advertise 512k receive buffer in the first ACK from the
> client, which seems to address my problem. I'm not sure if there's
> some drawback here.

Stack is conservative about RWIN increase, it wants to receive packets
to have an idea
of the skb->len/skb->truesize ratio to convert a memory budget to RWIN.

Some drivers have to allocate 16K buffers (or even 32K buffers) just
to hold one segment
(of less than 1500 bytes of payload), while others are able to pack
memory more efficiently.

I guess that you could use eBPF code to precisely tweak stack behavior
to your needs.

2021-12-23 22:53:11

by Ivan Babrou

[permalink] [raw]
Subject: Re: Initial TCP receive window is clamped to 64k by rcv_ssthresh

On Wed, Dec 22, 2021 at 10:10 AM Eric Dumazet <[email protected]> wrote:
> Stack is conservative about RWIN increase, it wants to receive packets
> to have an idea
> of the skb->len/skb->truesize ratio to convert a memory budget to RWIN.
>
> Some drivers have to allocate 16K buffers (or even 32K buffers) just
> to hold one segment
> (of less than 1500 bytes of payload), while others are able to pack
> memory more efficiently.
>
> I guess that you could use eBPF code to precisely tweak stack behavior
> to your needs.

Adding ebpf for this is certainly an option and it seems similar to
TCP_BPF_SNDCWND_CLAMP. I can certainly look into crafting a patch for
this.

Is it not possible to do anything automatically to pick a bigger
window without ebpf? When the scaled window is first advertised in the
very first ACK, the kernel already has the SYN ACK skb from the other
end of the connection. Could the skb->len / skb->truesize ratio be
looked up there?

Increasing tcp_rmem (the middle part specifically) is a lower entry
barrier than making ebpf involved, and it can really help with latency
even for human use cases like opening a website across the ocean.