by Menglong Dong

[permalink] [raw]

Subject: Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()

On Fri, Jul 28, 2023 at 10:25 PM Neal Cardwell <[email protected]> wrote:
>
> On Fri, Jul 28, 2023 at 1:50 AM Eric Dumazet <[email protected]> wrote:
[...]
>
> In that packetdrill case AFAICT that is the ZWP timer firing, and the
> sender sends a ZWP.
>
> I think maybe Menglong is looking more at something like the following
> scenario, where at the time the RTO timer fires the data sender finds
> the tp->snd_wnd is zero, so it sends a retransmit of the
> lowest-sequence data packet.
>
> Here is a packetdrill case and the tcpdump trace on an upstream
> net-next kernel... I have not worked out all the details at the end,
> but perhaps it can help move the discussion forward:
>
>
> ~/packetdrill/gtests/net/tcp/receiver_window# cat rwin-rto-zero-window.pkt
> // Test how sender reacts to unexpected arrival rwin of 0.
>
> `../common/defaults.sh`
>
> // Create a socket.
> 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> +0 bind(3, ..., ...) = 0
> +0 listen(3, 1) = 0
>
> // Establish a connection.
> +.1 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 6>
> +0 > S. 0:0(0) ack 1 win 65535 <mss 1460,nop,nop,sackOK,nop,wscale 14>
> +.1 < . 1:1(0) ack 1 win 457
> +0 accept(3, ..., ...) = 4
>
> +0 write(4, ..., 20000) = 20000
> +0 > P. 1:10001(10000) ack 1
>
> // TLP
> +.2 > . 10001:11001(1000) ack 1
> // Receiver has retracted rwin to 0
> // (perhaps from the 2023 proposed OOM code?).
> +.1 < . 1:1(0) ack 1 win 0
>
> // RTO, and in tcp_retransmit_timer() we see the receiver window is zero,
> // so we take the special f (!tp->snd_wnd...) code path.
> +.2 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +.5 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +1.2 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +2.4 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +4.8 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +9.6 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +19.2 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +38.4 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +76.8 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +120 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +120 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1001 win 1000
>
> // Received non-zero window update. Send more data.
> +0 > P. 1001:3001(2000) ack 1
> +.1 < . 1:1(0) ack 3001 win 1000
>
> ----------
> When I run that script on a net-next kernel I see the rounding up of
> the RTO to 122 secs rather than 120 secs, but for whatever reason the
> script does not cause the socket to die early...
>

I think I know the reason now. Without the 2nd patches that I send
in this series, the ACK can't update the rwin to 0, as it will be ignored
in tcp_may_update_window().

However, you can send an ACK that acknowledges the new data
to update the rwin to 0. I modified your script, and it can die as we
excepted:

// Test how sender reacts to unexpected arrival rwin of 0.

`../common/defaults.sh`

// Create a socket.
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

// Establish a connection.
+.1 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 6>
+0 > S. 0:0(0) ack 1 win 65535 <mss 1440,nop,nop,sackOK,nop,wscale 8>
+.1 < . 1:1(0) ack 1 win 457
+0 accept(3, ..., ...) = 4

+0 write(4, ..., 20000) = 20000
+0 > P. 1:10001(10000) ack 1

// Update the window to 0. "ack 0 win 0" won't update the window, as it
// will be ignored by tcp_may_update_window()
+.1 < . 1:1(0) ack 1001 win 0

// RTO, and in tcp_retransmit_timer() we see the receiver window is zero,
// so we take the special f (!tp->snd_wnd...) code path.
+.2 > . 1001:2001(1000) ack 1
+.1 < . 1:1(0) ack 1001 win 0

+.5 > . 1001:2001(1000) ack 1
+.1 < . 1:1(0) ack 1001 win 0

+1.2 > . 1001:2001(1000) ack 1
+.1 < . 1:1(0) ack 1001 win 0

+2.4 > . 1001:2001(1000) ack 1
+.1 < . 1:1(0) ack 1001 win 0

+4.8 > . 1001:2001(1000) ack 1
+.1 < . 1:1(0) ack 1001 win 0

+9.6 > . 1001:2001(1000) ack 1
+.1 < . 1:1(0) ack 1001 win 0

+19.2 > . 1001:2001(1000) ack 1
+.1 < . 1:1(0) ack 1001 win 0

+38.4 > . 1001:2001(1000) ack 1
+.1 < . 1:1(0) ack 1001 win 0

+76.8 > . 1001:2001(1000) ack 1
+.1 < . 1:1(0) ack 1001 win 0

// socket will die in tcp_retransmit_timer() in the
// "tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX" code path.
// Following retransmit won't happen.
+120 > . 1001:2001(1000) ack 1
+.1 < . 1:1(0) ack 1001 win 0
------------------------------------------------------------------------------

I don't know how to check the die of socket with
packetdrill, so I checked it by:
ss -nitme | grep 8080 | grep on
And I can see the socket die after timeout of the 120seconds
timer.

$ packetdrill ./rwin-rto-zero-window.pkt
./rwin-rto-zero-window.pkt:55: error handling packet: Timed out
waiting for packet

> The tcpdump trace:
>
> tcpdump -ttt -n -i any port 8080 &
>
> ->
>
> ~/packetdrill/gtests/net/tcp/receiver_window#
> ../../packetdrill/packetdrill rwin-rto-zero-window.pkt
> 00:01:01.370344 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [S], seq 0, win 65535, options [mss
> 1000,nop,nop,sackOK,nop,wscale 6], length 0
> 00:00:00.000096 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [S.], seq 3847169154, ack 1, win 65535, options [mss
> 1460,nop,nop,sackOK,nop,wscale 14], length 0
> 00:00:00.100277 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 457, length 0
> 00:00:00.000090 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 1:2001, ack 1, win 4, length 2000: HTTP
> 00:00:00.000006 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 2001:4001, ack 1, win 4, length 2000: HTTP
> 00:00:00.000003 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 4001:6001, ack 1, win 4, length 2000: HTTP
> 00:00:00.000002 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 6001:8001, ack 1, win 4, length 2000: HTTP
> 00:00:00.000001 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 8001:10001, ack 1, win 4, length 2000: HTTP
> 00:00:00.209131 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 10001:11001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100190 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:00.203824 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100175 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:00.507835 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100192 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:01.115858 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100182 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:02.331747 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100198 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:04.955980 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100197 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:09.627985 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100179 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:19.355725 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100203 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:42.395633 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100202 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:01:17.724059 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100201 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:02:02.779516 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100229 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:02:02.779828 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100230 ? In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1001, win 1000, length 0
> 00:00:00.000034 ? Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 11001:12001, ack 1, win 4, length 1000: HTTP
> 00:00:00.000005 ? Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 12001:13001, ack 1, win 4, length 1000: HTTP
>
> rwin-rto-zero-window.pkt:62: error handling packet: live packet field
> tcp_psh: expected: 1 (0x1) vs actual: 0 (0x0)
> script packet: 405.390244 P. 1001:3001(2000) ack 1
> actual packet: 405.390237 . 11001:13001(2000) ack 1 win 4

2023-07-31 11:23:35

by Menglong Dong

[permalink] [raw]

Subject: Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()

On Sat, Jul 29, 2023 at 1:15 AM Neal Cardwell <[email protected]> wrote:
>
> On Fri, Jul 28, 2023 at 7:25 AM Neal Cardwell <[email protected]> wrote:
> >
> > On Fri, Jul 28, 2023 at 1:50 AM Eric Dumazet <[email protected]> wrote:
> > >
> > > On Fri, Jul 28, 2023 at 8:25 AM Menglong Dong <[email protected]> wrote:
> > > >
> > > > On Fri, Jul 28, 2023 at 12:44 PM Neal Cardwell <[email protected]> wrote:
> > > > >
> > > > > On Thu, Jul 27, 2023 at 7:57 PM Menglong Dong <[email protected]> wrote:
> > > > > >
> > > > > > On Fri, Jul 28, 2023 at 3:31 AM Eric Dumazet <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, Jul 27, 2023 at 2:52 PM <[email protected]> wrote:
> > > > > > > >
> > > > > > > > From: Menglong Dong <[email protected]>
> > > > > > > >
> > > > > > > > In tcp_retransmit_timer(), a window shrunk connection will be regarded
> > > > > > > > as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> > > > > > > > right all the time.
> > > > > > > >
> > > > > > > > The retransmits will become zero-window probes in tcp_retransmit_timer()
> > > > > > > > if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> > > > > > > > TCP_RTO_MAX sooner or later.
> > > > > > > >
> > > > > > > > However, the timer is not precise enough, as it base on timer wheel.
> > > > > > > > Sorry that I am not good at timer, but I know the concept of time-wheel.
> > > > > > > > The longer of the timer, the rougher it will be. So the timeout is not
> > > > > > > > triggered after TCP_RTO_MAX, but 122877ms as I tested.
> > > > > > > >
> > > > > > > > Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> > > > > > > > once the RTO come up to TCP_RTO_MAX.
> > > > > > > >
> > > > > > > > Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> > > > > > > > which is exact the timestamp of the timeout.
> > > > > > > >
> > > > > > > > Signed-off-by: Menglong Dong <[email protected]>
> > > > > > > > ---
> > > > > > > > net/ipv4/tcp_timer.c | 6 +++++-
> > > > > > > > 1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > > > >
> > > > > > > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > > > > > > > index 470f581eedd4..3a20db15a186 100644
> > > > > > > > --- a/net/ipv4/tcp_timer.c
> > > > > > > > +++ b/net/ipv4/tcp_timer.c
> > > > > > > > @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
> > > > > > > > tp->snd_una, tp->snd_nxt);
> > > > > > > > }
> > > > > > > > #endif
> > > > > > > > - if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > > > > + /* It's a little rough here, we regard any valid packet that
> > > > > > > > + * update tp->rcv_tstamp as the reply of the retransmitted
> > > > > > > > + * packet.
> > > > > > > > + */
> > > > > > > > + if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > > > > tcp_write_err(sk);
> > > > > > > > goto out;
> > > > > > > > }
>
> One potential pre-existing issue with this logic: if the connection is
> restarting from idle, then tp->rcv_tstamp could already be a long time
> (minutes or hours) in the past even on the first RTO, in which case
> the very first RTO that found a zero tp->snd_wnd would find this
> check returns true, and would destroy the connection immediately. This
> seems extremely brittle.
>
> AFAICT it would be safer to replace this logic with a call to the
> standard tcp_write_timeout() logic that has a more robust check to see
> if the connection should be destroyed.

How about we check it with:

icsk->icsk_timeout - max(tp->retrans_stamp, tp->rcv_tstamp) > TCP_RTO_MAX

?

Therefore, we don't need to do a lot of modification. I have to
say that the way we check here is rough, and a simple loss
of the retransmitted packet or the ACK can cause the die
of the socket.

Maybe we need a more perfect way here? which may introduce
a certain amount of modification.

BTW, should I split out this patch from this series?

>
> neal