2023-07-27 14:00:12

by Menglong Dong

[permalink] [raw]
Subject: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()

From: Menglong Dong <[email protected]>

In tcp_retransmit_timer(), a window shrunk connection will be regarded
as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
right all the time.

The retransmits will become zero-window probes in tcp_retransmit_timer()
if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
TCP_RTO_MAX sooner or later.

However, the timer is not precise enough, as it base on timer wheel.
Sorry that I am not good at timer, but I know the concept of time-wheel.
The longer of the timer, the rougher it will be. So the timeout is not
triggered after TCP_RTO_MAX, but 122877ms as I tested.

Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
once the RTO come up to TCP_RTO_MAX.

Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
which is exact the timestamp of the timeout.

Signed-off-by: Menglong Dong <[email protected]>
---
net/ipv4/tcp_timer.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 470f581eedd4..3a20db15a186 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
tp->snd_una, tp->snd_nxt);
}
#endif
- if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
+ /* It's a little rough here, we regard any valid packet that
+ * update tp->rcv_tstamp as the reply of the retransmitted
+ * packet.
+ */
+ if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
tcp_write_err(sk);
goto out;
}
--
2.40.1



2023-07-27 19:50:33

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()

On Thu, Jul 27, 2023 at 2:52 PM <[email protected]> wrote:
>
> From: Menglong Dong <[email protected]>
>
> In tcp_retransmit_timer(), a window shrunk connection will be regarded
> as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> right all the time.
>
> The retransmits will become zero-window probes in tcp_retransmit_timer()
> if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> TCP_RTO_MAX sooner or later.
>
> However, the timer is not precise enough, as it base on timer wheel.
> Sorry that I am not good at timer, but I know the concept of time-wheel.
> The longer of the timer, the rougher it will be. So the timeout is not
> triggered after TCP_RTO_MAX, but 122877ms as I tested.
>
> Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> once the RTO come up to TCP_RTO_MAX.
>
> Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> which is exact the timestamp of the timeout.
>
> Signed-off-by: Menglong Dong <[email protected]>
> ---
> net/ipv4/tcp_timer.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 470f581eedd4..3a20db15a186 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
> tp->snd_una, tp->snd_nxt);
> }
> #endif
> - if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> + /* It's a little rough here, we regard any valid packet that
> + * update tp->rcv_tstamp as the reply of the retransmitted
> + * packet.
> + */
> + if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
> tcp_write_err(sk);
> goto out;
> }


Hmm, this looks like a net candidate, since this is unrelated to the
other patches ?

Neal, what do you think ?

2023-07-28 04:06:38

by Menglong Dong

[permalink] [raw]
Subject: Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()

On Fri, Jul 28, 2023 at 3:31 AM Eric Dumazet <[email protected]> wrote:
>
> On Thu, Jul 27, 2023 at 2:52 PM <[email protected]> wrote:
> >
> > From: Menglong Dong <[email protected]>
> >
> > In tcp_retransmit_timer(), a window shrunk connection will be regarded
> > as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> > right all the time.
> >
> > The retransmits will become zero-window probes in tcp_retransmit_timer()
> > if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> > TCP_RTO_MAX sooner or later.
> >
> > However, the timer is not precise enough, as it base on timer wheel.
> > Sorry that I am not good at timer, but I know the concept of time-wheel.
> > The longer of the timer, the rougher it will be. So the timeout is not
> > triggered after TCP_RTO_MAX, but 122877ms as I tested.
> >
> > Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> > once the RTO come up to TCP_RTO_MAX.
> >
> > Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> > which is exact the timestamp of the timeout.
> >
> > Signed-off-by: Menglong Dong <[email protected]>
> > ---
> > net/ipv4/tcp_timer.c | 6 +++++-
> > 1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > index 470f581eedd4..3a20db15a186 100644
> > --- a/net/ipv4/tcp_timer.c
> > +++ b/net/ipv4/tcp_timer.c
> > @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
> > tp->snd_una, tp->snd_nxt);
> > }
> > #endif
> > - if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> > + /* It's a little rough here, we regard any valid packet that
> > + * update tp->rcv_tstamp as the reply of the retransmitted
> > + * packet.
> > + */
> > + if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
> > tcp_write_err(sk);
> > goto out;
> > }
>
>
> Hmm, this looks like a net candidate, since this is unrelated to the
> other patches ?

Yeah, this patch can be standalone. However, considering the
purpose of this series, it is necessary. Without this patch, the
OOM probe will always timeout after a few minutes.

I'm not sure if I express the problem clearly in the commit log.
Let's explain it more.

Let's mark the timestamp of the 10th timeout of the rtx timer
as TS1. Now, the retransmission happens and the ACK of
the retransmitted packet will update the tp->rcv_tstamp to
TS1+rtt.

The RTO now is TCP_RTO_MAX. So let's see what will
happen in the 11th timeout. As we timeout after 122877ms,
so tcp_jiffies32 now is "TS1+122877ms", and
"tcp_jiffies32 - tp->rcv_tstamp" is
"TS1+122877ms - (TS1+rtt)" -> "122877ms - rtt",
which is always bigger than TCP_RTO_MAX, which is 120000ms.

>
> Neal, what do you think ?

2023-07-28 05:20:35

by Neal Cardwell

[permalink] [raw]
Subject: Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()

On Thu, Jul 27, 2023 at 7:57 PM Menglong Dong <[email protected]> wrote:
>
> On Fri, Jul 28, 2023 at 3:31 AM Eric Dumazet <[email protected]> wrote:
> >
> > On Thu, Jul 27, 2023 at 2:52 PM <[email protected]> wrote:
> > >
> > > From: Menglong Dong <[email protected]>
> > >
> > > In tcp_retransmit_timer(), a window shrunk connection will be regarded
> > > as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> > > right all the time.
> > >
> > > The retransmits will become zero-window probes in tcp_retransmit_timer()
> > > if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> > > TCP_RTO_MAX sooner or later.
> > >
> > > However, the timer is not precise enough, as it base on timer wheel.
> > > Sorry that I am not good at timer, but I know the concept of time-wheel.
> > > The longer of the timer, the rougher it will be. So the timeout is not
> > > triggered after TCP_RTO_MAX, but 122877ms as I tested.
> > >
> > > Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> > > once the RTO come up to TCP_RTO_MAX.
> > >
> > > Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> > > which is exact the timestamp of the timeout.
> > >
> > > Signed-off-by: Menglong Dong <[email protected]>
> > > ---
> > > net/ipv4/tcp_timer.c | 6 +++++-
> > > 1 file changed, 5 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > > index 470f581eedd4..3a20db15a186 100644
> > > --- a/net/ipv4/tcp_timer.c
> > > +++ b/net/ipv4/tcp_timer.c
> > > @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
> > > tp->snd_una, tp->snd_nxt);
> > > }
> > > #endif
> > > - if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > + /* It's a little rough here, we regard any valid packet that
> > > + * update tp->rcv_tstamp as the reply of the retransmitted
> > > + * packet.
> > > + */
> > > + if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > tcp_write_err(sk);
> > > goto out;
> > > }
> >
> >
> > Hmm, this looks like a net candidate, since this is unrelated to the
> > other patches ?
>
> Yeah, this patch can be standalone. However, considering the
> purpose of this series, it is necessary. Without this patch, the
> OOM probe will always timeout after a few minutes.
>
> I'm not sure if I express the problem clearly in the commit log.
> Let's explain it more.
>
> Let's mark the timestamp of the 10th timeout of the rtx timer
> as TS1. Now, the retransmission happens and the ACK of
> the retransmitted packet will update the tp->rcv_tstamp to
> TS1+rtt.
>
> The RTO now is TCP_RTO_MAX. So let's see what will
> happen in the 11th timeout. As we timeout after 122877ms,
> so tcp_jiffies32 now is "TS1+122877ms", and
> "tcp_jiffies32 - tp->rcv_tstamp" is
> "TS1+122877ms - (TS1+rtt)" -> "122877ms - rtt",
> which is always bigger than TCP_RTO_MAX, which is 120000ms.
>
> >
> > Neal, what do you think ?

Sorry, I am probably missing something here, but: what would ever make
this new proposed condition ((u32)icsk->icsk_timeout - tp->rcv_tstamp
> TCP_RTO_MAX) true? :-)

In your nicely explained scenario, your new expression,
icsk->icsk_timeout - tp->rcv_tstamp, will be:

icsk->icsk_timeout - tp->rcv_tstamp
= TS1 + 120 sec - (TS1+rtt)
= 120 sec - RTT

AFAICT there is no way for that expression to be bigger than
TCP_RTO_MAX = 120 sec unless somehow RTT is negative. :-)

So AFAICT your expression ((u32)icsk->icsk_timeout - tp->rcv_tstamp >
TCP_RTO_MAX) will always be false, so rather than this patch we may as
well remove the if check and the body of the if block?

To me such a change does not seem like a safe and clear bug fix for
the "net" branch but rather a riskier design change (appropriate for
"net-next" branch) that has connections retry forever when the
receiver retracts the window to zero, under the estimation that this
is preferable to having the connections die in such a case.

There might be apps that depend on the old behavior of having
connections die in such cases, so we might want to have this new
fail-faster behavior guarded by a sysctl in case some sites need to
revert to the older behavior? Not sure...

neal

2023-07-28 06:50:21

by Menglong Dong

[permalink] [raw]
Subject: Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()

On Fri, Jul 28, 2023 at 12:44 PM Neal Cardwell <[email protected]> wrote:
>
> On Thu, Jul 27, 2023 at 7:57 PM Menglong Dong <[email protected]> wrote:
> >
> > On Fri, Jul 28, 2023 at 3:31 AM Eric Dumazet <[email protected]> wrote:
> > >
> > > On Thu, Jul 27, 2023 at 2:52 PM <[email protected]> wrote:
> > > >
> > > > From: Menglong Dong <[email protected]>
> > > >
> > > > In tcp_retransmit_timer(), a window shrunk connection will be regarded
> > > > as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> > > > right all the time.
> > > >
> > > > The retransmits will become zero-window probes in tcp_retransmit_timer()
> > > > if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> > > > TCP_RTO_MAX sooner or later.
> > > >
> > > > However, the timer is not precise enough, as it base on timer wheel.
> > > > Sorry that I am not good at timer, but I know the concept of time-wheel.
> > > > The longer of the timer, the rougher it will be. So the timeout is not
> > > > triggered after TCP_RTO_MAX, but 122877ms as I tested.
> > > >
> > > > Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> > > > once the RTO come up to TCP_RTO_MAX.
> > > >
> > > > Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> > > > which is exact the timestamp of the timeout.
> > > >
> > > > Signed-off-by: Menglong Dong <[email protected]>
> > > > ---
> > > > net/ipv4/tcp_timer.c | 6 +++++-
> > > > 1 file changed, 5 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > > > index 470f581eedd4..3a20db15a186 100644
> > > > --- a/net/ipv4/tcp_timer.c
> > > > +++ b/net/ipv4/tcp_timer.c
> > > > @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
> > > > tp->snd_una, tp->snd_nxt);
> > > > }
> > > > #endif
> > > > - if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > + /* It's a little rough here, we regard any valid packet that
> > > > + * update tp->rcv_tstamp as the reply of the retransmitted
> > > > + * packet.
> > > > + */
> > > > + if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > tcp_write_err(sk);
> > > > goto out;
> > > > }
> > >
> > >
> > > Hmm, this looks like a net candidate, since this is unrelated to the
> > > other patches ?
> >
> > Yeah, this patch can be standalone. However, considering the
> > purpose of this series, it is necessary. Without this patch, the
> > OOM probe will always timeout after a few minutes.
> >
> > I'm not sure if I express the problem clearly in the commit log.
> > Let's explain it more.
> >
> > Let's mark the timestamp of the 10th timeout of the rtx timer
> > as TS1. Now, the retransmission happens and the ACK of
> > the retransmitted packet will update the tp->rcv_tstamp to
> > TS1+rtt.
> >
> > The RTO now is TCP_RTO_MAX. So let's see what will
> > happen in the 11th timeout. As we timeout after 122877ms,
> > so tcp_jiffies32 now is "TS1+122877ms", and
> > "tcp_jiffies32 - tp->rcv_tstamp" is
> > "TS1+122877ms - (TS1+rtt)" -> "122877ms - rtt",
> > which is always bigger than TCP_RTO_MAX, which is 120000ms.
> >
> > >
> > > Neal, what do you think ?
>
> Sorry, I am probably missing something here, but: what would ever make
> this new proposed condition ((u32)icsk->icsk_timeout - tp->rcv_tstamp
> > TCP_RTO_MAX) true? :-)
>

If the snd_wnd is 0, we need to keep probing until the window
is available. Meanwhile, any retransmission that don't have
a corresponding ACK (see what we do in the 1st patch), which
can be caused by the lost of the ACK or the lost of the retransmitted
packet, can make the condition true, as the tp->rcv_tstamp can't be
updated in time.

This is a little strict here. In the tcp_probe_timer(), we are allowed to
retransmit the probe0 packet for sysctl_tcp_retries2 times. But
we don't allow packets to be lost here.

> In your nicely explained scenario, your new expression,
> icsk->icsk_timeout - tp->rcv_tstamp, will be:
>
> icsk->icsk_timeout - tp->rcv_tstamp
> = TS1 + 120 sec - (TS1+rtt)
> = 120 sec - RTT
>
> AFAICT there is no way for that expression to be bigger than
> TCP_RTO_MAX = 120 sec unless somehow RTT is negative. :-)
>
> So AFAICT your expression ((u32)icsk->icsk_timeout - tp->rcv_tstamp >
> TCP_RTO_MAX) will always be false, so rather than this patch we may as
> well remove the if check and the body of the if block?
>

Hmm......as I explained above, the condition will be true
if the real packet loss happens. And I think it is the origin
design.

> To me such a change does not seem like a safe and clear bug fix for
> the "net" branch but rather a riskier design change (appropriate for
> "net-next" branch) that has connections retry forever when the
> receiver retracts the window to zero, under the estimation that this
> is preferable to having the connections die in such a case.
>
> There might be apps that depend on the old behavior of having
> connections die in such cases, so we might want to have this new
> fail-faster behavior guarded by a sysctl in case some sites need to
> revert to the older behavior? Not sure...

Yeah, the behavior here will be different for the users. I'm not
sure if there are any users that rely on such behavior.

What do you think, Eric? Do we need a sysctl here?

Thanks!
Menglong Dong

>
> neal

2023-07-28 09:14:50

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()

On Fri, Jul 28, 2023 at 8:25 AM Menglong Dong <[email protected]> wrote:
>
> On Fri, Jul 28, 2023 at 12:44 PM Neal Cardwell <[email protected]> wrote:
> >
> > On Thu, Jul 27, 2023 at 7:57 PM Menglong Dong <[email protected]> wrote:
> > >
> > > On Fri, Jul 28, 2023 at 3:31 AM Eric Dumazet <[email protected]> wrote:
> > > >
> > > > On Thu, Jul 27, 2023 at 2:52 PM <[email protected]> wrote:
> > > > >
> > > > > From: Menglong Dong <[email protected]>
> > > > >
> > > > > In tcp_retransmit_timer(), a window shrunk connection will be regarded
> > > > > as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> > > > > right all the time.
> > > > >
> > > > > The retransmits will become zero-window probes in tcp_retransmit_timer()
> > > > > if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> > > > > TCP_RTO_MAX sooner or later.
> > > > >
> > > > > However, the timer is not precise enough, as it base on timer wheel.
> > > > > Sorry that I am not good at timer, but I know the concept of time-wheel.
> > > > > The longer of the timer, the rougher it will be. So the timeout is not
> > > > > triggered after TCP_RTO_MAX, but 122877ms as I tested.
> > > > >
> > > > > Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> > > > > once the RTO come up to TCP_RTO_MAX.
> > > > >
> > > > > Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> > > > > which is exact the timestamp of the timeout.
> > > > >
> > > > > Signed-off-by: Menglong Dong <[email protected]>
> > > > > ---
> > > > > net/ipv4/tcp_timer.c | 6 +++++-
> > > > > 1 file changed, 5 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > > > > index 470f581eedd4..3a20db15a186 100644
> > > > > --- a/net/ipv4/tcp_timer.c
> > > > > +++ b/net/ipv4/tcp_timer.c
> > > > > @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
> > > > > tp->snd_una, tp->snd_nxt);
> > > > > }
> > > > > #endif
> > > > > - if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > + /* It's a little rough here, we regard any valid packet that
> > > > > + * update tp->rcv_tstamp as the reply of the retransmitted
> > > > > + * packet.
> > > > > + */
> > > > > + if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > tcp_write_err(sk);
> > > > > goto out;
> > > > > }
> > > >
> > > >
> > > > Hmm, this looks like a net candidate, since this is unrelated to the
> > > > other patches ?
> > >
> > > Yeah, this patch can be standalone. However, considering the
> > > purpose of this series, it is necessary. Without this patch, the
> > > OOM probe will always timeout after a few minutes.
> > >
> > > I'm not sure if I express the problem clearly in the commit log.
> > > Let's explain it more.
> > >
> > > Let's mark the timestamp of the 10th timeout of the rtx timer
> > > as TS1. Now, the retransmission happens and the ACK of
> > > the retransmitted packet will update the tp->rcv_tstamp to
> > > TS1+rtt.
> > >
> > > The RTO now is TCP_RTO_MAX. So let's see what will
> > > happen in the 11th timeout. As we timeout after 122877ms,
> > > so tcp_jiffies32 now is "TS1+122877ms", and
> > > "tcp_jiffies32 - tp->rcv_tstamp" is
> > > "TS1+122877ms - (TS1+rtt)" -> "122877ms - rtt",
> > > which is always bigger than TCP_RTO_MAX, which is 120000ms.
> > >
> > > >
> > > > Neal, what do you think ?
> >
> > Sorry, I am probably missing something here, but: what would ever make
> > this new proposed condition ((u32)icsk->icsk_timeout - tp->rcv_tstamp
> > > TCP_RTO_MAX) true? :-)
> >
>
> If the snd_wnd is 0, we need to keep probing until the window
> is available. Meanwhile, any retransmission that don't have
> a corresponding ACK (see what we do in the 1st patch), which
> can be caused by the lost of the ACK or the lost of the retransmitted
> packet, can make the condition true, as the tp->rcv_tstamp can't be
> updated in time.
>
> This is a little strict here. In the tcp_probe_timer(), we are allowed to
> retransmit the probe0 packet for sysctl_tcp_retries2 times. But
> we don't allow packets to be lost here.
>
> > In your nicely explained scenario, your new expression,
> > icsk->icsk_timeout - tp->rcv_tstamp, will be:
> >
> > icsk->icsk_timeout - tp->rcv_tstamp
> > = TS1 + 120 sec - (TS1+rtt)
> > = 120 sec - RTT
> >
> > AFAICT there is no way for that expression to be bigger than
> > TCP_RTO_MAX = 120 sec unless somehow RTT is negative. :-)
> >
> > So AFAICT your expression ((u32)icsk->icsk_timeout - tp->rcv_tstamp >
> > TCP_RTO_MAX) will always be false, so rather than this patch we may as
> > well remove the if check and the body of the if block?
> >
>
> Hmm......as I explained above, the condition will be true
> if the real packet loss happens. And I think it is the origin
> design.
>
> > To me such a change does not seem like a safe and clear bug fix for
> > the "net" branch but rather a riskier design change (appropriate for
> > "net-next" branch) that has connections retry forever when the
> > receiver retracts the window to zero, under the estimation that this
> > is preferable to having the connections die in such a case.
> >
> > There might be apps that depend on the old behavior of having
> > connections die in such cases, so we might want to have this new
> > fail-faster behavior guarded by a sysctl in case some sites need to
> > revert to the older behavior? Not sure...
>
> Yeah, the behavior here will be different for the users. I'm not
> sure if there are any users that rely on such behavior.
>
> What do you think, Eric? Do we need a sysctl here?
>

I honestly do not know what problem you want to solve.

As Neal pointed out, the new condition would not trigger,
so one can question about the whole piece of code,
what is its purpose exactly ?

When receiving WIN 0 acks, we should enter the so called probe0 state.
Maybe the real issue is that the 'receiver' will falsely send WIN X>0 ACKS,
because win probe acks do not really ensure memory is available to
receive new packets ?

Maybe provide a packetdrill test to show what kind of issue you are facing...

In Google kernels, we have TCP_MAX_RTO reduced to 30 seconds, and the
following test runs fine.

// Test how sender reacts to unexpected arrival rwin of 0.

`../common/defaults.sh`

// Create a socket.
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

// Establish a connection.
+.1 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 6>
+0 > S. 0:0(0) ack 1 win 65535 <mss 1460,nop,nop,sackOK,nop,wscale 8>
+.1 < . 1:1(0) ack 1 win 457
+0 accept(3, ..., ...) = 4

+0 write(4, ..., 20000) = 20000
+0 > P. 1:10001(10000) ack 1
+.1 < . 1:1(0) ack 10001 win 0

// Send zwp since we received rwin of 0 and have data to send.
+.3 > . 10000:10000(0) ack 1
+.1 < . 1:1(0) ack 10001 win 0

+.6 > . 10000:10000(0) ack 1
+.1 < . 1:1(0) ack 10001 win 0

+1.2 > . 10000:10000(0) ack 1
+.1 < . 1:1(0) ack 10001 win 0

+2.4 > . 10000:10000(0) ack 1
+.1 < . 1:1(0) ack 10001 win 0

+4.8 > . 10000:10000(0) ack 1
+.1 < . 1:1(0) ack 10001 win 0

+9.6 > . 10000:10000(0) ack 1
+.1 < . 1:1(0) ack 10001 win 0

+19.2 > . 10000:10000(0) ack 1
+.1 < . 1:1(0) ack 10001 win 0

+30 > . 10000:10000(0) ack 1
+.1 < . 1:1(0) ack 10001 win 0

+30 > . 10000:10000(0) ack 1
+.1 < . 1:1(0) ack 10001 win 193

// Received non-zero window update. Send rest of the data.
+0 > P. 10001:20001(10000) ack 1
+.1 < . 1:1(0) ack 20001 win 457

2023-07-28 15:34:36

by Neal Cardwell

[permalink] [raw]
Subject: Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()

On Fri, Jul 28, 2023 at 1:50 AM Eric Dumazet <[email protected]> wrote:
>
> On Fri, Jul 28, 2023 at 8:25 AM Menglong Dong <[email protected]> wrote:
> >
> > On Fri, Jul 28, 2023 at 12:44 PM Neal Cardwell <[email protected]> wrote:
> > >
> > > On Thu, Jul 27, 2023 at 7:57 PM Menglong Dong <[email protected]> wrote:
> > > >
> > > > On Fri, Jul 28, 2023 at 3:31 AM Eric Dumazet <[email protected]> wrote:
> > > > >
> > > > > On Thu, Jul 27, 2023 at 2:52 PM <[email protected]> wrote:
> > > > > >
> > > > > > From: Menglong Dong <[email protected]>
> > > > > >
> > > > > > In tcp_retransmit_timer(), a window shrunk connection will be regarded
> > > > > > as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> > > > > > right all the time.
> > > > > >
> > > > > > The retransmits will become zero-window probes in tcp_retransmit_timer()
> > > > > > if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> > > > > > TCP_RTO_MAX sooner or later.
> > > > > >
> > > > > > However, the timer is not precise enough, as it base on timer wheel.
> > > > > > Sorry that I am not good at timer, but I know the concept of time-wheel.
> > > > > > The longer of the timer, the rougher it will be. So the timeout is not
> > > > > > triggered after TCP_RTO_MAX, but 122877ms as I tested.
> > > > > >
> > > > > > Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> > > > > > once the RTO come up to TCP_RTO_MAX.
> > > > > >
> > > > > > Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> > > > > > which is exact the timestamp of the timeout.
> > > > > >
> > > > > > Signed-off-by: Menglong Dong <[email protected]>
> > > > > > ---
> > > > > > net/ipv4/tcp_timer.c | 6 +++++-
> > > > > > 1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > > > > > index 470f581eedd4..3a20db15a186 100644
> > > > > > --- a/net/ipv4/tcp_timer.c
> > > > > > +++ b/net/ipv4/tcp_timer.c
> > > > > > @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
> > > > > > tp->snd_una, tp->snd_nxt);
> > > > > > }
> > > > > > #endif
> > > > > > - if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > > + /* It's a little rough here, we regard any valid packet that
> > > > > > + * update tp->rcv_tstamp as the reply of the retransmitted
> > > > > > + * packet.
> > > > > > + */
> > > > > > + if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > > tcp_write_err(sk);
> > > > > > goto out;
> > > > > > }
> > > > >
> > > > >
> > > > > Hmm, this looks like a net candidate, since this is unrelated to the
> > > > > other patches ?
> > > >
> > > > Yeah, this patch can be standalone. However, considering the
> > > > purpose of this series, it is necessary. Without this patch, the
> > > > OOM probe will always timeout after a few minutes.
> > > >
> > > > I'm not sure if I express the problem clearly in the commit log.
> > > > Let's explain it more.
> > > >
> > > > Let's mark the timestamp of the 10th timeout of the rtx timer
> > > > as TS1. Now, the retransmission happens and the ACK of
> > > > the retransmitted packet will update the tp->rcv_tstamp to
> > > > TS1+rtt.
> > > >
> > > > The RTO now is TCP_RTO_MAX. So let's see what will
> > > > happen in the 11th timeout. As we timeout after 122877ms,
> > > > so tcp_jiffies32 now is "TS1+122877ms", and
> > > > "tcp_jiffies32 - tp->rcv_tstamp" is
> > > > "TS1+122877ms - (TS1+rtt)" -> "122877ms - rtt",
> > > > which is always bigger than TCP_RTO_MAX, which is 120000ms.
> > > >
> > > > >
> > > > > Neal, what do you think ?
> > >
> > > Sorry, I am probably missing something here, but: what would ever make
> > > this new proposed condition ((u32)icsk->icsk_timeout - tp->rcv_tstamp
> > > > TCP_RTO_MAX) true? :-)
> > >
> >
> > If the snd_wnd is 0, we need to keep probing until the window
> > is available. Meanwhile, any retransmission that don't have
> > a corresponding ACK (see what we do in the 1st patch), which
> > can be caused by the lost of the ACK or the lost of the retransmitted
> > packet, can make the condition true, as the tp->rcv_tstamp can't be
> > updated in time.
> >
> > This is a little strict here. In the tcp_probe_timer(), we are allowed to
> > retransmit the probe0 packet for sysctl_tcp_retries2 times. But
> > we don't allow packets to be lost here.
> >
> > > In your nicely explained scenario, your new expression,
> > > icsk->icsk_timeout - tp->rcv_tstamp, will be:
> > >
> > > icsk->icsk_timeout - tp->rcv_tstamp
> > > = TS1 + 120 sec - (TS1+rtt)
> > > = 120 sec - RTT
> > >
> > > AFAICT there is no way for that expression to be bigger than
> > > TCP_RTO_MAX = 120 sec unless somehow RTT is negative. :-)
> > >
> > > So AFAICT your expression ((u32)icsk->icsk_timeout - tp->rcv_tstamp >
> > > TCP_RTO_MAX) will always be false, so rather than this patch we may as
> > > well remove the if check and the body of the if block?
> > >
> >
> > Hmm......as I explained above, the condition will be true
> > if the real packet loss happens. And I think it is the origin
> > design.
> >
> > > To me such a change does not seem like a safe and clear bug fix for
> > > the "net" branch but rather a riskier design change (appropriate for
> > > "net-next" branch) that has connections retry forever when the
> > > receiver retracts the window to zero, under the estimation that this
> > > is preferable to having the connections die in such a case.
> > >
> > > There might be apps that depend on the old behavior of having
> > > connections die in such cases, so we might want to have this new
> > > fail-faster behavior guarded by a sysctl in case some sites need to
> > > revert to the older behavior? Not sure...
> >
> > Yeah, the behavior here will be different for the users. I'm not
> > sure if there are any users that rely on such behavior.
> >
> > What do you think, Eric? Do we need a sysctl here?
> >
>
> I honestly do not know what problem you want to solve.
>
> As Neal pointed out, the new condition would not trigger,
> so one can question about the whole piece of code,
> what is its purpose exactly ?
>
> When receiving WIN 0 acks, we should enter the so called probe0 state.
> Maybe the real issue is that the 'receiver' will falsely send WIN X>0 ACKS,
> because win probe acks do not really ensure memory is available to
> receive new packets ?
>
> Maybe provide a packetdrill test to show what kind of issue you are facing...
>
> In Google kernels, we have TCP_MAX_RTO reduced to 30 seconds, and the
> following test runs fine.
>
> // Test how sender reacts to unexpected arrival rwin of 0.
>
> `../common/defaults.sh`
>
> // Create a socket.
> 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> +0 bind(3, ..., ...) = 0
> +0 listen(3, 1) = 0
>
> // Establish a connection.
> +.1 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 6>
> +0 > S. 0:0(0) ack 1 win 65535 <mss 1460,nop,nop,sackOK,nop,wscale 8>
> +.1 < . 1:1(0) ack 1 win 457
> +0 accept(3, ..., ...) = 4
>
> +0 write(4, ..., 20000) = 20000
> +0 > P. 1:10001(10000) ack 1
> +.1 < . 1:1(0) ack 10001 win 0
>
> // Send zwp since we received rwin of 0 and have data to send.
> +.3 > . 10000:10000(0) ack 1
> +.1 < . 1:1(0) ack 10001 win 0
>
> +.6 > . 10000:10000(0) ack 1
> +.1 < . 1:1(0) ack 10001 win 0
>
> +1.2 > . 10000:10000(0) ack 1
> +.1 < . 1:1(0) ack 10001 win 0
>
> +2.4 > . 10000:10000(0) ack 1
> +.1 < . 1:1(0) ack 10001 win 0
>
> +4.8 > . 10000:10000(0) ack 1
> +.1 < . 1:1(0) ack 10001 win 0
>
> +9.6 > . 10000:10000(0) ack 1
> +.1 < . 1:1(0) ack 10001 win 0
>
> +19.2 > . 10000:10000(0) ack 1
> +.1 < . 1:1(0) ack 10001 win 0
>
> +30 > . 10000:10000(0) ack 1
> +.1 < . 1:1(0) ack 10001 win 0
>
> +30 > . 10000:10000(0) ack 1
> +.1 < . 1:1(0) ack 10001 win 193
>
> // Received non-zero window update. Send rest of the data.
> +0 > P. 10001:20001(10000) ack 1
> +.1 < . 1:1(0) ack 20001 win 457

In that packetdrill case AFAICT that is the ZWP timer firing, and the
sender sends a ZWP.

I think maybe Menglong is looking more at something like the following
scenario, where at the time the RTO timer fires the data sender finds
the tp->snd_wnd is zero, so it sends a retransmit of the
lowest-sequence data packet.

Here is a packetdrill case and the tcpdump trace on an upstream
net-next kernel... I have not worked out all the details at the end,
but perhaps it can help move the discussion forward:


~/packetdrill/gtests/net/tcp/receiver_window# cat rwin-rto-zero-window.pkt
// Test how sender reacts to unexpected arrival rwin of 0.

`../common/defaults.sh`

// Create a socket.
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

// Establish a connection.
+.1 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 6>
+0 > S. 0:0(0) ack 1 win 65535 <mss 1460,nop,nop,sackOK,nop,wscale 14>
+.1 < . 1:1(0) ack 1 win 457
+0 accept(3, ..., ...) = 4

+0 write(4, ..., 20000) = 20000
+0 > P. 1:10001(10000) ack 1

// TLP
+.2 > . 10001:11001(1000) ack 1
// Receiver has retracted rwin to 0
// (perhaps from the 2023 proposed OOM code?).
+.1 < . 1:1(0) ack 1 win 0

// RTO, and in tcp_retransmit_timer() we see the receiver window is zero,
// so we take the special f (!tp->snd_wnd...) code path.
+.2 > . 1:1001(1000) ack 1
+.1 < . 1:1(0) ack 1 win 0

+.5 > . 1:1001(1000) ack 1
+.1 < . 1:1(0) ack 1 win 0

+1.2 > . 1:1001(1000) ack 1
+.1 < . 1:1(0) ack 1 win 0

+2.4 > . 1:1001(1000) ack 1
+.1 < . 1:1(0) ack 1 win 0

+4.8 > . 1:1001(1000) ack 1
+.1 < . 1:1(0) ack 1 win 0

+9.6 > . 1:1001(1000) ack 1
+.1 < . 1:1(0) ack 1 win 0

+19.2 > . 1:1001(1000) ack 1
+.1 < . 1:1(0) ack 1 win 0

+38.4 > . 1:1001(1000) ack 1
+.1 < . 1:1(0) ack 1 win 0

+76.8 > . 1:1001(1000) ack 1
+.1 < . 1:1(0) ack 1 win 0

+120 > . 1:1001(1000) ack 1
+.1 < . 1:1(0) ack 1 win 0

+120 > . 1:1001(1000) ack 1
+.1 < . 1:1(0) ack 1001 win 1000

// Received non-zero window update. Send more data.
+0 > P. 1001:3001(2000) ack 1
+.1 < . 1:1(0) ack 3001 win 1000

----------
When I run that script on a net-next kernel I see the rounding up of
the RTO to 122 secs rather than 120 secs, but for whatever reason the
script does not cause the socket to die early...

The tcpdump trace:

tcpdump -ttt -n -i any port 8080 &

->

~/packetdrill/gtests/net/tcp/receiver_window#
../../packetdrill/packetdrill rwin-rto-zero-window.pkt
00:01:01.370344 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [S], seq 0, win 65535, options [mss
1000,nop,nop,sackOK,nop,wscale 6], length 0
00:00:00.000096 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [S.], seq 3847169154, ack 1, win 65535, options [mss
1460,nop,nop,sackOK,nop,wscale 14], length 0
00:00:00.100277 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 457, length 0
00:00:00.000090 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [P.], seq 1:2001, ack 1, win 4, length 2000: HTTP
00:00:00.000006 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [P.], seq 2001:4001, ack 1, win 4, length 2000: HTTP
00:00:00.000003 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [P.], seq 4001:6001, ack 1, win 4, length 2000: HTTP
00:00:00.000002 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [P.], seq 6001:8001, ack 1, win 4, length 2000: HTTP
00:00:00.000001 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [P.], seq 8001:10001, ack 1, win 4, length 2000: HTTP
00:00:00.209131 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 10001:11001, ack 1, win 4, length 1000: HTTP
00:00:00.100190 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 0, length 0
00:00:00.203824 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
00:00:00.100175 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 0, length 0
00:00:00.507835 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
00:00:00.100192 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 0, length 0
00:00:01.115858 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
00:00:00.100182 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 0, length 0
00:00:02.331747 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
00:00:00.100198 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 0, length 0
00:00:04.955980 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
00:00:00.100197 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 0, length 0
00:00:09.627985 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
00:00:00.100179 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 0, length 0
00:00:19.355725 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
00:00:00.100203 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 0, length 0
00:00:42.395633 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
00:00:00.100202 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 0, length 0
00:01:17.724059 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
00:00:00.100201 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 0, length 0
00:02:02.779516 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
00:00:00.100229 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 0, length 0
00:02:02.779828 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
00:00:00.100230 ? In IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1001, win 1000, length 0
00:00:00.000034 ? Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 11001:12001, ack 1, win 4, length 1000: HTTP
00:00:00.000005 ? Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 12001:13001, ack 1, win 4, length 1000: HTTP

rwin-rto-zero-window.pkt:62: error handling packet: live packet field
tcp_psh: expected: 1 (0x1) vs actual: 0 (0x0)
script packet: 405.390244 P. 1001:3001(2000) ack 1
actual packet: 405.390237 . 11001:13001(2000) ack 1 win 4

2023-07-28 18:11:52

by Neal Cardwell

[permalink] [raw]
Subject: Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()

On Fri, Jul 28, 2023 at 7:25 AM Neal Cardwell <[email protected]> wrote:
>
> On Fri, Jul 28, 2023 at 1:50 AM Eric Dumazet <[email protected]> wrote:
> >
> > On Fri, Jul 28, 2023 at 8:25 AM Menglong Dong <[email protected]> wrote:
> > >
> > > On Fri, Jul 28, 2023 at 12:44 PM Neal Cardwell <[email protected]> wrote:
> > > >
> > > > On Thu, Jul 27, 2023 at 7:57 PM Menglong Dong <[email protected]> wrote:
> > > > >
> > > > > On Fri, Jul 28, 2023 at 3:31 AM Eric Dumazet <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Jul 27, 2023 at 2:52 PM <[email protected]> wrote:
> > > > > > >
> > > > > > > From: Menglong Dong <[email protected]>
> > > > > > >
> > > > > > > In tcp_retransmit_timer(), a window shrunk connection will be regarded
> > > > > > > as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> > > > > > > right all the time.
> > > > > > >
> > > > > > > The retransmits will become zero-window probes in tcp_retransmit_timer()
> > > > > > > if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> > > > > > > TCP_RTO_MAX sooner or later.
> > > > > > >
> > > > > > > However, the timer is not precise enough, as it base on timer wheel.
> > > > > > > Sorry that I am not good at timer, but I know the concept of time-wheel.
> > > > > > > The longer of the timer, the rougher it will be. So the timeout is not
> > > > > > > triggered after TCP_RTO_MAX, but 122877ms as I tested.
> > > > > > >
> > > > > > > Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> > > > > > > once the RTO come up to TCP_RTO_MAX.
> > > > > > >
> > > > > > > Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> > > > > > > which is exact the timestamp of the timeout.
> > > > > > >
> > > > > > > Signed-off-by: Menglong Dong <[email protected]>
> > > > > > > ---
> > > > > > > net/ipv4/tcp_timer.c | 6 +++++-
> > > > > > > 1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > > > > > > index 470f581eedd4..3a20db15a186 100644
> > > > > > > --- a/net/ipv4/tcp_timer.c
> > > > > > > +++ b/net/ipv4/tcp_timer.c
> > > > > > > @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
> > > > > > > tp->snd_una, tp->snd_nxt);
> > > > > > > }
> > > > > > > #endif
> > > > > > > - if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > > > + /* It's a little rough here, we regard any valid packet that
> > > > > > > + * update tp->rcv_tstamp as the reply of the retransmitted
> > > > > > > + * packet.
> > > > > > > + */
> > > > > > > + if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > > > tcp_write_err(sk);
> > > > > > > goto out;
> > > > > > > }

One potential pre-existing issue with this logic: if the connection is
restarting from idle, then tp->rcv_tstamp could already be a long time
(minutes or hours) in the past even on the first RTO, in which case
the very first RTO that found a zero tp->snd_wnd would find this
check returns true, and would destroy the connection immediately. This
seems extremely brittle.

AFAICT it would be safer to replace this logic with a call to the
standard tcp_write_timeout() logic that has a more robust check to see
if the connection should be destroyed.

neal

2023-07-31 03:51:18

by Menglong Dong

[permalink] [raw]
Subject: Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()

On Fri, Jul 28, 2023 at 10:25 PM Neal Cardwell <[email protected]> wrote:
>
> On Fri, Jul 28, 2023 at 1:50 AM Eric Dumazet <[email protected]> wrote:
> >
> > On Fri, Jul 28, 2023 at 8:25 AM Menglong Dong <[email protected]> wrote:
> > >
> > > On Fri, Jul 28, 2023 at 12:44 PM Neal Cardwell <[email protected]> wrote:
> > > >
> > > > On Thu, Jul 27, 2023 at 7:57 PM Menglong Dong <[email protected]> wrote:
> > > > >
> > > > > On Fri, Jul 28, 2023 at 3:31 AM Eric Dumazet <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Jul 27, 2023 at 2:52 PM <[email protected]> wrote:
> > > > > > >
> > > > > > > From: Menglong Dong <[email protected]>
> > > > > > >
> > > > > > > In tcp_retransmit_timer(), a window shrunk connection will be regarded
> > > > > > > as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> > > > > > > right all the time.
> > > > > > >
> > > > > > > The retransmits will become zero-window probes in tcp_retransmit_timer()
> > > > > > > if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> > > > > > > TCP_RTO_MAX sooner or later.
> > > > > > >
> > > > > > > However, the timer is not precise enough, as it base on timer wheel.
> > > > > > > Sorry that I am not good at timer, but I know the concept of time-wheel.
> > > > > > > The longer of the timer, the rougher it will be. So the timeout is not
> > > > > > > triggered after TCP_RTO_MAX, but 122877ms as I tested.
> > > > > > >
> > > > > > > Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> > > > > > > once the RTO come up to TCP_RTO_MAX.
> > > > > > >
> > > > > > > Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> > > > > > > which is exact the timestamp of the timeout.
> > > > > > >
> > > > > > > Signed-off-by: Menglong Dong <[email protected]>
> > > > > > > ---
> > > > > > > net/ipv4/tcp_timer.c | 6 +++++-
> > > > > > > 1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > > > > > > index 470f581eedd4..3a20db15a186 100644
> > > > > > > --- a/net/ipv4/tcp_timer.c
> > > > > > > +++ b/net/ipv4/tcp_timer.c
> > > > > > > @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
> > > > > > > tp->snd_una, tp->snd_nxt);
> > > > > > > }
> > > > > > > #endif
> > > > > > > - if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > > > + /* It's a little rough here, we regard any valid packet that
> > > > > > > + * update tp->rcv_tstamp as the reply of the retransmitted
> > > > > > > + * packet.
> > > > > > > + */
> > > > > > > + if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > > > tcp_write_err(sk);
> > > > > > > goto out;
> > > > > > > }
> > > > > >
> > > > > >
> > > > > > Hmm, this looks like a net candidate, since this is unrelated to the
> > > > > > other patches ?
> > > > >
> > > > > Yeah, this patch can be standalone. However, considering the
> > > > > purpose of this series, it is necessary. Without this patch, the
> > > > > OOM probe will always timeout after a few minutes.
> > > > >
> > > > > I'm not sure if I express the problem clearly in the commit log.
> > > > > Let's explain it more.
> > > > >
> > > > > Let's mark the timestamp of the 10th timeout of the rtx timer
> > > > > as TS1. Now, the retransmission happens and the ACK of
> > > > > the retransmitted packet will update the tp->rcv_tstamp to
> > > > > TS1+rtt.
> > > > >
> > > > > The RTO now is TCP_RTO_MAX. So let's see what will
> > > > > happen in the 11th timeout. As we timeout after 122877ms,
> > > > > so tcp_jiffies32 now is "TS1+122877ms", and
> > > > > "tcp_jiffies32 - tp->rcv_tstamp" is
> > > > > "TS1+122877ms - (TS1+rtt)" -> "122877ms - rtt",
> > > > > which is always bigger than TCP_RTO_MAX, which is 120000ms.
> > > > >
> > > > > >
> > > > > > Neal, what do you think ?
> > > >
> > > > Sorry, I am probably missing something here, but: what would ever make
> > > > this new proposed condition ((u32)icsk->icsk_timeout - tp->rcv_tstamp
> > > > > TCP_RTO_MAX) true? :-)
> > > >
> > >
> > > If the snd_wnd is 0, we need to keep probing until the window
> > > is available. Meanwhile, any retransmission that don't have
> > > a corresponding ACK (see what we do in the 1st patch), which
> > > can be caused by the lost of the ACK or the lost of the retransmitted
> > > packet, can make the condition true, as the tp->rcv_tstamp can't be
> > > updated in time.
> > >
> > > This is a little strict here. In the tcp_probe_timer(), we are allowed to
> > > retransmit the probe0 packet for sysctl_tcp_retries2 times. But
> > > we don't allow packets to be lost here.
> > >
> > > > In your nicely explained scenario, your new expression,
> > > > icsk->icsk_timeout - tp->rcv_tstamp, will be:
> > > >
> > > > icsk->icsk_timeout - tp->rcv_tstamp
> > > > = TS1 + 120 sec - (TS1+rtt)
> > > > = 120 sec - RTT
> > > >
> > > > AFAICT there is no way for that expression to be bigger than
> > > > TCP_RTO_MAX = 120 sec unless somehow RTT is negative. :-)
> > > >
> > > > So AFAICT your expression ((u32)icsk->icsk_timeout - tp->rcv_tstamp >
> > > > TCP_RTO_MAX) will always be false, so rather than this patch we may as
> > > > well remove the if check and the body of the if block?
> > > >
> > >
> > > Hmm......as I explained above, the condition will be true
> > > if the real packet loss happens. And I think it is the origin
> > > design.
> > >
> > > > To me such a change does not seem like a safe and clear bug fix for
> > > > the "net" branch but rather a riskier design change (appropriate for
> > > > "net-next" branch) that has connections retry forever when the
> > > > receiver retracts the window to zero, under the estimation that this
> > > > is preferable to having the connections die in such a case.
> > > >
> > > > There might be apps that depend on the old behavior of having
> > > > connections die in such cases, so we might want to have this new
> > > > fail-faster behavior guarded by a sysctl in case some sites need to
> > > > revert to the older behavior? Not sure...
> > >
> > > Yeah, the behavior here will be different for the users. I'm not
> > > sure if there are any users that rely on such behavior.
> > >
> > > What do you think, Eric? Do we need a sysctl here?
> > >
> >
> > I honestly do not know what problem you want to solve.
> >
> > As Neal pointed out, the new condition would not trigger,
> > so one can question about the whole piece of code,
> > what is its purpose exactly ?
> >
> > When receiving WIN 0 acks, we should enter the so called probe0 state.
> > Maybe the real issue is that the 'receiver' will falsely send WIN X>0 ACKS,
> > because win probe acks do not really ensure memory is available to
> > receive new packets ?
> >
> > Maybe provide a packetdrill test to show what kind of issue you are facing...
> >
> > In Google kernels, we have TCP_MAX_RTO reduced to 30 seconds, and the
> > following test runs fine.
> >
> > // Test how sender reacts to unexpected arrival rwin of 0.
> >
> > `../common/defaults.sh`
> >
> > // Create a socket.
> > 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> > +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > +0 bind(3, ..., ...) = 0
> > +0 listen(3, 1) = 0
> >
> > // Establish a connection.
> > +.1 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 6>
> > +0 > S. 0:0(0) ack 1 win 65535 <mss 1460,nop,nop,sackOK,nop,wscale 8>
> > +.1 < . 1:1(0) ack 1 win 457
> > +0 accept(3, ..., ...) = 4
> >
> > +0 write(4, ..., 20000) = 20000
> > +0 > P. 1:10001(10000) ack 1
> > +.1 < . 1:1(0) ack 10001 win 0
> >
> > // Send zwp since we received rwin of 0 and have data to send.
> > +.3 > . 10000:10000(0) ack 1
> > +.1 < . 1:1(0) ack 10001 win 0
> >
> > +.6 > . 10000:10000(0) ack 1
> > +.1 < . 1:1(0) ack 10001 win 0
> >
> > +1.2 > . 10000:10000(0) ack 1
> > +.1 < . 1:1(0) ack 10001 win 0
> >
> > +2.4 > . 10000:10000(0) ack 1
> > +.1 < . 1:1(0) ack 10001 win 0
> >
> > +4.8 > . 10000:10000(0) ack 1
> > +.1 < . 1:1(0) ack 10001 win 0
> >
> > +9.6 > . 10000:10000(0) ack 1
> > +.1 < . 1:1(0) ack 10001 win 0
> >
> > +19.2 > . 10000:10000(0) ack 1
> > +.1 < . 1:1(0) ack 10001 win 0
> >
> > +30 > . 10000:10000(0) ack 1
> > +.1 < . 1:1(0) ack 10001 win 0
> >
> > +30 > . 10000:10000(0) ack 1
> > +.1 < . 1:1(0) ack 10001 win 193
> >
> > // Received non-zero window update. Send rest of the data.
> > +0 > P. 10001:20001(10000) ack 1
> > +.1 < . 1:1(0) ack 20001 win 457
>
> In that packetdrill case AFAICT that is the ZWP timer firing, and the
> sender sends a ZWP.
>
> I think maybe Menglong is looking more at something like the following
> scenario, where at the time the RTO timer fires the data sender finds
> the tp->snd_wnd is zero, so it sends a retransmit of the
> lowest-sequence data packet.
>

Sorry for the late reply, I'm traveling on the weekend for two days.

Yes, I think it is close to my scenario. The sender will send
a retransmit of the lowest-sequence data packet to probe
the window.

> Here is a packetdrill case and the tcpdump trace on an upstream
> net-next kernel... I have not worked out all the details at the end,
> but perhaps it can help move the discussion forward:
>

The packetdrill is more amazing than I thought, I never
thought it could do such a thing.

>
> ~/packetdrill/gtests/net/tcp/receiver_window# cat rwin-rto-zero-window.pkt
> // Test how sender reacts to unexpected arrival rwin of 0.
>
> `../common/defaults.sh`
>
> // Create a socket.
> 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> +0 bind(3, ..., ...) = 0
> +0 listen(3, 1) = 0
>
> // Establish a connection.
> +.1 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 6>
> +0 > S. 0:0(0) ack 1 win 65535 <mss 1460,nop,nop,sackOK,nop,wscale 14>
> +.1 < . 1:1(0) ack 1 win 457
> +0 accept(3, ..., ...) = 4
>
> +0 write(4, ..., 20000) = 20000
> +0 > P. 1:10001(10000) ack 1
>
> // TLP
> +.2 > . 10001:11001(1000) ack 1
> // Receiver has retracted rwin to 0
> // (perhaps from the 2023 proposed OOM code?).
> +.1 < . 1:1(0) ack 1 win 0
>
> // RTO, and in tcp_retransmit_timer() we see the receiver window is zero,
> // so we take the special f (!tp->snd_wnd...) code path.
> +.2 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +.5 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +1.2 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +2.4 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +4.8 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +9.6 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +19.2 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +38.4 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +76.8 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +120 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +120 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1001 win 1000
>
> // Received non-zero window update. Send more data.
> +0 > P. 1001:3001(2000) ack 1
> +.1 < . 1:1(0) ack 3001 win 1000
>
> ----------
> When I run that script on a net-next kernel I see the rounding up of
> the RTO to 122 secs rather than 120 secs, but for whatever reason the
> script does not cause the socket to die early...
>

Enn....it should die, let me dig deeper into the script first.
The script looks nice, I need a little time to learn it.

Thanks!
Menglong Dong

> The tcpdump trace:
>
> tcpdump -ttt -n -i any port 8080 &
>
> ->
>
> ~/packetdrill/gtests/net/tcp/receiver_window#
> ../../packetdrill/packetdrill rwin-rto-zero-window.pkt
> 00:01:01.370344 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [S], seq 0, win 65535, options [mss
> 1000,nop,nop,sackOK,nop,wscale 6], length 0
> 00:00:00.000096 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [S.], seq 3847169154, ack 1, win 65535, options [mss
> 1460,nop,nop,sackOK,nop,wscale 14], length 0
> 00:00:00.100277 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 457, length 0
> 00:00:00.000090 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 1:2001, ack 1, win 4, length 2000: HTTP
> 00:00:00.000006 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 2001:4001, ack 1, win 4, length 2000: HTTP
> 00:00:00.000003 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 4001:6001, ack 1, win 4, length 2000: HTTP
> 00:00:00.000002 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 6001:8001, ack 1, win 4, length 2000: HTTP
> 00:00:00.000001 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 8001:10001, ack 1, win 4, length 2000: HTTP
> 00:00:00.209131 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 10001:11001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100190 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:00.203824 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100175 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:00.507835 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100192 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:01.115858 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100182 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:02.331747 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100198 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:04.955980 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100197 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:09.627985 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100179 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:19.355725 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100203 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:42.395633 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100202 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:01:17.724059 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100201 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:02:02.779516 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100229 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:02:02.779828 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100230 ? In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1001, win 1000, length 0
> 00:00:00.000034 ? Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 11001:12001, ack 1, win 4, length 1000: HTTP
> 00:00:00.000005 ? Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 12001:13001, ack 1, win 4, length 1000: HTTP
>
> rwin-rto-zero-window.pkt:62: error handling packet: live packet field
> tcp_psh: expected: 1 (0x1) vs actual: 0 (0x0)
> script packet: 405.390244 P. 1001:3001(2000) ack 1
> actual packet: 405.390237 . 11001:13001(2000) ack 1 win 4

2023-07-31 04:02:21

by Menglong Dong

[permalink] [raw]
Subject: Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()

On Sat, Jul 29, 2023 at 1:15 AM Neal Cardwell <[email protected]> wrote:
>
> On Fri, Jul 28, 2023 at 7:25 AM Neal Cardwell <[email protected]> wrote:
> >
> > On Fri, Jul 28, 2023 at 1:50 AM Eric Dumazet <[email protected]> wrote:
> > >
> > > On Fri, Jul 28, 2023 at 8:25 AM Menglong Dong <[email protected]> wrote:
> > > >
> > > > On Fri, Jul 28, 2023 at 12:44 PM Neal Cardwell <[email protected]> wrote:
> > > > >
> > > > > On Thu, Jul 27, 2023 at 7:57 PM Menglong Dong <[email protected]> wrote:
> > > > > >
> > > > > > On Fri, Jul 28, 2023 at 3:31 AM Eric Dumazet <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, Jul 27, 2023 at 2:52 PM <[email protected]> wrote:
> > > > > > > >
> > > > > > > > From: Menglong Dong <[email protected]>
> > > > > > > >
> > > > > > > > In tcp_retransmit_timer(), a window shrunk connection will be regarded
> > > > > > > > as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> > > > > > > > right all the time.
> > > > > > > >
> > > > > > > > The retransmits will become zero-window probes in tcp_retransmit_timer()
> > > > > > > > if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> > > > > > > > TCP_RTO_MAX sooner or later.
> > > > > > > >
> > > > > > > > However, the timer is not precise enough, as it base on timer wheel.
> > > > > > > > Sorry that I am not good at timer, but I know the concept of time-wheel.
> > > > > > > > The longer of the timer, the rougher it will be. So the timeout is not
> > > > > > > > triggered after TCP_RTO_MAX, but 122877ms as I tested.
> > > > > > > >
> > > > > > > > Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> > > > > > > > once the RTO come up to TCP_RTO_MAX.
> > > > > > > >
> > > > > > > > Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> > > > > > > > which is exact the timestamp of the timeout.
> > > > > > > >
> > > > > > > > Signed-off-by: Menglong Dong <[email protected]>
> > > > > > > > ---
> > > > > > > > net/ipv4/tcp_timer.c | 6 +++++-
> > > > > > > > 1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > > > >
> > > > > > > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > > > > > > > index 470f581eedd4..3a20db15a186 100644
> > > > > > > > --- a/net/ipv4/tcp_timer.c
> > > > > > > > +++ b/net/ipv4/tcp_timer.c
> > > > > > > > @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
> > > > > > > > tp->snd_una, tp->snd_nxt);
> > > > > > > > }
> > > > > > > > #endif
> > > > > > > > - if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > > > > + /* It's a little rough here, we regard any valid packet that
> > > > > > > > + * update tp->rcv_tstamp as the reply of the retransmitted
> > > > > > > > + * packet.
> > > > > > > > + */
> > > > > > > > + if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > > > > tcp_write_err(sk);
> > > > > > > > goto out;
> > > > > > > > }
>
> One potential pre-existing issue with this logic: if the connection is
> restarting from idle, then tp->rcv_tstamp could already be a long time
> (minutes or hours) in the past even on the first RTO, in which case
> the very first RTO that found a zero tp->snd_wnd would find this
> check returns true, and would destroy the connection immediately. This
> seems extremely brittle.

Yes, this scenario can happen and cause the connection
break unexpectedly.

>
> AFAICT it would be safer to replace this logic with a call to the
> standard tcp_write_timeout() logic that has a more robust check to see
> if the connection should be destroyed.

Yes, we need a more robust check here. But I think tcp_write_timeout()
is not a good choice. The icsk->icsk_retransmits won't increase and
can keep being 0 in this scenario, which makes tcp_write_timeout()
always return 0.

Enn...let me think again.

>
> neal

2023-07-31 09:31:03

by Menglong Dong

[permalink] [raw]
Subject: Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()

On Fri, Jul 28, 2023 at 10:25 PM Neal Cardwell <[email protected]> wrote:
>
> On Fri, Jul 28, 2023 at 1:50 AM Eric Dumazet <[email protected]> wrote:
[...]
>
> In that packetdrill case AFAICT that is the ZWP timer firing, and the
> sender sends a ZWP.
>
> I think maybe Menglong is looking more at something like the following
> scenario, where at the time the RTO timer fires the data sender finds
> the tp->snd_wnd is zero, so it sends a retransmit of the
> lowest-sequence data packet.
>
> Here is a packetdrill case and the tcpdump trace on an upstream
> net-next kernel... I have not worked out all the details at the end,
> but perhaps it can help move the discussion forward:
>
>
> ~/packetdrill/gtests/net/tcp/receiver_window# cat rwin-rto-zero-window.pkt
> // Test how sender reacts to unexpected arrival rwin of 0.
>
> `../common/defaults.sh`
>
> // Create a socket.
> 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> +0 bind(3, ..., ...) = 0
> +0 listen(3, 1) = 0
>
> // Establish a connection.
> +.1 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 6>
> +0 > S. 0:0(0) ack 1 win 65535 <mss 1460,nop,nop,sackOK,nop,wscale 14>
> +.1 < . 1:1(0) ack 1 win 457
> +0 accept(3, ..., ...) = 4
>
> +0 write(4, ..., 20000) = 20000
> +0 > P. 1:10001(10000) ack 1
>
> // TLP
> +.2 > . 10001:11001(1000) ack 1
> // Receiver has retracted rwin to 0
> // (perhaps from the 2023 proposed OOM code?).
> +.1 < . 1:1(0) ack 1 win 0
>
> // RTO, and in tcp_retransmit_timer() we see the receiver window is zero,
> // so we take the special f (!tp->snd_wnd...) code path.
> +.2 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +.5 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +1.2 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +2.4 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +4.8 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +9.6 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +19.2 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +38.4 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +76.8 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +120 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1 win 0
>
> +120 > . 1:1001(1000) ack 1
> +.1 < . 1:1(0) ack 1001 win 1000
>
> // Received non-zero window update. Send more data.
> +0 > P. 1001:3001(2000) ack 1
> +.1 < . 1:1(0) ack 3001 win 1000
>
> ----------
> When I run that script on a net-next kernel I see the rounding up of
> the RTO to 122 secs rather than 120 secs, but for whatever reason the
> script does not cause the socket to die early...
>

I think I know the reason now. Without the 2nd patches that I send
in this series, the ACK can't update the rwin to 0, as it will be ignored
in tcp_may_update_window().

However, you can send an ACK that acknowledges the new data
to update the rwin to 0. I modified your script, and it can die as we
excepted:

// Test how sender reacts to unexpected arrival rwin of 0.

`../common/defaults.sh`

// Create a socket.
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

// Establish a connection.
+.1 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 6>
+0 > S. 0:0(0) ack 1 win 65535 <mss 1440,nop,nop,sackOK,nop,wscale 8>
+.1 < . 1:1(0) ack 1 win 457
+0 accept(3, ..., ...) = 4

+0 write(4, ..., 20000) = 20000
+0 > P. 1:10001(10000) ack 1

// Update the window to 0. "ack 0 win 0" won't update the window, as it
// will be ignored by tcp_may_update_window()
+.1 < . 1:1(0) ack 1001 win 0

// RTO, and in tcp_retransmit_timer() we see the receiver window is zero,
// so we take the special f (!tp->snd_wnd...) code path.
+.2 > . 1001:2001(1000) ack 1
+.1 < . 1:1(0) ack 1001 win 0

+.5 > . 1001:2001(1000) ack 1
+.1 < . 1:1(0) ack 1001 win 0

+1.2 > . 1001:2001(1000) ack 1
+.1 < . 1:1(0) ack 1001 win 0

+2.4 > . 1001:2001(1000) ack 1
+.1 < . 1:1(0) ack 1001 win 0

+4.8 > . 1001:2001(1000) ack 1
+.1 < . 1:1(0) ack 1001 win 0

+9.6 > . 1001:2001(1000) ack 1
+.1 < . 1:1(0) ack 1001 win 0

+19.2 > . 1001:2001(1000) ack 1
+.1 < . 1:1(0) ack 1001 win 0

+38.4 > . 1001:2001(1000) ack 1
+.1 < . 1:1(0) ack 1001 win 0

+76.8 > . 1001:2001(1000) ack 1
+.1 < . 1:1(0) ack 1001 win 0

// socket will die in tcp_retransmit_timer() in the
// "tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX" code path.
// Following retransmit won't happen.
+120 > . 1001:2001(1000) ack 1
+.1 < . 1:1(0) ack 1001 win 0
------------------------------------------------------------------------------

I don't know how to check the die of socket with
packetdrill, so I checked it by:
ss -nitme | grep 8080 | grep on
And I can see the socket die after timeout of the 120seconds
timer.

$ packetdrill ./rwin-rto-zero-window.pkt
./rwin-rto-zero-window.pkt:55: error handling packet: Timed out
waiting for packet


> The tcpdump trace:
>
> tcpdump -ttt -n -i any port 8080 &
>
> ->
>
> ~/packetdrill/gtests/net/tcp/receiver_window#
> ../../packetdrill/packetdrill rwin-rto-zero-window.pkt
> 00:01:01.370344 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [S], seq 0, win 65535, options [mss
> 1000,nop,nop,sackOK,nop,wscale 6], length 0
> 00:00:00.000096 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [S.], seq 3847169154, ack 1, win 65535, options [mss
> 1460,nop,nop,sackOK,nop,wscale 14], length 0
> 00:00:00.100277 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 457, length 0
> 00:00:00.000090 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 1:2001, ack 1, win 4, length 2000: HTTP
> 00:00:00.000006 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 2001:4001, ack 1, win 4, length 2000: HTTP
> 00:00:00.000003 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 4001:6001, ack 1, win 4, length 2000: HTTP
> 00:00:00.000002 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 6001:8001, ack 1, win 4, length 2000: HTTP
> 00:00:00.000001 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 8001:10001, ack 1, win 4, length 2000: HTTP
> 00:00:00.209131 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 10001:11001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100190 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:00.203824 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100175 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:00.507835 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100192 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:01.115858 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100182 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:02.331747 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100198 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:04.955980 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100197 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:09.627985 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100179 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:19.355725 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100203 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:00:42.395633 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100202 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:01:17.724059 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100201 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:02:02.779516 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100229 tun0 In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
> 00:02:02.779828 tun0 Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
> 00:00:00.100230 ? In IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1001, win 1000, length 0
> 00:00:00.000034 ? Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 11001:12001, ack 1, win 4, length 1000: HTTP
> 00:00:00.000005 ? Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 12001:13001, ack 1, win 4, length 1000: HTTP
>
> rwin-rto-zero-window.pkt:62: error handling packet: live packet field
> tcp_psh: expected: 1 (0x1) vs actual: 0 (0x0)
> script packet: 405.390244 P. 1001:3001(2000) ack 1
> actual packet: 405.390237 . 11001:13001(2000) ack 1 win 4

2023-07-31 11:23:35

by Menglong Dong

[permalink] [raw]
Subject: Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()

On Sat, Jul 29, 2023 at 1:15 AM Neal Cardwell <[email protected]> wrote:
>
> On Fri, Jul 28, 2023 at 7:25 AM Neal Cardwell <[email protected]> wrote:
> >
> > On Fri, Jul 28, 2023 at 1:50 AM Eric Dumazet <[email protected]> wrote:
> > >
> > > On Fri, Jul 28, 2023 at 8:25 AM Menglong Dong <[email protected]> wrote:
> > > >
> > > > On Fri, Jul 28, 2023 at 12:44 PM Neal Cardwell <[email protected]> wrote:
> > > > >
> > > > > On Thu, Jul 27, 2023 at 7:57 PM Menglong Dong <[email protected]> wrote:
> > > > > >
> > > > > > On Fri, Jul 28, 2023 at 3:31 AM Eric Dumazet <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, Jul 27, 2023 at 2:52 PM <[email protected]> wrote:
> > > > > > > >
> > > > > > > > From: Menglong Dong <[email protected]>
> > > > > > > >
> > > > > > > > In tcp_retransmit_timer(), a window shrunk connection will be regarded
> > > > > > > > as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> > > > > > > > right all the time.
> > > > > > > >
> > > > > > > > The retransmits will become zero-window probes in tcp_retransmit_timer()
> > > > > > > > if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> > > > > > > > TCP_RTO_MAX sooner or later.
> > > > > > > >
> > > > > > > > However, the timer is not precise enough, as it base on timer wheel.
> > > > > > > > Sorry that I am not good at timer, but I know the concept of time-wheel.
> > > > > > > > The longer of the timer, the rougher it will be. So the timeout is not
> > > > > > > > triggered after TCP_RTO_MAX, but 122877ms as I tested.
> > > > > > > >
> > > > > > > > Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> > > > > > > > once the RTO come up to TCP_RTO_MAX.
> > > > > > > >
> > > > > > > > Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> > > > > > > > which is exact the timestamp of the timeout.
> > > > > > > >
> > > > > > > > Signed-off-by: Menglong Dong <[email protected]>
> > > > > > > > ---
> > > > > > > > net/ipv4/tcp_timer.c | 6 +++++-
> > > > > > > > 1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > > > >
> > > > > > > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > > > > > > > index 470f581eedd4..3a20db15a186 100644
> > > > > > > > --- a/net/ipv4/tcp_timer.c
> > > > > > > > +++ b/net/ipv4/tcp_timer.c
> > > > > > > > @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
> > > > > > > > tp->snd_una, tp->snd_nxt);
> > > > > > > > }
> > > > > > > > #endif
> > > > > > > > - if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > > > > + /* It's a little rough here, we regard any valid packet that
> > > > > > > > + * update tp->rcv_tstamp as the reply of the retransmitted
> > > > > > > > + * packet.
> > > > > > > > + */
> > > > > > > > + if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > > > > tcp_write_err(sk);
> > > > > > > > goto out;
> > > > > > > > }
>
> One potential pre-existing issue with this logic: if the connection is
> restarting from idle, then tp->rcv_tstamp could already be a long time
> (minutes or hours) in the past even on the first RTO, in which case
> the very first RTO that found a zero tp->snd_wnd would find this
> check returns true, and would destroy the connection immediately. This
> seems extremely brittle.
>
> AFAICT it would be safer to replace this logic with a call to the
> standard tcp_write_timeout() logic that has a more robust check to see
> if the connection should be destroyed.

How about we check it with:

icsk->icsk_timeout - max(tp->retrans_stamp, tp->rcv_tstamp) > TCP_RTO_MAX

?

Therefore, we don't need to do a lot of modification. I have to
say that the way we check here is rough, and a simple loss
of the retransmitted packet or the ACK can cause the die
of the socket.

Maybe we need a more perfect way here? which may introduce
a certain amount of modification.

BTW, should I split out this patch from this series?

>
> neal