2023-08-11 03:16:17

by Menglong Dong

[permalink] [raw]
Subject: [PATCH net-next v4 0/4] net: tcp: support probing OOM

From: Menglong Dong <[email protected]>

In this series, we make some small changes to make the tcp retransmission
become zero-window probes if the receiver drops the skb because of memory
pressure.

In the 1st patch, we reply a zero-window ACK if the skb is dropped
because out of memory, instead of dropping the skb silently.

In the 2nd patch, we allow a zero-window ACK to update the window.

In the 3rd patch, fix unexcepted socket die when snd_wnd is 0 in
tcp_retransmit_timer().

In the 4th patch, we refactor the debug message in tcp_retransmit_timer()
to make it more correct.

After these changes, the tcp can probe the OOM of the receiver forever.

Changes since v3:
- make the timeout "2 * TCP_RTO_MAX" in the 3rd patch
- tp->retrans_stamp is not based on jiffies and can't be compared with
icsk->icsk_timeout in the 3rd patch. Fix it.
- introduce the 4th patch

Changes since v2:
- refactor the code to avoid code duplication in the 1st patch
- use after() instead of max() in tcp_rtx_probe0_timed_out()

Changes since v1:
- send 0 rwin ACK for the receive queue empty case when necessary in the
1st patch
- send the ACK immediately by using the ICSK_ACK_NOW flag in the 1st
patch
- consider the case of the connection restart from idle, as Neal comment,
in the 3rd patch

Menglong Dong (4):
net: tcp: send zero-window ACK when no memory
net: tcp: allow zero-window ACK update the window
net: tcp: fix unexcepted socket die when snd_wnd is 0
net: tcp: refactor the dbg message in tcp_retransmit_timer()

include/net/inet_connection_sock.h | 3 ++-
net/ipv4/tcp_input.c | 20 ++++++++++-----
net/ipv4/tcp_output.c | 14 +++++++---
net/ipv4/tcp_timer.c | 41 ++++++++++++++++++++++--------
4 files changed, 56 insertions(+), 22 deletions(-)

--
2.40.1



2023-08-11 03:46:41

by Menglong Dong

[permalink] [raw]
Subject: [PATCH net-next v4 2/4] net: tcp: allow zero-window ACK update the window

From: Menglong Dong <[email protected]>

Fow now, an ACK can update the window in following case, according to
the tcp_may_update_window():

1. the ACK acknowledged new data
2. the ACK has new data
3. the ACK expand the window and the seq of it is valid

Now, we allow the ACK update the window if the window is 0, and the
seq/ack of it is valid. This is for the case that the receiver replies
an zero-window ACK when it is under memory stress and can't queue the new
data.

Signed-off-by: Menglong Dong <[email protected]>
---
net/ipv4/tcp_input.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2ac059483410..d34d52fdfdb1 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3525,7 +3525,7 @@ static inline bool tcp_may_update_window(const struct tcp_sock *tp,
{
return after(ack, tp->snd_una) ||
after(ack_seq, tp->snd_wl1) ||
- (ack_seq == tp->snd_wl1 && nwin > tp->snd_wnd);
+ (ack_seq == tp->snd_wl1 && (nwin > tp->snd_wnd || !nwin));
}

/* If we update tp->snd_una, also update tp->bytes_acked */
--
2.40.1


2023-08-11 03:51:05

by Menglong Dong

[permalink] [raw]
Subject: [PATCH net-next v4 4/4] net: tcp: refactor the dbg message in tcp_retransmit_timer()

From: Menglong Dong <[email protected]>

The debug message in tcp_retransmit_timer() is slightly wrong, because
they could be printed even if we did not receive a new ACK packet from
the remote peer.

Change it to probing zero-window, as it is a expected case now. The
description may be not correct.

Adding the duration since the last ACK we received, and the duration of
the retransmission, which are useful for debugging.

And the message now like this:

Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 209ms ago, lasting 209ms
Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 404ms ago, lasting 408ms
Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 812ms ago, lasting 1224ms

Signed-off-by: Menglong Dong <[email protected]>
---
net/ipv4/tcp_timer.c | 23 +++++++++++++----------
1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index f2a52c11e044..74c70fc1003c 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -519,20 +519,23 @@ void tcp_retransmit_timer(struct sock *sk)
* we cannot allow such beasts to hang infinitely.
*/
struct inet_sock *inet = inet_sk(sk);
+ u32 rtx_delta;
+
+ rtx_delta = tcp_time_stamp(tp) - (tp->retrans_stamp ?: tcp_skb_timestamp(skb));
if (sk->sk_family == AF_INET) {
- net_dbg_ratelimited("Peer %pI4:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
- &inet->inet_daddr,
- ntohs(inet->inet_dport),
- inet->inet_num,
- tp->snd_una, tp->snd_nxt);
+ net_dbg_ratelimited("Probing zero-window on %pI4:%u/%u, seq=%u:%u, recv %ums ago, lasting %ums\n",
+ &inet->inet_daddr, ntohs(inet->inet_dport),
+ inet->inet_num, tp->snd_una, tp->snd_nxt,
+ jiffies_to_msecs(jiffies - tp->rcv_tstamp),
+ rtx_delta);
}
#if IS_ENABLED(CONFIG_IPV6)
else if (sk->sk_family == AF_INET6) {
- net_dbg_ratelimited("Peer %pI6:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
- &sk->sk_v6_daddr,
- ntohs(inet->inet_dport),
- inet->inet_num,
- tp->snd_una, tp->snd_nxt);
+ net_dbg_ratelimited("Probing zero-window on %pI6:%u/%u, seq=%u:%u, recv %ums ago, lasting %ums\n",
+ &sk->sk_v6_daddr, ntohs(inet->inet_dport),
+ inet->inet_num, tp->snd_una, tp->snd_nxt,
+ jiffies_to_msecs(jiffies - tp->rcv_tstamp),
+ rtx_delta);
}
#endif
if (tcp_rtx_probe0_timed_out(sk, skb)) {
--
2.40.1


2023-08-11 04:00:16

by Menglong Dong

[permalink] [raw]
Subject: [PATCH net-next v4 3/4] net: tcp: fix unexcepted socket die when snd_wnd is 0

From: Menglong Dong <[email protected]>

In tcp_retransmit_timer(), a window shrunk connection will be regarded
as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
right all the time.

The retransmits will become zero-window probes in tcp_retransmit_timer()
if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
TCP_RTO_MAX sooner or later.

However, the timer can be delayed and be triggered after 122877ms, not
TCP_RTO_MAX, as I tested.

Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
once the RTO come up to TCP_RTO_MAX, and the socket will die.

Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
which is exact the timestamp of the timeout.

However, "tp->rcv_tstamp" can restart from idle, then tp->rcv_tstamp
could already be a long time (minutes or hours) in the past even on the
first RTO. So we double check the timeout with the duration of the
retransmission.

Meanwhile, making "2 * TCP_RTO_MAX" as the timeout to avoid the socket
dying too soon.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/netdev/CADxym3YyMiO+zMD4zj03YPM3FBi-1LHi6gSD2XT8pyAMM096pg@mail.gmail.com/
Signed-off-by: Menglong Dong <[email protected]>
---
v4:
- make the timeout "2 * TCP_RTO_MAX"
- tp->retrans_stamp is not based on jiffies and can't be compared with
icsk->icsk_timeout. Fix it.
v3:
- use after() instead of max() in tcp_rtx_probe0_timed_out()
v2:
- consider the case of the connection restart from idle, as Neal comment
---
net/ipv4/tcp_timer.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index d45c96c7f5a4..f2a52c11e044 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -454,6 +454,22 @@ static void tcp_fastopen_synack_timer(struct sock *sk, struct request_sock *req)
req->timeout << req->num_timeout, TCP_RTO_MAX);
}

+static bool tcp_rtx_probe0_timed_out(const struct sock *sk,
+ const struct sk_buff *skb)
+{
+ const struct tcp_sock *tp = tcp_sk(sk);
+ const int timeout = TCP_RTO_MAX * 2;
+ u32 rcv_delta, rtx_delta;
+
+ rcv_delta = inet_csk(sk)->icsk_timeout - tp->rcv_tstamp;
+ if (rcv_delta <= timeout)
+ return false;
+
+ rtx_delta = (u32)msecs_to_jiffies(tcp_time_stamp(tp) -
+ (tp->retrans_stamp ?: tcp_skb_timestamp(skb)));
+
+ return rtx_delta > timeout;
+}

/**
* tcp_retransmit_timer() - The TCP retransmit timeout handler
@@ -519,7 +535,7 @@ void tcp_retransmit_timer(struct sock *sk)
tp->snd_una, tp->snd_nxt);
}
#endif
- if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
+ if (tcp_rtx_probe0_timed_out(sk, skb)) {
tcp_write_err(sk);
goto out;
}
--
2.40.1


2023-08-11 04:52:06

by Menglong Dong

[permalink] [raw]
Subject: [PATCH net-next v4 1/4] net: tcp: send zero-window ACK when no memory

From: Menglong Dong <[email protected]>

For now, skb will be dropped when no memory, which makes client keep
retrans util timeout and it's not friendly to the users.

In this patch, we reply an ACK with zero-window in this case to update
the snd_wnd of the sender to 0. Therefore, the sender won't timeout the
connection and will probe the zero-window with the retransmits.

Signed-off-by: Menglong Dong <[email protected]>
---
v3:
- refactor the code to avoid code duplication
v2:
- send 0 rwin ACK for the receive queue empty case when necessary
- send the ACK immediately by using the ICSK_ACK_NOW flag
---
include/net/inet_connection_sock.h | 3 ++-
net/ipv4/tcp_input.c | 18 ++++++++++++------
net/ipv4/tcp_output.c | 14 +++++++++++---
3 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index c2b15f7e5516..be3c858a2ebb 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -164,7 +164,8 @@ enum inet_csk_ack_state_t {
ICSK_ACK_TIMER = 2,
ICSK_ACK_PUSHED = 4,
ICSK_ACK_PUSHED2 = 8,
- ICSK_ACK_NOW = 16 /* Send the next ACK immediately (once) */
+ ICSK_ACK_NOW = 16, /* Send the next ACK immediately (once) */
+ ICSK_ACK_NOMEM = 32,
};

void inet_csk_init_xmit_timers(struct sock *sk,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 8e96ebe373d7..2ac059483410 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5059,13 +5059,19 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)

/* Ok. In sequence. In window. */
queue_and_out:
- if (skb_queue_len(&sk->sk_receive_queue) == 0)
- sk_forced_mem_schedule(sk, skb->truesize);
- else if (tcp_try_rmem_schedule(sk, skb, skb->truesize)) {
- reason = SKB_DROP_REASON_PROTO_MEM;
- NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRCVQDROP);
+ if (tcp_try_rmem_schedule(sk, skb, skb->truesize)) {
+ /* TODO: maybe ratelimit these WIN 0 ACK ? */
+ inet_csk(sk)->icsk_ack.pending |=
+ (ICSK_ACK_NOMEM | ICSK_ACK_NOW);
+ inet_csk_schedule_ack(sk);
sk->sk_data_ready(sk);
- goto drop;
+
+ if (skb_queue_len(&sk->sk_receive_queue)) {
+ reason = SKB_DROP_REASON_PROTO_MEM;
+ NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRCVQDROP);
+ goto drop;
+ }
+ sk_forced_mem_schedule(sk, skb->truesize);
}

eaten = tcp_queue_rcv(sk, skb, &fragstolen);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c5412ee77fc8..769a558159ee 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -257,11 +257,19 @@ EXPORT_SYMBOL(tcp_select_initial_window);
static u16 tcp_select_window(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
- u32 old_win = tp->rcv_wnd;
- u32 cur_win = tcp_receive_window(tp);
- u32 new_win = __tcp_select_window(sk);
struct net *net = sock_net(sk);
+ u32 old_win = tp->rcv_wnd;
+ u32 cur_win, new_win;
+
+ /* Make the window 0 if we failed to queue the data because we
+ * are out of memory. The window is temporary, so we don't store
+ * it on the socket.
+ */
+ if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
+ return 0;

+ cur_win = tcp_receive_window(tp);
+ new_win = __tcp_select_window(sk);
if (new_win < cur_win) {
/* Danger Will Robinson!
* Don't update rcv_wup/rcv_wnd here or else
--
2.40.1


2023-08-11 08:10:05

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH net-next v4 4/4] net: tcp: refactor the dbg message in tcp_retransmit_timer()

On Fri, Aug 11, 2023 at 5:01 AM <[email protected]> wrote:
>
> From: Menglong Dong <[email protected]>
>
> The debug message in tcp_retransmit_timer() is slightly wrong, because
> they could be printed even if we did not receive a new ACK packet from
> the remote peer.
>
> Change it to probing zero-window, as it is a expected case now. The
> description may be not correct.
>
> Adding the duration since the last ACK we received, and the duration of
> the retransmission, which are useful for debugging.
>
> And the message now like this:
>
> Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 209ms ago, lasting 209ms
> Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 404ms ago, lasting 408ms
> Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 812ms ago, lasting 1224ms

Reviewed-by: Eric Dumazet <[email protected]>

2023-08-11 08:31:20

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH net-next v4 0/4] net: tcp: support probing OOM

On Fri, Aug 11, 2023 at 5:01 AM <[email protected]> wrote:
>
> From: Menglong Dong <[email protected]>
>
> In this series, we make some small changes to make the tcp retransmission
> become zero-window probes if the receiver drops the skb because of memory
> pressure.
>
> In the 1st patch, we reply a zero-window ACK if the skb is dropped
> because out of memory, instead of dropping the skb silently.
>
> In the 2nd patch, we allow a zero-window ACK to update the window.
>
> In the 3rd patch, fix unexcepted socket die when snd_wnd is 0 in
> tcp_retransmit_timer().
>
> In the 4th patch, we refactor the debug message in tcp_retransmit_timer()
> to make it more correct.
>
> After these changes, the tcp can probe the OOM of the receiver forever.
>
> Changes since v3:
> - make the timeout "2 * TCP_RTO_MAX" in the 3rd patch
> - tp->retrans_stamp is not based on jiffies and can't be compared with
> icsk->icsk_timeout in the 3rd patch. Fix it.
> - introduce the 4th patch
>
> Changes since v2:
> - refactor the code to avoid code duplication in the 1st patch
> - use after() instead of max() in tcp_rtx_probe0_timed_out()
>
> Changes since v1:
> - send 0 rwin ACK for the receive queue empty case when necessary in the
> 1st patch
> - send the ACK immediately by using the ICSK_ACK_NOW flag in the 1st
> patch
> - consider the case of the connection restart from idle, as Neal comment,
> in the 3rd patch

SGTM, thanks.

Reviewed-by: Eric Dumazet <[email protected]>

2023-08-11 08:33:43

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH net-next v4 2/4] net: tcp: allow zero-window ACK update the window

On Fri, Aug 11, 2023 at 5:01 AM <[email protected]> wrote:
>
> From: Menglong Dong <[email protected]>
>
> Fow now, an ACK can update the window in following case, according to
> the tcp_may_update_window():
>
> 1. the ACK acknowledged new data
> 2. the ACK has new data
> 3. the ACK expand the window and the seq of it is valid
>
> Now, we allow the ACK update the window if the window is 0, and the
> seq/ack of it is valid. This is for the case that the receiver replies
> an zero-window ACK when it is under memory stress and can't queue the new
> data.
>
> Signed-off-by: Menglong Dong <[email protected]>

Reviewed-by: Eric Dumazet <[email protected]>

2023-08-11 09:09:26

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH net-next v4 1/4] net: tcp: send zero-window ACK when no memory

On Fri, Aug 11, 2023 at 5:01 AM <[email protected]> wrote:
>
> From: Menglong Dong <[email protected]>
>
> For now, skb will be dropped when no memory, which makes client keep
> retrans util timeout and it's not friendly to the users.
>
> In this patch, we reply an ACK with zero-window in this case to update
> the snd_wnd of the sender to 0. Therefore, the sender won't timeout the
> connection and will probe the zero-window with the retransmits.
>
> Signed-off-by: Menglong Dong <[email protected]>
> ---

Reviewed-by: Eric Dumazet <[email protected]>

2023-08-11 09:32:26

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH net-next v4 3/4] net: tcp: fix unexcepted socket die when snd_wnd is 0

On Fri, Aug 11, 2023 at 5:01 AM <[email protected]> wrote:
>
> From: Menglong Dong <[email protected]>
>
> In tcp_retransmit_timer(), a window shrunk connection will be regarded
> as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> right all the time.
>
> The retransmits will become zero-window probes in tcp_retransmit_timer()
> if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> TCP_RTO_MAX sooner or later.
>
> However, the timer can be delayed and be triggered after 122877ms, not
> TCP_RTO_MAX, as I tested.
>
> Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> once the RTO come up to TCP_RTO_MAX, and the socket will die.
>
> Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> which is exact the timestamp of the timeout.
>
> However, "tp->rcv_tstamp" can restart from idle, then tp->rcv_tstamp
> could already be a long time (minutes or hours) in the past even on the
> first RTO. So we double check the timeout with the duration of the
> retransmission.
>
> Meanwhile, making "2 * TCP_RTO_MAX" as the timeout to avoid the socket
> dying too soon.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Link: https://lore.kernel.org/netdev/CADxym3YyMiO+zMD4zj03YPM3FBi-1LHi6gSD2XT8pyAMM096pg@mail.gmail.com/
> Signed-off-by: Menglong Dong <[email protected]>

Reviewed-by: Eric Dumazet <[email protected]>

2023-08-13 12:54:23

by patchwork-bot+netdevbpf

[permalink] [raw]
Subject: Re: [PATCH net-next v4 0/4] net: tcp: support probing OOM

Hello:

This series was applied to netdev/net-next.git (main)
by David S. Miller <[email protected]>:

On Fri, 11 Aug 2023 10:55:26 +0800 you wrote:
> From: Menglong Dong <[email protected]>
>
> In this series, we make some small changes to make the tcp retransmission
> become zero-window probes if the receiver drops the skb because of memory
> pressure.
>
> In the 1st patch, we reply a zero-window ACK if the skb is dropped
> because out of memory, instead of dropping the skb silently.
>
> [...]

Here is the summary with links:
- [net-next,v4,1/4] net: tcp: send zero-window ACK when no memory
https://git.kernel.org/netdev/net-next/c/e2142825c120
- [net-next,v4,2/4] net: tcp: allow zero-window ACK update the window
https://git.kernel.org/netdev/net-next/c/800a666141de
- [net-next,v4,3/4] net: tcp: fix unexcepted socket die when snd_wnd is 0
https://git.kernel.org/netdev/net-next/c/e89688e3e978
- [net-next,v4,4/4] net: tcp: refactor the dbg message in tcp_retransmit_timer()
https://git.kernel.org/netdev/net-next/c/031c44b7527a

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html