LinuxLists.cc - [PATCH net-next 0/3] net: tcp: add support of window shrink

2023-05-17 12:47:43

by Menglong Dong

[permalink] [raw]

Subject: [PATCH net-next 0/3] net: tcp: add support of window shrink

From: Menglong Dong <[email protected]>

For now, skb will be dropped when no memory, which makes client keep
retrans util timeout and it's not friendly to the users.

Therefore, now we force to receive one packet on current socket when
the protocol memory is out of the limitation. Then, this socket will
stay in 'no mem' status, util protocol memory is available.

When a socket is in 'no mem' status, it's receive window will become
0, which means window shrink happens. For the sender, it need to
cover this case, and we turn it into zero-window probe status.

In the origin logic, 0 probe is triggered only when there is no any
data in the retrans queue and the receive window can't hold the data
of the 1th packet in the send queue.

Now, let's change it and trigger the 0 probe in such cases:

- if the retrans queue has data and the 1th packet in it is not within
the receive window, which is for window shrink case, as the shrinked
window may can't cover the data in retrans queue.
- no data in the retrans queue and the 1th packet in the send queue is
out of the end of the receive window

And the sysctl 'tcp_wnd_shrink' is also introduced. In order to keep
safe, we disable this feature by default.

*** BLURB HERE ***

Menglong Dong (3):
net: tcp: add sysctl for controling tcp window shrink
net: tcp: send zero-window when no memory
net: tcp: handle window shrink properly

include/net/sock.h | 1 +
include/net/tcp.h | 22 ++++++++++++++++
net/ipv4/sysctl_net_ipv4.c | 9 +++++++
net/ipv4/tcp.c | 3 +++
net/ipv4/tcp_input.c | 53 ++++++++++++++++++++++++++++++++++++++
net/ipv4/tcp_output.c | 10 +++++--
net/ipv4/tcp_timer.c | 4 +--
7 files changed, 97 insertions(+), 5 deletions(-)

--
2.40.1

2023-05-17 13:00:43

by Menglong Dong

[permalink] [raw]

Subject: [PATCH net-next 1/3] net: tcp: add sysctl for controling tcp window shrink

From: Menglong Dong <[email protected]>

Introduce the sysctl 'tcp_wnd_shrink', which will be used in the
following patches.

Signed-off-by: Menglong Dong <[email protected]>
---
include/net/tcp.h | 1 +
net/ipv4/sysctl_net_ipv4.c | 9 +++++++++
net/ipv4/tcp.c | 3 +++
3 files changed, 13 insertions(+)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index a0a91a988272..a6cf6d823e34 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -247,6 +247,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
/* sysctl variables for tcp */
extern int sysctl_tcp_max_orphans;
extern long sysctl_tcp_mem[3];
+extern int sysctl_tcp_wnd_shrink;

#define TCP_RACK_LOSS_DETECTION 0x1 /* Use RACK to detect losses */
#define TCP_RACK_STATIC_REO_WND 0x2 /* Use static RACK reo wnd */
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 0d0cc4ef2b85..fd6cb5a5c2b9 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -577,6 +577,15 @@ static struct ctl_table ipv4_table[] = {
.extra1 = &sysctl_fib_sync_mem_min,
.extra2 = &sysctl_fib_sync_mem_max,
},
+ {
+ .procname = "tcp_wnd_shrink",
+ .data = &sysctl_tcp_wnd_shrink,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE
+ },
{ }
};

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index fd68d49490f2..db0483b2159f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -297,6 +297,9 @@ EXPORT_SYMBOL(tcp_memory_allocated);
DEFINE_PER_CPU(int, tcp_memory_per_cpu_fw_alloc);
EXPORT_PER_CPU_SYMBOL_GPL(tcp_memory_per_cpu_fw_alloc);

+int sysctl_tcp_wnd_shrink __read_mostly;
+EXPORT_SYMBOL(sysctl_tcp_wnd_shrink);
+
#if IS_ENABLED(CONFIG_SMC)
DEFINE_STATIC_KEY_FALSE(tcp_have_smc);
EXPORT_SYMBOL(tcp_have_smc);
--
2.40.1

2023-05-17 13:07:28

by Menglong Dong

[permalink] [raw]

Subject: [PATCH net-next 2/3] net: tcp: send zero-window when no memory

From: Menglong Dong <[email protected]>

For now, skb will be dropped when no memory, which makes client keep
retrans util timeout and it's not friendly to the users.

Therefore, now we force to receive one packet on current socket when
the protocol memory is out of the limitation. Then, this socket will
stay in 'no mem' status, util protocol memory is available.

When a socket is in 'no mem' status, it's receive window will become
0, which means window shrink happens. And the sender need to handle
such window shrink properly, which is done in the next commit.

Signed-off-by: Menglong Dong <[email protected]>
---
include/net/sock.h | 1 +
net/ipv4/tcp_input.c | 12 ++++++++++++
net/ipv4/tcp_output.c | 7 +++++++
3 files changed, 20 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 5edf0038867c..90db8a1d7f31 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -957,6 +957,7 @@ enum sock_flags {
SOCK_XDP, /* XDP is attached */
SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
SOCK_RCVMARK, /* Receive SO_MARK ancillary data with packet */
+ SOCK_NO_MEM, /* protocol memory limitation happened */
};

#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a057330d6f59..56e395cb4554 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5047,10 +5047,22 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
if (skb_queue_len(&sk->sk_receive_queue) == 0)
sk_forced_mem_schedule(sk, skb->truesize);
else if (tcp_try_rmem_schedule(sk, skb, skb->truesize)) {
+ if (sysctl_tcp_wnd_shrink)
+ goto do_wnd_shrink;
+
reason = SKB_DROP_REASON_PROTO_MEM;
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRCVQDROP);
sk->sk_data_ready(sk);
goto drop;
+do_wnd_shrink:
+ if (sock_flag(sk, SOCK_NO_MEM)) {
+ NET_INC_STATS(sock_net(sk),
+ LINUX_MIB_TCPRCVQDROP);
+ sk->sk_data_ready(sk);
+ goto out_of_window;
+ }
+ sk_forced_mem_schedule(sk, skb->truesize);
+ sock_set_flag(sk, SOCK_NO_MEM);
}

eaten = tcp_queue_rcv(sk, skb, &fragstolen);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index cfe128b81a01..21dc4f7e0a12 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -300,6 +300,13 @@ static u16 tcp_select_window(struct sock *sk)
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPFROMZEROWINDOWADV);
}

+ if (sock_flag(sk, SOCK_NO_MEM)) {
+ if (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 2))
+ sock_reset_flag(sk, SOCK_NO_MEM);
+ else
+ new_win = 0;
+ }
+
return new_win;
}

--
2.40.1

2023-05-17 13:10:17

by Menglong Dong

[permalink] [raw]

Subject: [PATCH net-next 3/3] net: tcp: handle window shrink properly

From: Menglong Dong <[email protected]>

Window shrink is not allowed and also not handled for now, but it's
needed in some case.

In the origin logic, 0 probe is triggered only when there is no any
data in the retrans queue and the receive window can't hold the data
of the 1th packet in the send queue.

Now, let's change it and trigger the 0 probe in such cases:

- if the retrans queue has data and the 1th packet in it is not within
the receive window
- no data in the retrans queue and the 1th packet in the send queue is
out of the end of the receive window

Signed-off-by: Menglong Dong <[email protected]>
---
include/net/tcp.h | 21 +++++++++++++++++++++
net/ipv4/tcp_input.c | 41 +++++++++++++++++++++++++++++++++++++++++
net/ipv4/tcp_output.c | 3 +--
net/ipv4/tcp_timer.c | 4 +---
4 files changed, 64 insertions(+), 5 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index a6cf6d823e34..9625d0bf00e1 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1910,6 +1910,27 @@ static inline void tcp_add_write_queue_tail(struct sock *sk, struct sk_buff *skb
tcp_chrono_start(sk, TCP_CHRONO_BUSY);
}

+static inline bool tcp_rtx_overflow(const struct sock *sk)
+{
+ struct sk_buff *rtx_head = tcp_rtx_queue_head(sk);
+
+ return rtx_head && after(TCP_SKB_CB(rtx_head)->end_seq,
+ tcp_wnd_end(tcp_sk(sk)));
+}
+
+static inline bool tcp_probe0_needed(const struct sock *sk)
+{
+ /* for the normal case */
+ if (!tcp_sk(sk)->packets_out && !tcp_write_queue_empty(sk))
+ return true;
+
+ if (!sysctl_tcp_wnd_shrink)
+ return false;
+
+ /* for the window shrink case */
+ return tcp_rtx_overflow(sk);
+}
+
/* Insert new before skb on the write queue of sk. */
static inline void tcp_insert_write_queue_before(struct sk_buff *new,
struct sk_buff *skb,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 56e395cb4554..a9ac295502ee 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3188,6 +3188,14 @@ void tcp_rearm_rto(struct sock *sk)
/* Try to schedule a loss probe; if that doesn't work, then schedule an RTO. */
static void tcp_set_xmit_timer(struct sock *sk)
{
+ /* Check if we are already in probe0 state, which means it's
+ * not needed to schedule the RTO. The normal probe0 can't reach
+ * here, so it must be window-shrink probe0 case here.
+ */
+ if (unlikely(inet_csk(sk)->icsk_pending == ICSK_TIME_PROBE0) &&
+ sysctl_tcp_wnd_shrink)
+ return;
+
if (!tcp_schedule_loss_probe(sk, true))
tcp_rearm_rto(sk);
}
@@ -3465,6 +3473,38 @@ static void tcp_ack_probe(struct sock *sk)
}
}

+/**
+ * This function is called only when there are packets in the rtx queue,
+ * which means that the packets out is not 0.
+ *
+ * NOTE: we only handle window shrink case in this part.
+ */
+static void tcp_ack_probe_shrink(struct sock *sk)
+{
+ struct inet_connection_sock *icsk = inet_csk(sk);
+ unsigned long when;
+
+ if (!sysctl_tcp_wnd_shrink)
+ return;
+
+ if (tcp_rtx_overflow(sk)) {
+ when = tcp_probe0_when(sk, TCP_RTO_MAX);
+
+ when = tcp_clamp_probe0_to_user_timeout(sk, when);
+ tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0, when, TCP_RTO_MAX);
+ } else {
+ /* check if recover from window shrink */
+ if (icsk->icsk_pending != ICSK_TIME_PROBE0)
+ return;
+
+ icsk->icsk_backoff = 0;
+ icsk->icsk_probes_tstamp = 0;
+ inet_csk_clear_xmit_timer(sk, ICSK_TIME_PROBE0);
+ if (!tcp_rtx_queue_empty(sk))
+ tcp_retransmit_timer(sk);
+ }
+}
+
static inline bool tcp_ack_is_dubious(const struct sock *sk, const int flag)
{
return !(flag & FLAG_NOT_DUP) || (flag & FLAG_CA_ALERT) ||
@@ -3908,6 +3948,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP))
sk_dst_confirm(sk);

+ tcp_ack_probe_shrink(sk);
delivered = tcp_newly_delivered(sk, delivered, flag);
lost = tp->lost - lost; /* freshly marked lost */
rs.is_ack_delayed = !!(flag & FLAG_ACK_MAYBE_DELAYED);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 21dc4f7e0a12..eac0532edb61 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -4089,14 +4089,13 @@ int tcp_write_wakeup(struct sock *sk, int mib)
void tcp_send_probe0(struct sock *sk)
{
struct inet_connection_sock *icsk = inet_csk(sk);
- struct tcp_sock *tp = tcp_sk(sk);
struct net *net = sock_net(sk);
unsigned long timeout;
int err;

err = tcp_write_wakeup(sk, LINUX_MIB_TCPWINPROBE);

- if (tp->packets_out || tcp_write_queue_empty(sk)) {
+ if (!tcp_probe0_needed(sk)) {
/* Cancel probe timer, if it is not required. */
icsk->icsk_probes_out = 0;
icsk->icsk_backoff = 0;
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index b839c2f91292..a28606291b7e 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -350,11 +350,9 @@ static void tcp_delack_timer(struct timer_list *t)
static void tcp_probe_timer(struct sock *sk)
{
struct inet_connection_sock *icsk = inet_csk(sk);
- struct sk_buff *skb = tcp_send_head(sk);
- struct tcp_sock *tp = tcp_sk(sk);
int max_probes;

- if (tp->packets_out || !skb) {
+ if (!tcp_probe0_needed(sk)) {
icsk->icsk_probes_out = 0;
icsk->icsk_probes_tstamp = 0;
return;
--
2.40.1

2023-05-17 14:53:27

by Eric Dumazet

[permalink] [raw]

Subject: Re: [PATCH net-next 3/3] net: tcp: handle window shrink properly

On Wed, May 17, 2023 at 2:42 PM <[email protected]> wrote:
>
> From: Menglong Dong <[email protected]>
>
> Window shrink is not allowed and also not handled for now, but it's
> needed in some case.
>
> In the origin logic, 0 probe is triggered only when there is no any
> data in the retrans queue and the receive window can't hold the data
> of the 1th packet in the send queue.
>
> Now, let's change it and trigger the 0 probe in such cases:
>
> - if the retrans queue has data and the 1th packet in it is not within
> the receive window
> - no data in the retrans queue and the 1th packet in the send queue is
> out of the end of the receive window

Sorry, I do not understand.

Please provide packetdrill tests for new behavior like that.

Also, such fundamental change would need IETF discussion first.
We do not want linux to cause network collapses just because billions
of devices send more zero probes.

2023-05-17 14:54:26

by Eric Dumazet

[permalink] [raw]

Subject: Re: [PATCH net-next 2/3] net: tcp: send zero-window when no memory

On Wed, May 17, 2023 at 2:42 PM <[email protected]> wrote:
>
> From: Menglong Dong <[email protected]>
>
> For now, skb will be dropped when no memory, which makes client keep
> retrans util timeout and it's not friendly to the users.

Yes, networking needs memory. Trying to deny it is recipe for OOM.

>
> Therefore, now we force to receive one packet on current socket when
> the protocol memory is out of the limitation. Then, this socket will
> stay in 'no mem' status, util protocol memory is available.
>

I think you missed one old patch.

commit ba3bb0e76ccd464bb66665a1941fabe55dadb3ba tcp: fix
SO_RCVLOWAT possible hangs under high mem pressure

> When a socket is in 'no mem' status, it's receive window will become
> 0, which means window shrink happens. And the sender need to handle
> such window shrink properly, which is done in the next commit.
>
> Signed-off-by: Menglong Dong <[email protected]>
> ---
> include/net/sock.h | 1 +
> net/ipv4/tcp_input.c | 12 ++++++++++++
> net/ipv4/tcp_output.c | 7 +++++++
> 3 files changed, 20 insertions(+)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 5edf0038867c..90db8a1d7f31 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -957,6 +957,7 @@ enum sock_flags {
> SOCK_XDP, /* XDP is attached */
> SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
> SOCK_RCVMARK, /* Receive SO_MARK ancillary data with packet */
> + SOCK_NO_MEM, /* protocol memory limitation happened */
> };
>
> #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index a057330d6f59..56e395cb4554 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5047,10 +5047,22 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
> if (skb_queue_len(&sk->sk_receive_queue) == 0)
> sk_forced_mem_schedule(sk, skb->truesize);

I think you missed this part : We accept at least one packet,
regardless of memory pressure,
if the queue is empty.

So your changelog is misleading.

> else if (tcp_try_rmem_schedule(sk, skb, skb->truesize)) {
> + if (sysctl_tcp_wnd_shrink)

We no longer add global sysctls for TCP. All new sysctls must per net-ns.

> + goto do_wnd_shrink;
> +
> reason = SKB_DROP_REASON_PROTO_MEM;
> NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRCVQDROP);
> sk->sk_data_ready(sk);
> goto drop;
> +do_wnd_shrink:
> + if (sock_flag(sk, SOCK_NO_MEM)) {
> + NET_INC_STATS(sock_net(sk),
> + LINUX_MIB_TCPRCVQDROP);
> + sk->sk_data_ready(sk);
> + goto out_of_window;
> + }
> + sk_forced_mem_schedule(sk, skb->truesize);

So now we would accept two packets per TCP socket, and yet EPOLLIN
will not be sent in time ?

packets can consume about 45*4K each, I do not think it is wise to
double receive queue sizes.

What you want instead is simply to send EPOLLIN sooner (when the first
packet is queued instead when the second packet is dropped)
by changing sk_forced_mem_schedule() a bit.

This might matter for applications using SO_RCVLOWAT, but not for
other applications.

2023-05-17 17:00:37

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH net-next 3/3] net: tcp: handle window shrink properly

Hi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on net-next/main]

url: https://github.com/intel-lab-lkp/linux/commits/menglong8-dong-gmail-com/net-tcp-add-sysctl-for-controling-tcp-window-shrink/20230517-204436
base: net-next/main
patch link: https://lore.kernel.org/r/20230517124201.441634-4-imagedong%40tencent.com
patch subject: [PATCH net-next 3/3] net: tcp: handle window shrink properly
config: m68k-allyesconfig
compiler: m68k-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/3712ab843c151a10d6570ded9d8c77ad831a7a41
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review menglong8-dong-gmail-com/net-tcp-add-sysctl-for-controling-tcp-window-shrink/20230517-204436
git checkout 3712ab843c151a10d6570ded9d8c77ad831a7a41
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k SHELL=/bin/bash net/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

>> net/ipv4/tcp_input.c:3477: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* This function is called only when there are packets in the rtx queue,

vim +3477 net/ipv4/tcp_input.c

3475
3476 /**
> 3477 * This function is called only when there are packets in the rtx queue,
3478 * which means that the packets out is not 0.
3479 *
3480 * NOTE: we only handle window shrink case in this part.
3481 */
3482 static void tcp_ack_probe_shrink(struct sock *sk)
3483 {
3484 struct inet_connection_sock *icsk = inet_csk(sk);
3485 unsigned long when;
3486
3487 if (!sysctl_tcp_wnd_shrink)
3488 return;
3489
3490 if (tcp_rtx_overflow(sk)) {
3491 when = tcp_probe0_when(sk, TCP_RTO_MAX);
3492
3493 when = tcp_clamp_probe0_to_user_timeout(sk, when);
3494 tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0, when, TCP_RTO_MAX);
3495 } else {
3496 /* check if recover from window shrink */
3497 if (icsk->icsk_pending != ICSK_TIME_PROBE0)
3498 return;
3499
3500 icsk->icsk_backoff = 0;
3501 icsk->icsk_probes_tstamp = 0;
3502 inet_csk_clear_xmit_timer(sk, ICSK_TIME_PROBE0);
3503 if (!tcp_rtx_queue_empty(sk))
3504 tcp_retransmit_timer(sk);
3505 }
3506 }
3507

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Attachments:

(No filename) (3.00 kB)
config (295.61 kB)
Download all attachments

2023-05-18 02:19:18

by Menglong Dong

[permalink] [raw]

Subject: Re: [PATCH net-next 2/3] net: tcp: send zero-window when no memory

On Wed, May 17, 2023 at 10:45 PM Eric Dumazet <[email protected]> wrote:
>
> On Wed, May 17, 2023 at 2:42 PM <[email protected]> wrote:
> >
> > From: Menglong Dong <[email protected]>
> >
> > For now, skb will be dropped when no memory, which makes client keep
> > retrans util timeout and it's not friendly to the users.
>
> Yes, networking needs memory. Trying to deny it is recipe for OOM.
>
> >
> > Therefore, now we force to receive one packet on current socket when
> > the protocol memory is out of the limitation. Then, this socket will
> > stay in 'no mem' status, util protocol memory is available.
> >
>
> I think you missed one old patch.
>
> commit ba3bb0e76ccd464bb66665a1941fabe55dadb3ba tcp: fix
> SO_RCVLOWAT possible hangs under high mem pressure
>
>
>
> > When a socket is in 'no mem' status, it's receive window will become
> > 0, which means window shrink happens. And the sender need to handle
> > such window shrink properly, which is done in the next commit.
> >
> > Signed-off-by: Menglong Dong <[email protected]>
> > ---
> > include/net/sock.h | 1 +
> > net/ipv4/tcp_input.c | 12 ++++++++++++
> > net/ipv4/tcp_output.c | 7 +++++++
> > 3 files changed, 20 insertions(+)
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 5edf0038867c..90db8a1d7f31 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -957,6 +957,7 @@ enum sock_flags {
> > SOCK_XDP, /* XDP is attached */
> > SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
> > SOCK_RCVMARK, /* Receive SO_MARK ancillary data with packet */
> > + SOCK_NO_MEM, /* protocol memory limitation happened */
> > };
> >
> > #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index a057330d6f59..56e395cb4554 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -5047,10 +5047,22 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
> > if (skb_queue_len(&sk->sk_receive_queue) == 0)
> > sk_forced_mem_schedule(sk, skb->truesize);
>
> I think you missed this part : We accept at least one packet,
> regardless of memory pressure,
> if the queue is empty.
>
> So your changelog is misleading.

Sorry that I didn't describe the problem clearly enough. The problem is
for two cases.

Case 1:

tcp_mem[2] limitation causes packet drop. In some cases, applications
may not read the data in the socket receiving queue quickly enough.
In my case, it will call recv() every 5 minutes. And there are a lot of such
sockets. tcp_mem[2] limitation can happen easily in such a case, and once
this happens, skb will be dropped (the receive queue is not empty) and
the send retrans the skb until timeout and the connection break.

Case 2:

The sender keeps sending small packets and makes the rec_buf full.
Meanwhile, the window is not zero, and the sender will keep retrans
until timeout, as the skb is dropped by the receiver.

>
> > else if (tcp_try_rmem_schedule(sk, skb, skb->truesize)) {
> > + if (sysctl_tcp_wnd_shrink)
>
> We no longer add global sysctls for TCP. All new sysctls must per net-ns.
>
> > + goto do_wnd_shrink;
> > +
> > reason = SKB_DROP_REASON_PROTO_MEM;
> > NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRCVQDROP);
> > sk->sk_data_ready(sk);
> > goto drop;
> > +do_wnd_shrink:
> > + if (sock_flag(sk, SOCK_NO_MEM)) {
> > + NET_INC_STATS(sock_net(sk),
> > + LINUX_MIB_TCPRCVQDROP);
> > + sk->sk_data_ready(sk);
> > + goto out_of_window;
> > + }
> > + sk_forced_mem_schedule(sk, skb->truesize);
>
> So now we would accept two packets per TCP socket, and yet EPOLLIN
> will not be sent in time ?
>
> packets can consume about 45*4K each, I do not think it is wise to
> double receive queue sizes.
>

What we want to do here is to send a ack with zero window. It
may be not necessary to force receive new data here, but to stay
the same with the logic of 'tcp_may_update_window()', only
newer 'ack' in a ack packet can shrink the window.

If we don't receive new data and send a zero-window ack directly
here, it will be weird, as the previous ack with the same 'seq' and 'ack'
has non-zero window.

Thanks!
Menglong Dong

> What you want instead is simply to send EPOLLIN sooner (when the first
> packet is queued instead when the second packet is dropped)
> by changing sk_forced_mem_schedule() a bit.
>
> This might matter for applications using SO_RCVLOWAT, but not for
> other applications.

2023-05-18 02:46:28

by Menglong Dong

[permalink] [raw]

Subject: Re: [PATCH net-next 3/3] net: tcp: handle window shrink properly

On Wed, May 17, 2023 at 10:47 PM Eric Dumazet <[email protected]> wrote:
>
> On Wed, May 17, 2023 at 2:42 PM <[email protected]> wrote:
> >
> > From: Menglong Dong <[email protected]>
> >
> > Window shrink is not allowed and also not handled for now, but it's
> > needed in some case.
> >
> > In the origin logic, 0 probe is triggered only when there is no any
> > data in the retrans queue and the receive window can't hold the data
> > of the 1th packet in the send queue.
> >
> > Now, let's change it and trigger the 0 probe in such cases:
> >
> > - if the retrans queue has data and the 1th packet in it is not within
> > the receive window
> > - no data in the retrans queue and the 1th packet in the send queue is
> > out of the end of the receive window
>
> Sorry, I do not understand.
>
> Please provide packetdrill tests for new behavior like that.
>

Yes. The problem can be reproduced easily.

1. choose a server machine, decrease it's tcp_mem with:
echo '1024 1500 2048' > /proc/sys/net/ipv4/tcp_mem
2. call listen() and accept() on a port, such as 8888. We call
accept() looply and without call recv() to make the data stay
in the receive queue.
3. choose a client machine, and create 100 TCP connection
to the 8888 port of the server. Then, every connection sends
data about 1M.
4. we can see that some of the connection enter the 0-probe
state, but some of them keep retrans again and again. As
the server is up to the tcp_mem[2] and skb is dropped before
the recv_buf full and the connection enter 0-probe state.
Finially, some of these connection will timeout and break.

With this series, all the 100 connections will enter 0-probe
status and connection break won't happen. And the data
trans will recover if we increase tcp_mem or call 'recv()'
on the sockets in the server.

> Also, such fundamental change would need IETF discussion first.
> We do not want linux to cause network collapses just because billions
> of devices send more zero probes.

I think it maybe a good idea to make the connection enter
0-probe, rather than drop the skb silently. What 0-probe
meaning is to wait for space available when the buffer of the
receive queue is full. And maybe we can also use 0-probe
when the "buffer" of "TCP protocol" (which means tcp_mem)
is full?

Am I right?

Thanks!
Menglong Dong

2023-05-18 14:10:52

by Neal Cardwell

[permalink] [raw]

Subject: Re: [PATCH net-next 3/3] net: tcp: handle window shrink properly

On Wed, May 17, 2023 at 10:35 PM Menglong Dong <[email protected]> wrote:
>
> On Wed, May 17, 2023 at 10:47 PM Eric Dumazet <[email protected]> wrote:
> >
> > On Wed, May 17, 2023 at 2:42 PM <[email protected]> wrote:
> > >
> > > From: Menglong Dong <[email protected]>
> > >
> > > Window shrink is not allowed and also not handled for now, but it's
> > > needed in some case.
> > >
> > > In the origin logic, 0 probe is triggered only when there is no any
> > > data in the retrans queue and the receive window can't hold the data
> > > of the 1th packet in the send queue.
> > >
> > > Now, let's change it and trigger the 0 probe in such cases:
> > >
> > > - if the retrans queue has data and the 1th packet in it is not within
> > > the receive window
> > > - no data in the retrans queue and the 1th packet in the send queue is
> > > out of the end of the receive window
> >
> > Sorry, I do not understand.
> >
> > Please provide packetdrill tests for new behavior like that.
> >
>
> Yes. The problem can be reproduced easily.
>
> 1. choose a server machine, decrease it's tcp_mem with:
> echo '1024 1500 2048' > /proc/sys/net/ipv4/tcp_mem
> 2. call listen() and accept() on a port, such as 8888. We call
> accept() looply and without call recv() to make the data stay
> in the receive queue.
> 3. choose a client machine, and create 100 TCP connection
> to the 8888 port of the server. Then, every connection sends
> data about 1M.
> 4. we can see that some of the connection enter the 0-probe
> state, but some of them keep retrans again and again. As
> the server is up to the tcp_mem[2] and skb is dropped before
> the recv_buf full and the connection enter 0-probe state.
> Finially, some of these connection will timeout and break.
>
> With this series, all the 100 connections will enter 0-probe
> status and connection break won't happen. And the data
> trans will recover if we increase tcp_mem or call 'recv()'
> on the sockets in the server.
>
> > Also, such fundamental change would need IETF discussion first.
> > We do not want linux to cause network collapses just because billions
> > of devices send more zero probes.
>
> I think it maybe a good idea to make the connection enter
> 0-probe, rather than drop the skb silently. What 0-probe
> meaning is to wait for space available when the buffer of the
> receive queue is full. And maybe we can also use 0-probe
> when the "buffer" of "TCP protocol" (which means tcp_mem)
> is full?
>
> Am I right?
>
> Thanks!
> Menglong Dong

Thanks for describing the scenario in more detail. (Some kind of
packetdrill script or other program to reproduce this issue would be
nice, too, as Eric noted.)

You mention in step (4.) above that some of the connections keep
retransmitting again and again. Are those connections receiving any
ACKs in response to their retransmissions? Perhaps they are receiving
dupacks? If so, then perhaps we could solve this problem without
depending on a violation of the TCP spec (which says the receive
window should not be retracted) in the following way: when a data
sender suffers a retransmission timeout, and retransmits the first
unacknowledged segment, and receives a dupack for SND.UNA instead of
an ACK covering the RTO-retransmitted segment, then the data sender
should estimate that the receiver doesn't have enough memory to buffer
the retransmitted packet. In that case, the data sender should enter
the 0-probe state and repeatedly set the ICSK_TIME_PROBE0 timer to
call tcp_probe_timer().

Basically we could try to enhance the sender-side logic to try to
distinguish between two kinds of problems:

(a) Repeated data packet loss caused by congestion, routing problems,
or connectivity problems. In this case, the data sender uses
ICSK_TIME_RETRANS and tcp_retransmit_timer(), and backs off and only
retries sysctl_tcp_retries2 times before timing out the connection

(b) A receiver that is repeatedly sending dupacks but not ACKing
retransmitted data because it doesn't have any memory. In this case,
the data sender uses ICSK_TIME_PROBE0 and tcp_probe_timer(), and backs
off but keeps retrying as long as the data sender receives ACKs.

AFAICT that would be another way to reach the happy state you mention:
"all the 100 connections will enter 0-probe status and connection
break won't happen", and we could reach that state without violating
the TCP protocol spec and without requiring changes on the receiver
side (so that this fix could help in scenarios where the
memory-constrained receiver is an older stack without special new
behavior).

Eric, Yuchung, Menglong: do you think something like that would work?

neal

2023-05-18 14:41:09

by Menglong Dong

[permalink] [raw]

Subject: Re: [PATCH net-next 3/3] net: tcp: handle window shrink properly

On Thu, May 18, 2023 at 9:40 PM Neal Cardwell <[email protected]> wrote:
>
> On Wed, May 17, 2023 at 10:35 PM Menglong Dong <[email protected]> wrote:
> >
> > On Wed, May 17, 2023 at 10:47 PM Eric Dumazet <[email protected]> wrote:
> > >
> > > On Wed, May 17, 2023 at 2:42 PM <[email protected]> wrote:
> > > >
> > > > From: Menglong Dong <[email protected]>
> > > >
> > > > Window shrink is not allowed and also not handled for now, but it's
> > > > needed in some case.
> > > >
> > > > In the origin logic, 0 probe is triggered only when there is no any
> > > > data in the retrans queue and the receive window can't hold the data
> > > > of the 1th packet in the send queue.
> > > >
> > > > Now, let's change it and trigger the 0 probe in such cases:
> > > >
> > > > - if the retrans queue has data and the 1th packet in it is not within
> > > > the receive window
> > > > - no data in the retrans queue and the 1th packet in the send queue is
> > > > out of the end of the receive window
> > >
> > > Sorry, I do not understand.
> > >
> > > Please provide packetdrill tests for new behavior like that.
> > >
> >
> > Yes. The problem can be reproduced easily.
> >
> > 1. choose a server machine, decrease it's tcp_mem with:
> > echo '1024 1500 2048' > /proc/sys/net/ipv4/tcp_mem
> > 2. call listen() and accept() on a port, such as 8888. We call
> > accept() looply and without call recv() to make the data stay
> > in the receive queue.
> > 3. choose a client machine, and create 100 TCP connection
> > to the 8888 port of the server. Then, every connection sends
> > data about 1M.
> > 4. we can see that some of the connection enter the 0-probe
> > state, but some of them keep retrans again and again. As
> > the server is up to the tcp_mem[2] and skb is dropped before
> > the recv_buf full and the connection enter 0-probe state.
> > Finially, some of these connection will timeout and break.
> >
> > With this series, all the 100 connections will enter 0-probe
> > status and connection break won't happen. And the data
> > trans will recover if we increase tcp_mem or call 'recv()'
> > on the sockets in the server.
> >
> > > Also, such fundamental change would need IETF discussion first.
> > > We do not want linux to cause network collapses just because billions
> > > of devices send more zero probes.
> >
> > I think it maybe a good idea to make the connection enter
> > 0-probe, rather than drop the skb silently. What 0-probe
> > meaning is to wait for space available when the buffer of the
> > receive queue is full. And maybe we can also use 0-probe
> > when the "buffer" of "TCP protocol" (which means tcp_mem)
> > is full?
> >
> > Am I right?
> >
> > Thanks!
> > Menglong Dong
>
> Thanks for describing the scenario in more detail. (Some kind of
> packetdrill script or other program to reproduce this issue would be
> nice, too, as Eric noted.)
>
> You mention in step (4.) above that some of the connections keep
> retransmitting again and again. Are those connections receiving any
> ACKs in response to their retransmissions? Perhaps they are receiving
> dupacks?

Actually, these packets are dropped without any reply, even dupacks.
skb will be dropped directly when tcp_try_rmem_schedule()
fails in tcp_data_queue(). That's reasonable, as it's
useless to reply a ack to the sender, which will cause the sender
fast retrans the packet, because we are out of memory now, and
retrans can't solve the problem.

> If so, then perhaps we could solve this problem without
> depending on a violation of the TCP spec (which says the receive
> window should not be retracted) in the following way: when a data
> sender suffers a retransmission timeout, and retransmits the first
> unacknowledged segment, and receives a dupack for SND.UNA instead of
> an ACK covering the RTO-retransmitted segment, then the data sender
> should estimate that the receiver doesn't have enough memory to buffer
> the retransmitted packet. In that case, the data sender should enter
> the 0-probe state and repeatedly set the ICSK_TIME_PROBE0 timer to
> call tcp_probe_timer().
>
> Basically we could try to enhance the sender-side logic to try to
> distinguish between two kinds of problems:
>
> (a) Repeated data packet loss caused by congestion, routing problems,
> or connectivity problems. In this case, the data sender uses
> ICSK_TIME_RETRANS and tcp_retransmit_timer(), and backs off and only
> retries sysctl_tcp_retries2 times before timing out the connection
>
> (b) A receiver that is repeatedly sending dupacks but not ACKing
> retransmitted data because it doesn't have any memory. In this case,
> the data sender uses ICSK_TIME_PROBE0 and tcp_probe_timer(), and backs
> off but keeps retrying as long as the data sender receives ACKs.
>

I'm not sure if this is an ideal method, as it may be not rigorous
to conclude that the receiver is oom with dupacks. A packet can
loss can also cause multi dupacks.

Thanks!
Menglong Dong

> AFAICT that would be another way to reach the happy state you mention:
> "all the 100 connections will enter 0-probe status and connection
> break won't happen", and we could reach that state without violating
> the TCP protocol spec and without requiring changes on the receiver
> side (so that this fix could help in scenarios where the
> memory-constrained receiver is an older stack without special new
> behavior).
>
> Eric, Yuchung, Menglong: do you think something like that would work?
>
> neal

2023-05-18 14:57:06

by Menglong Dong

[permalink] [raw]

Subject: Re: [PATCH net-next 2/3] net: tcp: send zero-window when no memory

On Wed, May 17, 2023 at 10:45 PM Eric Dumazet <[email protected]> wrote:
>
> On Wed, May 17, 2023 at 2:42 PM <[email protected]> wrote:
> >
> > From: Menglong Dong <[email protected]>
> >
> > For now, skb will be dropped when no memory, which makes client keep
> > retrans util timeout and it's not friendly to the users.
>
> Yes, networking needs memory. Trying to deny it is recipe for OOM.
>
> >
> > Therefore, now we force to receive one packet on current socket when
> > the protocol memory is out of the limitation. Then, this socket will
> > stay in 'no mem' status, util protocol memory is available.
> >
>
> I think you missed one old patch.
>
> commit ba3bb0e76ccd464bb66665a1941fabe55dadb3ba tcp: fix
> SO_RCVLOWAT possible hangs under high mem pressure
>
>
>
> > When a socket is in 'no mem' status, it's receive window will become
> > 0, which means window shrink happens. And the sender need to handle
> > such window shrink properly, which is done in the next commit.
> >
> > Signed-off-by: Menglong Dong <[email protected]>
> > ---
> > include/net/sock.h | 1 +
> > net/ipv4/tcp_input.c | 12 ++++++++++++
> > net/ipv4/tcp_output.c | 7 +++++++
> > 3 files changed, 20 insertions(+)
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 5edf0038867c..90db8a1d7f31 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -957,6 +957,7 @@ enum sock_flags {
> > SOCK_XDP, /* XDP is attached */
> > SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
> > SOCK_RCVMARK, /* Receive SO_MARK ancillary data with packet */
> > + SOCK_NO_MEM, /* protocol memory limitation happened */
> > };
> >
> > #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index a057330d6f59..56e395cb4554 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -5047,10 +5047,22 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
> > if (skb_queue_len(&sk->sk_receive_queue) == 0)
> > sk_forced_mem_schedule(sk, skb->truesize);
>
> I think you missed this part : We accept at least one packet,
> regardless of memory pressure,
> if the queue is empty.
>
> So your changelog is misleading.
>
> > else if (tcp_try_rmem_schedule(sk, skb, skb->truesize)) {
> > + if (sysctl_tcp_wnd_shrink)
>
> We no longer add global sysctls for TCP. All new sysctls must per net-ns.
>
> > + goto do_wnd_shrink;
> > +
> > reason = SKB_DROP_REASON_PROTO_MEM;
> > NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRCVQDROP);
> > sk->sk_data_ready(sk);
> > goto drop;
> > +do_wnd_shrink:
> > + if (sock_flag(sk, SOCK_NO_MEM)) {
> > + NET_INC_STATS(sock_net(sk),
> > + LINUX_MIB_TCPRCVQDROP);
> > + sk->sk_data_ready(sk);
> > + goto out_of_window;
> > + }
> > + sk_forced_mem_schedule(sk, skb->truesize);
>
> So now we would accept two packets per TCP socket, and yet EPOLLIN
> will not be sent in time ?
>
> packets can consume about 45*4K each, I do not think it is wise to
> double receive queue sizes.
>
> What you want instead is simply to send EPOLLIN sooner (when the first
> packet is queued instead when the second packet is dropped)
> by changing sk_forced_mem_schedule() a bit.
>
> This might matter for applications using SO_RCVLOWAT, but not for
> other applications.

To be more clear, what I talk about here is not to send EPOLLIN
sooner, but try to make the TCP connection, which has a "hang"
receiver and in TCP protocol memory pressure, entry 0-probe
state. And this commit is the first step: make the receiver
shrink the window by sending a zero-window ack.

Thanks!
Menglong Dong

2023-05-18 16:10:29

by Neal Cardwell

[permalink] [raw]

Subject: Re: [PATCH net-next 3/3] net: tcp: handle window shrink properly

On Thu, May 18, 2023 at 10:12 AM Menglong Dong <[email protected]> wrote:
>
> On Thu, May 18, 2023 at 9:40 PM Neal Cardwell <[email protected]> wrote:
> >
> > On Wed, May 17, 2023 at 10:35 PM Menglong Dong <[email protected]> wrote:
> > >
> > > On Wed, May 17, 2023 at 10:47 PM Eric Dumazet <[email protected]> wrote:
> > > >
> > > > On Wed, May 17, 2023 at 2:42 PM <[email protected]> wrote:
> > > > >
> > > > > From: Menglong Dong <[email protected]>
> > > > >
> > > > > Window shrink is not allowed and also not handled for now, but it's
> > > > > needed in some case.
> > > > >
> > > > > In the origin logic, 0 probe is triggered only when there is no any
> > > > > data in the retrans queue and the receive window can't hold the data
> > > > > of the 1th packet in the send queue.
> > > > >
> > > > > Now, let's change it and trigger the 0 probe in such cases:
> > > > >
> > > > > - if the retrans queue has data and the 1th packet in it is not within
> > > > > the receive window
> > > > > - no data in the retrans queue and the 1th packet in the send queue is
> > > > > out of the end of the receive window
> > > >
> > > > Sorry, I do not understand.
> > > >
> > > > Please provide packetdrill tests for new behavior like that.
> > > >
> > >
> > > Yes. The problem can be reproduced easily.
> > >
> > > 1. choose a server machine, decrease it's tcp_mem with:
> > > echo '1024 1500 2048' > /proc/sys/net/ipv4/tcp_mem
> > > 2. call listen() and accept() on a port, such as 8888. We call
> > > accept() looply and without call recv() to make the data stay
> > > in the receive queue.
> > > 3. choose a client machine, and create 100 TCP connection
> > > to the 8888 port of the server. Then, every connection sends
> > > data about 1M.
> > > 4. we can see that some of the connection enter the 0-probe
> > > state, but some of them keep retrans again and again. As
> > > the server is up to the tcp_mem[2] and skb is dropped before
> > > the recv_buf full and the connection enter 0-probe state.
> > > Finially, some of these connection will timeout and break.
> > >
> > > With this series, all the 100 connections will enter 0-probe
> > > status and connection break won't happen. And the data
> > > trans will recover if we increase tcp_mem or call 'recv()'
> > > on the sockets in the server.
> > >
> > > > Also, such fundamental change would need IETF discussion first.
> > > > We do not want linux to cause network collapses just because billions
> > > > of devices send more zero probes.
> > >
> > > I think it maybe a good idea to make the connection enter
> > > 0-probe, rather than drop the skb silently. What 0-probe
> > > meaning is to wait for space available when the buffer of the
> > > receive queue is full. And maybe we can also use 0-probe
> > > when the "buffer" of "TCP protocol" (which means tcp_mem)
> > > is full?
> > >
> > > Am I right?
> > >
> > > Thanks!
> > > Menglong Dong
> >
> > Thanks for describing the scenario in more detail. (Some kind of
> > packetdrill script or other program to reproduce this issue would be
> > nice, too, as Eric noted.)
> >
> > You mention in step (4.) above that some of the connections keep
> > retransmitting again and again. Are those connections receiving any
> > ACKs in response to their retransmissions? Perhaps they are receiving
> > dupacks?
>
> Actually, these packets are dropped without any reply, even dupacks.
> skb will be dropped directly when tcp_try_rmem_schedule()
> fails in tcp_data_queue(). That's reasonable, as it's
> useless to reply a ack to the sender, which will cause the sender
> fast retrans the packet, because we are out of memory now, and
> retrans can't solve the problem.

I'm not sure I see the problem. If retransmits can't solve the
problem, then why are you proposing that data senders keep
retransmitting forever (via 0-window-probes) in this kind of scenario?

A single dupack without SACK blocks will not cause the sender to fast
retransmit. (Only 3 dupacks would trigger fast retransmit.)

Three or more dupacks without SACK blocks will cause the sender to
fast retransmit the segment above SND.UNA once if the sender doesn't
have SACK support. But in this case AFAICT fast-retransmitting once is
a fine strategy, since the sender should keep retrying transmits (with
backoff) until the receiver potentially has memory available to
receive the packet.

>
> > If so, then perhaps we could solve this problem without
> > depending on a violation of the TCP spec (which says the receive
> > window should not be retracted) in the following way: when a data
> > sender suffers a retransmission timeout, and retransmits the first
> > unacknowledged segment, and receives a dupack for SND.UNA instead of
> > an ACK covering the RTO-retransmitted segment, then the data sender
> > should estimate that the receiver doesn't have enough memory to buffer
> > the retransmitted packet. In that case, the data sender should enter
> > the 0-probe state and repeatedly set the ICSK_TIME_PROBE0 timer to
> > call tcp_probe_timer().
> >
> > Basically we could try to enhance the sender-side logic to try to
> > distinguish between two kinds of problems:
> >
> > (a) Repeated data packet loss caused by congestion, routing problems,
> > or connectivity problems. In this case, the data sender uses
> > ICSK_TIME_RETRANS and tcp_retransmit_timer(), and backs off and only
> > retries sysctl_tcp_retries2 times before timing out the connection
> >
> > (b) A receiver that is repeatedly sending dupacks but not ACKing
> > retransmitted data because it doesn't have any memory. In this case,
> > the data sender uses ICSK_TIME_PROBE0 and tcp_probe_timer(), and backs
> > off but keeps retrying as long as the data sender receives ACKs.
> >
>
> I'm not sure if this is an ideal method, as it may be not rigorous
> to conclude that the receiver is oom with dupacks. A packet can
> loss can also cause multi dupacks.

When a data sender suffers an RTO and retransmits a single data
packet, it would be very rare for the data sender to receive multiple
pure dupacks without SACKs. This would only happen in the rare case
where (a) the connection did not have SACK enabled, and (b) there was
a hole in the received sequence space and there were still packets in
flight when the (spurioius) RTO fired.

But if we want to be paranoid, then this new response could be written
to only trigger if SACK is enabled (the vast, vast majority of cases).
If SACK is enabled, and an RTO of a data packet starting at sequence
S1 results in the receiver sending only a dupack for S1 without SACK
blocks, then this clearly shows the issue is not packet loss but
suggests a receiver unable to buffer the given data packet, AFAICT.

thanks,
neal

>
> Thanks!
> Menglong Dong
>
> > AFAICT that would be another way to reach the happy state you mention:
> > "all the 100 connections will enter 0-probe status and connection
> > break won't happen", and we could reach that state without violating
> > the TCP protocol spec and without requiring changes on the receiver
> > side (so that this fix could help in scenarios where the
> > memory-constrained receiver is an older stack without special new
> > behavior).
> >
> > Eric, Yuchung, Menglong: do you think something like that would work?
> >
> > neal

2023-05-20 09:16:31

by Menglong Dong

[permalink] [raw]

Subject: Re: [PATCH net-next 3/3] net: tcp: handle window shrink properly

On Fri, May 19, 2023 at 12:03 AM Neal Cardwell <[email protected]> wrote:
>
> On Thu, May 18, 2023 at 10:12 AM Menglong Dong <[email protected]> wrote:
> >
> > On Thu, May 18, 2023 at 9:40 PM Neal Cardwell <[email protected]> wrote:
> > >
> > > On Wed, May 17, 2023 at 10:35 PM Menglong Dong <[email protected]> wrote:
> > > >
> > > > On Wed, May 17, 2023 at 10:47 PM Eric Dumazet <[email protected]> wrote:
> > > > >
> > > > > On Wed, May 17, 2023 at 2:42 PM <[email protected]> wrote:
> > > > > >
> > > > > > From: Menglong Dong <[email protected]>
> > > > > >
> > > > > > Window shrink is not allowed and also not handled for now, but it's
> > > > > > needed in some case.
> > > > > >
> > > > > > In the origin logic, 0 probe is triggered only when there is no any
> > > > > > data in the retrans queue and the receive window can't hold the data
> > > > > > of the 1th packet in the send queue.
> > > > > >
> > > > > > Now, let's change it and trigger the 0 probe in such cases:
> > > > > >
> > > > > > - if the retrans queue has data and the 1th packet in it is not within
> > > > > > the receive window
> > > > > > - no data in the retrans queue and the 1th packet in the send queue is
> > > > > > out of the end of the receive window
> > > > >
> > > > > Sorry, I do not understand.
> > > > >
> > > > > Please provide packetdrill tests for new behavior like that.
> > > > >
> > > >
> > > > Yes. The problem can be reproduced easily.
> > > >
> > > > 1. choose a server machine, decrease it's tcp_mem with:
> > > > echo '1024 1500 2048' > /proc/sys/net/ipv4/tcp_mem
> > > > 2. call listen() and accept() on a port, such as 8888. We call
> > > > accept() looply and without call recv() to make the data stay
> > > > in the receive queue.
> > > > 3. choose a client machine, and create 100 TCP connection
> > > > to the 8888 port of the server. Then, every connection sends
> > > > data about 1M.
> > > > 4. we can see that some of the connection enter the 0-probe
> > > > state, but some of them keep retrans again and again. As
> > > > the server is up to the tcp_mem[2] and skb is dropped before
> > > > the recv_buf full and the connection enter 0-probe state.
> > > > Finially, some of these connection will timeout and break.
> > > >
> > > > With this series, all the 100 connections will enter 0-probe
> > > > status and connection break won't happen. And the data
> > > > trans will recover if we increase tcp_mem or call 'recv()'
> > > > on the sockets in the server.
> > > >
> > > > > Also, such fundamental change would need IETF discussion first.
> > > > > We do not want linux to cause network collapses just because billions
> > > > > of devices send more zero probes.
> > > >
> > > > I think it maybe a good idea to make the connection enter
> > > > 0-probe, rather than drop the skb silently. What 0-probe
> > > > meaning is to wait for space available when the buffer of the
> > > > receive queue is full. And maybe we can also use 0-probe
> > > > when the "buffer" of "TCP protocol" (which means tcp_mem)
> > > > is full?
> > > >
> > > > Am I right?
> > > >
> > > > Thanks!
> > > > Menglong Dong
> > >
> > > Thanks for describing the scenario in more detail. (Some kind of
> > > packetdrill script or other program to reproduce this issue would be
> > > nice, too, as Eric noted.)
> > >
> > > You mention in step (4.) above that some of the connections keep
> > > retransmitting again and again. Are those connections receiving any
> > > ACKs in response to their retransmissions? Perhaps they are receiving
> > > dupacks?
> >
> > Actually, these packets are dropped without any reply, even dupacks.
> > skb will be dropped directly when tcp_try_rmem_schedule()
> > fails in tcp_data_queue(). That's reasonable, as it's
> > useless to reply a ack to the sender, which will cause the sender
> > fast retrans the packet, because we are out of memory now, and
> > retrans can't solve the problem.
>
> I'm not sure I see the problem. If retransmits can't solve the
> problem, then why are you proposing that data senders keep
> retransmitting forever (via 0-window-probes) in this kind of scenario?
>

Because the connection will break if the count of
retransmits up to tcp_retires2, but probe-0 can keep
for a long time.

> A single dupack without SACK blocks will not cause the sender to fast
> retransmit. (Only 3 dupacks would trigger fast retransmit.)
>
> Three or more dupacks without SACK blocks will cause the sender to
> fast retransmit the segment above SND.UNA once if the sender doesn't
> have SACK support. But in this case AFAICT fast-retransmitting once is
> a fine strategy, since the sender should keep retrying transmits (with
> backoff) until the receiver potentially has memory available to
> receive the packet.
>
> >
> > > If so, then perhaps we could solve this problem without
> > > depending on a violation of the TCP spec (which says the receive
> > > window should not be retracted) in the following way: when a data
> > > sender suffers a retransmission timeout, and retransmits the first
> > > unacknowledged segment, and receives a dupack for SND.UNA instead of
> > > an ACK covering the RTO-retransmitted segment, then the data sender
> > > should estimate that the receiver doesn't have enough memory to buffer
> > > the retransmitted packet. In that case, the data sender should enter
> > > the 0-probe state and repeatedly set the ICSK_TIME_PROBE0 timer to
> > > call tcp_probe_timer().
> > >
> > > Basically we could try to enhance the sender-side logic to try to
> > > distinguish between two kinds of problems:
> > >
> > > (a) Repeated data packet loss caused by congestion, routing problems,
> > > or connectivity problems. In this case, the data sender uses
> > > ICSK_TIME_RETRANS and tcp_retransmit_timer(), and backs off and only
> > > retries sysctl_tcp_retries2 times before timing out the connection
> > >
> > > (b) A receiver that is repeatedly sending dupacks but not ACKing
> > > retransmitted data because it doesn't have any memory. In this case,
> > > the data sender uses ICSK_TIME_PROBE0 and tcp_probe_timer(), and backs
> > > off but keeps retrying as long as the data sender receives ACKs.
> > >
> >
> > I'm not sure if this is an ideal method, as it may be not rigorous
> > to conclude that the receiver is oom with dupacks. A packet can
> > loss can also cause multi dupacks.
>
> When a data sender suffers an RTO and retransmits a single data
> packet, it would be very rare for the data sender to receive multiple
> pure dupacks without SACKs. This would only happen in the rare case
> where (a) the connection did not have SACK enabled, and (b) there was
> a hole in the received sequence space and there were still packets in
> flight when the (spurioius) RTO fired.
>
> But if we want to be paranoid, then this new response could be written
> to only trigger if SACK is enabled (the vast, vast majority of cases).
> If SACK is enabled, and an RTO of a data packet starting at sequence
> S1 results in the receiver sending only a dupack for S1 without SACK
> blocks, then this clearly shows the issue is not packet loss but
> suggests a receiver unable to buffer the given data packet, AFAICT.
>

Yeah, you are right on this point, multi pure dupacks can
mean out of memory of the receiver. But we still need to
know if the receiver recovers from OOM. Without window
shrink, the window in the ack of zero-window probe packet
is not zero on OOM.

Hi, Eric and kuba, do you have any comments on this
case?

Thanks!
Menglong Dong

> thanks,
> neal
>
> >
> > Thanks!
> > Menglong Dong
> >
> > > AFAICT that would be another way to reach the happy state you mention:
> > > "all the 100 connections will enter 0-probe status and connection
> > > break won't happen", and we could reach that state without violating
> > > the TCP protocol spec and without requiring changes on the receiver
> > > side (so that this fix could help in scenarios where the
> > > memory-constrained receiver is an older stack without special new
> > > behavior).
> > >
> > > Eric, Yuchung, Menglong: do you think something like that would work?
> > >
> > > neal

2023-05-20 14:46:24

by Neal Cardwell

[permalink] [raw]

Subject: Re: [PATCH net-next 3/3] net: tcp: handle window shrink properly

On Sat, May 20, 2023 at 5:08 AM Menglong Dong <[email protected]> wrote:
>
> On Fri, May 19, 2023 at 12:03 AM Neal Cardwell <[email protected]> wrote:
> >
> > On Thu, May 18, 2023 at 10:12 AM Menglong Dong <[email protected]> wrote:
> > >
> > > On Thu, May 18, 2023 at 9:40 PM Neal Cardwell <[email protected]> wrote:
> > > >
> > > > On Wed, May 17, 2023 at 10:35 PM Menglong Dong <[email protected]> wrote:
> > > > >
> > > > > On Wed, May 17, 2023 at 10:47 PM Eric Dumazet <[email protected]> wrote:
> > > > > >
> > > > > > On Wed, May 17, 2023 at 2:42 PM <[email protected]> wrote:
> > > > > > >
> > > > > > > From: Menglong Dong <[email protected]>
> > > > > > >
> > > > > > > Window shrink is not allowed and also not handled for now, but it's
> > > > > > > needed in some case.
> > > > > > >
> > > > > > > In the origin logic, 0 probe is triggered only when there is no any
> > > > > > > data in the retrans queue and the receive window can't hold the data
> > > > > > > of the 1th packet in the send queue.
> > > > > > >
> > > > > > > Now, let's change it and trigger the 0 probe in such cases:
> > > > > > >
> > > > > > > - if the retrans queue has data and the 1th packet in it is not within
> > > > > > > the receive window
> > > > > > > - no data in the retrans queue and the 1th packet in the send queue is
> > > > > > > out of the end of the receive window
> > > > > >
> > > > > > Sorry, I do not understand.
> > > > > >
> > > > > > Please provide packetdrill tests for new behavior like that.
> > > > > >
> > > > >
> > > > > Yes. The problem can be reproduced easily.
> > > > >
> > > > > 1. choose a server machine, decrease it's tcp_mem with:
> > > > > echo '1024 1500 2048' > /proc/sys/net/ipv4/tcp_mem
> > > > > 2. call listen() and accept() on a port, such as 8888. We call
> > > > > accept() looply and without call recv() to make the data stay
> > > > > in the receive queue.
> > > > > 3. choose a client machine, and create 100 TCP connection
> > > > > to the 8888 port of the server. Then, every connection sends
> > > > > data about 1M.
> > > > > 4. we can see that some of the connection enter the 0-probe
> > > > > state, but some of them keep retrans again and again. As
> > > > > the server is up to the tcp_mem[2] and skb is dropped before
> > > > > the recv_buf full and the connection enter 0-probe state.
> > > > > Finially, some of these connection will timeout and break.
> > > > >
> > > > > With this series, all the 100 connections will enter 0-probe
> > > > > status and connection break won't happen. And the data
> > > > > trans will recover if we increase tcp_mem or call 'recv()'
> > > > > on the sockets in the server.
> > > > >
> > > > > > Also, such fundamental change would need IETF discussion first.
> > > > > > We do not want linux to cause network collapses just because billions
> > > > > > of devices send more zero probes.
> > > > >
> > > > > I think it maybe a good idea to make the connection enter
> > > > > 0-probe, rather than drop the skb silently. What 0-probe
> > > > > meaning is to wait for space available when the buffer of the
> > > > > receive queue is full. And maybe we can also use 0-probe
> > > > > when the "buffer" of "TCP protocol" (which means tcp_mem)
> > > > > is full?
> > > > >
> > > > > Am I right?
> > > > >
> > > > > Thanks!
> > > > > Menglong Dong
> > > >
> > > > Thanks for describing the scenario in more detail. (Some kind of
> > > > packetdrill script or other program to reproduce this issue would be
> > > > nice, too, as Eric noted.)
> > > >
> > > > You mention in step (4.) above that some of the connections keep
> > > > retransmitting again and again. Are those connections receiving any
> > > > ACKs in response to their retransmissions? Perhaps they are receiving
> > > > dupacks?
> > >
> > > Actually, these packets are dropped without any reply, even dupacks.
> > > skb will be dropped directly when tcp_try_rmem_schedule()
> > > fails in tcp_data_queue(). That's reasonable, as it's
> > > useless to reply a ack to the sender, which will cause the sender
> > > fast retrans the packet, because we are out of memory now, and
> > > retrans can't solve the problem.
> >
> > I'm not sure I see the problem. If retransmits can't solve the
> > problem, then why are you proposing that data senders keep
> > retransmitting forever (via 0-window-probes) in this kind of scenario?
> >
>
> Because the connection will break if the count of
> retransmits up to tcp_retires2, but probe-0 can keep
> for a long time.

I see. So it sounds like you agree that retransmits can solve the
problem, as long as the retransmits are using the zero-window probe
state machine (ICSK_TIME_PROBE0, tcp_probe_timer()), which continues
as long as the receiver is sending ACKs. And it sounds like when you
said "retrans can't solve the problem" you didn't literally mean that
retransmits can't solve the problem, but rather you meant that the RTO
state machine, specifically (ICSK_TIME_RETRANS,
tcp_retransmit_timer(), etc) can't solve the problem. I agree with
that assessment that in this scenario tcp_probe_timer() seems like a
solution but tcp_retransmit_timer() does not.

> > A single dupack without SACK blocks will not cause the sender to fast
> > retransmit. (Only 3 dupacks would trigger fast retransmit.)
> >
> > Three or more dupacks without SACK blocks will cause the sender to
> > fast retransmit the segment above SND.UNA once if the sender doesn't
> > have SACK support. But in this case AFAICT fast-retransmitting once is
> > a fine strategy, since the sender should keep retrying transmits (with
> > backoff) until the receiver potentially has memory available to
> > receive the packet.
> >
> > >
> > > > If so, then perhaps we could solve this problem without
> > > > depending on a violation of the TCP spec (which says the receive
> > > > window should not be retracted) in the following way: when a data
> > > > sender suffers a retransmission timeout, and retransmits the first
> > > > unacknowledged segment, and receives a dupack for SND.UNA instead of
> > > > an ACK covering the RTO-retransmitted segment, then the data sender
> > > > should estimate that the receiver doesn't have enough memory to buffer
> > > > the retransmitted packet. In that case, the data sender should enter
> > > > the 0-probe state and repeatedly set the ICSK_TIME_PROBE0 timer to
> > > > call tcp_probe_timer().
> > > >
> > > > Basically we could try to enhance the sender-side logic to try to
> > > > distinguish between two kinds of problems:
> > > >
> > > > (a) Repeated data packet loss caused by congestion, routing problems,
> > > > or connectivity problems. In this case, the data sender uses
> > > > ICSK_TIME_RETRANS and tcp_retransmit_timer(), and backs off and only
> > > > retries sysctl_tcp_retries2 times before timing out the connection
> > > >
> > > > (b) A receiver that is repeatedly sending dupacks but not ACKing
> > > > retransmitted data because it doesn't have any memory. In this case,
> > > > the data sender uses ICSK_TIME_PROBE0 and tcp_probe_timer(), and backs
> > > > off but keeps retrying as long as the data sender receives ACKs.
> > > >
> > >
> > > I'm not sure if this is an ideal method, as it may be not rigorous
> > > to conclude that the receiver is oom with dupacks. A packet can
> > > loss can also cause multi dupacks.
> >
> > When a data sender suffers an RTO and retransmits a single data
> > packet, it would be very rare for the data sender to receive multiple
> > pure dupacks without SACKs. This would only happen in the rare case
> > where (a) the connection did not have SACK enabled, and (b) there was
> > a hole in the received sequence space and there were still packets in
> > flight when the (spurioius) RTO fired.
> >
> > But if we want to be paranoid, then this new response could be written
> > to only trigger if SACK is enabled (the vast, vast majority of cases).
> > If SACK is enabled, and an RTO of a data packet starting at sequence
> > S1 results in the receiver sending only a dupack for S1 without SACK
> > blocks, then this clearly shows the issue is not packet loss but
> > suggests a receiver unable to buffer the given data packet, AFAICT.
> >
>
> Yeah, you are right on this point, multi pure dupacks can
> mean out of memory of the receiver. But we still need to
> know if the receiver recovers from OOM. Without window
> shrink, the window in the ack of zero-window probe packet
> is not zero on OOM.

But do we need a protocol-violating zero-window in this case? Why not
use my approach suggested above: conveying the OOM condition by
sending an ACK but not ACKing the retransmitted packet?

Thanks,
neal

> Hi, Eric and kuba, do you have any comments on this
> case?
>
> Thanks!
> Menglong Dong
>
> > thanks,
> > neal
> >
> > >
> > > Thanks!
> > > Menglong Dong
> > >
> > > > AFAICT that would be another way to reach the happy state you mention:
> > > > "all the 100 connections will enter 0-probe status and connection
> > > > break won't happen", and we could reach that state without violating
> > > > the TCP protocol spec and without requiring changes on the receiver
> > > > side (so that this fix could help in scenarios where the
> > > > memory-constrained receiver is an older stack without special new
> > > > behavior).
> > > >
> > > > Eric, Yuchung, Menglong: do you think something like that would work?
> > > >
> > > > neal

2023-05-22 03:13:33

by Menglong Dong

[permalink] [raw]

Subject: Re: [PATCH net-next 3/3] net: tcp: handle window shrink properly

On Sat, May 20, 2023 at 10:28 PM Neal Cardwell <[email protected]> wrote:
>
> On Sat, May 20, 2023 at 5:08 AM Menglong Dong <[email protected]> wrote:
> >
> > On Fri, May 19, 2023 at 12:03 AM Neal Cardwell <[email protected]> wrote:
> > >
> > > On Thu, May 18, 2023 at 10:12 AM Menglong Dong <[email protected]> wrote:
> > > >
> > > > On Thu, May 18, 2023 at 9:40 PM Neal Cardwell <[email protected]> wrote:
> > > > >
> > > > > On Wed, May 17, 2023 at 10:35 PM Menglong Dong <[email protected]> wrote:
> > > > > >
> > > > > > On Wed, May 17, 2023 at 10:47 PM Eric Dumazet <[email protected]> wrote:
> > > > > > >
> > > > > > > On Wed, May 17, 2023 at 2:42 PM <[email protected]> wrote:
> > > > > > > >
> > > > > > > > From: Menglong Dong <[email protected]>
> > > > > > > >
> > > > > > > > Window shrink is not allowed and also not handled for now, but it's
> > > > > > > > needed in some case.
> > > > > > > >
> > > > > > > > In the origin logic, 0 probe is triggered only when there is no any
> > > > > > > > data in the retrans queue and the receive window can't hold the data
> > > > > > > > of the 1th packet in the send queue.
> > > > > > > >
> > > > > > > > Now, let's change it and trigger the 0 probe in such cases:
> > > > > > > >
> > > > > > > > - if the retrans queue has data and the 1th packet in it is not within
> > > > > > > > the receive window
> > > > > > > > - no data in the retrans queue and the 1th packet in the send queue is
> > > > > > > > out of the end of the receive window
> > > > > > >
> > > > > > > Sorry, I do not understand.
> > > > > > >
> > > > > > > Please provide packetdrill tests for new behavior like that.
> > > > > > >
> > > > > >
> > > > > > Yes. The problem can be reproduced easily.
> > > > > >
> > > > > > 1. choose a server machine, decrease it's tcp_mem with:
> > > > > > echo '1024 1500 2048' > /proc/sys/net/ipv4/tcp_mem
> > > > > > 2. call listen() and accept() on a port, such as 8888. We call
> > > > > > accept() looply and without call recv() to make the data stay
> > > > > > in the receive queue.
> > > > > > 3. choose a client machine, and create 100 TCP connection
> > > > > > to the 8888 port of the server. Then, every connection sends
> > > > > > data about 1M.
> > > > > > 4. we can see that some of the connection enter the 0-probe
> > > > > > state, but some of them keep retrans again and again. As
> > > > > > the server is up to the tcp_mem[2] and skb is dropped before
> > > > > > the recv_buf full and the connection enter 0-probe state.
> > > > > > Finially, some of these connection will timeout and break.
> > > > > >
> > > > > > With this series, all the 100 connections will enter 0-probe
> > > > > > status and connection break won't happen. And the data
> > > > > > trans will recover if we increase tcp_mem or call 'recv()'
> > > > > > on the sockets in the server.
> > > > > >
> > > > > > > Also, such fundamental change would need IETF discussion first.
> > > > > > > We do not want linux to cause network collapses just because billions
> > > > > > > of devices send more zero probes.
> > > > > >
> > > > > > I think it maybe a good idea to make the connection enter
> > > > > > 0-probe, rather than drop the skb silently. What 0-probe
> > > > > > meaning is to wait for space available when the buffer of the
> > > > > > receive queue is full. And maybe we can also use 0-probe
> > > > > > when the "buffer" of "TCP protocol" (which means tcp_mem)
> > > > > > is full?
> > > > > >
> > > > > > Am I right?
> > > > > >
> > > > > > Thanks!
> > > > > > Menglong Dong
> > > > >
> > > > > Thanks for describing the scenario in more detail. (Some kind of
> > > > > packetdrill script or other program to reproduce this issue would be
> > > > > nice, too, as Eric noted.)
> > > > >
> > > > > You mention in step (4.) above that some of the connections keep
> > > > > retransmitting again and again. Are those connections receiving any
> > > > > ACKs in response to their retransmissions? Perhaps they are receiving
> > > > > dupacks?
> > > >
> > > > Actually, these packets are dropped without any reply, even dupacks.
> > > > skb will be dropped directly when tcp_try_rmem_schedule()
> > > > fails in tcp_data_queue(). That's reasonable, as it's
> > > > useless to reply a ack to the sender, which will cause the sender
> > > > fast retrans the packet, because we are out of memory now, and
> > > > retrans can't solve the problem.
> > >
> > > I'm not sure I see the problem. If retransmits can't solve the
> > > problem, then why are you proposing that data senders keep
> > > retransmitting forever (via 0-window-probes) in this kind of scenario?
> > >
> >
> > Because the connection will break if the count of
> > retransmits up to tcp_retires2, but probe-0 can keep
> > for a long time.
>
> I see. So it sounds like you agree that retransmits can solve the
> problem, as long as the retransmits are using the zero-window probe
> state machine (ICSK_TIME_PROBE0, tcp_probe_timer()), which continues
> as long as the receiver is sending ACKs. And it sounds like when you
> said "retrans can't solve the problem" you didn't literally mean that
> retransmits can't solve the problem, but rather you meant that the RTO
> state machine, specifically (ICSK_TIME_RETRANS,
> tcp_retransmit_timer(), etc) can't solve the problem. I agree with
> that assessment that in this scenario tcp_probe_timer() seems like a
> solution but tcp_retransmit_timer() does not.
>

Yes, that is indeed what I want to express.

> > > A single dupack without SACK blocks will not cause the sender to fast
> > > retransmit. (Only 3 dupacks would trigger fast retransmit.)
> > >
> > > Three or more dupacks without SACK blocks will cause the sender to
> > > fast retransmit the segment above SND.UNA once if the sender doesn't
> > > have SACK support. But in this case AFAICT fast-retransmitting once is
> > > a fine strategy, since the sender should keep retrying transmits (with
> > > backoff) until the receiver potentially has memory available to
> > > receive the packet.
> > >
> > > >
> > > > > If so, then perhaps we could solve this problem without
> > > > > depending on a violation of the TCP spec (which says the receive
> > > > > window should not be retracted) in the following way: when a data
> > > > > sender suffers a retransmission timeout, and retransmits the first
> > > > > unacknowledged segment, and receives a dupack for SND.UNA instead of
> > > > > an ACK covering the RTO-retransmitted segment, then the data sender
> > > > > should estimate that the receiver doesn't have enough memory to buffer
> > > > > the retransmitted packet. In that case, the data sender should enter
> > > > > the 0-probe state and repeatedly set the ICSK_TIME_PROBE0 timer to
> > > > > call tcp_probe_timer().
> > > > >
> > > > > Basically we could try to enhance the sender-side logic to try to
> > > > > distinguish between two kinds of problems:
> > > > >
> > > > > (a) Repeated data packet loss caused by congestion, routing problems,
> > > > > or connectivity problems. In this case, the data sender uses
> > > > > ICSK_TIME_RETRANS and tcp_retransmit_timer(), and backs off and only
> > > > > retries sysctl_tcp_retries2 times before timing out the connection
> > > > >
> > > > > (b) A receiver that is repeatedly sending dupacks but not ACKing
> > > > > retransmitted data because it doesn't have any memory. In this case,
> > > > > the data sender uses ICSK_TIME_PROBE0 and tcp_probe_timer(), and backs
> > > > > off but keeps retrying as long as the data sender receives ACKs.
> > > > >
> > > >
> > > > I'm not sure if this is an ideal method, as it may be not rigorous
> > > > to conclude that the receiver is oom with dupacks. A packet can
> > > > loss can also cause multi dupacks.
> > >
> > > When a data sender suffers an RTO and retransmits a single data
> > > packet, it would be very rare for the data sender to receive multiple
> > > pure dupacks without SACKs. This would only happen in the rare case
> > > where (a) the connection did not have SACK enabled, and (b) there was
> > > a hole in the received sequence space and there were still packets in
> > > flight when the (spurioius) RTO fired.
> > >
> > > But if we want to be paranoid, then this new response could be written
> > > to only trigger if SACK is enabled (the vast, vast majority of cases).
> > > If SACK is enabled, and an RTO of a data packet starting at sequence
> > > S1 results in the receiver sending only a dupack for S1 without SACK
> > > blocks, then this clearly shows the issue is not packet loss but
> > > suggests a receiver unable to buffer the given data packet, AFAICT.
> > >
> >
> > Yeah, you are right on this point, multi pure dupacks can
> > mean out of memory of the receiver. But we still need to
> > know if the receiver recovers from OOM. Without window
> > shrink, the window in the ack of zero-window probe packet
> > is not zero on OOM.
>
> But do we need a protocol-violating zero-window in this case? Why not
> use my approach suggested above: conveying the OOM condition by
> sending an ACK but not ACKing the retransmitted packet?
>

I agree with you about the approach you mentioned
about conveying the OOM condition. But that approach
can't convey the recovery from OOM, can it?

Let's see the process. With 3 pure dupack for SND.UNA,
we deem the OOM of the receiver and make the sender
enter zero-window probe state.

The sender will keep sending probe0 packets, and the
receiver will reply an ack. However, as we don't
shrink the window actually, the window in the ack is
not zero on OOM, so we can't know if the receiver has
recovered from OOM and retransmit the data in retransmit
queue.

BTW, the probe0 will send the last byte that was already
acked, so the ack of the probe0 will be a pure dupack.

Did I miss something?

BTW, a previous patch has explained the need to
support window shrink, which should satisfy the RFC
of TCP protocol:

https://lore.kernel.org/netdev/[email protected]/

Thanks!
Menglong Dong

> Thanks,
> neal
>
> > Hi, Eric and kuba, do you have any comments on this
> > case?
> >
> > Thanks!
> > Menglong Dong
> >
> > > thanks,
> > > neal
> > >
> > > >
> > > > Thanks!
> > > > Menglong Dong
> > > >
> > > > > AFAICT that would be another way to reach the happy state you mention:
> > > > > "all the 100 connections will enter 0-probe status and connection
> > > > > break won't happen", and we could reach that state without violating
> > > > > the TCP protocol spec and without requiring changes on the receiver
> > > > > side (so that this fix could help in scenarios where the
> > > > > memory-constrained receiver is an older stack without special new
> > > > > behavior).
> > > > >
> > > > > Eric, Yuchung, Menglong: do you think something like that would work?
> > > > >
> > > > > neal

2023-05-22 15:15:27

by Neal Cardwell

[permalink] [raw]

Subject: Re: [PATCH net-next 3/3] net: tcp: handle window shrink properly

On Sun, May 21, 2023 at 10:55 PM Menglong Dong <[email protected]> wrote:
>
> On Sat, May 20, 2023 at 10:28 PM Neal Cardwell <[email protected]> wrote:
> >
> > On Sat, May 20, 2023 at 5:08 AM Menglong Dong <[email protected]> wrote:
> > >
> > > On Fri, May 19, 2023 at 12:03 AM Neal Cardwell <[email protected]> wrote:
> > > >
> > > > On Thu, May 18, 2023 at 10:12 AM Menglong Dong <[email protected]> wrote:
> > > > >
> > > > > On Thu, May 18, 2023 at 9:40 PM Neal Cardwell <[email protected]> wrote:
> > > > > >
> > > > > > On Wed, May 17, 2023 at 10:35 PM Menglong Dong <[email protected]> wrote:
> > > > > > >
> > > > > > > On Wed, May 17, 2023 at 10:47 PM Eric Dumazet <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Wed, May 17, 2023 at 2:42 PM <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > From: Menglong Dong <[email protected]>
> > > > > > > > >
> > > > > > > > > Window shrink is not allowed and also not handled for now, but it's
> > > > > > > > > needed in some case.
> > > > > > > > >
> > > > > > > > > In the origin logic, 0 probe is triggered only when there is no any
> > > > > > > > > data in the retrans queue and the receive window can't hold the data
> > > > > > > > > of the 1th packet in the send queue.
> > > > > > > > >
> > > > > > > > > Now, let's change it and trigger the 0 probe in such cases:
> > > > > > > > >
> > > > > > > > > - if the retrans queue has data and the 1th packet in it is not within
> > > > > > > > > the receive window
> > > > > > > > > - no data in the retrans queue and the 1th packet in the send queue is
> > > > > > > > > out of the end of the receive window
> > > > > > > >
> > > > > > > > Sorry, I do not understand.
> > > > > > > >
> > > > > > > > Please provide packetdrill tests for new behavior like that.
> > > > > > > >
> > > > > > >
> > > > > > > Yes. The problem can be reproduced easily.
> > > > > > >
> > > > > > > 1. choose a server machine, decrease it's tcp_mem with:
> > > > > > > echo '1024 1500 2048' > /proc/sys/net/ipv4/tcp_mem
> > > > > > > 2. call listen() and accept() on a port, such as 8888. We call
> > > > > > > accept() looply and without call recv() to make the data stay
> > > > > > > in the receive queue.
> > > > > > > 3. choose a client machine, and create 100 TCP connection
> > > > > > > to the 8888 port of the server. Then, every connection sends
> > > > > > > data about 1M.
> > > > > > > 4. we can see that some of the connection enter the 0-probe
> > > > > > > state, but some of them keep retrans again and again. As
> > > > > > > the server is up to the tcp_mem[2] and skb is dropped before
> > > > > > > the recv_buf full and the connection enter 0-probe state.
> > > > > > > Finially, some of these connection will timeout and break.
> > > > > > >
> > > > > > > With this series, all the 100 connections will enter 0-probe
> > > > > > > status and connection break won't happen. And the data
> > > > > > > trans will recover if we increase tcp_mem or call 'recv()'
> > > > > > > on the sockets in the server.
> > > > > > >
> > > > > > > > Also, such fundamental change would need IETF discussion first.
> > > > > > > > We do not want linux to cause network collapses just because billions
> > > > > > > > of devices send more zero probes.
> > > > > > >
> > > > > > > I think it maybe a good idea to make the connection enter
> > > > > > > 0-probe, rather than drop the skb silently. What 0-probe
> > > > > > > meaning is to wait for space available when the buffer of the
> > > > > > > receive queue is full. And maybe we can also use 0-probe
> > > > > > > when the "buffer" of "TCP protocol" (which means tcp_mem)
> > > > > > > is full?
> > > > > > >
> > > > > > > Am I right?
> > > > > > >
> > > > > > > Thanks!
> > > > > > > Menglong Dong
> > > > > >
> > > > > > Thanks for describing the scenario in more detail. (Some kind of
> > > > > > packetdrill script or other program to reproduce this issue would be
> > > > > > nice, too, as Eric noted.)
> > > > > >
> > > > > > You mention in step (4.) above that some of the connections keep
> > > > > > retransmitting again and again. Are those connections receiving any
> > > > > > ACKs in response to their retransmissions? Perhaps they are receiving
> > > > > > dupacks?
> > > > >
> > > > > Actually, these packets are dropped without any reply, even dupacks.
> > > > > skb will be dropped directly when tcp_try_rmem_schedule()
> > > > > fails in tcp_data_queue(). That's reasonable, as it's
> > > > > useless to reply a ack to the sender, which will cause the sender
> > > > > fast retrans the packet, because we are out of memory now, and
> > > > > retrans can't solve the problem.
> > > >
> > > > I'm not sure I see the problem. If retransmits can't solve the
> > > > problem, then why are you proposing that data senders keep
> > > > retransmitting forever (via 0-window-probes) in this kind of scenario?
> > > >
> > >
> > > Because the connection will break if the count of
> > > retransmits up to tcp_retires2, but probe-0 can keep
> > > for a long time.
> >
> > I see. So it sounds like you agree that retransmits can solve the
> > problem, as long as the retransmits are using the zero-window probe
> > state machine (ICSK_TIME_PROBE0, tcp_probe_timer()), which continues
> > as long as the receiver is sending ACKs. And it sounds like when you
> > said "retrans can't solve the problem" you didn't literally mean that
> > retransmits can't solve the problem, but rather you meant that the RTO
> > state machine, specifically (ICSK_TIME_RETRANS,
> > tcp_retransmit_timer(), etc) can't solve the problem. I agree with
> > that assessment that in this scenario tcp_probe_timer() seems like a
> > solution but tcp_retransmit_timer() does not.
> >
>
> Yes, that is indeed what I want to express.
>
> > > > A single dupack without SACK blocks will not cause the sender to fast
> > > > retransmit. (Only 3 dupacks would trigger fast retransmit.)
> > > >
> > > > Three or more dupacks without SACK blocks will cause the sender to
> > > > fast retransmit the segment above SND.UNA once if the sender doesn't
> > > > have SACK support. But in this case AFAICT fast-retransmitting once is
> > > > a fine strategy, since the sender should keep retrying transmits (with
> > > > backoff) until the receiver potentially has memory available to
> > > > receive the packet.
> > > >
> > > > >
> > > > > > If so, then perhaps we could solve this problem without
> > > > > > depending on a violation of the TCP spec (which says the receive
> > > > > > window should not be retracted) in the following way: when a data
> > > > > > sender suffers a retransmission timeout, and retransmits the first
> > > > > > unacknowledged segment, and receives a dupack for SND.UNA instead of
> > > > > > an ACK covering the RTO-retransmitted segment, then the data sender
> > > > > > should estimate that the receiver doesn't have enough memory to buffer
> > > > > > the retransmitted packet. In that case, the data sender should enter
> > > > > > the 0-probe state and repeatedly set the ICSK_TIME_PROBE0 timer to
> > > > > > call tcp_probe_timer().
> > > > > >
> > > > > > Basically we could try to enhance the sender-side logic to try to
> > > > > > distinguish between two kinds of problems:
> > > > > >
> > > > > > (a) Repeated data packet loss caused by congestion, routing problems,
> > > > > > or connectivity problems. In this case, the data sender uses
> > > > > > ICSK_TIME_RETRANS and tcp_retransmit_timer(), and backs off and only
> > > > > > retries sysctl_tcp_retries2 times before timing out the connection
> > > > > >
> > > > > > (b) A receiver that is repeatedly sending dupacks but not ACKing
> > > > > > retransmitted data because it doesn't have any memory. In this case,
> > > > > > the data sender uses ICSK_TIME_PROBE0 and tcp_probe_timer(), and backs
> > > > > > off but keeps retrying as long as the data sender receives ACKs.
> > > > > >
> > > > >
> > > > > I'm not sure if this is an ideal method, as it may be not rigorous
> > > > > to conclude that the receiver is oom with dupacks. A packet can
> > > > > loss can also cause multi dupacks.
> > > >
> > > > When a data sender suffers an RTO and retransmits a single data
> > > > packet, it would be very rare for the data sender to receive multiple
> > > > pure dupacks without SACKs. This would only happen in the rare case
> > > > where (a) the connection did not have SACK enabled, and (b) there was
> > > > a hole in the received sequence space and there were still packets in
> > > > flight when the (spurioius) RTO fired.
> > > >
> > > > But if we want to be paranoid, then this new response could be written
> > > > to only trigger if SACK is enabled (the vast, vast majority of cases).
> > > > If SACK is enabled, and an RTO of a data packet starting at sequence
> > > > S1 results in the receiver sending only a dupack for S1 without SACK
> > > > blocks, then this clearly shows the issue is not packet loss but
> > > > suggests a receiver unable to buffer the given data packet, AFAICT.
> > > >
> > >
> > > Yeah, you are right on this point, multi pure dupacks can
> > > mean out of memory of the receiver. But we still need to
> > > know if the receiver recovers from OOM. Without window
> > > shrink, the window in the ack of zero-window probe packet
> > > is not zero on OOM.
> >
> > But do we need a protocol-violating zero-window in this case? Why not
> > use my approach suggested above: conveying the OOM condition by
> > sending an ACK but not ACKing the retransmitted packet?
> >
>
> I agree with you about the approach you mentioned
> about conveying the OOM condition. But that approach
> can't convey the recovery from OOM, can it?

Yes, my suggested approach can convey the recovery from OOM. The data
receiver conveys the recovery from OOM by buffering and ACKing the
retransmitted data packet.

> Let's see the process. With 3 pure dupack for SND.UNA,
> we deem the OOM of the receiver and make the sender
> enter zero-window probe state.

AFAICT the data sender does not need to wait for 3 pure dpacks for
SND.UNA. AFAICT a data sender that suffers an RTO and finds that its
RTO gets a response that is a single dupack for SND.UNA could estimate
that the receiver is OOM, and enter the zero-window probe state.

> The sender will keep sending probe0 packets, and the
> receiver will reply an ack. However, as we don't
> shrink the window actually, the window in the ack is
> not zero on OOM, so we can't know if the receiver has
> recovered from OOM and retransmit the data in retransmit
> queue.

As I noted above, in my proposal the data receiver conveys the
recovery from OOM by buffering and ACKing the retransmitted data
packet.

> BTW, the probe0 will send the last byte that was already
> acked, so the ack of the probe0 will be a pure dupack.
>
> Did I miss something?

I don't think that's the case. My read of tcp_write_wakeup() is that
it will send an skb that is whatever prefix of 1 MSS is allowed by the
receiver window. In the scenario we are discussing, that would mean
that it sends a full 1 MSS. So AFAICT tcp_write_wakeup() could be used
for sending a data "probe" packet for this OOM case.

> BTW, a previous patch has explained the need to
> support window shrink, which should satisfy the RFC
> of TCP protocol:
>
> https://lore.kernel.org/netdev/[email protected]/

Let's see what Eric thinks about that patch.

neal

> Thanks!
> Menglong Dong
>
> > Thanks,
> > neal
> >
> > > Hi, Eric and kuba, do you have any comments on this
> > > case?
> > >
> > > Thanks!
> > > Menglong Dong
> > >
> > > > thanks,
> > > > neal
> > > >
> > > > >
> > > > > Thanks!
> > > > > Menglong Dong
> > > > >
> > > > > > AFAICT that would be another way to reach the happy state you mention:
> > > > > > "all the 100 connections will enter 0-probe status and connection
> > > > > > break won't happen", and we could reach that state without violating
> > > > > > the TCP protocol spec and without requiring changes on the receiver
> > > > > > side (so that this fix could help in scenarios where the
> > > > > > memory-constrained receiver is an older stack without special new
> > > > > > behavior).
> > > > > >
> > > > > > Eric, Yuchung, Menglong: do you think something like that would work?
> > > > > >
> > > > > > neal

2023-05-23 09:06:52

by Menglong Dong

[permalink] [raw]

Subject: Re: [PATCH net-next 3/3] net: tcp: handle window shrink properly

On Mon, May 22, 2023 at 11:04 PM Neal Cardwell <[email protected]> wrote:
>
> On Sun, May 21, 2023 at 10:55 PM Menglong Dong <[email protected]> wrote:
> >
> > On Sat, May 20, 2023 at 10:28 PM Neal Cardwell <[email protected]> wrote:
> > >
> > > On Sat, May 20, 2023 at 5:08 AM Menglong Dong <[email protected]> wrote:
> > > >
> > > > On Fri, May 19, 2023 at 12:03 AM Neal Cardwell <[email protected]> wrote:
> > > > >
> > > > > On Thu, May 18, 2023 at 10:12 AM Menglong Dong <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, May 18, 2023 at 9:40 PM Neal Cardwell <[email protected]> wrote:
> > > > > > >
> > > > > > > On Wed, May 17, 2023 at 10:35 PM Menglong Dong <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Wed, May 17, 2023 at 10:47 PM Eric Dumazet <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, May 17, 2023 at 2:42 PM <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > From: Menglong Dong <[email protected]>
> > > > > > > > > >
> > > > > > > > > > Window shrink is not allowed and also not handled for now, but it's
> > > > > > > > > > needed in some case.
> > > > > > > > > >
> > > > > > > > > > In the origin logic, 0 probe is triggered only when there is no any
> > > > > > > > > > data in the retrans queue and the receive window can't hold the data
> > > > > > > > > > of the 1th packet in the send queue.
> > > > > > > > > >
> > > > > > > > > > Now, let's change it and trigger the 0 probe in such cases:
> > > > > > > > > >
> > > > > > > > > > - if the retrans queue has data and the 1th packet in it is not within
> > > > > > > > > > the receive window
> > > > > > > > > > - no data in the retrans queue and the 1th packet in the send queue is
> > > > > > > > > > out of the end of the receive window
> > > > > > > > >
> > > > > > > > > Sorry, I do not understand.
> > > > > > > > >
> > > > > > > > > Please provide packetdrill tests for new behavior like that.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yes. The problem can be reproduced easily.
> > > > > > > >
> > > > > > > > 1. choose a server machine, decrease it's tcp_mem with:
> > > > > > > > echo '1024 1500 2048' > /proc/sys/net/ipv4/tcp_mem
> > > > > > > > 2. call listen() and accept() on a port, such as 8888. We call
> > > > > > > > accept() looply and without call recv() to make the data stay
> > > > > > > > in the receive queue.
> > > > > > > > 3. choose a client machine, and create 100 TCP connection
> > > > > > > > to the 8888 port of the server. Then, every connection sends
> > > > > > > > data about 1M.
> > > > > > > > 4. we can see that some of the connection enter the 0-probe
> > > > > > > > state, but some of them keep retrans again and again. As
> > > > > > > > the server is up to the tcp_mem[2] and skb is dropped before
> > > > > > > > the recv_buf full and the connection enter 0-probe state.
> > > > > > > > Finially, some of these connection will timeout and break.
> > > > > > > >
> > > > > > > > With this series, all the 100 connections will enter 0-probe
> > > > > > > > status and connection break won't happen. And the data
> > > > > > > > trans will recover if we increase tcp_mem or call 'recv()'
> > > > > > > > on the sockets in the server.
> > > > > > > >
> > > > > > > > > Also, such fundamental change would need IETF discussion first.
> > > > > > > > > We do not want linux to cause network collapses just because billions
> > > > > > > > > of devices send more zero probes.
> > > > > > > >
> > > > > > > > I think it maybe a good idea to make the connection enter
> > > > > > > > 0-probe, rather than drop the skb silently. What 0-probe
> > > > > > > > meaning is to wait for space available when the buffer of the
> > > > > > > > receive queue is full. And maybe we can also use 0-probe
> > > > > > > > when the "buffer" of "TCP protocol" (which means tcp_mem)
> > > > > > > > is full?
> > > > > > > >
> > > > > > > > Am I right?
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > > Menglong Dong
> > > > > > >
> > > > > > > Thanks for describing the scenario in more detail. (Some kind of
> > > > > > > packetdrill script or other program to reproduce this issue would be
> > > > > > > nice, too, as Eric noted.)
> > > > > > >
> > > > > > > You mention in step (4.) above that some of the connections keep
> > > > > > > retransmitting again and again. Are those connections receiving any
> > > > > > > ACKs in response to their retransmissions? Perhaps they are receiving
> > > > > > > dupacks?
> > > > > >
> > > > > > Actually, these packets are dropped without any reply, even dupacks.
> > > > > > skb will be dropped directly when tcp_try_rmem_schedule()
> > > > > > fails in tcp_data_queue(). That's reasonable, as it's
> > > > > > useless to reply a ack to the sender, which will cause the sender
> > > > > > fast retrans the packet, because we are out of memory now, and
> > > > > > retrans can't solve the problem.
> > > > >
> > > > > I'm not sure I see the problem. If retransmits can't solve the
> > > > > problem, then why are you proposing that data senders keep
> > > > > retransmitting forever (via 0-window-probes) in this kind of scenario?
> > > > >
> > > >
> > > > Because the connection will break if the count of
> > > > retransmits up to tcp_retires2, but probe-0 can keep
> > > > for a long time.
> > >
> > > I see. So it sounds like you agree that retransmits can solve the
> > > problem, as long as the retransmits are using the zero-window probe
> > > state machine (ICSK_TIME_PROBE0, tcp_probe_timer()), which continues
> > > as long as the receiver is sending ACKs. And it sounds like when you
> > > said "retrans can't solve the problem" you didn't literally mean that
> > > retransmits can't solve the problem, but rather you meant that the RTO
> > > state machine, specifically (ICSK_TIME_RETRANS,
> > > tcp_retransmit_timer(), etc) can't solve the problem. I agree with
> > > that assessment that in this scenario tcp_probe_timer() seems like a
> > > solution but tcp_retransmit_timer() does not.
> > >
> >
> > Yes, that is indeed what I want to express.
> >
> > > > > A single dupack without SACK blocks will not cause the sender to fast
> > > > > retransmit. (Only 3 dupacks would trigger fast retransmit.)
> > > > >
> > > > > Three or more dupacks without SACK blocks will cause the sender to
> > > > > fast retransmit the segment above SND.UNA once if the sender doesn't
> > > > > have SACK support. But in this case AFAICT fast-retransmitting once is
> > > > > a fine strategy, since the sender should keep retrying transmits (with
> > > > > backoff) until the receiver potentially has memory available to
> > > > > receive the packet.
> > > > >
> > > > > >
> > > > > > > If so, then perhaps we could solve this problem without
> > > > > > > depending on a violation of the TCP spec (which says the receive
> > > > > > > window should not be retracted) in the following way: when a data
> > > > > > > sender suffers a retransmission timeout, and retransmits the first
> > > > > > > unacknowledged segment, and receives a dupack for SND.UNA instead of
> > > > > > > an ACK covering the RTO-retransmitted segment, then the data sender
> > > > > > > should estimate that the receiver doesn't have enough memory to buffer
> > > > > > > the retransmitted packet. In that case, the data sender should enter
> > > > > > > the 0-probe state and repeatedly set the ICSK_TIME_PROBE0 timer to
> > > > > > > call tcp_probe_timer().
> > > > > > >
> > > > > > > Basically we could try to enhance the sender-side logic to try to
> > > > > > > distinguish between two kinds of problems:
> > > > > > >
> > > > > > > (a) Repeated data packet loss caused by congestion, routing problems,
> > > > > > > or connectivity problems. In this case, the data sender uses
> > > > > > > ICSK_TIME_RETRANS and tcp_retransmit_timer(), and backs off and only
> > > > > > > retries sysctl_tcp_retries2 times before timing out the connection
> > > > > > >
> > > > > > > (b) A receiver that is repeatedly sending dupacks but not ACKing
> > > > > > > retransmitted data because it doesn't have any memory. In this case,
> > > > > > > the data sender uses ICSK_TIME_PROBE0 and tcp_probe_timer(), and backs
> > > > > > > off but keeps retrying as long as the data sender receives ACKs.
> > > > > > >
> > > > > >
> > > > > > I'm not sure if this is an ideal method, as it may be not rigorous
> > > > > > to conclude that the receiver is oom with dupacks. A packet can
> > > > > > loss can also cause multi dupacks.
> > > > >
> > > > > When a data sender suffers an RTO and retransmits a single data
> > > > > packet, it would be very rare for the data sender to receive multiple
> > > > > pure dupacks without SACKs. This would only happen in the rare case
> > > > > where (a) the connection did not have SACK enabled, and (b) there was
> > > > > a hole in the received sequence space and there were still packets in
> > > > > flight when the (spurioius) RTO fired.
> > > > >
> > > > > But if we want to be paranoid, then this new response could be written
> > > > > to only trigger if SACK is enabled (the vast, vast majority of cases).
> > > > > If SACK is enabled, and an RTO of a data packet starting at sequence
> > > > > S1 results in the receiver sending only a dupack for S1 without SACK
> > > > > blocks, then this clearly shows the issue is not packet loss but
> > > > > suggests a receiver unable to buffer the given data packet, AFAICT.
> > > > >
> > > >
> > > > Yeah, you are right on this point, multi pure dupacks can
> > > > mean out of memory of the receiver. But we still need to
> > > > know if the receiver recovers from OOM. Without window
> > > > shrink, the window in the ack of zero-window probe packet
> > > > is not zero on OOM.
> > >
> > > But do we need a protocol-violating zero-window in this case? Why not
> > > use my approach suggested above: conveying the OOM condition by
> > > sending an ACK but not ACKing the retransmitted packet?
> > >
> >
> > I agree with you about the approach you mentioned
> > about conveying the OOM condition. But that approach
> > can't convey the recovery from OOM, can it?
>
> Yes, my suggested approach can convey the recovery from OOM. The data
> receiver conveys the recovery from OOM by buffering and ACKing the
> retransmitted data packet.

Oh, I understand what you mean now. You are saying that
retransmit that first packet in the retransmit queue instead
of zero-window probe packet when OOM of the receiver,
isn't it? In other word, retransmit the unacked data and ignore
the tcp_retries2 when we find the receiver is in OOM state.

That's an option, and we can make the length of the data we
send to 1 byte, which means we keep retransmitting the first
byte that has not be acked in the retransmit queue.

>
> > Let's see the process. With 3 pure dupack for SND.UNA,
> > we deem the OOM of the receiver and make the sender
> > enter zero-window probe state.
>
> AFAICT the data sender does not need to wait for 3 pure dpacks for
> SND.UNA. AFAICT a data sender that suffers an RTO and finds that its
> RTO gets a response that is a single dupack for SND.UNA could estimate
> that the receiver is OOM, and enter the zero-window probe state.
>
> > The sender will keep sending probe0 packets, and the
> > receiver will reply an ack. However, as we don't
> > shrink the window actually, the window in the ack is
> > not zero on OOM, so we can't know if the receiver has
> > recovered from OOM and retransmit the data in retransmit
> > queue.
>
> As I noted above, in my proposal the data receiver conveys the
> recovery from OOM by buffering and ACKing the retransmitted data
> packet.
>
> > BTW, the probe0 will send the last byte that was already
> > acked, so the ack of the probe0 will be a pure dupack.
> >
> > Did I miss something?
>
> I don't think that's the case. My read of tcp_write_wakeup() is that
> it will send an skb that is whatever prefix of 1 MSS is allowed by the
> receiver window. In the scenario we are discussing, that would mean
> that it sends a full 1 MSS. So AFAICT tcp_write_wakeup() could be used
> for sending a data "probe" packet for this OOM case.
>
> > BTW, a previous patch has explained the need to
> > support window shrink, which should satisfy the RFC
> > of TCP protocol:
> >
> > https://lore.kernel.org/netdev/[email protected]/
>
> Let's see what Eric thinks about that patch.
>
> neal
>
>
> > Thanks!
> > Menglong Dong
> >
> > > Thanks,
> > > neal
> > >
> > > > Hi, Eric and kuba, do you have any comments on this
> > > > case?
> > > >
> > > > Thanks!
> > > > Menglong Dong
> > > >
> > > > > thanks,
> > > > > neal
> > > > >
> > > > > >
> > > > > > Thanks!
> > > > > > Menglong Dong
> > > > > >
> > > > > > > AFAICT that would be another way to reach the happy state you mention:
> > > > > > > "all the 100 connections will enter 0-probe status and connection
> > > > > > > break won't happen", and we could reach that state without violating
> > > > > > > the TCP protocol spec and without requiring changes on the receiver
> > > > > > > side (so that this fix could help in scenarios where the
> > > > > > > memory-constrained receiver is an older stack without special new
> > > > > > > behavior).
> > > > > > >
> > > > > > > Eric, Yuchung, Menglong: do you think something like that would work?
> > > > > > >
> > > > > > > neal

2023-05-23 10:46:36

by Eric Dumazet

[permalink] [raw]

Subject: Re: [PATCH net-next 3/3] net: tcp: handle window shrink properly

On Mon, May 22, 2023 at 5:04 PM Neal Cardwell <[email protected]> wrote:
>
> On Sun, May 21, 2023 at 10:55 PM Menglong Dong <[email protected]> wrote:
>
> > BTW, a previous patch has explained the need to
> > support window shrink, which should satisfy the RFC
> > of TCP protocol:
> >
> > https://lore.kernel.org/netdev/[email protected]/
>
> Let's see what Eric thinks about that patch.

I have not received a copy of the patch, so I will not comment on it.

2023-05-23 13:36:38

by Neal Cardwell

[permalink] [raw]

Subject: Re: [PATCH net-next 3/3] net: tcp: handle window shrink properly

On Tue, May 23, 2023 at 4:59 AM Menglong Dong <[email protected]> wrote:
>
> On Mon, May 22, 2023 at 11:04 PM Neal Cardwell <[email protected]> wrote:
> >
> > On Sun, May 21, 2023 at 10:55 PM Menglong Dong <[email protected]> wrote:
> > >
> > > On Sat, May 20, 2023 at 10:28 PM Neal Cardwell <[email protected]> wrote:
> > > >
> > > > On Sat, May 20, 2023 at 5:08 AM Menglong Dong <[email protected]> wrote:
> > > > >
> > > > > On Fri, May 19, 2023 at 12:03 AM Neal Cardwell <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, May 18, 2023 at 10:12 AM Menglong Dong <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, May 18, 2023 at 9:40 PM Neal Cardwell <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Wed, May 17, 2023 at 10:35 PM Menglong Dong <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, May 17, 2023 at 10:47 PM Eric Dumazet <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, May 17, 2023 at 2:42 PM <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > From: Menglong Dong <[email protected]>
> > > > > > > > > > >
> > > > > > > > > > > Window shrink is not allowed and also not handled for now, but it's
> > > > > > > > > > > needed in some case.
> > > > > > > > > > >
> > > > > > > > > > > In the origin logic, 0 probe is triggered only when there is no any
> > > > > > > > > > > data in the retrans queue and the receive window can't hold the data
> > > > > > > > > > > of the 1th packet in the send queue.
> > > > > > > > > > >
> > > > > > > > > > > Now, let's change it and trigger the 0 probe in such cases:
> > > > > > > > > > >
> > > > > > > > > > > - if the retrans queue has data and the 1th packet in it is not within
> > > > > > > > > > > the receive window
> > > > > > > > > > > - no data in the retrans queue and the 1th packet in the send queue is
> > > > > > > > > > > out of the end of the receive window
> > > > > > > > > >
> > > > > > > > > > Sorry, I do not understand.
> > > > > > > > > >
> > > > > > > > > > Please provide packetdrill tests for new behavior like that.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Yes. The problem can be reproduced easily.
> > > > > > > > >
> > > > > > > > > 1. choose a server machine, decrease it's tcp_mem with:
> > > > > > > > > echo '1024 1500 2048' > /proc/sys/net/ipv4/tcp_mem
> > > > > > > > > 2. call listen() and accept() on a port, such as 8888. We call
> > > > > > > > > accept() looply and without call recv() to make the data stay
> > > > > > > > > in the receive queue.
> > > > > > > > > 3. choose a client machine, and create 100 TCP connection
> > > > > > > > > to the 8888 port of the server. Then, every connection sends
> > > > > > > > > data about 1M.
> > > > > > > > > 4. we can see that some of the connection enter the 0-probe
> > > > > > > > > state, but some of them keep retrans again and again. As
> > > > > > > > > the server is up to the tcp_mem[2] and skb is dropped before
> > > > > > > > > the recv_buf full and the connection enter 0-probe state.
> > > > > > > > > Finially, some of these connection will timeout and break.
> > > > > > > > >
> > > > > > > > > With this series, all the 100 connections will enter 0-probe
> > > > > > > > > status and connection break won't happen. And the data
> > > > > > > > > trans will recover if we increase tcp_mem or call 'recv()'
> > > > > > > > > on the sockets in the server.
> > > > > > > > >
> > > > > > > > > > Also, such fundamental change would need IETF discussion first.
> > > > > > > > > > We do not want linux to cause network collapses just because billions
> > > > > > > > > > of devices send more zero probes.
> > > > > > > > >
> > > > > > > > > I think it maybe a good idea to make the connection enter
> > > > > > > > > 0-probe, rather than drop the skb silently. What 0-probe
> > > > > > > > > meaning is to wait for space available when the buffer of the
> > > > > > > > > receive queue is full. And maybe we can also use 0-probe
> > > > > > > > > when the "buffer" of "TCP protocol" (which means tcp_mem)
> > > > > > > > > is full?
> > > > > > > > >
> > > > > > > > > Am I right?
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > > Menglong Dong
> > > > > > > >
> > > > > > > > Thanks for describing the scenario in more detail. (Some kind of
> > > > > > > > packetdrill script or other program to reproduce this issue would be
> > > > > > > > nice, too, as Eric noted.)
> > > > > > > >
> > > > > > > > You mention in step (4.) above that some of the connections keep
> > > > > > > > retransmitting again and again. Are those connections receiving any
> > > > > > > > ACKs in response to their retransmissions? Perhaps they are receiving
> > > > > > > > dupacks?
> > > > > > >
> > > > > > > Actually, these packets are dropped without any reply, even dupacks.
> > > > > > > skb will be dropped directly when tcp_try_rmem_schedule()
> > > > > > > fails in tcp_data_queue(). That's reasonable, as it's
> > > > > > > useless to reply a ack to the sender, which will cause the sender
> > > > > > > fast retrans the packet, because we are out of memory now, and
> > > > > > > retrans can't solve the problem.
> > > > > >
> > > > > > I'm not sure I see the problem. If retransmits can't solve the
> > > > > > problem, then why are you proposing that data senders keep
> > > > > > retransmitting forever (via 0-window-probes) in this kind of scenario?
> > > > > >
> > > > >
> > > > > Because the connection will break if the count of
> > > > > retransmits up to tcp_retires2, but probe-0 can keep
> > > > > for a long time.
> > > >
> > > > I see. So it sounds like you agree that retransmits can solve the
> > > > problem, as long as the retransmits are using the zero-window probe
> > > > state machine (ICSK_TIME_PROBE0, tcp_probe_timer()), which continues
> > > > as long as the receiver is sending ACKs. And it sounds like when you
> > > > said "retrans can't solve the problem" you didn't literally mean that
> > > > retransmits can't solve the problem, but rather you meant that the RTO
> > > > state machine, specifically (ICSK_TIME_RETRANS,
> > > > tcp_retransmit_timer(), etc) can't solve the problem. I agree with
> > > > that assessment that in this scenario tcp_probe_timer() seems like a
> > > > solution but tcp_retransmit_timer() does not.
> > > >
> > >
> > > Yes, that is indeed what I want to express.
> > >
> > > > > > A single dupack without SACK blocks will not cause the sender to fast
> > > > > > retransmit. (Only 3 dupacks would trigger fast retransmit.)
> > > > > >
> > > > > > Three or more dupacks without SACK blocks will cause the sender to
> > > > > > fast retransmit the segment above SND.UNA once if the sender doesn't
> > > > > > have SACK support. But in this case AFAICT fast-retransmitting once is
> > > > > > a fine strategy, since the sender should keep retrying transmits (with
> > > > > > backoff) until the receiver potentially has memory available to
> > > > > > receive the packet.
> > > > > >
> > > > > > >
> > > > > > > > If so, then perhaps we could solve this problem without
> > > > > > > > depending on a violation of the TCP spec (which says the receive
> > > > > > > > window should not be retracted) in the following way: when a data
> > > > > > > > sender suffers a retransmission timeout, and retransmits the first
> > > > > > > > unacknowledged segment, and receives a dupack for SND.UNA instead of
> > > > > > > > an ACK covering the RTO-retransmitted segment, then the data sender
> > > > > > > > should estimate that the receiver doesn't have enough memory to buffer
> > > > > > > > the retransmitted packet. In that case, the data sender should enter
> > > > > > > > the 0-probe state and repeatedly set the ICSK_TIME_PROBE0 timer to
> > > > > > > > call tcp_probe_timer().
> > > > > > > >
> > > > > > > > Basically we could try to enhance the sender-side logic to try to
> > > > > > > > distinguish between two kinds of problems:
> > > > > > > >
> > > > > > > > (a) Repeated data packet loss caused by congestion, routing problems,
> > > > > > > > or connectivity problems. In this case, the data sender uses
> > > > > > > > ICSK_TIME_RETRANS and tcp_retransmit_timer(), and backs off and only
> > > > > > > > retries sysctl_tcp_retries2 times before timing out the connection
> > > > > > > >
> > > > > > > > (b) A receiver that is repeatedly sending dupacks but not ACKing
> > > > > > > > retransmitted data because it doesn't have any memory. In this case,
> > > > > > > > the data sender uses ICSK_TIME_PROBE0 and tcp_probe_timer(), and backs
> > > > > > > > off but keeps retrying as long as the data sender receives ACKs.
> > > > > > > >
> > > > > > >
> > > > > > > I'm not sure if this is an ideal method, as it may be not rigorous
> > > > > > > to conclude that the receiver is oom with dupacks. A packet can
> > > > > > > loss can also cause multi dupacks.
> > > > > >
> > > > > > When a data sender suffers an RTO and retransmits a single data
> > > > > > packet, it would be very rare for the data sender to receive multiple
> > > > > > pure dupacks without SACKs. This would only happen in the rare case
> > > > > > where (a) the connection did not have SACK enabled, and (b) there was
> > > > > > a hole in the received sequence space and there were still packets in
> > > > > > flight when the (spurioius) RTO fired.
> > > > > >
> > > > > > But if we want to be paranoid, then this new response could be written
> > > > > > to only trigger if SACK is enabled (the vast, vast majority of cases).
> > > > > > If SACK is enabled, and an RTO of a data packet starting at sequence
> > > > > > S1 results in the receiver sending only a dupack for S1 without SACK
> > > > > > blocks, then this clearly shows the issue is not packet loss but
> > > > > > suggests a receiver unable to buffer the given data packet, AFAICT.
> > > > > >
> > > > >
> > > > > Yeah, you are right on this point, multi pure dupacks can
> > > > > mean out of memory of the receiver. But we still need to
> > > > > know if the receiver recovers from OOM. Without window
> > > > > shrink, the window in the ack of zero-window probe packet
> > > > > is not zero on OOM.
> > > >
> > > > But do we need a protocol-violating zero-window in this case? Why not
> > > > use my approach suggested above: conveying the OOM condition by
> > > > sending an ACK but not ACKing the retransmitted packet?
> > > >
> > >
> > > I agree with you about the approach you mentioned
> > > about conveying the OOM condition. But that approach
> > > can't convey the recovery from OOM, can it?
> >
> > Yes, my suggested approach can convey the recovery from OOM. The data
> > receiver conveys the recovery from OOM by buffering and ACKing the
> > retransmitted data packet.
>
> Oh, I understand what you mean now. You are saying that
> retransmit that first packet in the retransmit queue instead
> of zero-window probe packet when OOM of the receiver,
> isn't it? In other word, retransmit the unacked data and ignore
> the tcp_retries2 when we find the receiver is in OOM state.

Yes. The idea would be to use a heuristic to estimate the receiver is
currently OOM and use ICSK_TIME_PROBE0 / tcp_probe_timer() /
tcp_write_wakeup() in this case instead of ICSK_TIME_RETRANS /
tcp_retransmit_timer().

> That's an option, and we can make the length of the data we
> send to 1 byte, which means we keep retransmitting the first
> byte that has not be acked in the retransmit queue.

I don't think it would be worth adding special-case code to only send
1 byte. If the data receiver is not OOM then for CPU and memory
efficiency reasons (as well as simplicity) the data sender should send
it a full MSS. So for those reasons I would suggest that in this
potential approach tcp_write_wakeup() should stay the same.

neal

2023-05-24 12:25:05

by Menglong Dong

[permalink] [raw]

Subject: Re: [PATCH net-next 3/3] net: tcp: handle window shrink properly

On Tue, May 23, 2023 at 9:27 PM Neal Cardwell <[email protected]> wrote:
> > Oh, I understand what you mean now. You are saying that
> > retransmit that first packet in the retransmit queue instead
> > of zero-window probe packet when OOM of the receiver,
> > isn't it? In other word, retransmit the unacked data and ignore
> > the tcp_retries2 when we find the receiver is in OOM state.
>
> Yes. The idea would be to use a heuristic to estimate the receiver is
> currently OOM and use ICSK_TIME_PROBE0 / tcp_probe_timer() /
> tcp_write_wakeup() in this case instead of ICSK_TIME_RETRANS /
> tcp_retransmit_timer().
>

Well, I think that maybe we should use ICSK_TIME_PROBE0 /
tcp_probe_timer() / tcp_retransmit_skb(), isn't it?

What tcp_write_wakeup() does is send new data if the receive
window available, which means push new data into the retransmit
queue. However, what we need to do now is retransmit the first
packet in the rtx queue, isn't it?

In the tcp_ack(), we estimate that if the receiver is OOM and
mark the sk with OOM state, and raise ICSK_TIME_PROBE0.
When new data is acked, we leave the OOM state.

The OOM state can only happen when the rtx queue is not empty,
otherwise the tcp connection will enter normal zero-window probe
state. So when the timeout of ICSK_TIME_PROBE0, we need
retransmit the skb in the rtx queue.

tcp_write_wakeup() don't do the job the retransmit packet, but
send new data.

Am I right?

Thanks!
Menglong Dong

> > That's an option, and we can make the length of the data we
> > send to 1 byte, which means we keep retransmitting the first
> > byte that has not be acked in the retransmit queue.
>
> I don't think it would be worth adding special-case code to only send
> 1 byte. If the data receiver is not OOM then for CPU and memory
> efficiency reasons (as well as simplicity) the data sender should send
> it a full MSS. So for those reasons I would suggest that in this
> potential approach tcp_write_wakeup() should stay the same.
>
> neal

2023-05-24 15:08:31

by Neal Cardwell

[permalink] [raw]

Subject: Re: [PATCH net-next 3/3] net: tcp: handle window shrink properly

On Wed, May 24, 2023 at 8:16 AM Menglong Dong <[email protected]> wrote:
>
> On Tue, May 23, 2023 at 9:27 PM Neal Cardwell <[email protected]> wrote:
> > > Oh, I understand what you mean now. You are saying that
> > > retransmit that first packet in the retransmit queue instead
> > > of zero-window probe packet when OOM of the receiver,
> > > isn't it? In other word, retransmit the unacked data and ignore
> > > the tcp_retries2 when we find the receiver is in OOM state.
> >
> > Yes. The idea would be to use a heuristic to estimate the receiver is
> > currently OOM and use ICSK_TIME_PROBE0 / tcp_probe_timer() /
> > tcp_write_wakeup() in this case instead of ICSK_TIME_RETRANS /
> > tcp_retransmit_timer().
> >
>
> Well, I think that maybe we should use ICSK_TIME_PROBE0 /
> tcp_probe_timer() / tcp_retransmit_skb(), isn't it?
>
> What tcp_write_wakeup() does is send new data if the receive
> window available, which means push new data into the retransmit
> queue. However, what we need to do now is retransmit the first
> packet in the rtx queue, isn't it?
>
> In the tcp_ack(), we estimate that if the receiver is OOM and
> mark the sk with OOM state, and raise ICSK_TIME_PROBE0.
> When new data is acked, we leave the OOM state.
>
> The OOM state can only happen when the rtx queue is not empty,
> otherwise the tcp connection will enter normal zero-window probe
> state. So when the timeout of ICSK_TIME_PROBE0, we need
> retransmit the skb in the rtx queue.
>
> tcp_write_wakeup() don't do the job the retransmit packet, but
> send new data.
>
> Am I right?

Yes, that's a good point that the tcp_write_wakeup() code is not
currently a good fit for the OOM case, since it currently can only
send unsent data.

cheers,
neal

> Thanks!
> Menglong Dong
>
> > > That's an option, and we can make the length of the data we
> > > send to 1 byte, which means we keep retransmitting the first
> > > byte that has not be acked in the retransmit queue.
> >
> > I don't think it would be worth adding special-case code to only send
> > 1 byte. If the data receiver is not OOM then for CPU and memory
> > efficiency reasons (as well as simplicity) the data sender should send
> > it a full MSS. So for those reasons I would suggest that in this
> > potential approach tcp_write_wakeup() should stay the same.
> >
> > neal