Received: by 2002:a05:6358:111d:b0:dc:6189:e246 with SMTP id f29csp1587137rwi; Thu, 3 Nov 2022 07:02:05 -0700 (PDT) X-Google-Smtp-Source: AMsMyM6+oPrKllmXKlWCWY/SYlL+khjlLruxJPLoNF9loKvms+uNSM5CN2CwSQZj19Cyy+Z3l2jQ X-Received: by 2002:a17:902:7c12:b0:186:8111:ade2 with SMTP id x18-20020a1709027c1200b001868111ade2mr30074717pll.111.1667484125153; Thu, 03 Nov 2022 07:02:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1667484125; cv=none; d=google.com; s=arc-20160816; b=IoX5/XDrX040DxbUkwdvF71Jd4gUaY4w8rl26PsXO3epALB0txpgKzoc7NRNQ9UNVz 43MYwm8PeTKo8C8UaQYc6nJa3Dg5dqopuuGHrGojlgTETWF3Ev0YFtk9anFjcxj76muQ Kj7rv9cABPTmbOM+qFHTMh5dr6pbfpUzsDYNGvVGX/Lp+ufUK3KVV8u5otDwGQDpeFDT v039p4ZNl+5fqtcOJPIbQ5kmQ+GBR668F/9hwy9QlJCzJ1MXzUtGI3EARRMz+cS3W3JH c7IS/saUKJswExt6BZFYhY+npAhm+PK7HkpkzXRxxzPkCBRM5Asru/LPwdQm25xXr8So Hv5A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=TdPz9mrMSV0mLDD1LnuAN/2YSbFaycNRPgnZMSbqQC0=; b=pmCBnBhxtJh2lQJQ/ceVNEBTqTphowKCf05CZY6di9YZeeixZfOL9tUdk1rU1ih5FW FHqaZDRT5WWvDCfCUTTNyUztRFebqySfkA00cJw9p3US80uW0RIhz8FEkbf7ey5KZw7H GwhYODJmIsUFkuByu3oyqkjOMr9OSjp1Mb7RmdcmZZxg8FugY4nkGqWyuuX3yM/njQKk XTYthHkS6fLiJoABBhGsD7r0+6HoipnfqHJjF9lRI8R+iFOSWJtyhRk9YxLXrprRPhCM ZbMJGxK6miu/jSpeomWLimUSHZ/sAKaGZPKec0LoVF9xC1is5E3VUE3Bo771nrrPqkzC PuzQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=rRZyZDoJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a10-20020a170902710a00b00186b45948d1si813597pll.125.2022.11.03.07.01.49; Thu, 03 Nov 2022 07:02:05 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=rRZyZDoJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230348AbiKCNzY (ORCPT + 97 others); Thu, 3 Nov 2022 09:55:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50236 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229560AbiKCNzU (ORCPT ); Thu, 3 Nov 2022 09:55:20 -0400 Received: from mail-oa1-x36.google.com (mail-oa1-x36.google.com [IPv6:2001:4860:4864:20::36]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8180311464 for ; Thu, 3 Nov 2022 06:55:18 -0700 (PDT) Received: by mail-oa1-x36.google.com with SMTP id 586e51a60fabf-13b23e29e36so2188032fac.8 for ; Thu, 03 Nov 2022 06:55:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=TdPz9mrMSV0mLDD1LnuAN/2YSbFaycNRPgnZMSbqQC0=; b=rRZyZDoJLafXrgdXqgVgLeouQmCst+dIde8a/yUKgaCQ1TKsoxzrCCXzx0lX0fBPM3 t8YAIVv849tDGaDrEreXNQTWM548v4HF0a+oPf+4lWoRWp5zeuHra0458duf5bVmrcYg zMeyTYGfQW4Mmh/SEKF1n8oQWOkoZeyE7xXVbVrocWuDvJwPDaRDe2ws5/e7dUnQ/OPC J5h7xcM3I/iejlNCp+osg1bEuoHa544UVyHg41VS6B9zLJEftUxA5UWctjrw3rfQA9bC defdrtdS9za2T0eVXw2LCEJM5D4ChKciu6uStni8W0++gnX3bPkpF1IhotMstsAQIc9s p1nw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=TdPz9mrMSV0mLDD1LnuAN/2YSbFaycNRPgnZMSbqQC0=; b=JshVdtN5D07i2T2K70WMKE1ioGsKB8jfrOhU+bcSvFcgVwX43HCDFqA+OrXw+1rCT/ 0+NeHXNTn1v2kV57CYFcSVmV1PTAKI5pZDd86o5RJhUFglb8PFW8rVjNT4PsdRgFREVR 9pq0Pe14Q9I/CDQQgXkrAIWg/LrYQvLWZaxCWPuHHQLgV2Bp3haEWMi5Y0XszCDOYyE5 2OTqmxUgQ+mNOV+ujSRezkJfX7F0LxnshKLktBiJ2D+g9dAS9uEai9/bKtX18ItIxygH 2XE1rYfCc5uCv2mtZ+zKU2rVbbHr9L249D8++PJl77a/x/dxPAaJQtHvsMJh29hgwtPm CR9A== X-Gm-Message-State: ACrzQf1qjZxYHK/FU/gY6GCdiQVApnCNjGMSpNXDnEzb2RpoTEGeL8Zl 269+6FLeUZo+0Rp5A99bPqmZWJWpMjkqpsGr6IS3Rw== X-Received: by 2002:a05:6870:9a05:b0:132:ebf:dc61 with SMTP id fo5-20020a0568709a0500b001320ebfdc61mr17653487oab.76.1667483717633; Thu, 03 Nov 2022 06:55:17 -0700 (PDT) MIME-Version: 1.0 References: <20221102132811.70858-1-luwei32@huawei.com> In-Reply-To: From: Neal Cardwell Date: Thu, 3 Nov 2022 09:54:59 -0400 Message-ID: Subject: Re: [patch net v3] tcp: prohibit TCP_REPAIR_OPTIONS if data was already sent To: "luwei (O)" Cc: edumazet@google.com, davem@davemloft.net, yoshfuji@linux-ipv6.org, dsahern@kernel.org, kuba@kernel.org, pabeni@redhat.com, xemul@parallels.com, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 2, 2022 at 10:11 PM luwei (O) wrote: > > > =E5=9C=A8 2022/11/2 10:46 PM, Neal Cardwell =E5=86=99=E9=81=93: > > On Wed, Nov 2, 2022 at 8:23 AM Lu Wei wrote: > >> If setsockopt with option name of TCP_REPAIR_OPTIONS and opt_code > >> of TCPOPT_SACK_PERM is called to enable sack after data is sent > >> and before data is acked, ... > > This "before data is acked" phrase does not quite seem to match the > > sequence below, AFAICT? > > > > How about something like: > > > > If setsockopt TCP_REPAIR_OPTIONS with opt_code TCPOPT_SACK_PERM > > is called to enable SACK after data is sent and the data sender recei= ves a > > dupack, ... > yes, thanks for suggestion > > > > > >> ... it will trigger a warning in function > >> tcp_verify_left_out() as follows: > >> > >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >> WARNING: CPU: 8 PID: 0 at net/ipv4/tcp_input.c:2132 > >> tcp_timeout_mark_lost+0x154/0x160 > >> tcp_enter_loss+0x2b/0x290 > >> tcp_retransmit_timer+0x50b/0x640 > >> tcp_write_timer_handler+0x1c8/0x340 > >> tcp_write_timer+0xe5/0x140 > >> call_timer_fn+0x3a/0x1b0 > >> __run_timers.part.0+0x1bf/0x2d0 > >> run_timer_softirq+0x43/0xb0 > >> __do_softirq+0xfd/0x373 > >> __irq_exit_rcu+0xf6/0x140 > >> > >> The warning is caused in the following steps: > >> 1. a socket named socketA is created > >> 2. socketA enters repair mode without build a connection > >> 3. socketA calls connect() and its state is changed to TCP_ESTABLISHED > >> directly > >> 4. socketA leaves repair mode > >> 5. socketA calls sendmsg() to send data, packets_out and sack_outs(dup > >> ack receives) increase > >> 6. socketA enters repair mode again > >> 7. socketA calls setsockopt with TCPOPT_SACK_PERM to enable sack > >> 8. retransmit timer expires, it calls tcp_timeout_mark_lost(), lost_ou= t > >> increases > >> 9. sack_outs + lost_out > packets_out triggers since lost_out and > >> sack_outs increase repeatly > >> > >> In function tcp_timeout_mark_lost(), tp->sacked_out will be cleared if > >> Step7 not happen and the warning will not be triggered. As suggested b= y > >> Denis and Eric, TCP_REPAIR_OPTIONS should be prohibited if data was > >> already sent. So this patch checks tp->segs_out, only TCP_REPAIR_OPTIO= NS > >> can be set only if tp->segs_out is 0. > >> > >> socket-tcp tests in CRIU has been tested as follows: > >> $ sudo ./test/zdtm.py run -t zdtm/static/socket-tcp* --keep-going \ > >> --ignore-taint > >> > >> socket-tcp* represent all socket-tcp tests in test/zdtm/static/. > >> > >> Fixes: b139ba4e90dc ("tcp: Repair connection-time negotiated parameter= s") > >> Signed-off-by: Lu Wei > >> --- > >> net/ipv4/tcp.c | 2 +- > >> 1 file changed, 1 insertion(+), 1 deletion(-) > >> > >> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > >> index ef14efa1fb70..1f5cc32cf0cc 100644 > >> --- a/net/ipv4/tcp.c > >> +++ b/net/ipv4/tcp.c > >> @@ -3647,7 +3647,7 @@ int do_tcp_setsockopt(struct sock *sk, int level= , int optname, > >> case TCP_REPAIR_OPTIONS: > >> if (!tp->repair) > >> err =3D -EINVAL; > >> - else if (sk->sk_state =3D=3D TCP_ESTABLISHED) > >> + else if (sk->sk_state =3D=3D TCP_ESTABLISHED && !tp->s= egs_out) > > The tp->segs_out field is only 32 bits wide. By my math, at 200 > > Gbit/sec with 1500 byte MTU it can wrap roughly every 260 secs. So a > > caller could get unlucky or carefully sequence its call to > > TCP_REPAIR_OPTIONS (based on packets sent so far) to mess up the > > accounting and trigger the kernel warning. > > > > How about using some other method to determine if this is safe? > > Perhaps using tp->bytes_sent, which is a 64-bit field, which by my > > math would take 23 years to wrap at 200 Gbit/sec? > > > > If we're more paranoid about wrapping we could also check > > tp->packets_out, and refuse to allow TCP_REPAIR_OPTIONS if either > > tp->bytes_sent or tp->packets_out are non-zero. (Or if we're even more > > paranoid I suppose we could have a special new bit to track whether > > we've ever sent something, but that probably seems like overkill?) > > > > neal > > . > > I didn't notice that u32 will be easily wrapped in huge network throughpu= t, > thank you neal. > > But tcp->packets_out shoud not be used because tp->packets_out can decrea= se > when expected ack is received, so it can decrease to 0 and this is the co= mmon > condition. To say tp->packets_out should not be used is a bit strong. :-) Obviously packets_out decreases when packets are ACKed. The point of checking both tp->bytes_sent and tp->packets_out would be if we are paranoid enough that we want to prevent this warning in the case where tp->bytes_sent wraps and becomes zero. If tp->bytes_sent wraps and is zero, we will be saved from hitting this warning if we deny the request to set TCP_REPAIR_OPTIONS if tp->packets_out is non-zero. (Because we can only hit this warning if tp->sacked_out is non-zero, and tp->sacked_out should only be non-zero if tp->packets_out is non-zero.) neal