Received: by 2002:a05:6358:700f:b0:131:369:b2a3 with SMTP id 15csp929559rwo; Wed, 2 Aug 2023 06:26:00 -0700 (PDT) X-Google-Smtp-Source: APBJJlHC+bNKkqR34mQveIUO0I+6UzuE2kKAudZvNsmKaObYh2o7wcHayw/xODClnaKFc3omtoUq X-Received: by 2002:a17:906:30d5:b0:99b:c2b2:e498 with SMTP id b21-20020a17090630d500b0099bc2b2e498mr4754531ejb.52.1690982759887; Wed, 02 Aug 2023 06:25:59 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1690982759; cv=none; d=google.com; s=arc-20160816; b=TrDkje53PvLAfjfnxWORmOeOCv1fEcR8TbLD7nuDtra6kYMqPCzNjVVa1hyxK/ujdF eGylomBnVLCSkp7wFG/Sl2NyGEDGyx6LyAeKeUL4gLPPVoeAEAYMVhuWrA2rEchSSYzw Tgf7Pr9gUBkpDfgizUJmLqstO7rLgIR75ebUVEfg2wDK7uOk100x/f5H2uiV+MxKzHFd JcFOJO4s/0vIazONpwo/f2zUMKNa4265pz67GkpZTed9Dr0cjpQg6h12euowb5JIudpV x2xQcrH+OmYU7zbV2WyrtFed3zo0YbxNA1FCrCQVp0+GwQx8n6O2he946Sudz/iD1qei mfkA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:reply-to:message-id :subject:cc:to:from:date; bh=akv6b1H+G/SwmONZ20OQ6ahWzZk9hTs6GdepAaDwzeU=; fh=2Go3P7myD0Wc1FQAwPw2LeaWIe+JAN6NCg0ixVxBvXs=; b=MlovOafbZQRE5q0Xz/J5EKpJCNLs8qcSa/e3PL0nNCmiQHCGQCBthXa6linpepSa0G klGPDL7oHrMAT3kt+maeaRD8QP9mr0VIzj873vcN7hHHASlfpeFpNolsyic2Nv9CK0Ec fGnt33KGxIxbbrpWt2HTKqBrQ7iRlh5W+q6x05wAqDD+Z99acWwThR1GQurBFs9FwQPK CuBsnMep3O7pbHTls/tow68dJMi2O0PwYvd1emrnnSZT4x4wfO4C4NI7hAc3zeEqiCsB hIhbnfCG4+tRf1UeMC3oCH5YglGLPhi/R/A9p23PlrPI7PwLdBzCUDD/2/7JSeQuca8g G/kQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k13-20020a170906970d00b009828298ddd5si10909259ejx.259.2023.08.02.06.25.35; Wed, 02 Aug 2023 06:25:59 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234115AbjHBMdS (ORCPT + 99 others); Wed, 2 Aug 2023 08:33:18 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54648 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232364AbjHBMdQ (ORCPT ); Wed, 2 Aug 2023 08:33:16 -0400 Received: from out30-119.freemail.mail.aliyun.com (out30-119.freemail.mail.aliyun.com [115.124.30.119]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E64532701; Wed, 2 Aug 2023 05:33:12 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046050;MF=tonylu@linux.alibaba.com;NM=1;PH=DS;RN=14;SR=0;TI=SMTPD_---0Vov0kNP_1690979588; Received: from localhost(mailfrom:tonylu@linux.alibaba.com fp:SMTPD_---0Vov0kNP_1690979588) by smtp.aliyun-inc.com; Wed, 02 Aug 2023 20:33:09 +0800 Date: Wed, 2 Aug 2023 20:33:04 +0800 From: Tony Lu To: Gerd Bayer Cc: Wenjia Zhang , Jan Karcher , Paolo Abeni , Karsten Graul , "D . Wythe" , Wen Gu , "David S . Miller" , Eric Dumazet , Jakub Kicinski , linux-s390@vger.kernel.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH net 1/2] net/smc: Fix setsockopt and sysctl to specify same buffer size again Message-ID: Reply-To: Tony Lu References: <20230802093313.1501605-1-gbayer@linux.ibm.com> <20230802093313.1501605-2-gbayer@linux.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20230802093313.1501605-2-gbayer@linux.ibm.com> X-Spam-Status: No, score=-9.9 required=5.0 tests=BAYES_00, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,UNPARSEABLE_RELAY,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 02, 2023 at 11:33:12AM +0200, Gerd Bayer wrote: > Commit 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock > and make them tunable") introduced the net.smc.rmem and net.smc.wmem > sysctls to specify the size of buffers to be used for SMC type > connections. This created a regression for users that specified the > buffer size via setsockopt() as the effective buffer size was now > doubled. The idea of the original patch is going to unbind buffer from clcsock. It's great here to determine whether to use original value or doubled. > > Re-introduce the division by 2 in the SMC buffer create code and level > this out by duplicating the net.smc.[rw]mem values used for initializing > sk_rcvbuf/sk_sndbuf at socket creation time. This gives users of both > methods (setsockopt or sysctl) the effective buffer size that they > expect. > > Initialize net.smc.[rw]mem from its own constant of 64kB, respectively. > Internal performance tests show that this value is a good compromise > between throughput/latency and memory consumption. Also, this decouples > it from any tuning that was done to net.ipv4.tcp_[rw]mem[1] before the > module for SMC protocol was loaded. Check that no more than INT_MAX / 2 > is assigned to net.smc.[rw]mem, in order to avoid any overflow condition > when that is doubled for use in sk_sndbuf or sk_rcvbuf. 64kB may be small in our environment, but that's okay to change it with systemd during boot. Different networking environment, such as with higher latency, should have different buffer size. So two tunable knobs are good enough. > > While at it, drop the confusing sk_buf_size variable from > __smc_buf_create and name "compressed" buffer size variables more > consistently. > > Background: > > Before the commit mentioned above, SMC's buffer allocator in > __smc_buf_create() always used half of the sockets' sk_rcvbuf/sk_sndbuf > value as initial value to search for appropriate buffers. If the search > resorted to using a bigger buffer when all buffers of the specified > size were busy, the duplicate of the used effective buffer size is > stored back to sk_rcvbuf/sk_sndbuf. > > When available, buffers of exactly the size that a user had specified as > input to setsockopt() were used, despite setsockopt()'s documentation in > "man 7 socket" talking of a mandatory duplication: > > [...] > SO_SNDBUF > Sets or gets the maximum socket send buffer in bytes. > The kernel doubles this value (to allow space for book‐ > keeping overhead) when it is set using setsockopt(2), > and this doubled value is returned by getsockopt(2). > The default value is set by the > /proc/sys/net/core/wmem_default file and the maximum > allowed value is set by the /proc/sys/net/core/wmem_max > file. The minimum (doubled) value for this option is > 2048. > [...] > > Fixes: 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable") > Co-developed-by: Jan Karcher > Signed-off-by: Jan Karcher > Signed-off-by: Gerd Bayer > Reviewed-by: Wenjia Zhang Reviewed-by: Tony Lu > --- > net/smc/af_smc.c | 4 ++-- > net/smc/smc.h | 2 +- > net/smc/smc_clc.c | 4 ++-- > net/smc/smc_core.c | 25 ++++++++++++------------- > net/smc/smc_sysctl.c | 10 ++++++++-- > 5 files changed, 25 insertions(+), 20 deletions(-) > > diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c > index a7f887d91d89..1fcf1e42474a 100644 > --- a/net/smc/af_smc.c > +++ b/net/smc/af_smc.c > @@ -378,8 +378,8 @@ static struct sock *smc_sock_alloc(struct net *net, struct socket *sock, > sk->sk_state = SMC_INIT; > sk->sk_destruct = smc_destruct; > sk->sk_protocol = protocol; > - WRITE_ONCE(sk->sk_sndbuf, READ_ONCE(net->smc.sysctl_wmem)); > - WRITE_ONCE(sk->sk_rcvbuf, READ_ONCE(net->smc.sysctl_rmem)); > + WRITE_ONCE(sk->sk_sndbuf, 2 * READ_ONCE(net->smc.sysctl_wmem)); > + WRITE_ONCE(sk->sk_rcvbuf, 2 * READ_ONCE(net->smc.sysctl_rmem)); > smc = smc_sk(sk); > INIT_WORK(&smc->tcp_listen_work, smc_tcp_listen_work); > INIT_WORK(&smc->connect_work, smc_connect_work); > diff --git a/net/smc/smc.h b/net/smc/smc.h > index 2eeea4cdc718..1f2b912c43d1 100644 > --- a/net/smc/smc.h > +++ b/net/smc/smc.h > @@ -161,7 +161,7 @@ struct smc_connection { > > struct smc_buf_desc *sndbuf_desc; /* send buffer descriptor */ > struct smc_buf_desc *rmb_desc; /* RMBE descriptor */ > - int rmbe_size_short;/* compressed notation */ > + int rmbe_size_comp; /* compressed notation */ nit: spaces are used here, better to use two tabs. > int rmbe_update_limit; > /* lower limit for consumer > * cursor update > diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c > index b9b8b07aa702..c90d9e5dda54 100644 > --- a/net/smc/smc_clc.c > +++ b/net/smc/smc_clc.c > @@ -1007,7 +1007,7 @@ static int smc_clc_send_confirm_accept(struct smc_sock *smc, > clc->d0.gid = > conn->lgr->smcd->ops->get_local_gid(conn->lgr->smcd); > clc->d0.token = conn->rmb_desc->token; > - clc->d0.dmbe_size = conn->rmbe_size_short; > + clc->d0.dmbe_size = conn->rmbe_size_comp; > clc->d0.dmbe_idx = 0; > memcpy(&clc->d0.linkid, conn->lgr->id, SMC_LGR_ID_SIZE); > if (version == SMC_V1) { > @@ -1050,7 +1050,7 @@ static int smc_clc_send_confirm_accept(struct smc_sock *smc, > clc->r0.qp_mtu = min(link->path_mtu, link->peer_mtu); > break; > } > - clc->r0.rmbe_size = conn->rmbe_size_short; > + clc->r0.rmbe_size = conn->rmbe_size_comp; > clc->r0.rmb_dma_addr = conn->rmb_desc->is_vm ? > cpu_to_be64((uintptr_t)conn->rmb_desc->cpu_addr) : > cpu_to_be64((u64)sg_dma_address > diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c > index 3f465faf2b68..6b78075404d7 100644 > --- a/net/smc/smc_core.c > +++ b/net/smc/smc_core.c > @@ -2309,31 +2309,30 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb) > struct smc_connection *conn = &smc->conn; > struct smc_link_group *lgr = conn->lgr; > struct list_head *buf_list; > - int bufsize, bufsize_short; > + int bufsize, bufsize_comp; > struct rw_semaphore *lock; /* lock buffer list */ > bool is_dgraded = false; > - int sk_buf_size; > > if (is_rmb) > /* use socket recv buffer size (w/o overhead) as start value */ > - sk_buf_size = smc->sk.sk_rcvbuf; > + bufsize = smc->sk.sk_rcvbuf / 2; > else > /* use socket send buffer size (w/o overhead) as start value */ > - sk_buf_size = smc->sk.sk_sndbuf; > + bufsize = smc->sk.sk_sndbuf / 2; > > - for (bufsize_short = smc_compress_bufsize(sk_buf_size, is_smcd, is_rmb); > - bufsize_short >= 0; bufsize_short--) { > + for (bufsize_comp = smc_compress_bufsize(bufsize, is_smcd, is_rmb); > + bufsize_comp >= 0; bufsize_comp--) { > if (is_rmb) { > lock = &lgr->rmbs_lock; > - buf_list = &lgr->rmbs[bufsize_short]; > + buf_list = &lgr->rmbs[bufsize_comp]; > } else { > lock = &lgr->sndbufs_lock; > - buf_list = &lgr->sndbufs[bufsize_short]; > + buf_list = &lgr->sndbufs[bufsize_comp]; > } > - bufsize = smc_uncompress_bufsize(bufsize_short); > + bufsize = smc_uncompress_bufsize(bufsize_comp); > > /* check for reusable slot in the link group */ > - buf_desc = smc_buf_get_slot(bufsize_short, lock, buf_list); > + buf_desc = smc_buf_get_slot(bufsize_comp, lock, buf_list); > if (buf_desc) { > buf_desc->is_dma_need_sync = 0; > SMC_STAT_RMB_SIZE(smc, is_smcd, is_rmb, bufsize); > @@ -2377,8 +2376,8 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb) > > if (is_rmb) { > conn->rmb_desc = buf_desc; > - conn->rmbe_size_short = bufsize_short; > - smc->sk.sk_rcvbuf = bufsize; > + conn->rmbe_size_comp = bufsize_comp; > + smc->sk.sk_rcvbuf = bufsize * 2; > atomic_set(&conn->bytes_to_rcv, 0); > conn->rmbe_update_limit = > smc_rmb_wnd_update_limit(buf_desc->len); > @@ -2386,7 +2385,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb) > smc_ism_set_conn(conn); /* map RMB/smcd_dev to conn */ > } else { > conn->sndbuf_desc = buf_desc; > - smc->sk.sk_sndbuf = bufsize; > + smc->sk.sk_sndbuf = bufsize * 2; > atomic_set(&conn->sndbuf_space, bufsize); > } > return 0; > diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c > index b6f79fabb9d3..0b2a957ca5f5 100644 > --- a/net/smc/smc_sysctl.c > +++ b/net/smc/smc_sysctl.c > @@ -21,6 +21,10 @@ > > static int min_sndbuf = SMC_BUF_MIN_SIZE; > static int min_rcvbuf = SMC_BUF_MIN_SIZE; > +static int max_sndbuf = INT_MAX / 2; > +static int max_rcvbuf = INT_MAX / 2; > +static const int net_smc_wmem_init = (64 * 1024); > +static const int net_smc_rmem_init = (64 * 1024); > > static struct ctl_table smc_table[] = { > { > @@ -53,6 +57,7 @@ static struct ctl_table smc_table[] = { > .mode = 0644, > .proc_handler = proc_dointvec_minmax, > .extra1 = &min_sndbuf, > + .extra2 = &max_sndbuf, > }, > { > .procname = "rmem", > @@ -61,6 +66,7 @@ static struct ctl_table smc_table[] = { > .mode = 0644, > .proc_handler = proc_dointvec_minmax, > .extra1 = &min_rcvbuf, > + .extra2 = &max_rcvbuf, > }, > { } > }; > @@ -88,8 +94,8 @@ int __net_init smc_sysctl_net_init(struct net *net) > net->smc.sysctl_autocorking_size = SMC_AUTOCORKING_DEFAULT_SIZE; > net->smc.sysctl_smcr_buf_type = SMCR_PHYS_CONT_BUFS; > net->smc.sysctl_smcr_testlink_time = SMC_LLC_TESTLINK_DEFAULT_TIME; > - WRITE_ONCE(net->smc.sysctl_wmem, READ_ONCE(net->ipv4.sysctl_tcp_wmem[1])); > - WRITE_ONCE(net->smc.sysctl_rmem, READ_ONCE(net->ipv4.sysctl_tcp_rmem[1])); > + WRITE_ONCE(net->smc.sysctl_wmem, net_smc_wmem_init); > + WRITE_ONCE(net->smc.sysctl_rmem, net_smc_rmem_init); > > return 0; > > -- > 2.41.0