2023-08-02 10:44:37

by Gerd Bayer

[permalink] [raw]
Subject: [PATCH net 0/2] net/smc: Fix effective buffer size

Hi all,

commit 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock
and make them tunable") started to derive the effective buffer size for
SMC connections inconsistently in case a TCP fallback was used and
memory consumption of SMC with the default settings was doubled when
a connection negotiated SMC. That was not what we want.

This series consolidates the resulting effective buffer size that is
used with SMC sockets, which is based on Jan karcher's effort (see
[1]). For all TCP exchanges (in particular in case of a fall back when
no SMC connection was possible) the values from net.ipv4.tcp_[rw]mem
are used. If SMC succeeds in establishing a SMC connection, the newly
introduced values from net.smc.[rw]mem are used.

net.smc.[rw]mem is initialized to 64kB, respectively. Internal test
have show this to be a good compromise between throughput/latency
and memory consumption. Also net.smc.[rw]mem is now decoupled completely
from any tuning through net.ipv4.tcp_[rw]mem.

If a user chose to tune a socket's receive or send buffer size with
setsockopt, this tuning is now consistently applied to either fall-back
TCP or proper SMC connections over the socket.

Thanks,
Gerd


[1] https://lore.kernel.org/netdev/[email protected]



Gerd Bayer (2):
net/smc: Fix setsockopt and sysctl to specify same buffer size again
net/smc: Use correct buffer sizes when switching between TCP and SMC

net/smc/af_smc.c | 80 +++++++++++++++++++++++++++++++-------------
net/smc/smc.h | 2 +-
net/smc/smc_clc.c | 4 +--
net/smc/smc_core.c | 25 +++++++-------
net/smc/smc_sysctl.c | 10 ++++--
5 files changed, 79 insertions(+), 42 deletions(-)

--
2.41.0



2023-08-02 12:11:57

by Gerd Bayer

[permalink] [raw]
Subject: [PATCH net 1/2] net/smc: Fix setsockopt and sysctl to specify same buffer size again

Commit 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock
and make them tunable") introduced the net.smc.rmem and net.smc.wmem
sysctls to specify the size of buffers to be used for SMC type
connections. This created a regression for users that specified the
buffer size via setsockopt() as the effective buffer size was now
doubled.

Re-introduce the division by 2 in the SMC buffer create code and level
this out by duplicating the net.smc.[rw]mem values used for initializing
sk_rcvbuf/sk_sndbuf at socket creation time. This gives users of both
methods (setsockopt or sysctl) the effective buffer size that they
expect.

Initialize net.smc.[rw]mem from its own constant of 64kB, respectively.
Internal performance tests show that this value is a good compromise
between throughput/latency and memory consumption. Also, this decouples
it from any tuning that was done to net.ipv4.tcp_[rw]mem[1] before the
module for SMC protocol was loaded. Check that no more than INT_MAX / 2
is assigned to net.smc.[rw]mem, in order to avoid any overflow condition
when that is doubled for use in sk_sndbuf or sk_rcvbuf.

While at it, drop the confusing sk_buf_size variable from
__smc_buf_create and name "compressed" buffer size variables more
consistently.

Background:

Before the commit mentioned above, SMC's buffer allocator in
__smc_buf_create() always used half of the sockets' sk_rcvbuf/sk_sndbuf
value as initial value to search for appropriate buffers. If the search
resorted to using a bigger buffer when all buffers of the specified
size were busy, the duplicate of the used effective buffer size is
stored back to sk_rcvbuf/sk_sndbuf.

When available, buffers of exactly the size that a user had specified as
input to setsockopt() were used, despite setsockopt()'s documentation in
"man 7 socket" talking of a mandatory duplication:

[...]
SO_SNDBUF
Sets or gets the maximum socket send buffer in bytes.
The kernel doubles this value (to allow space for book‐
keeping overhead) when it is set using setsockopt(2),
and this doubled value is returned by getsockopt(2).
The default value is set by the
/proc/sys/net/core/wmem_default file and the maximum
allowed value is set by the /proc/sys/net/core/wmem_max
file. The minimum (doubled) value for this option is
2048.
[...]

Fixes: 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable")
Co-developed-by: Jan Karcher <[email protected]>
Signed-off-by: Jan Karcher <[email protected]>
Signed-off-by: Gerd Bayer <[email protected]>
Reviewed-by: Wenjia Zhang <[email protected]>
---
net/smc/af_smc.c | 4 ++--
net/smc/smc.h | 2 +-
net/smc/smc_clc.c | 4 ++--
net/smc/smc_core.c | 25 ++++++++++++-------------
net/smc/smc_sysctl.c | 10 ++++++++--
5 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index a7f887d91d89..1fcf1e42474a 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -378,8 +378,8 @@ static struct sock *smc_sock_alloc(struct net *net, struct socket *sock,
sk->sk_state = SMC_INIT;
sk->sk_destruct = smc_destruct;
sk->sk_protocol = protocol;
- WRITE_ONCE(sk->sk_sndbuf, READ_ONCE(net->smc.sysctl_wmem));
- WRITE_ONCE(sk->sk_rcvbuf, READ_ONCE(net->smc.sysctl_rmem));
+ WRITE_ONCE(sk->sk_sndbuf, 2 * READ_ONCE(net->smc.sysctl_wmem));
+ WRITE_ONCE(sk->sk_rcvbuf, 2 * READ_ONCE(net->smc.sysctl_rmem));
smc = smc_sk(sk);
INIT_WORK(&smc->tcp_listen_work, smc_tcp_listen_work);
INIT_WORK(&smc->connect_work, smc_connect_work);
diff --git a/net/smc/smc.h b/net/smc/smc.h
index 2eeea4cdc718..1f2b912c43d1 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -161,7 +161,7 @@ struct smc_connection {

struct smc_buf_desc *sndbuf_desc; /* send buffer descriptor */
struct smc_buf_desc *rmb_desc; /* RMBE descriptor */
- int rmbe_size_short;/* compressed notation */
+ int rmbe_size_comp; /* compressed notation */
int rmbe_update_limit;
/* lower limit for consumer
* cursor update
diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c
index b9b8b07aa702..c90d9e5dda54 100644
--- a/net/smc/smc_clc.c
+++ b/net/smc/smc_clc.c
@@ -1007,7 +1007,7 @@ static int smc_clc_send_confirm_accept(struct smc_sock *smc,
clc->d0.gid =
conn->lgr->smcd->ops->get_local_gid(conn->lgr->smcd);
clc->d0.token = conn->rmb_desc->token;
- clc->d0.dmbe_size = conn->rmbe_size_short;
+ clc->d0.dmbe_size = conn->rmbe_size_comp;
clc->d0.dmbe_idx = 0;
memcpy(&clc->d0.linkid, conn->lgr->id, SMC_LGR_ID_SIZE);
if (version == SMC_V1) {
@@ -1050,7 +1050,7 @@ static int smc_clc_send_confirm_accept(struct smc_sock *smc,
clc->r0.qp_mtu = min(link->path_mtu, link->peer_mtu);
break;
}
- clc->r0.rmbe_size = conn->rmbe_size_short;
+ clc->r0.rmbe_size = conn->rmbe_size_comp;
clc->r0.rmb_dma_addr = conn->rmb_desc->is_vm ?
cpu_to_be64((uintptr_t)conn->rmb_desc->cpu_addr) :
cpu_to_be64((u64)sg_dma_address
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 3f465faf2b68..6b78075404d7 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -2309,31 +2309,30 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
struct smc_connection *conn = &smc->conn;
struct smc_link_group *lgr = conn->lgr;
struct list_head *buf_list;
- int bufsize, bufsize_short;
+ int bufsize, bufsize_comp;
struct rw_semaphore *lock; /* lock buffer list */
bool is_dgraded = false;
- int sk_buf_size;

if (is_rmb)
/* use socket recv buffer size (w/o overhead) as start value */
- sk_buf_size = smc->sk.sk_rcvbuf;
+ bufsize = smc->sk.sk_rcvbuf / 2;
else
/* use socket send buffer size (w/o overhead) as start value */
- sk_buf_size = smc->sk.sk_sndbuf;
+ bufsize = smc->sk.sk_sndbuf / 2;

- for (bufsize_short = smc_compress_bufsize(sk_buf_size, is_smcd, is_rmb);
- bufsize_short >= 0; bufsize_short--) {
+ for (bufsize_comp = smc_compress_bufsize(bufsize, is_smcd, is_rmb);
+ bufsize_comp >= 0; bufsize_comp--) {
if (is_rmb) {
lock = &lgr->rmbs_lock;
- buf_list = &lgr->rmbs[bufsize_short];
+ buf_list = &lgr->rmbs[bufsize_comp];
} else {
lock = &lgr->sndbufs_lock;
- buf_list = &lgr->sndbufs[bufsize_short];
+ buf_list = &lgr->sndbufs[bufsize_comp];
}
- bufsize = smc_uncompress_bufsize(bufsize_short);
+ bufsize = smc_uncompress_bufsize(bufsize_comp);

/* check for reusable slot in the link group */
- buf_desc = smc_buf_get_slot(bufsize_short, lock, buf_list);
+ buf_desc = smc_buf_get_slot(bufsize_comp, lock, buf_list);
if (buf_desc) {
buf_desc->is_dma_need_sync = 0;
SMC_STAT_RMB_SIZE(smc, is_smcd, is_rmb, bufsize);
@@ -2377,8 +2376,8 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)

if (is_rmb) {
conn->rmb_desc = buf_desc;
- conn->rmbe_size_short = bufsize_short;
- smc->sk.sk_rcvbuf = bufsize;
+ conn->rmbe_size_comp = bufsize_comp;
+ smc->sk.sk_rcvbuf = bufsize * 2;
atomic_set(&conn->bytes_to_rcv, 0);
conn->rmbe_update_limit =
smc_rmb_wnd_update_limit(buf_desc->len);
@@ -2386,7 +2385,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
smc_ism_set_conn(conn); /* map RMB/smcd_dev to conn */
} else {
conn->sndbuf_desc = buf_desc;
- smc->sk.sk_sndbuf = bufsize;
+ smc->sk.sk_sndbuf = bufsize * 2;
atomic_set(&conn->sndbuf_space, bufsize);
}
return 0;
diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c
index b6f79fabb9d3..0b2a957ca5f5 100644
--- a/net/smc/smc_sysctl.c
+++ b/net/smc/smc_sysctl.c
@@ -21,6 +21,10 @@

static int min_sndbuf = SMC_BUF_MIN_SIZE;
static int min_rcvbuf = SMC_BUF_MIN_SIZE;
+static int max_sndbuf = INT_MAX / 2;
+static int max_rcvbuf = INT_MAX / 2;
+static const int net_smc_wmem_init = (64 * 1024);
+static const int net_smc_rmem_init = (64 * 1024);

static struct ctl_table smc_table[] = {
{
@@ -53,6 +57,7 @@ static struct ctl_table smc_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec_minmax,
.extra1 = &min_sndbuf,
+ .extra2 = &max_sndbuf,
},
{
.procname = "rmem",
@@ -61,6 +66,7 @@ static struct ctl_table smc_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec_minmax,
.extra1 = &min_rcvbuf,
+ .extra2 = &max_rcvbuf,
},
{ }
};
@@ -88,8 +94,8 @@ int __net_init smc_sysctl_net_init(struct net *net)
net->smc.sysctl_autocorking_size = SMC_AUTOCORKING_DEFAULT_SIZE;
net->smc.sysctl_smcr_buf_type = SMCR_PHYS_CONT_BUFS;
net->smc.sysctl_smcr_testlink_time = SMC_LLC_TESTLINK_DEFAULT_TIME;
- WRITE_ONCE(net->smc.sysctl_wmem, READ_ONCE(net->ipv4.sysctl_tcp_wmem[1]));
- WRITE_ONCE(net->smc.sysctl_rmem, READ_ONCE(net->ipv4.sysctl_tcp_rmem[1]));
+ WRITE_ONCE(net->smc.sysctl_wmem, net_smc_wmem_init);
+ WRITE_ONCE(net->smc.sysctl_rmem, net_smc_rmem_init);

return 0;

--
2.41.0


2023-08-02 12:23:33

by Gerd Bayer

[permalink] [raw]
Subject: [PATCH net 2/2] net/smc: Use correct buffer sizes when switching between TCP and SMC

Tuning of the effective buffer size through setsockopts was working for
SMC traffic only but not for TCP fall-back connections even before
commit 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and
make them tunable"). That change made it apparent that TCP fall-back
connections would use net.smc.[rw]mem as buffer size instead of
net.ipv4_tcp_[rw]mem.

Amend the code that copies attributes between the (TCP) clcsock and the
SMC socket and adjust buffer sizes appropriately:
- Copy over sk_userlocks so that both sockets agree on whether tuning
via setsockopt is active.
- When falling back to TCP use sk_sndbuf or sk_rcvbuf as specified with
setsockopt. Otherwise, use the sysctl value for TCP/IPv4.
- Likewise, use either values from setsockopt or from sysctl for SMC
(duplicated) on successful SMC connect.

In smc_tcp_listen_work() drop the explicit copy of buffer sizes as that
is taken care of by the attribute copy.

Fixes: 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable")
Signed-off-by: Gerd Bayer <[email protected]>
Reviewed-by: Wenjia Zhang <[email protected]>
Reviewed-by: Jan Karcher <[email protected]>

---
net/smc/af_smc.c | 76 ++++++++++++++++++++++++++++++++++--------------
1 file changed, 54 insertions(+), 22 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 1fcf1e42474a..1c8ed19041d7 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -436,13 +436,63 @@ static int smc_bind(struct socket *sock, struct sockaddr *uaddr,
return rc;
}

+/* copy only relevant settings and flags of SOL_SOCKET level from smc to
+ * clc socket (since smc is not called for these options from net/core)
+ */
+
+#define SK_FLAGS_SMC_TO_CLC ((1UL << SOCK_URGINLINE) | \
+ (1UL << SOCK_KEEPOPEN) | \
+ (1UL << SOCK_LINGER) | \
+ (1UL << SOCK_BROADCAST) | \
+ (1UL << SOCK_TIMESTAMP) | \
+ (1UL << SOCK_DBG) | \
+ (1UL << SOCK_RCVTSTAMP) | \
+ (1UL << SOCK_RCVTSTAMPNS) | \
+ (1UL << SOCK_LOCALROUTE) | \
+ (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE) | \
+ (1UL << SOCK_RXQ_OVFL) | \
+ (1UL << SOCK_WIFI_STATUS) | \
+ (1UL << SOCK_NOFCS) | \
+ (1UL << SOCK_FILTER_LOCKED) | \
+ (1UL << SOCK_TSTAMP_NEW))
+
+/* if set, use value set by setsockopt() - else use IPv4 or SMC sysctl value */
+static void smc_adjust_sock_bufsizes(struct sock *nsk, struct sock *osk,
+ unsigned long mask)
+{
+ struct net *nnet;
+
+ nnet = nsk->sk_net.net;
+
+ nsk->sk_userlocks = osk->sk_userlocks;
+
+ if (osk->sk_userlocks & SOCK_SNDBUF_LOCK) {
+ nsk->sk_sndbuf = osk->sk_sndbuf;
+ } else {
+ if (mask == SK_FLAGS_SMC_TO_CLC)
+ WRITE_ONCE(nsk->sk_sndbuf,
+ READ_ONCE(nnet->ipv4.sysctl_tcp_wmem[1]));
+ else
+ WRITE_ONCE(nsk->sk_sndbuf,
+ 2 * READ_ONCE(nnet->smc.sysctl_wmem));
+ }
+ if (osk->sk_userlocks & SOCK_RCVBUF_LOCK) {
+ nsk->sk_rcvbuf = osk->sk_rcvbuf;
+ } else {
+ if (mask == SK_FLAGS_SMC_TO_CLC)
+ WRITE_ONCE(nsk->sk_rcvbuf,
+ READ_ONCE(nnet->ipv4.sysctl_tcp_rmem[1]));
+ else
+ WRITE_ONCE(nsk->sk_rcvbuf,
+ 2 * READ_ONCE(nnet->smc.sysctl_rmem));
+ }
+}
+
static void smc_copy_sock_settings(struct sock *nsk, struct sock *osk,
unsigned long mask)
{
/* options we don't get control via setsockopt for */
nsk->sk_type = osk->sk_type;
- nsk->sk_sndbuf = osk->sk_sndbuf;
- nsk->sk_rcvbuf = osk->sk_rcvbuf;
nsk->sk_sndtimeo = osk->sk_sndtimeo;
nsk->sk_rcvtimeo = osk->sk_rcvtimeo;
nsk->sk_mark = osk->sk_mark;
@@ -453,26 +503,10 @@ static void smc_copy_sock_settings(struct sock *nsk, struct sock *osk,

nsk->sk_flags &= ~mask;
nsk->sk_flags |= osk->sk_flags & mask;
+
+ smc_adjust_sock_bufsizes(nsk, osk, mask);
}

-#define SK_FLAGS_SMC_TO_CLC ((1UL << SOCK_URGINLINE) | \
- (1UL << SOCK_KEEPOPEN) | \
- (1UL << SOCK_LINGER) | \
- (1UL << SOCK_BROADCAST) | \
- (1UL << SOCK_TIMESTAMP) | \
- (1UL << SOCK_DBG) | \
- (1UL << SOCK_RCVTSTAMP) | \
- (1UL << SOCK_RCVTSTAMPNS) | \
- (1UL << SOCK_LOCALROUTE) | \
- (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE) | \
- (1UL << SOCK_RXQ_OVFL) | \
- (1UL << SOCK_WIFI_STATUS) | \
- (1UL << SOCK_NOFCS) | \
- (1UL << SOCK_FILTER_LOCKED) | \
- (1UL << SOCK_TSTAMP_NEW))
-/* copy only relevant settings and flags of SOL_SOCKET level from smc to
- * clc socket (since smc is not called for these options from net/core)
- */
static void smc_copy_sock_settings_to_clc(struct smc_sock *smc)
{
smc_copy_sock_settings(smc->clcsock->sk, &smc->sk, SK_FLAGS_SMC_TO_CLC);
@@ -2479,8 +2513,6 @@ static void smc_tcp_listen_work(struct work_struct *work)
sock_hold(lsk); /* sock_put in smc_listen_work */
INIT_WORK(&new_smc->smc_listen_work, smc_listen_work);
smc_copy_sock_settings_to_smc(new_smc);
- new_smc->sk.sk_sndbuf = lsmc->sk.sk_sndbuf;
- new_smc->sk.sk_rcvbuf = lsmc->sk.sk_rcvbuf;
sock_hold(&new_smc->sk); /* sock_put in passive closing */
if (!queue_work(smc_hs_wq, &new_smc->smc_listen_work))
sock_put(&new_smc->sk);
--
2.41.0


2023-08-02 13:24:07

by Tony Lu

[permalink] [raw]
Subject: Re: [PATCH net 2/2] net/smc: Use correct buffer sizes when switching between TCP and SMC

On Wed, Aug 02, 2023 at 11:33:13AM +0200, Gerd Bayer wrote:
> Tuning of the effective buffer size through setsockopts was working for
> SMC traffic only but not for TCP fall-back connections even before
> commit 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and
> make them tunable"). That change made it apparent that TCP fall-back
> connections would use net.smc.[rw]mem as buffer size instead of
> net.ipv4_tcp_[rw]mem.
>
> Amend the code that copies attributes between the (TCP) clcsock and the
> SMC socket and adjust buffer sizes appropriately:
> - Copy over sk_userlocks so that both sockets agree on whether tuning
> via setsockopt is active.
> - When falling back to TCP use sk_sndbuf or sk_rcvbuf as specified with
> setsockopt. Otherwise, use the sysctl value for TCP/IPv4.
> - Likewise, use either values from setsockopt or from sysctl for SMC
> (duplicated) on successful SMC connect.
>
> In smc_tcp_listen_work() drop the explicit copy of buffer sizes as that
> is taken care of by the attribute copy.
>
> Fixes: 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable")
> Signed-off-by: Gerd Bayer <[email protected]>
> Reviewed-by: Wenjia Zhang <[email protected]>
> Reviewed-by: Jan Karcher <[email protected]>

Reviewed-by: Tony Lu <[email protected]>

>
^^^^ nit: a extra new line here.
> ---
> net/smc/af_smc.c | 76 ++++++++++++++++++++++++++++++++++--------------
> 1 file changed, 54 insertions(+), 22 deletions(-)
>
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index 1fcf1e42474a..1c8ed19041d7 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -436,13 +436,63 @@ static int smc_bind(struct socket *sock, struct sockaddr *uaddr,
> return rc;
> }
>
> +/* copy only relevant settings and flags of SOL_SOCKET level from smc to
> + * clc socket (since smc is not called for these options from net/core)
> + */
> +
> +#define SK_FLAGS_SMC_TO_CLC ((1UL << SOCK_URGINLINE) | \
> + (1UL << SOCK_KEEPOPEN) | \
> + (1UL << SOCK_LINGER) | \
> + (1UL << SOCK_BROADCAST) | \
> + (1UL << SOCK_TIMESTAMP) | \
> + (1UL << SOCK_DBG) | \
> + (1UL << SOCK_RCVTSTAMP) | \
> + (1UL << SOCK_RCVTSTAMPNS) | \
> + (1UL << SOCK_LOCALROUTE) | \
> + (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE) | \
> + (1UL << SOCK_RXQ_OVFL) | \
> + (1UL << SOCK_WIFI_STATUS) | \
> + (1UL << SOCK_NOFCS) | \
> + (1UL << SOCK_FILTER_LOCKED) | \
> + (1UL << SOCK_TSTAMP_NEW))
> +
> +/* if set, use value set by setsockopt() - else use IPv4 or SMC sysctl value */
> +static void smc_adjust_sock_bufsizes(struct sock *nsk, struct sock *osk,
> + unsigned long mask)
> +{
> + struct net *nnet;
> +
> + nnet = nsk->sk_net.net;

Better to combine these two lines with existed helper.

struct net *net = sock_net(nsk);

> +
> + nsk->sk_userlocks = osk->sk_userlocks;
> +
> + if (osk->sk_userlocks & SOCK_SNDBUF_LOCK) {
> + nsk->sk_sndbuf = osk->sk_sndbuf;
> + } else {
> + if (mask == SK_FLAGS_SMC_TO_CLC)
> + WRITE_ONCE(nsk->sk_sndbuf,
> + READ_ONCE(nnet->ipv4.sysctl_tcp_wmem[1]));
> + else
> + WRITE_ONCE(nsk->sk_sndbuf,
> + 2 * READ_ONCE(nnet->smc.sysctl_wmem));
> + }
> + if (osk->sk_userlocks & SOCK_RCVBUF_LOCK) {
> + nsk->sk_rcvbuf = osk->sk_rcvbuf;
> + } else {
> + if (mask == SK_FLAGS_SMC_TO_CLC)
> + WRITE_ONCE(nsk->sk_rcvbuf,
> + READ_ONCE(nnet->ipv4.sysctl_tcp_rmem[1]));
> + else
> + WRITE_ONCE(nsk->sk_rcvbuf,
> + 2 * READ_ONCE(nnet->smc.sysctl_rmem));
> + }
> +}
> +
> static void smc_copy_sock_settings(struct sock *nsk, struct sock *osk,
> unsigned long mask)
> {
> /* options we don't get control via setsockopt for */
> nsk->sk_type = osk->sk_type;
> - nsk->sk_sndbuf = osk->sk_sndbuf;
> - nsk->sk_rcvbuf = osk->sk_rcvbuf;
> nsk->sk_sndtimeo = osk->sk_sndtimeo;
> nsk->sk_rcvtimeo = osk->sk_rcvtimeo;
> nsk->sk_mark = osk->sk_mark;
> @@ -453,26 +503,10 @@ static void smc_copy_sock_settings(struct sock *nsk, struct sock *osk,
>
> nsk->sk_flags &= ~mask;
> nsk->sk_flags |= osk->sk_flags & mask;
> +
> + smc_adjust_sock_bufsizes(nsk, osk, mask);
> }
>
> -#define SK_FLAGS_SMC_TO_CLC ((1UL << SOCK_URGINLINE) | \
> - (1UL << SOCK_KEEPOPEN) | \
> - (1UL << SOCK_LINGER) | \
> - (1UL << SOCK_BROADCAST) | \
> - (1UL << SOCK_TIMESTAMP) | \
> - (1UL << SOCK_DBG) | \
> - (1UL << SOCK_RCVTSTAMP) | \
> - (1UL << SOCK_RCVTSTAMPNS) | \
> - (1UL << SOCK_LOCALROUTE) | \
> - (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE) | \
> - (1UL << SOCK_RXQ_OVFL) | \
> - (1UL << SOCK_WIFI_STATUS) | \
> - (1UL << SOCK_NOFCS) | \
> - (1UL << SOCK_FILTER_LOCKED) | \
> - (1UL << SOCK_TSTAMP_NEW))
> -/* copy only relevant settings and flags of SOL_SOCKET level from smc to
> - * clc socket (since smc is not called for these options from net/core)
> - */
> static void smc_copy_sock_settings_to_clc(struct smc_sock *smc)
> {
> smc_copy_sock_settings(smc->clcsock->sk, &smc->sk, SK_FLAGS_SMC_TO_CLC);
> @@ -2479,8 +2513,6 @@ static void smc_tcp_listen_work(struct work_struct *work)
> sock_hold(lsk); /* sock_put in smc_listen_work */
> INIT_WORK(&new_smc->smc_listen_work, smc_listen_work);
> smc_copy_sock_settings_to_smc(new_smc);
> - new_smc->sk.sk_sndbuf = lsmc->sk.sk_sndbuf;
> - new_smc->sk.sk_rcvbuf = lsmc->sk.sk_rcvbuf;
> sock_hold(&new_smc->sk); /* sock_put in passive closing */
> if (!queue_work(smc_hs_wq, &new_smc->smc_listen_work))
> sock_put(&new_smc->sk);
> --
> 2.41.0

2023-08-02 13:26:00

by Tony Lu

[permalink] [raw]
Subject: Re: [PATCH net 1/2] net/smc: Fix setsockopt and sysctl to specify same buffer size again

On Wed, Aug 02, 2023 at 11:33:12AM +0200, Gerd Bayer wrote:
> Commit 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock
> and make them tunable") introduced the net.smc.rmem and net.smc.wmem
> sysctls to specify the size of buffers to be used for SMC type
> connections. This created a regression for users that specified the
> buffer size via setsockopt() as the effective buffer size was now
> doubled.

The idea of the original patch is going to unbind buffer from clcsock.
It's great here to determine whether to use original value or doubled.

>
> Re-introduce the division by 2 in the SMC buffer create code and level
> this out by duplicating the net.smc.[rw]mem values used for initializing
> sk_rcvbuf/sk_sndbuf at socket creation time. This gives users of both
> methods (setsockopt or sysctl) the effective buffer size that they
> expect.
>
> Initialize net.smc.[rw]mem from its own constant of 64kB, respectively.
> Internal performance tests show that this value is a good compromise
> between throughput/latency and memory consumption. Also, this decouples
> it from any tuning that was done to net.ipv4.tcp_[rw]mem[1] before the
> module for SMC protocol was loaded. Check that no more than INT_MAX / 2
> is assigned to net.smc.[rw]mem, in order to avoid any overflow condition
> when that is doubled for use in sk_sndbuf or sk_rcvbuf.

64kB may be small in our environment, but that's okay to change it with
systemd during boot. Different networking environment, such as with
higher latency, should have different buffer size. So two tunable knobs
are good enough.

>
> While at it, drop the confusing sk_buf_size variable from
> __smc_buf_create and name "compressed" buffer size variables more
> consistently.
>
> Background:
>
> Before the commit mentioned above, SMC's buffer allocator in
> __smc_buf_create() always used half of the sockets' sk_rcvbuf/sk_sndbuf
> value as initial value to search for appropriate buffers. If the search
> resorted to using a bigger buffer when all buffers of the specified
> size were busy, the duplicate of the used effective buffer size is
> stored back to sk_rcvbuf/sk_sndbuf.
>
> When available, buffers of exactly the size that a user had specified as
> input to setsockopt() were used, despite setsockopt()'s documentation in
> "man 7 socket" talking of a mandatory duplication:
>
> [...]
> SO_SNDBUF
> Sets or gets the maximum socket send buffer in bytes.
> The kernel doubles this value (to allow space for book‐
> keeping overhead) when it is set using setsockopt(2),
> and this doubled value is returned by getsockopt(2).
> The default value is set by the
> /proc/sys/net/core/wmem_default file and the maximum
> allowed value is set by the /proc/sys/net/core/wmem_max
> file. The minimum (doubled) value for this option is
> 2048.
> [...]
>
> Fixes: 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable")
> Co-developed-by: Jan Karcher <[email protected]>
> Signed-off-by: Jan Karcher <[email protected]>
> Signed-off-by: Gerd Bayer <[email protected]>
> Reviewed-by: Wenjia Zhang <[email protected]>

Reviewed-by: Tony Lu <[email protected]>

> ---
> net/smc/af_smc.c | 4 ++--
> net/smc/smc.h | 2 +-
> net/smc/smc_clc.c | 4 ++--
> net/smc/smc_core.c | 25 ++++++++++++-------------
> net/smc/smc_sysctl.c | 10 ++++++++--
> 5 files changed, 25 insertions(+), 20 deletions(-)
>
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index a7f887d91d89..1fcf1e42474a 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -378,8 +378,8 @@ static struct sock *smc_sock_alloc(struct net *net, struct socket *sock,
> sk->sk_state = SMC_INIT;
> sk->sk_destruct = smc_destruct;
> sk->sk_protocol = protocol;
> - WRITE_ONCE(sk->sk_sndbuf, READ_ONCE(net->smc.sysctl_wmem));
> - WRITE_ONCE(sk->sk_rcvbuf, READ_ONCE(net->smc.sysctl_rmem));
> + WRITE_ONCE(sk->sk_sndbuf, 2 * READ_ONCE(net->smc.sysctl_wmem));
> + WRITE_ONCE(sk->sk_rcvbuf, 2 * READ_ONCE(net->smc.sysctl_rmem));
> smc = smc_sk(sk);
> INIT_WORK(&smc->tcp_listen_work, smc_tcp_listen_work);
> INIT_WORK(&smc->connect_work, smc_connect_work);
> diff --git a/net/smc/smc.h b/net/smc/smc.h
> index 2eeea4cdc718..1f2b912c43d1 100644
> --- a/net/smc/smc.h
> +++ b/net/smc/smc.h
> @@ -161,7 +161,7 @@ struct smc_connection {
>
> struct smc_buf_desc *sndbuf_desc; /* send buffer descriptor */
> struct smc_buf_desc *rmb_desc; /* RMBE descriptor */
> - int rmbe_size_short;/* compressed notation */
> + int rmbe_size_comp; /* compressed notation */

nit: spaces are used here, better to use two tabs.

> int rmbe_update_limit;
> /* lower limit for consumer
> * cursor update
> diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c
> index b9b8b07aa702..c90d9e5dda54 100644
> --- a/net/smc/smc_clc.c
> +++ b/net/smc/smc_clc.c
> @@ -1007,7 +1007,7 @@ static int smc_clc_send_confirm_accept(struct smc_sock *smc,
> clc->d0.gid =
> conn->lgr->smcd->ops->get_local_gid(conn->lgr->smcd);
> clc->d0.token = conn->rmb_desc->token;
> - clc->d0.dmbe_size = conn->rmbe_size_short;
> + clc->d0.dmbe_size = conn->rmbe_size_comp;
> clc->d0.dmbe_idx = 0;
> memcpy(&clc->d0.linkid, conn->lgr->id, SMC_LGR_ID_SIZE);
> if (version == SMC_V1) {
> @@ -1050,7 +1050,7 @@ static int smc_clc_send_confirm_accept(struct smc_sock *smc,
> clc->r0.qp_mtu = min(link->path_mtu, link->peer_mtu);
> break;
> }
> - clc->r0.rmbe_size = conn->rmbe_size_short;
> + clc->r0.rmbe_size = conn->rmbe_size_comp;
> clc->r0.rmb_dma_addr = conn->rmb_desc->is_vm ?
> cpu_to_be64((uintptr_t)conn->rmb_desc->cpu_addr) :
> cpu_to_be64((u64)sg_dma_address
> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> index 3f465faf2b68..6b78075404d7 100644
> --- a/net/smc/smc_core.c
> +++ b/net/smc/smc_core.c
> @@ -2309,31 +2309,30 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
> struct smc_connection *conn = &smc->conn;
> struct smc_link_group *lgr = conn->lgr;
> struct list_head *buf_list;
> - int bufsize, bufsize_short;
> + int bufsize, bufsize_comp;
> struct rw_semaphore *lock; /* lock buffer list */
> bool is_dgraded = false;
> - int sk_buf_size;
>
> if (is_rmb)
> /* use socket recv buffer size (w/o overhead) as start value */
> - sk_buf_size = smc->sk.sk_rcvbuf;
> + bufsize = smc->sk.sk_rcvbuf / 2;
> else
> /* use socket send buffer size (w/o overhead) as start value */
> - sk_buf_size = smc->sk.sk_sndbuf;
> + bufsize = smc->sk.sk_sndbuf / 2;
>
> - for (bufsize_short = smc_compress_bufsize(sk_buf_size, is_smcd, is_rmb);
> - bufsize_short >= 0; bufsize_short--) {
> + for (bufsize_comp = smc_compress_bufsize(bufsize, is_smcd, is_rmb);
> + bufsize_comp >= 0; bufsize_comp--) {
> if (is_rmb) {
> lock = &lgr->rmbs_lock;
> - buf_list = &lgr->rmbs[bufsize_short];
> + buf_list = &lgr->rmbs[bufsize_comp];
> } else {
> lock = &lgr->sndbufs_lock;
> - buf_list = &lgr->sndbufs[bufsize_short];
> + buf_list = &lgr->sndbufs[bufsize_comp];
> }
> - bufsize = smc_uncompress_bufsize(bufsize_short);
> + bufsize = smc_uncompress_bufsize(bufsize_comp);
>
> /* check for reusable slot in the link group */
> - buf_desc = smc_buf_get_slot(bufsize_short, lock, buf_list);
> + buf_desc = smc_buf_get_slot(bufsize_comp, lock, buf_list);
> if (buf_desc) {
> buf_desc->is_dma_need_sync = 0;
> SMC_STAT_RMB_SIZE(smc, is_smcd, is_rmb, bufsize);
> @@ -2377,8 +2376,8 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
>
> if (is_rmb) {
> conn->rmb_desc = buf_desc;
> - conn->rmbe_size_short = bufsize_short;
> - smc->sk.sk_rcvbuf = bufsize;
> + conn->rmbe_size_comp = bufsize_comp;
> + smc->sk.sk_rcvbuf = bufsize * 2;
> atomic_set(&conn->bytes_to_rcv, 0);
> conn->rmbe_update_limit =
> smc_rmb_wnd_update_limit(buf_desc->len);
> @@ -2386,7 +2385,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
> smc_ism_set_conn(conn); /* map RMB/smcd_dev to conn */
> } else {
> conn->sndbuf_desc = buf_desc;
> - smc->sk.sk_sndbuf = bufsize;
> + smc->sk.sk_sndbuf = bufsize * 2;
> atomic_set(&conn->sndbuf_space, bufsize);
> }
> return 0;
> diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c
> index b6f79fabb9d3..0b2a957ca5f5 100644
> --- a/net/smc/smc_sysctl.c
> +++ b/net/smc/smc_sysctl.c
> @@ -21,6 +21,10 @@
>
> static int min_sndbuf = SMC_BUF_MIN_SIZE;
> static int min_rcvbuf = SMC_BUF_MIN_SIZE;
> +static int max_sndbuf = INT_MAX / 2;
> +static int max_rcvbuf = INT_MAX / 2;
> +static const int net_smc_wmem_init = (64 * 1024);
> +static const int net_smc_rmem_init = (64 * 1024);
>
> static struct ctl_table smc_table[] = {
> {
> @@ -53,6 +57,7 @@ static struct ctl_table smc_table[] = {
> .mode = 0644,
> .proc_handler = proc_dointvec_minmax,
> .extra1 = &min_sndbuf,
> + .extra2 = &max_sndbuf,
> },
> {
> .procname = "rmem",
> @@ -61,6 +66,7 @@ static struct ctl_table smc_table[] = {
> .mode = 0644,
> .proc_handler = proc_dointvec_minmax,
> .extra1 = &min_rcvbuf,
> + .extra2 = &max_rcvbuf,
> },
> { }
> };
> @@ -88,8 +94,8 @@ int __net_init smc_sysctl_net_init(struct net *net)
> net->smc.sysctl_autocorking_size = SMC_AUTOCORKING_DEFAULT_SIZE;
> net->smc.sysctl_smcr_buf_type = SMCR_PHYS_CONT_BUFS;
> net->smc.sysctl_smcr_testlink_time = SMC_LLC_TESTLINK_DEFAULT_TIME;
> - WRITE_ONCE(net->smc.sysctl_wmem, READ_ONCE(net->ipv4.sysctl_tcp_wmem[1]));
> - WRITE_ONCE(net->smc.sysctl_rmem, READ_ONCE(net->ipv4.sysctl_tcp_rmem[1]));
> + WRITE_ONCE(net->smc.sysctl_wmem, net_smc_wmem_init);
> + WRITE_ONCE(net->smc.sysctl_rmem, net_smc_rmem_init);
>
> return 0;
>
> --
> 2.41.0

2023-08-02 17:50:41

by Gerd Bayer

[permalink] [raw]
Subject: Re: [PATCH net 2/2] net/smc: Use correct buffer sizes when switching between TCP and SMC

On Wed, 2023-08-02 at 20:43 +0800, Tony Lu wrote:
> On Wed, Aug 02, 2023 at 11:33:13AM +0200, Gerd Bayer wrote:
> > Tuning of the effective buffer size through setsockopts was working
> > for
> > SMC traffic only but not for TCP fall-back connections even before
> > commit 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock
> > and
> > make them tunable"). That change made it apparent that TCP fall-
> > back
> > connections would use net.smc.[rw]mem as buffer size instead of
> > net.ipv4_tcp_[rw]mem.
> >
> > Amend the code that copies attributes between the (TCP) clcsock and
> > the
> > SMC socket and adjust buffer sizes appropriately:
> > - Copy over sk_userlocks so that both sockets agree on whether
> > tuning
> >   via setsockopt is active.
> > - When falling back to TCP use sk_sndbuf or sk_rcvbuf as specified
> > with
> >   setsockopt. Otherwise, use the sysctl value for TCP/IPv4.
> > - Likewise, use either values from setsockopt or from sysctl for
> > SMC
> >   (duplicated) on successful SMC connect.
> >
> > In smc_tcp_listen_work() drop the explicit copy of buffer sizes as
> > that
> > is taken care of by the attribute copy.
> >
> > Fixes: 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock
> > and make them tunable")
> > Signed-off-by: Gerd Bayer <[email protected]>
> > Reviewed-by: Wenjia Zhang <[email protected]>
> > Reviewed-by: Jan Karcher <[email protected]>
>
> Reviewed-by: Tony Lu <[email protected]>
>
> >
> ^^^^ nit: a extra new line here.
I'll clean that up.

> > ---
> >  net/smc/af_smc.c | 76 ++++++++++++++++++++++++++++++++++----------
> > ----
> >  1 file changed, 54 insertions(+), 22 deletions(-)
> >
> >
[...]

> > +/* if set, use value set by setsockopt() - else use IPv4 or SMC
> > sysctl value */
> > +static void smc_adjust_sock_bufsizes(struct sock *nsk, struct sock
> > *osk,
> > +                                    unsigned long mask)
> > +{
> > +       struct net *nnet;
> > +
> > +       nnet = nsk->sk_net.net;
>
> Better to combine these two lines with existed helper.
>
> struct net *net = sock_net(nsk);
Yes, looks much cleaner.

[...]
>

Thank you Tony for your review and comments.
I'll be sending out a v2 with your recommendations - but give people a
little more time to look at this version.

Thanks,
Gerd

2023-08-02 23:44:28

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH net 2/2] net/smc: Use correct buffer sizes when switching between TCP and SMC

Hi Gerd,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]
[also build test ERROR on linus/master v6.5-rc4 next-20230802]
[cannot apply to net/main]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Gerd-Bayer/net-smc-Fix-setsockopt-and-sysctl-to-specify-same-buffer-size-again/20230802-193805
base: net-next/main
patch link: https://lore.kernel.org/r/20230802093313.1501605-3-gbayer%40linux.ibm.com
patch subject: [PATCH net 2/2] net/smc: Use correct buffer sizes when switching between TCP and SMC
config: nios2-randconfig-r006-20230731 (https://download.01.org/0day-ci/archive/20230803/[email protected]/config)
compiler: nios2-linux-gcc (GCC) 12.3.0
reproduce: (https://download.01.org/0day-ci/archive/20230803/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

net/smc/af_smc.c: In function 'smc_adjust_sock_bufsizes':
>> net/smc/af_smc.c:465:27: error: 'possible_net_t' has no member named 'net'
465 | nnet = nsk->sk_net.net;
| ^


vim +465 net/smc/af_smc.c

438
439 /* copy only relevant settings and flags of SOL_SOCKET level from smc to
440 * clc socket (since smc is not called for these options from net/core)
441 */
442
443 #define SK_FLAGS_SMC_TO_CLC ((1UL << SOCK_URGINLINE) | \
444 (1UL << SOCK_KEEPOPEN) | \
445 (1UL << SOCK_LINGER) | \
446 (1UL << SOCK_BROADCAST) | \
447 (1UL << SOCK_TIMESTAMP) | \
448 (1UL << SOCK_DBG) | \
449 (1UL << SOCK_RCVTSTAMP) | \
450 (1UL << SOCK_RCVTSTAMPNS) | \
451 (1UL << SOCK_LOCALROUTE) | \
452 (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE) | \
453 (1UL << SOCK_RXQ_OVFL) | \
454 (1UL << SOCK_WIFI_STATUS) | \
455 (1UL << SOCK_NOFCS) | \
456 (1UL << SOCK_FILTER_LOCKED) | \
457 (1UL << SOCK_TSTAMP_NEW))
458
459 /* if set, use value set by setsockopt() - else use IPv4 or SMC sysctl value */
460 static void smc_adjust_sock_bufsizes(struct sock *nsk, struct sock *osk,
461 unsigned long mask)
462 {
463 struct net *nnet;
464
> 465 nnet = nsk->sk_net.net;
466
467 nsk->sk_userlocks = osk->sk_userlocks;
468
469 if (osk->sk_userlocks & SOCK_SNDBUF_LOCK) {
470 nsk->sk_sndbuf = osk->sk_sndbuf;
471 } else {
472 if (mask == SK_FLAGS_SMC_TO_CLC)
473 WRITE_ONCE(nsk->sk_sndbuf,
474 READ_ONCE(nnet->ipv4.sysctl_tcp_wmem[1]));
475 else
476 WRITE_ONCE(nsk->sk_sndbuf,
477 2 * READ_ONCE(nnet->smc.sysctl_wmem));
478 }
479 if (osk->sk_userlocks & SOCK_RCVBUF_LOCK) {
480 nsk->sk_rcvbuf = osk->sk_rcvbuf;
481 } else {
482 if (mask == SK_FLAGS_SMC_TO_CLC)
483 WRITE_ONCE(nsk->sk_rcvbuf,
484 READ_ONCE(nnet->ipv4.sysctl_tcp_rmem[1]));
485 else
486 WRITE_ONCE(nsk->sk_rcvbuf,
487 2 * READ_ONCE(nnet->smc.sysctl_rmem));
488 }
489 }
490

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki