2024-01-11 12:01:26

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next 00/15] net/smc: implement loopback-ism used by SMC-D

This patch set acts as the second part of the new version of [1] (The first
part can be referred from [2]), the updated things of this version are listed
at the end.

# Background

SMC-D is now used in IBM z with ISM function to optimize network interconnect
for intra-CPC communications. Inspired by this, we try to make SMC-D available
on the non-s390 architecture through a software-implemented virtual ISM device,
that is the loopback-ism device here, to accelerate inter-process or
inter-containers communication within the same OS instance.

# Design

This patch set includes 3 parts:

- Patch #1-#2: some prepare work for loopback-ism.
- Patch #3-#9: implement loopback-ism device.
- Patch #10-#15: memory copy optimization for loopback scenario.

The loopback-ism device is designed as a ISMv2 device and not be limited to
a specific net namespace, ends of both inter-process connection (1/1' in diagram
below) or inter-container connection (2/2' in diagram below) can find the same
available loopback-ism and choose it during the CLC handshake.

Container 1 (ns1) Container 2 (ns2)
+-----------------------------------------+ +-------------------------+
| +-------+ +-------+ +-------+ | | +-------+ |
| | App A | | App B | | App C | | | | App D |<-+ |
| +-------+ +---^---+ +-------+ | | +-------+ |(2') |
| |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| |
| (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ |
| `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | |
+---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+
| | | |
Kernel | | | |
+----+-------v---+-----------v----------------------------------+---+----+
| | TCP | |
| | | |
| +--------------------------------------------------------------+ |
| |
| +--------------+ |
| | smc loopback | |
+---------------------------+--------------+-----------------------------+

loopback-ism device creates DMBs (shared memory) for each connection peer.
Since data transfer occurs within the same kernel, the sndbuf of each peer
is only a descriptor and point to the same memory region as peer DMB, so that
the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.

Container 1 (ns1) Container 2 (ns2)
+-----------------------------------------+ +-------------------------+
| +-------+ | | +-------+ |
| | App C |-----+ | | | App D | |
| +-------+ | | | +-^-----+ |
| | | | | |
| (2) | | | (2') | |
| | | | | |
+---------------|-------------------------+ +----------|--------------+
| |
Kernel | |
+---------------|-----------------------------------------|--------------+
| +--------+ +--v-----+ +--------+ +--------+ |
| |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| |
| +-----|--+ +--|-----+ +-----|--+ +--------+ |
| +-----|--+ | +-----|--+ |
| | DMB C | +---------------------------------| DMB D | |
| +--------+ +--------+ |
| |
| +--------------+ |
| | smc loopback | |
+---------------------------+--------------+-----------------------------+

# Benchmark Test

* Test environments:
- VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
- SMC sndbuf/DMB size 1MB.
- /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0,
which means sndbuf and DMB are merged and no data copied between them.
- /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0,
which means DMB is physically contiguous buffer.

* Test object:
- TCP: run on TCP loopback.
- SMC lo: run on SMC loopback device.

1. ipc-benchmark (see [3])

- ./<foo> -c 1000000 -s 100

TCP SMC-lo
Message
rate (msg/s) 80636 149515(+85.42%)

2. sockperf

- serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
- clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30

TCP SMC-lo
Bandwidth(MBps) 4909.36 8197.57(+66.98%)
Latency(us) 6.098 3.383(-44.52%)

3. nginx/wrk

- serv: <smc_run> nginx
- clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80

TCP SMC-lo
Requests/s 181685.74 246447.77(+35.65%)

4. redis-benchmark

- serv: <smc_run> redis-server
- clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024

TCP SMC-lo
GET(Requests/s) 85855.34 118553.64(+38.09%)
SET(Requests/s) 86824.40 125944.58(+45.06%)


Change log:

v1->RFC:
- Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
/sys/devices/virtual/smc/loopback-ism/xfer_bytes
- Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
merging sndbuf with peer DMB.
- Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
control of whether to merge sndbuf and DMB. They can be respectively set by:
/sys/devices/virtual/smc/loopback-ism/dmb_type
/sys/devices/virtual/smc/loopback-ism/dmb_copy
The motivation for these two control is that a performance bottleneck was
found when using vzalloced DMB and sndbuf is merged with DMB, and there are
many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
vmap lock contention [6]. It has significant effects, but using virtual memory
still has additional overhead compared to using physical memory.
So this new version provides controls of dmb_type and dmb_copy to suit
different scenarios.
- Some minor changes and comments improvements.

RFC->old version([1]):
Link: https://lore.kernel.org/netdev/[email protected]/
- Patch #1: improve the loopback-ism dump, it shows as follows now:
# smcd d
FID Type PCI-ID PCHID InUse #LGs PNET-ID
0000 0 loopback-ism ffff No 0
- Patch #3: introduce the smc_ism_set_v2_capable() helper and set
smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
regardless of whether there is already a device in smcd device list.
- Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
- Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
to activate or deactivate the loopback-ism.
- Patch #9: introduce the statistics of loopback-ism by
/sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
- Some minor changes and comments improvements.

[1] https://lore.kernel.org/netdev/[email protected]/
[2] https://lore.kernel.org/netdev/[email protected]/
[3] https://github.com/goldsborough/ipc-bench
[4] https://lore.kernel.org/all/[email protected]/
[5] https://lore.kernel.org/all/[email protected]/
[6] https://lore.kernel.org/all/[email protected]/

Wen Gu (15):
net/smc: improve SMC-D device dump for virtual ISM
net/smc: decouple specialized struct from SMC-D DMB registration
net/smc: introduce virtual ISM device loopback-ism
net/smc: implement ID-related operations of loopback-ism
net/smc: implement some unsupported operations of loopback-ism
net/smc: implement DMB-related operations of loopback-ism
net/smc: register loopback-ism into SMC-D device list
net/smc: introduce loopback-ism runtime switch
net/smc: introduce loopback-ism statistics attributes
net/smc: add operations to merge sndbuf with peer DMB
net/smc: attach or detach ghost sndbuf to peer DMB
net/smc: adapt cursor update when sndbuf and peer DMB are merged
net/smc: introduce loopback-ism DMB type control
net/smc: introduce loopback-ism DMB data copy control
net/smc: implement DMB-merged operations of loopback-ism

drivers/s390/net/ism_drv.c | 2 +-
include/net/smc.h | 7 +-
net/smc/Kconfig | 13 +
net/smc/Makefile | 2 +-
net/smc/af_smc.c | 28 +-
net/smc/smc_cdc.c | 58 ++-
net/smc/smc_cdc.h | 1 +
net/smc/smc_core.c | 61 +++-
net/smc/smc_core.h | 1 +
net/smc/smc_ism.c | 71 +++-
net/smc/smc_ism.h | 5 +
net/smc/smc_loopback.c | 718 +++++++++++++++++++++++++++++++++++++
net/smc/smc_loopback.h | 88 +++++
13 files changed, 1026 insertions(+), 29 deletions(-)
create mode 100644 net/smc/smc_loopback.c
create mode 100644 net/smc/smc_loopback.h

--
2.32.0.3.g01195cf9f



2024-01-11 12:01:47

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next 01/15] net/smc: improve SMC-D device dump for virtual ISM

The introduction of virtual ISM requires improvement of SMC-D device
dump. Software implemented non-PCI device (loopback-ism) should be
handled correctly and the CHID reserved for virtual ISM should be got
from smcd_ops interface instead of PCI information.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_ism.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index ac88de2a06a0..66bcfddd3fcf 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -252,12 +252,11 @@ static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
char smc_pnet[SMC_MAX_PNETID_LEN + 1];
struct smc_pci_dev smc_pci_dev;
struct nlattr *port_attrs;
+ struct device *device;
struct nlattr *attrs;
- struct ism_dev *ism;
int use_cnt = 0;
void *nlh;

- ism = smcd->priv;
nlh = genlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq,
&smc_gen_nl_family, NLM_F_MULTI,
SMC_NETLINK_GET_DEV_SMCD);
@@ -272,7 +271,15 @@ static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
if (nla_put_u8(skb, SMC_NLA_DEV_IS_CRIT, use_cnt > 0))
goto errattr;
memset(&smc_pci_dev, 0, sizeof(smc_pci_dev));
- smc_set_pci_values(to_pci_dev(ism->dev.parent), &smc_pci_dev);
+ device = smcd->ops->get_dev(smcd);
+ if (device->parent)
+ smc_set_pci_values(to_pci_dev(device->parent), &smc_pci_dev);
+ if (smc_ism_is_virtual(smcd)) {
+ smc_pci_dev.pci_pchid = smc_ism_get_chid(smcd);
+ if (!device->parent)
+ snprintf(smc_pci_dev.pci_id, sizeof(smc_pci_dev.pci_id),
+ "%s", dev_name(device));
+ }
if (nla_put_u32(skb, SMC_NLA_DEV_PCI_FID, smc_pci_dev.pci_fid))
goto errattr;
if (nla_put_u16(skb, SMC_NLA_DEV_PCI_CHID, smc_pci_dev.pci_pchid))
--
2.32.0.3.g01195cf9f


2024-01-11 12:02:05

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next 02/15] net/smc: decouple specialized struct from SMC-D DMB registration

The struct 'ism_client' is specialized for s390 platform firmware ISM.
So replace it with 'void' to make SMCD DMB registration helper generic
for both virtual ISM and existing ISM.

Signed-off-by: Wen Gu <[email protected]>
---
drivers/s390/net/ism_drv.c | 2 +-
include/net/smc.h | 4 ++--
net/smc/smc_ism.c | 7 ++-----
3 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/s390/net/ism_drv.c b/drivers/s390/net/ism_drv.c
index 2c8e964425dc..9b2a52913e76 100644
--- a/drivers/s390/net/ism_drv.c
+++ b/drivers/s390/net/ism_drv.c
@@ -726,7 +726,7 @@ static int smcd_query_rgid(struct smcd_dev *smcd, struct smcd_gid *rgid,
}

static int smcd_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
- struct ism_client *client)
+ void *client)
{
return ism_register_dmb(smcd->priv, (struct ism_dmb *)dmb, client);
}
diff --git a/include/net/smc.h b/include/net/smc.h
index c9dcb30e3fd9..6273c3a8b24a 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -50,7 +50,6 @@ struct smcd_dmb {
#define ISM_ERROR 0xFFFF

struct smcd_dev;
-struct ism_client;

struct smcd_gid {
u64 gid;
@@ -61,7 +60,7 @@ struct smcd_ops {
int (*query_remote_gid)(struct smcd_dev *dev, struct smcd_gid *rgid,
u32 vid_valid, u32 vid);
int (*register_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb,
- struct ism_client *client);
+ void *client);
int (*unregister_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
int (*add_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
int (*del_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
@@ -81,6 +80,7 @@ struct smcd_ops {
struct smcd_dev {
const struct smcd_ops *ops;
void *priv;
+ void *client;
struct list_head list;
spinlock_t lock;
struct smc_connection **conn;
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index 66bcfddd3fcf..fb1837d0a861 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -222,7 +222,6 @@ int smc_ism_unregister_dmb(struct smcd_dev *smcd, struct smc_buf_desc *dmb_desc)
int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
struct smc_buf_desc *dmb_desc)
{
-#if IS_ENABLED(CONFIG_ISM)
struct smcd_dmb dmb;
int rc;

@@ -231,7 +230,7 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
dmb.sba_idx = dmb_desc->sba_idx;
dmb.vlan_id = lgr->vlan_id;
dmb.rgid = lgr->peer_gid.gid;
- rc = lgr->smcd->ops->register_dmb(lgr->smcd, &dmb, &smc_ism_client);
+ rc = lgr->smcd->ops->register_dmb(lgr->smcd, &dmb, lgr->smcd->client);
if (!rc) {
dmb_desc->sba_idx = dmb.sba_idx;
dmb_desc->token = dmb.dmb_tok;
@@ -240,9 +239,6 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
dmb_desc->len = dmb.dmb_len;
}
return rc;
-#else
- return 0;
-#endif
}

static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
@@ -453,6 +449,7 @@ static void smcd_register_dev(struct ism_dev *ism)
if (!smcd)
return;
smcd->priv = ism;
+ smcd->client = &smc_ism_client;
ism_set_priv(ism, &smc_ism_client, smcd);
if (smc_pnetid_by_dev_port(&ism->pdev->dev, 0, smcd->pnetid))
smc_pnetid_by_table_smcd(smcd);
--
2.32.0.3.g01195cf9f


2024-01-11 12:02:22

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next 05/15] net/smc: implement some unsupported operations of loopback-ism

vlan operations are not supported currently since the need for vlan in
loopback situation does not seem to be strong.

signal_event operation is not supported since no event now needs to be
processed by loopback-ism device.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_loopback.c | 36 +++++++++++++++++++++++++++++++-----
1 file changed, 31 insertions(+), 5 deletions(-)

diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index 40dff28d837d..353d4a2d69a1 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -50,6 +50,32 @@ static int smc_lo_query_rgid(struct smcd_dev *smcd, struct smcd_gid *rgid,
return 0;
}

+static int smc_lo_add_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
+{
+ return -EOPNOTSUPP;
+}
+
+static int smc_lo_del_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
+{
+ return -EOPNOTSUPP;
+}
+
+static int smc_lo_set_vlan_required(struct smcd_dev *smcd)
+{
+ return -EOPNOTSUPP;
+}
+
+static int smc_lo_reset_vlan_required(struct smcd_dev *smcd)
+{
+ return -EOPNOTSUPP;
+}
+
+static int smc_lo_signal_event(struct smcd_dev *dev, struct smcd_gid *rgid,
+ u32 trigger_irq, u32 event_code, u64 info)
+{
+ return 0;
+}
+
static int smc_lo_supports_v2(void)
{
return SMC_LO_V2_CAPABLE;
@@ -78,11 +104,11 @@ static const struct smcd_ops lo_ops = {
.query_remote_gid = smc_lo_query_rgid,
.register_dmb = NULL,
.unregister_dmb = NULL,
- .add_vlan_id = NULL,
- .del_vlan_id = NULL,
- .set_vlan_required = NULL,
- .reset_vlan_required = NULL,
- .signal_event = NULL,
+ .add_vlan_id = smc_lo_add_vlan_id,
+ .del_vlan_id = smc_lo_del_vlan_id,
+ .set_vlan_required = smc_lo_set_vlan_required,
+ .reset_vlan_required = smc_lo_reset_vlan_required,
+ .signal_event = smc_lo_signal_event,
.move_data = NULL,
.supports_v2 = smc_lo_supports_v2,
.get_local_gid = smc_lo_get_local_gid,
--
2.32.0.3.g01195cf9f


2024-01-11 12:02:48

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next 03/15] net/smc: introduce virtual ISM device loopback-ism

This introduces a kind of virtual ISM device loopback-ism for SMCDv2.1.
loopback-ism is implemented by software and serves inter-process or
inter-container SMC communication in the same OS instance. It is created
during SMC module loading and destroyed upon unloading. The support for
loopback-ism can be configured via CONFIG_SMC_LO.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/Kconfig | 13 +++
net/smc/Makefile | 2 +-
net/smc/af_smc.c | 12 ++-
net/smc/smc_loopback.c | 181 +++++++++++++++++++++++++++++++++++++++++
net/smc/smc_loopback.h | 33 ++++++++
5 files changed, 239 insertions(+), 2 deletions(-)
create mode 100644 net/smc/smc_loopback.c
create mode 100644 net/smc/smc_loopback.h

diff --git a/net/smc/Kconfig b/net/smc/Kconfig
index 746be3996768..e191f78551f4 100644
--- a/net/smc/Kconfig
+++ b/net/smc/Kconfig
@@ -20,3 +20,16 @@ config SMC_DIAG
smcss.

if unsure, say Y.
+
+config SMC_LO
+ bool "SMC_LO: virtual ISM loopback-ism for SMC"
+ depends on SMC
+ default n
+ help
+ SMC_LO provides a kind of virtual ISM device called loopback-ism
+ for SMCD to upgrade AF_INET TCP connections whose ends share the
+ same kernel.
+ loopback-ism is a software implemented device that does not depend
+ on a specific architecture or hardware.
+
+ if unsure, say N.
diff --git a/net/smc/Makefile b/net/smc/Makefile
index 875efcd126a2..a8c37111abe1 100644
--- a/net/smc/Makefile
+++ b/net/smc/Makefile
@@ -4,5 +4,5 @@ obj-$(CONFIG_SMC) += smc.o
obj-$(CONFIG_SMC_DIAG) += smc_diag.o
smc-y := af_smc.o smc_pnet.o smc_ib.o smc_clc.o smc_core.o smc_wr.o smc_llc.o
smc-y += smc_cdc.o smc_tx.o smc_rx.o smc_close.o smc_ism.o smc_netlink.o smc_stats.o
-smc-y += smc_tracepoint.o
+smc-y += smc_tracepoint.o smc_loopback.o
smc-$(CONFIG_SYSCTL) += smc_sysctl.o
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index a2cb30af46cb..189aea09b66e 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -53,6 +53,7 @@
#include "smc_stats.h"
#include "smc_tracepoint.h"
#include "smc_sysctl.h"
+#include "smc_loopback.h"

static DEFINE_MUTEX(smc_server_lgr_pending); /* serialize link group
* creation on server
@@ -3556,15 +3557,23 @@ static int __init smc_init(void)
goto out_sock;
}

+ rc = smc_loopback_init();
+ if (rc) {
+ pr_err("%s: smc_loopback_init fails with %d\n", __func__, rc);
+ goto out_ib;
+ }
+
rc = tcp_register_ulp(&smc_ulp_ops);
if (rc) {
pr_err("%s: tcp_ulp_register fails with %d\n", __func__, rc);
- goto out_ib;
+ goto out_lo;
}

static_branch_enable(&tcp_have_smc);
return 0;

+out_lo:
+ smc_loopback_exit();
out_ib:
smc_ib_unregister_client();
out_sock:
@@ -3602,6 +3611,7 @@ static void __exit smc_exit(void)
tcp_unregister_ulp(&smc_ulp_ops);
sock_unregister(PF_SMC);
smc_core_exit();
+ smc_loopback_exit();
smc_ib_unregister_client();
smc_ism_exit();
destroy_workqueue(smc_close_wq);
diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
new file mode 100644
index 000000000000..cbb6625ccd0d
--- /dev/null
+++ b/net/smc/smc_loopback.c
@@ -0,0 +1,181 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Shared Memory Communications Direct over loopback-ism device.
+ *
+ * Provide a SMC-D loopback-ism device.
+ *
+ * Copyright (c) 2024, Alibaba Inc.
+ *
+ * Author: Wen Gu <[email protected]>
+ * Tony Lu <[email protected]>
+ *
+ */
+
+#include <linux/device.h>
+#include <linux/types.h>
+#include <net/smc.h>
+
+#include "smc_ism.h"
+#include "smc_loopback.h"
+
+#if IS_ENABLED(CONFIG_SMC_LO)
+static const char smc_lo_dev_name[] = "loopback-ism";
+static struct smc_lo_dev *lo_dev;
+static struct class *smc_class;
+
+static const struct smcd_ops lo_ops = {
+ .query_remote_gid = NULL,
+ .register_dmb = NULL,
+ .unregister_dmb = NULL,
+ .add_vlan_id = NULL,
+ .del_vlan_id = NULL,
+ .set_vlan_required = NULL,
+ .reset_vlan_required = NULL,
+ .signal_event = NULL,
+ .move_data = NULL,
+ .supports_v2 = NULL,
+ .get_local_gid = NULL,
+ .get_chid = NULL,
+ .get_dev = NULL,
+};
+
+static struct smcd_dev *smcd_lo_alloc_dev(const struct smcd_ops *ops,
+ int max_dmbs)
+{
+ struct smcd_dev *smcd;
+
+ smcd = kzalloc(sizeof(*smcd), GFP_KERNEL);
+ if (!smcd)
+ return NULL;
+
+ smcd->conn = kcalloc(max_dmbs, sizeof(struct smc_connection *),
+ GFP_KERNEL);
+ if (!smcd->conn)
+ goto out_smcd;
+
+ smcd->ops = ops;
+
+ spin_lock_init(&smcd->lock);
+ spin_lock_init(&smcd->lgr_lock);
+ INIT_LIST_HEAD(&smcd->vlan);
+ INIT_LIST_HEAD(&smcd->lgr_list);
+ init_waitqueue_head(&smcd->lgrs_deleted);
+ return smcd;
+
+out_smcd:
+ kfree(smcd);
+ return NULL;
+}
+
+static int smcd_lo_register_dev(struct smc_lo_dev *ldev)
+{
+ struct smcd_dev *smcd;
+
+ smcd = smcd_lo_alloc_dev(&lo_ops, SMC_LO_MAX_DMBS);
+ if (!smcd)
+ return -ENOMEM;
+ ldev->smcd = smcd;
+ smcd->priv = ldev;
+
+ /* TODO:
+ * register loopback-ism to smcd_dev list.
+ */
+ return 0;
+}
+
+static void smcd_lo_unregister_dev(struct smc_lo_dev *ldev)
+{
+ struct smcd_dev *smcd = ldev->smcd;
+
+ /* TODO:
+ * unregister loopback-ism from smcd_dev list.
+ */
+ kfree(smcd->conn);
+ kfree(smcd);
+}
+
+static int smc_lo_dev_init(struct smc_lo_dev *ldev)
+{
+ return smcd_lo_register_dev(ldev);
+}
+
+static void smc_lo_dev_exit(struct smc_lo_dev *ldev)
+{
+ smcd_lo_unregister_dev(ldev);
+}
+
+static void smc_lo_dev_release(struct device *dev)
+{
+ struct smc_lo_dev *ldev =
+ container_of(dev, struct smc_lo_dev, dev);
+
+ kfree(ldev);
+}
+
+static int smc_lo_dev_probe(void)
+{
+ struct smc_lo_dev *ldev;
+ int ret;
+
+ smc_class = class_create("smc");
+ if (IS_ERR(smc_class))
+ return PTR_ERR(smc_class);
+
+ ldev = kzalloc(sizeof(*ldev), GFP_KERNEL);
+ if (!ldev) {
+ ret = -ENOMEM;
+ goto destroy_class;
+ }
+
+ ldev->dev.parent = NULL;
+ ldev->dev.class = smc_class;
+ ldev->dev.release = smc_lo_dev_release;
+ device_initialize(&ldev->dev);
+ dev_set_name(&ldev->dev, smc_lo_dev_name);
+ ret = device_add(&ldev->dev);
+ if (ret)
+ goto free_dev;
+
+ ret = smc_lo_dev_init(ldev);
+ if (ret)
+ goto del_dev;
+
+ lo_dev = ldev; /* global loopback device */
+ return 0;
+
+del_dev:
+ device_del(&ldev->dev);
+free_dev:
+ put_device(&ldev->dev);
+destroy_class:
+ class_destroy(smc_class);
+ return ret;
+}
+
+static void smc_lo_dev_remove(void)
+{
+ if (!lo_dev)
+ return;
+
+ smc_lo_dev_exit(lo_dev);
+ device_del(&lo_dev->dev); /* device_add in smc_lo_dev_probe */
+ put_device(&lo_dev->dev); /* device_initialize in smc_lo_dev_probe */
+ class_destroy(smc_class);
+}
+#endif
+
+int smc_loopback_init(void)
+{
+#if IS_ENABLED(CONFIG_SMC_LO)
+ return smc_lo_dev_probe();
+#else
+ return 0;
+#endif
+}
+
+void smc_loopback_exit(void)
+{
+#if IS_ENABLED(CONFIG_SMC_LO)
+ smc_lo_dev_remove();
+#endif
+}
diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
new file mode 100644
index 000000000000..9dd44d4c0ca3
--- /dev/null
+++ b/net/smc/smc_loopback.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Shared Memory Communications Direct over loopback-ism device.
+ *
+ * Provide a SMC-D loopback-ism device.
+ *
+ * Copyright (c) 2024, Alibaba Inc.
+ *
+ * Author: Wen Gu <[email protected]>
+ * Tony Lu <[email protected]>
+ *
+ */
+
+#ifndef _SMC_LOOPBACK_H
+#define _SMC_LOOPBACK_H
+
+#include <linux/device.h>
+#include <linux/err.h>
+#include <net/smc.h>
+
+#if IS_ENABLED(CONFIG_SMC_LO)
+#define SMC_LO_MAX_DMBS 5000
+
+struct smc_lo_dev {
+ struct smcd_dev *smcd;
+ struct device dev;
+};
+#endif
+
+int smc_loopback_init(void);
+void smc_loopback_exit(void);
+
+#endif /* _SMC_LOOPBACK_H */
--
2.32.0.3.g01195cf9f


2024-01-11 12:03:01

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next 04/15] net/smc: implement ID-related operations of loopback-ism

This implements GID and CHID related operations of loopback-ism device.
loopback-ism acts as an ISMv2. It's GID is generated randomly by UUIDv4
algorithm and CHID is reserved 0xFFFF.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_loopback.c | 62 ++++++++++++++++++++++++++++++++++++++----
net/smc/smc_loopback.h | 3 ++
2 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index cbb6625ccd0d..40dff28d837d 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -19,12 +19,63 @@
#include "smc_loopback.h"

#if IS_ENABLED(CONFIG_SMC_LO)
+#define SMC_LO_V2_CAPABLE 0x1 /* loopback-ism acts as ISMv2 */
+
static const char smc_lo_dev_name[] = "loopback-ism";
static struct smc_lo_dev *lo_dev;
static struct class *smc_class;

+static void smc_lo_generate_id(struct smc_lo_dev *ldev)
+{
+ struct smcd_gid *lgid = &ldev->local_gid;
+ uuid_t uuid;
+
+ uuid_gen(&uuid);
+ memcpy(&lgid->gid, &uuid, sizeof(lgid->gid));
+ memcpy(&lgid->gid_ext, (u8 *)&uuid + sizeof(lgid->gid),
+ sizeof(lgid->gid_ext));
+
+ ldev->chid = SMC_LO_CHID;
+}
+
+static int smc_lo_query_rgid(struct smcd_dev *smcd, struct smcd_gid *rgid,
+ u32 vid_valid, u32 vid)
+{
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ /* rgid should equal to lgid in loopback situation */
+ if (!ldev || rgid->gid != ldev->local_gid.gid ||
+ rgid->gid_ext != ldev->local_gid.gid_ext)
+ return -ENETUNREACH;
+ return 0;
+}
+
+static int smc_lo_supports_v2(void)
+{
+ return SMC_LO_V2_CAPABLE;
+}
+
+static void smc_lo_get_local_gid(struct smcd_dev *smcd,
+ struct smcd_gid *smcd_gid)
+{
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ smcd_gid->gid = ldev->local_gid.gid;
+ smcd_gid->gid_ext = ldev->local_gid.gid_ext;
+}
+
+static u16 smc_lo_get_chid(struct smcd_dev *smcd)
+{
+ return ((struct smc_lo_dev *)smcd->priv)->chid;
+}
+
+static struct device *smc_lo_get_dev(struct smcd_dev *smcd)
+{
+ return &((struct smc_lo_dev *)smcd->priv)->dev;
+}
+
static const struct smcd_ops lo_ops = {
- .query_remote_gid = NULL,
+ .query_remote_gid = smc_lo_query_rgid,
.register_dmb = NULL,
.unregister_dmb = NULL,
.add_vlan_id = NULL,
@@ -33,10 +84,10 @@ static const struct smcd_ops lo_ops = {
.reset_vlan_required = NULL,
.signal_event = NULL,
.move_data = NULL,
- .supports_v2 = NULL,
- .get_local_gid = NULL,
- .get_chid = NULL,
- .get_dev = NULL,
+ .supports_v2 = smc_lo_supports_v2,
+ .get_local_gid = smc_lo_get_local_gid,
+ .get_chid = smc_lo_get_chid,
+ .get_dev = smc_lo_get_dev,
};

static struct smcd_dev *smcd_lo_alloc_dev(const struct smcd_ops *ops,
@@ -96,6 +147,7 @@ static void smcd_lo_unregister_dev(struct smc_lo_dev *ldev)

static int smc_lo_dev_init(struct smc_lo_dev *ldev)
{
+ smc_lo_generate_id(ldev);
return smcd_lo_register_dev(ldev);
}

diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
index 9dd44d4c0ca3..55b41133a97f 100644
--- a/net/smc/smc_loopback.h
+++ b/net/smc/smc_loopback.h
@@ -20,10 +20,13 @@

#if IS_ENABLED(CONFIG_SMC_LO)
#define SMC_LO_MAX_DMBS 5000
+#define SMC_LO_CHID 0xFFFF

struct smc_lo_dev {
struct smcd_dev *smcd;
struct device dev;
+ u16 chid;
+ struct smcd_gid local_gid;
};
#endif

--
2.32.0.3.g01195cf9f


2024-01-11 12:03:31

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next 08/15] net/smc: introduce loopback-ism runtime switch

This provides a runtime switch to activate or deactivate loopback-ism
device by echo {1|0} > /sys/devices/virtual/smc/loopback-ism/active. It
will trigger the registration or removal of loopback-ism from the SMC-D
device list.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_loopback.c | 55 ++++++++++++++++++++++++++++++++++++++++++
net/smc/smc_loopback.h | 1 +
2 files changed, 56 insertions(+)

diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index db0b45f8560c..3bf7bf5e8c96 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -27,6 +27,58 @@ static const char smc_lo_dev_name[] = "loopback-ism";
static struct smc_lo_dev *lo_dev;
static struct class *smc_class;

+static int smcd_lo_register_dev(struct smc_lo_dev *ldev);
+static void smcd_lo_unregister_dev(struct smc_lo_dev *ldev);
+
+static ssize_t active_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct smc_lo_dev *ldev =
+ container_of(dev, struct smc_lo_dev, dev);
+
+ return sysfs_emit(buf, "%d\n", ldev->active);
+}
+
+static ssize_t active_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct smc_lo_dev *ldev =
+ container_of(dev, struct smc_lo_dev, dev);
+ bool active;
+ int ret;
+
+ ret = kstrtobool(buf, &active);
+ if (ret)
+ return ret;
+
+ if (active && !ldev->active) {
+ /* activate loopback-ism */
+ ret = smcd_lo_register_dev(ldev);
+ if (ret)
+ return ret;
+ } else if (!active && ldev->active) {
+ /* deactivate loopback-ism */
+ smcd_lo_unregister_dev(ldev);
+ }
+
+ return count;
+}
+static DEVICE_ATTR_RW(active);
+static struct attribute *smc_lo_attrs[] = {
+ &dev_attr_active.attr,
+ NULL,
+};
+
+static struct attribute_group smc_lo_attr_group = {
+ .attrs = smc_lo_attrs,
+};
+
+static const struct attribute_group *smc_lo_attr_groups[] = {
+ &smc_lo_attr_group,
+ NULL,
+};
+
static void smc_lo_generate_id(struct smc_lo_dev *ldev)
{
struct smcd_gid *lgid = &ldev->local_gid;
@@ -282,6 +334,7 @@ static int smcd_lo_register_dev(struct smc_lo_dev *ldev)
mutex_lock(&smcd_dev_list.mutex);
list_add(&smcd->list, &smcd_dev_list.list);
mutex_unlock(&smcd_dev_list.mutex);
+ ldev->active = 1;
pr_warn_ratelimited("smc: adding smcd device %s\n",
smc_lo_dev_name);
return 0;
@@ -293,6 +346,7 @@ static void smcd_lo_unregister_dev(struct smc_lo_dev *ldev)

pr_warn_ratelimited("smc: removing smcd device %s\n",
smc_lo_dev_name);
+ ldev->active = 0;
smcd->going_away = 1;
smc_smcd_terminate_all(smcd);
mutex_lock(&smcd_dev_list.mutex);
@@ -340,6 +394,7 @@ static int smc_lo_dev_probe(void)

ldev->dev.parent = NULL;
ldev->dev.class = smc_class;
+ ldev->dev.groups = smc_lo_attr_groups;
ldev->dev.release = smc_lo_dev_release;
device_initialize(&ldev->dev);
dev_set_name(&ldev->dev, smc_lo_dev_name);
diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
index 24ab9d747613..02a522e322b4 100644
--- a/net/smc/smc_loopback.h
+++ b/net/smc/smc_loopback.h
@@ -35,6 +35,7 @@ struct smc_lo_dmb_node {
struct smc_lo_dev {
struct smcd_dev *smcd;
struct device dev;
+ u8 active;
u16 chid;
struct smcd_gid local_gid;
rwlock_t dmb_ht_lock;
--
2.32.0.3.g01195cf9f


2024-01-11 12:03:55

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next 06/15] net/smc: implement DMB-related operations of loopback-ism

This implements DMB (un)registration and data move operations of
loopback-ism device.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_cdc.c | 6 ++
net/smc/smc_cdc.h | 1 +
net/smc/smc_loopback.c | 133 ++++++++++++++++++++++++++++++++++++++++-
net/smc/smc_loopback.h | 13 ++++
4 files changed, 150 insertions(+), 3 deletions(-)

diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index 3c06625ceb20..c820ef197610 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -410,6 +410,12 @@ static void smc_cdc_msg_recv(struct smc_sock *smc, struct smc_cdc_msg *cdc)
static void smcd_cdc_rx_tsklet(struct tasklet_struct *t)
{
struct smc_connection *conn = from_tasklet(conn, t, rx_tsklet);
+
+ smcd_cdc_rx_handler(conn);
+}
+
+void smcd_cdc_rx_handler(struct smc_connection *conn)
+{
struct smcd_cdc_msg *data_cdc;
struct smcd_cdc_msg cdc;
struct smc_sock *smc;
diff --git a/net/smc/smc_cdc.h b/net/smc/smc_cdc.h
index 696cc11f2303..11559d4ebf2b 100644
--- a/net/smc/smc_cdc.h
+++ b/net/smc/smc_cdc.h
@@ -301,5 +301,6 @@ int smcr_cdc_msg_send_validation(struct smc_connection *conn,
struct smc_wr_buf *wr_buf);
int smc_cdc_init(void) __init;
void smcd_cdc_rx_init(struct smc_connection *conn);
+void smcd_cdc_rx_handler(struct smc_connection *conn);

#endif /* SMC_CDC_H */
diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index 353d4a2d69a1..f72e7b24fc1a 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -15,11 +15,13 @@
#include <linux/types.h>
#include <net/smc.h>

+#include "smc_cdc.h"
#include "smc_ism.h"
#include "smc_loopback.h"

#if IS_ENABLED(CONFIG_SMC_LO)
#define SMC_LO_V2_CAPABLE 0x1 /* loopback-ism acts as ISMv2 */
+#define SMC_DMA_ADDR_INVALID (~(dma_addr_t)0)

static const char smc_lo_dev_name[] = "loopback-ism";
static struct smc_lo_dev *lo_dev;
@@ -50,6 +52,97 @@ static int smc_lo_query_rgid(struct smcd_dev *smcd, struct smcd_gid *rgid,
return 0;
}

+static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
+ void *client_priv)
+{
+ struct smc_lo_dmb_node *dmb_node, *tmp_node;
+ struct smc_lo_dev *ldev = smcd->priv;
+ int sba_idx, order, rc;
+ struct page *pages;
+
+ /* check space for new dmb */
+ for_each_clear_bit(sba_idx, ldev->sba_idx_mask, SMC_LO_MAX_DMBS) {
+ if (!test_and_set_bit(sba_idx, ldev->sba_idx_mask))
+ break;
+ }
+ if (sba_idx == SMC_LO_MAX_DMBS)
+ return -ENOSPC;
+
+ dmb_node = kzalloc(sizeof(*dmb_node), GFP_KERNEL);
+ if (!dmb_node) {
+ rc = -ENOMEM;
+ goto err_bit;
+ }
+
+ dmb_node->sba_idx = sba_idx;
+ order = get_order(dmb->dmb_len);
+ pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN |
+ __GFP_NOMEMALLOC | __GFP_COMP |
+ __GFP_NORETRY | __GFP_ZERO,
+ order);
+ if (!pages) {
+ rc = -ENOMEM;
+ goto err_node;
+ }
+ dmb_node->cpu_addr = (void *)page_address(pages);
+ dmb_node->len = dmb->dmb_len;
+ dmb_node->dma_addr = SMC_DMA_ADDR_INVALID;
+
+again:
+ /* add new dmb into hash table */
+ get_random_bytes(&dmb_node->token, sizeof(dmb_node->token));
+ write_lock(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_node->token) {
+ if (tmp_node->token == dmb_node->token) {
+ write_unlock(&ldev->dmb_ht_lock);
+ goto again;
+ }
+ }
+ hash_add(ldev->dmb_ht, &dmb_node->list, dmb_node->token);
+ write_unlock(&ldev->dmb_ht_lock);
+
+ dmb->sba_idx = dmb_node->sba_idx;
+ dmb->dmb_tok = dmb_node->token;
+ dmb->cpu_addr = dmb_node->cpu_addr;
+ dmb->dma_addr = dmb_node->dma_addr;
+ dmb->dmb_len = dmb_node->len;
+
+ return 0;
+
+err_node:
+ kfree(dmb_node);
+err_bit:
+ clear_bit(sba_idx, ldev->sba_idx_mask);
+ return rc;
+}
+
+static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
+{
+ struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ /* remove dmb from hash table */
+ write_lock(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
+ if (tmp_node->token == dmb->dmb_tok) {
+ dmb_node = tmp_node;
+ break;
+ }
+ }
+ if (!dmb_node) {
+ write_unlock(&ldev->dmb_ht_lock);
+ return -EINVAL;
+ }
+ hash_del(&dmb_node->list);
+ write_unlock(&ldev->dmb_ht_lock);
+
+ clear_bit(dmb_node->sba_idx, ldev->sba_idx_mask);
+ kfree(dmb_node->cpu_addr);
+ kfree(dmb_node);
+
+ return 0;
+}
+
static int smc_lo_add_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
{
return -EOPNOTSUPP;
@@ -76,6 +169,38 @@ static int smc_lo_signal_event(struct smcd_dev *dev, struct smcd_gid *rgid,
return 0;
}

+static int smc_lo_move_data(struct smcd_dev *smcd, u64 dmb_tok,
+ unsigned int idx, bool sf, unsigned int offset,
+ void *data, unsigned int size)
+{
+ struct smc_lo_dmb_node *rmb_node = NULL, *tmp_node;
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ read_lock(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_tok) {
+ if (tmp_node->token == dmb_tok) {
+ rmb_node = tmp_node;
+ break;
+ }
+ }
+ if (!rmb_node) {
+ read_unlock(&ldev->dmb_ht_lock);
+ return -EINVAL;
+ }
+ read_unlock(&ldev->dmb_ht_lock);
+
+ memcpy((char *)rmb_node->cpu_addr + offset, data, size);
+
+ if (sf) {
+ struct smc_connection *conn =
+ smcd->conn[rmb_node->sba_idx];
+
+ if (conn && !conn->killed)
+ smcd_cdc_rx_handler(conn);
+ }
+ return 0;
+}
+
static int smc_lo_supports_v2(void)
{
return SMC_LO_V2_CAPABLE;
@@ -102,14 +227,14 @@ static struct device *smc_lo_get_dev(struct smcd_dev *smcd)

static const struct smcd_ops lo_ops = {
.query_remote_gid = smc_lo_query_rgid,
- .register_dmb = NULL,
- .unregister_dmb = NULL,
+ .register_dmb = smc_lo_register_dmb,
+ .unregister_dmb = smc_lo_unregister_dmb,
.add_vlan_id = smc_lo_add_vlan_id,
.del_vlan_id = smc_lo_del_vlan_id,
.set_vlan_required = smc_lo_set_vlan_required,
.reset_vlan_required = smc_lo_reset_vlan_required,
.signal_event = smc_lo_signal_event,
- .move_data = NULL,
+ .move_data = smc_lo_move_data,
.supports_v2 = smc_lo_supports_v2,
.get_local_gid = smc_lo_get_local_gid,
.get_chid = smc_lo_get_chid,
@@ -174,6 +299,8 @@ static void smcd_lo_unregister_dev(struct smc_lo_dev *ldev)
static int smc_lo_dev_init(struct smc_lo_dev *ldev)
{
smc_lo_generate_id(ldev);
+ rwlock_init(&ldev->dmb_ht_lock);
+ hash_init(ldev->dmb_ht);
return smcd_lo_register_dev(ldev);
}

diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
index 55b41133a97f..24ab9d747613 100644
--- a/net/smc/smc_loopback.h
+++ b/net/smc/smc_loopback.h
@@ -20,13 +20,26 @@

#if IS_ENABLED(CONFIG_SMC_LO)
#define SMC_LO_MAX_DMBS 5000
+#define SMC_LO_DMBS_HASH_BITS 12
#define SMC_LO_CHID 0xFFFF

+struct smc_lo_dmb_node {
+ struct hlist_node list;
+ u64 token;
+ u32 len;
+ u32 sba_idx;
+ void *cpu_addr;
+ dma_addr_t dma_addr;
+};
+
struct smc_lo_dev {
struct smcd_dev *smcd;
struct device dev;
u16 chid;
struct smcd_gid local_gid;
+ rwlock_t dmb_ht_lock;
+ DECLARE_BITMAP(sba_idx_mask, SMC_LO_MAX_DMBS);
+ DECLARE_HASHTABLE(dmb_ht, SMC_LO_DMBS_HASH_BITS);
};
#endif

--
2.32.0.3.g01195cf9f


2024-01-11 12:04:11

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next 09/15] net/smc: introduce loopback-ism statistics attributes

This introduces some statistics attributes of loopback-ism. They can be
read from /sys/devices/virtual/smc/loopback-ism/{xfer_tytes|dmbs_cnt}.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_loopback.c | 74 ++++++++++++++++++++++++++++++++++++++++++
net/smc/smc_loopback.h | 22 +++++++++++++
2 files changed, 96 insertions(+)

diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index 3bf7bf5e8c96..a89dbf84aea5 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -30,6 +30,65 @@ static struct class *smc_class;
static int smcd_lo_register_dev(struct smc_lo_dev *ldev);
static void smcd_lo_unregister_dev(struct smc_lo_dev *ldev);

+static void smc_lo_clear_stats(struct smc_lo_dev *ldev)
+{
+ struct smc_lo_dev_stats64 *tmp;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ tmp = per_cpu_ptr(ldev->stats, cpu);
+ tmp->xfer_bytes = 0;
+ }
+}
+
+static void smc_lo_get_stats(struct smc_lo_dev *ldev,
+ struct smc_lo_dev_stats64 *stats)
+{
+ int size, cpu, i;
+ u64 *src, *sum;
+
+ memset(stats, 0, sizeof(*stats));
+ size = sizeof(*stats) / sizeof(u64);
+ for_each_possible_cpu(cpu) {
+ src = (u64 *)per_cpu_ptr(ldev->stats, cpu);
+ sum = (u64 *)stats;
+ for (i = 0; i < size; i++)
+ *(sum++) += *(src++);
+ }
+}
+
+static ssize_t smc_lo_show_stats(struct device *dev,
+ struct device_attribute *attr,
+ char *buf, unsigned long offset)
+{
+ struct smc_lo_dev *ldev =
+ container_of(dev, struct smc_lo_dev, dev);
+ struct smc_lo_dev_stats64 stats;
+ ssize_t ret = -EINVAL;
+
+ if (WARN_ON(offset > sizeof(struct smc_lo_dev_stats64) ||
+ offset % sizeof(u64) != 0))
+ goto out;
+
+ smc_lo_get_stats(ldev, &stats);
+ ret = sysfs_emit(buf, "%llu\n", *(u64 *)(((u8 *)&stats) + offset));
+out:
+ return ret;
+}
+
+/* generate a read-only statistics attribute */
+#define SMC_LO_DEVICE_ATTR_RO(name) \
+static ssize_t name##_show(struct device *dev, \
+ struct device_attribute *attr, char *buf) \
+{ \
+ return smc_lo_show_stats(dev, attr, buf, \
+ offsetof(struct smc_lo_dev_stats64, name)); \
+} \
+static DEVICE_ATTR_RO(name)
+
+SMC_LO_DEVICE_ATTR_RO(xfer_bytes);
+SMC_LO_DEVICE_ATTR_RO(dmbs_cnt);
+
static ssize_t active_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -67,6 +126,8 @@ static ssize_t active_store(struct device *dev,
static DEVICE_ATTR_RW(active);
static struct attribute *smc_lo_attrs[] = {
&dev_attr_active.attr,
+ &dev_attr_xfer_bytes.attr,
+ &dev_attr_dmbs_cnt.attr,
NULL,
};

@@ -152,6 +213,7 @@ static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
}
hash_add(ldev->dmb_ht, &dmb_node->list, dmb_node->token);
write_unlock(&ldev->dmb_ht_lock);
+ SMC_LO_STAT_DMBS_INC(ldev);

dmb->sba_idx = dmb_node->sba_idx;
dmb->dmb_tok = dmb_node->token;
@@ -191,6 +253,7 @@ static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
clear_bit(dmb_node->sba_idx, ldev->sba_idx_mask);
kfree(dmb_node->cpu_addr);
kfree(dmb_node);
+ SMC_LO_STAT_DMBS_DEC(ldev);

return 0;
}
@@ -249,6 +312,8 @@ static int smc_lo_move_data(struct smcd_dev *smcd, u64 dmb_tok,

if (conn && !conn->killed)
smcd_cdc_rx_handler(conn);
+ } else {
+ SMC_LO_STAT_XFER_BYTES(ldev, size);
}
return 0;
}
@@ -354,6 +419,7 @@ static void smcd_lo_unregister_dev(struct smc_lo_dev *ldev)
mutex_unlock(&smcd_dev_list.mutex);
kfree(smcd->conn);
kfree(smcd);
+ smc_lo_clear_stats(ldev);
}

static int smc_lo_dev_init(struct smc_lo_dev *ldev)
@@ -374,6 +440,7 @@ static void smc_lo_dev_release(struct device *dev)
struct smc_lo_dev *ldev =
container_of(dev, struct smc_lo_dev, dev);

+ free_percpu(ldev->stats);
kfree(ldev);
}

@@ -392,6 +459,13 @@ static int smc_lo_dev_probe(void)
goto destroy_class;
}

+ ldev->stats = alloc_percpu(struct smc_lo_dev_stats64);
+ if (!ldev->stats) {
+ ret = -ENOMEM;
+ kfree(ldev);
+ goto destroy_class;
+ }
+
ldev->dev.parent = NULL;
ldev->dev.class = smc_class;
ldev->dev.groups = smc_lo_attr_groups;
diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
index 02a522e322b4..d4572ca42f08 100644
--- a/net/smc/smc_loopback.h
+++ b/net/smc/smc_loopback.h
@@ -32,16 +32,38 @@ struct smc_lo_dmb_node {
dma_addr_t dma_addr;
};

+struct smc_lo_dev_stats64 {
+ __u64 xfer_bytes;
+ __u64 dmbs_cnt;
+};
+
struct smc_lo_dev {
struct smcd_dev *smcd;
struct device dev;
u8 active;
u16 chid;
struct smcd_gid local_gid;
+ struct smc_lo_dev_stats64 __percpu *stats;
rwlock_t dmb_ht_lock;
DECLARE_BITMAP(sba_idx_mask, SMC_LO_MAX_DMBS);
DECLARE_HASHTABLE(dmb_ht, SMC_LO_DMBS_HASH_BITS);
};
+
+#define SMC_LO_STAT_SUB(ldev, key, val) \
+do { \
+ struct smc_lo_dev_stats64 *_stats = (ldev)->stats; \
+ this_cpu_add((*(_stats)).key, val); \
+} \
+while (0)
+
+#define SMC_LO_STAT_XFER_BYTES(ldev, val) \
+ SMC_LO_STAT_SUB(ldev, xfer_bytes, val)
+
+#define SMC_LO_STAT_DMBS_INC(ldev) \
+ SMC_LO_STAT_SUB(ldev, dmbs_cnt, 1)
+
+#define SMC_LO_STAT_DMBS_DEC(ldev) \
+ SMC_LO_STAT_SUB(ldev, dmbs_cnt, -1)
#endif

int smc_loopback_init(void);
--
2.32.0.3.g01195cf9f


2024-01-11 12:04:39

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next 13/15] net/smc: introduce loopback-ism DMB type control

This provides a way to {get|set} type of DMB offered by loopback-ism,
whether it is physically or virtually contiguous memory.

echo 0 > /sys/devices/virtual/smc/loopback-ism/dmb_type # physically
echo 1 > /sys/devices/virtual/smc/loopback-ism/dmb_type # virtually

The settings take effect after re-activating loopback-ism by:

echo 0 > /sys/devices/virtual/smc/loopback-ism/active
echo 1 > /sys/devices/virtual/smc/loopback-ism/active

After this, the link group and DMBs related to loopback-ism will be
flushed and subsequent DMBs created will be of the desired type.

The motivation of this control is that physically contiguous DMB has
best performance but is usually expensive, while the virtually
contiguous DMB is cheap and perform well in most scenarios, but if
sndbuf and DMB are merged, virtual DMB will be accessed concurrently
in Tx and Rx and there will be a bottleneck caused by lock contention
of find_vmap_area when there are many CPUs and CONFIG_HARDENED_USERCOPY
is set (see link below). So an option is provided.

Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_loopback.c | 80 +++++++++++++++++++++++++++++++++++-------
net/smc/smc_loopback.h | 6 ++++
2 files changed, 74 insertions(+), 12 deletions(-)

diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index a89dbf84aea5..2e734f8e08f5 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -13,6 +13,7 @@

#include <linux/device.h>
#include <linux/types.h>
+#include <linux/vmalloc.h>
#include <net/smc.h>

#include "smc_cdc.h"
@@ -24,6 +25,7 @@
#define SMC_DMA_ADDR_INVALID (~(dma_addr_t)0)

static const char smc_lo_dev_name[] = "loopback-ism";
+static unsigned int smc_lo_dmb_type = SMC_LO_DMB_PHYS;
static struct smc_lo_dev *lo_dev;
static struct class *smc_class;

@@ -124,8 +126,50 @@ static ssize_t active_store(struct device *dev,
return count;
}
static DEVICE_ATTR_RW(active);
+
+static ssize_t dmb_type_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct smc_lo_dev *ldev =
+ container_of(dev, struct smc_lo_dev, dev);
+ const char *type;
+
+ switch (ldev->dmb_type) {
+ case SMC_LO_DMB_PHYS:
+ type = "Physically contiguous buffer";
+ break;
+ case SMC_LO_DMB_VIRT:
+ type = "Virtually contiguous buffer";
+ break;
+ default:
+ type = "Unknown type";
+ }
+
+ return sysfs_emit(buf, "%d: %s\n", ldev->dmb_type, type);
+}
+
+static ssize_t dmb_type_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned int dmb_type;
+ int ret;
+
+ ret = kstrtouint(buf, 0, &dmb_type);
+ if (ret)
+ return ret;
+
+ if (dmb_type != SMC_LO_DMB_PHYS &&
+ dmb_type != SMC_LO_DMB_VIRT)
+ return -EINVAL;
+
+ smc_lo_dmb_type = dmb_type; /* re-activate to take effect */
+ return count;
+}
+static DEVICE_ATTR_RW(dmb_type);
static struct attribute *smc_lo_attrs[] = {
&dev_attr_active.attr,
+ &dev_attr_dmb_type.attr,
&dev_attr_xfer_bytes.attr,
&dev_attr_dmbs_cnt.attr,
NULL,
@@ -170,8 +214,7 @@ static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
{
struct smc_lo_dmb_node *dmb_node, *tmp_node;
struct smc_lo_dev *ldev = smcd->priv;
- int sba_idx, order, rc;
- struct page *pages;
+ int sba_idx, rc;

/* check space for new dmb */
for_each_clear_bit(sba_idx, ldev->sba_idx_mask, SMC_LO_MAX_DMBS) {
@@ -188,16 +231,27 @@ static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
}

dmb_node->sba_idx = sba_idx;
- order = get_order(dmb->dmb_len);
- pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN |
- __GFP_NOMEMALLOC | __GFP_COMP |
- __GFP_NORETRY | __GFP_ZERO,
- order);
- if (!pages) {
- rc = -ENOMEM;
- goto err_node;
+ if (ldev->dmb_type == SMC_LO_DMB_PHYS) {
+ struct page *pages;
+ int order;
+
+ order = get_order(dmb->dmb_len);
+ pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN |
+ __GFP_NOMEMALLOC | __GFP_COMP |
+ __GFP_NORETRY | __GFP_ZERO,
+ order);
+ if (!pages) {
+ rc = -ENOMEM;
+ goto err_node;
+ }
+ dmb_node->cpu_addr = (void *)page_address(pages);
+ } else {
+ dmb_node->cpu_addr = vzalloc(dmb->dmb_len);
+ if (!dmb_node->cpu_addr) {
+ rc = -ENOMEM;
+ goto err_node;
+ }
}
- dmb_node->cpu_addr = (void *)page_address(pages);
dmb_node->len = dmb->dmb_len;
dmb_node->dma_addr = SMC_DMA_ADDR_INVALID;

@@ -251,7 +305,7 @@ static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
write_unlock(&ldev->dmb_ht_lock);

clear_bit(dmb_node->sba_idx, ldev->sba_idx_mask);
- kfree(dmb_node->cpu_addr);
+ kvfree(dmb_node->cpu_addr);
kfree(dmb_node);
SMC_LO_STAT_DMBS_DEC(ldev);

@@ -396,6 +450,7 @@ static int smcd_lo_register_dev(struct smc_lo_dev *ldev)
ldev->smcd = smcd;
smcd->priv = ldev;
smc_ism_set_v2_capable();
+ ldev->dmb_type = smc_lo_dmb_type;
mutex_lock(&smcd_dev_list.mutex);
list_add(&smcd->list, &smcd_dev_list.list);
mutex_unlock(&smcd_dev_list.mutex);
@@ -419,6 +474,7 @@ static void smcd_lo_unregister_dev(struct smc_lo_dev *ldev)
mutex_unlock(&smcd_dev_list.mutex);
kfree(smcd->conn);
kfree(smcd);
+ ldev->dmb_type = smc_lo_dmb_type;
smc_lo_clear_stats(ldev);
}

diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
index d4572ca42f08..8ee5c6805fc4 100644
--- a/net/smc/smc_loopback.h
+++ b/net/smc/smc_loopback.h
@@ -23,6 +23,11 @@
#define SMC_LO_DMBS_HASH_BITS 12
#define SMC_LO_CHID 0xFFFF

+enum {
+ SMC_LO_DMB_PHYS,
+ SMC_LO_DMB_VIRT,
+};
+
struct smc_lo_dmb_node {
struct hlist_node list;
u64 token;
@@ -41,6 +46,7 @@ struct smc_lo_dev {
struct smcd_dev *smcd;
struct device dev;
u8 active;
+ u8 dmb_type;
u16 chid;
struct smcd_gid local_gid;
struct smc_lo_dev_stats64 __percpu *stats;
--
2.32.0.3.g01195cf9f


2024-01-11 12:05:01

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next 14/15] net/smc: introduce loopback-ism DMB data copy control

This provides a way to {get|set} whether loopback-ism device supports
merging sndbuf with peer DMB to eliminate data copies between them.

echo 0 > /sys/devices/virtual/smc/loopback-ism/dmb_copy # support
echo 1 > /sys/devices/virtual/smc/loopback-ism/dmb_copy # not support

The settings take effect after re-activating loopback-ism by:

echo 0 > /sys/devices/virtual/smc/loopback-ism/active
echo 1 > /sys/devices/virtual/smc/loopback-ism/active

After this, the link group related to loopback-ism will be flushed and
the sndbufs of subsequent connections will be merged or not merged with
peer DMB.

The motivation of this control is that the bandwidth will be highly
improved when sndbuf and DMB are merged, but when virtually contiguous
DMB is provided and merged with sndbuf, it will be concurrently accessed
on Tx and Rx, then there will be a bottleneck caused by lock contention
of find_vmap_area when there are many CPUs and CONFIG_HARDENED_USERCOPY
is set (see link below). So an option is provided.

Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_loopback.c | 46 ++++++++++++++++++++++++++++++++++++++++++
net/smc/smc_loopback.h | 8 +++++++-
2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index 2e734f8e08f5..bfbb346ef01a 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -26,6 +26,7 @@

static const char smc_lo_dev_name[] = "loopback-ism";
static unsigned int smc_lo_dmb_type = SMC_LO_DMB_PHYS;
+static unsigned int smc_lo_dmb_copy = SMC_LO_DMB_NOCOPY;
static struct smc_lo_dev *lo_dev;
static struct class *smc_class;

@@ -167,9 +168,52 @@ static ssize_t dmb_type_store(struct device *dev,
return count;
}
static DEVICE_ATTR_RW(dmb_type);
+
+static ssize_t dmb_copy_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct smc_lo_dev *ldev =
+ container_of(dev, struct smc_lo_dev, dev);
+ const char *copy;
+
+ switch (ldev->dmb_copy) {
+ case SMC_LO_DMB_NOCOPY:
+ copy = "sndbuf and DMB merged and no data copied";
+ break;
+ case SMC_LO_DMB_COPY:
+ copy = "sndbuf and DMB separated and data copied";
+ break;
+ default:
+ copy = "Unknown setting";
+ }
+
+ return sysfs_emit(buf, "%d: %s\n", ldev->dmb_copy, copy);
+}
+
+static ssize_t dmb_copy_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned int dmb_copy;
+ int ret;
+
+ ret = kstrtouint(buf, 0, &dmb_copy);
+ if (ret)
+ return ret;
+
+ if (dmb_copy != SMC_LO_DMB_NOCOPY &&
+ dmb_copy != SMC_LO_DMB_COPY)
+ return -EINVAL;
+
+ smc_lo_dmb_copy = dmb_copy; /* re-activate to take effect */
+ return count;
+}
+static DEVICE_ATTR_RW(dmb_copy);
+
static struct attribute *smc_lo_attrs[] = {
&dev_attr_active.attr,
&dev_attr_dmb_type.attr,
+ &dev_attr_dmb_copy.attr,
&dev_attr_xfer_bytes.attr,
&dev_attr_dmbs_cnt.attr,
NULL,
@@ -451,6 +495,7 @@ static int smcd_lo_register_dev(struct smc_lo_dev *ldev)
smcd->priv = ldev;
smc_ism_set_v2_capable();
ldev->dmb_type = smc_lo_dmb_type;
+ ldev->dmb_copy = smc_lo_dmb_copy;
mutex_lock(&smcd_dev_list.mutex);
list_add(&smcd->list, &smcd_dev_list.list);
mutex_unlock(&smcd_dev_list.mutex);
@@ -475,6 +520,7 @@ static void smcd_lo_unregister_dev(struct smc_lo_dev *ldev)
kfree(smcd->conn);
kfree(smcd);
ldev->dmb_type = smc_lo_dmb_type;
+ ldev->dmb_copy = smc_lo_dmb_copy;
smc_lo_clear_stats(ldev);
}

diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
index 8ee5c6805fc4..7ecb4a35eb36 100644
--- a/net/smc/smc_loopback.h
+++ b/net/smc/smc_loopback.h
@@ -28,6 +28,11 @@ enum {
SMC_LO_DMB_VIRT,
};

+enum {
+ SMC_LO_DMB_NOCOPY,
+ SMC_LO_DMB_COPY,
+};
+
struct smc_lo_dmb_node {
struct hlist_node list;
u64 token;
@@ -45,7 +50,8 @@ struct smc_lo_dev_stats64 {
struct smc_lo_dev {
struct smcd_dev *smcd;
struct device dev;
- u8 active;
+ u8 active : 1;
+ u8 dmb_copy : 1;
u8 dmb_type;
u16 chid;
struct smcd_gid local_gid;
--
2.32.0.3.g01195cf9f


2024-01-11 12:05:53

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next 12/15] net/smc: adapt cursor update when sndbuf and peer DMB are merged

Since ghost sndbuf shares the same physical memory with peer DMB,
the cursor update processing needs to be adapted to ensure that the
data to be consumed won't be overwritten.

So in this case, the fin_curs and sndbuf_space that were originally
updated after sending the CDC message should be modified to not be
update until the peer updates cons_curs.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_cdc.c | 52 +++++++++++++++++++++++++++++++++++++----------
1 file changed, 41 insertions(+), 11 deletions(-)

diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index c820ef197610..e938fe3bcc7c 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -18,6 +18,7 @@
#include "smc_tx.h"
#include "smc_rx.h"
#include "smc_close.h"
+#include "smc_ism.h"

/********************************** send *************************************/

@@ -255,17 +256,25 @@ int smcd_cdc_msg_send(struct smc_connection *conn)
return rc;
smc_curs_copy(&conn->rx_curs_confirmed, &curs, conn);
conn->local_rx_ctrl.prod_flags.cons_curs_upd_req = 0;
- /* Calculate transmitted data and increment free send buffer space */
- diff = smc_curs_diff(conn->sndbuf_desc->len, &conn->tx_curs_fin,
- &conn->tx_curs_sent);
- /* increased by confirmed number of bytes */
- smp_mb__before_atomic();
- atomic_add(diff, &conn->sndbuf_space);
- /* guarantee 0 <= sndbuf_space <= sndbuf_desc->len */
- smp_mb__after_atomic();
- smc_curs_copy(&conn->tx_curs_fin, &conn->tx_curs_sent, conn);
+ if (!smc_ism_support_dmb_nocopy(conn->lgr->smcd)) {
+ /* Ghost sndbuf shares the same memory region with
+ * peer DMB, so don't update the tx_curs_fin and
+ * sndbuf_space until peer has consumed the data.
+ */
+ /* Calculate transmitted data and increment free
+ * send buffer space
+ */
+ diff = smc_curs_diff(conn->sndbuf_desc->len, &conn->tx_curs_fin,
+ &conn->tx_curs_sent);
+ /* increased by confirmed number of bytes */
+ smp_mb__before_atomic();
+ atomic_add(diff, &conn->sndbuf_space);
+ /* guarantee 0 <= sndbuf_space <= sndbuf_desc->len */
+ smp_mb__after_atomic();
+ smc_curs_copy(&conn->tx_curs_fin, &conn->tx_curs_sent, conn);

- smc_tx_sndbuf_nonfull(smc);
+ smc_tx_sndbuf_nonfull(smc);
+ }
return rc;
}

@@ -323,7 +332,7 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc,
{
union smc_host_cursor cons_old, prod_old;
struct smc_connection *conn = &smc->conn;
- int diff_cons, diff_prod;
+ int diff_cons, diff_prod, diff_tx;

smc_curs_copy(&prod_old, &conn->local_rx_ctrl.prod, conn);
smc_curs_copy(&cons_old, &conn->local_rx_ctrl.cons, conn);
@@ -339,6 +348,27 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc,
atomic_add(diff_cons, &conn->peer_rmbe_space);
/* guarantee 0 <= peer_rmbe_space <= peer_rmbe_size */
smp_mb__after_atomic();
+
+ if (conn->lgr->is_smcd &&
+ smc_ism_support_dmb_nocopy(conn->lgr->smcd)) {
+ /* Ghost sndbuf shares the same memory region with
+ * peer RMB, so update tx_curs_fin and sndbuf_space
+ * when peer has consumed the data.
+ */
+ /* calculate peer rmb consumed data */
+ diff_tx = smc_curs_diff(conn->sndbuf_desc->len,
+ &conn->tx_curs_fin,
+ &conn->local_rx_ctrl.cons);
+ /* increase local sndbuf space and fin_curs */
+ smp_mb__before_atomic();
+ atomic_add(diff_tx, &conn->sndbuf_space);
+ /* guarantee 0 <= sndbuf_space <= sndbuf_desc->len */
+ smp_mb__after_atomic();
+ smc_curs_copy(&conn->tx_curs_fin,
+ &conn->local_rx_ctrl.cons, conn);
+
+ smc_tx_sndbuf_nonfull(smc);
+ }
}

diff_prod = smc_curs_diff(conn->rmb_desc->len, &prod_old,
--
2.32.0.3.g01195cf9f


2024-01-11 12:05:55

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next 15/15] net/smc: implement DMB-merged operations of loopback-ism

This implements operations related to merging sndbuf with peer DMB in
loopback-ism. The DMB won't be unregistered until no sndbuf is attached
to it.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_loopback.c | 101 +++++++++++++++++++++++++++++++++++++++--
net/smc/smc_loopback.h | 4 ++
2 files changed, 102 insertions(+), 3 deletions(-)

diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index bfbb346ef01a..296a4d1f1a33 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -298,6 +298,7 @@ static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
}
dmb_node->len = dmb->dmb_len;
dmb_node->dma_addr = SMC_DMA_ADDR_INVALID;
+ refcount_set(&dmb_node->refcnt, 1);

again:
/* add new dmb into hash table */
@@ -311,6 +312,7 @@ static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
}
hash_add(ldev->dmb_ht, &dmb_node->list, dmb_node->token);
write_unlock(&ldev->dmb_ht_lock);
+ atomic_inc(&ldev->dmb_cnt);
SMC_LO_STAT_DMBS_INC(ldev);

dmb->sba_idx = dmb_node->sba_idx;
@@ -333,8 +335,8 @@ static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
struct smc_lo_dev *ldev = smcd->priv;

- /* remove dmb from hash table */
- write_lock(&ldev->dmb_ht_lock);
+ /* find dmb from hash table */
+ read_lock(&ldev->dmb_ht_lock);
hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
if (tmp_node->token == dmb->dmb_tok) {
dmb_node = tmp_node;
@@ -342,9 +344,18 @@ static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
}
}
if (!dmb_node) {
- write_unlock(&ldev->dmb_ht_lock);
+ read_unlock(&ldev->dmb_ht_lock);
return -EINVAL;
}
+ read_unlock(&ldev->dmb_ht_lock);
+
+ /* wait for peer sndbuf to detach from this dmb */
+ if (!refcount_dec_and_test(&dmb_node->refcnt))
+ wait_event(ldev->dmbs_release,
+ !refcount_read(&dmb_node->refcnt));
+
+ /* remove dmb from hash table */
+ write_lock(&ldev->dmb_ht_lock);
hash_del(&dmb_node->list);
write_unlock(&ldev->dmb_ht_lock);

@@ -353,6 +364,73 @@ static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
kfree(dmb_node);
SMC_LO_STAT_DMBS_DEC(ldev);

+ if (atomic_dec_and_test(&ldev->dmb_cnt))
+ wake_up(&ldev->ldev_release);
+ return 0;
+}
+
+static int smc_lo_support_dmb_nocopy(struct smcd_dev *smcd)
+{
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ return (ldev->dmb_copy == SMC_LO_DMB_NOCOPY);
+}
+
+static int smc_lo_attach_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
+{
+ struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ /* find dmb_node according to dmb->dmb_tok */
+ read_lock(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
+ if (tmp_node->token == dmb->dmb_tok) {
+ dmb_node = tmp_node;
+ break;
+ }
+ }
+ if (!dmb_node) {
+ read_unlock(&ldev->dmb_ht_lock);
+ return -EINVAL;
+ }
+ read_unlock(&ldev->dmb_ht_lock);
+
+ if (!refcount_inc_not_zero(&dmb_node->refcnt))
+ /* the dmb is being unregistered, but has
+ * not been removed from the hash table.
+ */
+ return -EINVAL;
+
+ /* provide dmb information */
+ dmb->sba_idx = dmb_node->sba_idx;
+ dmb->dmb_tok = dmb_node->token;
+ dmb->cpu_addr = dmb_node->cpu_addr;
+ dmb->dma_addr = dmb_node->dma_addr;
+ dmb->dmb_len = dmb_node->len;
+ return 0;
+}
+
+static int smc_lo_detach_dmb(struct smcd_dev *smcd, u64 token)
+{
+ struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ /* find dmb_node according to dmb->dmb_tok */
+ read_lock(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, token) {
+ if (tmp_node->token == token) {
+ dmb_node = tmp_node;
+ break;
+ }
+ }
+ if (!dmb_node) {
+ read_unlock(&ldev->dmb_ht_lock);
+ return -EINVAL;
+ }
+ read_unlock(&ldev->dmb_ht_lock);
+
+ if (refcount_dec_and_test(&dmb_node->refcnt))
+ wake_up_all(&ldev->dmbs_release);
return 0;
}

@@ -389,6 +467,14 @@ static int smc_lo_move_data(struct smcd_dev *smcd, u64 dmb_tok,
struct smc_lo_dmb_node *rmb_node = NULL, *tmp_node;
struct smc_lo_dev *ldev = smcd->priv;

+ /* if sndbuf is merged with peer DMB, there is
+ * no need to copy data from sndbuf to peer DMB.
+ */
+ if (!sf && smc_lo_support_dmb_nocopy(smcd)) {
+ SMC_LO_STAT_XFER_BYTES(ldev, size);
+ return 0;
+ }
+
read_lock(&ldev->dmb_ht_lock);
hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_tok) {
if (tmp_node->token == dmb_tok) {
@@ -444,6 +530,9 @@ static const struct smcd_ops lo_ops = {
.query_remote_gid = smc_lo_query_rgid,
.register_dmb = smc_lo_register_dmb,
.unregister_dmb = smc_lo_unregister_dmb,
+ .support_dmb_nocopy = smc_lo_support_dmb_nocopy,
+ .attach_dmb = smc_lo_attach_dmb,
+ .detach_dmb = smc_lo_detach_dmb,
.add_vlan_id = smc_lo_add_vlan_id,
.del_vlan_id = smc_lo_del_vlan_id,
.set_vlan_required = smc_lo_set_vlan_required,
@@ -529,12 +618,18 @@ static int smc_lo_dev_init(struct smc_lo_dev *ldev)
smc_lo_generate_id(ldev);
rwlock_init(&ldev->dmb_ht_lock);
hash_init(ldev->dmb_ht);
+ atomic_set(&ldev->dmb_cnt, 0);
+ init_waitqueue_head(&ldev->dmbs_release);
+ init_waitqueue_head(&ldev->ldev_release);
+
return smcd_lo_register_dev(ldev);
}

static void smc_lo_dev_exit(struct smc_lo_dev *ldev)
{
smcd_lo_unregister_dev(ldev);
+ if (atomic_read(&ldev->dmb_cnt))
+ wait_event(ldev->ldev_release, !atomic_read(&ldev->dmb_cnt));
}

static void smc_lo_dev_release(struct device *dev)
diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
index 7ecb4a35eb36..19a1eace2255 100644
--- a/net/smc/smc_loopback.h
+++ b/net/smc/smc_loopback.h
@@ -40,6 +40,7 @@ struct smc_lo_dmb_node {
u32 sba_idx;
void *cpu_addr;
dma_addr_t dma_addr;
+ refcount_t refcnt;
};

struct smc_lo_dev_stats64 {
@@ -56,9 +57,12 @@ struct smc_lo_dev {
u16 chid;
struct smcd_gid local_gid;
struct smc_lo_dev_stats64 __percpu *stats;
+ atomic_t dmb_cnt;
rwlock_t dmb_ht_lock;
DECLARE_BITMAP(sba_idx_mask, SMC_LO_MAX_DMBS);
DECLARE_HASHTABLE(dmb_ht, SMC_LO_DMBS_HASH_BITS);
+ wait_queue_head_t dmbs_release;
+ wait_queue_head_t ldev_release;
};

#define SMC_LO_STAT_SUB(ldev, key, val) \
--
2.32.0.3.g01195cf9f


2024-01-11 12:06:52

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next 10/15] net/smc: add operations to merge sndbuf with peer DMB

In some scenarios using virtual ISM device, sndbuf can share the same
physical memory region with peer DMB to avoid data copy from one side
to the other. In such case the sndbuf is only a descriptor that
describes the shared memory and does not actually occupy memory, it's
more like a ghost buffer.

+----------+ +----------+
| socket A | | socket B |
+----------+ +----------+
| |
+--------+ +--------+
| sndbuf | | DMB |
| desc | | desc |
+--------+ +--------+
| |
| +----v-----+
+--------------------------> memory |
+----------+

So here introduces three new SMC-D device operations to check if this
feature is supported by device, and to {attach|detach} ghost sndbuf to
peer DMB. For now only loopback-ism supports this.

Signed-off-by: Wen Gu <[email protected]>
---
include/net/smc.h | 3 +++
net/smc/smc_ism.c | 40 ++++++++++++++++++++++++++++++++++++++++
net/smc/smc_ism.h | 4 ++++
3 files changed, 47 insertions(+)

diff --git a/include/net/smc.h b/include/net/smc.h
index 6273c3a8b24a..01387631d8a6 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -62,6 +62,9 @@ struct smcd_ops {
int (*register_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb,
void *client);
int (*unregister_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
+ int (*support_dmb_nocopy)(struct smcd_dev *dev);
+ int (*attach_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
+ int (*detach_dmb)(struct smcd_dev *dev, u64 token);
int (*add_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
int (*del_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
int (*set_vlan_required)(struct smcd_dev *dev);
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index 4065ebd2e43d..2d2781724932 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -246,6 +246,46 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
return rc;
}

+bool smc_ism_support_dmb_nocopy(struct smcd_dev *smcd)
+{
+ /* for now only loopback-ism supports
+ * merging sndbuf with peer DMB to avoid
+ * data copies between them.
+ */
+ return (smcd->ops->support_dmb_nocopy &&
+ smcd->ops->support_dmb_nocopy(smcd));
+}
+
+int smc_ism_attach_dmb(struct smcd_dev *dev, u64 token,
+ struct smc_buf_desc *dmb_desc)
+{
+ struct smcd_dmb dmb;
+ int rc = 0;
+
+ if (!dev->ops->attach_dmb)
+ return -EINVAL;
+
+ memset(&dmb, 0, sizeof(dmb));
+ dmb.dmb_tok = token;
+ rc = dev->ops->attach_dmb(dev, &dmb);
+ if (!rc) {
+ dmb_desc->sba_idx = dmb.sba_idx;
+ dmb_desc->token = dmb.dmb_tok;
+ dmb_desc->cpu_addr = dmb.cpu_addr;
+ dmb_desc->dma_addr = dmb.dma_addr;
+ dmb_desc->len = dmb.dmb_len;
+ }
+ return rc;
+}
+
+int smc_ism_detach_dmb(struct smcd_dev *dev, u64 token)
+{
+ if (!dev->ops->detach_dmb)
+ return -EINVAL;
+
+ return dev->ops->detach_dmb(dev, token);
+}
+
static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
struct sk_buff *skb,
struct netlink_callback *cb)
diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
index 6903cd5d4d4d..8ea5ab737c6f 100644
--- a/net/smc/smc_ism.h
+++ b/net/smc/smc_ism.h
@@ -48,6 +48,10 @@ int smc_ism_put_vlan(struct smcd_dev *dev, unsigned short vlan_id);
int smc_ism_register_dmb(struct smc_link_group *lgr, int buf_size,
struct smc_buf_desc *dmb_desc);
int smc_ism_unregister_dmb(struct smcd_dev *dev, struct smc_buf_desc *dmb_desc);
+bool smc_ism_support_dmb_nocopy(struct smcd_dev *smcd);
+int smc_ism_attach_dmb(struct smcd_dev *dev, u64 token,
+ struct smc_buf_desc *dmb_desc);
+int smc_ism_detach_dmb(struct smcd_dev *dev, u64 token);
int smc_ism_signal_shutdown(struct smc_link_group *lgr);
void smc_ism_get_system_eid(u8 **eid);
u16 smc_ism_get_chid(struct smcd_dev *dev);
--
2.32.0.3.g01195cf9f


2024-01-11 12:07:19

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next 11/15] net/smc: attach or detach ghost sndbuf to peer DMB

The ghost sndbuf descriptor will be created and attached to peer DMB
once peer token is obtained and it will be detach and freed when the
connection is freed.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/af_smc.c | 16 ++++++++++++
net/smc/smc_core.c | 61 +++++++++++++++++++++++++++++++++++++++++++++-
net/smc/smc_core.h | 1 +
3 files changed, 77 insertions(+), 1 deletion(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 189aea09b66e..96a6e5f13351 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -1437,6 +1437,14 @@ static int smc_connect_ism(struct smc_sock *smc,
}

smc_conn_save_peer_info(smc, aclc);
+
+ if (smc_ism_support_dmb_nocopy(smc->conn.lgr->smcd)) {
+ rc = smcd_buf_attach(smc);
+ if (rc) {
+ rc = SMC_CLC_DECL_MEM; /* try to fallback */
+ goto connect_abort;
+ }
+ }
smc_close_init(smc);
smc_rx_init(smc);
smc_tx_init(smc);
@@ -2541,6 +2549,14 @@ static void smc_listen_work(struct work_struct *work)
mutex_unlock(&smc_server_lgr_pending);
}
smc_conn_save_peer_info(new_smc, cclc);
+
+ if (ini->is_smcd &&
+ smc_ism_support_dmb_nocopy(new_smc->conn.lgr->smcd)) {
+ rc = smcd_buf_attach(new_smc);
+ if (rc)
+ goto out_decl;
+ }
+
smc_listen_out_connected(new_smc);
SMC_STAT_SERV_SUCC_INC(sock_net(newclcsock->sk), ini);
goto out_free;
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 95cc95458e2d..da6a8d9c81ea 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -1149,6 +1149,20 @@ static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
}
}

+static void smcd_buf_detach(struct smc_connection *conn)
+{
+ struct smcd_dev *smcd = conn->lgr->smcd;
+ u64 peer_token = conn->peer_token;
+
+ if (!conn->sndbuf_desc)
+ return;
+
+ smc_ism_detach_dmb(smcd, peer_token);
+
+ kfree(conn->sndbuf_desc);
+ conn->sndbuf_desc = NULL;
+}
+
static void smc_buf_unuse(struct smc_connection *conn,
struct smc_link_group *lgr)
{
@@ -1192,6 +1206,8 @@ void smc_conn_free(struct smc_connection *conn)
if (lgr->is_smcd) {
if (!list_empty(&lgr->list))
smc_ism_unset_conn(conn);
+ if (smc_ism_support_dmb_nocopy(lgr->smcd))
+ smcd_buf_detach(conn);
tasklet_kill(&conn->rx_tsklet);
} else {
smc_cdc_wait_pend_tx_wr(conn);
@@ -1445,6 +1461,8 @@ static void smc_conn_kill(struct smc_connection *conn, bool soft)
smc_sk_wake_ups(smc);
if (conn->lgr->is_smcd) {
smc_ism_unset_conn(conn);
+ if (smc_ism_support_dmb_nocopy(conn->lgr->smcd))
+ smcd_buf_detach(conn);
if (soft)
tasklet_kill(&conn->rx_tsklet);
else
@@ -2458,12 +2476,18 @@ int smc_buf_create(struct smc_sock *smc, bool is_smcd)
int rc;

/* create send buffer */
+ if (is_smcd &&
+ smc_ism_support_dmb_nocopy(smc->conn.lgr->smcd))
+ goto create_rmb;
+
rc = __smc_buf_create(smc, is_smcd, false);
if (rc)
return rc;
+
+create_rmb:
/* create rmb */
rc = __smc_buf_create(smc, is_smcd, true);
- if (rc) {
+ if (rc && smc->conn.sndbuf_desc) {
down_write(&smc->conn.lgr->sndbufs_lock);
list_del(&smc->conn.sndbuf_desc->list);
up_write(&smc->conn.lgr->sndbufs_lock);
@@ -2473,6 +2497,41 @@ int smc_buf_create(struct smc_sock *smc, bool is_smcd)
return rc;
}

+int smcd_buf_attach(struct smc_sock *smc)
+{
+ struct smc_connection *conn = &smc->conn;
+ struct smcd_dev *smcd = conn->lgr->smcd;
+ u64 peer_token = conn->peer_token;
+ struct smc_buf_desc *buf_desc;
+ int rc;
+
+ buf_desc = kzalloc(sizeof(*buf_desc), GFP_KERNEL);
+ if (!buf_desc)
+ return -ENOMEM;
+
+ /* The ghost sndbuf_desc describes the same memory region as
+ * peer RMB. Its lifecycle is consistent with the connection's
+ * and it will be freed with the connections instead of the
+ * link group.
+ */
+ rc = smc_ism_attach_dmb(smcd, peer_token, buf_desc);
+ if (rc)
+ goto free;
+
+ smc->sk.sk_sndbuf = buf_desc->len;
+ buf_desc->cpu_addr =
+ (u8 *)buf_desc->cpu_addr + sizeof(struct smcd_cdc_msg);
+ buf_desc->len -= sizeof(struct smcd_cdc_msg);
+ conn->sndbuf_desc = buf_desc;
+ conn->sndbuf_desc->used = 1;
+ atomic_set(&conn->sndbuf_space, conn->sndbuf_desc->len);
+ return 0;
+
+free:
+ kfree(buf_desc);
+ return rc;
+}
+
static inline int smc_rmb_reserve_rtoken_idx(struct smc_link_group *lgr)
{
int i;
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 1f175376037b..d93cf51dbd7c 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -557,6 +557,7 @@ void smc_smcd_terminate(struct smcd_dev *dev, struct smcd_gid *peer_gid,
void smc_smcd_terminate_all(struct smcd_dev *dev);
void smc_smcr_terminate_all(struct smc_ib_device *smcibdev);
int smc_buf_create(struct smc_sock *smc, bool is_smcd);
+int smcd_buf_attach(struct smc_sock *smc);
int smc_uncompress_bufsize(u8 compressed);
int smc_rmb_rtoken_handling(struct smc_connection *conn, struct smc_link *link,
struct smc_clc_msg_accept_confirm *clc);
--
2.32.0.3.g01195cf9f


2024-01-11 12:09:06

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next 07/15] net/smc: register loopback-ism into SMC-D device list

After loopback-ism device gets ready, add it to the SMC-D device list as
an ISMv2 device.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_ism.c | 11 +++++++----
net/smc/smc_ism.h | 1 +
net/smc/smc_loopback.c | 20 +++++++++++++-------
3 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index fb1837d0a861..4065ebd2e43d 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -91,6 +91,11 @@ bool smc_ism_is_v2_capable(void)
return smc_ism_v2_capable;
}

+void smc_ism_set_v2_capable(void)
+{
+ smc_ism_v2_capable = true;
+}
+
/* Set a connection using this DMBE. */
void smc_ism_set_conn(struct smc_connection *conn)
{
@@ -454,11 +459,9 @@ static void smcd_register_dev(struct ism_dev *ism)
if (smc_pnetid_by_dev_port(&ism->pdev->dev, 0, smcd->pnetid))
smc_pnetid_by_table_smcd(smcd);

+ if (smcd->ops->supports_v2())
+ smc_ism_set_v2_capable();
mutex_lock(&smcd_dev_list.mutex);
- if (list_empty(&smcd_dev_list.list)) {
- if (smcd->ops->supports_v2())
- smc_ism_v2_capable = true;
- }
/* sort list: devices without pnetid before devices with pnetid */
if (smcd->pnetid[0])
list_add_tail(&smcd->list, &smcd_dev_list.list);
diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
index ffff40c30a06..6903cd5d4d4d 100644
--- a/net/smc/smc_ism.h
+++ b/net/smc/smc_ism.h
@@ -52,6 +52,7 @@ int smc_ism_signal_shutdown(struct smc_link_group *lgr);
void smc_ism_get_system_eid(u8 **eid);
u16 smc_ism_get_chid(struct smcd_dev *dev);
bool smc_ism_is_v2_capable(void);
+void smc_ism_set_v2_capable(void);
int smc_ism_init(void);
void smc_ism_exit(void);
int smcd_nl_get_device(struct sk_buff *skb, struct netlink_callback *cb);
diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index f72e7b24fc1a..db0b45f8560c 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -278,10 +278,12 @@ static int smcd_lo_register_dev(struct smc_lo_dev *ldev)
return -ENOMEM;
ldev->smcd = smcd;
smcd->priv = ldev;
-
- /* TODO:
- * register loopback-ism to smcd_dev list.
- */
+ smc_ism_set_v2_capable();
+ mutex_lock(&smcd_dev_list.mutex);
+ list_add(&smcd->list, &smcd_dev_list.list);
+ mutex_unlock(&smcd_dev_list.mutex);
+ pr_warn_ratelimited("smc: adding smcd device %s\n",
+ smc_lo_dev_name);
return 0;
}

@@ -289,9 +291,13 @@ static void smcd_lo_unregister_dev(struct smc_lo_dev *ldev)
{
struct smcd_dev *smcd = ldev->smcd;

- /* TODO:
- * unregister loopback-ism from smcd_dev list.
- */
+ pr_warn_ratelimited("smc: removing smcd device %s\n",
+ smc_lo_dev_name);
+ smcd->going_away = 1;
+ smc_smcd_terminate_all(smcd);
+ mutex_lock(&smcd_dev_list.mutex);
+ list_del_init(&smcd->list);
+ mutex_unlock(&smcd_dev_list.mutex);
kfree(smcd->conn);
kfree(smcd);
}
--
2.32.0.3.g01195cf9f


2024-01-11 14:01:49

by Simon Horman

[permalink] [raw]
Subject: Re: [PATCH net-next 00/15] net/smc: implement loopback-ism used by SMC-D

On Thu, Jan 11, 2024 at 08:00:21PM +0800, Wen Gu wrote:
> This patch set acts as the second part of the new version of [1] (The
> first part can be referred from [2]), the updated things of this version
> are listed at the end.

..

Hi Wen Gu,

unfortunately net-next is currently closed.

[adapted from text by Jakub]

## Form letter - net-next-closed

The merge window for v6.8 has begun and therefore net-next is closed
for new drivers, features, code refactoring and optimizations.
We are currently accepting bug fixes only.

Please repost when net-next reopens on or after 21st January.

RFC patches sent for review only are obviously welcome at any time.

See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#development-cycle
--
pw-bot: defer

2024-01-11 14:50:50

by Jiri Pirko

[permalink] [raw]
Subject: Re: [PATCH net-next 00/15] net/smc: implement loopback-ism used by SMC-D

Thu, Jan 11, 2024 at 01:00:21PM CET, [email protected] wrote:
>This patch set acts as the second part of the new version of [1] (The first
>part can be referred from [2]), the updated things of this version are listed
>at the end.
>
># Background
>
>SMC-D is now used in IBM z with ISM function to optimize network interconnect
>for intra-CPC communications. Inspired by this, we try to make SMC-D available

Care to provide more details about what ISM and intra-CPC is and what it
it good for?


>on the non-s390 architecture through a software-implemented virtual ISM device,
>that is the loopback-ism device here, to accelerate inter-process or

I see no such device. Is it a netdevice?

If it is "software-implemented", why is it part of smc driver and not
separate soft-device driver? If there is some smc specific code, I guess
there should be some level of separation. Can't this be implemented by
other devices too?



>inter-containers communication within the same OS instance.
>
># Design
>
>This patch set includes 3 parts:
>
> - Patch #1-#2: some prepare work for loopback-ism.
> - Patch #3-#9: implement loopback-ism device.
> - Patch #10-#15: memory copy optimization for loopback scenario.
>
>The loopback-ism device is designed as a ISMv2 device and not be limited to
>a specific net namespace, ends of both inter-process connection (1/1' in diagram
>below) or inter-container connection (2/2' in diagram below) can find the same
>available loopback-ism and choose it during the CLC handshake.
>
> Container 1 (ns1) Container 2 (ns2)
> +-----------------------------------------+ +-------------------------+
> | +-------+ +-------+ +-------+ | | +-------+ |
> | | App A | | App B | | App C | | | | App D |<-+ |
> | +-------+ +---^---+ +-------+ | | +-------+ |(2') |
> | |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| |
> | (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ |
> | `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | |
> +---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+
> | | | |
> Kernel | | | |
> +----+-------v---+-----------v----------------------------------+---+----+
> | | TCP | |
> | | | |
> | +--------------------------------------------------------------+ |
> | |
> | +--------------+ |
> | | smc loopback | |
> +---------------------------+--------------+-----------------------------+
>
>loopback-ism device creates DMBs (shared memory) for each connection peer.
>Since data transfer occurs within the same kernel, the sndbuf of each peer
>is only a descriptor and point to the same memory region as peer DMB, so that
>the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.
>
> Container 1 (ns1) Container 2 (ns2)
> +-----------------------------------------+ +-------------------------+
> | +-------+ | | +-------+ |
> | | App C |-----+ | | | App D | |
> | +-------+ | | | +-^-----+ |
> | | | | | |
> | (2) | | | (2') | |
> | | | | | |
> +---------------|-------------------------+ +----------|--------------+
> | |
> Kernel | |
> +---------------|-----------------------------------------|--------------+
> | +--------+ +--v-----+ +--------+ +--------+ |
> | |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| |
> | +-----|--+ +--|-----+ +-----|--+ +--------+ |
> | +-----|--+ | +-----|--+ |
> | | DMB C | +---------------------------------| DMB D | |
> | +--------+ +--------+ |
> | |
> | +--------------+ |
> | | smc loopback | |
> +---------------------------+--------------+-----------------------------+
>
># Benchmark Test
>
> * Test environments:
> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
> - SMC sndbuf/DMB size 1MB.
> - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0,
> which means sndbuf and DMB are merged and no data copied between them.
> - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0,

Exposing any configuration knobs and statistics over sysfs for
softdevices does not look correct at all :/ Could you please avoid
sysfs?


> which means DMB is physically contiguous buffer.
>
> * Test object:
> - TCP: run on TCP loopback.
> - SMC lo: run on SMC loopback device.
>
>1. ipc-benchmark (see [3])
>
> - ./<foo> -c 1000000 -s 100
>
> TCP SMC-lo
>Message
>rate (msg/s) 80636 149515(+85.42%)
>
>2. sockperf
>
> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>
> TCP SMC-lo
>Bandwidth(MBps) 4909.36 8197.57(+66.98%)
>Latency(us) 6.098 3.383(-44.52%)
>
>3. nginx/wrk
>
> - serv: <smc_run> nginx
> - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
>
> TCP SMC-lo
>Requests/s 181685.74 246447.77(+35.65%)
>
>4. redis-benchmark
>
> - serv: <smc_run> redis-server
> - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
>
> TCP SMC-lo
>GET(Requests/s) 85855.34 118553.64(+38.09%)
>SET(Requests/s) 86824.40 125944.58(+45.06%)
>
>
>Change log:
>
>v1->RFC:
>- Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
> /sys/devices/virtual/smc/loopback-ism/xfer_bytes
>- Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
> merging sndbuf with peer DMB.
>- Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
> control of whether to merge sndbuf and DMB. They can be respectively set by:
> /sys/devices/virtual/smc/loopback-ism/dmb_type
> /sys/devices/virtual/smc/loopback-ism/dmb_copy
> The motivation for these two control is that a performance bottleneck was
> found when using vzalloced DMB and sndbuf is merged with DMB, and there are
> many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
> by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
> or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
> vmap lock contention [6]. It has significant effects, but using virtual memory
> still has additional overhead compared to using physical memory.
> So this new version provides controls of dmb_type and dmb_copy to suit
> different scenarios.
>- Some minor changes and comments improvements.
>
>RFC->old version([1]):
>Link: https://lore.kernel.org/netdev/[email protected]/
>- Patch #1: improve the loopback-ism dump, it shows as follows now:
> # smcd d
> FID Type PCI-ID PCHID InUse #LGs PNET-ID
> 0000 0 loopback-ism ffff No 0
>- Patch #3: introduce the smc_ism_set_v2_capable() helper and set
> smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
> regardless of whether there is already a device in smcd device list.
>- Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
>- Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
> to activate or deactivate the loopback-ism.
>- Patch #9: introduce the statistics of loopback-ism by
> /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
>- Some minor changes and comments improvements.
>
>[1] https://lore.kernel.org/netdev/[email protected]/
>[2] https://lore.kernel.org/netdev/[email protected]/
>[3] https://github.com/goldsborough/ipc-bench
>[4] https://lore.kernel.org/all/[email protected]/
>[5] https://lore.kernel.org/all/[email protected]/
>[6] https://lore.kernel.org/all/[email protected]/
>
>Wen Gu (15):
> net/smc: improve SMC-D device dump for virtual ISM
> net/smc: decouple specialized struct from SMC-D DMB registration
> net/smc: introduce virtual ISM device loopback-ism
> net/smc: implement ID-related operations of loopback-ism
> net/smc: implement some unsupported operations of loopback-ism
> net/smc: implement DMB-related operations of loopback-ism
> net/smc: register loopback-ism into SMC-D device list
> net/smc: introduce loopback-ism runtime switch
> net/smc: introduce loopback-ism statistics attributes
> net/smc: add operations to merge sndbuf with peer DMB
> net/smc: attach or detach ghost sndbuf to peer DMB
> net/smc: adapt cursor update when sndbuf and peer DMB are merged
> net/smc: introduce loopback-ism DMB type control
> net/smc: introduce loopback-ism DMB data copy control
> net/smc: implement DMB-merged operations of loopback-ism
>
> drivers/s390/net/ism_drv.c | 2 +-
> include/net/smc.h | 7 +-
> net/smc/Kconfig | 13 +
> net/smc/Makefile | 2 +-
> net/smc/af_smc.c | 28 +-
> net/smc/smc_cdc.c | 58 ++-
> net/smc/smc_cdc.h | 1 +
> net/smc/smc_core.c | 61 +++-
> net/smc/smc_core.h | 1 +
> net/smc/smc_ism.c | 71 +++-
> net/smc/smc_ism.h | 5 +
> net/smc/smc_loopback.c | 718 +++++++++++++++++++++++++++++++++++++
> net/smc/smc_loopback.h | 88 +++++
> 13 files changed, 1026 insertions(+), 29 deletions(-)
> create mode 100644 net/smc/smc_loopback.c
> create mode 100644 net/smc/smc_loopback.h
>
>--
>2.32.0.3.g01195cf9f
>
>

2024-01-12 02:56:04

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next 00/15] net/smc: implement loopback-ism used by SMC-D



On 2024/1/11 21:36, Simon Horman wrote:
> On Thu, Jan 11, 2024 at 08:00:21PM +0800, Wen Gu wrote:
>> This patch set acts as the second part of the new version of [1] (The
>> first part can be referred from [2]), the updated things of this version
>> are listed at the end.
>
> ...
>
> Hi Wen Gu,
>
> unfortunately net-next is currently closed.
>
> [adapted from text by Jakub]
>
> ## Form letter - net-next-closed
>
> The merge window for v6.8 has begun and therefore net-next is closed
> for new drivers, features, code refactoring and optimizations.
> We are currently accepting bug fixes only.
>
> Please repost when net-next reopens on or after 21st January.
>
> RFC patches sent for review only are obviously welcome at any time.
>
> See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#development-cycle
> --
> pw-bot: defer

Thank you for notifying, Simon. I will follow the development-cycle. Thanks again.

2024-01-12 08:31:55

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next 00/15] net/smc: implement loopback-ism used by SMC-D



On 2024/1/11 22:50, Jiri Pirko wrote:
> Thu, Jan 11, 2024 at 01:00:21PM CET, [email protected] wrote:
>> This patch set acts as the second part of the new version of [1] (The first
>> part can be referred from [2]), the updated things of this version are listed
>> at the end.
>>
>> # Background
>>
>> SMC-D is now used in IBM z with ISM function to optimize network interconnect
>> for intra-CPC communications. Inspired by this, we try to make SMC-D available
>
> Care to provide more details about what ISM and intra-CPC is and what it
> it good for?
>

Hi Jiri,

Sure,

ISM (IBM System Z Internal Shared Memory) is a technology that provides the
internal communications capability required for SMC-D. It is a virtual PCI
network adapter that enables direct access to shared virtual memory providing
a highly optimized network interconnect for IBM Z intra-CPC communications.
(It can be found in https://www.ibm.com/docs/en/zos/3.1.0?topic=communications-shared-memory-reference-information
and https://www.ibm.com/docs/en/zos/3.1.0?topic=dv2-ismv2)

CPC (Central processor complex) is an IBM mainframe term to refer to the physical
collection of hardware that includes main storage, one or more central processors,
timers, and channels.
(It can be found in https://www.ibm.com/docs/en/zos-basic-skills?topic=concepts-mainframe-hardware-terminology
and https://www.ibm.com/docs/en/ztpf/2023?topic=support-central-processor-complex-cpc)

SMC (Shared Memory Communications) is a network protocol that allows two SMC
capable peers to communicate using memory that each peer allocates and manages
for their partner’s use. It has two forms:

- SMC over Remote Direct Memory Access (SMC-R)

It is an open protocol that was initially introduced in z/OS V2R1 on the IBM zEC12.
SMC-R is defined in an informational RFC entitled IBM’s Shared Memory Communications
over RDMA (https://tools.ietf.org/html/rfc7609).

- SMC - Direct Memory Access (SMC-D)

It is a variation of SMC-R. SMC-D is closely related to SMC-R but is based on the
Internal Shared Memory (ISM) capabilities introduced with the IBM z13™ (z13) hardware
model.

(SMC protocol can be found in
https://www.ibm.com/support/pages/system/files/inline-files/IBM%20Shared%20Memory%20Communications%20Version%202.1_0.pdf)

So with ISM function, SMC-D can be used to improves throughput, lowers latency and cost,
and maintains existing functions of communications within CPC.

>
>> on the non-s390 architecture through a software-implemented virtual ISM device,
>> that is the loopback-ism device here, to accelerate inter-process or
>
> I see no such device. Is it a netdevice?

Currently, SMC-D depends on ISM and is only available on IBM Z systems. Now we try
to make SMC-D available on other system architectures other than s390 and Z system.
So 'virtual ISM' is proposed and acts as original firmware ISM on Z system.
(The virtual ISM supports can be found in https://lore.kernel.org/netdev/[email protected]/)

The loopback-ism is the first virtual ISM. It does not rely on a specific architecture
or hardware, and provides functions that ISM should have (like a dummy device). It is
designed to be used by SMC-D when communication occurs within OS instance.

It is not a typical network device, since it primarily provides exact functions
defined by SMC-D device operations(struct smcd_ops), e.g. provides and manages the
shared memory (term used is DMB in SMC, Direct Memory Buffer).

It can't be found now since it is introduced by this patchset.

>
> If it is "software-implemented", why is it part of smc driver and not
> separate soft-device driver? If there is some smc specific code, I guess
> there should be some level of separation. Can't this be implemented by
> other devices too?
>

loopback-ism is designed to specifically used by SMC-D (like s390 ISM), to
serves as a easy-available ISM for community to test SMC-D and to accelerate
intra-OS communication (see benchmark test). So the code is under net/smc.

>
>
>> inter-containers communication within the same OS instance.
>>
>> # Design
>>
>> This patch set includes 3 parts:
>>
>> - Patch #1-#2: some prepare work for loopback-ism.
>> - Patch #3-#9: implement loopback-ism device.
>> - Patch #10-#15: memory copy optimization for loopback scenario.
>>
>> The loopback-ism device is designed as a ISMv2 device and not be limited to
>> a specific net namespace, ends of both inter-process connection (1/1' in diagram
>> below) or inter-container connection (2/2' in diagram below) can find the same
>> available loopback-ism and choose it during the CLC handshake.
>>
>> Container 1 (ns1) Container 2 (ns2)
>> +-----------------------------------------+ +-------------------------+
>> | +-------+ +-------+ +-------+ | | +-------+ |
>> | | App A | | App B | | App C | | | | App D |<-+ |
>> | +-------+ +---^---+ +-------+ | | +-------+ |(2') |
>> | |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| |
>> | (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ |
>> | `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | |
>> +---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+
>> | | | |
>> Kernel | | | |
>> +----+-------v---+-----------v----------------------------------+---+----+
>> | | TCP | |
>> | | | |
>> | +--------------------------------------------------------------+ |
>> | |
>> | +--------------+ |
>> | | smc loopback | |
>> +---------------------------+--------------+-----------------------------+
>>
>> loopback-ism device creates DMBs (shared memory) for each connection peer.
>> Since data transfer occurs within the same kernel, the sndbuf of each peer
>> is only a descriptor and point to the same memory region as peer DMB, so that
>> the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.
>>
>> Container 1 (ns1) Container 2 (ns2)
>> +-----------------------------------------+ +-------------------------+
>> | +-------+ | | +-------+ |
>> | | App C |-----+ | | | App D | |
>> | +-------+ | | | +-^-----+ |
>> | | | | | |
>> | (2) | | | (2') | |
>> | | | | | |
>> +---------------|-------------------------+ +----------|--------------+
>> | |
>> Kernel | |
>> +---------------|-----------------------------------------|--------------+
>> | +--------+ +--v-----+ +--------+ +--------+ |
>> | |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| |
>> | +-----|--+ +--|-----+ +-----|--+ +--------+ |
>> | +-----|--+ | +-----|--+ |
>> | | DMB C | +---------------------------------| DMB D | |
>> | +--------+ +--------+ |
>> | |
>> | +--------------+ |
>> | | smc loopback | |
>> +---------------------------+--------------+-----------------------------+
>>
>> # Benchmark Test
>>
>> * Test environments:
>> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>> - SMC sndbuf/DMB size 1MB.
>> - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0,
>> which means sndbuf and DMB are merged and no data copied between them.
>> - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0,
>
> Exposing any configuration knobs and statistics over sysfs for
> softdevices does not look correct at all :/ Could you please avoid
> sysfs?
>

In previous reviews and calls, we think loopback-ism needs to be more
like a device and be visible under /sys/devices.

Would you mind explaining why using sysfs for loopback-ism is not correct?
since I saw some other configurations or statistics exists under /sys/devices,
e.g. /sys/devices/virtual/net/lo. Thank you!



Thanks again,
Wen Gu

>
>> which means DMB is physically contiguous buffer.
>>
>> * Test object:
>> - TCP: run on TCP loopback.
>> - SMC lo: run on SMC loopback device.
>>
>> 1. ipc-benchmark (see [3])
>>
>> - ./<foo> -c 1000000 -s 100
>>
>> TCP SMC-lo
>> Message
>> rate (msg/s) 80636 149515(+85.42%)
>>
>> 2. sockperf
>>
>> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>>
>> TCP SMC-lo
>> Bandwidth(MBps) 4909.36 8197.57(+66.98%)
>> Latency(us) 6.098 3.383(-44.52%)
>>
>> 3. nginx/wrk
>>
>> - serv: <smc_run> nginx
>> - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
>>
>> TCP SMC-lo
>> Requests/s 181685.74 246447.77(+35.65%)
>>
>> 4. redis-benchmark
>>
>> - serv: <smc_run> redis-server
>> - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
>>
>> TCP SMC-lo
>> GET(Requests/s) 85855.34 118553.64(+38.09%)
>> SET(Requests/s) 86824.40 125944.58(+45.06%)
>>
>>
>> Change log:
>>
>> v1->RFC:
>> - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
>> /sys/devices/virtual/smc/loopback-ism/xfer_bytes
>> - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
>> merging sndbuf with peer DMB.
>> - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
>> control of whether to merge sndbuf and DMB. They can be respectively set by:
>> /sys/devices/virtual/smc/loopback-ism/dmb_type
>> /sys/devices/virtual/smc/loopback-ism/dmb_copy
>> The motivation for these two control is that a performance bottleneck was
>> found when using vzalloced DMB and sndbuf is merged with DMB, and there are
>> many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
>> by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
>> or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
>> vmap lock contention [6]. It has significant effects, but using virtual memory
>> still has additional overhead compared to using physical memory.
>> So this new version provides controls of dmb_type and dmb_copy to suit
>> different scenarios.
>> - Some minor changes and comments improvements.
>>
>> RFC->old version([1]):
>> Link: https://lore.kernel.org/netdev/[email protected]/
>> - Patch #1: improve the loopback-ism dump, it shows as follows now:
>> # smcd d
>> FID Type PCI-ID PCHID InUse #LGs PNET-ID
>> 0000 0 loopback-ism ffff No 0
>> - Patch #3: introduce the smc_ism_set_v2_capable() helper and set
>> smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
>> regardless of whether there is already a device in smcd device list.
>> - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
>> - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
>> to activate or deactivate the loopback-ism.
>> - Patch #9: introduce the statistics of loopback-ism by
>> /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
>> - Some minor changes and comments improvements.
>>
>> [1] https://lore.kernel.org/netdev/[email protected]/
>> [2] https://lore.kernel.org/netdev/[email protected]/
>> [3] https://github.com/goldsborough/ipc-bench
>> [4] https://lore.kernel.org/all/[email protected]/
>> [5] https://lore.kernel.org/all/[email protected]/
>> [6] https://lore.kernel.org/all/[email protected]/
>>
>> Wen Gu (15):
>> net/smc: improve SMC-D device dump for virtual ISM
>> net/smc: decouple specialized struct from SMC-D DMB registration
>> net/smc: introduce virtual ISM device loopback-ism
>> net/smc: implement ID-related operations of loopback-ism
>> net/smc: implement some unsupported operations of loopback-ism
>> net/smc: implement DMB-related operations of loopback-ism
>> net/smc: register loopback-ism into SMC-D device list
>> net/smc: introduce loopback-ism runtime switch
>> net/smc: introduce loopback-ism statistics attributes
>> net/smc: add operations to merge sndbuf with peer DMB
>> net/smc: attach or detach ghost sndbuf to peer DMB
>> net/smc: adapt cursor update when sndbuf and peer DMB are merged
>> net/smc: introduce loopback-ism DMB type control
>> net/smc: introduce loopback-ism DMB data copy control
>> net/smc: implement DMB-merged operations of loopback-ism
>>
>> drivers/s390/net/ism_drv.c | 2 +-
>> include/net/smc.h | 7 +-
>> net/smc/Kconfig | 13 +
>> net/smc/Makefile | 2 +-
>> net/smc/af_smc.c | 28 +-
>> net/smc/smc_cdc.c | 58 ++-
>> net/smc/smc_cdc.h | 1 +
>> net/smc/smc_core.c | 61 +++-
>> net/smc/smc_core.h | 1 +
>> net/smc/smc_ism.c | 71 +++-
>> net/smc/smc_ism.h | 5 +
>> net/smc/smc_loopback.c | 718 +++++++++++++++++++++++++++++++++++++
>> net/smc/smc_loopback.h | 88 +++++
>> 13 files changed, 1026 insertions(+), 29 deletions(-)
>> create mode 100644 net/smc/smc_loopback.c
>> create mode 100644 net/smc/smc_loopback.h
>>
>> --
>> 2.32.0.3.g01195cf9f
>>
>>

2024-01-12 09:10:39

by Jiri Pirko

[permalink] [raw]
Subject: Re: [PATCH net-next 00/15] net/smc: implement loopback-ism used by SMC-D

Fri, Jan 12, 2024 at 09:29:35AM CET, [email protected] wrote:
>
>
>On 2024/1/11 22:50, Jiri Pirko wrote:
>> Thu, Jan 11, 2024 at 01:00:21PM CET, [email protected] wrote:
>> > This patch set acts as the second part of the new version of [1] (The first
>> > part can be referred from [2]), the updated things of this version are listed
>> > at the end.
>> >
>> > # Background
>> >
>> > SMC-D is now used in IBM z with ISM function to optimize network interconnect
>> > for intra-CPC communications. Inspired by this, we try to make SMC-D available
>>
>> Care to provide more details about what ISM and intra-CPC is and what it
>> it good for?
>>
>
>Hi Jiri,
>
>Sure,
>
>ISM (IBM System Z Internal Shared Memory) is a technology that provides the
>internal communications capability required for SMC-D. It is a virtual PCI
>network adapter that enables direct access to shared virtual memory providing
>a highly optimized network interconnect for IBM Z intra-CPC communications.
>(It can be found in https://www.ibm.com/docs/en/zos/3.1.0?topic=communications-shared-memory-reference-information
>and https://www.ibm.com/docs/en/zos/3.1.0?topic=dv2-ismv2)
>
>CPC (Central processor complex) is an IBM mainframe term to refer to the physical
>collection of hardware that includes main storage, one or more central processors,
>timers, and channels.
>(It can be found in https://www.ibm.com/docs/en/zos-basic-skills?topic=concepts-mainframe-hardware-terminology
>and https://www.ibm.com/docs/en/ztpf/2023?topic=support-central-processor-complex-cpc)
>
>SMC (Shared Memory Communications) is a network protocol that allows two SMC
>capable peers to communicate using memory that each peer allocates and manages
>for their partner’s use. It has two forms:
>
>- SMC over Remote Direct Memory Access (SMC-R)
>
> It is an open protocol that was initially introduced in z/OS V2R1 on the IBM zEC12.
> SMC-R is defined in an informational RFC entitled IBM’s Shared Memory Communications
> over RDMA (https://tools.ietf.org/html/rfc7609).
>
>- SMC - Direct Memory Access (SMC-D)
>
> It is a variation of SMC-R. SMC-D is closely related to SMC-R but is based on the
> Internal Shared Memory (ISM) capabilities introduced with the IBM z13™ (z13) hardware
> model.
>
>(SMC protocol can be found in https://www.ibm.com/support/pages/system/files/inline-files/IBM%20Shared%20Memory%20Communications%20Version%202.1_0.pdf)
>
>So with ISM function, SMC-D can be used to improves throughput, lowers latency and cost,
>and maintains existing functions of communications within CPC.
>
>>
>> > on the non-s390 architecture through a software-implemented virtual ISM device,
>> > that is the loopback-ism device here, to accelerate inter-process or
>>
>> I see no such device. Is it a netdevice?
>
>Currently, SMC-D depends on ISM and is only available on IBM Z systems. Now we try
>to make SMC-D available on other system architectures other than s390 and Z system.
>So 'virtual ISM' is proposed and acts as original firmware ISM on Z system.
>(The virtual ISM supports can be found in https://lore.kernel.org/netdev/[email protected]/)
>
>The loopback-ism is the first virtual ISM. It does not rely on a specific architecture
>or hardware, and provides functions that ISM should have (like a dummy device). It is
>designed to be used by SMC-D when communication occurs within OS instance.
>
>It is not a typical network device, since it primarily provides exact functions
>defined by SMC-D device operations(struct smcd_ops), e.g. provides and manages the
>shared memory (term used is DMB in SMC, Direct Memory Buffer).
>
>It can't be found now since it is introduced by this patchset.
>
>>
>> If it is "software-implemented", why is it part of smc driver and not
>> separate soft-device driver? If there is some smc specific code, I guess
>> there should be some level of separation. Can't this be implemented by
>> other devices too?
>>
>
>loopback-ism is designed to specifically used by SMC-D (like s390 ISM), to
>serves as a easy-available ISM for community to test SMC-D and to accelerate
>intra-OS communication (see benchmark test). So the code is under net/smc.

Got it.


>
>>
>>
>> > inter-containers communication within the same OS instance.
>> >
>> > # Design
>> >
>> > This patch set includes 3 parts:
>> >
>> > - Patch #1-#2: some prepare work for loopback-ism.
>> > - Patch #3-#9: implement loopback-ism device.
>> > - Patch #10-#15: memory copy optimization for loopback scenario.
>> >
>> > The loopback-ism device is designed as a ISMv2 device and not be limited to
>> > a specific net namespace, ends of both inter-process connection (1/1' in diagram
>> > below) or inter-container connection (2/2' in diagram below) can find the same
>> > available loopback-ism and choose it during the CLC handshake.
>> >
>> > Container 1 (ns1) Container 2 (ns2)
>> > +-----------------------------------------+ +-------------------------+
>> > | +-------+ +-------+ +-------+ | | +-------+ |
>> > | | App A | | App B | | App C | | | | App D |<-+ |
>> > | +-------+ +---^---+ +-------+ | | +-------+ |(2') |
>> > | |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| |
>> > | (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ |
>> > | `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | |
>> > +---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+
>> > | | | |
>> > Kernel | | | |
>> > +----+-------v---+-----------v----------------------------------+---+----+
>> > | | TCP | |
>> > | | | |
>> > | +--------------------------------------------------------------+ |
>> > | |
>> > | +--------------+ |
>> > | | smc loopback | |
>> > +---------------------------+--------------+-----------------------------+
>> >
>> > loopback-ism device creates DMBs (shared memory) for each connection peer.
>> > Since data transfer occurs within the same kernel, the sndbuf of each peer
>> > is only a descriptor and point to the same memory region as peer DMB, so that
>> > the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.
>> >
>> > Container 1 (ns1) Container 2 (ns2)
>> > +-----------------------------------------+ +-------------------------+
>> > | +-------+ | | +-------+ |
>> > | | App C |-----+ | | | App D | |
>> > | +-------+ | | | +-^-----+ |
>> > | | | | | |
>> > | (2) | | | (2') | |
>> > | | | | | |
>> > +---------------|-------------------------+ +----------|--------------+
>> > | |
>> > Kernel | |
>> > +---------------|-----------------------------------------|--------------+
>> > | +--------+ +--v-----+ +--------+ +--------+ |
>> > | |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| |
>> > | +-----|--+ +--|-----+ +-----|--+ +--------+ |
>> > | +-----|--+ | +-----|--+ |
>> > | | DMB C | +---------------------------------| DMB D | |
>> > | +--------+ +--------+ |
>> > | |
>> > | +--------------+ |
>> > | | smc loopback | |
>> > +---------------------------+--------------+-----------------------------+
>> >
>> > # Benchmark Test
>> >
>> > * Test environments:
>> > - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>> > - SMC sndbuf/DMB size 1MB.
>> > - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0,
>> > which means sndbuf and DMB are merged and no data copied between them.
>> > - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0,
>>
>> Exposing any configuration knobs and statistics over sysfs for
>> softdevices does not look correct at all :/ Could you please avoid
>> sysfs?
>>
>
>In previous reviews and calls, we think loopback-ism needs to be more
>like a device and be visible under /sys/devices.
>
>Would you mind explaining why using sysfs for loopback-ism is not correct?
>since I saw some other configurations or statistics exists under /sys/devices,
>e.g. /sys/devices/virtual/net/lo. Thank you!

You have smc_netlink.c exposing clear netlink api for the subsystem.
Can't you extend it to contain the configuration knobs and expose stats
instead of sysfs?


>
>
>
>Thanks again,
>Wen Gu
>
>>
>> > which means DMB is physically contiguous buffer.
>> >
>> > * Test object:
>> > - TCP: run on TCP loopback.
>> > - SMC lo: run on SMC loopback device.
>> >
>> > 1. ipc-benchmark (see [3])
>> >
>> > - ./<foo> -c 1000000 -s 100
>> >
>> > TCP SMC-lo
>> > Message
>> > rate (msg/s) 80636 149515(+85.42%)
>> >
>> > 2. sockperf
>> >
>> > - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>> > - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>> >
>> > TCP SMC-lo
>> > Bandwidth(MBps) 4909.36 8197.57(+66.98%)
>> > Latency(us) 6.098 3.383(-44.52%)
>> >
>> > 3. nginx/wrk
>> >
>> > - serv: <smc_run> nginx
>> > - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
>> >
>> > TCP SMC-lo
>> > Requests/s 181685.74 246447.77(+35.65%)
>> >
>> > 4. redis-benchmark
>> >
>> > - serv: <smc_run> redis-server
>> > - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
>> >
>> > TCP SMC-lo
>> > GET(Requests/s) 85855.34 118553.64(+38.09%)
>> > SET(Requests/s) 86824.40 125944.58(+45.06%)
>> >
>> >
>> > Change log:
>> >
>> > v1->RFC:
>> > - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
>> > /sys/devices/virtual/smc/loopback-ism/xfer_bytes
>> > - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
>> > merging sndbuf with peer DMB.
>> > - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
>> > control of whether to merge sndbuf and DMB. They can be respectively set by:
>> > /sys/devices/virtual/smc/loopback-ism/dmb_type
>> > /sys/devices/virtual/smc/loopback-ism/dmb_copy
>> > The motivation for these two control is that a performance bottleneck was
>> > found when using vzalloced DMB and sndbuf is merged with DMB, and there are
>> > many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
>> > by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
>> > or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
>> > vmap lock contention [6]. It has significant effects, but using virtual memory
>> > still has additional overhead compared to using physical memory.
>> > So this new version provides controls of dmb_type and dmb_copy to suit
>> > different scenarios.
>> > - Some minor changes and comments improvements.
>> >
>> > RFC->old version([1]):
>> > Link: https://lore.kernel.org/netdev/[email protected]/
>> > - Patch #1: improve the loopback-ism dump, it shows as follows now:
>> > # smcd d
>> > FID Type PCI-ID PCHID InUse #LGs PNET-ID
>> > 0000 0 loopback-ism ffff No 0
>> > - Patch #3: introduce the smc_ism_set_v2_capable() helper and set
>> > smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
>> > regardless of whether there is already a device in smcd device list.
>> > - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
>> > - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
>> > to activate or deactivate the loopback-ism.
>> > - Patch #9: introduce the statistics of loopback-ism by
>> > /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
>> > - Some minor changes and comments improvements.
>> >
>> > [1] https://lore.kernel.org/netdev/[email protected]/
>> > [2] https://lore.kernel.org/netdev/[email protected]/
>> > [3] https://github.com/goldsborough/ipc-bench
>> > [4] https://lore.kernel.org/all/[email protected]/
>> > [5] https://lore.kernel.org/all/[email protected]/
>> > [6] https://lore.kernel.org/all/[email protected]/
>> >
>> > Wen Gu (15):
>> > net/smc: improve SMC-D device dump for virtual ISM
>> > net/smc: decouple specialized struct from SMC-D DMB registration
>> > net/smc: introduce virtual ISM device loopback-ism
>> > net/smc: implement ID-related operations of loopback-ism
>> > net/smc: implement some unsupported operations of loopback-ism
>> > net/smc: implement DMB-related operations of loopback-ism
>> > net/smc: register loopback-ism into SMC-D device list
>> > net/smc: introduce loopback-ism runtime switch
>> > net/smc: introduce loopback-ism statistics attributes
>> > net/smc: add operations to merge sndbuf with peer DMB
>> > net/smc: attach or detach ghost sndbuf to peer DMB
>> > net/smc: adapt cursor update when sndbuf and peer DMB are merged
>> > net/smc: introduce loopback-ism DMB type control
>> > net/smc: introduce loopback-ism DMB data copy control
>> > net/smc: implement DMB-merged operations of loopback-ism
>> >
>> > drivers/s390/net/ism_drv.c | 2 +-
>> > include/net/smc.h | 7 +-
>> > net/smc/Kconfig | 13 +
>> > net/smc/Makefile | 2 +-
>> > net/smc/af_smc.c | 28 +-
>> > net/smc/smc_cdc.c | 58 ++-
>> > net/smc/smc_cdc.h | 1 +
>> > net/smc/smc_core.c | 61 +++-
>> > net/smc/smc_core.h | 1 +
>> > net/smc/smc_ism.c | 71 +++-
>> > net/smc/smc_ism.h | 5 +
>> > net/smc/smc_loopback.c | 718 +++++++++++++++++++++++++++++++++++++
>> > net/smc/smc_loopback.h | 88 +++++
>> > 13 files changed, 1026 insertions(+), 29 deletions(-)
>> > create mode 100644 net/smc/smc_loopback.c
>> > create mode 100644 net/smc/smc_loopback.h
>> >
>> > --
>> > 2.32.0.3.g01195cf9f
>> >
>> >

2024-01-12 12:32:46

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next 00/15] net/smc: implement loopback-ism used by SMC-D



On 2024/1/12 17:10, Jiri Pirko wrote:
> Fri, Jan 12, 2024 at 09:29:35AM CET, [email protected] wrote:
>>
>>

<...>

>>>> inter-containers communication within the same OS instance.
>>>>
>>>> # Design
>>>>
>>>> This patch set includes 3 parts:
>>>>
>>>> - Patch #1-#2: some prepare work for loopback-ism.
>>>> - Patch #3-#9: implement loopback-ism device.
>>>> - Patch #10-#15: memory copy optimization for loopback scenario.
>>>>
>>>> The loopback-ism device is designed as a ISMv2 device and not be limited to
>>>> a specific net namespace, ends of both inter-process connection (1/1' in diagram
>>>> below) or inter-container connection (2/2' in diagram below) can find the same
>>>> available loopback-ism and choose it during the CLC handshake.
>>>>
>>>> Container 1 (ns1) Container 2 (ns2)
>>>> +-----------------------------------------+ +-------------------------+
>>>> | +-------+ +-------+ +-------+ | | +-------+ |
>>>> | | App A | | App B | | App C | | | | App D |<-+ |
>>>> | +-------+ +---^---+ +-------+ | | +-------+ |(2') |
>>>> | |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| |
>>>> | (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ |
>>>> | `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | |
>>>> +---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+
>>>> | | | |
>>>> Kernel | | | |
>>>> +----+-------v---+-----------v----------------------------------+---+----+
>>>> | | TCP | |
>>>> | | | |
>>>> | +--------------------------------------------------------------+ |
>>>> | |
>>>> | +--------------+ |
>>>> | | smc loopback | |
>>>> +---------------------------+--------------+-----------------------------+
>>>>
>>>> loopback-ism device creates DMBs (shared memory) for each connection peer.
>>>> Since data transfer occurs within the same kernel, the sndbuf of each peer
>>>> is only a descriptor and point to the same memory region as peer DMB, so that
>>>> the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.
>>>>
>>>> Container 1 (ns1) Container 2 (ns2)
>>>> +-----------------------------------------+ +-------------------------+
>>>> | +-------+ | | +-------+ |
>>>> | | App C |-----+ | | | App D | |
>>>> | +-------+ | | | +-^-----+ |
>>>> | | | | | |
>>>> | (2) | | | (2') | |
>>>> | | | | | |
>>>> +---------------|-------------------------+ +----------|--------------+
>>>> | |
>>>> Kernel | |
>>>> +---------------|-----------------------------------------|--------------+
>>>> | +--------+ +--v-----+ +--------+ +--------+ |
>>>> | |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| |
>>>> | +-----|--+ +--|-----+ +-----|--+ +--------+ |
>>>> | +-----|--+ | +-----|--+ |
>>>> | | DMB C | +---------------------------------| DMB D | |
>>>> | +--------+ +--------+ |
>>>> | |
>>>> | +--------------+ |
>>>> | | smc loopback | |
>>>> +---------------------------+--------------+-----------------------------+
>>>>
>>>> # Benchmark Test
>>>>
>>>> * Test environments:
>>>> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>>>> - SMC sndbuf/DMB size 1MB.
>>>> - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0,
>>>> which means sndbuf and DMB are merged and no data copied between them.
>>>> - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0,
>>>
>>> Exposing any configuration knobs and statistics over sysfs for
>>> softdevices does not look correct at all :/ Could you please avoid
>>> sysfs?
>>>
>>
>> In previous reviews and calls, we think loopback-ism needs to be more
>> like a device and be visible under /sys/devices.
>>
>> Would you mind explaining why using sysfs for loopback-ism is not correct?
>> since I saw some other configurations or statistics exists under /sys/devices,
>> e.g. /sys/devices/virtual/net/lo. Thank you!
>
> You have smc_netlink.c exposing clear netlink api for the subsystem.
> Can't you extend it to contain the configuration knobs and expose stats
> instead of sysfs?
>

Thank you for the suggestion. I've also considered this approach.

But I didn't choose to extend the smc netlink because for now smc netlink
are used for SMC protocol related attributes, for example:

SMC_NETLINK_GET_SYS_INFO: SMC version, release, v2-capable..
SMC_NETLINK_GET_LGR_SMC{R|D}: SMC-{R|D} link group inform (lgr id, lgr conn num, lgr role..)
SMC_NETLINK_GET_LINK_SMCR: SMC-R link inform (link id, link state, conn cnt..)
SMC_NETLINK_GET_DEV_SMCD: SMC-D device generic inform (user cnt, pci_fid, pci_chid, pci_vendor..)
SMC_NETLINK_GET_DEV_SMCR: SMC-R device generic inform (dev name, port pnet_id, port valid, port state..)
SMC_NETLINK_GET_STATS: SMC generic stats (RMB cnt, Tx size, Rx size, RMB size...)

And the knobs and stats in this patchset are loopback-ism device specific
attributes, for example:

active: loopback-ism runtime switch
dmb_type: type of DMB provided by loopback-ism
dmb_copy: support for DMB merge of loopback-ism
xfer_bytes: data transferred by loopback-ism
dmbs_cnt: DMB num provided by loopback-ism

The layer will be:

+--------------------------------------+
| |
| SMC protocol |
| (attrs by netlink in smc_netlink.c) |
| |
+--------------------------------------+
------------------smcd_ops------------------
+---------------+ +---------------------+ +--------------+
| loopback-ism | | s390 firmware ISM | | Possible |
+---------------+ | | | other |
(attrs by sysfs | | | virtual ISM |
in smc_loopback.c) | | | |
| | | |
+---------------------+ +--------------+

So I choose to use current way to provide this lower layer loopback-ism
device's attributes, restrict loopback-ism specific code to smc_loopback.c
and try to make a clear layer architecture.

Thanks,
Wen Gu
>
>>
>>
>>
>> Thanks again,
>> Wen Gu
>>
>>>
>>>> which means DMB is physically contiguous buffer.
>>>>
>>>> * Test object:
>>>> - TCP: run on TCP loopback.
>>>> - SMC lo: run on SMC loopback device.
>>>>
>>>> 1. ipc-benchmark (see [3])
>>>>
>>>> - ./<foo> -c 1000000 -s 100
>>>>
>>>> TCP SMC-lo
>>>> Message
>>>> rate (msg/s) 80636 149515(+85.42%)
>>>>
>>>> 2. sockperf
>>>>
>>>> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>>>> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>>>>
>>>> TCP SMC-lo
>>>> Bandwidth(MBps) 4909.36 8197.57(+66.98%)
>>>> Latency(us) 6.098 3.383(-44.52%)
>>>>
>>>> 3. nginx/wrk
>>>>
>>>> - serv: <smc_run> nginx
>>>> - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
>>>>
>>>> TCP SMC-lo
>>>> Requests/s 181685.74 246447.77(+35.65%)
>>>>
>>>> 4. redis-benchmark
>>>>
>>>> - serv: <smc_run> redis-server
>>>> - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
>>>>
>>>> TCP SMC-lo
>>>> GET(Requests/s) 85855.34 118553.64(+38.09%)
>>>> SET(Requests/s) 86824.40 125944.58(+45.06%)
>>>>
>>>>
>>>> Change log:
>>>>
>>>> v1->RFC:
>>>> - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
>>>> /sys/devices/virtual/smc/loopback-ism/xfer_bytes
>>>> - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
>>>> merging sndbuf with peer DMB.
>>>> - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
>>>> control of whether to merge sndbuf and DMB. They can be respectively set by:
>>>> /sys/devices/virtual/smc/loopback-ism/dmb_type
>>>> /sys/devices/virtual/smc/loopback-ism/dmb_copy
>>>> The motivation for these two control is that a performance bottleneck was
>>>> found when using vzalloced DMB and sndbuf is merged with DMB, and there are
>>>> many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
>>>> by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
>>>> or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
>>>> vmap lock contention [6]. It has significant effects, but using virtual memory
>>>> still has additional overhead compared to using physical memory.
>>>> So this new version provides controls of dmb_type and dmb_copy to suit
>>>> different scenarios.
>>>> - Some minor changes and comments improvements.
>>>>
>>>> RFC->old version([1]):
>>>> Link: https://lore.kernel.org/netdev/[email protected]/
>>>> - Patch #1: improve the loopback-ism dump, it shows as follows now:
>>>> # smcd d
>>>> FID Type PCI-ID PCHID InUse #LGs PNET-ID
>>>> 0000 0 loopback-ism ffff No 0
>>>> - Patch #3: introduce the smc_ism_set_v2_capable() helper and set
>>>> smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
>>>> regardless of whether there is already a device in smcd device list.
>>>> - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
>>>> - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
>>>> to activate or deactivate the loopback-ism.
>>>> - Patch #9: introduce the statistics of loopback-ism by
>>>> /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
>>>> - Some minor changes and comments improvements.
>>>>
>>>> [1] https://lore.kernel.org/netdev/[email protected]/
>>>> [2] https://lore.kernel.org/netdev/[email protected]/
>>>> [3] https://github.com/goldsborough/ipc-bench
>>>> [4] https://lore.kernel.org/all/[email protected]/
>>>> [5] https://lore.kernel.org/all/[email protected]/
>>>> [6] https://lore.kernel.org/all/[email protected]/
>>>>
>>>> Wen Gu (15):
>>>> net/smc: improve SMC-D device dump for virtual ISM
>>>> net/smc: decouple specialized struct from SMC-D DMB registration
>>>> net/smc: introduce virtual ISM device loopback-ism
>>>> net/smc: implement ID-related operations of loopback-ism
>>>> net/smc: implement some unsupported operations of loopback-ism
>>>> net/smc: implement DMB-related operations of loopback-ism
>>>> net/smc: register loopback-ism into SMC-D device list
>>>> net/smc: introduce loopback-ism runtime switch
>>>> net/smc: introduce loopback-ism statistics attributes
>>>> net/smc: add operations to merge sndbuf with peer DMB
>>>> net/smc: attach or detach ghost sndbuf to peer DMB
>>>> net/smc: adapt cursor update when sndbuf and peer DMB are merged
>>>> net/smc: introduce loopback-ism DMB type control
>>>> net/smc: introduce loopback-ism DMB data copy control
>>>> net/smc: implement DMB-merged operations of loopback-ism
>>>>
>>>> drivers/s390/net/ism_drv.c | 2 +-
>>>> include/net/smc.h | 7 +-
>>>> net/smc/Kconfig | 13 +
>>>> net/smc/Makefile | 2 +-
>>>> net/smc/af_smc.c | 28 +-
>>>> net/smc/smc_cdc.c | 58 ++-
>>>> net/smc/smc_cdc.h | 1 +
>>>> net/smc/smc_core.c | 61 +++-
>>>> net/smc/smc_core.h | 1 +
>>>> net/smc/smc_ism.c | 71 +++-
>>>> net/smc/smc_ism.h | 5 +
>>>> net/smc/smc_loopback.c | 718 +++++++++++++++++++++++++++++++++++++
>>>> net/smc/smc_loopback.h | 88 +++++
>>>> 13 files changed, 1026 insertions(+), 29 deletions(-)
>>>> create mode 100644 net/smc/smc_loopback.c
>>>> create mode 100644 net/smc/smc_loopback.h
>>>>
>>>> --
>>>> 2.32.0.3.g01195cf9f
>>>>
>>>>

2024-01-12 15:51:15

by Jiri Pirko

[permalink] [raw]
Subject: Re: [PATCH net-next 00/15] net/smc: implement loopback-ism used by SMC-D

Fri, Jan 12, 2024 at 01:32:14PM CET, [email protected] wrote:
>
>
>On 2024/1/12 17:10, Jiri Pirko wrote:
>> Fri, Jan 12, 2024 at 09:29:35AM CET, [email protected] wrote:
>> >
>> >
>
><...>
>
>> > > > inter-containers communication within the same OS instance.
>> > > >
>> > > > # Design
>> > > >
>> > > > This patch set includes 3 parts:
>> > > >
>> > > > - Patch #1-#2: some prepare work for loopback-ism.
>> > > > - Patch #3-#9: implement loopback-ism device.
>> > > > - Patch #10-#15: memory copy optimization for loopback scenario.
>> > > >
>> > > > The loopback-ism device is designed as a ISMv2 device and not be limited to
>> > > > a specific net namespace, ends of both inter-process connection (1/1' in diagram
>> > > > below) or inter-container connection (2/2' in diagram below) can find the same
>> > > > available loopback-ism and choose it during the CLC handshake.
>> > > >
>> > > > Container 1 (ns1) Container 2 (ns2)
>> > > > +-----------------------------------------+ +-------------------------+
>> > > > | +-------+ +-------+ +-------+ | | +-------+ |
>> > > > | | App A | | App B | | App C | | | | App D |<-+ |
>> > > > | +-------+ +---^---+ +-------+ | | +-------+ |(2') |
>> > > > | |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| |
>> > > > | (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ |
>> > > > | `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | |
>> > > > +---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+
>> > > > | | | |
>> > > > Kernel | | | |
>> > > > +----+-------v---+-----------v----------------------------------+---+----+
>> > > > | | TCP | |
>> > > > | | | |
>> > > > | +--------------------------------------------------------------+ |
>> > > > | |
>> > > > | +--------------+ |
>> > > > | | smc loopback | |
>> > > > +---------------------------+--------------+-----------------------------+
>> > > >
>> > > > loopback-ism device creates DMBs (shared memory) for each connection peer.
>> > > > Since data transfer occurs within the same kernel, the sndbuf of each peer
>> > > > is only a descriptor and point to the same memory region as peer DMB, so that
>> > > > the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.
>> > > >
>> > > > Container 1 (ns1) Container 2 (ns2)
>> > > > +-----------------------------------------+ +-------------------------+
>> > > > | +-------+ | | +-------+ |
>> > > > | | App C |-----+ | | | App D | |
>> > > > | +-------+ | | | +-^-----+ |
>> > > > | | | | | |
>> > > > | (2) | | | (2') | |
>> > > > | | | | | |
>> > > > +---------------|-------------------------+ +----------|--------------+
>> > > > | |
>> > > > Kernel | |
>> > > > +---------------|-----------------------------------------|--------------+
>> > > > | +--------+ +--v-----+ +--------+ +--------+ |
>> > > > | |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| |
>> > > > | +-----|--+ +--|-----+ +-----|--+ +--------+ |
>> > > > | +-----|--+ | +-----|--+ |
>> > > > | | DMB C | +---------------------------------| DMB D | |
>> > > > | +--------+ +--------+ |
>> > > > | |
>> > > > | +--------------+ |
>> > > > | | smc loopback | |
>> > > > +---------------------------+--------------+-----------------------------+
>> > > >
>> > > > # Benchmark Test
>> > > >
>> > > > * Test environments:
>> > > > - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>> > > > - SMC sndbuf/DMB size 1MB.
>> > > > - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0,
>> > > > which means sndbuf and DMB are merged and no data copied between them.
>> > > > - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0,
>> > >
>> > > Exposing any configuration knobs and statistics over sysfs for
>> > > softdevices does not look correct at all :/ Could you please avoid
>> > > sysfs?
>> > >
>> >
>> > In previous reviews and calls, we think loopback-ism needs to be more
>> > like a device and be visible under /sys/devices.
>> >
>> > Would you mind explaining why using sysfs for loopback-ism is not correct?
>> > since I saw some other configurations or statistics exists under /sys/devices,
>> > e.g. /sys/devices/virtual/net/lo. Thank you!
>>
>> You have smc_netlink.c exposing clear netlink api for the subsystem.
>> Can't you extend it to contain the configuration knobs and expose stats
>> instead of sysfs?
>>
>
>Thank you for the suggestion. I've also considered this approach.
>
>But I didn't choose to extend the smc netlink because for now smc netlink
>are used for SMC protocol related attributes, for example:
>
>SMC_NETLINK_GET_SYS_INFO: SMC version, release, v2-capable..
>SMC_NETLINK_GET_LGR_SMC{R|D}: SMC-{R|D} link group inform (lgr id, lgr conn num, lgr role..)
>SMC_NETLINK_GET_LINK_SMCR: SMC-R link inform (link id, link state, conn cnt..)
>SMC_NETLINK_GET_DEV_SMCD: SMC-D device generic inform (user cnt, pci_fid, pci_chid, pci_vendor..)
>SMC_NETLINK_GET_DEV_SMCR: SMC-R device generic inform (dev name, port pnet_id, port valid, port state..)
>SMC_NETLINK_GET_STATS: SMC generic stats (RMB cnt, Tx size, Rx size, RMB size...)
>
>And the knobs and stats in this patchset are loopback-ism device specific
>attributes, for example:
>
>active: loopback-ism runtime switch
>dmb_type: type of DMB provided by loopback-ism
>dmb_copy: support for DMB merge of loopback-ism
>xfer_bytes: data transferred by loopback-ism
>dmbs_cnt: DMB num provided by loopback-ism
>
>The layer will be:
>
> +--------------------------------------+
> | |
> | SMC protocol |
> | (attrs by netlink in smc_netlink.c) |
> | |
> +--------------------------------------+
> ------------------smcd_ops------------------
> +---------------+ +---------------------+ +--------------+
> | loopback-ism | | s390 firmware ISM | | Possible |
> +---------------+ | | | other |
> (attrs by sysfs | | | virtual ISM |
> in smc_loopback.c) | | | |
> | | | |
> +---------------------+ +--------------+

So nest it:
SMC_NETLINK_BACKEND_GET_INFO
SMC_NETLINK_BACKEND_GET_STATS
?
I mean, isn't it better to have the backend knobs and stats in one place
under same netlink commands and attributes than random sysfs path ?



>
>So I choose to use current way to provide this lower layer loopback-ism
>device's attributes, restrict loopback-ism specific code to smc_loopback.c
>and try to make a clear layer architecture.
>
>Thanks,
>Wen Gu
>>
>> >
>> >
>> >
>> > Thanks again,
>> > Wen Gu
>> >
>> > >
>> > > > which means DMB is physically contiguous buffer.
>> > > >
>> > > > * Test object:
>> > > > - TCP: run on TCP loopback.
>> > > > - SMC lo: run on SMC loopback device.
>> > > >
>> > > > 1. ipc-benchmark (see [3])
>> > > >
>> > > > - ./<foo> -c 1000000 -s 100
>> > > >
>> > > > TCP SMC-lo
>> > > > Message
>> > > > rate (msg/s) 80636 149515(+85.42%)
>> > > >
>> > > > 2. sockperf
>> > > >
>> > > > - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>> > > > - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>> > > >
>> > > > TCP SMC-lo
>> > > > Bandwidth(MBps) 4909.36 8197.57(+66.98%)
>> > > > Latency(us) 6.098 3.383(-44.52%)
>> > > >
>> > > > 3. nginx/wrk
>> > > >
>> > > > - serv: <smc_run> nginx
>> > > > - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
>> > > >
>> > > > TCP SMC-lo
>> > > > Requests/s 181685.74 246447.77(+35.65%)
>> > > >
>> > > > 4. redis-benchmark
>> > > >
>> > > > - serv: <smc_run> redis-server
>> > > > - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
>> > > >
>> > > > TCP SMC-lo
>> > > > GET(Requests/s) 85855.34 118553.64(+38.09%)
>> > > > SET(Requests/s) 86824.40 125944.58(+45.06%)
>> > > >
>> > > >
>> > > > Change log:
>> > > >
>> > > > v1->RFC:
>> > > > - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
>> > > > /sys/devices/virtual/smc/loopback-ism/xfer_bytes
>> > > > - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
>> > > > merging sndbuf with peer DMB.
>> > > > - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
>> > > > control of whether to merge sndbuf and DMB. They can be respectively set by:
>> > > > /sys/devices/virtual/smc/loopback-ism/dmb_type
>> > > > /sys/devices/virtual/smc/loopback-ism/dmb_copy
>> > > > The motivation for these two control is that a performance bottleneck was
>> > > > found when using vzalloced DMB and sndbuf is merged with DMB, and there are
>> > > > many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
>> > > > by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
>> > > > or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
>> > > > vmap lock contention [6]. It has significant effects, but using virtual memory
>> > > > still has additional overhead compared to using physical memory.
>> > > > So this new version provides controls of dmb_type and dmb_copy to suit
>> > > > different scenarios.
>> > > > - Some minor changes and comments improvements.
>> > > >
>> > > > RFC->old version([1]):
>> > > > Link: https://lore.kernel.org/netdev/[email protected]/
>> > > > - Patch #1: improve the loopback-ism dump, it shows as follows now:
>> > > > # smcd d
>> > > > FID Type PCI-ID PCHID InUse #LGs PNET-ID
>> > > > 0000 0 loopback-ism ffff No 0
>> > > > - Patch #3: introduce the smc_ism_set_v2_capable() helper and set
>> > > > smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
>> > > > regardless of whether there is already a device in smcd device list.
>> > > > - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
>> > > > - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
>> > > > to activate or deactivate the loopback-ism.
>> > > > - Patch #9: introduce the statistics of loopback-ism by
>> > > > /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
>> > > > - Some minor changes and comments improvements.
>> > > >
>> > > > [1] https://lore.kernel.org/netdev/[email protected]/
>> > > > [2] https://lore.kernel.org/netdev/[email protected]/
>> > > > [3] https://github.com/goldsborough/ipc-bench
>> > > > [4] https://lore.kernel.org/all/[email protected]/
>> > > > [5] https://lore.kernel.org/all/[email protected]/
>> > > > [6] https://lore.kernel.org/all/[email protected]/
>> > > >
>> > > > Wen Gu (15):
>> > > > net/smc: improve SMC-D device dump for virtual ISM
>> > > > net/smc: decouple specialized struct from SMC-D DMB registration
>> > > > net/smc: introduce virtual ISM device loopback-ism
>> > > > net/smc: implement ID-related operations of loopback-ism
>> > > > net/smc: implement some unsupported operations of loopback-ism
>> > > > net/smc: implement DMB-related operations of loopback-ism
>> > > > net/smc: register loopback-ism into SMC-D device list
>> > > > net/smc: introduce loopback-ism runtime switch
>> > > > net/smc: introduce loopback-ism statistics attributes
>> > > > net/smc: add operations to merge sndbuf with peer DMB
>> > > > net/smc: attach or detach ghost sndbuf to peer DMB
>> > > > net/smc: adapt cursor update when sndbuf and peer DMB are merged
>> > > > net/smc: introduce loopback-ism DMB type control
>> > > > net/smc: introduce loopback-ism DMB data copy control
>> > > > net/smc: implement DMB-merged operations of loopback-ism
>> > > >
>> > > > drivers/s390/net/ism_drv.c | 2 +-
>> > > > include/net/smc.h | 7 +-
>> > > > net/smc/Kconfig | 13 +
>> > > > net/smc/Makefile | 2 +-
>> > > > net/smc/af_smc.c | 28 +-
>> > > > net/smc/smc_cdc.c | 58 ++-
>> > > > net/smc/smc_cdc.h | 1 +
>> > > > net/smc/smc_core.c | 61 +++-
>> > > > net/smc/smc_core.h | 1 +
>> > > > net/smc/smc_ism.c | 71 +++-
>> > > > net/smc/smc_ism.h | 5 +
>> > > > net/smc/smc_loopback.c | 718 +++++++++++++++++++++++++++++++++++++
>> > > > net/smc/smc_loopback.h | 88 +++++
>> > > > 13 files changed, 1026 insertions(+), 29 deletions(-)
>> > > > create mode 100644 net/smc/smc_loopback.c
>> > > > create mode 100644 net/smc/smc_loopback.h
>> > > >
>> > > > --
>> > > > 2.32.0.3.g01195cf9f
>> > > >
>> > > >

2024-01-12 16:24:59

by Niklas Schnelle

[permalink] [raw]
Subject: Re: [PATCH net-next 14/15] net/smc: introduce loopback-ism DMB data copy control

On Thu, 2024-01-11 at 20:00 +0800, Wen Gu wrote:
> This provides a way to {get|set} whether loopback-ism device supports
> merging sndbuf with peer DMB to eliminate data copies between them.
>
> echo 0 > /sys/devices/virtual/smc/loopback-ism/dmb_copy # support
> echo 1 > /sys/devices/virtual/smc/loopback-ism/dmb_copy # not support

The two support/no support remarks are a bit confusing because support
here seems to mean "support no-copy mode" while the attribute is more
like "force copy mode". How about:

echo 0 > /sys/devices/virtual/smc/loopback-ism/dmb_copy # one DMB mode
echo 1 > /sys/devices/virtual/smc/loopback-ism/dmb_copy # copy mode

>
> The settings take effect after re-activating loopback-ism by:
>
> echo 0 > /sys/devices/virtual/smc/loopback-ism/active
> echo 1 > /sys/devices/virtual/smc/loopback-ism/active
>
> After this, the link group related to loopback-ism will be flushed and
> the sndbufs of subsequent connections will be merged or not merged with
> peer DMB.
>
> The motivation of this control is that the bandwidth will be highly
> improved when sndbuf and DMB are merged, but when virtually contiguous
> DMB is provided and merged with sndbuf, it will be concurrently accessed
> on Tx and Rx, then there will be a bottleneck caused by lock contention
> of find_vmap_area when there are many CPUs and CONFIG_HARDENED_USERCOPY
> is set (see link below). So an option is provided.
>
> Link: https://lore.kernel.org/all/[email protected]/
> Signed-off-by: Wen Gu <[email protected]>
---8<---

2024-01-13 07:12:27

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next 14/15] net/smc: introduce loopback-ism DMB data copy control



On 2024/1/13 00:24, Niklas Schnelle wrote:
> On Thu, 2024-01-11 at 20:00 +0800, Wen Gu wrote:
>> This provides a way to {get|set} whether loopback-ism device supports
>> merging sndbuf with peer DMB to eliminate data copies between them.
>>
>> echo 0 > /sys/devices/virtual/smc/loopback-ism/dmb_copy # support
>> echo 1 > /sys/devices/virtual/smc/loopback-ism/dmb_copy # not support
>
> The two support/no support remarks are a bit confusing because support
> here seems to mean "support no-copy mode" while the attribute is more
> like "force copy mode". How about:
>
> echo 0 > /sys/devices/virtual/smc/loopback-ism/dmb_copy # one DMB mode
> echo 1 > /sys/devices/virtual/smc/loopback-ism/dmb_copy # copy mode
>

Thank you! Niklas.
That makes it much clearer. It will be improved in next version.

>>
>> The settings take effect after re-activating loopback-ism by:
>>
>> echo 0 > /sys/devices/virtual/smc/loopback-ism/active
>> echo 1 > /sys/devices/virtual/smc/loopback-ism/active
>>
>> After this, the link group related to loopback-ism will be flushed and
>> the sndbufs of subsequent connections will be merged or not merged with
>> peer DMB.
>>
>> The motivation of this control is that the bandwidth will be highly
>> improved when sndbuf and DMB are merged, but when virtually contiguous
>> DMB is provided and merged with sndbuf, it will be concurrently accessed
>> on Tx and Rx, then there will be a bottleneck caused by lock contention
>> of find_vmap_area when there are many CPUs and CONFIG_HARDENED_USERCOPY
>> is set (see link below). So an option is provided.
>>
>> Link: https://lore.kernel.org/all/[email protected]/
>> Signed-off-by: Wen Gu <[email protected]>
> ---8<---

2024-01-13 09:22:44

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next 00/15] net/smc: implement loopback-ism used by SMC-D



On 2024/1/12 23:50, Jiri Pirko wrote:
> Fri, Jan 12, 2024 at 01:32:14PM CET, [email protected] wrote:
>>
>>
>> On 2024/1/12 17:10, Jiri Pirko wrote:
>>> Fri, Jan 12, 2024 at 09:29:35AM CET, [email protected] wrote:
>>>>

<...>

>>>>>> # Benchmark Test
>>>>>>
>>>>>> * Test environments:
>>>>>> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>>>>>> - SMC sndbuf/DMB size 1MB.
>>>>>> - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0,
>>>>>> which means sndbuf and DMB are merged and no data copied between them.
>>>>>> - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0,
>>>>>
>>>>> Exposing any configuration knobs and statistics over sysfs for
>>>>> softdevices does not look correct at all :/ Could you please avoid
>>>>> sysfs?
>>>>>
>>>>
>>>> In previous reviews and calls, we think loopback-ism needs to be more
>>>> like a device and be visible under /sys/devices.
>>>>
>>>> Would you mind explaining why using sysfs for loopback-ism is not correct?
>>>> since I saw some other configurations or statistics exists under /sys/devices,
>>>> e.g. /sys/devices/virtual/net/lo. Thank you!
>>>
>>> You have smc_netlink.c exposing clear netlink api for the subsystem.
>>> Can't you extend it to contain the configuration knobs and expose stats
>>> instead of sysfs?
>>>
>>
>> Thank you for the suggestion. I've also considered this approach.
>>
>> But I didn't choose to extend the smc netlink because for now smc netlink
>> are used for SMC protocol related attributes, for example:
>>
>> SMC_NETLINK_GET_SYS_INFO: SMC version, release, v2-capable..
>> SMC_NETLINK_GET_LGR_SMC{R|D}: SMC-{R|D} link group inform (lgr id, lgr conn num, lgr role..)
>> SMC_NETLINK_GET_LINK_SMCR: SMC-R link inform (link id, link state, conn cnt..)
>> SMC_NETLINK_GET_DEV_SMCD: SMC-D device generic inform (user cnt, pci_fid, pci_chid, pci_vendor..)
>> SMC_NETLINK_GET_DEV_SMCR: SMC-R device generic inform (dev name, port pnet_id, port valid, port state..)
>> SMC_NETLINK_GET_STATS: SMC generic stats (RMB cnt, Tx size, Rx size, RMB size...)
>>
>> And the knobs and stats in this patchset are loopback-ism device specific
>> attributes, for example:
>>
>> active: loopback-ism runtime switch
>> dmb_type: type of DMB provided by loopback-ism
>> dmb_copy: support for DMB merge of loopback-ism
>> xfer_bytes: data transferred by loopback-ism
>> dmbs_cnt: DMB num provided by loopback-ism
>>
>> The layer will be:
>>
>> +--------------------------------------+
>> | |
>> | SMC protocol |
>> | (attrs by netlink in smc_netlink.c) |
>> | |
>> +--------------------------------------+
>> ------------------smcd_ops------------------
>> +---------------+ +---------------------+ +--------------+
>> | loopback-ism | | s390 firmware ISM | | Possible |
>> +---------------+ | | | other |
>> (attrs by sysfs | | | virtual ISM |
>> in smc_loopback.c) | | | |
>> | | | |
>> +---------------------+ +--------------+
>
> So nest it:
> SMC_NETLINK_BACKEND_GET_INFO
> SMC_NETLINK_BACKEND_GET_STATS
> ?
> I mean, isn't it better to have the backend knobs and stats in one place
> under same netlink commands and attributes than random sysfs path ?
>
Thank you for suggestion.

I think it is not about nesting or gathering knobs and stats. It is
about not coupling underlying device details to upper layer SMC stack.

From SMC perspective, it cares about the abstract operations defined
by smcd_ops, regardless of which underlying devices provide these
functions and how they provide. So IMO the details or configurations
of underlying devices shouldn't be involved in SMC.

Besides, the knobs and stats here are specific for loopback-ism device,
they include runtime switch, buffer type choice and mode choice of
loopback-ism (basically they won't change after being set once). The
other kinds of devices used by SMC-D, e.g. s390 firmware ISM or other
virtual ISMs have no similar things.

So I prefer to keep the current solution instead of expanding upper
layer SMC netlink.

Thanks,
Wen Gu

>
>
>>
>> So I choose to use current way to provide this lower layer loopback-ism
>> device's attributes, restrict loopback-ism specific code to smc_loopback.c
>> and try to make a clear layer architecture.
>>
>> Thanks,
>> Wen Gu
>>>
>>>>
>>>>
>>>>
>>>> Thanks again,
>>>> Wen Gu
>>>>
>>>>>
>>>>>> which means DMB is physically contiguous buffer.
>>>>>>
>>>>>> * Test object:
>>>>>> - TCP: run on TCP loopback.
>>>>>> - SMC lo: run on SMC loopback device.
>>>>>>
>>>>>> 1. ipc-benchmark (see [3])
>>>>>>
>>>>>> - ./<foo> -c 1000000 -s 100
>>>>>>
>>>>>> TCP SMC-lo
>>>>>> Message
>>>>>> rate (msg/s) 80636 149515(+85.42%)
>>>>>>
>>>>>> 2. sockperf
>>>>>>
>>>>>> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>>>>>> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>>>>>>
>>>>>> TCP SMC-lo
>>>>>> Bandwidth(MBps) 4909.36 8197.57(+66.98%)
>>>>>> Latency(us) 6.098 3.383(-44.52%)
>>>>>>
>>>>>> 3. nginx/wrk
>>>>>>
>>>>>> - serv: <smc_run> nginx
>>>>>> - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
>>>>>>
>>>>>> TCP SMC-lo
>>>>>> Requests/s 181685.74 246447.77(+35.65%)
>>>>>>
>>>>>> 4. redis-benchmark
>>>>>>
>>>>>> - serv: <smc_run> redis-server
>>>>>> - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
>>>>>>
>>>>>> TCP SMC-lo
>>>>>> GET(Requests/s) 85855.34 118553.64(+38.09%)
>>>>>> SET(Requests/s) 86824.40 125944.58(+45.06%)
>>>>>>
>>>>>>
>>>>>> Change log:
>>>>>>
>>>>>> v1->RFC:
>>>>>> - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
>>>>>> /sys/devices/virtual/smc/loopback-ism/xfer_bytes
>>>>>> - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
>>>>>> merging sndbuf with peer DMB.
>>>>>> - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
>>>>>> control of whether to merge sndbuf and DMB. They can be respectively set by:
>>>>>> /sys/devices/virtual/smc/loopback-ism/dmb_type
>>>>>> /sys/devices/virtual/smc/loopback-ism/dmb_copy
>>>>>> The motivation for these two control is that a performance bottleneck was
>>>>>> found when using vzalloced DMB and sndbuf is merged with DMB, and there are
>>>>>> many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
>>>>>> by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
>>>>>> or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
>>>>>> vmap lock contention [6]. It has significant effects, but using virtual memory
>>>>>> still has additional overhead compared to using physical memory.
>>>>>> So this new version provides controls of dmb_type and dmb_copy to suit
>>>>>> different scenarios.
>>>>>> - Some minor changes and comments improvements.
>>>>>>
>>>>>> RFC->old version([1]):
>>>>>> Link: https://lore.kernel.org/netdev/[email protected]/
>>>>>> - Patch #1: improve the loopback-ism dump, it shows as follows now:
>>>>>> # smcd d
>>>>>> FID Type PCI-ID PCHID InUse #LGs PNET-ID
>>>>>> 0000 0 loopback-ism ffff No 0
>>>>>> - Patch #3: introduce the smc_ism_set_v2_capable() helper and set
>>>>>> smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
>>>>>> regardless of whether there is already a device in smcd device list.
>>>>>> - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
>>>>>> - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
>>>>>> to activate or deactivate the loopback-ism.
>>>>>> - Patch #9: introduce the statistics of loopback-ism by
>>>>>> /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
>>>>>> - Some minor changes and comments improvements.
>>>>>>
>>>>>> [1] https://lore.kernel.org/netdev/[email protected]/
>>>>>> [2] https://lore.kernel.org/netdev/[email protected]/
>>>>>> [3] https://github.com/goldsborough/ipc-bench
>>>>>> [4] https://lore.kernel.org/all/[email protected]/
>>>>>> [5] https://lore.kernel.org/all/[email protected]/
>>>>>> [6] https://lore.kernel.org/all/[email protected]/
>>>>>>
>>>>>> Wen Gu (15):
>>>>>> net/smc: improve SMC-D device dump for virtual ISM
>>>>>> net/smc: decouple specialized struct from SMC-D DMB registration
>>>>>> net/smc: introduce virtual ISM device loopback-ism
>>>>>> net/smc: implement ID-related operations of loopback-ism
>>>>>> net/smc: implement some unsupported operations of loopback-ism
>>>>>> net/smc: implement DMB-related operations of loopback-ism
>>>>>> net/smc: register loopback-ism into SMC-D device list
>>>>>> net/smc: introduce loopback-ism runtime switch
>>>>>> net/smc: introduce loopback-ism statistics attributes
>>>>>> net/smc: add operations to merge sndbuf with peer DMB
>>>>>> net/smc: attach or detach ghost sndbuf to peer DMB
>>>>>> net/smc: adapt cursor update when sndbuf and peer DMB are merged
>>>>>> net/smc: introduce loopback-ism DMB type control
>>>>>> net/smc: introduce loopback-ism DMB data copy control
>>>>>> net/smc: implement DMB-merged operations of loopback-ism
>>>>>>
>>>>>> drivers/s390/net/ism_drv.c | 2 +-
>>>>>> include/net/smc.h | 7 +-
>>>>>> net/smc/Kconfig | 13 +
>>>>>> net/smc/Makefile | 2 +-
>>>>>> net/smc/af_smc.c | 28 +-
>>>>>> net/smc/smc_cdc.c | 58 ++-
>>>>>> net/smc/smc_cdc.h | 1 +
>>>>>> net/smc/smc_core.c | 61 +++-
>>>>>> net/smc/smc_core.h | 1 +
>>>>>> net/smc/smc_ism.c | 71 +++-
>>>>>> net/smc/smc_ism.h | 5 +
>>>>>> net/smc/smc_loopback.c | 718 +++++++++++++++++++++++++++++++++++++
>>>>>> net/smc/smc_loopback.h | 88 +++++
>>>>>> 13 files changed, 1026 insertions(+), 29 deletions(-)
>>>>>> create mode 100644 net/smc/smc_loopback.c
>>>>>> create mode 100644 net/smc/smc_loopback.h
>>>>>>
>>>>>> --
>>>>>> 2.32.0.3.g01195cf9f
>>>>>>
>>>>>>

2024-01-15 14:11:36

by Jiri Pirko

[permalink] [raw]
Subject: Re: [PATCH net-next 00/15] net/smc: implement loopback-ism used by SMC-D

Sat, Jan 13, 2024 at 10:22:15AM CET, [email protected] wrote:
>
>
>On 2024/1/12 23:50, Jiri Pirko wrote:
>> Fri, Jan 12, 2024 at 01:32:14PM CET, [email protected] wrote:
>> >
>> >
>> > On 2024/1/12 17:10, Jiri Pirko wrote:
>> > > Fri, Jan 12, 2024 at 09:29:35AM CET, [email protected] wrote:
>> > > >
>
><...>
>
>> > > > > > # Benchmark Test
>> > > > > >
>> > > > > > * Test environments:
>> > > > > > - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>> > > > > > - SMC sndbuf/DMB size 1MB.
>> > > > > > - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0,
>> > > > > > which means sndbuf and DMB are merged and no data copied between them.
>> > > > > > - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0,
>> > > > >
>> > > > > Exposing any configuration knobs and statistics over sysfs for
>> > > > > softdevices does not look correct at all :/ Could you please avoid
>> > > > > sysfs?
>> > > > >
>> > > >
>> > > > In previous reviews and calls, we think loopback-ism needs to be more
>> > > > like a device and be visible under /sys/devices.
>> > > >
>> > > > Would you mind explaining why using sysfs for loopback-ism is not correct?
>> > > > since I saw some other configurations or statistics exists under /sys/devices,
>> > > > e.g. /sys/devices/virtual/net/lo. Thank you!
>> > >
>> > > You have smc_netlink.c exposing clear netlink api for the subsystem.
>> > > Can't you extend it to contain the configuration knobs and expose stats
>> > > instead of sysfs?
>> > >
>> >
>> > Thank you for the suggestion. I've also considered this approach.
>> >
>> > But I didn't choose to extend the smc netlink because for now smc netlink
>> > are used for SMC protocol related attributes, for example:
>> >
>> > SMC_NETLINK_GET_SYS_INFO: SMC version, release, v2-capable..
>> > SMC_NETLINK_GET_LGR_SMC{R|D}: SMC-{R|D} link group inform (lgr id, lgr conn num, lgr role..)
>> > SMC_NETLINK_GET_LINK_SMCR: SMC-R link inform (link id, link state, conn cnt..)
>> > SMC_NETLINK_GET_DEV_SMCD: SMC-D device generic inform (user cnt, pci_fid, pci_chid, pci_vendor..)
>> > SMC_NETLINK_GET_DEV_SMCR: SMC-R device generic inform (dev name, port pnet_id, port valid, port state..)
>> > SMC_NETLINK_GET_STATS: SMC generic stats (RMB cnt, Tx size, Rx size, RMB size...)
>> >
>> > And the knobs and stats in this patchset are loopback-ism device specific
>> > attributes, for example:
>> >
>> > active: loopback-ism runtime switch
>> > dmb_type: type of DMB provided by loopback-ism
>> > dmb_copy: support for DMB merge of loopback-ism
>> > xfer_bytes: data transferred by loopback-ism
>> > dmbs_cnt: DMB num provided by loopback-ism
>> >
>> > The layer will be:
>> >
>> > +--------------------------------------+
>> > | |
>> > | SMC protocol |
>> > | (attrs by netlink in smc_netlink.c) |
>> > | |
>> > +--------------------------------------+
>> > ------------------smcd_ops------------------
>> > +---------------+ +---------------------+ +--------------+
>> > | loopback-ism | | s390 firmware ISM | | Possible |
>> > +---------------+ | | | other |
>> > (attrs by sysfs | | | virtual ISM |
>> > in smc_loopback.c) | | | |
>> > | | | |
>> > +---------------------+ +--------------+
>>
>> So nest it:
>> SMC_NETLINK_BACKEND_GET_INFO
>> SMC_NETLINK_BACKEND_GET_STATS
>> ?
>> I mean, isn't it better to have the backend knobs and stats in one place
>> under same netlink commands and attributes than random sysfs path ?
>>
>Thank you for suggestion.
>
>I think it is not about nesting or gathering knobs and stats. It is
>about not coupling underlying device details to upper layer SMC stack.
>
>From SMC perspective, it cares about the abstract operations defined
>by smcd_ops, regardless of which underlying devices provide these
>functions and how they provide. So IMO the details or configurations
>of underlying devices shouldn't be involved in SMC.

So you rather keep the device configuration and info exposed over random
sysfs files? Sorry, that makes not sense to me.


>
>Besides, the knobs and stats here are specific for loopback-ism device,
>they include runtime switch, buffer type choice and mode choice of
>loopback-ism (basically they won't change after being set once). The
>other kinds of devices used by SMC-D, e.g. s390 firmware ISM or other
>virtual ISMs have no similar things.

Okay, it is normal that different drivers implement different parts of
UAPI. No problem.


>
>So I prefer to keep the current solution instead of expanding upper
>layer SMC netlink.

Makes no sense to me. This is UAPI from 20 years ago. Is this a time
machine?


>
>Thanks,
>Wen Gu
>
>>
>>
>> >
>> > So I choose to use current way to provide this lower layer loopback-ism
>> > device's attributes, restrict loopback-ism specific code to smc_loopback.c
>> > and try to make a clear layer architecture.
>> >
>> > Thanks,
>> > Wen Gu
>> > >
>> > > >
>> > > >
>> > > >
>> > > > Thanks again,
>> > > > Wen Gu
>> > > >
>> > > > >
>> > > > > > which means DMB is physically contiguous buffer.
>> > > > > >
>> > > > > > * Test object:
>> > > > > > - TCP: run on TCP loopback.
>> > > > > > - SMC lo: run on SMC loopback device.
>> > > > > >
>> > > > > > 1. ipc-benchmark (see [3])
>> > > > > >
>> > > > > > - ./<foo> -c 1000000 -s 100
>> > > > > >
>> > > > > > TCP SMC-lo
>> > > > > > Message
>> > > > > > rate (msg/s) 80636 149515(+85.42%)
>> > > > > >
>> > > > > > 2. sockperf
>> > > > > >
>> > > > > > - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>> > > > > > - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>> > > > > >
>> > > > > > TCP SMC-lo
>> > > > > > Bandwidth(MBps) 4909.36 8197.57(+66.98%)
>> > > > > > Latency(us) 6.098 3.383(-44.52%)
>> > > > > >
>> > > > > > 3. nginx/wrk
>> > > > > >
>> > > > > > - serv: <smc_run> nginx
>> > > > > > - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
>> > > > > >
>> > > > > > TCP SMC-lo
>> > > > > > Requests/s 181685.74 246447.77(+35.65%)
>> > > > > >
>> > > > > > 4. redis-benchmark
>> > > > > >
>> > > > > > - serv: <smc_run> redis-server
>> > > > > > - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
>> > > > > >
>> > > > > > TCP SMC-lo
>> > > > > > GET(Requests/s) 85855.34 118553.64(+38.09%)
>> > > > > > SET(Requests/s) 86824.40 125944.58(+45.06%)
>> > > > > >
>> > > > > >
>> > > > > > Change log:
>> > > > > >
>> > > > > > v1->RFC:
>> > > > > > - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
>> > > > > > /sys/devices/virtual/smc/loopback-ism/xfer_bytes
>> > > > > > - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
>> > > > > > merging sndbuf with peer DMB.
>> > > > > > - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
>> > > > > > control of whether to merge sndbuf and DMB. They can be respectively set by:
>> > > > > > /sys/devices/virtual/smc/loopback-ism/dmb_type
>> > > > > > /sys/devices/virtual/smc/loopback-ism/dmb_copy
>> > > > > > The motivation for these two control is that a performance bottleneck was
>> > > > > > found when using vzalloced DMB and sndbuf is merged with DMB, and there are
>> > > > > > many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
>> > > > > > by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
>> > > > > > or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
>> > > > > > vmap lock contention [6]. It has significant effects, but using virtual memory
>> > > > > > still has additional overhead compared to using physical memory.
>> > > > > > So this new version provides controls of dmb_type and dmb_copy to suit
>> > > > > > different scenarios.
>> > > > > > - Some minor changes and comments improvements.
>> > > > > >
>> > > > > > RFC->old version([1]):
>> > > > > > Link: https://lore.kernel.org/netdev/[email protected]/
>> > > > > > - Patch #1: improve the loopback-ism dump, it shows as follows now:
>> > > > > > # smcd d
>> > > > > > FID Type PCI-ID PCHID InUse #LGs PNET-ID
>> > > > > > 0000 0 loopback-ism ffff No 0
>> > > > > > - Patch #3: introduce the smc_ism_set_v2_capable() helper and set
>> > > > > > smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
>> > > > > > regardless of whether there is already a device in smcd device list.
>> > > > > > - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
>> > > > > > - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
>> > > > > > to activate or deactivate the loopback-ism.
>> > > > > > - Patch #9: introduce the statistics of loopback-ism by
>> > > > > > /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
>> > > > > > - Some minor changes and comments improvements.
>> > > > > >
>> > > > > > [1] https://lore.kernel.org/netdev/[email protected]/
>> > > > > > [2] https://lore.kernel.org/netdev/[email protected]/
>> > > > > > [3] https://github.com/goldsborough/ipc-bench
>> > > > > > [4] https://lore.kernel.org/all/[email protected]/
>> > > > > > [5] https://lore.kernel.org/all/[email protected]/
>> > > > > > [6] https://lore.kernel.org/all/[email protected]/
>> > > > > >
>> > > > > > Wen Gu (15):
>> > > > > > net/smc: improve SMC-D device dump for virtual ISM
>> > > > > > net/smc: decouple specialized struct from SMC-D DMB registration
>> > > > > > net/smc: introduce virtual ISM device loopback-ism
>> > > > > > net/smc: implement ID-related operations of loopback-ism
>> > > > > > net/smc: implement some unsupported operations of loopback-ism
>> > > > > > net/smc: implement DMB-related operations of loopback-ism
>> > > > > > net/smc: register loopback-ism into SMC-D device list
>> > > > > > net/smc: introduce loopback-ism runtime switch
>> > > > > > net/smc: introduce loopback-ism statistics attributes
>> > > > > > net/smc: add operations to merge sndbuf with peer DMB
>> > > > > > net/smc: attach or detach ghost sndbuf to peer DMB
>> > > > > > net/smc: adapt cursor update when sndbuf and peer DMB are merged
>> > > > > > net/smc: introduce loopback-ism DMB type control
>> > > > > > net/smc: introduce loopback-ism DMB data copy control
>> > > > > > net/smc: implement DMB-merged operations of loopback-ism
>> > > > > >
>> > > > > > drivers/s390/net/ism_drv.c | 2 +-
>> > > > > > include/net/smc.h | 7 +-
>> > > > > > net/smc/Kconfig | 13 +
>> > > > > > net/smc/Makefile | 2 +-
>> > > > > > net/smc/af_smc.c | 28 +-
>> > > > > > net/smc/smc_cdc.c | 58 ++-
>> > > > > > net/smc/smc_cdc.h | 1 +
>> > > > > > net/smc/smc_core.c | 61 +++-
>> > > > > > net/smc/smc_core.h | 1 +
>> > > > > > net/smc/smc_ism.c | 71 +++-
>> > > > > > net/smc/smc_ism.h | 5 +
>> > > > > > net/smc/smc_loopback.c | 718 +++++++++++++++++++++++++++++++++++++
>> > > > > > net/smc/smc_loopback.h | 88 +++++
>> > > > > > 13 files changed, 1026 insertions(+), 29 deletions(-)
>> > > > > > create mode 100644 net/smc/smc_loopback.c
>> > > > > > create mode 100644 net/smc/smc_loopback.h
>> > > > > >
>> > > > > > --
>> > > > > > 2.32.0.3.g01195cf9f
>> > > > > >
>> > > > > >

2024-01-18 08:37:01

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next 00/15] net/smc: implement loopback-ism used by SMC-D



On 2024/1/11 20:00, Wen Gu wrote:
> This patch set acts as the second part of the new version of [1] (The first
> part can be referred from [2]), the updated things of this version are listed
> at the end.
>

Hi Wenjia and Jan, I would appreciate any thoughts or comments you might have
on this series. Thank you very much!

>
> # Design
>
> This patch set includes 3 parts:
>
> - Patch #1-#2: some prepare work for loopback-ism.
> - Patch #3-#9: implement loopback-ism device.
> - Patch #10-#15: memory copy optimization for loopback scenario.
>
> The loopback-ism device is designed as a ISMv2 device and not be limited to
> a specific net namespace, ends of both inter-process connection (1/1' in diagram
> below) or inter-container connection (2/2' in diagram below) can find the same
> available loopback-ism and choose it during the CLC handshake.
>
> Container 1 (ns1) Container 2 (ns2)
> +-----------------------------------------+ +-------------------------+
> | +-------+ +-------+ +-------+ | | +-------+ |
> | | App A | | App B | | App C | | | | App D |<-+ |
> | +-------+ +---^---+ +-------+ | | +-------+ |(2') |
> | |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| |
> | (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ |
> | `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | |
> +---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+
> | | | |
> Kernel | | | |
> +----+-------v---+-----------v----------------------------------+---+----+
> | | TCP | |
> | | | |
> | +--------------------------------------------------------------+ |
> | |
> | +--------------+ |
> | | smc loopback | |
> +---------------------------+--------------+-----------------------------+
>
> loopback-ism device creates DMBs (shared memory) for each connection peer.
> Since data transfer occurs within the same kernel, the sndbuf of each peer
> is only a descriptor and point to the same memory region as peer DMB, so that
> the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.
>
> Container 1 (ns1) Container 2 (ns2)
> +-----------------------------------------+ +-------------------------+
> | +-------+ | | +-------+ |
> | | App C |-----+ | | | App D | |
> | +-------+ | | | +-^-----+ |
> | | | | | |
> | (2) | | | (2') | |
> | | | | | |
> +---------------|-------------------------+ +----------|--------------+
> | |
> Kernel | |
> +---------------|-----------------------------------------|--------------+
> | +--------+ +--v-----+ +--------+ +--------+ |
> | |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| |
> | +-----|--+ +--|-----+ +-----|--+ +--------+ |
> | +-----|--+ | +-----|--+ |
> | | DMB C | +---------------------------------| DMB D | |
> | +--------+ +--------+ |
> | |
> | +--------------+ |
> | | smc loopback | |
> +---------------------------+--------------+-----------------------------+
>
> # Benchmark Test
>
> * Test environments:
> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
> - SMC sndbuf/DMB size 1MB.
> - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0,
> which means sndbuf and DMB are merged and no data copied between them.
> - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0,
> which means DMB is physically contiguous buffer.
>
> * Test object:
> - TCP: run on TCP loopback.
> - SMC lo: run on SMC loopback device.
>
> 1. ipc-benchmark (see [3])
>
> - ./<foo> -c 1000000 -s 100
>
> TCP SMC-lo
> Message
> rate (msg/s) 80636 149515(+85.42%)
>
> 2. sockperf
>
> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>
> TCP SMC-lo
> Bandwidth(MBps) 4909.36 8197.57(+66.98%)
> Latency(us) 6.098 3.383(-44.52%)
>
> 3. nginx/wrk
>
> - serv: <smc_run> nginx
> - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
>
> TCP SMC-lo
> Requests/s 181685.74 246447.77(+35.65%)
>
> 4. redis-benchmark
>
> - serv: <smc_run> redis-server
> - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
>
> TCP SMC-lo
> GET(Requests/s) 85855.34 118553.64(+38.09%)
> SET(Requests/s) 86824.40 125944.58(+45.06%)
>
>
> Change log:
>
> v1->RFC:
> - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
> /sys/devices/virtual/smc/loopback-ism/xfer_bytes
> - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
> merging sndbuf with peer DMB.
> - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
> control of whether to merge sndbuf and DMB. They can be respectively set by:
> /sys/devices/virtual/smc/loopback-ism/dmb_type
> /sys/devices/virtual/smc/loopback-ism/dmb_copy
> The motivation for these two control is that a performance bottleneck was
> found when using vzalloced DMB and sndbuf is merged with DMB, and there are
> many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
> by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
> or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
> vmap lock contention [6]. It has significant effects, but using virtual memory
> still has additional overhead compared to using physical memory.
> So this new version provides controls of dmb_type and dmb_copy to suit
> different scenarios.
> - Some minor changes and comments improvements.
>
> RFC->old version([1]):
> Link: https://lore.kernel.org/netdev/[email protected]/
> - Patch #1: improve the loopback-ism dump, it shows as follows now:
> # smcd d
> FID Type PCI-ID PCHID InUse #LGs PNET-ID
> 0000 0 loopback-ism ffff No 0
> - Patch #3: introduce the smc_ism_set_v2_capable() helper and set
> smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
> regardless of whether there is already a device in smcd device list.
> - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
> - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
> to activate or deactivate the loopback-ism.
> - Patch #9: introduce the statistics of loopback-ism by
> /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
> - Some minor changes and comments improvements.
>
> [1] https://lore.kernel.org/netdev/[email protected]/
> [2] https://lore.kernel.org/netdev/[email protected]/
> [3] https://github.com/goldsborough/ipc-bench
> [4] https://lore.kernel.org/all/[email protected]/
> [5] https://lore.kernel.org/all/[email protected]/
> [6] https://lore.kernel.org/all/[email protected]/
>
> Wen Gu (15):
> net/smc: improve SMC-D device dump for virtual ISM
> net/smc: decouple specialized struct from SMC-D DMB registration
> net/smc: introduce virtual ISM device loopback-ism
> net/smc: implement ID-related operations of loopback-ism
> net/smc: implement some unsupported operations of loopback-ism
> net/smc: implement DMB-related operations of loopback-ism
> net/smc: register loopback-ism into SMC-D device list
> net/smc: introduce loopback-ism runtime switch
> net/smc: introduce loopback-ism statistics attributes
> net/smc: add operations to merge sndbuf with peer DMB
> net/smc: attach or detach ghost sndbuf to peer DMB
> net/smc: adapt cursor update when sndbuf and peer DMB are merged
> net/smc: introduce loopback-ism DMB type control
> net/smc: introduce loopback-ism DMB data copy control
> net/smc: implement DMB-merged operations of loopback-ism
>
> drivers/s390/net/ism_drv.c | 2 +-
> include/net/smc.h | 7 +-
> net/smc/Kconfig | 13 +
> net/smc/Makefile | 2 +-
> net/smc/af_smc.c | 28 +-
> net/smc/smc_cdc.c | 58 ++-
> net/smc/smc_cdc.h | 1 +
> net/smc/smc_core.c | 61 +++-
> net/smc/smc_core.h | 1 +
> net/smc/smc_ism.c | 71 +++-
> net/smc/smc_ism.h | 5 +
> net/smc/smc_loopback.c | 718 +++++++++++++++++++++++++++++++++++++
> net/smc/smc_loopback.h | 88 +++++
> 13 files changed, 1026 insertions(+), 29 deletions(-)
> create mode 100644 net/smc/smc_loopback.c
> create mode 100644 net/smc/smc_loopback.h
>

2024-01-18 14:00:09

by Wenjia Zhang

[permalink] [raw]
Subject: Re: [PATCH net-next 00/15] net/smc: implement loopback-ism used by SMC-D



On 18.01.24 09:27, Wen Gu wrote:
>
>
> On 2024/1/11 20:00, Wen Gu wrote:
>> This patch set acts as the second part of the new version of [1] (The
>> first
>> part can be referred from [2]), the updated things of this version are
>> listed
>> at the end.
>>
>
> Hi Wenjia and Jan, I would appreciate any thoughts or comments you might
> have
> on this series. Thank you very much!
>
Hi Wen,

I'm still in the middle of the proto type on IPPROTO_SMC and other
issues, so that I need more time to review this patch series.

Thank you for your patience!
Wenjia


2024-01-19 01:47:13

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next 00/15] net/smc: implement loopback-ism used by SMC-D



On 2024/1/18 21:59, Wenjia Zhang wrote:
>
>
> On 18.01.24 09:27, Wen Gu wrote:
>>
>>
>> On 2024/1/11 20:00, Wen Gu wrote:
>>> This patch set acts as the second part of the new version of [1] (The first
>>> part can be referred from [2]), the updated things of this version are listed
>>> at the end.
>>>
>>
>> Hi Wenjia and Jan, I would appreciate any thoughts or comments you might have
>> on this series. Thank you very much!
>>
> Hi Wen,
>
> I'm still in the middle of the proto type on IPPROTO_SMC and other issues, so that I need more time to review this patch
> series.
>
> Thank you for your patience!
> Wenjia

Understood. Thank you! Wenjia.

Best regards,
Wen Gu

2024-01-23 14:03:33

by Alexandra Winter

[permalink] [raw]
Subject: Re: [PATCH net-next 00/15] net/smc: implement loopback-ism used by SMC-D



On 19.01.24 02:46, Wen Gu wrote:
>
>
> On 2024/1/18 21:59, Wenjia Zhang wrote:
>>
>>
>> On 18.01.24 09:27, Wen Gu wrote:
>>>
>>>
>>> On 2024/1/11 20:00, Wen Gu wrote:
>>>> This patch set acts as the second part of the new version of [1] (The first
>>>> part can be referred from [2]), the updated things of this version are listed
>>>> at the end.
>>>>
>>>
>>> Hi Wenjia and Jan, I would appreciate any thoughts or comments you might have
>>> on this series. Thank you very much!
>>>
>> Hi Wen,
>>
>> I'm still in the middle of the proto type on IPPROTO_SMC and other issues, so that I need more time to review this patch series.
>>
>> Thank you for your patience!
>> Wenjia
>
> Understood. Thank you! Wenjia.
>
> Best regards,
> Wen Gu
>

Hello Wen Gu and others,

I just wanted to let you know that unfortunately both Wenjia and Jan have called in sick and we don't know
when they will be back at work.
So I'm sorry but there may be mroe delays in the review of this patchset.

Kind regards
Alexandra Winter

2024-01-24 06:33:51

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next 00/15] net/smc: implement loopback-ism used by SMC-D



On 2024/1/23 22:03, Alexandra Winter wrote:
> Hello Wen Gu and others,
>
> I just wanted to let you know that unfortunately both Wenjia and Jan have called in sick and we don't know
> when they will be back at work.
> So I'm sorry but there may be mroe delays in the review of this patchset.
>
> Kind regards
> Alexandra Winter

Hi Alexandra,

Thank you for the update. Health comes first. Wishing Wenjia and Jan
both make a swift recovery.

Best regards,
Wen Gu

2024-02-05 10:06:20

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next 00/15] net/smc: implement loopback-ism used by SMC-D


On 2024/1/24 14:33, Wen Gu wrote:
>
>
> On 2024/1/23 22:03, Alexandra Winter wrote:
>> Hello Wen Gu and others,
>>
>> I just wanted to let you know that unfortunately both Wenjia and Jan have called in sick and we don't know
>> when they will be back at work.
>> So I'm sorry but there may be mroe delays in the review of this patchset.
>>
>> Kind regards
>> Alexandra Winter
>
> Hi Alexandra,
>
> Thank you for the update. Health comes first. Wishing Wenjia and Jan
> both make a swift recovery.
>
> Best regards,
> Wen Gu

Hi, Wenjia and Jan

I would like to ask if I should wait for the review of this version
or send a v2 (with some minor changes) ?

Thanks!

2024-02-06 12:45:57

by Alexandra Winter

[permalink] [raw]
Subject: Re: [PATCH net-next 00/15] net/smc: implement loopback-ism used by SMC-D



On 11.01.24 13:00, Wen Gu wrote:
> This patch set acts as the second part of the new version of [1] (The first
> part can be referred from [2]), the updated things of this version are listed
> at the end.
>
> # Background
>
> SMC-D is now used in IBM z with ISM function to optimize network interconnect
> for intra-CPC communications. Inspired by this, we try to make SMC-D available
> on the non-s390 architecture through a software-implemented virtual ISM device,
> that is the loopback-ism device here, to accelerate inter-process or
> inter-containers communication within the same OS instance.


Hello Wen Gu,

thank you very much for this patchset. I have been looking at it a bit.
I installed in on a testserver, but did not yet excercise the loopback-ism device.
I want to continue investigations, but daily work interferes, so I thought I
send you some comments now. So this is not at all a code review, but some
thoughts and observations about the general concept.


In [1] there was a discussion about an abstraction layer between smc-d and the
ism devices.
I am not sure what you are proposing now, is it an smc-d feature or independent of smc?
In 3/15 you say it is part of the SMC module, but then it has its own device entry.
Didn't you want to use it for other things as well? Or is it an SMC-D only feature?
Is it a device (Config help: "kind of virtual device")? Or an SMC-D feature?

Will we have a class of ism devices (s390 ism, ism-loopback, virtio-ism)
That share common properties (internal API?)
and smc-d will work with any of those?
But they all can exist without smc ?! BTW: This is what we want for s390-ism.
The client-registration interface [2] is currently the way to achieve this.
But maybe we need a more general concept?

Maybe first a preparation patchset that introduces a class/ism
Together with an improved API?
In case you want to use ISM devices for other purposes as well..
But then the whole picture of ism-loopback in one patchset (RFC?)
has its benefits as well.


Other points that I noticed:

Naming: smc loopback, ism-loopback, loopback-ism ?

config: why not tristate? Why under net/smc?

/sys/devices/virtual/smc does not initially show up in my installation!!!
root@t35lp50:/sys/devices/virtual/> ls
3270/ bdi/ block/ graphics/ iommu/ mem/ memory_tiering/ misc/ net/ tty/ vc/ vtconsole/ workqueue/
root@t35lp50:/sys/devices/virtual/> ls smc/loopback-ism
active dmb_copy dmbs_cnt dmb_type subsystem@ uevent xfer_bytes
root@t35lp50:/sys/devices/virtual/> ls
3270/ bdi/ block/ graphics/ iommu/ mem/ memory_tiering/ misc/ net/ smc/ tty/ vc/ vtconsole/ workqueue/
Is that normal behaviour?

You introduced a class/smc
Maybe class/ism would be better?
The other smc interfaces do not show up in class/smc!! Not so good

Why doesn't it show in smc_rnics?
(Maybe some deficiency of smc_rnics?)

But then it shows in other smc-tools:
root@t35lp50:/sys/> smcd device
FID Type PCI-ID PCHID InUse #LGs PNET-ID
0000 0 loopback-ism ffff No 0
0029 ISM 0000:00:00.0 07c1 No 0 NET1
Nice!

Kind regards
Sandy


[1] https://lore.kernel.org/lkml/[email protected]/
[2] 89e7d2ba61b7 ("net/ism: Add new API for client registration")

2024-02-07 09:08:40

by Wenjia Zhang

[permalink] [raw]
Subject: Re: [PATCH net-next 00/15] net/smc: implement loopback-ism used by SMC-D



On 05.02.24 11:05, Wen Gu wrote:
>
> On 2024/1/24 14:33, Wen Gu wrote:
>>
>>
>> On 2024/1/23 22:03, Alexandra Winter wrote:
>>> Hello Wen Gu and others,
>>>
>>> I just wanted to let you know that unfortunately both Wenjia and Jan
>>> have called in sick and we don't know
>>> when they will be back at work.
>>> So I'm sorry but there may be mroe delays in the review of this
>>> patchset.
>>>
>>> Kind regards
>>> Alexandra Winter
>>
>> Hi Alexandra,
>>
>> Thank you for the update. Health comes first. Wishing Wenjia and Jan
>> both make a swift recovery.
>>
>> Best regards,
>> Wen Gu
>
> Hi, Wenjia and Jan
>
> I would like to ask if I should wait for the review of this version
> or send a v2 (with some minor changes) ?
>
> Thanks!

Hi Wen,

Finally I can carve out some time on this patches, the review is still
ongoing. I'll send my comments out, as soon as I finish all of them.

Thank you for the patience!
Wenjia

2024-02-08 16:13:26

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next 00/15] net/smc: implement loopback-ism used by SMC-D



On 2024/2/6 20:18, Alexandra Winter wrote:
>
>
> On 11.01.24 13:00, Wen Gu wrote:
>> This patch set acts as the second part of the new version of [1] (The first
>> part can be referred from [2]), the updated things of this version are listed
>> at the end.
>>
>> # Background
>>
>> SMC-D is now used in IBM z with ISM function to optimize network interconnect
>> for intra-CPC communications. Inspired by this, we try to make SMC-D available
>> on the non-s390 architecture through a software-implemented virtual ISM device,
>> that is the loopback-ism device here, to accelerate inter-process or
>> inter-containers communication within the same OS instance.
>
>
> Hello Wen Gu,
>
> thank you very much for this patchset. I have been looking at it a bit.
> I installed in on a testserver, but did not yet excercise the loopback-ism device.
> I want to continue investigations, but daily work interferes, so I thought I
> send you some comments now. So this is not at all a code review, but some
> thoughts and observations about the general concept.
>

Hi Sandy and Wenjia,

Thank you very much for your feedback!
I am working on the detailed replies. As we are on holiday for Chinese New Year,
the progress may be slower. But please feel free to leave any other comments and
feedback, thank you!

Best regards!
Wen Gu

>
> In [1] there was a discussion about an abstraction layer between smc-d and the
> ism devices.
> I am not sure what you are proposing now, is it an smc-d feature or independent of smc?
> In 3/15 you say it is part of the SMC module, but then it has its own device entry.
> Didn't you want to use it for other things as well? Or is it an SMC-D only feature?
> Is it a device (Config help: "kind of virtual device")? Or an SMC-D feature?
>
> Will we have a class of ism devices (s390 ism, ism-loopback, virtio-ism)
> That share common properties (internal API?)
> and smc-d will work with any of those?
> But they all can exist without smc ?! BTW: This is what we want for s390-ism.
> The client-registration interface [2] is currently the way to achieve this.
> But maybe we need a more general concept?
>
> Maybe first a preparation patchset that introduces a class/ism
> Together with an improved API?
> In case you want to use ISM devices for other purposes as well..
> But then the whole picture of ism-loopback in one patchset (RFC?)
> has its benefits as well.
>
>
> Other points that I noticed:
>
> Naming: smc loopback, ism-loopback, loopback-ism ?
>
> config: why not tristate? Why under net/smc?
>
> /sys/devices/virtual/smc does not initially show up in my installation!!!
> root@t35lp50:/sys/devices/virtual/> ls
> 3270/ bdi/ block/ graphics/ iommu/ mem/ memory_tiering/ misc/ net/ tty/ vc/ vtconsole/ workqueue/
> root@t35lp50:/sys/devices/virtual/> ls smc/loopback-ism
> active dmb_copy dmbs_cnt dmb_type subsystem@ uevent xfer_bytes
> root@t35lp50:/sys/devices/virtual/> ls
> 3270/ bdi/ block/ graphics/ iommu/ mem/ memory_tiering/ misc/ net/ smc/ tty/ vc/ vtconsole/ workqueue/
> Is that normal behaviour?
>
> You introduced a class/smc
> Maybe class/ism would be better?
> The other smc interfaces do not show up in class/smc!! Not so good
>
> Why doesn't it show in smc_rnics?
> (Maybe some deficiency of smc_rnics?)
>
> But then it shows in other smc-tools:
> root@t35lp50:/sys/> smcd device
> FID Type PCI-ID PCHID InUse #LGs PNET-ID
> 0000 0 loopback-ism ffff No 0
> 0029 ISM 0000:00:00.0 07c1 No 0 NET1
> Nice!
>
> Kind regards
> Sandy
>
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] 89e7d2ba61b7 ("net/ism: Add new API for client registration")

2024-02-16 14:10:41

by Wenjia Zhang

[permalink] [raw]
Subject: Re: [PATCH net-next 00/15] net/smc: implement loopback-ism used by SMC-D



On 11.01.24 13:00, Wen Gu wrote:
> This patch set acts as the second part of the new version of [1] (The first
> part can be referred from [2]), the updated things of this version are listed
> at the end.
>
> # Background
>
> SMC-D is now used in IBM z with ISM function to optimize network interconnect
> for intra-CPC communications. Inspired by this, we try to make SMC-D available
> on the non-s390 architecture through a software-implemented virtual ISM device,
> that is the loopback-ism device here, to accelerate inter-process or
> inter-containers communication within the same OS instance.
>
> # Design
>
> This patch set includes 3 parts:
>
> - Patch #1-#2: some prepare work for loopback-ism.
> - Patch #3-#9: implement loopback-ism device.
> - Patch #10-#15: memory copy optimization for loopback scenario.
>
> The loopback-ism device is designed as a ISMv2 device and not be limited to
> a specific net namespace, ends of both inter-process connection (1/1' in diagram
> below) or inter-container connection (2/2' in diagram below) can find the same
> available loopback-ism and choose it during the CLC handshake.
>
> Container 1 (ns1) Container 2 (ns2)
> +-----------------------------------------+ +-------------------------+
> | +-------+ +-------+ +-------+ | | +-------+ |
> | | App A | | App B | | App C | | | | App D |<-+ |
> | +-------+ +---^---+ +-------+ | | +-------+ |(2') |
> | |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| |
> | (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ |
> | `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | |
> +---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+
> | | | |
> Kernel | | | |
> +----+-------v---+-----------v----------------------------------+---+----+
> | | TCP | |
> | | | |
> | +--------------------------------------------------------------+ |
> | |
> | +--------------+ |
> | | smc loopback | |
> +---------------------------+--------------+-----------------------------+
>
> loopback-ism device creates DMBs (shared memory) for each connection peer.
> Since data transfer occurs within the same kernel, the sndbuf of each peer
> is only a descriptor and point to the same memory region as peer DMB, so that
> the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.
>
> Container 1 (ns1) Container 2 (ns2)
> +-----------------------------------------+ +-------------------------+
> | +-------+ | | +-------+ |
> | | App C |-----+ | | | App D | |
> | +-------+ | | | +-^-----+ |
> | | | | | |
> | (2) | | | (2') | |
> | | | | | |
> +---------------|-------------------------+ +----------|--------------+
> | |
> Kernel | |
> +---------------|-----------------------------------------|--------------+
> | +--------+ +--v-----+ +--------+ +--------+ |
> | |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| |
> | +-----|--+ +--|-----+ +-----|--+ +--------+ |
> | +-----|--+ | +-----|--+ |
> | | DMB C | +---------------------------------| DMB D | |
> | +--------+ +--------+ |
> | |
> | +--------------+ |
> | | smc loopback | |
> +---------------------------+--------------+-----------------------------+
>
> # Benchmark Test
>
> * Test environments:
> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
> - SMC sndbuf/DMB size 1MB.
> - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0,
> which means sndbuf and DMB are merged and no data copied between them.
> - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0,
> which means DMB is physically contiguous buffer.
>
> * Test object:
> - TCP: run on TCP loopback.
> - SMC lo: run on SMC loopback device.
>
> 1. ipc-benchmark (see [3])
>
> - ./<foo> -c 1000000 -s 100
>
> TCP SMC-lo
> Message
> rate (msg/s) 80636 149515(+85.42%)
>
> 2. sockperf
>
> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>
> TCP SMC-lo
> Bandwidth(MBps) 4909.36 8197.57(+66.98%)
> Latency(us) 6.098 3.383(-44.52%)
>
> 3. nginx/wrk
>
> - serv: <smc_run> nginx
> - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
>
> TCP SMC-lo
> Requests/s 181685.74 246447.77(+35.65%)
>
> 4. redis-benchmark
>
> - serv: <smc_run> redis-server
> - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
>
> TCP SMC-lo
> GET(Requests/s) 85855.34 118553.64(+38.09%)
> SET(Requests/s) 86824.40 125944.58(+45.06%)
>
>
> Change log:
>
> v1->RFC:
> - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
> /sys/devices/virtual/smc/loopback-ism/xfer_bytes
> - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
> merging sndbuf with peer DMB.
> - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
> control of whether to merge sndbuf and DMB. They can be respectively set by:
> /sys/devices/virtual/smc/loopback-ism/dmb_type
> /sys/devices/virtual/smc/loopback-ism/dmb_copy
> The motivation for these two control is that a performance bottleneck was
> found when using vzalloced DMB and sndbuf is merged with DMB, and there are
> many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
> by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
> or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
> vmap lock contention [6]. It has significant effects, but using virtual memory
> still has additional overhead compared to using physical memory.
> So this new version provides controls of dmb_type and dmb_copy to suit
> different scenarios.
> - Some minor changes and comments improvements.
>
> RFC->old version([1]):
> Link: https://lore.kernel.org/netdev/[email protected]/
> - Patch #1: improve the loopback-ism dump, it shows as follows now:
> # smcd d
> FID Type PCI-ID PCHID InUse #LGs PNET-ID
> 0000 0 loopback-ism ffff No 0
> - Patch #3: introduce the smc_ism_set_v2_capable() helper and set
> smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
> regardless of whether there is already a device in smcd device list.
> - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
> - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
> to activate or deactivate the loopback-ism.
> - Patch #9: introduce the statistics of loopback-ism by
> /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
> - Some minor changes and comments improvements.
>
> [1] https://lore.kernel.org/netdev/[email protected]/
> [2] https://lore.kernel.org/netdev/[email protected]/
> [3] https://github.com/goldsborough/ipc-bench
> [4] https://lore.kernel.org/all/[email protected]/
> [5] https://lore.kernel.org/all/[email protected]/
> [6] https://lore.kernel.org/all/[email protected]/
>
> Wen Gu (15):
> net/smc: improve SMC-D device dump for virtual ISM
> net/smc: decouple specialized struct from SMC-D DMB registration
> net/smc: introduce virtual ISM device loopback-ism
> net/smc: implement ID-related operations of loopback-ism
> net/smc: implement some unsupported operations of loopback-ism
> net/smc: implement DMB-related operations of loopback-ism
> net/smc: register loopback-ism into SMC-D device list
> net/smc: introduce loopback-ism runtime switch
> net/smc: introduce loopback-ism statistics attributes
> net/smc: add operations to merge sndbuf with peer DMB
> net/smc: attach or detach ghost sndbuf to peer DMB
> net/smc: adapt cursor update when sndbuf and peer DMB are merged
> net/smc: introduce loopback-ism DMB type control
> net/smc: introduce loopback-ism DMB data copy control
> net/smc: implement DMB-merged operations of loopback-ism
>
> drivers/s390/net/ism_drv.c | 2 +-
> include/net/smc.h | 7 +-
> net/smc/Kconfig | 13 +
> net/smc/Makefile | 2 +-
> net/smc/af_smc.c | 28 +-
> net/smc/smc_cdc.c | 58 ++-
> net/smc/smc_cdc.h | 1 +
> net/smc/smc_core.c | 61 +++-
> net/smc/smc_core.h | 1 +
> net/smc/smc_ism.c | 71 +++-
> net/smc/smc_ism.h | 5 +
> net/smc/smc_loopback.c | 718 +++++++++++++++++++++++++++++++++++++
> net/smc/smc_loopback.h | 88 +++++
> 13 files changed, 1026 insertions(+), 29 deletions(-)
> create mode 100644 net/smc/smc_loopback.c
> create mode 100644 net/smc/smc_loopback.h
>
Hi Wen,

Thank you for the patience again!

You can find the comments under the corresponding patches respectively.
About the file hierarchy in sysfs and the names, we still have some
thoughts. We need to investigate a bit more time on it.

Thanks,
Gerd & Wenjia

2024-02-16 14:15:00

by Wenjia Zhang

[permalink] [raw]
Subject: Re: [PATCH net-next 06/15] net/smc: implement DMB-related operations of loopback-ism



On 11.01.24 13:00, Wen Gu wrote:
> This implements DMB (un)registration and data move operations of
> loopback-ism device.
>
> Signed-off-by: Wen Gu <[email protected]>
> ---
> net/smc/smc_cdc.c | 6 ++
> net/smc/smc_cdc.h | 1 +
> net/smc/smc_loopback.c | 133 ++++++++++++++++++++++++++++++++++++++++-
> net/smc/smc_loopback.h | 13 ++++
> 4 files changed, 150 insertions(+), 3 deletions(-)
>
> diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
> index 3c06625ceb20..c820ef197610 100644
> --- a/net/smc/smc_cdc.c
> +++ b/net/smc/smc_cdc.c
> @@ -410,6 +410,12 @@ static void smc_cdc_msg_recv(struct smc_sock *smc, struct smc_cdc_msg *cdc)
> static void smcd_cdc_rx_tsklet(struct tasklet_struct *t)
> {
> struct smc_connection *conn = from_tasklet(conn, t, rx_tsklet);
> +
> + smcd_cdc_rx_handler(conn);
> +}
> +
> +void smcd_cdc_rx_handler(struct smc_connection *conn)
> +{
> struct smcd_cdc_msg *data_cdc;
> struct smcd_cdc_msg cdc;
> struct smc_sock *smc;
> diff --git a/net/smc/smc_cdc.h b/net/smc/smc_cdc.h
> index 696cc11f2303..11559d4ebf2b 100644
> --- a/net/smc/smc_cdc.h
> +++ b/net/smc/smc_cdc.h
> @@ -301,5 +301,6 @@ int smcr_cdc_msg_send_validation(struct smc_connection *conn,
> struct smc_wr_buf *wr_buf);
> int smc_cdc_init(void) __init;
> void smcd_cdc_rx_init(struct smc_connection *conn);
> +void smcd_cdc_rx_handler(struct smc_connection *conn);
>
> #endif /* SMC_CDC_H */
> diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
> index 353d4a2d69a1..f72e7b24fc1a 100644
> --- a/net/smc/smc_loopback.c
> +++ b/net/smc/smc_loopback.c
> @@ -15,11 +15,13 @@
> #include <linux/types.h>
> #include <net/smc.h>
>
> +#include "smc_cdc.h"
> #include "smc_ism.h"
> #include "smc_loopback.h"
>
> #if IS_ENABLED(CONFIG_SMC_LO)
> #define SMC_LO_V2_CAPABLE 0x1 /* loopback-ism acts as ISMv2 */
> +#define SMC_DMA_ADDR_INVALID (~(dma_addr_t)0)
>
> static const char smc_lo_dev_name[] = "loopback-ism";
> static struct smc_lo_dev *lo_dev;
> @@ -50,6 +52,97 @@ static int smc_lo_query_rgid(struct smcd_dev *smcd, struct smcd_gid *rgid,
> return 0;
> }
>
> +static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
> + void *client_priv)
> +{
> + struct smc_lo_dmb_node *dmb_node, *tmp_node;
> + struct smc_lo_dev *ldev = smcd->priv;
> + int sba_idx, order, rc;
> + struct page *pages;
> +
> + /* check space for new dmb */
> + for_each_clear_bit(sba_idx, ldev->sba_idx_mask, SMC_LO_MAX_DMBS) {
> + if (!test_and_set_bit(sba_idx, ldev->sba_idx_mask))
> + break;
> + }
> + if (sba_idx == SMC_LO_MAX_DMBS)
> + return -ENOSPC;
> +
> + dmb_node = kzalloc(sizeof(*dmb_node), GFP_KERNEL);
> + if (!dmb_node) {
> + rc = -ENOMEM;
> + goto err_bit;
> + }
> +
> + dmb_node->sba_idx = sba_idx;
> + order = get_order(dmb->dmb_len);
> + pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN |
> + __GFP_NOMEMALLOC | __GFP_COMP |
> + __GFP_NORETRY | __GFP_ZERO,
> + order);
> + if (!pages) {
> + rc = -ENOMEM;
> + goto err_node;
> + }
> + dmb_node->cpu_addr = (void *)page_address(pages);
> + dmb_node->len = dmb->dmb_len;
> + dmb_node->dma_addr = SMC_DMA_ADDR_INVALID;
> +
> +again:
> + /* add new dmb into hash table */
> + get_random_bytes(&dmb_node->token, sizeof(dmb_node->token));
> + write_lock(&ldev->dmb_ht_lock);
> + hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_node->token) {
> + if (tmp_node->token == dmb_node->token) {
> + write_unlock(&ldev->dmb_ht_lock);
> + goto again;
> + }
> + }
> + hash_add(ldev->dmb_ht, &dmb_node->list, dmb_node->token);
> + write_unlock(&ldev->dmb_ht_lock);
> +
The write_lock_irqsave()/write_unlock_irqrestore() and
read_lock_irqsave()/read_unlock_irqrestore()should be used instead of
write_lock()/write_unlock() and read_lock()/read_unlock() in order to
keep the lock irq-safe.

> + dmb->sba_idx = dmb_node->sba_idx;
> + dmb->dmb_tok = dmb_node->token;
> + dmb->cpu_addr = dmb_node->cpu_addr;
> + dmb->dma_addr = dmb_node->dma_addr;
> + dmb->dmb_len = dmb_node->len;
> +
> + return 0;
> +
> +err_node:
> + kfree(dmb_node);
> +err_bit:
> + clear_bit(sba_idx, ldev->sba_idx_mask);
> + return rc;
> +}
> +
> +static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
> +{
> + struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
> + struct smc_lo_dev *ldev = smcd->priv;
> +
> + /* remove dmb from hash table */
> + write_lock(&ldev->dmb_ht_lock);
> + hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
> + if (tmp_node->token == dmb->dmb_tok) {
> + dmb_node = tmp_node;
> + break;
> + }
> + }
> + if (!dmb_node) {
> + write_unlock(&ldev->dmb_ht_lock);
> + return -EINVAL;
> + }
> + hash_del(&dmb_node->list);
> + write_unlock(&ldev->dmb_ht_lock);
> +
> + clear_bit(dmb_node->sba_idx, ldev->sba_idx_mask);
> + kfree(dmb_node->cpu_addr);
> + kfree(dmb_node);
> +
> + return 0;
> +}
> +
> static int smc_lo_add_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
> {
> return -EOPNOTSUPP;
> @@ -76,6 +169,38 @@ static int smc_lo_signal_event(struct smcd_dev *dev, struct smcd_gid *rgid,
> return 0;
> }
>
> +static int smc_lo_move_data(struct smcd_dev *smcd, u64 dmb_tok,
> + unsigned int idx, bool sf, unsigned int offset,
> + void *data, unsigned int size)
> +{
> + struct smc_lo_dmb_node *rmb_node = NULL, *tmp_node;
> + struct smc_lo_dev *ldev = smcd->priv;
> +
> + read_lock(&ldev->dmb_ht_lock);
> + hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_tok) {
> + if (tmp_node->token == dmb_tok) {
> + rmb_node = tmp_node;
> + break;
> + }
> + }
> + if (!rmb_node) {
> + read_unlock(&ldev->dmb_ht_lock);
> + return -EINVAL;
> + }
> + read_unlock(&ldev->dmb_ht_lock);
> +
> + memcpy((char *)rmb_node->cpu_addr + offset, data, size);
> +

Should this read_unlock be placed behind memcpy()?

<...>

2024-02-16 14:25:59

by Wenjia Zhang

[permalink] [raw]
Subject: Re: [PATCH net-next 13/15] net/smc: introduce loopback-ism DMB type control



On 11.01.24 13:00, Wen Gu wrote:
> This provides a way to {get|set} type of DMB offered by loopback-ism,
> whether it is physically or virtually contiguous memory.
>
> echo 0 > /sys/devices/virtual/smc/loopback-ism/dmb_type # physically
> echo 1 > /sys/devices/virtual/smc/loopback-ism/dmb_type # virtually
>
> The settings take effect after re-activating loopback-ism by:
>
> echo 0 > /sys/devices/virtual/smc/loopback-ism/active
> echo 1 > /sys/devices/virtual/smc/loopback-ism/active
>
> After this, the link group and DMBs related to loopback-ism will be
> flushed and subsequent DMBs created will be of the desired type.
>
> The motivation of this control is that physically contiguous DMB has
> best performance but is usually expensive, while the virtually
> contiguous DMB is cheap and perform well in most scenarios, but if
> sndbuf and DMB are merged, virtual DMB will be accessed concurrently
> in Tx and Rx and there will be a bottleneck caused by lock contention
> of find_vmap_area when there are many CPUs and CONFIG_HARDENED_USERCOPY
> is set (see link below). So an option is provided.
>
I'm courious about why you say that physically contiguous DMB has best
performance. Because we saw even a bit better perfomance with the
virtual one than the performance with the physical one.

2024-02-16 14:26:48

by Wenjia Zhang

[permalink] [raw]
Subject: Re: [PATCH net-next 14/15] net/smc: introduce loopback-ism DMB data copy control



On 11.01.24 13:00, Wen Gu wrote:
> This provides a way to {get|set} whether loopback-ism device supports
> merging sndbuf with peer DMB to eliminate data copies between them.
>
> echo 0 > /sys/devices/virtual/smc/loopback-ism/dmb_copy # support
> echo 1 > /sys/devices/virtual/smc/loopback-ism/dmb_copy # not support
>
Besides the same confusing as Niklas already mentioned, the name of the
option looks not clear enough to what it means. What about:
echo 1 > /sys/devices/virtual/smc/loopback-ism/nocopy_support # merge mode
echo 0 > /sys/devices/virtual/smc/loopback-ism/nocopy_support # copy mode

> The settings take effect after re-activating loopback-ism by:
>
> echo 0 > /sys/devices/virtual/smc/loopback-ism/active
> echo 1 > /sys/devices/virtual/smc/loopback-ism/active
>
> After this, the link group related to loopback-ism will be flushed and
> the sndbufs of subsequent connections will be merged or not merged with
> peer DMB.
>
> The motivation of this control is that the bandwidth will be highly
> improved when sndbuf and DMB are merged, but when virtually contiguous
> DMB is provided and merged with sndbuf, it will be concurrently accessed
> on Tx and Rx, then there will be a bottleneck caused by lock contention
> of find_vmap_area when there are many CPUs and CONFIG_HARDENED_USERCOPY
> is set (see link below). So an option is provided.
>
> Link: https://lore.kernel.org/all/[email protected]/
> Signed-off-by: Wen Gu <[email protected]>
> ---
We tried some simple workloads, and the performance of the no-copy case
was remarkable. Thus, we're wondering if it is necessary to have the
tunable setting in this loopback case? Or rather, why do we need the
copy option? Is that because of the bottleneck caused by using the
combination of the no-copy and virtually contiguours DMA? Or at least
let no-copy as the default one.


2024-02-16 14:31:18

by Wenjia Zhang

[permalink] [raw]
Subject: Re: [PATCH net-next 03/15] net/smc: introduce virtual ISM device loopback-ism



On 11.01.24 13:00, Wen Gu wrote:
> This introduces a kind of virtual ISM device loopback-ism for SMCDv2.1.
> loopback-ism is implemented by software and serves inter-process or
> inter-container SMC communication in the same OS instance. It is created
> during SMC module loading and destroyed upon unloading. The support for
> loopback-ism can be configured via CONFIG_SMC_LO.
>
> Signed-off-by: Wen Gu <[email protected]>
> ---
> net/smc/Kconfig | 13 +++
> net/smc/Makefile | 2 +-
> net/smc/af_smc.c | 12 ++-
> net/smc/smc_loopback.c | 181 +++++++++++++++++++++++++++++++++++++++++
> net/smc/smc_loopback.h | 33 ++++++++
> 5 files changed, 239 insertions(+), 2 deletions(-)
> create mode 100644 net/smc/smc_loopback.c
> create mode 100644 net/smc/smc_loopback.h
>
> diff --git a/net/smc/Kconfig b/net/smc/Kconfig
> index 746be3996768..e191f78551f4 100644
> --- a/net/smc/Kconfig
> +++ b/net/smc/Kconfig
> @@ -20,3 +20,16 @@ config SMC_DIAG
> smcss.
>
> if unsure, say Y.
> +
> +config SMC_LO
> + bool "SMC_LO: virtual ISM loopback-ism for SMC"
> + depends on SMC
> + default n
> + help
> + SMC_LO provides a kind of virtual ISM device called loopback-ism
Don't forget to update "s/virtual/emulated/" later. ;-)

<...>

2024-02-16 14:32:32

by Wenjia Zhang

[permalink] [raw]
Subject: Re: [PATCH net-next 09/15] net/smc: introduce loopback-ism statistics attributes



On 11.01.24 13:00, Wen Gu wrote:
> This introduces some statistics attributes of loopback-ism. They can be
> read from /sys/devices/virtual/smc/loopback-ism/{xfer_tytes|dmbs_cnt}.
>
> Signed-off-by: Wen Gu <[email protected]>
> ---
> net/smc/smc_loopback.c | 74 ++++++++++++++++++++++++++++++++++++++++++
> net/smc/smc_loopback.h | 22 +++++++++++++
> 2 files changed, 96 insertions(+)
>

I've read the comments from Jiri and your answer. I can understand your
thought. However, from the perspective of the end user, it makes more
sense to integetrate the stats info into 'smcd stats'. Otherwise, it
would make users confused to find out with which tool to check which
statisic infornation. Sure, some improvement of the smc-tools is also needed

2024-02-19 14:05:44

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next 00/15] net/smc: implement loopback-ism used by SMC-D



On 2024/2/6 20:18, Alexandra Winter wrote:
>
>
> On 11.01.24 13:00, Wen Gu wrote:
>> This patch set acts as the second part of the new version of [1] (The first
>> part can be referred from [2]), the updated things of this version are listed
>> at the end.
>>
>> # Background
>>
>> SMC-D is now used in IBM z with ISM function to optimize network interconnect
>> for intra-CPC communications. Inspired by this, we try to make SMC-D available
>> on the non-s390 architecture through a software-implemented virtual ISM device,
>> that is the loopback-ism device here, to accelerate inter-process or
>> inter-containers communication within the same OS instance.
>
>
> Hello Wen Gu,
>
> thank you very much for this patchset. I have been looking at it a bit.
> I installed in on a testserver, but did not yet excercise the loopback-ism device.
> I want to continue investigations, but daily work interferes, so I thought I
> send you some comments now. So this is not at all a code review, but some
> thoughts and observations about the general concept.
>

Thank you very much, Sandy.

>
> In [1] there was a discussion about an abstraction layer between smc-d and the
> ism devices.
> I am not sure what you are proposing now, is it an smc-d feature or independent of smc?
> In 3/15 you say it is part of the SMC module, but then it has its own device entry.
> Didn't you want to use it for other things as well? Or is it an SMC-D only feature?
> Is it a device (Config help: "kind of virtual device")? Or an SMC-D feature?
>

This patchset aims to propose an SMC feature, which is SMC-D loopback. The main work
to achieve this feature is to implement an Emulated-ISM, which is loopback-ism. The
loopback-ism is a 'built-in' dummy device of SMC and only serves SMC.

SMC-D protocol + 'built-in dummy device' (loopback-ism device) = SMC-D loopback feature.

To provide the runtime switch and statistics of loopback-ism, I need to find a sysfs
entry for it, since it doesn't belong to any class (e.g. pci_bus), I created an 'smc'
entry under /sys/devices/virtual/ and put loopback-ism under it.

The other SMC devices, such as RoCE, s390 ISM, virtio-ism will be in their own sysfs
entry, not under the /sys/devices/*virtual*/smc/.

The Config help is somewhat inaccurate. To be more precise, the SMC_LO config is used to
configure whether to enable this built-in dummy device for intra-OS communication.

> Will we have a class of ism devices (s390 ism, ism-loopback, virtio-ism)
> That share common properties (internal API?)
> and smc-d will work with any of those? > But they all can exist without smc ?! BTW: This is what we want for s390-ism.
> The client-registration interface [2] is currently the way to achieve this.
> But maybe we need a more general concept?
>

I didn't mean to create a class to cover all the ISM devices. It is only for
loopback-ism. Because loopback-ism can not be classified, so I create an entry
under /sys/devices/virtual/.

> Maybe first a preparation patchset that introduces a class/ism
> Together with an improved API?
> In case you want to use ISM devices for other purposes as well..
> But then the whole picture of ism-loopback in one patchset (RFC?)
> has its benefits as well.
>

Sorry for causing, I didn't mean to create a class to cover all the ISM devices.
They should be in their own sysfs entries (e.g. pci_bus), since they will be used
out of SMC. Only loopback-ism belongs only to SMC.

>
> Other points that I noticed:
>
> Naming: smc loopback, ism-loopback, loopback-ism ?
>
> config: why not tristate? Why under net/smc?
>

'SMC-D loopback' or 'SMC loopback' is used to indicate the feature or capability.
'loopback-ism' is the emulated-ISM device that 'SMC/SMC-D loopback' used.
('ism-loopback' doesn't seem to appear in my patchset)
If we all agree with these, I will check all the terms in the patch and unify them.

SMC_LO is used to configure whether SMC is allowed to use loopback-ism (CONFIG_SMC_LO),
it acts as a check in the code, so I defined it as a bool.
And loopback-ism only serves SMC-D loopback, as a feature of SMC, so the implementation
(net/smc/smc_loopback.{c|h}) is under net/smc.

> /sys/devices/virtual/smc does not initially show up in my installation!!!
> root@t35lp50:/sys/devices/virtual/> ls
> 3270/ bdi/ block/ graphics/ iommu/ mem/ memory_tiering/ misc/ net/ tty/ vc/ vtconsole/ workqueue/
> root@t35lp50:/sys/devices/virtual/> ls smc/loopback-ism
> active dmb_copy dmbs_cnt dmb_type subsystem@ uevent xfer_bytes
> root@t35lp50:/sys/devices/virtual/> ls
> 3270/ bdi/ block/ graphics/ iommu/ mem/ memory_tiering/ misc/ net/ smc/ tty/ vc/ vtconsole/ workqueue/
> Is that normal behaviour?
>

/sys/devices/virtual/smc is created after SMC module initialization.
During the SMC module initialization, smc_loopback_init() is called, and the
/sys/devices/virtual/smc entry is created.

> You introduced a class/smc
> Maybe class/ism would be better?
> The other smc interfaces do not show up in class/smc!! Not so good
>

Sorry for causing, I didn't mean to create a class to cover all the ISM devices.
They should be in their own sysfs entries (e.g. pci_bus), since they can be used
out of SMC. But loopback-ism is a SMC 'built-in' dummy device, it belongs only
to SMC and can't be classified to other entries.


> Why doesn't it show in smc_rnics?
> (Maybe some deficiency of smc_rnics?)
>
smc_rnics can't be used on the arch other than s390.

# ./smc_rnics -a
Error: s390/s390x supported only


> But then it shows in other smc-tools:
> root@t35lp50:/sys/> smcd device
> FID Type PCI-ID PCHID InUse #LGs PNET-ID
> 0000 0 loopback-ism ffff No 0
> 0029 ISM 0000:00:00.0 07c1 No 0 NET1
> Nice!
>

Yes, this is did on patch 01/15.

Best regards,
Wen Gu

> Kind regards
> Sandy
>
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] 89e7d2ba61b7 ("net/ism: Add new API for client registration")

2024-02-20 01:20:56

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next 03/15] net/smc: introduce virtual ISM device loopback-ism



On 2024/2/16 22:11, Wenjia Zhang wrote:
>
>
> On 11.01.24 13:00, Wen Gu wrote:
>> This introduces a kind of virtual ISM device loopback-ism for SMCDv2.1.
>> loopback-ism is implemented by software and serves inter-process or
>> inter-container SMC communication in the same OS instance. It is created
>> during SMC module loading and destroyed upon unloading. The support for
>> loopback-ism can be configured via CONFIG_SMC_LO.
>>
>> Signed-off-by: Wen Gu <[email protected]>
>> ---
>>   net/smc/Kconfig        |  13 +++
>>   net/smc/Makefile       |   2 +-
>>   net/smc/af_smc.c       |  12 ++-
>>   net/smc/smc_loopback.c | 181 +++++++++++++++++++++++++++++++++++++++++
>>   net/smc/smc_loopback.h |  33 ++++++++
>>   5 files changed, 239 insertions(+), 2 deletions(-)
>>   create mode 100644 net/smc/smc_loopback.c
>>   create mode 100644 net/smc/smc_loopback.h
>>
>> diff --git a/net/smc/Kconfig b/net/smc/Kconfig
>> index 746be3996768..e191f78551f4 100644
>> --- a/net/smc/Kconfig
>> +++ b/net/smc/Kconfig
>> @@ -20,3 +20,16 @@ config SMC_DIAG
>>         smcss.
>>         if unsure, say Y.
>> +
>> +config SMC_LO
>> +    bool "SMC_LO: virtual ISM loopback-ism for SMC"
>> +    depends on SMC
>> +    default n
>> +    help
>> +      SMC_LO provides a kind of virtual ISM device called loopback-ism
> Don't forget to update "s/virtual/emulated/" later. ;-)
>
> <...>

Yes, new version will change all virtual ISM to Emulated-ISM. Thank you.

2024-02-20 01:55:44

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next 06/15] net/smc: implement DMB-related operations of loopback-ism



On 2024/2/16 22:13, Wenjia Zhang wrote:
>
>
> On 11.01.24 13:00, Wen Gu wrote:
>> This implements DMB (un)registration and data move operations of
>> loopback-ism device.
>>
>> Signed-off-by: Wen Gu <[email protected]>
>> ---
>>   net/smc/smc_cdc.c      |   6 ++
>>   net/smc/smc_cdc.h      |   1 +
>>   net/smc/smc_loopback.c | 133 ++++++++++++++++++++++++++++++++++++++++-
>>   net/smc/smc_loopback.h |  13 ++++
>>   4 files changed, 150 insertions(+), 3 deletions(-)
>>
>> diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
>> index 3c06625ceb20..c820ef197610 100644
>> --- a/net/smc/smc_cdc.c
>> +++ b/net/smc/smc_cdc.c
>> @@ -410,6 +410,12 @@ static void smc_cdc_msg_recv(struct smc_sock *smc, struct smc_cdc_msg *cdc)
>>   static void smcd_cdc_rx_tsklet(struct tasklet_struct *t)
>>   {
>>       struct smc_connection *conn = from_tasklet(conn, t, rx_tsklet);
>> +
>> +    smcd_cdc_rx_handler(conn);
>> +}
>> +
>> +void smcd_cdc_rx_handler(struct smc_connection *conn)
>> +{
>>       struct smcd_cdc_msg *data_cdc;
>>       struct smcd_cdc_msg cdc;
>>       struct smc_sock *smc;
>> diff --git a/net/smc/smc_cdc.h b/net/smc/smc_cdc.h
>> index 696cc11f2303..11559d4ebf2b 100644
>> --- a/net/smc/smc_cdc.h
>> +++ b/net/smc/smc_cdc.h
>> @@ -301,5 +301,6 @@ int smcr_cdc_msg_send_validation(struct smc_connection *conn,
>>                    struct smc_wr_buf *wr_buf);
>>   int smc_cdc_init(void) __init;
>>   void smcd_cdc_rx_init(struct smc_connection *conn);
>> +void smcd_cdc_rx_handler(struct smc_connection *conn);
>>   #endif /* SMC_CDC_H */
>> diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
>> index 353d4a2d69a1..f72e7b24fc1a 100644
>> --- a/net/smc/smc_loopback.c
>> +++ b/net/smc/smc_loopback.c
>> @@ -15,11 +15,13 @@
>>   #include <linux/types.h>
>>   #include <net/smc.h>
>> +#include "smc_cdc.h"
>>   #include "smc_ism.h"
>>   #include "smc_loopback.h"
>>   #if IS_ENABLED(CONFIG_SMC_LO)
>>   #define SMC_LO_V2_CAPABLE    0x1 /* loopback-ism acts as ISMv2 */
>> +#define SMC_DMA_ADDR_INVALID    (~(dma_addr_t)0)
>>   static const char smc_lo_dev_name[] = "loopback-ism";
>>   static struct smc_lo_dev *lo_dev;
>> @@ -50,6 +52,97 @@ static int smc_lo_query_rgid(struct smcd_dev *smcd, struct smcd_gid *rgid,
>>       return 0;
>>   }
>> +static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
>> +                   void *client_priv)
>> +{
>> +    struct smc_lo_dmb_node *dmb_node, *tmp_node;
>> +    struct smc_lo_dev *ldev = smcd->priv;
>> +    int sba_idx, order, rc;
>> +    struct page *pages;
>> +
>> +    /* check space for new dmb */
>> +    for_each_clear_bit(sba_idx, ldev->sba_idx_mask, SMC_LO_MAX_DMBS) {
>> +        if (!test_and_set_bit(sba_idx, ldev->sba_idx_mask))
>> +            break;
>> +    }
>> +    if (sba_idx == SMC_LO_MAX_DMBS)
>> +        return -ENOSPC;
>> +
>> +    dmb_node = kzalloc(sizeof(*dmb_node), GFP_KERNEL);
>> +    if (!dmb_node) {
>> +        rc = -ENOMEM;
>> +        goto err_bit;
>> +    }
>> +
>> +    dmb_node->sba_idx = sba_idx;
>> +    order = get_order(dmb->dmb_len);
>> +    pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN |
>> +                __GFP_NOMEMALLOC | __GFP_COMP |
>> +                __GFP_NORETRY | __GFP_ZERO,
>> +                order);
>> +    if (!pages) {
>> +        rc = -ENOMEM;
>> +        goto err_node;
>> +    }
>> +    dmb_node->cpu_addr = (void *)page_address(pages);
>> +    dmb_node->len = dmb->dmb_len;
>> +    dmb_node->dma_addr = SMC_DMA_ADDR_INVALID;
>> +
>> +again:
>> +    /* add new dmb into hash table */
>> +    get_random_bytes(&dmb_node->token, sizeof(dmb_node->token));
>> +    write_lock(&ldev->dmb_ht_lock);
>> +    hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_node->token) {
>> +        if (tmp_node->token == dmb_node->token) {
>> +            write_unlock(&ldev->dmb_ht_lock);
>> +            goto again;
>> +        }
>> +    }
>> +    hash_add(ldev->dmb_ht, &dmb_node->list, dmb_node->token);
>> +    write_unlock(&ldev->dmb_ht_lock);
>> +
> The write_lock_irqsave()/write_unlock_irqrestore() and read_lock_irqsave()/read_unlock_irqrestore()should be used
> instead of write_lock()/write_unlock() and read_lock()/read_unlock() in order to keep the lock irq-safe.
>

dmb_ht_lock won't be hold in an interrupt or sockirq context. The dmb_{register|unregister},
dmb_{attach|detach} and data_move are all on the process context. So I think write_(un)lock
and read_(un)lock is safe here.

>> +    dmb->sba_idx = dmb_node->sba_idx;
>> +    dmb->dmb_tok = dmb_node->token;
>> +    dmb->cpu_addr = dmb_node->cpu_addr;
>> +    dmb->dma_addr = dmb_node->dma_addr;
>> +    dmb->dmb_len = dmb_node->len;
>> +
>> +    return 0;
>> +
>> +err_node:
>> +    kfree(dmb_node);
>> +err_bit:
>> +    clear_bit(sba_idx, ldev->sba_idx_mask);
>> +    return rc;
>> +}
>> +
>> +static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
>> +{
>> +    struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
>> +    struct smc_lo_dev *ldev = smcd->priv;
>> +
>> +    /* remove dmb from hash table */
>> +    write_lock(&ldev->dmb_ht_lock);
>> +    hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
>> +        if (tmp_node->token == dmb->dmb_tok) {
>> +            dmb_node = tmp_node;
>> +            break;
>> +        }
>> +    }
>> +    if (!dmb_node) {
>> +        write_unlock(&ldev->dmb_ht_lock);
>> +        return -EINVAL;
>> +    }
>> +    hash_del(&dmb_node->list);
>> +    write_unlock(&ldev->dmb_ht_lock);
>> +
>> +    clear_bit(dmb_node->sba_idx, ldev->sba_idx_mask);
>> +    kfree(dmb_node->cpu_addr);
>> +    kfree(dmb_node);
>> +
>> +    return 0;
>> +}
>> +
>>   static int smc_lo_add_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
>>   {
>>       return -EOPNOTSUPP;
>> @@ -76,6 +169,38 @@ static int smc_lo_signal_event(struct smcd_dev *dev, struct smcd_gid *rgid,
>>       return 0;
>>   }
>> +static int smc_lo_move_data(struct smcd_dev *smcd, u64 dmb_tok,
>> +                unsigned int idx, bool sf, unsigned int offset,
>> +                void *data, unsigned int size)
>> +{
>> +    struct smc_lo_dmb_node *rmb_node = NULL, *tmp_node;
>> +    struct smc_lo_dev *ldev = smcd->priv;
>> +
>> +    read_lock(&ldev->dmb_ht_lock);
>> +    hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_tok) {
>> +        if (tmp_node->token == dmb_tok) {
>> +            rmb_node = tmp_node;
>> +            break;
>> +        }
>> +    }
>> +    if (!rmb_node) {
>> +        read_unlock(&ldev->dmb_ht_lock);
>> +        return -EINVAL;
>> +    }
>> +    read_unlock(&ldev->dmb_ht_lock);
>> +
>> +    memcpy((char *)rmb_node->cpu_addr + offset, data, size);
>> +
>
> Should this read_unlock be placed behind memcpy()?
>

dmb_ht_lock is used to ensure safe access to the DMB hash table of loopback-ism.
The DMB hash table could be accessed by all the connections on loopback-ism, so
it should be protected.

But a certain DMB is only used by one connection, and the move_data process is
protected by conn->send_lock (see smcd_tx_sndbuf_nonempty()), so the memcpy(rmb_node)
here is safe and no race with other.

Thanks!

> <...>

2024-02-20 02:45:52

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next 09/15] net/smc: introduce loopback-ism statistics attributes



On 2024/2/16 22:24, Wenjia Zhang wrote:
>
>
> On 11.01.24 13:00, Wen Gu wrote:
>> This introduces some statistics attributes of loopback-ism. They can be
>> read from /sys/devices/virtual/smc/loopback-ism/{xfer_tytes|dmbs_cnt}.
>>
>> Signed-off-by: Wen Gu <[email protected]>
>> ---
>>   net/smc/smc_loopback.c | 74 ++++++++++++++++++++++++++++++++++++++++++
>>   net/smc/smc_loopback.h | 22 +++++++++++++
>>   2 files changed, 96 insertions(+)
>>
>
> I've read the comments from Jiri and your answer. I can understand your thought. However, from the perspective of the
> end user, it makes more sense to integetrate the stats info into 'smcd stats'. Otherwise, it would make users confused
> to find out with which tool to check which statisic infornation. Sure, some improvement of the smc-tools is also needed

Thank you Wenjia.

Let's draw an analogy with RDMA devices, which is used in SMC-R. If we want
to check the RNIC status or statistics, we may use rdma statistic command, or
ibv_devinfo command, or check file under /sys/class/infiniband/mlx5_0. These
provide details or attributes related to *devices*.

Since s390 ISM can be used out of SMC, I guess it also has its own way (other
than smc-tools) to check the statistic?

What we can see in smcr stats or smcd stats command is about statistic or
status of SMC *protocol* layer, such as DMB status, Tx/Rx, connections, fallbacks.

If we put the underlying devices's statistics into smc-tools, should we also
put RNIC statistics or s390 ISM statistics into smcr stat or smcd stat? and
for each futures device that can be used by SMC-R/SMC-D, should we update them
into smcr stat and smcd stat? And the attributes of each devices may be different,
should we add entries in smcd stat for each of them?

After considering the above things, I believe that the details of the underlying
device should not be exposed to smc(smc-tools). What do you think?

Thanks!

2024-02-20 03:20:12

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next 13/15] net/smc: introduce loopback-ism DMB type control



On 2024/2/16 22:25, Wenjia Zhang wrote:
>
>
> On 11.01.24 13:00, Wen Gu wrote:
>> This provides a way to {get|set} type of DMB offered by loopback-ism,
>> whether it is physically or virtually contiguous memory.
>>
>> echo 0 > /sys/devices/virtual/smc/loopback-ism/dmb_type # physically
>> echo 1 > /sys/devices/virtual/smc/loopback-ism/dmb_type # virtually
>>
>> The settings take effect after re-activating loopback-ism by:
>>
>> echo 0 > /sys/devices/virtual/smc/loopback-ism/active
>> echo 1 > /sys/devices/virtual/smc/loopback-ism/active
>>
>> After this, the link group and DMBs related to loopback-ism will be
>> flushed and subsequent DMBs created will be of the desired type.
>>
>> The motivation of this control is that physically contiguous DMB has
>> best performance but is usually expensive, while the virtually
>> contiguous DMB is cheap and perform well in most scenarios, but if
>> sndbuf and DMB are merged, virtual DMB will be accessed concurrently
>> in Tx and Rx and there will be a bottleneck caused by lock contention
>> of find_vmap_area when there are many CPUs and CONFIG_HARDENED_USERCOPY
>> is set (see link below). So an option is provided.
>>
> I'm courious about why you say that physically contiguous DMB has best performance. Because we saw even a bit better
> perfomance with the virtual one than the performance with the physical one.

Hi Wenjia, you can find examples from here:

https://lore.kernel.org/all/[email protected]/
https://lore.kernel.org/all/[email protected]/

Excerpted from above:
"
In 48 CPUs qemu environment, the Requests/s increased by 5 times:
- nginx
- wrk -c 1000 -t 96 -d 30 http://127.0.0.1:80

vzalloced shmem vzalloced shmem(with this patch set)
Requests/sec 113536.56 583729.93


But it also has some overhead, compared to using kzalloced shared memory
or unsetting CONFIG_HARDENED_USERCOPY, which won't involve finding vmap area:

kzalloced shmem vzalloced shmem(unset CONFIG_HARDENED_USERCOPY)
Requests/sec 831950.39 805164.78
"

Without CONFIG_HARDENED_USERCOPY, the performance of physical-DMB and
virtual-DMB is basically same (physical-DMB is a bit better), and with
CONFIG_HARDENED_USERCOPY, under many CPUs environment, such as 48 CPUs
here, if we merge sndbuf and DMB, the find_vmap_area lock contention is
heavy, and the performance is drop obviously. So I said physical-DMB has
best performance, since it can guarantee good performance under known
environments.


By the way, we discussed the memory cost before (see [1]), but I found
that when we use s390 ISM (or not merge sndbuf and DMB), the sndbuf also
costs physically contiguous memory.

static struct smc_buf_desc *smcd_new_buf_create(struct smc_link_group *lgr,
bool is_dmb, int bufsize)
{
<...>
if (is_dmb) {
<...>
} else {
buf_desc->cpu_addr = kzalloc(bufsize, GFP_KERNEL |
__GFP_NOWARN | __GFP_NORETRY |
__GFP_NOMEMALLOC);
if (!buf_desc->cpu_addr) {
kfree(buf_desc);
return ERR_PTR(-EAGAIN);
}
buf_desc->len = bufsize;
}
<...>
}

So I wonder is it really necessary to use virtual-DMB in loopback-ism? Maybe
we can always use physical-DMB in loopback-ism, then there is no need for the
dmb_type or dmb_copy knobs.

[1] https://lore.kernel.org/netdev/[email protected]/


Thanks!

2024-02-20 03:36:50

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next 14/15] net/smc: introduce loopback-ism DMB data copy control



On 2024/2/16 22:25, Wenjia Zhang wrote:
>
>
> On 11.01.24 13:00, Wen Gu wrote:
>> This provides a way to {get|set} whether loopback-ism device supports
>> merging sndbuf with peer DMB to eliminate data copies between them.
>>
>> echo 0 > /sys/devices/virtual/smc/loopback-ism/dmb_copy # support
>> echo 1 > /sys/devices/virtual/smc/loopback-ism/dmb_copy # not support
>>
> Besides the same confusing as Niklas already mentioned, the name of the option looks not clear enough to what it means.
> What about:
> echo 1 > /sys/devices/virtual/smc/loopback-ism/nocopy_support # merge mode
> echo 0 > /sys/devices/virtual/smc/loopback-ism/nocopy_support # copy mode
>

OK, if we decide to keep the knobs, I will improve the name. Thanks!

>> The settings take effect after re-activating loopback-ism by:
>>
>> echo 0 > /sys/devices/virtual/smc/loopback-ism/active
>> echo 1 > /sys/devices/virtual/smc/loopback-ism/active
>>
>> After this, the link group related to loopback-ism will be flushed and
>> the sndbufs of subsequent connections will be merged or not merged with
>> peer DMB.
>>
>> The motivation of this control is that the bandwidth will be highly
>> improved when sndbuf and DMB are merged, but when virtually contiguous
>> DMB is provided and merged with sndbuf, it will be concurrently accessed
>> on Tx and Rx, then there will be a bottleneck caused by lock contention
>> of find_vmap_area when there are many CPUs and CONFIG_HARDENED_USERCOPY
>> is set (see link below). So an option is provided.
>>
>> Link: https://lore.kernel.org/all/[email protected]/
>> Signed-off-by: Wen Gu <[email protected]>
>> ---
> We tried some simple workloads, and the performance of the no-copy case was remarkable. Thus, we're wondering if it is
> necessary to have the tunable setting in this loopback case? Or rather, why do we need the copy option? Is that because
> of the bottleneck caused by using the combination of the no-copy and virtually contiguours DMA? Or at least let no-copy
> as the default one.

Yes, it is because the bottleneck caused by using the combination of the no-copy
and virtual-DMB. If we have to use virtual-DMB and CONFIG_HARDENED_USERCOPY is
set, then we may be forced to use copy mode in many CPUs environment, to get the
good latency performance (the bandwidth performance still drop because of copy mode).

But if we agree that physical-DMB is acceptable (it costs 1 physical buffer per conn side
in loopback-ism no-copy mode, same as what sndbuf costs when using s390 ISM), then
there is no such performance issue and the two knobs can be removed. (see also the reply
for 13/15 patch [1]).

[1] https://lore.kernel.org/netdev/[email protected]/

Thanks!

2024-02-20 03:52:42

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next 00/15] net/smc: implement loopback-ism used by SMC-D



On 2024/2/16 22:09, Wenjia Zhang wrote:
>
>
> On 11.01.24 13:00, Wen Gu wrote:
>> This patch set acts as the second part of the new version of [1] (The first
>> part can be referred from [2]), the updated things of this version are listed
>> at the end.
>>
>> # Background
>>
>> SMC-D is now used in IBM z with ISM function to optimize network interconnect
>> for intra-CPC communications. Inspired by this, we try to make SMC-D available
>> on the non-s390 architecture through a software-implemented virtual ISM device,
>> that is the loopback-ism device here, to accelerate inter-process or
>> inter-containers communication within the same OS instance.
>>
>> # Design
>>
>> This patch set includes 3 parts:
>>
>>   - Patch #1-#2: some prepare work for loopback-ism.
>>   - Patch #3-#9: implement loopback-ism device.
>>   - Patch #10-#15: memory copy optimization for loopback scenario.
>>
>> The loopback-ism device is designed as a ISMv2 device and not be limited to
>> a specific net namespace, ends of both inter-process connection (1/1' in diagram
>> below) or inter-container connection (2/2' in diagram below) can find the same
>> available loopback-ism and choose it during the CLC handshake.
>>
>>   Container 1 (ns1)                              Container 2 (ns2)
>>   +-----------------------------------------+    +-------------------------+
>>   | +-------+      +-------+      +-------+ |    |        +-------+        |
>>   | | App A |      | App B |      | App C | |    |        | App D |<-+     |
>>   | +-------+      +---^---+      +-------+ |    |        +-------+  |(2') |
>>   |     |127.0.0.1 (1')|             |192.168.0.11       192.168.0.12|     |
>>   |  (1)|   +--------+ | +--------+  |(2)   |    | +--------+   +--------+ |
>>   |     `-->|   lo   |-` |  eth0  |<-`      |    | |   lo   |   |  eth0  | |
>>   +---------+--|---^-+---+-----|--+---------+    +-+--------+---+-^------+-+
>>                |   |           |                                  |
>>   Kernel       |   |           |                                  |
>>   +----+-------v---+-----------v----------------------------------+---+----+
>>   |    |                            TCP                               |    |
>>   |    |                                                              |    |
>>   |    +--------------------------------------------------------------+    |
>>   |                                                                        |
>>   |                           +--------------+                             |
>>   |                           | smc loopback |                             |
>>   +---------------------------+--------------+-----------------------------+
>>
>> loopback-ism device creates DMBs (shared memory) for each connection peer.
>> Since data transfer occurs within the same kernel, the sndbuf of each peer
>> is only a descriptor and point to the same memory region as peer DMB, so that
>> the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.
>>
>>   Container 1 (ns1)                              Container 2 (ns2)
>>   +-----------------------------------------+    +-------------------------+
>>   | +-------+                               |    |        +-------+        |
>>   | | App C |-----+                         |    |        | App D |        |
>>   | +-------+     |                         |    |        +-^-----+        |
>>   |               |                         |    |          |              |
>>   |           (2) |                         |    |     (2') |              |
>>   |               |                         |    |          |              |
>>   +---------------|-------------------------+    +----------|--------------+
>>                   |                                         |
>>   Kernel          |                                         |
>>   +---------------|-----------------------------------------|--------------+
>>   | +--------+ +--v-----+                           +--------+ +--------+  |
>>   | |dmb_desc| |snd_desc|                           |dmb_desc| |snd_desc|  |
>>   | +-----|--+ +--|-----+                           +-----|--+ +--------+  |
>>   | +-----|--+    |                                 +-----|--+             |
>>   | | DMB C  |    +---------------------------------| DMB D  |             |
>>   | +--------+                                      +--------+             |
>>   |                                                                        |
>>   |                           +--------------+                             |
>>   |                           | smc loopback |                             |
>>   +---------------------------+--------------+-----------------------------+
>>
>> # Benchmark Test
>>
>>   * Test environments:
>>        - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>>        - SMC sndbuf/DMB size 1MB.
>>        - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0,
>>          which means sndbuf and DMB are merged and no data copied between them.
>>        - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0,
>>          which means DMB is physically contiguous buffer.
>>
>>   * Test object:
>>        - TCP: run on TCP loopback.
>>        - SMC lo: run on SMC loopback device.
>>
>> 1. ipc-benchmark (see [3])
>>
>>   - ./<foo> -c 1000000 -s 100
>>
>>                              TCP                  SMC-lo
>> Message
>> rate (msg/s)              80636                  149515(+85.42%)
>>
>> 2. sockperf
>>
>>   - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>>   - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1
>> -t 30
>>
>>                              TCP                  SMC-lo
>> Bandwidth(MBps)         4909.36                 8197.57(+66.98%)
>> Latency(us)               6.098                   3.383(-44.52%)
>>
>> 3. nginx/wrk
>>
>>   - serv: <smc_run> nginx
>>   - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
>>
>>                             TCP                   SMC-lo
>> Requests/s           181685.74                246447.77(+35.65%)
>>
>> 4. redis-benchmark
>>
>>   - serv: <smc_run> redis-server
>>   - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
>>
>>                             TCP                   SMC-lo
>> GET(Requests/s)       85855.34                118553.64(+38.09%)
>> SET(Requests/s)       86824.40                125944.58(+45.06%)
>>
>>
>> Change log:
>>
>> v1->RFC:
>> - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
>>    /sys/devices/virtual/smc/loopback-ism/xfer_bytes
>> - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
>>    merging sndbuf with peer DMB.
>> - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
>>    control of whether to merge sndbuf and DMB. They can be respectively set by:
>>    /sys/devices/virtual/smc/loopback-ism/dmb_type
>>    /sys/devices/virtual/smc/loopback-ism/dmb_copy
>>    The motivation for these two control is that a performance bottleneck was
>>    found when using vzalloced DMB and sndbuf is merged with DMB, and there are
>>    many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
>>    by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
>>    or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
>>    vmap lock contention [6]. It has significant effects, but using virtual memory
>>    still has additional overhead compared to using physical memory.
>>    So this new version provides controls of dmb_type and dmb_copy to suit
>>    different scenarios.
>> - Some minor changes and comments improvements.
>>
>> RFC->old version([1]):
>> Link: https://lore.kernel.org/netdev/[email protected]/
>> - Patch #1: improve the loopback-ism dump, it shows as follows now:
>>    # smcd d
>>    FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
>>    0000 0     loopback-ism  ffff   No        0
>> - Patch #3: introduce the smc_ism_set_v2_capable() helper and set
>>    smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
>>    regardless of whether there is already a device in smcd device list.
>> - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
>> - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
>>    to activate or deactivate the loopback-ism.
>> - Patch #9: introduce the statistics of loopback-ism by
>>    /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
>> - Some minor changes and comments improvements.
>>
>> [1] https://lore.kernel.org/netdev/[email protected]/
>> [2] https://lore.kernel.org/netdev/[email protected]/
>> [3] https://github.com/goldsborough/ipc-bench
>> [4] https://lore.kernel.org/all/[email protected]/
>> [5] https://lore.kernel.org/all/[email protected]/
>> [6] https://lore.kernel.org/all/[email protected]/
>>
>> Wen Gu (15):
>>    net/smc: improve SMC-D device dump for virtual ISM
>>    net/smc: decouple specialized struct from SMC-D DMB registration
>>    net/smc: introduce virtual ISM device loopback-ism
>>    net/smc: implement ID-related operations of loopback-ism
>>    net/smc: implement some unsupported operations of loopback-ism
>>    net/smc: implement DMB-related operations of loopback-ism
>>    net/smc: register loopback-ism into SMC-D device list
>>    net/smc: introduce loopback-ism runtime switch
>>    net/smc: introduce loopback-ism statistics attributes
>>    net/smc: add operations to merge sndbuf with peer DMB
>>    net/smc: attach or detach ghost sndbuf to peer DMB
>>    net/smc: adapt cursor update when sndbuf and peer DMB are merged
>>    net/smc: introduce loopback-ism DMB type control
>>    net/smc: introduce loopback-ism DMB data copy control
>>    net/smc: implement DMB-merged operations of loopback-ism
>>
>>   drivers/s390/net/ism_drv.c |   2 +-
>>   include/net/smc.h          |   7 +-
>>   net/smc/Kconfig            |  13 +
>>   net/smc/Makefile           |   2 +-
>>   net/smc/af_smc.c           |  28 +-
>>   net/smc/smc_cdc.c          |  58 ++-
>>   net/smc/smc_cdc.h          |   1 +
>>   net/smc/smc_core.c         |  61 +++-
>>   net/smc/smc_core.h         |   1 +
>>   net/smc/smc_ism.c          |  71 +++-
>>   net/smc/smc_ism.h          |   5 +
>>   net/smc/smc_loopback.c     | 718 +++++++++++++++++++++++++++++++++++++
>>   net/smc/smc_loopback.h     |  88 +++++
>>   13 files changed, 1026 insertions(+), 29 deletions(-)
>>   create mode 100644 net/smc/smc_loopback.c
>>   create mode 100644 net/smc/smc_loopback.h
>>
> Hi Wen,
>
> Thank you for the patience again!
>
> You can find the comments under the corresponding patches respectively.
> About the file hierarchy in sysfs and the names, we still have some thoughts. We need to investigate a bit more time on it.
>

Hi Wenjia and Gerd,

Thank you very much!

I answered each comment you left. You can find my thoughts about sysfs and
knobs there. Looking forward to your further reply. Thanks!

Best regards,
Wen Gu

> Thanks,
> Gerd & Wenjia

2024-02-23 14:13:29

by Wenjia Zhang

[permalink] [raw]
Subject: Re: [PATCH net-next 06/15] net/smc: implement DMB-related operations of loopback-ism



On 20.02.24 02:55, Wen Gu wrote:
>
>
> On 2024/2/16 22:13, Wenjia Zhang wrote:
>>
>>
>> On 11.01.24 13:00, Wen Gu wrote:
>>> This implements DMB (un)registration and data move operations of
>>> loopback-ism device.
>>>
>>> Signed-off-by: Wen Gu <[email protected]>
>>> ---
>>>   net/smc/smc_cdc.c      |   6 ++
>>>   net/smc/smc_cdc.h      |   1 +
>>>   net/smc/smc_loopback.c | 133 ++++++++++++++++++++++++++++++++++++++++-
>>>   net/smc/smc_loopback.h |  13 ++++
>>>   4 files changed, 150 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
>>> index 3c06625ceb20..c820ef197610 100644
>>> --- a/net/smc/smc_cdc.c
>>> +++ b/net/smc/smc_cdc.c
>>> @@ -410,6 +410,12 @@ static void smc_cdc_msg_recv(struct smc_sock
>>> *smc, struct smc_cdc_msg *cdc)
>>>   static void smcd_cdc_rx_tsklet(struct tasklet_struct *t)
>>>   {
>>>       struct smc_connection *conn = from_tasklet(conn, t, rx_tsklet);
>>> +
>>> +    smcd_cdc_rx_handler(conn);
>>> +}
>>> +
>>> +void smcd_cdc_rx_handler(struct smc_connection *conn)
>>> +{
>>>       struct smcd_cdc_msg *data_cdc;
>>>       struct smcd_cdc_msg cdc;
>>>       struct smc_sock *smc;
>>> diff --git a/net/smc/smc_cdc.h b/net/smc/smc_cdc.h
>>> index 696cc11f2303..11559d4ebf2b 100644
>>> --- a/net/smc/smc_cdc.h
>>> +++ b/net/smc/smc_cdc.h
>>> @@ -301,5 +301,6 @@ int smcr_cdc_msg_send_validation(struct
>>> smc_connection *conn,
>>>                    struct smc_wr_buf *wr_buf);
>>>   int smc_cdc_init(void) __init;
>>>   void smcd_cdc_rx_init(struct smc_connection *conn);
>>> +void smcd_cdc_rx_handler(struct smc_connection *conn);
>>>   #endif /* SMC_CDC_H */
>>> diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
>>> index 353d4a2d69a1..f72e7b24fc1a 100644
>>> --- a/net/smc/smc_loopback.c
>>> +++ b/net/smc/smc_loopback.c
>>> @@ -15,11 +15,13 @@
>>>   #include <linux/types.h>
>>>   #include <net/smc.h>
>>> +#include "smc_cdc.h"
>>>   #include "smc_ism.h"
>>>   #include "smc_loopback.h"
>>>   #if IS_ENABLED(CONFIG_SMC_LO)
>>>   #define SMC_LO_V2_CAPABLE    0x1 /* loopback-ism acts as ISMv2 */
>>> +#define SMC_DMA_ADDR_INVALID    (~(dma_addr_t)0)
>>>   static const char smc_lo_dev_name[] = "loopback-ism";
>>>   static struct smc_lo_dev *lo_dev;
>>> @@ -50,6 +52,97 @@ static int smc_lo_query_rgid(struct smcd_dev
>>> *smcd, struct smcd_gid *rgid,
>>>       return 0;
>>>   }
>>> +static int smc_lo_register_dmb(struct smcd_dev *smcd, struct
>>> smcd_dmb *dmb,
>>> +                   void *client_priv)
>>> +{
>>> +    struct smc_lo_dmb_node *dmb_node, *tmp_node;
>>> +    struct smc_lo_dev *ldev = smcd->priv;
>>> +    int sba_idx, order, rc;
>>> +    struct page *pages;
>>> +
>>> +    /* check space for new dmb */
>>> +    for_each_clear_bit(sba_idx, ldev->sba_idx_mask, SMC_LO_MAX_DMBS) {
>>> +        if (!test_and_set_bit(sba_idx, ldev->sba_idx_mask))
>>> +            break;
>>> +    }
>>> +    if (sba_idx == SMC_LO_MAX_DMBS)
>>> +        return -ENOSPC;
>>> +
>>> +    dmb_node = kzalloc(sizeof(*dmb_node), GFP_KERNEL);
>>> +    if (!dmb_node) {
>>> +        rc = -ENOMEM;
>>> +        goto err_bit;
>>> +    }
>>> +
>>> +    dmb_node->sba_idx = sba_idx;
>>> +    order = get_order(dmb->dmb_len);
>>> +    pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN |
>>> +                __GFP_NOMEMALLOC | __GFP_COMP |
>>> +                __GFP_NORETRY | __GFP_ZERO,
>>> +                order);
>>> +    if (!pages) {
>>> +        rc = -ENOMEM;
>>> +        goto err_node;
>>> +    }
>>> +    dmb_node->cpu_addr = (void *)page_address(pages);
>>> +    dmb_node->len = dmb->dmb_len;
>>> +    dmb_node->dma_addr = SMC_DMA_ADDR_INVALID;
>>> +
>>> +again:
>>> +    /* add new dmb into hash table */
>>> +    get_random_bytes(&dmb_node->token, sizeof(dmb_node->token));
>>> +    write_lock(&ldev->dmb_ht_lock);
>>> +    hash_for_each_possible(ldev->dmb_ht, tmp_node, list,
>>> dmb_node->token) {
>>> +        if (tmp_node->token == dmb_node->token) {
>>> +            write_unlock(&ldev->dmb_ht_lock);
>>> +            goto again;
>>> +        }
>>> +    }
>>> +    hash_add(ldev->dmb_ht, &dmb_node->list, dmb_node->token);
>>> +    write_unlock(&ldev->dmb_ht_lock);
>>> +
>> The write_lock_irqsave()/write_unlock_irqrestore() and
>> read_lock_irqsave()/read_unlock_irqrestore()should be used instead of
>> write_lock()/write_unlock() and read_lock()/read_unlock() in order to
>> keep the lock irq-safe.
>>
>
> dmb_ht_lock won't be hold in an interrupt or sockirq context. The
> dmb_{register|unregister},
> dmb_{attach|detach} and data_move are all on the process context. So I
> think write_(un)lock
> and read_(un)lock is safe here.

right, it is not directly hold in a interrupt context, but it has a
dependency on conn->send_lock as you wrote below, which requires
irq-safe lock. And this matches our finding from a test:

=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
6.8.0-rc4-00787-g8eb4d2392609 #2 Not tainted
-----------------------------------------------------
smcapp/33802 [HC0[0]:SC0[2]:HE1:SE0] is trying to acquire:
00000000a2fc0330 (&ldev->dmb_ht_lock){++++}-{2:2}, at:
smc_lo_move_data+0x84/0x1d0 [>
and this task is already holding:
00000000e4df6f28 (&smc->conn.send_lock){+.-.}-{2:2}, at:
smc_tx_sndbuf_nonempty+0xaa>
which would create a new lock dependency:
(&smc->conn.send_lock){+.-.}-{2:2} -> (&ldev->dmb_ht_lock){++++}-{2:2}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&smc->conn.send_lock){+.-.}-{2:2}

>
>>> +    dmb->sba_idx = dmb_node->sba_idx;
>>> +    dmb->dmb_tok = dmb_node->token;
>>> +    dmb->cpu_addr = dmb_node->cpu_addr;
>>> +    dmb->dma_addr = dmb_node->dma_addr;
>>> +    dmb->dmb_len = dmb_node->len;
>>> +
>>> +    return 0;
>>> +
>>> +err_node:
>>> +    kfree(dmb_node);
>>> +err_bit:
>>> +    clear_bit(sba_idx, ldev->sba_idx_mask);
>>> +    return rc;
>>> +}
>>> +
>>> +static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct
>>> smcd_dmb *dmb)
>>> +{
>>> +    struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
>>> +    struct smc_lo_dev *ldev = smcd->priv;
>>> +
>>> +    /* remove dmb from hash table */
>>> +    write_lock(&ldev->dmb_ht_lock);
>>> +    hash_for_each_possible(ldev->dmb_ht, tmp_node, list,
>>> dmb->dmb_tok) {
>>> +        if (tmp_node->token == dmb->dmb_tok) {
>>> +            dmb_node = tmp_node;
>>> +            break;
>>> +        }
>>> +    }
>>> +    if (!dmb_node) {
>>> +        write_unlock(&ldev->dmb_ht_lock);
>>> +        return -EINVAL;
>>> +    }
>>> +    hash_del(&dmb_node->list);
>>> +    write_unlock(&ldev->dmb_ht_lock);
>>> +
>>> +    clear_bit(dmb_node->sba_idx, ldev->sba_idx_mask);
>>> +    kfree(dmb_node->cpu_addr);
>>> +    kfree(dmb_node);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>>   static int smc_lo_add_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
>>>   {
>>>       return -EOPNOTSUPP;
>>> @@ -76,6 +169,38 @@ static int smc_lo_signal_event(struct smcd_dev
>>> *dev, struct smcd_gid *rgid,
>>>       return 0;
>>>   }
>>> +static int smc_lo_move_data(struct smcd_dev *smcd, u64 dmb_tok,
>>> +                unsigned int idx, bool sf, unsigned int offset,
>>> +                void *data, unsigned int size)
>>> +{
>>> +    struct smc_lo_dmb_node *rmb_node = NULL, *tmp_node;
>>> +    struct smc_lo_dev *ldev = smcd->priv;
>>> +
>>> +    read_lock(&ldev->dmb_ht_lock);
>>> +    hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_tok) {
>>> +        if (tmp_node->token == dmb_tok) {
>>> +            rmb_node = tmp_node;
>>> +            break;
>>> +        }
>>> +    }
>>> +    if (!rmb_node) {
>>> +        read_unlock(&ldev->dmb_ht_lock);
>>> +        return -EINVAL;
>>> +    }
>>> +    read_unlock(&ldev->dmb_ht_lock);
>>> +
>>> +    memcpy((char *)rmb_node->cpu_addr + offset, data, size);
>>> +
>>
>> Should this read_unlock be placed behind memcpy()?
>>
>
> dmb_ht_lock is used to ensure safe access to the DMB hash table of
> loopback-ism.
> The DMB hash table could be accessed by all the connections on
> loopback-ism, so
> it should be protected.
>
> But a certain DMB is only used by one connection, and the move_data
> process is
> protected by conn->send_lock (see smcd_tx_sndbuf_nonempty()), so the
> memcpy(rmb_node)
> here is safe and no race with other.
>
> Thanks!
>
sounds reasonable.
>> <...>

2024-02-23 14:13:46

by Wenjia Zhang

[permalink] [raw]
Subject: Re: [PATCH net-next 09/15] net/smc: introduce loopback-ism statistics attributes



On 20.02.24 03:45, Wen Gu wrote:
>
>
> On 2024/2/16 22:24, Wenjia Zhang wrote:
>>
>>
>> On 11.01.24 13:00, Wen Gu wrote:
>>> This introduces some statistics attributes of loopback-ism. They can be
>>> read from /sys/devices/virtual/smc/loopback-ism/{xfer_tytes|dmbs_cnt}.
>>>
>>> Signed-off-by: Wen Gu <[email protected]>
>>> ---
>>>   net/smc/smc_loopback.c | 74 ++++++++++++++++++++++++++++++++++++++++++
>>>   net/smc/smc_loopback.h | 22 +++++++++++++
>>>   2 files changed, 96 insertions(+)
>>>
>>
>> I've read the comments from Jiri and your answer. I can understand
>> your thought. However, from the perspective of the end user, it makes
>> more sense to integetrate the stats info into 'smcd stats'. Otherwise,
>> it would make users confused to find out with which tool to check
>> which statisic infornation. Sure, some improvement of the smc-tools is
>> also needed
>
> Thank you Wenjia.
>
> Let's draw an analogy with RDMA devices, which is used in SMC-R. If we want
> to check the RNIC status or statistics, we may use rdma statistic
> command, or
> ibv_devinfo command, or check file under /sys/class/infiniband/mlx5_0.
> These
> provide details or attributes related to *devices*.
>
> Since s390 ISM can be used out of SMC, I guess it also has its own way
> (other
> than smc-tools) to check the statistic?
>
> What we can see in smcr stats or smcd stats command is about statistic or
> status of SMC *protocol* layer, such as DMB status, Tx/Rx, connections,
> fallbacks.
>
> If we put the underlying devices's statistics into smc-tools, should we
> also
> put RNIC statistics or s390 ISM statistics into smcr stat or smcd stat? and
> for each futures device that can be used by SMC-R/SMC-D, should we
> update them
> into smcr stat and smcd stat? And the attributes of each devices may be
> different,
> should we add entries in smcd stat for each of them?
>
> After considering the above things, I believe that the details of the
> underlying
> device should not be exposed to smc(smc-tools). What do you think?
>
> Thanks!
>
That is a very good point. It really depends on how we understand
*devices* and how we want to use it. The more we are thinking, the more
complicated the thing is getting. I'm trying to find accurate
definitions on modeling virtual devices hoping that would make things
eaiser. Unfortunately, it is not easy. Finally, I found this article:
https://lwn.net/Articles/645810/ (Heads up! It is even from nine years
ago, I'm not sure how reliable it is.) With the insight of this article,
I'm trying to summarize my thought:

It looks good to put the loopback-ism under the /sys/devices/virtual,
especially according to the article
"
.. it is simply a place to put things that don't belong anywhere else.
"
However, in practice we use this in the term of simulated ism, which
includes not only loopback-ism, but also other ones. Thus, does it not
make sense to classify all of them together? E.g. same bus (just a
half-baked idea)

Then the following questions are comig up:
- How should we organize them?
- Should it show up in the smc_rnics?
- How should it be seen from the perspective of the container?
- If we see this loopback-ism as a *device*, should we not only put the
device related information under the /sys? Thus, dmbs_cnt seems ok, but
xfer_tytes not. Besides, we have a field in smd stat naming "Data
transmitted (Bytes)", which should be suitable for this information.


2024-02-23 14:44:40

by Wenjia Zhang

[permalink] [raw]
Subject: Re: [PATCH net-next 14/15] net/smc: introduce loopback-ism DMB data copy control



On 20.02.24 04:36, Wen Gu wrote:
>
>
> On 2024/2/16 22:25, Wenjia Zhang wrote:
>>
>>
>> On 11.01.24 13:00, Wen Gu wrote:
>>> This provides a way to {get|set} whether loopback-ism device supports
>>> merging sndbuf with peer DMB to eliminate data copies between them.
>>>
>>> echo 0 > /sys/devices/virtual/smc/loopback-ism/dmb_copy # support
>>> echo 1 > /sys/devices/virtual/smc/loopback-ism/dmb_copy # not support
>>>
>> Besides the same confusing as Niklas already mentioned, the name of
>> the option looks not clear enough to what it means. What about:
>> echo 1 > /sys/devices/virtual/smc/loopback-ism/nocopy_support # merge
>> mode
>> echo 0 > /sys/devices/virtual/smc/loopback-ism/nocopy_support # copy mode
>>
>
> OK, if we decide to keep the knobs, I will improve the name. Thanks!
>
>>> The settings take effect after re-activating loopback-ism by:
>>>
>>> echo 0 > /sys/devices/virtual/smc/loopback-ism/active
>>> echo 1 > /sys/devices/virtual/smc/loopback-ism/active
>>>
>>> After this, the link group related to loopback-ism will be flushed and
>>> the sndbufs of subsequent connections will be merged or not merged with
>>> peer DMB.
>>>
>>> The motivation of this control is that the bandwidth will be highly
>>> improved when sndbuf and DMB are merged, but when virtually contiguous
>>> DMB is provided and merged with sndbuf, it will be concurrently accessed
>>> on Tx and Rx, then there will be a bottleneck caused by lock contention
>>> of find_vmap_area when there are many CPUs and CONFIG_HARDENED_USERCOPY
>>> is set (see link below). So an option is provided.
>>>
>>> Link:
>>> https://lore.kernel.org/all/[email protected]/
>>> Signed-off-by: Wen Gu <[email protected]>
>>> ---
>> We tried some simple workloads, and the performance of the no-copy
>> case was remarkable. Thus, we're wondering if it is necessary to have
>> the tunable setting in this loopback case? Or rather, why do we need
>> the copy option? Is that because of the bottleneck caused by using the
>> combination of the no-copy and virtually contiguours DMA? Or at least
>> let no-copy as the default one.
>
> Yes, it is because the bottleneck caused by using the combination of the
> no-copy
> and virtual-DMB. If we have to use virtual-DMB and
> CONFIG_HARDENED_USERCOPY is
> set, then we may be forced to use copy mode in many CPUs environment, to
> get the
> good latency performance (the bandwidth performance still drop because
> of copy mode).
>
> But if we agree that physical-DMB is acceptable (it costs 1 physical
> buffer per conn side
> in loopback-ism no-copy mode, same as what sndbuf costs when using s390
> ISM), then
> there is no such performance issue and the two knobs can be removed.
> (see also the reply
> for 13/15 patch [1]).
>
> [1]
> https://lore.kernel.org/netdev/[email protected]/
>
> Thanks!
Thank you, Wen, for the elaboration! As I said, though we did see some
better performance on using the virtually contiguous memory with a
simple test, the improvement was not really significant. Additionally,
our environment ist very different as your 48 CPUs qemu environment, and
it also depends on the workload. I think I can understand why you see
better performance by using physically contiguous memory. Anyway, I
don't have any objection on using physical-DMB only. But I still want to
see if there is any other opinion.

2024-02-26 03:05:31

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next 06/15] net/smc: implement DMB-related operations of loopback-ism



On 2024/2/23 22:12, Wenjia Zhang wrote:
>
>
> On 20.02.24 02:55, Wen Gu wrote:
>>
>>
>> On 2024/2/16 22:13, Wenjia Zhang wrote:
>>>
>>>
>>> On 11.01.24 13:00, Wen Gu wrote:
>>>> This implements DMB (un)registration and data move operations of
>>>> loopback-ism device.
>>>>
>>>> Signed-off-by: Wen Gu <[email protected]>
>>>> ---
>>>>   net/smc/smc_cdc.c      |   6 ++
>>>>   net/smc/smc_cdc.h      |   1 +
>>>>   net/smc/smc_loopback.c | 133 ++++++++++++++++++++++++++++++++++++++++-
>>>>   net/smc/smc_loopback.h |  13 ++++
>>>>   4 files changed, 150 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
>>>> index 3c06625ceb20..c820ef197610 100644
>>>> --- a/net/smc/smc_cdc.c
>>>> +++ b/net/smc/smc_cdc.c
>>>> @@ -410,6 +410,12 @@ static void smc_cdc_msg_recv(struct smc_sock *smc, struct smc_cdc_msg *cdc)
>>>>   static void smcd_cdc_rx_tsklet(struct tasklet_struct *t)
>>>>   {
>>>>       struct smc_connection *conn = from_tasklet(conn, t, rx_tsklet);
>>>> +
>>>> +    smcd_cdc_rx_handler(conn);
>>>> +}
>>>> +
>>>> +void smcd_cdc_rx_handler(struct smc_connection *conn)
>>>> +{
>>>>       struct smcd_cdc_msg *data_cdc;
>>>>       struct smcd_cdc_msg cdc;
>>>>       struct smc_sock *smc;
>>>> diff --git a/net/smc/smc_cdc.h b/net/smc/smc_cdc.h
>>>> index 696cc11f2303..11559d4ebf2b 100644
>>>> --- a/net/smc/smc_cdc.h
>>>> +++ b/net/smc/smc_cdc.h
>>>> @@ -301,5 +301,6 @@ int smcr_cdc_msg_send_validation(struct smc_connection *conn,
>>>>                    struct smc_wr_buf *wr_buf);
>>>>   int smc_cdc_init(void) __init;
>>>>   void smcd_cdc_rx_init(struct smc_connection *conn);
>>>> +void smcd_cdc_rx_handler(struct smc_connection *conn);
>>>>   #endif /* SMC_CDC_H */
>>>> diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
>>>> index 353d4a2d69a1..f72e7b24fc1a 100644
>>>> --- a/net/smc/smc_loopback.c
>>>> +++ b/net/smc/smc_loopback.c
>>>> @@ -15,11 +15,13 @@
>>>>   #include <linux/types.h>
>>>>   #include <net/smc.h>
>>>> +#include "smc_cdc.h"
>>>>   #include "smc_ism.h"
>>>>   #include "smc_loopback.h"
>>>>   #if IS_ENABLED(CONFIG_SMC_LO)
>>>>   #define SMC_LO_V2_CAPABLE    0x1 /* loopback-ism acts as ISMv2 */
>>>> +#define SMC_DMA_ADDR_INVALID    (~(dma_addr_t)0)
>>>>   static const char smc_lo_dev_name[] = "loopback-ism";
>>>>   static struct smc_lo_dev *lo_dev;
>>>> @@ -50,6 +52,97 @@ static int smc_lo_query_rgid(struct smcd_dev *smcd, struct smcd_gid *rgid,
>>>>       return 0;
>>>>   }
>>>> +static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
>>>> +                   void *client_priv)
>>>> +{
>>>> +    struct smc_lo_dmb_node *dmb_node, *tmp_node;
>>>> +    struct smc_lo_dev *ldev = smcd->priv;
>>>> +    int sba_idx, order, rc;
>>>> +    struct page *pages;
>>>> +
>>>> +    /* check space for new dmb */
>>>> +    for_each_clear_bit(sba_idx, ldev->sba_idx_mask, SMC_LO_MAX_DMBS) {
>>>> +        if (!test_and_set_bit(sba_idx, ldev->sba_idx_mask))
>>>> +            break;
>>>> +    }
>>>> +    if (sba_idx == SMC_LO_MAX_DMBS)
>>>> +        return -ENOSPC;
>>>> +
>>>> +    dmb_node = kzalloc(sizeof(*dmb_node), GFP_KERNEL);
>>>> +    if (!dmb_node) {
>>>> +        rc = -ENOMEM;
>>>> +        goto err_bit;
>>>> +    }
>>>> +
>>>> +    dmb_node->sba_idx = sba_idx;
>>>> +    order = get_order(dmb->dmb_len);
>>>> +    pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN |
>>>> +                __GFP_NOMEMALLOC | __GFP_COMP |
>>>> +                __GFP_NORETRY | __GFP_ZERO,
>>>> +                order);
>>>> +    if (!pages) {
>>>> +        rc = -ENOMEM;
>>>> +        goto err_node;
>>>> +    }
>>>> +    dmb_node->cpu_addr = (void *)page_address(pages);
>>>> +    dmb_node->len = dmb->dmb_len;
>>>> +    dmb_node->dma_addr = SMC_DMA_ADDR_INVALID;
>>>> +
>>>> +again:
>>>> +    /* add new dmb into hash table */
>>>> +    get_random_bytes(&dmb_node->token, sizeof(dmb_node->token));
>>>> +    write_lock(&ldev->dmb_ht_lock);
>>>> +    hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_node->token) {
>>>> +        if (tmp_node->token == dmb_node->token) {
>>>> +            write_unlock(&ldev->dmb_ht_lock);
>>>> +            goto again;
>>>> +        }
>>>> +    }
>>>> +    hash_add(ldev->dmb_ht, &dmb_node->list, dmb_node->token);
>>>> +    write_unlock(&ldev->dmb_ht_lock);
>>>> +
>>> The write_lock_irqsave()/write_unlock_irqrestore() and read_lock_irqsave()/read_unlock_irqrestore()should be used
>>> instead of write_lock()/write_unlock() and read_lock()/read_unlock() in order to keep the lock irq-safe.
>>>
>>
>> dmb_ht_lock won't be hold in an interrupt or sockirq context. The dmb_{register|unregister},
>> dmb_{attach|detach} and data_move are all on the process context. So I think write_(un)lock
>> and read_(un)lock is safe here.
>
> right, it is not directly hold in a interrupt context, but it has a dependency on conn->send_lock as you wrote below,
> which requires irq-safe lock. And this matches our finding from a test:
>
> =====================================================
> WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
> 6.8.0-rc4-00787-g8eb4d2392609 #2 Not tainted
> -----------------------------------------------------
> smcapp/33802 [HC0[0]:SC0[2]:HE1:SE0] is trying to acquire:
> 00000000a2fc0330 (&ldev->dmb_ht_lock){++++}-{2:2}, at: smc_lo_move_data+0x84/0x1d0 [>
> and this task is already holding:
> 00000000e4df6f28 (&smc->conn.send_lock){+.-.}-{2:2}, at: smc_tx_sndbuf_nonempty+0xaa>
> which would create a new lock dependency:
> (&smc->conn.send_lock){+.-.}-{2:2} -> (&ldev->dmb_ht_lock){++++}-{2:2}
> but this new dependency connects a SOFTIRQ-irq-safe lock:
> (&smc->conn.send_lock){+.-.}-{2:2}
>

I understand, thank you Wenjia. I will fix it in the next version.

>>
>>>> +    dmb->sba_idx = dmb_node->sba_idx;
>>>> +    dmb->dmb_tok = dmb_node->token;
>>>> +    dmb->cpu_addr = dmb_node->cpu_addr;
>>>> +    dmb->dma_addr = dmb_node->dma_addr;
>>>> +    dmb->dmb_len = dmb_node->len;
>>>> +
>>>> +    return 0;
>>>> +
>>>> +err_node:
>>>> +    kfree(dmb_node);
>>>> +err_bit:
>>>> +    clear_bit(sba_idx, ldev->sba_idx_mask);
>>>> +    return rc;
>>>> +}
>>>> +
>>>> +static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
>>>> +{
>>>> +    struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
>>>> +    struct smc_lo_dev *ldev = smcd->priv;
>>>> +
>>>> +    /* remove dmb from hash table */
>>>> +    write_lock(&ldev->dmb_ht_lock);
>>>> +    hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
>>>> +        if (tmp_node->token == dmb->dmb_tok) {
>>>> +            dmb_node = tmp_node;
>>>> +            break;
>>>> +        }
>>>> +    }
>>>> +    if (!dmb_node) {
>>>> +        write_unlock(&ldev->dmb_ht_lock);
>>>> +        return -EINVAL;
>>>> +    }
>>>> +    hash_del(&dmb_node->list);
>>>> +    write_unlock(&ldev->dmb_ht_lock);
>>>> +
>>>> +    clear_bit(dmb_node->sba_idx, ldev->sba_idx_mask);
>>>> +    kfree(dmb_node->cpu_addr);
>>>> +    kfree(dmb_node);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>>   static int smc_lo_add_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
>>>>   {
>>>>       return -EOPNOTSUPP;
>>>> @@ -76,6 +169,38 @@ static int smc_lo_signal_event(struct smcd_dev *dev, struct smcd_gid *rgid,
>>>>       return 0;
>>>>   }
>>>> +static int smc_lo_move_data(struct smcd_dev *smcd, u64 dmb_tok,
>>>> +                unsigned int idx, bool sf, unsigned int offset,
>>>> +                void *data, unsigned int size)
>>>> +{
>>>> +    struct smc_lo_dmb_node *rmb_node = NULL, *tmp_node;
>>>> +    struct smc_lo_dev *ldev = smcd->priv;
>>>> +
>>>> +    read_lock(&ldev->dmb_ht_lock);
>>>> +    hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_tok) {
>>>> +        if (tmp_node->token == dmb_tok) {
>>>> +            rmb_node = tmp_node;
>>>> +            break;
>>>> +        }
>>>> +    }
>>>> +    if (!rmb_node) {
>>>> +        read_unlock(&ldev->dmb_ht_lock);
>>>> +        return -EINVAL;
>>>> +    }
>>>> +    read_unlock(&ldev->dmb_ht_lock);
>>>> +
>>>> +    memcpy((char *)rmb_node->cpu_addr + offset, data, size);
>>>> +
>>>
>>> Should this read_unlock be placed behind memcpy()?
>>>
>>
>> dmb_ht_lock is used to ensure safe access to the DMB hash table of loopback-ism.
>> The DMB hash table could be accessed by all the connections on loopback-ism, so
>> it should be protected.
>>
>> But a certain DMB is only used by one connection, and the move_data process is
>> protected by conn->send_lock (see smcd_tx_sndbuf_nonempty()), so the memcpy(rmb_node)
>> here is safe and no race with other.
>>
>> Thanks!
>>
> sounds reasonable.
>>> <...>

2024-02-26 13:02:56

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next 09/15] net/smc: introduce loopback-ism statistics attributes



On 2024/2/23 22:13, Wenjia Zhang wrote:
>
>
> On 20.02.24 03:45, Wen Gu wrote:
>>
>>
>> On 2024/2/16 22:24, Wenjia Zhang wrote:
>>>
>>>
>>> On 11.01.24 13:00, Wen Gu wrote:
>>>> This introduces some statistics attributes of loopback-ism. They can be
>>>> read from /sys/devices/virtual/smc/loopback-ism/{xfer_tytes|dmbs_cnt}.
>>>>
>>>> Signed-off-by: Wen Gu <[email protected]>
>>>> ---
>>>>   net/smc/smc_loopback.c | 74 ++++++++++++++++++++++++++++++++++++++++++
>>>>   net/smc/smc_loopback.h | 22 +++++++++++++
>>>>   2 files changed, 96 insertions(+)
>>>>
>>>
>>> I've read the comments from Jiri and your answer. I can understand your thought. However, from the perspective of the
>>> end user, it makes more sense to integetrate the stats info into 'smcd stats'. Otherwise, it would make users
>>> confused to find out with which tool to check which statisic infornation. Sure, some improvement of the smc-tools is
>>> also needed
>>
>> Thank you Wenjia.
>>
>> Let's draw an analogy with RDMA devices, which is used in SMC-R. If we want
>> to check the RNIC status or statistics, we may use rdma statistic command, or
>> ibv_devinfo command, or check file under /sys/class/infiniband/mlx5_0. These
>> provide details or attributes related to *devices*.
>>
>> Since s390 ISM can be used out of SMC, I guess it also has its own way (other
>> than smc-tools) to check the statistic?
>>
>> What we can see in smcr stats or smcd stats command is about statistic or
>> status of SMC *protocol* layer, such as DMB status, Tx/Rx, connections, fallbacks.
>>
>> If we put the underlying devices's statistics into smc-tools, should we also
>> put RNIC statistics or s390 ISM statistics into smcr stat or smcd stat? and
>> for each futures device that can be used by SMC-R/SMC-D, should we update them
>> into smcr stat and smcd stat? And the attributes of each devices may be different,
>> should we add entries in smcd stat for each of them?
>>
>> After considering the above things, I believe that the details of the underlying
>> device should not be exposed to smc(smc-tools). What do you think?
>>
>> Thanks!
>>
> That is a very good point. It really depends on how we understand *devices* and how we want to use it. The more we are
> thinking, the more complicated the thing is getting. I'm trying to find accurate definitions on modeling virtual devices
> hoping that would make things eaiser. Unfortunately, it is not easy. Finally, I found this article:
> https://lwn.net/Articles/645810/ (Heads up! It is even from nine years ago, I'm not sure how reliable it is.) With the
> insight of this article, I'm trying to summarize my thought:
>
> It looks good to put the loopback-ism under the /sys/devices/virtual, especially according to the article
> "
> ... it is simply a place to put things that don't belong anywhere else.
> "

Yes, it can also be reflected from the implementation of get_device_parent():

static struct kobject *get_device_parent(struct device *dev,
struct device *parent)
{
<...>
/*
* If we have no parent, we live in "virtual".
* Class-devices with a non class-device as parent, live
* in a "glue" directory to prevent namespace collisions.
*/
if (parent == NULL)
parent_kobj = virtual_device_parent(dev);
else if (parent->class && !dev->class->ns_type) {
subsys_put(sp);
return &parent->kobj;
} else {
parent_kobj = &parent->kobj;
}
<...>
}

> However, in practice we use this in the term of simulated ism, which includes not only loopback-ism, but also other
> ones. Thus, does it not make sense to classify all of them together? E.g. same bus (just a half-baked idea)
>
> Then the following questions are comig up:
> - How should we organize them?
> - Should it show up in the smc_rnics?
> - How should it be seen from the perspective of the container?
> - If we see this loopback-ism as a *device*, should we not only put the device related information under the /sys? Thus,
> dmbs_cnt seems ok, but xfer_tytes not. Besides, we have a field in smd stat naming "Data transmitted (Bytes)", which
> should be suitable for this information.

Actually I created 'smc' class under /sys/devices/virtual just to place
loopback-ism, since it doesn't seem to belong to a certain class of device
and serves only SMC. Other 'smc devices', e.g. RDMA device, s390 ISM and
other Emulated-ISM like virtio-ism, all belong to a certain class or bus,
so I have no intention of putting them under the same path.

But now looks like that the 'smc' class and /sys/devices/virtual/smc path
will lead people to mistakenly think that there is a class of 'SMC devices',
but in fact these 'SMC devices' belongs to different classes or buses. They
can be used by SMC and any other users. So I think it is better to avoid
creating such 'smc' class.

Alternatively, after referring to other examples in the kernel, I think
another choice is to to put loopback-ism under /sys/devices/virtual/misc/,
for devices which can't fit in a specific class. What do you think?

Thanks a lot!