2024-03-12 14:28:53

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v3 00/11] net/smc: SMC intra-OS shortcut with loopback-ism

This patch set acts as the second part of the new version of [1] (The first
part can be referred from [2]), the updated things of this version are listed
at the end.

- Background

SMC-D is now used in IBM z with ISM function to optimize network interconnect
for intra-CPC communications. Inspired by this, we try to make SMC-D available
on the non-s390 architecture through a software-implemented Emulated-ISM device,
that is the loopback-ism device here, to accelerate inter-process or
inter-containers communication within the same OS instance.

- Design

This patch set includes 3 parts:

- Patch #1-#2: some prepare work for loopback-ism.
- Patch #3-#7: implement loopback-ism device. Noted that loopback-ism now
serves only SMC and no userspace interface exposed.
- Patch #10-#15: memory copy optimization for intra-OS scenario.

The loopback-ism device is designed as an ISMv2 device and not be limited to
a specific net namespace, ends of both inter-process connection (1/1' in diagram
below) or inter-container connection (2/2' in diagram below) can find the same
available loopback-ism and choose it during the CLC handshake.

Container 1 (ns1) Container 2 (ns2)
+-----------------------------------------+ +-------------------------+
| +-------+ +-------+ +-------+ | | +-------+ |
| | App A | | App B | | App C | | | | App D |<-+ |
| +-------+ +---^---+ +-------+ | | +-------+ |(2') |
| |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| |
| (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ |
| `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | |
+---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+
| | | |
Kernel | | | |
+----+-------v---+-----------v----------------------------------+---+----+
| | TCP | |
| | | |
| +--------------------------------------------------------------+ |
| |
| +--------------+ |
| | smc loopback | |
+---------------------------+--------------+-----------------------------+

loopback-ism device creates DMBs (shared memory) for each connection peer.
Since data transfer occurs within the same kernel, the sndbuf of each peer
is only a descriptor and point to the same memory region as peer DMB, so that
the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.

Container 1 (ns1) Container 2 (ns2)
+-----------------------------------------+ +-------------------------+
| +-------+ | | +-------+ |
| | App C |-----+ | | | App D | |
| +-------+ | | | +-^-----+ |
| | | | | |
| (2) | | | (2') | |
| | | | | |
+---------------|-------------------------+ +----------|--------------+
| |
Kernel | |
+---------------|-----------------------------------------|--------------+
| +--------+ +--v-----+ +--------+ +--------+ |
| |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| |
| +-----|--+ +--|-----+ +-----|--+ +--------+ |
| +-----|--+ | +-----|--+ |
| | DMB C | +---------------------------------| DMB D | |
| +--------+ +--------+ |
| |
| +--------------+ |
| | smc loopback | |
+---------------------------+--------------+-----------------------------+

- Benchmark Test

* Test environments:
- VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
- SMC sndbuf/DMB size 1MB.

* Test object:
- TCP: run on TCP loopback.
- SMC lo: run on SMC loopback-ism.

1. ipc-benchmark (see [3])

- ./<foo> -c 1000000 -s 100

TCP SMC-lo
Message
rate (msg/s) 81433 143938(+76.75%)

2. sockperf

- serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
- clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30

TCP SMC-lo
Bandwidth(MBps) 4903.07 7978.69(+62.73%)
Latency(us) 6.095 3.539(-41.94%)

3. nginx/wrk

- serv: <smc_run> nginx
- clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80

TCP SMC-lo
Requests/s 161665.67 244272.41(+51.10%)

4. redis-benchmark

- serv: <smc_run> redis-server
- clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024

TCP SMC-lo
GET(Requests/s) 88790.23 117474.30(+32.31%)
SET(Requests/s) 87508.20 118623.96(+35.57%)


Change log:

v3->v2:
- Patch #11: use tasklet_schedule(&conn->rx_tsklet) instead of smcd_cdc_rx_handler()
to avoid possible recursive locking of conn->send_lock and use {read|write}_lock_bh()
to acquire dmb_ht_lock.

v2->v1:
Link: https://lore.kernel.org/netdev/[email protected]/
- All the patches: changed the term virtual-ISM to Emulated-ISM as defined by SMCv2.1.
- Patch #3: optimized the description of SMC_LO config. Avoid exposing loopback-ism
to sysfs and remove all the knobs until future definition clear.
- Patch #3: try to make lockdep happy by using read_lock_bh() in smc_lo_move_data().
- Patch #6: defaultly use physical contiguous DMB buffers.
- Patch #11: defaultly enable DMB no-copy for loopback-ism and free the DMB in
unregister_dmb or detach_dmb when dmb_node->refcnt reaches 0, instead of using
wait_event to keep waiting in unregister_dmb.

v1->RFC:
Link: https://lore.kernel.org/netdev/[email protected]/
- Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
/sys/devices/virtual/smc/loopback-ism/xfer_bytes
- Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
merging sndbuf with peer DMB.
- Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
control of whether to merge sndbuf and DMB. They can be respectively set by:
/sys/devices/virtual/smc/loopback-ism/dmb_type
/sys/devices/virtual/smc/loopback-ism/dmb_copy
The motivation for these two control is that a performance bottleneck was
found when using vzalloced DMB and sndbuf is merged with DMB, and there are
many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
vmap lock contention [6]. It has significant effects, but using virtual memory
still has additional overhead compared to using physical memory.
So this new version provides controls of dmb_type and dmb_copy to suit
different scenarios.
- Some minor changes and comments improvements.

RFC->old version([1]):
Link: https://lore.kernel.org/netdev/[email protected]/
- Patch #1: improve the loopback-ism dump, it shows as follows now:
# smcd d
FID Type PCI-ID PCHID InUse #LGs PNET-ID
0000 0 loopback-ism ffff No 0
- Patch #3: introduce the smc_ism_set_v2_capable() helper and set
smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
regardless of whether there is already a device in smcd device list.
- Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
- Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
to activate or deactivate the loopback-ism.
- Patch #9: introduce the statistics of loopback-ism by
/sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
- Some minor changes and comments improvements.

[1] https://lore.kernel.org/netdev/[email protected]/
[2] https://lore.kernel.org/netdev/[email protected]/
[3] https://github.com/goldsborough/ipc-bench
[4] https://lore.kernel.org/all/[email protected]/
[5] https://lore.kernel.org/all/[email protected]/
[6] https://lore.kernel.org/all/[email protected]/

Wen Gu (11):
net/smc: adapt SMC-D device dump for Emulated-ISM
net/smc: decouple ism_client from SMC-D DMB registration
net/smc: introduce loopback-ism for SMC intra-OS shortcut
net/smc: implement ID-related operations of loopback-ism
net/smc: implement some unsupported operations of loopback-ism
net/smc: implement DMB-related operations of loopback-ism
net/smc: register loopback-ism into SMC-D device list
net/smc: add operations to merge sndbuf with peer DMB
net/smc: attach or detach ghost sndbuf to peer DMB
net/smc: adapt cursor update when sndbuf and peer DMB are merged
net/smc: implement DMB-merged operations of loopback-ism

drivers/s390/net/ism_drv.c | 2 +-
include/net/smc.h | 7 +-
net/smc/Kconfig | 13 ++
net/smc/Makefile | 2 +-
net/smc/af_smc.c | 28 ++-
net/smc/smc_cdc.c | 52 ++++-
net/smc/smc_core.c | 61 ++++-
net/smc/smc_core.h | 1 +
net/smc/smc_ism.c | 71 +++++-
net/smc/smc_ism.h | 5 +
net/smc/smc_loopback.c | 462 +++++++++++++++++++++++++++++++++++++
net/smc/smc_loopback.h | 52 +++++
12 files changed, 727 insertions(+), 29 deletions(-)
create mode 100644 net/smc/smc_loopback.c
create mode 100644 net/smc/smc_loopback.h

--
2.32.0.3.g01195cf9f



2024-03-12 14:29:17

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v3 02/11] net/smc: decouple ism_client from SMC-D DMB registration

The struct 'ism_client' is specialized for s390 platform firmware ISM.
So replace it with 'void' to make SMCD DMB registration helper generic
for both Emulated-ISM and existing ISM.

Signed-off-by: Wen Gu <[email protected]>
---
drivers/s390/net/ism_drv.c | 2 +-
include/net/smc.h | 4 ++--
net/smc/smc_ism.c | 7 ++-----
3 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/s390/net/ism_drv.c b/drivers/s390/net/ism_drv.c
index 2c8e964425dc..9b2a52913e76 100644
--- a/drivers/s390/net/ism_drv.c
+++ b/drivers/s390/net/ism_drv.c
@@ -726,7 +726,7 @@ static int smcd_query_rgid(struct smcd_dev *smcd, struct smcd_gid *rgid,
}

static int smcd_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
- struct ism_client *client)
+ void *client)
{
return ism_register_dmb(smcd->priv, (struct ism_dmb *)dmb, client);
}
diff --git a/include/net/smc.h b/include/net/smc.h
index c9dcb30e3fd9..6273c3a8b24a 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -50,7 +50,6 @@ struct smcd_dmb {
#define ISM_ERROR 0xFFFF

struct smcd_dev;
-struct ism_client;

struct smcd_gid {
u64 gid;
@@ -61,7 +60,7 @@ struct smcd_ops {
int (*query_remote_gid)(struct smcd_dev *dev, struct smcd_gid *rgid,
u32 vid_valid, u32 vid);
int (*register_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb,
- struct ism_client *client);
+ void *client);
int (*unregister_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
int (*add_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
int (*del_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
@@ -81,6 +80,7 @@ struct smcd_ops {
struct smcd_dev {
const struct smcd_ops *ops;
void *priv;
+ void *client;
struct list_head list;
spinlock_t lock;
struct smc_connection **conn;
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index b6eca4231913..26743a14cf27 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -222,7 +222,6 @@ int smc_ism_unregister_dmb(struct smcd_dev *smcd, struct smc_buf_desc *dmb_desc)
int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
struct smc_buf_desc *dmb_desc)
{
-#if IS_ENABLED(CONFIG_ISM)
struct smcd_dmb dmb;
int rc;

@@ -231,7 +230,7 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
dmb.sba_idx = dmb_desc->sba_idx;
dmb.vlan_id = lgr->vlan_id;
dmb.rgid = lgr->peer_gid.gid;
- rc = lgr->smcd->ops->register_dmb(lgr->smcd, &dmb, &smc_ism_client);
+ rc = lgr->smcd->ops->register_dmb(lgr->smcd, &dmb, lgr->smcd->client);
if (!rc) {
dmb_desc->sba_idx = dmb.sba_idx;
dmb_desc->token = dmb.dmb_tok;
@@ -240,9 +239,6 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
dmb_desc->len = dmb.dmb_len;
}
return rc;
-#else
- return 0;
-#endif
}

static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
@@ -453,6 +449,7 @@ static void smcd_register_dev(struct ism_dev *ism)
if (!smcd)
return;
smcd->priv = ism;
+ smcd->client = &smc_ism_client;
ism_set_priv(ism, &smc_ism_client, smcd);
if (smc_pnetid_by_dev_port(&ism->pdev->dev, 0, smcd->pnetid))
smc_pnetid_by_table_smcd(smcd);
--
2.32.0.3.g01195cf9f


2024-03-12 14:29:53

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v3 03/11] net/smc: introduce loopback-ism for SMC intra-OS shortcut

This introduces a kind of Emulated-ISM device named loopback-ism for
SMCv2.1. The loopback-ism device is currently exclusive for SMC usage,
and aims to provide a SMC shortcut for sockets within the same kernel,
leading to improved intra-OS traffic performance. Configuration of this
feature is managed through the config SMC_LO.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/Kconfig | 13 ++++
net/smc/Makefile | 2 +-
net/smc/af_smc.c | 12 ++-
net/smc/smc_loopback.c | 164 +++++++++++++++++++++++++++++++++++++++++
net/smc/smc_loopback.h | 33 +++++++++
5 files changed, 222 insertions(+), 2 deletions(-)
create mode 100644 net/smc/smc_loopback.c
create mode 100644 net/smc/smc_loopback.h

diff --git a/net/smc/Kconfig b/net/smc/Kconfig
index 746be3996768..2f1a8706a372 100644
--- a/net/smc/Kconfig
+++ b/net/smc/Kconfig
@@ -20,3 +20,16 @@ config SMC_DIAG
smcss.

if unsure, say Y.
+
+config SMC_LO
+ bool "SMC intra-OS shortcut with loopback-ism"
+ depends on SMC
+ default n
+ help
+ SMC_LO enables the creation of an Emulated-ISM device named
+ loopback-ism in SMC and the use of it for transferring data
+ when communication occurs within the same OS. This helps in
+ convenient testing of SMC-D since loopback-ism is independent
+ of architecture or hardware.
+
+ if unsure, say N.
diff --git a/net/smc/Makefile b/net/smc/Makefile
index 875efcd126a2..a8c37111abe1 100644
--- a/net/smc/Makefile
+++ b/net/smc/Makefile
@@ -4,5 +4,5 @@ obj-$(CONFIG_SMC) += smc.o
obj-$(CONFIG_SMC_DIAG) += smc_diag.o
smc-y := af_smc.o smc_pnet.o smc_ib.o smc_clc.o smc_core.o smc_wr.o smc_llc.o
smc-y += smc_cdc.o smc_tx.o smc_rx.o smc_close.o smc_ism.o smc_netlink.o smc_stats.o
-smc-y += smc_tracepoint.o
+smc-y += smc_tracepoint.o smc_loopback.o
smc-$(CONFIG_SYSCTL) += smc_sysctl.o
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 4b52b3b159c0..751a8fdd6eb0 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -53,6 +53,7 @@
#include "smc_stats.h"
#include "smc_tracepoint.h"
#include "smc_sysctl.h"
+#include "smc_loopback.h"

static DEFINE_MUTEX(smc_server_lgr_pending); /* serialize link group
* creation on server
@@ -3557,15 +3558,23 @@ static int __init smc_init(void)
goto out_sock;
}

+ rc = smc_loopback_init();
+ if (rc) {
+ pr_err("%s: smc_loopback_init fails with %d\n", __func__, rc);
+ goto out_ib;
+ }
+
rc = tcp_register_ulp(&smc_ulp_ops);
if (rc) {
pr_err("%s: tcp_ulp_register fails with %d\n", __func__, rc);
- goto out_ib;
+ goto out_lo;
}

static_branch_enable(&tcp_have_smc);
return 0;

+out_lo:
+ smc_loopback_exit();
out_ib:
smc_ib_unregister_client();
out_sock:
@@ -3603,6 +3612,7 @@ static void __exit smc_exit(void)
tcp_unregister_ulp(&smc_ulp_ops);
sock_unregister(PF_SMC);
smc_core_exit();
+ smc_loopback_exit();
smc_ib_unregister_client();
smc_ism_exit();
destroy_workqueue(smc_close_wq);
diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
new file mode 100644
index 000000000000..e9170d86e58f
--- /dev/null
+++ b/net/smc/smc_loopback.c
@@ -0,0 +1,164 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Shared Memory Communications Direct over loopback-ism device.
+ *
+ * Provide a SMC-D loopback-ism device.
+ *
+ * Copyright (c) 2024, Alibaba Inc.
+ *
+ * Author: Wen Gu <[email protected]>
+ * Tony Lu <[email protected]>
+ *
+ */
+
+#include <linux/device.h>
+#include <linux/types.h>
+#include <net/smc.h>
+
+#include "smc_ism.h"
+#include "smc_loopback.h"
+
+#if IS_ENABLED(CONFIG_SMC_LO)
+static const char smc_lo_dev_name[] = "loopback-ism";
+static struct smc_lo_dev *lo_dev;
+
+static const struct smcd_ops lo_ops = {
+ .query_remote_gid = NULL,
+ .register_dmb = NULL,
+ .unregister_dmb = NULL,
+ .add_vlan_id = NULL,
+ .del_vlan_id = NULL,
+ .set_vlan_required = NULL,
+ .reset_vlan_required = NULL,
+ .signal_event = NULL,
+ .move_data = NULL,
+ .supports_v2 = NULL,
+ .get_local_gid = NULL,
+ .get_chid = NULL,
+ .get_dev = NULL,
+};
+
+static struct smcd_dev *smcd_lo_alloc_dev(const struct smcd_ops *ops,
+ int max_dmbs)
+{
+ struct smcd_dev *smcd;
+
+ smcd = kzalloc(sizeof(*smcd), GFP_KERNEL);
+ if (!smcd)
+ return NULL;
+
+ smcd->conn = kcalloc(max_dmbs, sizeof(struct smc_connection *),
+ GFP_KERNEL);
+ if (!smcd->conn)
+ goto out_smcd;
+
+ smcd->ops = ops;
+
+ spin_lock_init(&smcd->lock);
+ spin_lock_init(&smcd->lgr_lock);
+ INIT_LIST_HEAD(&smcd->vlan);
+ INIT_LIST_HEAD(&smcd->lgr_list);
+ init_waitqueue_head(&smcd->lgrs_deleted);
+ return smcd;
+
+out_smcd:
+ kfree(smcd);
+ return NULL;
+}
+
+static int smcd_lo_register_dev(struct smc_lo_dev *ldev)
+{
+ struct smcd_dev *smcd;
+
+ smcd = smcd_lo_alloc_dev(&lo_ops, SMC_LO_MAX_DMBS);
+ if (!smcd)
+ return -ENOMEM;
+ ldev->smcd = smcd;
+ smcd->priv = ldev;
+
+ /* TODO:
+ * register loopback-ism to smcd_dev list.
+ */
+ return 0;
+}
+
+static void smcd_lo_unregister_dev(struct smc_lo_dev *ldev)
+{
+ struct smcd_dev *smcd = ldev->smcd;
+
+ /* TODO:
+ * unregister loopback-ism from smcd_dev list.
+ */
+ kfree(smcd->conn);
+ kfree(smcd);
+}
+
+static int smc_lo_dev_init(struct smc_lo_dev *ldev)
+{
+ return smcd_lo_register_dev(ldev);
+}
+
+static void smc_lo_dev_exit(struct smc_lo_dev *ldev)
+{
+ smcd_lo_unregister_dev(ldev);
+}
+
+static void smc_lo_dev_release(struct device *dev)
+{
+ struct smc_lo_dev *ldev =
+ container_of(dev, struct smc_lo_dev, dev);
+
+ kfree(ldev);
+}
+
+static int smc_lo_dev_probe(void)
+{
+ struct smc_lo_dev *ldev;
+ int ret;
+
+ ldev = kzalloc(sizeof(*ldev), GFP_KERNEL);
+ if (!ldev)
+ return -ENOMEM;
+
+ ldev->dev.parent = NULL;
+ ldev->dev.release = smc_lo_dev_release;
+ device_initialize(&ldev->dev);
+ dev_set_name(&ldev->dev, smc_lo_dev_name);
+
+ ret = smc_lo_dev_init(ldev);
+ if (ret)
+ goto free_dev;
+
+ lo_dev = ldev; /* global loopback device */
+ return 0;
+
+free_dev:
+ put_device(&ldev->dev);
+ return ret;
+}
+
+static void smc_lo_dev_remove(void)
+{
+ if (!lo_dev)
+ return;
+
+ smc_lo_dev_exit(lo_dev);
+ put_device(&lo_dev->dev); /* device_initialize in smc_lo_dev_probe */
+}
+#endif
+
+int smc_loopback_init(void)
+{
+#if IS_ENABLED(CONFIG_SMC_LO)
+ return smc_lo_dev_probe();
+#else
+ return 0;
+#endif
+}
+
+void smc_loopback_exit(void)
+{
+#if IS_ENABLED(CONFIG_SMC_LO)
+ smc_lo_dev_remove();
+#endif
+}
diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
new file mode 100644
index 000000000000..9dd44d4c0ca3
--- /dev/null
+++ b/net/smc/smc_loopback.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Shared Memory Communications Direct over loopback-ism device.
+ *
+ * Provide a SMC-D loopback-ism device.
+ *
+ * Copyright (c) 2024, Alibaba Inc.
+ *
+ * Author: Wen Gu <[email protected]>
+ * Tony Lu <[email protected]>
+ *
+ */
+
+#ifndef _SMC_LOOPBACK_H
+#define _SMC_LOOPBACK_H
+
+#include <linux/device.h>
+#include <linux/err.h>
+#include <net/smc.h>
+
+#if IS_ENABLED(CONFIG_SMC_LO)
+#define SMC_LO_MAX_DMBS 5000
+
+struct smc_lo_dev {
+ struct smcd_dev *smcd;
+ struct device dev;
+};
+#endif
+
+int smc_loopback_init(void);
+void smc_loopback_exit(void);
+
+#endif /* _SMC_LOOPBACK_H */
--
2.32.0.3.g01195cf9f


2024-03-12 14:29:59

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v3 01/11] net/smc: adapt SMC-D device dump for Emulated-ISM

The introduction of Emulated-ISM requires adaptation of SMC-D device
dump. Software implemented non-PCI device (loopback-ism) should be
handled correctly and the CHID reserved for Emulated-ISM should be got
from smcd_ops interface instead of PCI information.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_ism.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index ac88de2a06a0..b6eca4231913 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -252,12 +252,11 @@ static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
char smc_pnet[SMC_MAX_PNETID_LEN + 1];
struct smc_pci_dev smc_pci_dev;
struct nlattr *port_attrs;
+ struct device *device;
struct nlattr *attrs;
- struct ism_dev *ism;
int use_cnt = 0;
void *nlh;

- ism = smcd->priv;
nlh = genlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq,
&smc_gen_nl_family, NLM_F_MULTI,
SMC_NETLINK_GET_DEV_SMCD);
@@ -272,7 +271,15 @@ static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
if (nla_put_u8(skb, SMC_NLA_DEV_IS_CRIT, use_cnt > 0))
goto errattr;
memset(&smc_pci_dev, 0, sizeof(smc_pci_dev));
- smc_set_pci_values(to_pci_dev(ism->dev.parent), &smc_pci_dev);
+ device = smcd->ops->get_dev(smcd);
+ if (device->parent)
+ smc_set_pci_values(to_pci_dev(device->parent), &smc_pci_dev);
+ if (smc_ism_is_emulated(smcd)) {
+ smc_pci_dev.pci_pchid = smc_ism_get_chid(smcd);
+ if (!device->parent)
+ snprintf(smc_pci_dev.pci_id, sizeof(smc_pci_dev.pci_id),
+ "%s", dev_name(device));
+ }
if (nla_put_u32(skb, SMC_NLA_DEV_PCI_FID, smc_pci_dev.pci_fid))
goto errattr;
if (nla_put_u16(skb, SMC_NLA_DEV_PCI_CHID, smc_pci_dev.pci_pchid))
--
2.32.0.3.g01195cf9f


2024-03-12 14:30:00

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v3 04/11] net/smc: implement ID-related operations of loopback-ism

This implements GID and CHID related operations of loopback-ism device.
loopback-ism acts as an ISMv2. It's GID is generated randomly by UUIDv4
algorithm and CHID is reserved 0xFFFF.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_loopback.c | 62 ++++++++++++++++++++++++++++++++++++++----
net/smc/smc_loopback.h | 3 ++
2 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index e9170d86e58f..7656a2474500 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -19,11 +19,62 @@
#include "smc_loopback.h"

#if IS_ENABLED(CONFIG_SMC_LO)
+#define SMC_LO_V2_CAPABLE 0x1 /* loopback-ism acts as ISMv2 */
+
static const char smc_lo_dev_name[] = "loopback-ism";
static struct smc_lo_dev *lo_dev;

+static void smc_lo_generate_id(struct smc_lo_dev *ldev)
+{
+ struct smcd_gid *lgid = &ldev->local_gid;
+ uuid_t uuid;
+
+ uuid_gen(&uuid);
+ memcpy(&lgid->gid, &uuid, sizeof(lgid->gid));
+ memcpy(&lgid->gid_ext, (u8 *)&uuid + sizeof(lgid->gid),
+ sizeof(lgid->gid_ext));
+
+ ldev->chid = SMC_LO_CHID;
+}
+
+static int smc_lo_query_rgid(struct smcd_dev *smcd, struct smcd_gid *rgid,
+ u32 vid_valid, u32 vid)
+{
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ /* rgid should be the same as lgid */
+ if (!ldev || rgid->gid != ldev->local_gid.gid ||
+ rgid->gid_ext != ldev->local_gid.gid_ext)
+ return -ENETUNREACH;
+ return 0;
+}
+
+static int smc_lo_supports_v2(void)
+{
+ return SMC_LO_V2_CAPABLE;
+}
+
+static void smc_lo_get_local_gid(struct smcd_dev *smcd,
+ struct smcd_gid *smcd_gid)
+{
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ smcd_gid->gid = ldev->local_gid.gid;
+ smcd_gid->gid_ext = ldev->local_gid.gid_ext;
+}
+
+static u16 smc_lo_get_chid(struct smcd_dev *smcd)
+{
+ return ((struct smc_lo_dev *)smcd->priv)->chid;
+}
+
+static struct device *smc_lo_get_dev(struct smcd_dev *smcd)
+{
+ return &((struct smc_lo_dev *)smcd->priv)->dev;
+}
+
static const struct smcd_ops lo_ops = {
- .query_remote_gid = NULL,
+ .query_remote_gid = smc_lo_query_rgid,
.register_dmb = NULL,
.unregister_dmb = NULL,
.add_vlan_id = NULL,
@@ -32,10 +83,10 @@ static const struct smcd_ops lo_ops = {
.reset_vlan_required = NULL,
.signal_event = NULL,
.move_data = NULL,
- .supports_v2 = NULL,
- .get_local_gid = NULL,
- .get_chid = NULL,
- .get_dev = NULL,
+ .supports_v2 = smc_lo_supports_v2,
+ .get_local_gid = smc_lo_get_local_gid,
+ .get_chid = smc_lo_get_chid,
+ .get_dev = smc_lo_get_dev,
};

static struct smcd_dev *smcd_lo_alloc_dev(const struct smcd_ops *ops,
@@ -95,6 +146,7 @@ static void smcd_lo_unregister_dev(struct smc_lo_dev *ldev)

static int smc_lo_dev_init(struct smc_lo_dev *ldev)
{
+ smc_lo_generate_id(ldev);
return smcd_lo_register_dev(ldev);
}

diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
index 9dd44d4c0ca3..55b41133a97f 100644
--- a/net/smc/smc_loopback.h
+++ b/net/smc/smc_loopback.h
@@ -20,10 +20,13 @@

#if IS_ENABLED(CONFIG_SMC_LO)
#define SMC_LO_MAX_DMBS 5000
+#define SMC_LO_CHID 0xFFFF

struct smc_lo_dev {
struct smcd_dev *smcd;
struct device dev;
+ u16 chid;
+ struct smcd_gid local_gid;
};
#endif

--
2.32.0.3.g01195cf9f


2024-03-12 14:30:39

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v3 07/11] net/smc: register loopback-ism into SMC-D device list

After loopback-ism device gets ready, add it to the SMC-D device list as
an ISMv2 device.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_ism.c | 11 +++++++----
net/smc/smc_ism.h | 1 +
net/smc/smc_loopback.c | 20 +++++++++++++-------
3 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index 26743a14cf27..c50910300a03 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -91,6 +91,11 @@ bool smc_ism_is_v2_capable(void)
return smc_ism_v2_capable;
}

+void smc_ism_set_v2_capable(void)
+{
+ smc_ism_v2_capable = true;
+}
+
/* Set a connection using this DMBE. */
void smc_ism_set_conn(struct smc_connection *conn)
{
@@ -454,11 +459,9 @@ static void smcd_register_dev(struct ism_dev *ism)
if (smc_pnetid_by_dev_port(&ism->pdev->dev, 0, smcd->pnetid))
smc_pnetid_by_table_smcd(smcd);

+ if (smcd->ops->supports_v2())
+ smc_ism_set_v2_capable();
mutex_lock(&smcd_dev_list.mutex);
- if (list_empty(&smcd_dev_list.list)) {
- if (smcd->ops->supports_v2())
- smc_ism_v2_capable = true;
- }
/* sort list: devices without pnetid before devices with pnetid */
if (smcd->pnetid[0])
list_add_tail(&smcd->list, &smcd_dev_list.list);
diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
index 165cd013404b..2ea169c1301c 100644
--- a/net/smc/smc_ism.h
+++ b/net/smc/smc_ism.h
@@ -52,6 +52,7 @@ int smc_ism_signal_shutdown(struct smc_link_group *lgr);
void smc_ism_get_system_eid(u8 **eid);
u16 smc_ism_get_chid(struct smcd_dev *dev);
bool smc_ism_is_v2_capable(void);
+void smc_ism_set_v2_capable(void);
int smc_ism_init(void);
void smc_ism_exit(void);
int smcd_nl_get_device(struct sk_buff *skb, struct netlink_callback *cb);
diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index 7335acb03920..c3e1133da8d7 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -275,10 +275,12 @@ static int smcd_lo_register_dev(struct smc_lo_dev *ldev)
return -ENOMEM;
ldev->smcd = smcd;
smcd->priv = ldev;
-
- /* TODO:
- * register loopback-ism to smcd_dev list.
- */
+ smc_ism_set_v2_capable();
+ mutex_lock(&smcd_dev_list.mutex);
+ list_add(&smcd->list, &smcd_dev_list.list);
+ mutex_unlock(&smcd_dev_list.mutex);
+ pr_warn_ratelimited("smc: adding smcd device %s\n",
+ smc_lo_dev_name);
return 0;
}

@@ -286,9 +288,13 @@ static void smcd_lo_unregister_dev(struct smc_lo_dev *ldev)
{
struct smcd_dev *smcd = ldev->smcd;

- /* TODO:
- * unregister loopback-ism from smcd_dev list.
- */
+ pr_warn_ratelimited("smc: removing smcd device %s\n",
+ smc_lo_dev_name);
+ smcd->going_away = 1;
+ smc_smcd_terminate_all(smcd);
+ mutex_lock(&smcd_dev_list.mutex);
+ list_del_init(&smcd->list);
+ mutex_unlock(&smcd_dev_list.mutex);
kfree(smcd->conn);
kfree(smcd);
}
--
2.32.0.3.g01195cf9f


2024-03-12 14:31:09

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v3 08/11] net/smc: add operations to merge sndbuf with peer DMB

In some scenarios using Emulated-ISM device, sndbuf can share the same
physical memory region with peer DMB to avoid data copy from one side
to the other. In such case the sndbuf is only a descriptor that
describes the shared memory and does not actually occupy memory, it's
more like a ghost buffer.

+----------+ +----------+
| socket A | | socket B |
+----------+ +----------+
| |
+--------+ +--------+
| sndbuf | | DMB |
| desc | | desc |
+--------+ +--------+
| |
| +----v-----+
+--------------------------> memory |
+----------+

So here introduces three new SMC-D device operations to check if this
feature is supported by device, and to {attach|detach} ghost sndbuf to
peer DMB. For now only loopback-ism supports this.

Signed-off-by: Wen Gu <[email protected]>
---
include/net/smc.h | 3 +++
net/smc/smc_ism.c | 40 ++++++++++++++++++++++++++++++++++++++++
net/smc/smc_ism.h | 4 ++++
3 files changed, 47 insertions(+)

diff --git a/include/net/smc.h b/include/net/smc.h
index 6273c3a8b24a..01387631d8a6 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -62,6 +62,9 @@ struct smcd_ops {
int (*register_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb,
void *client);
int (*unregister_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
+ int (*support_dmb_nocopy)(struct smcd_dev *dev);
+ int (*attach_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
+ int (*detach_dmb)(struct smcd_dev *dev, u64 token);
int (*add_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
int (*del_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
int (*set_vlan_required)(struct smcd_dev *dev);
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index c50910300a03..67ab9bce202a 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -246,6 +246,46 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
return rc;
}

+bool smc_ism_support_dmb_nocopy(struct smcd_dev *smcd)
+{
+ /* for now only loopback-ism supports
+ * merging sndbuf with peer DMB to avoid
+ * data copies between them.
+ */
+ return (smcd->ops->support_dmb_nocopy &&
+ smcd->ops->support_dmb_nocopy(smcd));
+}
+
+int smc_ism_attach_dmb(struct smcd_dev *dev, u64 token,
+ struct smc_buf_desc *dmb_desc)
+{
+ struct smcd_dmb dmb;
+ int rc = 0;
+
+ if (!dev->ops->attach_dmb)
+ return -EINVAL;
+
+ memset(&dmb, 0, sizeof(dmb));
+ dmb.dmb_tok = token;
+ rc = dev->ops->attach_dmb(dev, &dmb);
+ if (!rc) {
+ dmb_desc->sba_idx = dmb.sba_idx;
+ dmb_desc->token = dmb.dmb_tok;
+ dmb_desc->cpu_addr = dmb.cpu_addr;
+ dmb_desc->dma_addr = dmb.dma_addr;
+ dmb_desc->len = dmb.dmb_len;
+ }
+ return rc;
+}
+
+int smc_ism_detach_dmb(struct smcd_dev *dev, u64 token)
+{
+ if (!dev->ops->detach_dmb)
+ return -EINVAL;
+
+ return dev->ops->detach_dmb(dev, token);
+}
+
static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
struct sk_buff *skb,
struct netlink_callback *cb)
diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
index 2ea169c1301c..83cf99265160 100644
--- a/net/smc/smc_ism.h
+++ b/net/smc/smc_ism.h
@@ -48,6 +48,10 @@ int smc_ism_put_vlan(struct smcd_dev *dev, unsigned short vlan_id);
int smc_ism_register_dmb(struct smc_link_group *lgr, int buf_size,
struct smc_buf_desc *dmb_desc);
int smc_ism_unregister_dmb(struct smcd_dev *dev, struct smc_buf_desc *dmb_desc);
+bool smc_ism_support_dmb_nocopy(struct smcd_dev *smcd);
+int smc_ism_attach_dmb(struct smcd_dev *dev, u64 token,
+ struct smc_buf_desc *dmb_desc);
+int smc_ism_detach_dmb(struct smcd_dev *dev, u64 token);
int smc_ism_signal_shutdown(struct smc_link_group *lgr);
void smc_ism_get_system_eid(u8 **eid);
u16 smc_ism_get_chid(struct smcd_dev *dev);
--
2.32.0.3.g01195cf9f


2024-03-12 14:31:37

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v3 06/11] net/smc: implement DMB-related operations of loopback-ism

This implements DMB (un)registration and data move operations of
loopback-ism device.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_loopback.c | 131 ++++++++++++++++++++++++++++++++++++++++-
net/smc/smc_loopback.h | 13 ++++
2 files changed, 141 insertions(+), 3 deletions(-)

diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index 253128c77208..7335acb03920 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -15,11 +15,13 @@
#include <linux/types.h>
#include <net/smc.h>

+#include "smc_cdc.h"
#include "smc_ism.h"
#include "smc_loopback.h"

#if IS_ENABLED(CONFIG_SMC_LO)
#define SMC_LO_V2_CAPABLE 0x1 /* loopback-ism acts as ISMv2 */
+#define SMC_DMA_ADDR_INVALID (~(dma_addr_t)0)

static const char smc_lo_dev_name[] = "loopback-ism";
static struct smc_lo_dev *lo_dev;
@@ -49,6 +51,93 @@ static int smc_lo_query_rgid(struct smcd_dev *smcd, struct smcd_gid *rgid,
return 0;
}

+static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
+ void *client_priv)
+{
+ struct smc_lo_dmb_node *dmb_node, *tmp_node;
+ struct smc_lo_dev *ldev = smcd->priv;
+ int sba_idx, rc;
+
+ /* check space for new dmb */
+ for_each_clear_bit(sba_idx, ldev->sba_idx_mask, SMC_LO_MAX_DMBS) {
+ if (!test_and_set_bit(sba_idx, ldev->sba_idx_mask))
+ break;
+ }
+ if (sba_idx == SMC_LO_MAX_DMBS)
+ return -ENOSPC;
+
+ dmb_node = kzalloc(sizeof(*dmb_node), GFP_KERNEL);
+ if (!dmb_node) {
+ rc = -ENOMEM;
+ goto err_bit;
+ }
+
+ dmb_node->sba_idx = sba_idx;
+ dmb_node->len = dmb->dmb_len;
+ dmb_node->cpu_addr = kzalloc(dmb_node->len, GFP_KERNEL |
+ __GFP_NOWARN | __GFP_NORETRY |
+ __GFP_NOMEMALLOC);
+ if (!dmb_node->cpu_addr) {
+ rc = -ENOMEM;
+ goto err_node;
+ }
+ dmb_node->dma_addr = SMC_DMA_ADDR_INVALID;
+
+again:
+ /* add new dmb into hash table */
+ get_random_bytes(&dmb_node->token, sizeof(dmb_node->token));
+ write_lock_bh(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_node->token) {
+ if (tmp_node->token == dmb_node->token) {
+ write_unlock_bh(&ldev->dmb_ht_lock);
+ goto again;
+ }
+ }
+ hash_add(ldev->dmb_ht, &dmb_node->list, dmb_node->token);
+ write_unlock_bh(&ldev->dmb_ht_lock);
+
+ dmb->sba_idx = dmb_node->sba_idx;
+ dmb->dmb_tok = dmb_node->token;
+ dmb->cpu_addr = dmb_node->cpu_addr;
+ dmb->dma_addr = dmb_node->dma_addr;
+ dmb->dmb_len = dmb_node->len;
+
+ return 0;
+
+err_node:
+ kfree(dmb_node);
+err_bit:
+ clear_bit(sba_idx, ldev->sba_idx_mask);
+ return rc;
+}
+
+static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
+{
+ struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ /* remove dmb from hash table */
+ write_lock_bh(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
+ if (tmp_node->token == dmb->dmb_tok) {
+ dmb_node = tmp_node;
+ break;
+ }
+ }
+ if (!dmb_node) {
+ write_unlock_bh(&ldev->dmb_ht_lock);
+ return -EINVAL;
+ }
+ hash_del(&dmb_node->list);
+ write_unlock_bh(&ldev->dmb_ht_lock);
+
+ clear_bit(dmb_node->sba_idx, ldev->sba_idx_mask);
+ kfree(dmb_node->cpu_addr);
+ kfree(dmb_node);
+
+ return 0;
+}
+
static int smc_lo_add_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
{
return -EOPNOTSUPP;
@@ -75,6 +164,40 @@ static int smc_lo_signal_event(struct smcd_dev *dev, struct smcd_gid *rgid,
return 0;
}

+static int smc_lo_move_data(struct smcd_dev *smcd, u64 dmb_tok,
+ unsigned int idx, bool sf, unsigned int offset,
+ void *data, unsigned int size)
+{
+ struct smc_lo_dmb_node *rmb_node = NULL, *tmp_node;
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ read_lock_bh(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_tok) {
+ if (tmp_node->token == dmb_tok) {
+ rmb_node = tmp_node;
+ break;
+ }
+ }
+ if (!rmb_node) {
+ read_unlock_bh(&ldev->dmb_ht_lock);
+ return -EINVAL;
+ }
+ read_unlock_bh(&ldev->dmb_ht_lock);
+
+ memcpy((char *)rmb_node->cpu_addr + offset, data, size);
+
+ if (sf) {
+ struct smc_connection *conn =
+ smcd->conn[rmb_node->sba_idx];
+
+ if (conn && !conn->killed)
+ tasklet_schedule(&conn->rx_tsklet);
+ else
+ return -EPIPE;
+ }
+ return 0;
+}
+
static int smc_lo_supports_v2(void)
{
return SMC_LO_V2_CAPABLE;
@@ -101,14 +224,14 @@ static struct device *smc_lo_get_dev(struct smcd_dev *smcd)

static const struct smcd_ops lo_ops = {
.query_remote_gid = smc_lo_query_rgid,
- .register_dmb = NULL,
- .unregister_dmb = NULL,
+ .register_dmb = smc_lo_register_dmb,
+ .unregister_dmb = smc_lo_unregister_dmb,
.add_vlan_id = smc_lo_add_vlan_id,
.del_vlan_id = smc_lo_del_vlan_id,
.set_vlan_required = smc_lo_set_vlan_required,
.reset_vlan_required = smc_lo_reset_vlan_required,
.signal_event = smc_lo_signal_event,
- .move_data = NULL,
+ .move_data = smc_lo_move_data,
.supports_v2 = smc_lo_supports_v2,
.get_local_gid = smc_lo_get_local_gid,
.get_chid = smc_lo_get_chid,
@@ -173,6 +296,8 @@ static void smcd_lo_unregister_dev(struct smc_lo_dev *ldev)
static int smc_lo_dev_init(struct smc_lo_dev *ldev)
{
smc_lo_generate_id(ldev);
+ rwlock_init(&ldev->dmb_ht_lock);
+ hash_init(ldev->dmb_ht);
return smcd_lo_register_dev(ldev);
}

diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
index 55b41133a97f..24ab9d747613 100644
--- a/net/smc/smc_loopback.h
+++ b/net/smc/smc_loopback.h
@@ -20,13 +20,26 @@

#if IS_ENABLED(CONFIG_SMC_LO)
#define SMC_LO_MAX_DMBS 5000
+#define SMC_LO_DMBS_HASH_BITS 12
#define SMC_LO_CHID 0xFFFF

+struct smc_lo_dmb_node {
+ struct hlist_node list;
+ u64 token;
+ u32 len;
+ u32 sba_idx;
+ void *cpu_addr;
+ dma_addr_t dma_addr;
+};
+
struct smc_lo_dev {
struct smcd_dev *smcd;
struct device dev;
u16 chid;
struct smcd_gid local_gid;
+ rwlock_t dmb_ht_lock;
+ DECLARE_BITMAP(sba_idx_mask, SMC_LO_MAX_DMBS);
+ DECLARE_HASHTABLE(dmb_ht, SMC_LO_DMBS_HASH_BITS);
};
#endif

--
2.32.0.3.g01195cf9f


2024-03-12 14:31:43

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v3 09/11] net/smc: attach or detach ghost sndbuf to peer DMB

The ghost sndbuf descriptor will be created and attached to peer DMB
once peer token is obtained and it will be detach and freed when the
connection is freed.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/af_smc.c | 16 ++++++++++++
net/smc/smc_core.c | 61 +++++++++++++++++++++++++++++++++++++++++++++-
net/smc/smc_core.h | 1 +
3 files changed, 77 insertions(+), 1 deletion(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 751a8fdd6eb0..fce7a5b2ce5c 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -1438,6 +1438,14 @@ static int smc_connect_ism(struct smc_sock *smc,
}

smc_conn_save_peer_info(smc, aclc);
+
+ if (smc_ism_support_dmb_nocopy(smc->conn.lgr->smcd)) {
+ rc = smcd_buf_attach(smc);
+ if (rc) {
+ rc = SMC_CLC_DECL_MEM; /* try to fallback */
+ goto connect_abort;
+ }
+ }
smc_close_init(smc);
smc_rx_init(smc);
smc_tx_init(smc);
@@ -2542,6 +2550,14 @@ static void smc_listen_work(struct work_struct *work)
mutex_unlock(&smc_server_lgr_pending);
}
smc_conn_save_peer_info(new_smc, cclc);
+
+ if (ini->is_smcd &&
+ smc_ism_support_dmb_nocopy(new_smc->conn.lgr->smcd)) {
+ rc = smcd_buf_attach(new_smc);
+ if (rc)
+ goto out_decl;
+ }
+
smc_listen_out_connected(new_smc);
SMC_STAT_SERV_SUCC_INC(sock_net(newclcsock->sk), ini);
goto out_free;
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 9b84d5897aa5..fafdb97adfad 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -1149,6 +1149,20 @@ static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
}
}

+static void smcd_buf_detach(struct smc_connection *conn)
+{
+ struct smcd_dev *smcd = conn->lgr->smcd;
+ u64 peer_token = conn->peer_token;
+
+ if (!conn->sndbuf_desc)
+ return;
+
+ smc_ism_detach_dmb(smcd, peer_token);
+
+ kfree(conn->sndbuf_desc);
+ conn->sndbuf_desc = NULL;
+}
+
static void smc_buf_unuse(struct smc_connection *conn,
struct smc_link_group *lgr)
{
@@ -1192,6 +1206,8 @@ void smc_conn_free(struct smc_connection *conn)
if (lgr->is_smcd) {
if (!list_empty(&lgr->list))
smc_ism_unset_conn(conn);
+ if (smc_ism_support_dmb_nocopy(lgr->smcd))
+ smcd_buf_detach(conn);
tasklet_kill(&conn->rx_tsklet);
} else {
smc_cdc_wait_pend_tx_wr(conn);
@@ -1445,6 +1461,8 @@ static void smc_conn_kill(struct smc_connection *conn, bool soft)
smc_sk_wake_ups(smc);
if (conn->lgr->is_smcd) {
smc_ism_unset_conn(conn);
+ if (smc_ism_support_dmb_nocopy(conn->lgr->smcd))
+ smcd_buf_detach(conn);
if (soft)
tasklet_kill(&conn->rx_tsklet);
else
@@ -2464,12 +2482,18 @@ int smc_buf_create(struct smc_sock *smc, bool is_smcd)
int rc;

/* create send buffer */
+ if (is_smcd &&
+ smc_ism_support_dmb_nocopy(smc->conn.lgr->smcd))
+ goto create_rmb;
+
rc = __smc_buf_create(smc, is_smcd, false);
if (rc)
return rc;
+
+create_rmb:
/* create rmb */
rc = __smc_buf_create(smc, is_smcd, true);
- if (rc) {
+ if (rc && smc->conn.sndbuf_desc) {
down_write(&smc->conn.lgr->sndbufs_lock);
list_del(&smc->conn.sndbuf_desc->list);
up_write(&smc->conn.lgr->sndbufs_lock);
@@ -2479,6 +2503,41 @@ int smc_buf_create(struct smc_sock *smc, bool is_smcd)
return rc;
}

+int smcd_buf_attach(struct smc_sock *smc)
+{
+ struct smc_connection *conn = &smc->conn;
+ struct smcd_dev *smcd = conn->lgr->smcd;
+ u64 peer_token = conn->peer_token;
+ struct smc_buf_desc *buf_desc;
+ int rc;
+
+ buf_desc = kzalloc(sizeof(*buf_desc), GFP_KERNEL);
+ if (!buf_desc)
+ return -ENOMEM;
+
+ /* The ghost sndbuf_desc describes the same memory region as
+ * peer RMB. Its lifecycle is consistent with the connection's
+ * and it will be freed with the connections instead of the
+ * link group.
+ */
+ rc = smc_ism_attach_dmb(smcd, peer_token, buf_desc);
+ if (rc)
+ goto free;
+
+ smc->sk.sk_sndbuf = buf_desc->len;
+ buf_desc->cpu_addr =
+ (u8 *)buf_desc->cpu_addr + sizeof(struct smcd_cdc_msg);
+ buf_desc->len -= sizeof(struct smcd_cdc_msg);
+ conn->sndbuf_desc = buf_desc;
+ conn->sndbuf_desc->used = 1;
+ atomic_set(&conn->sndbuf_space, conn->sndbuf_desc->len);
+ return 0;
+
+free:
+ kfree(buf_desc);
+ return rc;
+}
+
static inline int smc_rmb_reserve_rtoken_idx(struct smc_link_group *lgr)
{
int i;
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 1f175376037b..d93cf51dbd7c 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -557,6 +557,7 @@ void smc_smcd_terminate(struct smcd_dev *dev, struct smcd_gid *peer_gid,
void smc_smcd_terminate_all(struct smcd_dev *dev);
void smc_smcr_terminate_all(struct smc_ib_device *smcibdev);
int smc_buf_create(struct smc_sock *smc, bool is_smcd);
+int smcd_buf_attach(struct smc_sock *smc);
int smc_uncompress_bufsize(u8 compressed);
int smc_rmb_rtoken_handling(struct smc_connection *conn, struct smc_link *link,
struct smc_clc_msg_accept_confirm *clc);
--
2.32.0.3.g01195cf9f


2024-03-12 14:32:12

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v3 10/11] net/smc: adapt cursor update when sndbuf and peer DMB are merged

Since ghost sndbuf shares the same physical memory with peer DMB,
the cursor update processing needs to be adapted to ensure that the
data to be consumed won't be overwritten.

So in this case, the fin_curs and sndbuf_space that were originally
updated after sending the CDC message should be modified to not be
update until the peer updates cons_curs.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_cdc.c | 52 +++++++++++++++++++++++++++++++++++++----------
1 file changed, 41 insertions(+), 11 deletions(-)

diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index 3c06625ceb20..bf5b214ec15a 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -18,6 +18,7 @@
#include "smc_tx.h"
#include "smc_rx.h"
#include "smc_close.h"
+#include "smc_ism.h"

/********************************** send *************************************/

@@ -255,17 +256,25 @@ int smcd_cdc_msg_send(struct smc_connection *conn)
return rc;
smc_curs_copy(&conn->rx_curs_confirmed, &curs, conn);
conn->local_rx_ctrl.prod_flags.cons_curs_upd_req = 0;
- /* Calculate transmitted data and increment free send buffer space */
- diff = smc_curs_diff(conn->sndbuf_desc->len, &conn->tx_curs_fin,
- &conn->tx_curs_sent);
- /* increased by confirmed number of bytes */
- smp_mb__before_atomic();
- atomic_add(diff, &conn->sndbuf_space);
- /* guarantee 0 <= sndbuf_space <= sndbuf_desc->len */
- smp_mb__after_atomic();
- smc_curs_copy(&conn->tx_curs_fin, &conn->tx_curs_sent, conn);
+ if (!smc_ism_support_dmb_nocopy(conn->lgr->smcd)) {
+ /* Ghost sndbuf shares the same memory region with
+ * peer DMB, so don't update the tx_curs_fin and
+ * sndbuf_space until peer has consumed the data.
+ */
+ /* Calculate transmitted data and increment free
+ * send buffer space
+ */
+ diff = smc_curs_diff(conn->sndbuf_desc->len, &conn->tx_curs_fin,
+ &conn->tx_curs_sent);
+ /* increased by confirmed number of bytes */
+ smp_mb__before_atomic();
+ atomic_add(diff, &conn->sndbuf_space);
+ /* guarantee 0 <= sndbuf_space <= sndbuf_desc->len */
+ smp_mb__after_atomic();
+ smc_curs_copy(&conn->tx_curs_fin, &conn->tx_curs_sent, conn);

- smc_tx_sndbuf_nonfull(smc);
+ smc_tx_sndbuf_nonfull(smc);
+ }
return rc;
}

@@ -323,7 +332,7 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc,
{
union smc_host_cursor cons_old, prod_old;
struct smc_connection *conn = &smc->conn;
- int diff_cons, diff_prod;
+ int diff_cons, diff_prod, diff_tx;

smc_curs_copy(&prod_old, &conn->local_rx_ctrl.prod, conn);
smc_curs_copy(&cons_old, &conn->local_rx_ctrl.cons, conn);
@@ -339,6 +348,27 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc,
atomic_add(diff_cons, &conn->peer_rmbe_space);
/* guarantee 0 <= peer_rmbe_space <= peer_rmbe_size */
smp_mb__after_atomic();
+
+ if (conn->lgr->is_smcd &&
+ smc_ism_support_dmb_nocopy(conn->lgr->smcd)) {
+ /* Ghost sndbuf shares the same memory region with
+ * peer RMB, so update tx_curs_fin and sndbuf_space
+ * when peer has consumed the data.
+ */
+ /* calculate peer rmb consumed data */
+ diff_tx = smc_curs_diff(conn->sndbuf_desc->len,
+ &conn->tx_curs_fin,
+ &conn->local_rx_ctrl.cons);
+ /* increase local sndbuf space and fin_curs */
+ smp_mb__before_atomic();
+ atomic_add(diff_tx, &conn->sndbuf_space);
+ /* guarantee 0 <= sndbuf_space <= sndbuf_desc->len */
+ smp_mb__after_atomic();
+ smc_curs_copy(&conn->tx_curs_fin,
+ &conn->local_rx_ctrl.cons, conn);
+
+ smc_tx_sndbuf_nonfull(smc);
+ }
}

diff_prod = smc_curs_diff(conn->rmb_desc->len, &prod_old,
--
2.32.0.3.g01195cf9f


2024-03-12 14:33:10

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v3 11/11] net/smc: implement DMB-merged operations of loopback-ism

This implements operations related to merging sndbuf with peer DMB in
loopback-ism. The DMB won't be freed until no sndbuf is attached to it.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_loopback.c | 123 +++++++++++++++++++++++++++++++++++------
net/smc/smc_loopback.h | 3 +
2 files changed, 109 insertions(+), 17 deletions(-)

diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index c3e1133da8d7..202cd081c758 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -21,6 +21,7 @@

#if IS_ENABLED(CONFIG_SMC_LO)
#define SMC_LO_V2_CAPABLE 0x1 /* loopback-ism acts as ISMv2 */
+#define SMC_LO_SUPPORT_NOCOPY 0x1
#define SMC_DMA_ADDR_INVALID (~(dma_addr_t)0)

static const char smc_lo_dev_name[] = "loopback-ism";
@@ -82,6 +83,7 @@ static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
goto err_node;
}
dmb_node->dma_addr = SMC_DMA_ADDR_INVALID;
+ refcount_set(&dmb_node->refcnt, 1);

again:
/* add new dmb into hash table */
@@ -95,6 +97,7 @@ static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
}
hash_add(ldev->dmb_ht, &dmb_node->list, dmb_node->token);
write_unlock_bh(&ldev->dmb_ht_lock);
+ atomic_inc(&ldev->dmb_cnt);

dmb->sba_idx = dmb_node->sba_idx;
dmb->dmb_tok = dmb_node->token;
@@ -111,13 +114,29 @@ static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
return rc;
}

+static void __smc_lo_unregister_dmb(struct smc_lo_dev *ldev,
+ struct smc_lo_dmb_node *dmb_node)
+{
+ /* remove dmb from hash table */
+ write_lock_bh(&ldev->dmb_ht_lock);
+ hash_del(&dmb_node->list);
+ write_unlock_bh(&ldev->dmb_ht_lock);
+
+ clear_bit(dmb_node->sba_idx, ldev->sba_idx_mask);
+ kvfree(dmb_node->cpu_addr);
+ kfree(dmb_node);
+
+ if (atomic_dec_and_test(&ldev->dmb_cnt))
+ wake_up(&ldev->ldev_release);
+}
+
static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
{
struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
struct smc_lo_dev *ldev = smcd->priv;

- /* remove dmb from hash table */
- write_lock_bh(&ldev->dmb_ht_lock);
+ /* find dmb from hash table */
+ read_lock_bh(&ldev->dmb_ht_lock);
hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
if (tmp_node->token == dmb->dmb_tok) {
dmb_node = tmp_node;
@@ -125,16 +144,76 @@ static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
}
}
if (!dmb_node) {
- write_unlock_bh(&ldev->dmb_ht_lock);
+ read_unlock_bh(&ldev->dmb_ht_lock);
return -EINVAL;
}
- hash_del(&dmb_node->list);
- write_unlock_bh(&ldev->dmb_ht_lock);
+ read_unlock_bh(&ldev->dmb_ht_lock);

- clear_bit(dmb_node->sba_idx, ldev->sba_idx_mask);
- kfree(dmb_node->cpu_addr);
- kfree(dmb_node);
+ if (refcount_dec_and_test(&dmb_node->refcnt))
+ __smc_lo_unregister_dmb(ldev, dmb_node);
+ return 0;
+}
+
+static int smc_lo_support_dmb_nocopy(struct smcd_dev *smcd)
+{
+ return SMC_LO_SUPPORT_NOCOPY;
+}
+
+static int smc_lo_attach_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
+{
+ struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
+ struct smc_lo_dev *ldev = smcd->priv;

+ /* find dmb_node according to dmb->dmb_tok */
+ read_lock_bh(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
+ if (tmp_node->token == dmb->dmb_tok) {
+ dmb_node = tmp_node;
+ break;
+ }
+ }
+ if (!dmb_node) {
+ read_unlock_bh(&ldev->dmb_ht_lock);
+ return -EINVAL;
+ }
+ read_unlock_bh(&ldev->dmb_ht_lock);
+
+ if (!refcount_inc_not_zero(&dmb_node->refcnt))
+ /* the dmb is being unregistered, but has
+ * not been removed from the hash table.
+ */
+ return -EINVAL;
+
+ /* provide dmb information */
+ dmb->sba_idx = dmb_node->sba_idx;
+ dmb->dmb_tok = dmb_node->token;
+ dmb->cpu_addr = dmb_node->cpu_addr;
+ dmb->dma_addr = dmb_node->dma_addr;
+ dmb->dmb_len = dmb_node->len;
+ return 0;
+}
+
+static int smc_lo_detach_dmb(struct smcd_dev *smcd, u64 token)
+{
+ struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ /* find dmb_node according to dmb->dmb_tok */
+ read_lock_bh(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, token) {
+ if (tmp_node->token == token) {
+ dmb_node = tmp_node;
+ break;
+ }
+ }
+ if (!dmb_node) {
+ read_unlock_bh(&ldev->dmb_ht_lock);
+ return -EINVAL;
+ }
+ read_unlock_bh(&ldev->dmb_ht_lock);
+
+ if (refcount_dec_and_test(&dmb_node->refcnt))
+ __smc_lo_unregister_dmb(ldev, dmb_node);
return 0;
}

@@ -170,6 +249,13 @@ static int smc_lo_move_data(struct smcd_dev *smcd, u64 dmb_tok,
{
struct smc_lo_dmb_node *rmb_node = NULL, *tmp_node;
struct smc_lo_dev *ldev = smcd->priv;
+ struct smc_connection *conn;
+
+ if (!sf)
+ /* since sndbuf is merged with peer DMB, there is
+ * no need to copy data from sndbuf to peer DMB.
+ */
+ return 0;

read_lock_bh(&ldev->dmb_ht_lock);
hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_tok) {
@@ -186,15 +272,10 @@ static int smc_lo_move_data(struct smcd_dev *smcd, u64 dmb_tok,

memcpy((char *)rmb_node->cpu_addr + offset, data, size);

- if (sf) {
- struct smc_connection *conn =
- smcd->conn[rmb_node->sba_idx];
-
- if (conn && !conn->killed)
- tasklet_schedule(&conn->rx_tsklet);
- else
- return -EPIPE;
- }
+ conn = smcd->conn[rmb_node->sba_idx];
+ if (!conn || conn->killed)
+ return -EPIPE;
+ tasklet_schedule(&conn->rx_tsklet);
return 0;
}

@@ -226,6 +307,9 @@ static const struct smcd_ops lo_ops = {
.query_remote_gid = smc_lo_query_rgid,
.register_dmb = smc_lo_register_dmb,
.unregister_dmb = smc_lo_unregister_dmb,
+ .support_dmb_nocopy = smc_lo_support_dmb_nocopy,
+ .attach_dmb = smc_lo_attach_dmb,
+ .detach_dmb = smc_lo_detach_dmb,
.add_vlan_id = smc_lo_add_vlan_id,
.del_vlan_id = smc_lo_del_vlan_id,
.set_vlan_required = smc_lo_set_vlan_required,
@@ -304,12 +388,17 @@ static int smc_lo_dev_init(struct smc_lo_dev *ldev)
smc_lo_generate_id(ldev);
rwlock_init(&ldev->dmb_ht_lock);
hash_init(ldev->dmb_ht);
+ atomic_set(&ldev->dmb_cnt, 0);
+ init_waitqueue_head(&ldev->ldev_release);
+
return smcd_lo_register_dev(ldev);
}

static void smc_lo_dev_exit(struct smc_lo_dev *ldev)
{
smcd_lo_unregister_dev(ldev);
+ if (atomic_read(&ldev->dmb_cnt))
+ wait_event(ldev->ldev_release, !atomic_read(&ldev->dmb_cnt));
}

static void smc_lo_dev_release(struct device *dev)
diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
index 24ab9d747613..9156a6c37e65 100644
--- a/net/smc/smc_loopback.h
+++ b/net/smc/smc_loopback.h
@@ -30,6 +30,7 @@ struct smc_lo_dmb_node {
u32 sba_idx;
void *cpu_addr;
dma_addr_t dma_addr;
+ refcount_t refcnt;
};

struct smc_lo_dev {
@@ -37,9 +38,11 @@ struct smc_lo_dev {
struct device dev;
u16 chid;
struct smcd_gid local_gid;
+ atomic_t dmb_cnt;
rwlock_t dmb_ht_lock;
DECLARE_BITMAP(sba_idx_mask, SMC_LO_MAX_DMBS);
DECLARE_HASHTABLE(dmb_ht, SMC_LO_DMBS_HASH_BITS);
+ wait_queue_head_t ldev_release;
};
#endif

--
2.32.0.3.g01195cf9f


2024-03-12 14:35:09

by Jan Karcher

[permalink] [raw]
Subject: Re: [PATCH net-next v3 00/11] net/smc: SMC intra-OS shortcut with loopback-ism



On 12/03/2024 15:27, Wen Gu wrote:
> This patch set acts as the second part of the new version of [1] (The first
> part can be referred from [2]), the updated things of this version are listed
> at the end.

Hi Wen Gu,

Re-running the tests with this version and finishing up review.
Please give me some time.

Thank you
- Jan

>
> - Background
>
> SMC-D is now used in IBM z with ISM function to optimize network interconnect
> for intra-CPC communications. Inspired by this, we try to make SMC-D available
> on the non-s390 architecture through a software-implemented Emulated-ISM device,
> that is the loopback-ism device here, to accelerate inter-process or
> inter-containers communication within the same OS instance.
>
> - Design
>
> This patch set includes 3 parts:
>
> - Patch #1-#2: some prepare work for loopback-ism.
> - Patch #3-#7: implement loopback-ism device. Noted that loopback-ism now
> serves only SMC and no userspace interface exposed.
> - Patch #10-#15: memory copy optimization for intra-OS scenario.
>
> The loopback-ism device is designed as an ISMv2 device and not be limited to
> a specific net namespace, ends of both inter-process connection (1/1' in diagram
> below) or inter-container connection (2/2' in diagram below) can find the same
> available loopback-ism and choose it during the CLC handshake.
>
> Container 1 (ns1) Container 2 (ns2)
> +-----------------------------------------+ +-------------------------+
> | +-------+ +-------+ +-------+ | | +-------+ |
> | | App A | | App B | | App C | | | | App D |<-+ |
> | +-------+ +---^---+ +-------+ | | +-------+ |(2') |
> | |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| |
> | (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ |
> | `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | |
> +---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+
> | | | |
> Kernel | | | |
> +----+-------v---+-----------v----------------------------------+---+----+
> | | TCP | |
> | | | |
> | +--------------------------------------------------------------+ |
> | |
> | +--------------+ |
> | | smc loopback | |
> +---------------------------+--------------+-----------------------------+
>
> loopback-ism device creates DMBs (shared memory) for each connection peer.
> Since data transfer occurs within the same kernel, the sndbuf of each peer
> is only a descriptor and point to the same memory region as peer DMB, so that
> the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.
>
> Container 1 (ns1) Container 2 (ns2)
> +-----------------------------------------+ +-------------------------+
> | +-------+ | | +-------+ |
> | | App C |-----+ | | | App D | |
> | +-------+ | | | +-^-----+ |
> | | | | | |
> | (2) | | | (2') | |
> | | | | | |
> +---------------|-------------------------+ +----------|--------------+
> | |
> Kernel | |
> +---------------|-----------------------------------------|--------------+
> | +--------+ +--v-----+ +--------+ +--------+ |
> | |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| |
> | +-----|--+ +--|-----+ +-----|--+ +--------+ |
> | +-----|--+ | +-----|--+ |
> | | DMB C | +---------------------------------| DMB D | |
> | +--------+ +--------+ |
> | |
> | +--------------+ |
> | | smc loopback | |
> +---------------------------+--------------+-----------------------------+
>
> - Benchmark Test
>
> * Test environments:
> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
> - SMC sndbuf/DMB size 1MB.
>
> * Test object:
> - TCP: run on TCP loopback.
> - SMC lo: run on SMC loopback-ism.
>
> 1. ipc-benchmark (see [3])
>
> - ./<foo> -c 1000000 -s 100
>
> TCP SMC-lo
> Message
> rate (msg/s) 81433 143938(+76.75%)
>
> 2. sockperf
>
> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>
> TCP SMC-lo
> Bandwidth(MBps) 4903.07 7978.69(+62.73%)
> Latency(us) 6.095 3.539(-41.94%)
>
> 3. nginx/wrk
>
> - serv: <smc_run> nginx
> - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
>
> TCP SMC-lo
> Requests/s 161665.67 244272.41(+51.10%)
>
> 4. redis-benchmark
>
> - serv: <smc_run> redis-server
> - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
>
> TCP SMC-lo
> GET(Requests/s) 88790.23 117474.30(+32.31%)
> SET(Requests/s) 87508.20 118623.96(+35.57%)
>
>
> Change log:
>
> v3->v2:
> - Patch #11: use tasklet_schedule(&conn->rx_tsklet) instead of smcd_cdc_rx_handler()
> to avoid possible recursive locking of conn->send_lock and use {read|write}_lock_bh()
> to acquire dmb_ht_lock.
>
> v2->v1:
> Link: https://lore.kernel.org/netdev/[email protected]/
> - All the patches: changed the term virtual-ISM to Emulated-ISM as defined by SMCv2.1.
> - Patch #3: optimized the description of SMC_LO config. Avoid exposing loopback-ism
> to sysfs and remove all the knobs until future definition clear.
> - Patch #3: try to make lockdep happy by using read_lock_bh() in smc_lo_move_data().
> - Patch #6: defaultly use physical contiguous DMB buffers.
> - Patch #11: defaultly enable DMB no-copy for loopback-ism and free the DMB in
> unregister_dmb or detach_dmb when dmb_node->refcnt reaches 0, instead of using
> wait_event to keep waiting in unregister_dmb.
>
> v1->RFC:
> Link: https://lore.kernel.org/netdev/[email protected]/
> - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
> /sys/devices/virtual/smc/loopback-ism/xfer_bytes
> - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
> merging sndbuf with peer DMB.
> - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
> control of whether to merge sndbuf and DMB. They can be respectively set by:
> /sys/devices/virtual/smc/loopback-ism/dmb_type
> /sys/devices/virtual/smc/loopback-ism/dmb_copy
> The motivation for these two control is that a performance bottleneck was
> found when using vzalloced DMB and sndbuf is merged with DMB, and there are
> many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
> by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
> or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
> vmap lock contention [6]. It has significant effects, but using virtual memory
> still has additional overhead compared to using physical memory.
> So this new version provides controls of dmb_type and dmb_copy to suit
> different scenarios.
> - Some minor changes and comments improvements.
>
> RFC->old version([1]):
> Link: https://lore.kernel.org/netdev/[email protected]/
> - Patch #1: improve the loopback-ism dump, it shows as follows now:
> # smcd d
> FID Type PCI-ID PCHID InUse #LGs PNET-ID
> 0000 0 loopback-ism ffff No 0
> - Patch #3: introduce the smc_ism_set_v2_capable() helper and set
> smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
> regardless of whether there is already a device in smcd device list.
> - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
> - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
> to activate or deactivate the loopback-ism.
> - Patch #9: introduce the statistics of loopback-ism by
> /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
> - Some minor changes and comments improvements.
>
> [1] https://lore.kernel.org/netdev/[email protected]/
> [2] https://lore.kernel.org/netdev/[email protected]/
> [3] https://github.com/goldsborough/ipc-bench
> [4] https://lore.kernel.org/all/[email protected]/
> [5] https://lore.kernel.org/all/[email protected]/
> [6] https://lore.kernel.org/all/[email protected]/
>
> Wen Gu (11):
> net/smc: adapt SMC-D device dump for Emulated-ISM
> net/smc: decouple ism_client from SMC-D DMB registration
> net/smc: introduce loopback-ism for SMC intra-OS shortcut
> net/smc: implement ID-related operations of loopback-ism
> net/smc: implement some unsupported operations of loopback-ism
> net/smc: implement DMB-related operations of loopback-ism
> net/smc: register loopback-ism into SMC-D device list
> net/smc: add operations to merge sndbuf with peer DMB
> net/smc: attach or detach ghost sndbuf to peer DMB
> net/smc: adapt cursor update when sndbuf and peer DMB are merged
> net/smc: implement DMB-merged operations of loopback-ism
>
> drivers/s390/net/ism_drv.c | 2 +-
> include/net/smc.h | 7 +-
> net/smc/Kconfig | 13 ++
> net/smc/Makefile | 2 +-
> net/smc/af_smc.c | 28 ++-
> net/smc/smc_cdc.c | 52 ++++-
> net/smc/smc_core.c | 61 ++++-
> net/smc/smc_core.h | 1 +
> net/smc/smc_ism.c | 71 +++++-
> net/smc/smc_ism.h | 5 +
> net/smc/smc_loopback.c | 462 +++++++++++++++++++++++++++++++++++++
> net/smc/smc_loopback.h | 52 +++++
> 12 files changed, 727 insertions(+), 29 deletions(-)
> create mode 100644 net/smc/smc_loopback.c
> create mode 100644 net/smc/smc_loopback.h
>

2024-03-12 14:38:49

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v3 05/11] net/smc: implement some unsupported operations of loopback-ism

vlan operations are not supported currently since the need for vlan in
loopback-ism situation does not seem to be strong.

signal_event operation is not supported since no event now needs to be
processed by loopback-ism device.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_loopback.c | 36 +++++++++++++++++++++++++++++++-----
1 file changed, 31 insertions(+), 5 deletions(-)

diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index 7656a2474500..253128c77208 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -49,6 +49,32 @@ static int smc_lo_query_rgid(struct smcd_dev *smcd, struct smcd_gid *rgid,
return 0;
}

+static int smc_lo_add_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
+{
+ return -EOPNOTSUPP;
+}
+
+static int smc_lo_del_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
+{
+ return -EOPNOTSUPP;
+}
+
+static int smc_lo_set_vlan_required(struct smcd_dev *smcd)
+{
+ return -EOPNOTSUPP;
+}
+
+static int smc_lo_reset_vlan_required(struct smcd_dev *smcd)
+{
+ return -EOPNOTSUPP;
+}
+
+static int smc_lo_signal_event(struct smcd_dev *dev, struct smcd_gid *rgid,
+ u32 trigger_irq, u32 event_code, u64 info)
+{
+ return 0;
+}
+
static int smc_lo_supports_v2(void)
{
return SMC_LO_V2_CAPABLE;
@@ -77,11 +103,11 @@ static const struct smcd_ops lo_ops = {
.query_remote_gid = smc_lo_query_rgid,
.register_dmb = NULL,
.unregister_dmb = NULL,
- .add_vlan_id = NULL,
- .del_vlan_id = NULL,
- .set_vlan_required = NULL,
- .reset_vlan_required = NULL,
- .signal_event = NULL,
+ .add_vlan_id = smc_lo_add_vlan_id,
+ .del_vlan_id = smc_lo_del_vlan_id,
+ .set_vlan_required = smc_lo_set_vlan_required,
+ .reset_vlan_required = smc_lo_reset_vlan_required,
+ .signal_event = smc_lo_signal_event,
.move_data = NULL,
.supports_v2 = smc_lo_supports_v2,
.get_local_gid = smc_lo_get_local_gid,
--
2.32.0.3.g01195cf9f


2024-03-12 14:52:43

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next v3 00/11] net/smc: SMC intra-OS shortcut with loopback-ism



On 2024/3/12 22:33, Jan Karcher wrote:
>
>
> On 12/03/2024 15:27, Wen Gu wrote:
>> This patch set acts as the second part of the new version of [1] (The first
>> part can be referred from [2]), the updated things of this version are listed
>> at the end.
>
> Hi Wen Gu,
>
> Re-running the tests with this version and finishing up review.
> Please give me some time.
>

Sure! Thank you very much for your time and help!

> Thank you
> - Jan
>
>>
>> - Background
>>
>> SMC-D is now used in IBM z with ISM function to optimize network interconnect
>> for intra-CPC communications. Inspired by this, we try to make SMC-D available
>> on the non-s390 architecture through a software-implemented Emulated-ISM device,
>> that is the loopback-ism device here, to accelerate inter-process or
>> inter-containers communication within the same OS instance.
>>
>> - Design
>>
>> This patch set includes 3 parts:
>>
>>   - Patch #1-#2: some prepare work for loopback-ism.
>>   - Patch #3-#7: implement loopback-ism device. Noted that loopback-ism now
>>     serves only SMC and no userspace interface exposed.
>>   - Patch #10-#15: memory copy optimization for intra-OS scenario.
>>
>> The loopback-ism device is designed as an ISMv2 device and not be limited to
>> a specific net namespace, ends of both inter-process connection (1/1' in diagram
>> below) or inter-container connection (2/2' in diagram below) can find the same
>> available loopback-ism and choose it during the CLC handshake.
>>
>>   Container 1 (ns1)                              Container 2 (ns2)
>>   +-----------------------------------------+    +-------------------------+
>>   | +-------+      +-------+      +-------+ |    |        +-------+        |
>>   | | App A |      | App B |      | App C | |    |        | App D |<-+     |
>>   | +-------+      +---^---+      +-------+ |    |        +-------+  |(2') |
>>   |     |127.0.0.1 (1')|             |192.168.0.11       192.168.0.12|     |
>>   |  (1)|   +--------+ | +--------+  |(2)   |    | +--------+   +--------+ |
>>   |     `-->|   lo   |-` |  eth0  |<-`      |    | |   lo   |   |  eth0  | |
>>   +---------+--|---^-+---+-----|--+---------+    +-+--------+---+-^------+-+
>>                |   |           |                                  |
>>   Kernel       |   |           |                                  |
>>   +----+-------v---+-----------v----------------------------------+---+----+
>>   |    |                            TCP                               |    |
>>   |    |                                                              |    |
>>   |    +--------------------------------------------------------------+    |
>>   |                                                                        |
>>   |                           +--------------+                             |
>>   |                           | smc loopback |                             |
>>   +---------------------------+--------------+-----------------------------+
>>
>> loopback-ism device creates DMBs (shared memory) for each connection peer.
>> Since data transfer occurs within the same kernel, the sndbuf of each peer
>> is only a descriptor and point to the same memory region as peer DMB, so that
>> the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.
>>
>>   Container 1 (ns1)                              Container 2 (ns2)
>>   +-----------------------------------------+    +-------------------------+
>>   | +-------+                               |    |        +-------+        |
>>   | | App C |-----+                         |    |        | App D |        |
>>   | +-------+     |                         |    |        +-^-----+        |
>>   |               |                         |    |          |              |
>>   |           (2) |                         |    |     (2') |              |
>>   |               |                         |    |          |              |
>>   +---------------|-------------------------+    +----------|--------------+
>>                   |                                         |
>>   Kernel          |                                         |
>>   +---------------|-----------------------------------------|--------------+
>>   | +--------+ +--v-----+                           +--------+ +--------+  |
>>   | |dmb_desc| |snd_desc|                           |dmb_desc| |snd_desc|  |
>>   | +-----|--+ +--|-----+                           +-----|--+ +--------+  |
>>   | +-----|--+    |                                 +-----|--+             |
>>   | | DMB C  |    +---------------------------------| DMB D  |             |
>>   | +--------+                                      +--------+             |
>>   |                                                                        |
>>   |                           +--------------+                             |
>>   |                           | smc loopback |                             |
>>   +---------------------------+--------------+-----------------------------+
>>
>> - Benchmark Test
>>
>>   * Test environments:
>>        - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>>        - SMC sndbuf/DMB size 1MB.
>>
>>   * Test object:
>>        - TCP: run on TCP loopback.
>>        - SMC lo: run on SMC loopback-ism.
>>
>> 1. ipc-benchmark (see [3])
>>
>>   - ./<foo> -c 1000000 -s 100
>>
>>                              TCP                  SMC-lo
>> Message
>> rate (msg/s)              81433                  143938(+76.75%)
>>
>> 2. sockperf
>>
>>   - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>>   - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1
>> -t 30
>>
>>                              TCP                  SMC-lo
>> Bandwidth(MBps)         4903.07                 7978.69(+62.73%)
>> Latency(us)               6.095                   3.539(-41.94%)
>>
>> 3. nginx/wrk
>>
>>   - serv: <smc_run> nginx
>>   - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
>>
>>                             TCP                   SMC-lo
>> Requests/s           161665.67                244272.41(+51.10%)
>>
>> 4. redis-benchmark
>>
>>   - serv: <smc_run> redis-server
>>   - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
>>
>>                             TCP                   SMC-lo
>> GET(Requests/s)       88790.23                117474.30(+32.31%)
>> SET(Requests/s)       87508.20                118623.96(+35.57%)
>>
>>
>> Change log:
>>
>> v3->v2:
>> - Patch #11: use tasklet_schedule(&conn->rx_tsklet) instead of smcd_cdc_rx_handler()
>>    to avoid possible recursive locking of conn->send_lock and use {read|write}_lock_bh()
>>    to acquire dmb_ht_lock.
>>
>> v2->v1:
>> Link: https://lore.kernel.org/netdev/[email protected]/
>> - All the patches: changed the term virtual-ISM to Emulated-ISM as defined by SMCv2.1.
>> - Patch #3: optimized the description of SMC_LO config. Avoid exposing loopback-ism
>>    to sysfs and remove all the knobs until future definition clear.
>> - Patch #3: try to make lockdep happy by using read_lock_bh() in smc_lo_move_data().
>> - Patch #6: defaultly use physical contiguous DMB buffers.
>> - Patch #11: defaultly enable DMB no-copy for loopback-ism and free the DMB in
>>    unregister_dmb or detach_dmb when dmb_node->refcnt reaches 0, instead of using
>>    wait_event to keep waiting in unregister_dmb.
>>
>> v1->RFC:
>> Link: https://lore.kernel.org/netdev/[email protected]/
>> - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
>>    /sys/devices/virtual/smc/loopback-ism/xfer_bytes
>> - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
>>    merging sndbuf with peer DMB.
>> - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
>>    control of whether to merge sndbuf and DMB. They can be respectively set by:
>>    /sys/devices/virtual/smc/loopback-ism/dmb_type
>>    /sys/devices/virtual/smc/loopback-ism/dmb_copy
>>    The motivation for these two control is that a performance bottleneck was
>>    found when using vzalloced DMB and sndbuf is merged with DMB, and there are
>>    many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
>>    by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
>>    or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
>>    vmap lock contention [6]. It has significant effects, but using virtual memory
>>    still has additional overhead compared to using physical memory.
>>    So this new version provides controls of dmb_type and dmb_copy to suit
>>    different scenarios.
>> - Some minor changes and comments improvements.
>>
>> RFC->old version([1]):
>> Link: https://lore.kernel.org/netdev/[email protected]/
>> - Patch #1: improve the loopback-ism dump, it shows as follows now:
>>    # smcd d
>>    FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
>>    0000 0     loopback-ism  ffff   No        0
>> - Patch #3: introduce the smc_ism_set_v2_capable() helper and set
>>    smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
>>    regardless of whether there is already a device in smcd device list.
>> - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
>> - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
>>    to activate or deactivate the loopback-ism.
>> - Patch #9: introduce the statistics of loopback-ism by
>>    /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
>> - Some minor changes and comments improvements.
>>
>> [1] https://lore.kernel.org/netdev/[email protected]/
>> [2] https://lore.kernel.org/netdev/[email protected]/
>> [3] https://github.com/goldsborough/ipc-bench
>> [4] https://lore.kernel.org/all/[email protected]/
>> [5] https://lore.kernel.org/all/[email protected]/
>> [6] https://lore.kernel.org/all/[email protected]/
>>
>> Wen Gu (11):
>>    net/smc: adapt SMC-D device dump for Emulated-ISM
>>    net/smc: decouple ism_client from SMC-D DMB registration
>>    net/smc: introduce loopback-ism for SMC intra-OS shortcut
>>    net/smc: implement ID-related operations of loopback-ism
>>    net/smc: implement some unsupported operations of loopback-ism
>>    net/smc: implement DMB-related operations of loopback-ism
>>    net/smc: register loopback-ism into SMC-D device list
>>    net/smc: add operations to merge sndbuf with peer DMB
>>    net/smc: attach or detach ghost sndbuf to peer DMB
>>    net/smc: adapt cursor update when sndbuf and peer DMB are merged
>>    net/smc: implement DMB-merged operations of loopback-ism
>>
>>   drivers/s390/net/ism_drv.c |   2 +-
>>   include/net/smc.h          |   7 +-
>>   net/smc/Kconfig            |  13 ++
>>   net/smc/Makefile           |   2 +-
>>   net/smc/af_smc.c           |  28 ++-
>>   net/smc/smc_cdc.c          |  52 ++++-
>>   net/smc/smc_core.c         |  61 ++++-
>>   net/smc/smc_core.h         |   1 +
>>   net/smc/smc_ism.c          |  71 +++++-
>>   net/smc/smc_ism.h          |   5 +
>>   net/smc/smc_loopback.c     | 462 +++++++++++++++++++++++++++++++++++++
>>   net/smc/smc_loopback.h     |  52 +++++
>>   12 files changed, 727 insertions(+), 29 deletions(-)
>>   create mode 100644 net/smc/smc_loopback.c
>>   create mode 100644 net/smc/smc_loopback.h
>>

2024-03-12 15:00:48

by Paolo Abeni

[permalink] [raw]
Subject: Re: [PATCH net-next v3 00/11] net/smc: SMC intra-OS shortcut with loopback-ism

On Tue, 2024-03-12 at 22:27 +0800, Wen Gu wrote:
> This patch set acts as the second part of the new version of [1] (The first
> part can be referred from [2]), the updated things of this version are listed
> at the end.
>
> - Background
>
> SMC-D is now used in IBM z with ISM function to optimize network interconnect
> for intra-CPC communications. Inspired by this, we try to make SMC-D available
> on the non-s390 architecture through a software-implemented Emulated-ISM device,
> that is the loopback-ism device here, to accelerate inter-process or
> inter-containers communication within the same OS instance.
>
> - Design
>
> This patch set includes 3 parts:
>
> - Patch #1-#2: some prepare work for loopback-ism.
> - Patch #3-#7: implement loopback-ism device. Noted that loopback-ism now
> serves only SMC and no userspace interface exposed.
> - Patch #10-#15: memory copy optimization for intra-OS scenario.
>
> The loopback-ism device is designed as an ISMv2 device and not be limited to
> a specific net namespace, ends of both inter-process connection (1/1' in diagram
> below) or inter-container connection (2/2' in diagram below) can find the same
> available loopback-ism and choose it during the CLC handshake.
>
> Container 1 (ns1) Container 2 (ns2)
> +-----------------------------------------+ +-------------------------+
> | +-------+ +-------+ +-------+ | | +-------+ |
> | | App A | | App B | | App C | | | | App D |<-+ |
> | +-------+ +---^---+ +-------+ | | +-------+ |(2') |
> | |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| |
> | (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ |
> | `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | |
> +---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+
> | | | |
> Kernel | | | |
> +----+-------v---+-----------v----------------------------------+---+----+
> | | TCP | |
> | | | |
> | +--------------------------------------------------------------+ |
> | |
> | +--------------+ |
> | | smc loopback | |
> +---------------------------+--------------+-----------------------------+
>
> loopback-ism device creates DMBs (shared memory) for each connection peer.
> Since data transfer occurs within the same kernel, the sndbuf of each peer
> is only a descriptor and point to the same memory region as peer DMB, so that
> the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.
>
> Container 1 (ns1) Container 2 (ns2)
> +-----------------------------------------+ +-------------------------+
> | +-------+ | | +-------+ |
> | | App C |-----+ | | | App D | |
> | +-------+ | | | +-^-----+ |
> | | | | | |
> | (2) | | | (2') | |
> | | | | | |
> +---------------|-------------------------+ +----------|--------------+
> | |
> Kernel | |
> +---------------|-----------------------------------------|--------------+
> | +--------+ +--v-----+ +--------+ +--------+ |
> | |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| |
> | +-----|--+ +--|-----+ +-----|--+ +--------+ |
> | +-----|--+ | +-----|--+ |
> | | DMB C | +---------------------------------| DMB D | |
> | +--------+ +--------+ |
> | |
> | +--------------+ |
> | | smc loopback | |
> +---------------------------+--------------+-----------------------------+
>
> - Benchmark Test
>
> * Test environments:
> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
> - SMC sndbuf/DMB size 1MB.
>
> * Test object:
> - TCP: run on TCP loopback.
> - SMC lo: run on SMC loopback-ism.
>
> 1. ipc-benchmark (see [3])
>
> - ./<foo> -c 1000000 -s 100
>
> TCP SMC-lo
> Message
> rate (msg/s) 81433 143938(+76.75%)
>
> 2. sockperf
>
> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>
> TCP SMC-lo
> Bandwidth(MBps) 4903.07 7978.69(+62.73%)
> Latency(us) 6.095 3.539(-41.94%)
>
> 3. nginx/wrk
>
> - serv: <smc_run> nginx
> - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
>
> TCP SMC-lo
> Requests/s 161665.67 244272.41(+51.10%)
>
> 4. redis-benchmark
>
> - serv: <smc_run> redis-server
> - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
>
> TCP SMC-lo
> GET(Requests/s) 88790.23 117474.30(+32.31%)
> SET(Requests/s) 87508.20 118623.96(+35.57%)
>
>
> Change log:
>
> v3->v2:
> - Patch #11: use tasklet_schedule(&conn->rx_tsklet) instead of smcd_cdc_rx_handler()
> to avoid possible recursive locking of conn->send_lock and use {read|write}_lock_bh()
> to acquire dmb_ht_lock.
>
> v2->v1:
> Link: https://lore.kernel.org/netdev/[email protected]/
> - All the patches: changed the term virtual-ISM to Emulated-ISM as defined by SMCv2.1.
> - Patch #3: optimized the description of SMC_LO config. Avoid exposing loopback-ism
> to sysfs and remove all the knobs until future definition clear.
> - Patch #3: try to make lockdep happy by using read_lock_bh() in smc_lo_move_data().
> - Patch #6: defaultly use physical contiguous DMB buffers.
> - Patch #11: defaultly enable DMB no-copy for loopback-ism and free the DMB in
> unregister_dmb or detach_dmb when dmb_node->refcnt reaches 0, instead of using
> wait_event to keep waiting in unregister_dmb.
>
> v1->RFC:
> Link: https://lore.kernel.org/netdev/[email protected]/
> - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
> /sys/devices/virtual/smc/loopback-ism/xfer_bytes
> - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
> merging sndbuf with peer DMB.
> - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
> control of whether to merge sndbuf and DMB. They can be respectively set by:
> /sys/devices/virtual/smc/loopback-ism/dmb_type
> /sys/devices/virtual/smc/loopback-ism/dmb_copy
> The motivation for these two control is that a performance bottleneck was
> found when using vzalloced DMB and sndbuf is merged with DMB, and there are
> many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
> by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
> or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
> vmap lock contention [6]. It has significant effects, but using virtual memory
> still has additional overhead compared to using physical memory.
> So this new version provides controls of dmb_type and dmb_copy to suit
> different scenarios.
> - Some minor changes and comments improvements.
>
## Form letter - net-next-closed

The merge window for v6.9 has begun and we have already posted our pull
request. Therefore net-next is closed for new drivers, features, code
refactoring and optimizations. We are currently accepting bug fixes
only.

Please repost when net-next reopens after March 25th.

RFC patches sent for review only are obviously welcome at any time.

See:
https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#development-cycle

--
pw-bot: defer



2024-03-14 10:31:53

by Jan Karcher

[permalink] [raw]
Subject: Re: [PATCH net-next v3 01/11] net/smc: adapt SMC-D device dump for Emulated-ISM



On 12/03/2024 15:27, Wen Gu wrote:
> The introduction of Emulated-ISM requires adaptation of SMC-D device
> dump. Software implemented non-PCI device (loopback-ism) should be
> handled correctly and the CHID reserved for Emulated-ISM should be got
> from smcd_ops interface instead of PCI information.
>
> Signed-off-by: Wen Gu <[email protected]>
> ---
> net/smc/smc_ism.c | 13 ++++++++++---
> 1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
> index ac88de2a06a0..b6eca4231913 100644
> --- a/net/smc/smc_ism.c
> +++ b/net/smc/smc_ism.c
> @@ -252,12 +252,11 @@ static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
> char smc_pnet[SMC_MAX_PNETID_LEN + 1];
> struct smc_pci_dev smc_pci_dev;
> struct nlattr *port_attrs;
> + struct device *device;
> struct nlattr *attrs;
> - struct ism_dev *ism;
> int use_cnt = 0;
> void *nlh;
>
> - ism = smcd->priv;
> nlh = genlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq,
> &smc_gen_nl_family, NLM_F_MULTI,
> SMC_NETLINK_GET_DEV_SMCD);
> @@ -272,7 +271,15 @@ static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
> if (nla_put_u8(skb, SMC_NLA_DEV_IS_CRIT, use_cnt > 0))
> goto errattr;
> memset(&smc_pci_dev, 0, sizeof(smc_pci_dev));
> - smc_set_pci_values(to_pci_dev(ism->dev.parent), &smc_pci_dev);
> + device = smcd->ops->get_dev(smcd);
> + if (device->parent)
> + smc_set_pci_values(to_pci_dev(device->parent), &smc_pci_dev);
> + if (smc_ism_is_emulated(smcd)) {
> + smc_pci_dev.pci_pchid = smc_ism_get_chid(smcd);
> + if (!device->parent)
> + snprintf(smc_pci_dev.pci_id, sizeof(smc_pci_dev.pci_id),
> + "%s", dev_name(device));
> + }

Hi Wen Gu,

playing around with the loopback-ism device and testing it, i developed
some concerns regarding this exposure. Basically this enables us to see
the loopback device in the `smcd device` tool without any changes.
E.g.:
```
# smcd device
FID Type PCI-ID PCHID InUse #LGs PNET-ID
0000 0 loopback-ism ffff No 0
102a ISM 0000:00:00.0 07c2 No 0 NET1
```

Now the problem with this is that "loopback-ism" is no valid PCI-ID and
should not be there. My first thought was to put an "n/a" instead, but
that opens up another problem. Currently you could set - even if it does
not make sense - a PNET_ID for the loopback device:
```
# smc_pnet -a -D loopback-ism NET1
# smcd device
FID Type PCI-ID PCHID InUse #LGs PNET-ID
0000 0 loopback-ism ffff No 0 *NET1
102a ISM 0000:00:00.0 07c2 No 0 NET1
```
If we would change the PCI-ID to "n/a" it would be a valid input
parameter for the tooling which is... to put it nice... not so beautiful.
I brainstormed this with them team and the problem is more complex.
In theory there shouldn't be PCI information set for the loopback
device. There should be a better abstraction in the netlink interface
that creates this output and the tooling should be made aware of it.

Do you rely on the output currently? What are your thoughts about it?
If not I'd ask you to not fill the netlink interface for the loopback
device and refactor it with the next stage when we create a right
interface for it.

Since the Merge-Window is closed feel free to send new versions as RFC.
Thank you
- Jan

> if (nla_put_u32(skb, SMC_NLA_DEV_PCI_FID, smc_pci_dev.pci_fid))
> goto errattr;
> if (nla_put_u16(skb, SMC_NLA_DEV_PCI_CHID, smc_pci_dev.pci_pchid))

2024-03-15 03:44:38

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next v3 01/11] net/smc: adapt SMC-D device dump for Emulated-ISM



On 2024/3/14 18:23, Jan Karcher wrote:
>
>
> On 12/03/2024 15:27, Wen Gu wrote:
>> The introduction of Emulated-ISM requires adaptation of SMC-D device
>> dump. Software implemented non-PCI device (loopback-ism) should be
>> handled correctly and the CHID reserved for Emulated-ISM should be got
>> from smcd_ops interface instead of PCI information.
>>
>> Signed-off-by: Wen Gu <[email protected]>
>> ---
>>   net/smc/smc_ism.c | 13 ++++++++++---
>>   1 file changed, 10 insertions(+), 3 deletions(-)
>>
>> diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
>> index ac88de2a06a0..b6eca4231913 100644
>> --- a/net/smc/smc_ism.c
>> +++ b/net/smc/smc_ism.c
>> @@ -252,12 +252,11 @@ static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
>>       char smc_pnet[SMC_MAX_PNETID_LEN + 1];
>>       struct smc_pci_dev smc_pci_dev;
>>       struct nlattr *port_attrs;
>> +    struct device *device;
>>       struct nlattr *attrs;
>> -    struct ism_dev *ism;
>>       int use_cnt = 0;
>>       void *nlh;
>> -    ism = smcd->priv;
>>       nlh = genlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq,
>>                 &smc_gen_nl_family, NLM_F_MULTI,
>>                 SMC_NETLINK_GET_DEV_SMCD);
>> @@ -272,7 +271,15 @@ static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
>>       if (nla_put_u8(skb, SMC_NLA_DEV_IS_CRIT, use_cnt > 0))
>>           goto errattr;
>>       memset(&smc_pci_dev, 0, sizeof(smc_pci_dev));
>> -    smc_set_pci_values(to_pci_dev(ism->dev.parent), &smc_pci_dev);
>> +    device = smcd->ops->get_dev(smcd);
>> +    if (device->parent)
>> +        smc_set_pci_values(to_pci_dev(device->parent), &smc_pci_dev);
>> +    if (smc_ism_is_emulated(smcd)) {
>> +        smc_pci_dev.pci_pchid = smc_ism_get_chid(smcd);
>> +        if (!device->parent)
>> +            snprintf(smc_pci_dev.pci_id, sizeof(smc_pci_dev.pci_id),
>> +                 "%s", dev_name(device));
>> +    }
>
> Hi Wen Gu,
>
> playing around with the loopback-ism device and testing it, i developed some concerns regarding this exposure. Basically
> this enables us to see the loopback device in the `smcd device` tool without any changes.
> E.g.:
> ```
> # smcd device
> FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
> 0000 0     loopback-ism  ffff   No        0
> 102a ISM   0000:00:00.0  07c2   No        0  NET1
> ```
>
> Now the problem with this is that "loopback-ism" is no valid PCI-ID and should not be there. My first thought was to put
> an "n/a" instead, but that opens up another problem. Currently you could set - even if it does not make sense - a
> PNET_ID for the loopback device:
> ```

Yes, and we can exclude loopback-ism in smc_pnet_enter() if necessary.

> # smc_pnet -a -D loopback-ism NET1
> # smcd device
> FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
> 0000 0     loopback-ism  ffff   No        0  *NET1
> 102a ISM   0000:00:00.0  07c2   No        0  NET1
> ```
> If we would change the PCI-ID to "n/a" it would be a valid input parameter for the tooling which is... to put it nice...
> not so beautiful.

FID Type PCI-ID PCHID InUse #LGs PNET-ID
0000 0 n/a ffff No 0

IIUC, the problem is that the 'n/a', which would be an input for other tools, is somewhat strange?

Since PCHID 0xffff is clear defined for loopback-ism, I am wondering if it can be a specific sign
of loopback-ism for tooling to not take PCI-ID into account? Would you also consider that inelegant?

> I brainstormed this with them team and the problem is more complex.
> In theory there shouldn't be PCI information set for the loopback device. There should be a better abstraction in the
> netlink interface that creates this output and the tooling should be made aware of it.
>

Yes, it is better. But I couldn't surely picture how the abstraction looks like, and if it is necessary
to introduce it just for a special case of loopback-ism (note that virtio-ISM also has PCI information),
since we can identify loopback-ism by CHID.

> Do you rely on the output currently? What are your thoughts about it?
> If not I'd ask you to not fill the netlink interface for the loopback device and refactor it with the next stage when we
> create a right interface for it.
>

Currently we don't rely on the output, and I have no objection to the proposal that not fill the netlink
interface for loopback-ism and refactor it in the next stage.

> Since the Merge-Window is closed feel free to send new versions as RFC.

OK, I will send the new version as an RFC.

Thank you!

> Thank you
> - Jan
>
>>       if (nla_put_u32(skb, SMC_NLA_DEV_PCI_FID, smc_pci_dev.pci_fid))
>>           goto errattr;
>>       if (nla_put_u16(skb, SMC_NLA_DEV_PCI_CHID, smc_pci_dev.pci_pchid))

2024-03-15 10:30:03

by Jan Karcher

[permalink] [raw]
Subject: Re: [PATCH net-next v3 01/11] net/smc: adapt SMC-D device dump for Emulated-ISM



On 15/03/2024 04:44, Wen Gu wrote:
>
>
> On 2024/3/14 18:23, Jan Karcher wrote:
>>
>>
>> On 12/03/2024 15:27, Wen Gu wrote:
>>> The introduction of Emulated-ISM requires adaptation of SMC-D device
>>> dump. Software implemented non-PCI device (loopback-ism) should be
>>> handled correctly and the CHID reserved for Emulated-ISM should be got
>>> from smcd_ops interface instead of PCI information.
>>>
>>> Signed-off-by: Wen Gu <[email protected]>
>>> ---
>>>   net/smc/smc_ism.c | 13 ++++++++++---
>>>   1 file changed, 10 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
>>> index ac88de2a06a0..b6eca4231913 100644
>>> --- a/net/smc/smc_ism.c
>>> +++ b/net/smc/smc_ism.c
>>> @@ -252,12 +252,11 @@ static int smc_nl_handle_smcd_dev(struct
>>> smcd_dev *smcd,
>>>       char smc_pnet[SMC_MAX_PNETID_LEN + 1];
>>>       struct smc_pci_dev smc_pci_dev;
>>>       struct nlattr *port_attrs;
>>> +    struct device *device;
>>>       struct nlattr *attrs;
>>> -    struct ism_dev *ism;
>>>       int use_cnt = 0;
>>>       void *nlh;
>>> -    ism = smcd->priv;
>>>       nlh = genlmsg_put(skb, NETLINK_CB(cb->skb).portid,
>>> cb->nlh->nlmsg_seq,
>>>                 &smc_gen_nl_family, NLM_F_MULTI,
>>>                 SMC_NETLINK_GET_DEV_SMCD);
>>> @@ -272,7 +271,15 @@ static int smc_nl_handle_smcd_dev(struct
>>> smcd_dev *smcd,
>>>       if (nla_put_u8(skb, SMC_NLA_DEV_IS_CRIT, use_cnt > 0))
>>>           goto errattr;
>>>       memset(&smc_pci_dev, 0, sizeof(smc_pci_dev));
>>> -    smc_set_pci_values(to_pci_dev(ism->dev.parent), &smc_pci_dev);
>>> +    device = smcd->ops->get_dev(smcd);
>>> +    if (device->parent)
>>> +        smc_set_pci_values(to_pci_dev(device->parent), &smc_pci_dev);
>>> +    if (smc_ism_is_emulated(smcd)) {
>>> +        smc_pci_dev.pci_pchid = smc_ism_get_chid(smcd);
>>> +        if (!device->parent)
>>> +            snprintf(smc_pci_dev.pci_id, sizeof(smc_pci_dev.pci_id),
>>> +                 "%s", dev_name(device));
>>> +    }
>>
>> Hi Wen Gu,
>>
>> playing around with the loopback-ism device and testing it, i
>> developed some concerns regarding this exposure. Basically this
>> enables us to see the loopback device in the `smcd device` tool
>> without any changes.
>> E.g.:
>> ```
>> # smcd device
>> FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
>> 0000 0     loopback-ism  ffff   No        0
>> 102a ISM   0000:00:00.0  07c2   No        0  NET1
>> ```
>>
>> Now the problem with this is that "loopback-ism" is no valid PCI-ID
>> and should not be there. My first thought was to put an "n/a" instead,
>> but that opens up another problem. Currently you could set - even if
>> it does not make sense - a PNET_ID for the loopback device:
>> ```
>
> Yes, and we can exclude loopback-ism in smc_pnet_enter() if necessary.

We could, but we have to make sure we implement those distinctions at
the right location. E.g.: if virtio-ism is coming. Does a PNETID make
sense for a virtio-ism device? Do we want to exclude is also there or do
we want an abstracted layer/mechanism to recognize if a device has a
PNETId capability or not?

>
>> # smc_pnet -a -D loopback-ism NET1
>> # smcd device
>> FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
>> 0000 0     loopback-ism  ffff   No        0  *NET1
>> 102a ISM   0000:00:00.0  07c2   No        0  NET1
>> ```
>> If we would change the PCI-ID to "n/a" it would be a valid input
>> parameter for the tooling which is... to put it nice... not so beautiful.
>
> FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
> 0000 0     n/a           ffff   No        0
>
> IIUC, the problem is that the 'n/a', which would be an input for other
> tools, is somewhat strange?

Exactly.

>
> Since PCHID 0xffff is clear defined for loopback-ism, I am wondering if
> it can be a specific sign
> of loopback-ism for tooling to not take PCI-ID into account? Would you
> also consider that inelegant?

I think deciding on PCHID is the only way we can currently differentiate
what kind of device we are talking about. So my guess would be that
PCHID is going to play a big role in future design decisions.

>
>> I brainstormed this with them team and the problem is more complex.
>> In theory there shouldn't be PCI information set for the loopback
>> device. There should be a better abstraction in the netlink interface
>> that creates this output and the tooling should be made aware of it.
>>
>
> Yes, it is better. But I couldn't surely picture how the abstraction
> looks like, and if it is necessary
> to introduce it just for a special case of loopback-ism (note that
> virtio-ISM also has PCI information),
> since we can identify loopback-ism by CHID.

Please take the following with a grain of salt. I just want to give you
a bit insight of our current train of thought. None of it is final or
set in stone. The idea would be to have device core information that
contain the information which other fields are important for this
device. And corresponding "extensions" that contain the information. The
tooling cvould then decide soley on the core information which features
are supported by a device and which are not.
If that is really needed: Not sure yet. Is this the best solution:
Propably not.
E.g.:

SMC-D netlink abstraction

+------------------------------------+
| Core information |
| (e.g. PCHID, InUse, isPCI, isS390) |
+------------------------------------+

+----------------+
| s390 extension |
| (e.g.FID) |
+----------------+

+--------------------+
| PCI extension |
| (e.g. PCI-ID, ...) |
+--------------------+


>
>> Do you rely on the output currently? What are your thoughts about it?
>> If not I'd ask you to not fill the netlink interface for the loopback
>> device and refactor it with the next stage when we create a right
>> interface for it.
>>
>
> Currently we don't rely on the output, and I have no objection to the
> proposal that not fill the netlink
> interface for loopback-ism and refactor it in the next stage.

Thank you! If you have ideas regarding the design of the interface hit
us up. As soon as we are going to think about this further I'm going to
invite you to those discussions.

>
>> Since the Merge-Window is closed feel free to send new versions as RFC.
>
> OK, I will send the new version as an RFC.
>
> Thank you!

Thanks!
- Jan

>
>> Thank you
>> - Jan
>>
>>>       if (nla_put_u32(skb, SMC_NLA_DEV_PCI_FID, smc_pci_dev.pci_fid))
>>>           goto errattr;
>>>       if (nla_put_u16(skb, SMC_NLA_DEV_PCI_CHID, smc_pci_dev.pci_pchid))

2024-03-15 12:29:48

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next v3 01/11] net/smc: adapt SMC-D device dump for Emulated-ISM



On 2024/3/15 18:27, Jan Karcher wrote:
>
>
> On 15/03/2024 04:44, Wen Gu wrote:
>>
>>
>> On 2024/3/14 18:23, Jan Karcher wrote:
>>>
>>>
>>> On 12/03/2024 15:27, Wen Gu wrote:
>>>> The introduction of Emulated-ISM requires adaptation of SMC-D device
>>>> dump. Software implemented non-PCI device (loopback-ism) should be
>>>> handled correctly and the CHID reserved for Emulated-ISM should be got
>>>> from smcd_ops interface instead of PCI information.
>>>>
>>>> Signed-off-by: Wen Gu <[email protected]>
>>>> ---
>>>>   net/smc/smc_ism.c | 13 ++++++++++---
>>>>   1 file changed, 10 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
>>>> index ac88de2a06a0..b6eca4231913 100644
>>>> --- a/net/smc/smc_ism.c
>>>> +++ b/net/smc/smc_ism.c
>>>> @@ -252,12 +252,11 @@ static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
>>>>       char smc_pnet[SMC_MAX_PNETID_LEN + 1];
>>>>       struct smc_pci_dev smc_pci_dev;
>>>>       struct nlattr *port_attrs;
>>>> +    struct device *device;
>>>>       struct nlattr *attrs;
>>>> -    struct ism_dev *ism;
>>>>       int use_cnt = 0;
>>>>       void *nlh;
>>>> -    ism = smcd->priv;
>>>>       nlh = genlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq,
>>>>                 &smc_gen_nl_family, NLM_F_MULTI,
>>>>                 SMC_NETLINK_GET_DEV_SMCD);
>>>> @@ -272,7 +271,15 @@ static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
>>>>       if (nla_put_u8(skb, SMC_NLA_DEV_IS_CRIT, use_cnt > 0))
>>>>           goto errattr;
>>>>       memset(&smc_pci_dev, 0, sizeof(smc_pci_dev));
>>>> -    smc_set_pci_values(to_pci_dev(ism->dev.parent), &smc_pci_dev);
>>>> +    device = smcd->ops->get_dev(smcd);
>>>> +    if (device->parent)
>>>> +        smc_set_pci_values(to_pci_dev(device->parent), &smc_pci_dev);
>>>> +    if (smc_ism_is_emulated(smcd)) {
>>>> +        smc_pci_dev.pci_pchid = smc_ism_get_chid(smcd);
>>>> +        if (!device->parent)
>>>> +            snprintf(smc_pci_dev.pci_id, sizeof(smc_pci_dev.pci_id),
>>>> +                 "%s", dev_name(device));
>>>> +    }
>>>
>>> Hi Wen Gu,
>>>
>>> playing around with the loopback-ism device and testing it, i developed some concerns regarding this exposure.
>>> Basically this enables us to see the loopback device in the `smcd device` tool without any changes.
>>> E.g.:
>>> ```
>>> # smcd device
>>> FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
>>> 0000 0     loopback-ism  ffff   No        0
>>> 102a ISM   0000:00:00.0  07c2   No        0  NET1
>>> ```
>>>
>>> Now the problem with this is that "loopback-ism" is no valid PCI-ID and should not be there. My first thought was to
>>> put an "n/a" instead, but that opens up another problem. Currently you could set - even if it does not make sense - a
>>> PNET_ID for the loopback device:
>>> ```
>>
>> Yes, and we can exclude loopback-ism in smc_pnet_enter() if necessary.
>
> We could, but we have to make sure we implement those distinctions at the right location. E.g.: if virtio-ism is coming.
> Does a PNETID make sense for a virtio-ism device? Do we want to exclude is also there or do we want an abstracted
> layer/mechanism to recognize if a device has a PNETId capability or not?
>

Understand, a long-term view is better.

>>
>>> # smc_pnet -a -D loopback-ism NET1
>>> # smcd device
>>> FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
>>> 0000 0     loopback-ism  ffff   No        0  *NET1
>>> 102a ISM   0000:00:00.0  07c2   No        0  NET1
>>> ```
>>> If we would change the PCI-ID to "n/a" it would be a valid input parameter for the tooling which is... to put it
>>> nice... not so beautiful.
>>
>> FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
>> 0000 0     n/a           ffff   No        0
>>
>> IIUC, the problem is that the 'n/a', which would be an input for other tools, is somewhat strange?
>
> Exactly.
>
>>
>> Since PCHID 0xffff is clear defined for loopback-ism, I am wondering if it can be a specific sign
>> of loopback-ism for tooling to not take PCI-ID into account? Would you also consider that inelegant?
>
> I think deciding on PCHID is the only way we can currently differentiate what kind of device we are talking about. So my
> guess would be that PCHID is going to play a big role in future design decisions.
>

Make sense to me.

>>
>>> I brainstormed this with them team and the problem is more complex.
>>> In theory there shouldn't be PCI information set for the loopback device. There should be a better abstraction in the
>>> netlink interface that creates this output and the tooling should be made aware of it.
>>>
>>
>> Yes, it is better. But I couldn't surely picture how the abstraction looks like, and if it is necessary
>> to introduce it just for a special case of loopback-ism (note that virtio-ISM also has PCI information),
>> since we can identify loopback-ism by CHID.
>
> Please take the following with a grain of salt. I just want to give you a bit insight of our current train of thought.
> None of it is final or set in stone. The idea would be to have device core information that contain the information
> which other fields are important for this device. And corresponding "extensions" that contain the information. The
> tooling cvould then decide soley on the core information which features are supported by a device and which are not.
> If that is really needed: Not sure yet. Is this the best solution: Propably not.
> E.g.:
>
> SMC-D netlink abstraction
>
> +------------------------------------+
> | Core information                   |
> | (e.g. PCHID, InUse, isPCI, isS390) |
> +------------------------------------+
>
> +----------------+
> | s390 extension |
> | (e.g.FID)      |
> +----------------+
>
> +--------------------+
> | PCI extension      |
> | (e.g. PCI-ID, ...) |
> +--------------------+
>
>

I like this diagram and it is very clear. So core information will be applicable to all ISM devices,
and the extension information will be specific to some certain kinds.

>>
>>> Do you rely on the output currently? What are your thoughts about it?
>>> If not I'd ask you to not fill the netlink interface for the loopback device and refactor it with the next stage when
>>> we create a right interface for it.
>>>
>>
>> Currently we don't rely on the output, and I have no objection to the proposal that not fill the netlink
>> interface for loopback-ism and refactor it in the next stage.
>
> Thank you! If you have ideas regarding the design of the interface hit us up. As soon as we are going to think about
> this further I'm going to invite you to those discussions.
>

Sure! and thank you very much!


Best regards,
Wen Gu

>> >>> Since the Merge-Window is closed feel free to send new versions as RFC.
>>
>> OK, I will send the new version as an RFC.
>>
>> Thank you!
>
> Thanks!
> - Jan
>
>>
>>> Thank you
>>> - Jan
>>>
>>>>       if (nla_put_u32(skb, SMC_NLA_DEV_PCI_FID, smc_pci_dev.pci_fid))
>>>>           goto errattr;
>>>>       if (nla_put_u16(skb, SMC_NLA_DEV_PCI_CHID, smc_pci_dev.pci_pchid))