Hi, all
# Background
As previously mentioned in [1], we (Alibaba Cloud) are trying to use SMC
to accelerate TCP applications in cloud environment, improving inter-host
or inter-VM communication.
In addition of these, we also found the value of SMC-D in scenario of local
inter-process communication, such as accelerate communication between containers
within the same host. So this RFC tries to provide a SMC-D loopback solution
in such scenario, to bring a significant improvement in latency and throughput
compared to TCP loopback.
# Design
This patch set provides a kind of SMC-D loopback solution.
Patch #1/5 and #2/5 provide an SMC-D based dummy device, preparing for the
inter-process communication acceleration. Except for loopback acceleration,
the dummy device can also meet the requirements mentioned in [2], which is
providing a way to test SMC-D logic for broad community without ISM device.
+------------------------------------------+
| +-----------+ +-----------+ |
| | process A | | process B | |
| +-----------+ +-----------+ |
| ^ ^ |
| | +---------------+ | |
| | | SMC stack | | |
| +--->| +-----------+ |<--| |
| | | dummy | | |
| | | device | | |
| +-+-----------+-+ |
| VM |
+------------------------------------------+
Patch #3/5, #4/5, #5/5 provides a way to avoid data copy from sndbuf to RMB
and improve SMC-D loopback performance. Through extending smcd_ops with two
new semantic: attach_dmb and detach_dmb, sender's sndbuf shares the same
physical memory region with receiver's RMB. The data copied from userspace
to sender's sndbuf directly reaches the receiver's RMB without unnecessary
memory copy in the same kernel.
+----------+ +----------+
| socket A | | socket B |
+----------+ +----------+
| ^
| +---------+ |
regard as | | ----------|
local sndbuf | B's | regard as
| | RMB | local RMB
|-------> | |
+---------+
# Benchmark Test
* Test environments:
- VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
- SMC sndbuf/RMB size 1MB.
* Test object:
- TCP: run on TCP loopback.
- domain: run on UNIX domain.
- SMC lo: run on SMC loopback device with patch #1/5 ~ #2/5.
- SMC lo-nocpy: run on SMC loopback device with patch #1/5 ~ #5/5.
1. ipc-benchmark (see [3])
- ./<foo> -c 1000000 -s 100
TCP domain SMC-lo SMC-lo-nocpy
Message
rate (msg/s) 75140 129548(+72.41) 152266(+102.64%) 151914(+102.17%)
2. sockperf
- serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
- clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
TCP SMC-lo SMC-lo-nocpy
Bandwidth(MBps) 4943.359 4936.096(-0.15%) 8239.624(+66.68%)
Latency(us) 6.372 3.359(-47.28%) 3.25(-49.00%)
3. iperf3
- serv: <smc_run> taskset -c <cpu> iperf3 -s
- clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15
TCP SMC-lo SMC-lo-nocpy
Bitrate(Gb/s) 40.5 41.4(+2.22%) 76.4(+88.64%)
4. nginx/wrk
- serv: <smc_run> nginx
- clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80
TCP SMC-lo SMC-lo-nocpy
Requests/s 154643.22 220894.03(+42.84%) 226754.3(+46.63%)
# Discussion
1. API between SMC-D and ISM device
As Jan mentioned in [2], IBM are working on placing an API between SMC-D
and the ISM device for easier use of different "devices" for SMC-D.
So, considering that the introduction of attach_dmb or detach_dmb can
effectively avoid data copying from sndbuf to RMB and brings obvious
throughput advantages in inter-VM or inter-process scenarios, can the
attach/detach semantics be taken into consideration when designing the
API to make it a standard ISM device behavior?
Maybe our RFC of SMC-D based inter-process acceleration (this one) and
inter-VM acceleration (will coming soon, which is the update of [1])
can provide some examples for new API design. And we are very glad to
discuss this on the mail list.
2. Way to select different ISM-like devices
With the proposal of SMC-D loopback 'device' (this RFC) and incoming
device used for inter-VM acceleration as update of [1], SMC-D has more
options to choose from. So we need to consider that how to indicate
supported devices, how to determine which one to use, and their priority...
IMHO, this may require an update of CLC message and negotiation mechanism.
Again, we are very glad to discuss this with you on the mailing list.
[1] https://lore.kernel.org/netdev/[email protected]/
[2] https://lore.kernel.org/netdev/[email protected]/
[3] https://github.com/goldsborough/ipc-bench
v1->v2
1. Fix some build WARNINGs complained by kernel test rebot
Reported-by: kernel test robot <[email protected]>
2. Add iperf3 test data.
Wen Gu (5):
net/smc: introduce SMC-D loopback device
net/smc: choose loopback device in SMC-D communication
net/smc: add dmb attach and detach interface
net/smc: avoid data copy from sndbuf to peer RMB in SMC-D loopback
net/smc: logic of cursors update in SMC-D loopback connections
include/net/smc.h | 3 +
net/smc/Makefile | 2 +-
net/smc/af_smc.c | 88 +++++++++++-
net/smc/smc_cdc.c | 59 ++++++--
net/smc/smc_cdc.h | 1 +
net/smc/smc_clc.c | 4 +-
net/smc/smc_core.c | 62 +++++++++
net/smc/smc_core.h | 2 +
net/smc/smc_ism.c | 39 +++++-
net/smc/smc_ism.h | 2 +
net/smc/smc_loopback.c | 358 +++++++++++++++++++++++++++++++++++++++++++++++++
net/smc/smc_loopback.h | 63 +++++++++
12 files changed, 662 insertions(+), 21 deletions(-)
create mode 100644 net/smc/smc_loopback.c
create mode 100644 net/smc/smc_loopback.h
--
1.8.3.1
This patch extends smcd_ops, adding two more semantic for
SMC-D device:
- attach_dmb:
Attach an already registered dmb to a specific buf_desc,
so that we can refer to the dmb through this buf_desc.
- detach_dmb:
Reverse operation of attach_dmb. detach the dmb from the
buf_desc.
This interface extension is to prepare for the reduction
of data moving from sndbuf to RMB in SMC-D loopback device.
Signed-off-by: Wen Gu <[email protected]>
---
include/net/smc.h | 2 ++
net/smc/smc_ism.c | 36 ++++++++++++++++++++++++++
net/smc/smc_ism.h | 2 ++
net/smc/smc_loopback.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++
net/smc/smc_loopback.h | 4 +++
5 files changed, 113 insertions(+)
diff --git a/include/net/smc.h b/include/net/smc.h
index 7699f97..60a96f7 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -63,6 +63,8 @@ struct smcd_ops {
u32 vid);
int (*register_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
int (*unregister_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
+ int (*attach_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
+ int (*detach_dmb)(struct smcd_dev *dev, u64 token);
int (*add_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
int (*del_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
int (*set_vlan_required)(struct smcd_dev *dev);
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index 1d10435..2049388 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -202,6 +202,42 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
return rc;
}
+int smc_ism_attach_dmb(struct smcd_dev *dev, u64 token,
+ struct smc_buf_desc *dmb_desc)
+{
+ struct smcd_dmb dmb;
+ int rc = 0;
+
+ memset(&dmb, 0, sizeof(dmb));
+ dmb.dmb_tok = token;
+
+ /* only support loopback device now */
+ if (!dev->is_loopback)
+ return -EINVAL;
+ if (!dev->ops->attach_dmb)
+ return -EINVAL;
+
+ rc = dev->ops->attach_dmb(dev, &dmb);
+ if (!rc) {
+ dmb_desc->sba_idx = dmb.sba_idx;
+ dmb_desc->token = dmb.dmb_tok;
+ dmb_desc->cpu_addr = dmb.cpu_addr;
+ dmb_desc->dma_addr = dmb.dma_addr;
+ dmb_desc->len = dmb.dmb_len;
+ }
+ return rc;
+}
+
+int smc_ism_detach_dmb(struct smcd_dev *dev, u64 token)
+{
+ if (!dev->is_loopback)
+ return -EINVAL;
+ if (!dev->ops->detach_dmb)
+ return -EINVAL;
+
+ return dev->ops->detach_dmb(dev, token);
+}
+
static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
struct sk_buff *skb,
struct netlink_callback *cb)
diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
index d6b2db6..9022979 100644
--- a/net/smc/smc_ism.h
+++ b/net/smc/smc_ism.h
@@ -38,6 +38,8 @@ struct smc_ism_vlanid { /* VLAN id set on ISM device */
int smc_ism_register_dmb(struct smc_link_group *lgr, int buf_size,
struct smc_buf_desc *dmb_desc);
int smc_ism_unregister_dmb(struct smcd_dev *dev, struct smc_buf_desc *dmb_desc);
+int smc_ism_attach_dmb(struct smcd_dev *dev, u64 token, struct smc_buf_desc *dmb_desc);
+int smc_ism_detach_dmb(struct smcd_dev *dev, u64 token);
int smc_ism_signal_shutdown(struct smc_link_group *lgr);
void smc_ism_get_system_eid(u8 **eid);
u16 smc_ism_get_chid(struct smcd_dev *dev);
diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index 973382a..bc3ff82 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -68,6 +68,7 @@ static int lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
goto err_node;
}
dmb_node->len = dmb->dmb_len;
+ refcount_set(&dmb_node->refcnt, 1);
/* TODO: token is random but not exclusive !
* suppose to find token in dmb hask table, if has this token
@@ -78,6 +79,7 @@ static int lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
write_lock(&ldev->dmb_ht_lock);
hash_add(ldev->dmb_ht, &dmb_node->list, dmb_node->token);
write_unlock(&ldev->dmb_ht_lock);
+ atomic_inc(&ldev->dmb_cnt);
dmb->sba_idx = dmb_node->sba_idx;
dmb->dmb_tok = dmb_node->token;
@@ -115,9 +117,69 @@ static int lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
write_unlock(&ldev->dmb_ht_lock);
clear_bit(dmb_node->sba_idx, ldev->sba_idx_mask);
+
+ /* wait for dmb refcnt equal to 0 */
+ if (!refcount_dec_and_test(&dmb_node->refcnt))
+ wait_event(ldev->dmbs_release, !refcount_read(&dmb_node->refcnt));
kfree(dmb_node->cpu_addr);
kfree(dmb_node);
+ if (atomic_dec_and_test(&ldev->dmb_cnt))
+ wake_up(&ldev->ldev_release);
+
+ return 0;
+}
+
+static int lo_attach_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
+{
+ struct lo_dmb_node *dmb_node = NULL, *tmp_node;
+ struct lo_dev *ldev = smcd->priv;
+
+ /* find dmb_node according to dmb->dmb_tok */
+ read_lock(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
+ if (tmp_node->token == dmb->dmb_tok) {
+ dmb_node = tmp_node;
+ break;
+ }
+ }
+ if (!dmb_node) {
+ read_unlock(&ldev->dmb_ht_lock);
+ return -EINVAL;
+ }
+ read_unlock(&ldev->dmb_ht_lock);
+ refcount_inc(&dmb_node->refcnt);
+
+ /* provide dmb information */
+ dmb->sba_idx = dmb_node->sba_idx;
+ dmb->dmb_tok = dmb_node->token;
+ dmb->cpu_addr = dmb_node->cpu_addr;
+ dmb->dma_addr = dmb_node->dma_addr;
+ dmb->dmb_len = dmb_node->len;
+ return 0;
+}
+
+static int lo_detach_dmb(struct smcd_dev *smcd, u64 token)
+{
+ struct lo_dmb_node *dmb_node = NULL, *tmp_node;
+ struct lo_dev *ldev = smcd->priv;
+
+ /* find dmb_node according to dmb->dmb_tok */
+ read_lock(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, token) {
+ if (tmp_node->token == token) {
+ dmb_node = tmp_node;
+ break;
+ }
+ }
+ if (!dmb_node) {
+ read_unlock(&ldev->dmb_ht_lock);
+ return -EINVAL;
+ }
+ read_unlock(&ldev->dmb_ht_lock);
+
+ if (refcount_dec_and_test(&dmb_node->refcnt))
+ wake_up_all(&ldev->dmbs_release);
return 0;
}
@@ -193,6 +255,8 @@ static u16 lo_get_chid(struct smcd_dev *smcd)
.query_remote_gid = lo_query_rgid,
.register_dmb = lo_register_dmb,
.unregister_dmb = lo_unregister_dmb,
+ .attach_dmb = lo_attach_dmb,
+ .detach_dmb = lo_detach_dmb,
.add_vlan_id = lo_add_vlan_id,
.del_vlan_id = lo_del_vlan_id,
.set_vlan_required = lo_set_vlan_required,
@@ -218,6 +282,9 @@ static int lo_dev_init(struct lo_dev *ldev)
ldev->lgid = smcd->local_gid;
rwlock_init(&ldev->dmb_ht_lock);
hash_init(ldev->dmb_ht);
+ atomic_set(&ldev->dmb_cnt, 0);
+ init_waitqueue_head(&ldev->dmbs_release);
+ init_waitqueue_head(&ldev->ldev_release);
return smcd_register_dev(smcd);
}
@@ -255,6 +322,8 @@ static int lo_dev_probe(void)
static void lo_dev_exit(struct lo_dev *ldev)
{
smcd_unregister_dev(ldev->smcd);
+ if (atomic_read(&ldev->dmb_cnt))
+ wait_event(ldev->ldev_release, !atomic_read(&ldev->dmb_cnt));
}
static void lo_dev_remove(void)
diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
index d7f7815..f4122be 100644
--- a/net/smc/smc_loopback.h
+++ b/net/smc/smc_loopback.h
@@ -32,6 +32,7 @@ struct lo_dmb_node {
u32 sba_idx;
void *cpu_addr;
dma_addr_t dma_addr;
+ refcount_t refcnt;
};
struct lo_dev {
@@ -41,6 +42,9 @@ struct lo_dev {
DECLARE_BITMAP(sba_idx_mask, LODEV_MAX_DMBS);
rwlock_t dmb_ht_lock;
DECLARE_HASHTABLE(dmb_ht, LODEV_MAX_DMBS_BUCKETS);
+ atomic_t dmb_cnt;
+ wait_queue_head_t dmbs_release;
+ wait_queue_head_t ldev_release;
};
struct lo_systemeid {
--
1.8.3.1
This patch allows SMC-D to use loopback device.
But noted that the implementation here is quiet simple and informal.
Loopback device will always be firstly chosen, and fallback happens
if loopback communication is impossible.
It needs to be discussed how client indicates to peer that multiple
SMC-D devices are available and how server picks a suitable one.
Signed-off-by: Wen Gu <[email protected]>
---
net/smc/af_smc.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++------
net/smc/smc_clc.c | 4 +++-
net/smc/smc_ism.c | 3 ++-
3 files changed, 54 insertions(+), 8 deletions(-)
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 9546c02..b9884c8 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -979,6 +979,28 @@ static int smc_find_ism_device(struct smc_sock *smc, struct smc_init_info *ini)
return 0;
}
+/* check if there is a lo device available for this connection. */
+static int smc_find_lo_device(struct smc_sock *smc, struct smc_init_info *ini)
+{
+ struct smcd_dev *sdev;
+
+ mutex_lock(&smcd_dev_list.mutex);
+ list_for_each_entry(sdev, &smcd_dev_list.list, list) {
+ if (sdev->is_loopback && !sdev->going_away &&
+ (!ini->ism_peer_gid[0] ||
+ !smc_ism_cantalk(ini->ism_peer_gid[0], ini->vlan_id,
+ sdev))) {
+ ini->ism_dev[0] = sdev;
+ break;
+ }
+ }
+ mutex_unlock(&smcd_dev_list.mutex);
+ if (!ini->ism_dev[0])
+ return SMC_CLC_DECL_NOSMCDDEV;
+ ini->ism_chid[0] = smc_ism_get_chid(ini->ism_dev[0]);
+ return 0;
+}
+
/* is chid unique for the ism devices that are already determined? */
static bool smc_find_ism_v2_is_unique_chid(u16 chid, struct smc_init_info *ini,
int cnt)
@@ -1044,10 +1066,20 @@ static int smc_find_proposal_devices(struct smc_sock *smc,
{
int rc = 0;
- /* check if there is an ism device available */
+ /* TODO:
+ * How to indicate to peer if ism device and loopback
+ * device are both available ?
+ *
+ * The RFC patch hasn't resolved this, just simply always
+ * chooses loopback device first, and fallback if loopback
+ * communication is impossible.
+ *
+ */
+ /* check if there is an ism or loopback device available */
if (!(ini->smcd_version & SMC_V1) ||
- smc_find_ism_device(smc, ini) ||
- smc_connect_ism_vlan_setup(smc, ini))
+ (smc_find_lo_device(smc, ini) &&
+ (smc_find_ism_device(smc, ini) ||
+ smc_connect_ism_vlan_setup(smc, ini))))
ini->smcd_version &= ~SMC_V1;
/* else ISM V1 is supported for this connection */
@@ -2135,9 +2167,20 @@ static void smc_find_ism_v1_device_serv(struct smc_sock *new_smc,
goto not_found;
ini->is_smcd = true; /* prepare ISM check */
ini->ism_peer_gid[0] = ntohll(pclc_smcd->ism.gid);
- rc = smc_find_ism_device(new_smc, ini);
- if (rc)
- goto not_found;
+
+ /* TODO:
+ * How to know that peer has both loopback and ism device ?
+ *
+ * The RFC patch hasn't resolved this, simply tries loopback
+ * device first, then ism device.
+ */
+ /* find available loopback or ism device */
+ if (smc_find_lo_device(new_smc, ini)) {
+ rc = smc_find_ism_device(new_smc, ini);
+ if (rc)
+ goto not_found;
+ }
+
ini->ism_selected = 0;
rc = smc_listen_ism_init(new_smc, ini);
if (!rc)
diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c
index dfb9797..3887692 100644
--- a/net/smc/smc_clc.c
+++ b/net/smc/smc_clc.c
@@ -486,7 +486,9 @@ static int smc_clc_prfx_set4_rcu(struct dst_entry *dst, __be32 ipv4,
return -ENODEV;
in_dev_for_each_ifa_rcu(ifa, in_dev) {
- if (!inet_ifa_match(ipv4, ifa))
+ /* add loopback support */
+ if (inet_addr_type(dev_net(dst->dev), ipv4) != RTN_LOCAL &&
+ !inet_ifa_match(ipv4, ifa))
continue;
prop->prefix_len = inet_mask_len(ifa->ifa_mask);
prop->outgoing_subnet = ifa->ifa_address & ifa->ifa_mask;
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index 911fe08..1d10435 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -227,7 +227,8 @@ static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
if (nla_put_u8(skb, SMC_NLA_DEV_IS_CRIT, use_cnt > 0))
goto errattr;
memset(&smc_pci_dev, 0, sizeof(smc_pci_dev));
- smc_set_pci_values(to_pci_dev(smcd->dev.parent), &smc_pci_dev);
+ if (!smcd->is_loopback)
+ smc_set_pci_values(to_pci_dev(smcd->dev.parent), &smc_pci_dev);
if (nla_put_u32(skb, SMC_NLA_DEV_PCI_FID, smc_pci_dev.pci_fid))
goto errattr;
if (nla_put_u16(skb, SMC_NLA_DEV_PCI_CHID, smc_pci_dev.pci_pchid))
--
1.8.3.1
This patch introduces a kind of loopback device for SMC-D, thus
enabling the SMC communication between two local sockets in one
kernel.
The loopback device supports basic capabilities defined by SMC-D,
including registering DMB, unregistering DMB and moving data.
Considering that there is no ism device on other servers expect
IBM z13, the loopback device can be used as a dummy device to
test SMC-D logic for the broad community.
Signed-off-by: Wen Gu <[email protected]>
---
include/net/smc.h | 1 +
net/smc/Makefile | 2 +-
net/smc/af_smc.c | 12 ++-
net/smc/smc_cdc.c | 6 ++
net/smc/smc_cdc.h | 1 +
net/smc/smc_loopback.c | 282 +++++++++++++++++++++++++++++++++++++++++++++++++
net/smc/smc_loopback.h | 59 +++++++++++
7 files changed, 361 insertions(+), 2 deletions(-)
create mode 100644 net/smc/smc_loopback.c
create mode 100644 net/smc/smc_loopback.h
diff --git a/include/net/smc.h b/include/net/smc.h
index c926d33..7699f97 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -93,6 +93,7 @@ struct smcd_dev {
atomic_t lgr_cnt;
wait_queue_head_t lgrs_deleted;
u8 going_away : 1;
+ u8 is_loopback : 1;
};
struct smcd_dev *smcd_alloc_dev(struct device *parent, const char *name,
diff --git a/net/smc/Makefile b/net/smc/Makefile
index 875efcd..a8c3711 100644
--- a/net/smc/Makefile
+++ b/net/smc/Makefile
@@ -4,5 +4,5 @@ obj-$(CONFIG_SMC) += smc.o
obj-$(CONFIG_SMC_DIAG) += smc_diag.o
smc-y := af_smc.o smc_pnet.o smc_ib.o smc_clc.o smc_core.o smc_wr.o smc_llc.o
smc-y += smc_cdc.o smc_tx.o smc_rx.o smc_close.o smc_ism.o smc_netlink.o smc_stats.o
-smc-y += smc_tracepoint.o
+smc-y += smc_tracepoint.o smc_loopback.o
smc-$(CONFIG_SYSCTL) += smc_sysctl.o
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index e12d4fa..9546c02 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -52,6 +52,7 @@
#include "smc_stats.h"
#include "smc_tracepoint.h"
#include "smc_sysctl.h"
+#include "smc_loopback.h"
static DEFINE_MUTEX(smc_server_lgr_pending); /* serialize link group
* creation on server
@@ -3451,15 +3452,23 @@ static int __init smc_init(void)
goto out_sock;
}
+ rc = smc_loopback_init();
+ if (rc) {
+ pr_err("%s: smc_loopback_init fails with %d\n", __func__, rc);
+ goto out_ib;
+ }
+
rc = tcp_register_ulp(&smc_ulp_ops);
if (rc) {
pr_err("%s: tcp_ulp_register fails with %d\n", __func__, rc);
- goto out_ib;
+ goto out_lo;
}
static_branch_enable(&tcp_have_smc);
return 0;
+out_lo:
+ smc_loopback_exit();
out_ib:
smc_ib_unregister_client();
out_sock:
@@ -3494,6 +3503,7 @@ static void __exit smc_exit(void)
tcp_unregister_ulp(&smc_ulp_ops);
sock_unregister(PF_SMC);
smc_core_exit();
+ smc_loopback_exit();
smc_ib_unregister_client();
destroy_workqueue(smc_close_wq);
destroy_workqueue(smc_tcp_ls_wq);
diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index 53f63bf..61f5ff7 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -408,6 +408,12 @@ static void smc_cdc_msg_recv(struct smc_sock *smc, struct smc_cdc_msg *cdc)
static void smcd_cdc_rx_tsklet(struct tasklet_struct *t)
{
struct smc_connection *conn = from_tasklet(conn, t, rx_tsklet);
+
+ smcd_cdc_rx_handler(conn);
+}
+
+void smcd_cdc_rx_handler(struct smc_connection *conn)
+{
struct smcd_cdc_msg *data_cdc;
struct smcd_cdc_msg cdc;
struct smc_sock *smc;
diff --git a/net/smc/smc_cdc.h b/net/smc/smc_cdc.h
index 696cc11..11559d4 100644
--- a/net/smc/smc_cdc.h
+++ b/net/smc/smc_cdc.h
@@ -301,5 +301,6 @@ int smcr_cdc_msg_send_validation(struct smc_connection *conn,
struct smc_wr_buf *wr_buf);
int smc_cdc_init(void) __init;
void smcd_cdc_rx_init(struct smc_connection *conn);
+void smcd_cdc_rx_handler(struct smc_connection *conn);
#endif /* SMC_CDC_H */
diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
new file mode 100644
index 0000000..973382a
--- /dev/null
+++ b/net/smc/smc_loopback.c
@@ -0,0 +1,282 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Shared Memory Communications Direct over loopback device.
+ *
+ * Provide a SMC-D loopback dummy device.
+ *
+ * Copyright (c) 2022, Alibaba Inc.
+ *
+ * Author: Wen Gu <[email protected]>
+ * Tony Lu <[email protected]>
+ *
+ */
+
+#include <linux/device.h>
+#include <linux/types.h>
+#include <net/smc.h>
+
+#include "smc_cdc.h"
+#include "smc_loopback.h"
+
+#define DRV_NAME "smc_lodev"
+
+struct lo_dev *lo_dev;
+
+static struct lo_systemeid LO_SYSTEM_EID = {
+ .seid_string = "SMC-SYSZ-LOSEID000000000",
+ .serial_number = "0000",
+ .type = "0000",
+};
+
+static int lo_query_rgid(struct smcd_dev *smcd, u64 rgid, u32 vid_valid,
+ u32 vid)
+{
+ struct lo_dev *ldev = smcd->priv;
+
+ /* return local gid */
+ if (!ldev || rgid != ldev->lgid)
+ return -ENETUNREACH;
+ return 0;
+}
+
+static int lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
+{
+ struct lo_dev *ldev = smcd->priv;
+ struct lo_dmb_node *dmb_node;
+ int sba_idx, rc;
+
+ /* check space for new dmb */
+ for_each_clear_bit(sba_idx, ldev->sba_idx_mask, LODEV_MAX_DMBS) {
+ if (!test_and_set_bit(sba_idx, ldev->sba_idx_mask))
+ break;
+ }
+ if (sba_idx == LODEV_MAX_DMBS)
+ return -ENOSPC;
+
+ dmb_node = kzalloc(sizeof(*dmb_node), GFP_KERNEL);
+ if (!dmb_node) {
+ rc = -ENOMEM;
+ goto err_bit;
+ }
+
+ dmb_node->sba_idx = sba_idx;
+ dmb_node->cpu_addr = kzalloc(dmb->dmb_len, GFP_KERNEL |
+ __GFP_NOWARN | __GFP_NORETRY |
+ __GFP_NOMEMALLOC);
+ if (!dmb_node->cpu_addr) {
+ rc = -ENOMEM;
+ goto err_node;
+ }
+ dmb_node->len = dmb->dmb_len;
+
+ /* TODO: token is random but not exclusive !
+ * suppose to find token in dmb hask table, if has this token
+ * already, then generate another one.
+ */
+ /* add new dmb into hash table */
+ get_random_bytes(&dmb_node->token, sizeof(dmb_node->token));
+ write_lock(&ldev->dmb_ht_lock);
+ hash_add(ldev->dmb_ht, &dmb_node->list, dmb_node->token);
+ write_unlock(&ldev->dmb_ht_lock);
+
+ dmb->sba_idx = dmb_node->sba_idx;
+ dmb->dmb_tok = dmb_node->token;
+ dmb->cpu_addr = dmb_node->cpu_addr;
+ dmb->dma_addr = dmb_node->dma_addr;
+ dmb->dmb_len = dmb_node->len;
+
+ return 0;
+
+err_node:
+ kfree(dmb_node);
+err_bit:
+ clear_bit(sba_idx, ldev->sba_idx_mask);
+ return rc;
+}
+
+static int lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
+{
+ struct lo_dmb_node *dmb_node = NULL, *tmp_node;
+ struct lo_dev *ldev = smcd->priv;
+
+ /* remove dmb from hash table */
+ write_lock(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
+ if (tmp_node->token == dmb->dmb_tok) {
+ dmb_node = tmp_node;
+ break;
+ }
+ }
+ if (!dmb_node) {
+ write_unlock(&ldev->dmb_ht_lock);
+ return -EINVAL;
+ }
+ hash_del(&dmb_node->list);
+ write_unlock(&ldev->dmb_ht_lock);
+
+ clear_bit(dmb_node->sba_idx, ldev->sba_idx_mask);
+ kfree(dmb_node->cpu_addr);
+ kfree(dmb_node);
+
+ return 0;
+}
+
+static int lo_add_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
+{
+ return 0;
+}
+
+static int lo_del_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
+{
+ return 0;
+}
+
+static int lo_set_vlan_required(struct smcd_dev *smcd)
+{
+ return 0;
+}
+
+static int lo_reset_vlan_required(struct smcd_dev *smcd)
+{
+ return 0;
+}
+
+static int lo_signal_ieq(struct smcd_dev *smcd, u64 rgid, u32 trigger_irq,
+ u32 event_code, u64 info)
+{
+ return 0;
+}
+
+static int lo_move_data(struct smcd_dev *smcd, u64 dmb_tok, unsigned int idx,
+ bool sf, unsigned int offset, void *data,
+ unsigned int size)
+{
+ struct lo_dmb_node *rmb_node = NULL, *tmp_node;
+ struct lo_dev *ldev = smcd->priv;
+
+ read_lock(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_tok) {
+ if (tmp_node->token == dmb_tok) {
+ rmb_node = tmp_node;
+ break;
+ }
+ }
+ if (!rmb_node) {
+ read_unlock(&ldev->dmb_ht_lock);
+ return -EINVAL;
+ }
+ read_unlock(&ldev->dmb_ht_lock);
+
+ memcpy((char *)rmb_node->cpu_addr + offset, data, size);
+
+ if (sf) {
+ struct smc_connection *conn =
+ smcd->conn[rmb_node->sba_idx];
+
+ if (conn && !conn->killed)
+ smcd_cdc_rx_handler(conn);
+ }
+ return 0;
+}
+
+static u8 *lo_get_system_eid(void)
+{
+ return &LO_SYSTEM_EID.seid_string[0];
+}
+
+static u16 lo_get_chid(struct smcd_dev *smcd)
+{
+ return 0;
+}
+
+static const struct smcd_ops lo_ops = {
+ .query_remote_gid = lo_query_rgid,
+ .register_dmb = lo_register_dmb,
+ .unregister_dmb = lo_unregister_dmb,
+ .add_vlan_id = lo_add_vlan_id,
+ .del_vlan_id = lo_del_vlan_id,
+ .set_vlan_required = lo_set_vlan_required,
+ .reset_vlan_required = lo_reset_vlan_required,
+ .signal_event = lo_signal_ieq,
+ .move_data = lo_move_data,
+ .get_system_eid = lo_get_system_eid,
+ .get_chid = lo_get_chid,
+};
+
+static int lo_dev_init(struct lo_dev *ldev)
+{
+ struct smcd_dev *smcd = ldev->smcd;
+
+ /* smcd related */
+ smcd->is_loopback = 1;
+ smcd->priv = ldev;
+ get_random_bytes(&smcd->local_gid, sizeof(smcd->local_gid));
+
+ /* ldev related */
+ /* TODO: lgid is random but not exclusive !
+ */
+ ldev->lgid = smcd->local_gid;
+ rwlock_init(&ldev->dmb_ht_lock);
+ hash_init(ldev->dmb_ht);
+
+ return smcd_register_dev(smcd);
+}
+
+static int lo_dev_probe(void)
+{
+ struct lo_dev *ldev;
+ int ret;
+
+ ldev = kzalloc(sizeof(*ldev), GFP_KERNEL);
+ if (!ldev)
+ return -ENOMEM;
+
+ ldev->smcd = smcd_alloc_dev(NULL, "smcd-loopback-dev",
+ &lo_ops, LODEV_MAX_DMBS);
+ if (!ldev->smcd) {
+ ret = -ENOMEM;
+ goto err_ldev;
+ }
+
+ ret = lo_dev_init(ldev);
+ if (ret)
+ goto err_smcd;
+
+ lo_dev = ldev;
+ return 0;
+
+err_smcd:
+ smcd_free_dev(ldev->smcd);
+err_ldev:
+ kfree(ldev);
+ return ret;
+}
+
+static void lo_dev_exit(struct lo_dev *ldev)
+{
+ smcd_unregister_dev(ldev->smcd);
+}
+
+static void lo_dev_remove(void)
+{
+ if (!lo_dev)
+ return;
+
+ lo_dev_exit(lo_dev);
+ smcd_free_dev(lo_dev->smcd);
+ kfree(lo_dev);
+}
+
+int smc_loopback_init(void)
+{
+ /* TODO: now lo_dev is a global device shared by
+ * the whole kernel, and can't be referred to by
+ * smc-tools command 'smcd dev'. Need to be improved.
+ */
+ return lo_dev_probe();
+}
+
+void smc_loopback_exit(void)
+{
+ lo_dev_remove();
+}
diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
new file mode 100644
index 0000000..d7f7815
--- /dev/null
+++ b/net/smc/smc_loopback.h
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Shared Memory Communications Direct over loopback device.
+ *
+ * Provide a SMC-D loopback dummy device.
+ *
+ * Copyright (c) 2022, Alibaba Inc.
+ *
+ * Author: Wen Gu <[email protected]>
+ * Tony Lu <[email protected]>
+ *
+ */
+
+#ifndef _SMC_LOOPBACK_H
+#define _SMC_LOOPBACK_H
+
+#include <linux/types.h>
+#include <linux/interrupt.h>
+#include <linux/device.h>
+#include <linux/err.h>
+#include <net/smc.h>
+
+#include "smc_core.h"
+
+#define LODEV_MAX_DMBS 5000
+#define LODEV_MAX_DMBS_BUCKETS 16
+
+struct lo_dmb_node {
+ struct hlist_node list;
+ u64 token;
+ u32 len;
+ u32 sba_idx;
+ void *cpu_addr;
+ dma_addr_t dma_addr;
+};
+
+struct lo_dev {
+ struct smcd_dev *smcd;
+ /* priv data */
+ u64 lgid;
+ DECLARE_BITMAP(sba_idx_mask, LODEV_MAX_DMBS);
+ rwlock_t dmb_ht_lock;
+ DECLARE_HASHTABLE(dmb_ht, LODEV_MAX_DMBS_BUCKETS);
+};
+
+struct lo_systemeid {
+ u8 seid_string[24];
+ u8 serial_number[4];
+ u8 type[4];
+};
+
+/* smcd loopback dev*/
+extern struct lo_dev *lo_dev;
+
+int smc_loopback_init(void);
+void smc_loopback_exit(void);
+
+#endif /* _SMC_LOOPBACK_H */
+
--
1.8.3.1
Since local sndbuf of SMC-D loopback connection shares the same
physical memory region with peer RMB, the logic of cursors update
needs to be adapted.
The main difference from original implementation is need to ensure
that the data copied to local sndbuf won't overwrite the unconsumed
data of peer.
So, for SMC-D loopback connections:
1. TX
a. don't update fin_curs when send out cdc msg.
b. fin_curs and sndbuf_space update will be deferred until
receiving peer cons_curs update.
2. RX
a. same as before. peer sndbuf is as large as local rmb,
which guarantees that prod_curs will behind prep_curs.
Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_cdc.c | 53 +++++++++++++++++++++++++++++++++++++++-----------
net/smc/smc_loopback.c | 7 +++++++
2 files changed, 49 insertions(+), 11 deletions(-)
diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index 61f5ff7..586472a 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -253,17 +253,26 @@ int smcd_cdc_msg_send(struct smc_connection *conn)
return rc;
smc_curs_copy(&conn->rx_curs_confirmed, &curs, conn);
conn->local_rx_ctrl.prod_flags.cons_curs_upd_req = 0;
- /* Calculate transmitted data and increment free send buffer space */
- diff = smc_curs_diff(conn->sndbuf_desc->len, &conn->tx_curs_fin,
- &conn->tx_curs_sent);
- /* increased by confirmed number of bytes */
- smp_mb__before_atomic();
- atomic_add(diff, &conn->sndbuf_space);
- /* guarantee 0 <= sndbuf_space <= sndbuf_desc->len */
- smp_mb__after_atomic();
- smc_curs_copy(&conn->tx_curs_fin, &conn->tx_curs_sent, conn);
+ if (!conn->lgr->smcd->is_loopback) {
+ /* Note:
+ * For smcd loopback device:
+ *
+ * Don't update the fin_curs and sndbuf_space here.
+ * Update fin_curs when peer consumes the data in RMB.
+ */
- smc_tx_sndbuf_nonfull(smc);
+ /* Calculate transmitted data and increment free send buffer space */
+ diff = smc_curs_diff(conn->sndbuf_desc->len, &conn->tx_curs_fin,
+ &conn->tx_curs_sent);
+ /* increased by confirmed number of bytes */
+ smp_mb__before_atomic();
+ atomic_add(diff, &conn->sndbuf_space);
+ /* guarantee 0 <= sndbuf_space <= sndbuf_desc->len */
+ smp_mb__after_atomic();
+ smc_curs_copy(&conn->tx_curs_fin, &conn->tx_curs_sent, conn);
+
+ smc_tx_sndbuf_nonfull(smc);
+ }
return rc;
}
@@ -321,7 +330,7 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc,
{
union smc_host_cursor cons_old, prod_old;
struct smc_connection *conn = &smc->conn;
- int diff_cons, diff_prod;
+ int diff_cons, diff_prod, diff_tx;
smc_curs_copy(&prod_old, &conn->local_rx_ctrl.prod, conn);
smc_curs_copy(&cons_old, &conn->local_rx_ctrl.cons, conn);
@@ -337,6 +346,28 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc,
atomic_add(diff_cons, &conn->peer_rmbe_space);
/* guarantee 0 <= peer_rmbe_space <= peer_rmbe_size */
smp_mb__after_atomic();
+
+ /* For smcd loopback device:
+ * Update of peer cons_curs indicates that
+ * 1. peer rmbe space increases.
+ * 2. local sndbuf space increases.
+ *
+ * So local sndbuf fin_curs should be equal to peer RMB cons_curs.
+ */
+ if (conn->lgr->is_smcd &&
+ conn->lgr->smcd->is_loopback) {
+ /* calculate peer rmb consumed data */
+ diff_tx = smc_curs_diff(conn->sndbuf_desc->len, &conn->tx_curs_fin,
+ &conn->local_rx_ctrl.cons);
+ /* increase local sndbuf space and fin_curs */
+ smp_mb__before_atomic();
+ atomic_add(diff_tx, &conn->sndbuf_space);
+ /* guarantee 0 <= sndbuf_space <= sndbuf_desc->len */
+ smp_mb__after_atomic();
+ smc_curs_copy(&conn->tx_curs_fin, &conn->local_rx_ctrl.cons, conn);
+
+ smc_tx_sndbuf_nonfull(smc);
+ }
}
diff_prod = smc_curs_diff(conn->rmb_desc->len, &prod_old,
diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index bc3ff82..43f0287 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -216,6 +216,13 @@ static int lo_move_data(struct smcd_dev *smcd, u64 dmb_tok, unsigned int idx,
struct lo_dmb_node *rmb_node = NULL, *tmp_node;
struct lo_dev *ldev = smcd->priv;
+ if (!sf) {
+ /* no need to move data.
+ * sndbuf is equal to peer rmb.
+ */
+ return 0;
+ }
+
read_lock(&ldev->dmb_ht_lock);
hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_tok) {
if (tmp_node->token == dmb_tok) {
--
1.8.3.1
On Tue, 2022-12-20 at 11:21 +0800, Wen Gu wrote:
> Hi, all
>
> # Background
>
> As previously mentioned in [1], we (Alibaba Cloud) are trying to use SMC
> to accelerate TCP applications in cloud environment, improving inter-host
> or inter-VM communication.
>
> In addition of these, we also found the value of SMC-D in scenario of local
> inter-process communication, such as accelerate communication between containers
> within the same host. So this RFC tries to provide a SMC-D loopback solution
> in such scenario, to bring a significant improvement in latency and throughput
> compared to TCP loopback.
>
> # Design
>
> This patch set provides a kind of SMC-D loopback solution.
>
> Patch #1/5 and #2/5 provide an SMC-D based dummy device, preparing for the
> inter-process communication acceleration. Except for loopback acceleration,
> the dummy device can also meet the requirements mentioned in [2], which is
> providing a way to test SMC-D logic for broad community without ISM device.
>
> +------------------------------------------+
> | +-----------+ +-----------+ |
> | | process A | | process B | |
> | +-----------+ +-----------+ |
> | ^ ^ |
> | | +---------------+ | |
> | | | SMC stack | | |
> | +--->| +-----------+ |<--| |
> | | | dummy | | |
> | | | device | | |
> | +-+-----------+-+ |
> | VM |
> +------------------------------------------+
>
> Patch #3/5, #4/5, #5/5 provides a way to avoid data copy from sndbuf to RMB
> and improve SMC-D loopback performance. Through extending smcd_ops with two
> new semantic: attach_dmb and detach_dmb, sender's sndbuf shares the same
> physical memory region with receiver's RMB. The data copied from userspace
> to sender's sndbuf directly reaches the receiver's RMB without unnecessary
> memory copy in the same kernel.
>
> +----------+ +----------+
> | socket A | | socket B |
> +----------+ +----------+
> | ^
> | +---------+ |
> regard as | | ----------|
> local sndbuf | B's | regard as
> | | RMB | local RMB
> |-------> | |
> +---------+
Hi Wen Gu,
I maintain the s390 specific PCI support in Linux and would like to
provide a bit of background on this. You're surely wondering why we
even have a copy in there for our ISM virtual PCI device. To understand
why this copy operation exists and why we need to keep it working, one
needs a bit of s390 aka mainframe background.
On s390 all (currently supported) native machines have a mandatory
machine level hypervisor. All OSs whether z/OS or Linux run either on
this machine level hypervisor as so called Logical Partitions (LPARs)
or as second/third/… level guests on e.g. a KVM or z/VM hypervisor that
in turn runs in an LPAR. Now, in terms of memory this machine level
hypervisor sometimes called PR/SM unlike KVM, z/VM, or VMWare is a
partitioning hypervisor without paging. This is one of the main reasons
for the very-near-native performance of the machine hypervisor as the
memory of its guests acts just like native RAM on other systems. It is
never paged out and always accessible to IOMMU translated DMA from
devices without the need for pinning pages and besides a trivial
offset/limit adjustment an LPAR's MMU does the same amount of work as
an MMU on a bare metal x86_64/ARM64 box.
It also means however that when SMC-D is used to communicate between
LPARs via an ISM device there is no way of mapping the DMBs to the
same physical memory as there exists no MMU-like layer spanning
partitions that could do such a mapping. Meanwhile for machine level
firmware including the ISM virtual PCI device it is still possible to
_copy_ memory between different memory partitions. So yeah while I do
see the appeal of skipping the memcpy() for loopback or even between
guests of a paging hypervisor such as KVM, which can map the DMBs on
the same physical memory, we must keep in mind this original use case
requiring a copy operation.
Thanks,
Niklas
>
> # Benchmark Test
>
> * Test environments:
> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
> - SMC sndbuf/RMB size 1MB.
>
> * Test object:
> - TCP: run on TCP loopback.
> - domain: run on UNIX domain.
> - SMC lo: run on SMC loopback device with patch #1/5 ~ #2/5.
> - SMC lo-nocpy: run on SMC loopback device with patch #1/5 ~ #5/5.
>
> 1. ipc-benchmark (see [3])
>
> - ./<foo> -c 1000000 -s 100
>
> TCP domain SMC-lo SMC-lo-nocpy
> Message
> rate (msg/s) 75140 129548(+72.41) 152266(+102.64%) 151914(+102.17%)
Interesting that it does beat UNIX domain sockets. Also, see my below
comment for nginx/wrk as this seems very similar.
>
> 2. sockperf
>
> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>
> TCP SMC-lo SMC-lo-nocpy
> Bandwidth(MBps) 4943.359 4936.096(-0.15%) 8239.624(+66.68%)
> Latency(us) 6.372 3.359(-47.28%) 3.25(-49.00%)
>
> 3. iperf3
>
> - serv: <smc_run> taskset -c <cpu> iperf3 -s
> - clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15
>
> TCP SMC-lo SMC-lo-nocpy
> Bitrate(Gb/s) 40.5 41.4(+2.22%) 76.4(+88.64%)
>
> 4. nginx/wrk
>
> - serv: <smc_run> nginx
> - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80
>
> TCP SMC-lo SMC-lo-nocpy
> Requests/s 154643.22 220894.03(+42.84%) 226754.3(+46.63%)
This result is very interesting indeed. So with the much more realistic
nginx/wrk workload it seems to copy hurts much less than the
iperf3/sockperf would suggest while SMC-D itself seems to help more.
I'd hope that this translates to actual applications as well. Maybe
this makes SMC-D based loopback interesting even while keeping the
copy, at least until we can come up with a sane way to work a no-copy
variant into SMC-D?
>
>
> # Discussion
>
> 1. API between SMC-D and ISM device
>
> As Jan mentioned in [2], IBM are working on placing an API between SMC-D
> and the ISM device for easier use of different "devices" for SMC-D.
>
> So, considering that the introduction of attach_dmb or detach_dmb can
> effectively avoid data copying from sndbuf to RMB and brings obvious
> throughput advantages in inter-VM or inter-process scenarios, can the
> attach/detach semantics be taken into consideration when designing the
> API to make it a standard ISM device behavior?
Due to the reasons explained above this behavior can't be emulated by
ISM devices at least not when crossing partitions. Not sure if we can
still incorporate it in the API and allow for both copying and
remapping SMC-D like devices, it definitely needs careful consideration
and I think also a better understanding of the benefit for real world
workloads.
>
> Maybe our RFC of SMC-D based inter-process acceleration (this one) and
> inter-VM acceleration (will coming soon, which is the update of [1])
> can provide some examples for new API design. And we are very glad to
> discuss this on the mail list.
>
> 2. Way to select different ISM-like devices
>
> With the proposal of SMC-D loopback 'device' (this RFC) and incoming
> device used for inter-VM acceleration as update of [1], SMC-D has more
> options to choose from. So we need to consider that how to indicate
> supported devices, how to determine which one to use, and their priority...
Agree on this part, though it is for the SMC maintainers to decide, I
think we would definitely want to be able to use any upcoming inter-VM
devices on s390 possibly also in conjunction with ISM devices for
communication across partitions.
>
> IMHO, this may require an update of CLC message and negotiation mechanism.
> Again, we are very glad to discuss this with you on the mailing list.
>
> [1] https://lore.kernel.org/netdev/[email protected]/
> [2] https://lore.kernel.org/netdev/[email protected]/
> [3] https://github.com/goldsborough/ipc-bench
>
> v1->v2
> 1. Fix some build WARNINGs complained by kernel test rebot
> Reported-by: kernel test robot <[email protected]>
> 2. Add iperf3 test data.
>
> Wen Gu (5):
> net/smc: introduce SMC-D loopback device
> net/smc: choose loopback device in SMC-D communication
> net/smc: add dmb attach and detach interface
> net/smc: avoid data copy from sndbuf to peer RMB in SMC-D loopback
> net/smc: logic of cursors update in SMC-D loopback connections
>
> include/net/smc.h | 3 +
> net/smc/Makefile | 2 +-
> net/smc/af_smc.c | 88 +++++++++++-
> net/smc/smc_cdc.c | 59 ++++++--
> net/smc/smc_cdc.h | 1 +
> net/smc/smc_clc.c | 4 +-
> net/smc/smc_core.c | 62 +++++++++
> net/smc/smc_core.h | 2 +
> net/smc/smc_ism.c | 39 +++++-
> net/smc/smc_ism.h | 2 +
> net/smc/smc_loopback.c | 358 +++++++++++++++++++++++++++++++++++++++++++++++++
> net/smc/smc_loopback.h | 63 +++++++++
> 12 files changed, 662 insertions(+), 21 deletions(-)
> create mode 100644 net/smc/smc_loopback.c
> create mode 100644 net/smc/smc_loopback.h
>
On 2022/12/20 22:02, Niklas Schnelle wrote:
> On Tue, 2022-12-20 at 11:21 +0800, Wen Gu wrote:
>> Hi, all
>>
>> # Background
>>
>> As previously mentioned in [1], we (Alibaba Cloud) are trying to use SMC
>> to accelerate TCP applications in cloud environment, improving inter-host
>> or inter-VM communication.
>>
>> In addition of these, we also found the value of SMC-D in scenario of local
>> inter-process communication, such as accelerate communication between containers
>> within the same host. So this RFC tries to provide a SMC-D loopback solution
>> in such scenario, to bring a significant improvement in latency and throughput
>> compared to TCP loopback.
>>
>> # Design
>>
>> This patch set provides a kind of SMC-D loopback solution.
>>
>> Patch #1/5 and #2/5 provide an SMC-D based dummy device, preparing for the
>> inter-process communication acceleration. Except for loopback acceleration,
>> the dummy device can also meet the requirements mentioned in [2], which is
>> providing a way to test SMC-D logic for broad community without ISM device.
>>
>> +------------------------------------------+
>> | +-----------+ +-----------+ |
>> | | process A | | process B | |
>> | +-----------+ +-----------+ |
>> | ^ ^ |
>> | | +---------------+ | |
>> | | | SMC stack | | |
>> | +--->| +-----------+ |<--| |
>> | | | dummy | | |
>> | | | device | | |
>> | +-+-----------+-+ |
>> | VM |
>> +------------------------------------------+
>>
>> Patch #3/5, #4/5, #5/5 provides a way to avoid data copy from sndbuf to RMB
>> and improve SMC-D loopback performance. Through extending smcd_ops with two
>> new semantic: attach_dmb and detach_dmb, sender's sndbuf shares the same
>> physical memory region with receiver's RMB. The data copied from userspace
>> to sender's sndbuf directly reaches the receiver's RMB without unnecessary
>> memory copy in the same kernel.
>>
>> +----------+ +----------+
>> | socket A | | socket B |
>> +----------+ +----------+
>> | ^
>> | +---------+ |
>> regard as | | ----------|
>> local sndbuf | B's | regard as
>> | | RMB | local RMB
>> |-------> | |
>> +---------+
>
> Hi Wen Gu,
>
> I maintain the s390 specific PCI support in Linux and would like to
> provide a bit of background on this. You're surely wondering why we
> even have a copy in there for our ISM virtual PCI device. To understand
> why this copy operation exists and why we need to keep it working, one
> needs a bit of s390 aka mainframe background.
>
> On s390 all (currently supported) native machines have a mandatory
> machine level hypervisor. All OSs whether z/OS or Linux run either on
> this machine level hypervisor as so called Logical Partitions (LPARs)
> or as second/third/… level guests on e.g. a KVM or z/VM hypervisor that
> in turn runs in an LPAR. Now, in terms of memory this machine level
> hypervisor sometimes called PR/SM unlike KVM, z/VM, or VMWare is a
> partitioning hypervisor without paging. This is one of the main reasons
> for the very-near-native performance of the machine hypervisor as the
> memory of its guests acts just like native RAM on other systems. It is
> never paged out and always accessible to IOMMU translated DMA from
> devices without the need for pinning pages and besides a trivial
> offset/limit adjustment an LPAR's MMU does the same amount of work as
> an MMU on a bare metal x86_64/ARM64 box.
>
> It also means however that when SMC-D is used to communicate between
> LPARs via an ISM device there is no way of mapping the DMBs to the
> same physical memory as there exists no MMU-like layer spanning
> partitions that could do such a mapping. Meanwhile for machine level
> firmware including the ISM virtual PCI device it is still possible to
> _copy_ memory between different memory partitions. So yeah while I do
> see the appeal of skipping the memcpy() for loopback or even between
> guests of a paging hypervisor such as KVM, which can map the DMBs on
> the same physical memory, we must keep in mind this original use case
> requiring a copy operation.
>
> Thanks,
> Niklas
>
Hi Niklas,
Thank you so much for the complete and detailed explanation! This provides
me a brand new perspective of s390 device that we hadn't dabbled in before.
Now I understand why shared memory is unavailable between different LPARs.
Our original intention of proposing loopback device and the incoming device
(virtio-ism) for inter-VM is to use SMC-D to accelerate communication in the
case with no existing s390 ISM devices. In our conception, s390 ISM device,
loopback device and virtio-ism device are parallel and are abstracted by smcd_ops.
+------------------------+
| SMC-D |
+------------------------+
-------- smcd_ops ---------
+------+ +------+ +------+
| s390 | | loop | |virtio|
| ISM | | back | | -ism |
| dev | | dev | | dev |
+------+ +------+ +------+
We also believe that keeping the existing design and behavior of s390 ISM
device is unshaken. What we want to get support for is some smcd_ops extension
for devices with optional beneficial capability, such as nocopy here (Let's call
it this for now), which is really helpful for us in inter-process and inter-VM
scenario.
And coincided with IBM's intention to add APIs between SMC-D and devices to
support various devices for SMC-D, as mentioned in [2], we send out this RFC and
the incoming virio-ism RFC, to provide some examples.
>>
>> # Benchmark Test
>>
>> * Test environments:
>> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>> - SMC sndbuf/RMB size 1MB.
>>
>> * Test object:
>> - TCP: run on TCP loopback.
>> - domain: run on UNIX domain.
>> - SMC lo: run on SMC loopback device with patch #1/5 ~ #2/5.
>> - SMC lo-nocpy: run on SMC loopback device with patch #1/5 ~ #5/5.
>>
>> 1. ipc-benchmark (see [3])
>>
>> - ./<foo> -c 1000000 -s 100
>>
>> TCP domain SMC-lo SMC-lo-nocpy
>> Message
>> rate (msg/s) 75140 129548(+72.41) 152266(+102.64%) 151914(+102.17%)
>
> Interesting that it does beat UNIX domain sockets. Also, see my below
> comment for nginx/wrk as this seems very similar.
>
>>
>> 2. sockperf
>>
>> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>>
>> TCP SMC-lo SMC-lo-nocpy
>> Bandwidth(MBps) 4943.359 4936.096(-0.15%) 8239.624(+66.68%)
>> Latency(us) 6.372 3.359(-47.28%) 3.25(-49.00%)
>>
>> 3. iperf3
>>
>> - serv: <smc_run> taskset -c <cpu> iperf3 -s
>> - clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15
>>
>> TCP SMC-lo SMC-lo-nocpy
>> Bitrate(Gb/s) 40.5 41.4(+2.22%) 76.4(+88.64%)
>>
>> 4. nginx/wrk
>>
>> - serv: <smc_run> nginx
>> - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80
>>
>> TCP SMC-lo SMC-lo-nocpy
>> Requests/s 154643.22 220894.03(+42.84%) 226754.3(+46.63%)
>
>
> This result is very interesting indeed. So with the much more realistic
> nginx/wrk workload it seems to copy hurts much less than the
> iperf3/sockperf would suggest while SMC-D itself seems to help more.
> I'd hope that this translates to actual applications as well. Maybe
> this makes SMC-D based loopback interesting even while keeping the
> copy, at least until we can come up with a sane way to work a no-copy
> variant into SMC-D?
>
I agree, nginx/wrk workload is much more realistic for many applications.
But we also encounter many other cases similar to sockperf on the cloud, which
requires high throughput, such as AI training and big data.
So avoidance of copying between DMBs can help these cases a lot :)
>>
>>
>> # Discussion
>>
>> 1. API between SMC-D and ISM device
>>
>> As Jan mentioned in [2], IBM are working on placing an API between SMC-D
>> and the ISM device for easier use of different "devices" for SMC-D.
>>
>> So, considering that the introduction of attach_dmb or detach_dmb can
>> effectively avoid data copying from sndbuf to RMB and brings obvious
>> throughput advantages in inter-VM or inter-process scenarios, can the
>> attach/detach semantics be taken into consideration when designing the
>> API to make it a standard ISM device behavior?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Due to the reasons explained above this behavior can't be emulated by
> ISM devices at least not when crossing partitions. Not sure if we can
> still incorporate it in the API and allow for both copying and
> remapping SMC-D like devices, it definitely needs careful consideration
> and I think also a better understanding of the benefit for real world
> workloads.
>
Here I am not rigorous.
Nocopy shouldn't be a standard ISM device behavior indeed. Actually we hope it be a
standard optional _SMC-D_ device behavior and defined by smcd_ops.
For devices don't support these options, like ISM device on s390 architecture,
.attach_dmb/.detach_dmb and other reasonable extensions (which will be proposed to
discuss in incoming virtio-ism RFC) can be set to NULL or return invalid. And for
devices do support, they may be used for improving performance in some cases.
In addition, can I know more latest news about the API design? :) , like its scale, will
it be a almost refactor of existing interface or incremental patching? and its object,
will it be tailored for exact ISM behavior or to reserve some options for other devices,
like nocopy here? From my understanding of [2], it might be the latter?
>>
>> Maybe our RFC of SMC-D based inter-process acceleration (this one) and
>> inter-VM acceleration (will coming soon, which is the update of [1])
>> can provide some examples for new API design. And we are very glad to
>> discuss this on the mail list.
>>
>> 2. Way to select different ISM-like devices
>>
>> With the proposal of SMC-D loopback 'device' (this RFC) and incoming
>> device used for inter-VM acceleration as update of [1], SMC-D has more
>> options to choose from. So we need to consider that how to indicate
>> supported devices, how to determine which one to use, and their priority...
>
> Agree on this part, though it is for the SMC maintainers to decide, I
> think we would definitely want to be able to use any upcoming inter-VM
> devices on s390 possibly also in conjunction with ISM devices for
> communication across partitions.
>
Yes, this part needs to be discussed with SMC maintainers. And thank you, we are very glad
if our devices can be applied on s390 through the efforts.
Best Regards,
Wen Gu
>>
>> IMHO, this may require an update of CLC message and negotiation mechanism.
>> Again, we are very glad to discuss this with you on the mailing list.
>>
>> [1] https://lore.kernel.org/netdev/[email protected]/
>> [2] https://lore.kernel.org/netdev/[email protected]/
>> [3] https://github.com/goldsborough/ipc-bench
>>
>> v1->v2
>> 1. Fix some build WARNINGs complained by kernel test rebot
>> Reported-by: kernel test robot <[email protected]>
>> 2. Add iperf3 test data.
>>
>> Wen Gu (5):
>> net/smc: introduce SMC-D loopback device
>> net/smc: choose loopback device in SMC-D communication
>> net/smc: add dmb attach and detach interface
>> net/smc: avoid data copy from sndbuf to peer RMB in SMC-D loopback
>> net/smc: logic of cursors update in SMC-D loopback connections
>>
>> include/net/smc.h | 3 +
>> net/smc/Makefile | 2 +-
>> net/smc/af_smc.c | 88 +++++++++++-
>> net/smc/smc_cdc.c | 59 ++++++--
>> net/smc/smc_cdc.h | 1 +
>> net/smc/smc_clc.c | 4 +-
>> net/smc/smc_core.c | 62 +++++++++
>> net/smc/smc_core.h | 2 +
>> net/smc/smc_ism.c | 39 +++++-
>> net/smc/smc_ism.h | 2 +
>> net/smc/smc_loopback.c | 358 +++++++++++++++++++++++++++++++++++++++++++++++++
>> net/smc/smc_loopback.h | 63 +++++++++
>> 12 files changed, 662 insertions(+), 21 deletions(-)
>> create mode 100644 net/smc/smc_loopback.c
>> create mode 100644 net/smc/smc_loopback.h
>>
On Tue, Dec 20, 2022 at 03:02:45PM +0100, Niklas Schnelle wrote:
>On Tue, 2022-12-20 at 11:21 +0800, Wen Gu wrote:
>> Hi, all
>>
>> # Background
>>
>> As previously mentioned in [1], we (Alibaba Cloud) are trying to use SMC
>> to accelerate TCP applications in cloud environment, improving inter-host
>> or inter-VM communication.
>>
>> In addition of these, we also found the value of SMC-D in scenario of local
>> inter-process communication, such as accelerate communication between containers
>> within the same host. So this RFC tries to provide a SMC-D loopback solution
>> in such scenario, to bring a significant improvement in latency and throughput
>> compared to TCP loopback.
>>
>> # Design
>>
>> This patch set provides a kind of SMC-D loopback solution.
>>
>> Patch #1/5 and #2/5 provide an SMC-D based dummy device, preparing for the
>> inter-process communication acceleration. Except for loopback acceleration,
>> the dummy device can also meet the requirements mentioned in [2], which is
>> providing a way to test SMC-D logic for broad community without ISM device.
>>
>> +------------------------------------------+
>> | +-----------+ +-----------+ |
>> | | process A | | process B | |
>> | +-----------+ +-----------+ |
>> | ^ ^ |
>> | | +---------------+ | |
>> | | | SMC stack | | |
>> | +--->| +-----------+ |<--| |
>> | | | dummy | | |
>> | | | device | | |
>> | +-+-----------+-+ |
>> | VM |
>> +------------------------------------------+
>>
>> Patch #3/5, #4/5, #5/5 provides a way to avoid data copy from sndbuf to RMB
>> and improve SMC-D loopback performance. Through extending smcd_ops with two
>> new semantic: attach_dmb and detach_dmb, sender's sndbuf shares the same
>> physical memory region with receiver's RMB. The data copied from userspace
>> to sender's sndbuf directly reaches the receiver's RMB without unnecessary
>> memory copy in the same kernel.
>>
>> +----------+ +----------+
>> | socket A | | socket B |
>> +----------+ +----------+
>> | ^
>> | +---------+ |
>> regard as | | ----------|
>> local sndbuf | B's | regard as
>> | | RMB | local RMB
>> |-------> | |
>> +---------+
>
>Hi Wen Gu,
>
>I maintain the s390 specific PCI support in Linux and would like to
>provide a bit of background on this. You're surely wondering why we
>even have a copy in there for our ISM virtual PCI device. To understand
>why this copy operation exists and why we need to keep it working, one
>needs a bit of s390 aka mainframe background.
>
>On s390 all (currently supported) native machines have a mandatory
>machine level hypervisor. All OSs whether z/OS or Linux run either on
>this machine level hypervisor as so called Logical Partitions (LPARs)
>or as second/third/… level guests on e.g. a KVM or z/VM hypervisor that
>in turn runs in an LPAR. Now, in terms of memory this machine level
>hypervisor sometimes called PR/SM unlike KVM, z/VM, or VMWare is a
>partitioning hypervisor without paging. This is one of the main reasons
>for the very-near-native performance of the machine hypervisor as the
>memory of its guests acts just like native RAM on other systems. It is
>never paged out and always accessible to IOMMU translated DMA from
>devices without the need for pinning pages and besides a trivial
>offset/limit adjustment an LPAR's MMU does the same amount of work as
>an MMU on a bare metal x86_64/ARM64 box.
>
>It also means however that when SMC-D is used to communicate between
>LPARs via an ISM device there is no way of mapping the DMBs to the
>same physical memory as there exists no MMU-like layer spanning
>partitions that could do such a mapping. Meanwhile for machine level
>firmware including the ISM virtual PCI device it is still possible to
>_copy_ memory between different memory partitions. So yeah while I do
>see the appeal of skipping the memcpy() for loopback or even between
>guests of a paging hypervisor such as KVM, which can map the DMBs on
>the same physical memory, we must keep in mind this original use case
>requiring a copy operation.
>
>Thanks,
>Niklas
>
>>
>> # Benchmark Test
>>
>> * Test environments:
>> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>> - SMC sndbuf/RMB size 1MB.
>>
>> * Test object:
>> - TCP: run on TCP loopback.
>> - domain: run on UNIX domain.
>> - SMC lo: run on SMC loopback device with patch #1/5 ~ #2/5.
>> - SMC lo-nocpy: run on SMC loopback device with patch #1/5 ~ #5/5.
>>
>> 1. ipc-benchmark (see [3])
>>
>> - ./<foo> -c 1000000 -s 100
>>
>> TCP domain SMC-lo SMC-lo-nocpy
>> Message
>> rate (msg/s) 75140 129548(+72.41) 152266(+102.64%) 151914(+102.17%)
>
>Interesting that it does beat UNIX domain sockets. Also, see my below
>comment for nginx/wrk as this seems very similar.
>
>>
>> 2. sockperf
>>
>> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>>
>> TCP SMC-lo SMC-lo-nocpy
>> Bandwidth(MBps) 4943.359 4936.096(-0.15%) 8239.624(+66.68%)
>> Latency(us) 6.372 3.359(-47.28%) 3.25(-49.00%)
>>
>> 3. iperf3
>>
>> - serv: <smc_run> taskset -c <cpu> iperf3 -s
>> - clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15
>>
>> TCP SMC-lo SMC-lo-nocpy
>> Bitrate(Gb/s) 40.5 41.4(+2.22%) 76.4(+88.64%)
>>
>> 4. nginx/wrk
>>
>> - serv: <smc_run> nginx
>> - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80
>>
>> TCP SMC-lo SMC-lo-nocpy
>> Requests/s 154643.22 220894.03(+42.84%) 226754.3(+46.63%)
>
>
>This result is very interesting indeed. So with the much more realistic
>nginx/wrk workload it seems to copy hurts much less than the
>iperf3/sockperf would suggest while SMC-D itself seems to help more.
>I'd hope that this translates to actual applications as well. Maybe
>this makes SMC-D based loopback interesting even while keeping the
>copy, at least until we can come up with a sane way to work a no-copy
>variant into SMC-D?
Yes, SMC-D based loopback shows great advantages over TCP loopback, with
or without copy.
The advantage of zero-copy should be observed when we need to transfer
a large mount of data. But here in this wrk/nginx case, the test file
transferred from server to client is a small file. So we didn't see much gain.
If we use a large file(e.g >=1MB file), I think we should observe a much
different result.
Thinks!
On 2022/12/20 11:21, Wen Gu wrote:
> Hi, all
>
> # Background
>
> As previously mentioned in [1], we (Alibaba Cloud) are trying to use SMC
> to accelerate TCP applications in cloud environment, improving inter-host
> or inter-VM communication.
>
> In addition of these, we also found the value of SMC-D in scenario of local
> inter-process communication, such as accelerate communication between containers
> within the same host. So this RFC tries to provide a SMC-D loopback solution
> in such scenario, to bring a significant improvement in latency and throughput
> compared to TCP loopback.
>
> # Design
>
> This patch set provides a kind of SMC-D loopback solution.
>
> Patch #1/5 and #2/5 provide an SMC-D based dummy device, preparing for the
> inter-process communication acceleration. Except for loopback acceleration,
> the dummy device can also meet the requirements mentioned in [2], which is
> providing a way to test SMC-D logic for broad community without ISM device.
>
> +------------------------------------------+
> | +-----------+ +-----------+ |
> | | process A | | process B | |
> | +-----------+ +-----------+ |
> | ^ ^ |
> | | +---------------+ | |
> | | | SMC stack | | |
> | +--->| +-----------+ |<--| |
> | | | dummy | | |
> | | | device | | |
> | +-+-----------+-+ |
> | VM |
> +------------------------------------------+
>
> Patch #3/5, #4/5, #5/5 provides a way to avoid data copy from sndbuf to RMB
> and improve SMC-D loopback performance. Through extending smcd_ops with two
> new semantic: attach_dmb and detach_dmb, sender's sndbuf shares the same
> physical memory region with receiver's RMB. The data copied from userspace
> to sender's sndbuf directly reaches the receiver's RMB without unnecessary
> memory copy in the same kernel.
>
> +----------+ +----------+
> | socket A | | socket B |
> +----------+ +----------+
> | ^
> | +---------+ |
> regard as | | ----------|
> local sndbuf | B's | regard as
> | | RMB | local RMB
> |-------> | |
> +---------+
>
> # Benchmark Test
>
> * Test environments:
> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
> - SMC sndbuf/RMB size 1MB.
>
> * Test object:
> - TCP: run on TCP loopback.
> - domain: run on UNIX domain.
> - SMC lo: run on SMC loopback device with patch #1/5 ~ #2/5.
> - SMC lo-nocpy: run on SMC loopback device with patch #1/5 ~ #5/5.
>
> 1. ipc-benchmark (see [3])
>
> - ./<foo> -c 1000000 -s 100
>
> TCP domain SMC-lo SMC-lo-nocpy
> Message
> rate (msg/s) 75140 129548(+72.41) 152266(+102.64%) 151914(+102.17%)
>
> 2. sockperf
>
> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>
> TCP SMC-lo SMC-lo-nocpy
> Bandwidth(MBps) 4943.359 4936.096(-0.15%) 8239.624(+66.68%)
> Latency(us) 6.372 3.359(-47.28%) 3.25(-49.00%)
>
> 3. iperf3
>
> - serv: <smc_run> taskset -c <cpu> iperf3 -s
> - clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15
>
> TCP SMC-lo SMC-lo-nocpy
> Bitrate(Gb/s) 40.5 41.4(+2.22%) 76.4(+88.64%)
>
> 4. nginx/wrk
>
> - serv: <smc_run> nginx
> - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80
>
> TCP SMC-lo SMC-lo-nocpy
> Requests/s 154643.22 220894.03(+42.84%) 226754.3(+46.63%)
>
>
> # Discussion
>
> 1. API between SMC-D and ISM device
>
> As Jan mentioned in [2], IBM are working on placing an API between SMC-D
> and the ISM device for easier use of different "devices" for SMC-D.
>
> So, considering that the introduction of attach_dmb or detach_dmb can
> effectively avoid data copying from sndbuf to RMB and brings obvious
> throughput advantages in inter-VM or inter-process scenarios, can the
> attach/detach semantics be taken into consideration when designing the
> API to make it a standard ISM device behavior?
>
> Maybe our RFC of SMC-D based inter-process acceleration (this one) and
> inter-VM acceleration (will coming soon, which is the update of [1])
The patch of SMC-D + virtio-ism device is now discussed in virtio community:
https://lists.oasis-open.org/archives/virtio-comment/202212/msg00030.html
> can provide some examples for new API design. And we are very glad to
> discuss this on the mail list.
>
> 2. Way to select different ISM-like devices
>
> With the proposal of SMC-D loopback 'device' (this RFC) and incoming
> device used for inter-VM acceleration as update of [1], SMC-D has more
> options to choose from. So we need to consider that how to indicate
> supported devices, how to determine which one to use, and their priority...
>
> IMHO, this may require an update of CLC message and negotiation mechanism.
> Again, we are very glad to discuss this with you on the mailing list.
>
> [1] https://lore.kernel.org/netdev/[email protected]/
> [2] https://lore.kernel.org/netdev/[email protected]/
> [3] https://github.com/goldsborough/ipc-bench
>
> v1->v2
> 1. Fix some build WARNINGs complained by kernel test rebot
> Reported-by: kernel test robot <[email protected]>
> 2. Add iperf3 test data.
>
> Wen Gu (5):
> net/smc: introduce SMC-D loopback device
> net/smc: choose loopback device in SMC-D communication
> net/smc: add dmb attach and detach interface
> net/smc: avoid data copy from sndbuf to peer RMB in SMC-D loopback
> net/smc: logic of cursors update in SMC-D loopback connections
>
> include/net/smc.h | 3 +
> net/smc/Makefile | 2 +-
> net/smc/af_smc.c | 88 +++++++++++-
> net/smc/smc_cdc.c | 59 ++++++--
> net/smc/smc_cdc.h | 1 +
> net/smc/smc_clc.c | 4 +-
> net/smc/smc_core.c | 62 +++++++++
> net/smc/smc_core.h | 2 +
> net/smc/smc_ism.c | 39 +++++-
> net/smc/smc_ism.h | 2 +
> net/smc/smc_loopback.c | 358 +++++++++++++++++++++++++++++++++++++++++++++++++
> net/smc/smc_loopback.h | 63 +++++++++
> 12 files changed, 662 insertions(+), 21 deletions(-)
> create mode 100644 net/smc/smc_loopback.c
> create mode 100644 net/smc/smc_loopback.h
>
On 21.12.22 14:14, Wen Gu wrote:
>
>
> On 2022/12/20 22:02, Niklas Schnelle wrote:
>
>> On Tue, 2022-12-20 at 11:21 +0800, Wen Gu wrote:
>>> Hi, all
>>>
>>> # Background
>>>
>>> As previously mentioned in [1], we (Alibaba Cloud) are trying to use SMC
>>> to accelerate TCP applications in cloud environment, improving inter-host
>>> or inter-VM communication.
>>>
>>> In addition of these, we also found the value of SMC-D in scenario of local
>>> inter-process communication, such as accelerate communication between containers
>>> within the same host. So this RFC tries to provide a SMC-D loopback solution
>>> in such scenario, to bring a significant improvement in latency and throughput
>>> compared to TCP loopback.
>>>
>>> # Design
>>>
>>> This patch set provides a kind of SMC-D loopback solution.
>>>
>>> Patch #1/5 and #2/5 provide an SMC-D based dummy device, preparing for the
>>> inter-process communication acceleration. Except for loopback acceleration,
>>> the dummy device can also meet the requirements mentioned in [2], which is
>>> providing a way to test SMC-D logic for broad community without ISM device.
>>>
>>> +------------------------------------------+
>>> | +-----------+ +-----------+ |
>>> | | process A | | process B | |
>>> | +-----------+ +-----------+ |
>>> | ^ ^ |
>>> | | +---------------+ | |
>>> | | | SMC stack | | |
>>> | +--->| +-----------+ |<--| |
>>> | | | dummy | | |
>>> | | | device | | |
>>> | +-+-----------+-+ |
>>> | VM |
>>> +------------------------------------------+
>>>
>>> Patch #3/5, #4/5, #5/5 provides a way to avoid data copy from sndbuf to RMB
>>> and improve SMC-D loopback performance. Through extending smcd_ops with two
>>> new semantic: attach_dmb and detach_dmb, sender's sndbuf shares the same
>>> physical memory region with receiver's RMB. The data copied from userspace
>>> to sender's sndbuf directly reaches the receiver's RMB without unnecessary
>>> memory copy in the same kernel.
>>>
>>> +----------+ +----------+
>>> | socket A | | socket B |
>>> +----------+ +----------+
>>> | ^
>>> | +---------+ |
>>> regard as | | ----------|
>>> local sndbuf | B's | regard as
>>> | | RMB | local RMB
>>> |-------> | |
>>> +---------+
>>
>> Hi Wen Gu,
>>
>> I maintain the s390 specific PCI support in Linux and would like to
>> provide a bit of background on this. You're surely wondering why we
>> even have a copy in there for our ISM virtual PCI device. To understand
>> why this copy operation exists and why we need to keep it working, one
>> needs a bit of s390 aka mainframe background.
>>
>> On s390 all (currently supported) native machines have a mandatory
>> machine level hypervisor. All OSs whether z/OS or Linux run either on
>> this machine level hypervisor as so called Logical Partitions (LPARs)
>> or as second/third/… level guests on e.g. a KVM or z/VM hypervisor that
>> in turn runs in an LPAR. Now, in terms of memory this machine level
>> hypervisor sometimes called PR/SM unlike KVM, z/VM, or VMWare is a
>> partitioning hypervisor without paging. This is one of the main reasons
>> for the very-near-native performance of the machine hypervisor as the
>> memory of its guests acts just like native RAM on other systems. It is
>> never paged out and always accessible to IOMMU translated DMA from
>> devices without the need for pinning pages and besides a trivial
>> offset/limit adjustment an LPAR's MMU does the same amount of work as
>> an MMU on a bare metal x86_64/ARM64 box.
>>
>> It also means however that when SMC-D is used to communicate between
>> LPARs via an ISM device there is no way of mapping the DMBs to the
>> same physical memory as there exists no MMU-like layer spanning
>> partitions that could do such a mapping. Meanwhile for machine level
>> firmware including the ISM virtual PCI device it is still possible to
>> _copy_ memory between different memory partitions. So yeah while I do
>> see the appeal of skipping the memcpy() for loopback or even between
>> guests of a paging hypervisor such as KVM, which can map the DMBs on
>> the same physical memory, we must keep in mind this original use case
>> requiring a copy operation.
>>
>> Thanks,
>> Niklas
>>
>
> Hi Niklas,
>
> Thank you so much for the complete and detailed explanation! This provides
> me a brand new perspective of s390 device that we hadn't dabbled in before.
> Now I understand why shared memory is unavailable between different LPARs.
>
> Our original intention of proposing loopback device and the incoming device
> (virtio-ism) for inter-VM is to use SMC-D to accelerate communication in the
> case with no existing s390 ISM devices. In our conception, s390 ISM device,
> loopback device and virtio-ism device are parallel and are abstracted by smcd_ops.
>
> +------------------------+
> | SMC-D |
> +------------------------+
> -------- smcd_ops ---------
> +------+ +------+ +------+
> | s390 | | loop | |virtio|
> | ISM | | back | | -ism |
> | dev | | dev | | dev |
> +------+ +------+ +------+
>
> We also believe that keeping the existing design and behavior of s390 ISM
> device is unshaken. What we want to get support for is some smcd_ops extension
> for devices with optional beneficial capability, such as nocopy here (Let's call
> it this for now), which is really helpful for us in inter-process and inter-VM
> scenario.
>
> And coincided with IBM's intention to add APIs between SMC-D and devices to
> support various devices for SMC-D, as mentioned in [2], we send out this RFC and
> the incoming virio-ism RFC, to provide some examples.
>
>>>
>>> # Benchmark Test
>>>
>>> * Test environments:
>>> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>>> - SMC sndbuf/RMB size 1MB.
>>>
>>> * Test object:
>>> - TCP: run on TCP loopback.
>>> - domain: run on UNIX domain.
>>> - SMC lo: run on SMC loopback device with patch #1/5 ~ #2/5.
>>> - SMC lo-nocpy: run on SMC loopback device with patch #1/5 ~ #5/5.
>>>
>>> 1. ipc-benchmark (see [3])
>>>
>>> - ./<foo> -c 1000000 -s 100
>>>
>>> TCP domain SMC-lo SMC-lo-nocpy
>>> Message
>>> rate (msg/s) 75140 129548(+72.41) 152266(+102.64%) 151914(+102.17%)
>>
>> Interesting that it does beat UNIX domain sockets. Also, see my below
>> comment for nginx/wrk as this seems very similar.
>>
>>>
>>> 2. sockperf
>>>
>>> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>>> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>>>
>>> TCP SMC-lo SMC-lo-nocpy
>>> Bandwidth(MBps) 4943.359 4936.096(-0.15%) 8239.624(+66.68%)
>>> Latency(us) 6.372 3.359(-47.28%) 3.25(-49.00%)
>>>
>>> 3. iperf3
>>>
>>> - serv: <smc_run> taskset -c <cpu> iperf3 -s
>>> - clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15
>>>
>>> TCP SMC-lo SMC-lo-nocpy
>>> Bitrate(Gb/s) 40.5 41.4(+2.22%) 76.4(+88.64%)
>>>
>>> 4. nginx/wrk
>>>
>>> - serv: <smc_run> nginx
>>> - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80
>>>
>>> TCP SMC-lo SMC-lo-nocpy
>>> Requests/s 154643.22 220894.03(+42.84%) 226754.3(+46.63%)
>>
>>
>> This result is very interesting indeed. So with the much more realistic
>> nginx/wrk workload it seems to copy hurts much less than the
>> iperf3/sockperf would suggest while SMC-D itself seems to help more.
>> I'd hope that this translates to actual applications as well. Maybe
>> this makes SMC-D based loopback interesting even while keeping the
>> copy, at least until we can come up with a sane way to work a no-copy
>> variant into SMC-D?
>>
>
> I agree, nginx/wrk workload is much more realistic for many applications.
>
> But we also encounter many other cases similar to sockperf on the cloud, which
> requires high throughput, such as AI training and big data.
>
> So avoidance of copying between DMBs can help these cases a lot :)
>
>>>
>>>
>>> # Discussion
>>>
>>> 1. API between SMC-D and ISM device
>>>
>>> As Jan mentioned in [2], IBM are working on placing an API between SMC-D
>>> and the ISM device for easier use of different "devices" for SMC-D.
>>>
>>> So, considering that the introduction of attach_dmb or detach_dmb can
>>> effectively avoid data copying from sndbuf to RMB and brings obvious
>>> throughput advantages in inter-VM or inter-process scenarios, can the
>>> attach/detach semantics be taken into consideration when designing the
>>> API to make it a standard ISM device behavior?
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> Due to the reasons explained above this behavior can't be emulated by
>> ISM devices at least not when crossing partitions. Not sure if we can
>> still incorporate it in the API and allow for both copying and
>> remapping SMC-D like devices, it definitely needs careful consideration
>> and I think also a better understanding of the benefit for real world
>> workloads.
>>
>
> Here I am not rigorous.
>
> Nocopy shouldn't be a standard ISM device behavior indeed. Actually we hope it be a
> standard optional _SMC-D_ device behavior and defined by smcd_ops.
>
> For devices don't support these options, like ISM device on s390 architecture,
> .attach_dmb/.detach_dmb and other reasonable extensions (which will be proposed to
> discuss in incoming virtio-ism RFC) can be set to NULL or return invalid. And for
> devices do support, they may be used for improving performance in some cases.
>
> In addition, can I know more latest news about the API design? :) , like its scale, will
> it be a almost refactor of existing interface or incremental patching? and its object,
> will it be tailored for exact ISM behavior or to reserve some options for other devices,
> like nocopy here? From my understanding of [2], it might be the latter?
>
>>>
>>> Maybe our RFC of SMC-D based inter-process acceleration (this one) and
>>> inter-VM acceleration (will coming soon, which is the update of [1])
>>> can provide some examples for new API design. And we are very glad to
>>> discuss this on the mail list.
>>>
>>> 2. Way to select different ISM-like devices
>>>
>>> With the proposal of SMC-D loopback 'device' (this RFC) and incoming
>>> device used for inter-VM acceleration as update of [1], SMC-D has more
>>> options to choose from. So we need to consider that how to indicate
>>> supported devices, how to determine which one to use, and their priority...
>>
>> Agree on this part, though it is for the SMC maintainers to decide, I
>> think we would definitely want to be able to use any upcoming inter-VM
>> devices on s390 possibly also in conjunction with ISM devices for
>> communication across partitions.
>>
>
> Yes, this part needs to be discussed with SMC maintainers. And thank you, we are very glad
> if our devices can be applied on s390 through the efforts.
>
>
> Best Regards,
> Wen Gu
>
>>>
>>> IMHO, this may require an update of CLC message and negotiation mechanism.
>>> Again, we are very glad to discuss this with you on the mailing list.
As described in
SMC protocol (including SMC-D): https://www.ibm.com/support/pages/system/files/inline-files/IBM%20Shared%20Memory%20Communications%20Version%202_2.pdf
the CLC messages provide a list of up to 8 ISM devices to chose from.
So I would hope that we can use the existing protocol.
The challenge will be to define GID (Global Interface ID) and CHID (a fabric ID) in
a meaningful way for the new devices.
There is always smcd_ops->query_remote_gid() as a safety net. But the idea is that
a CHID mismatch is a fast way to tell that these 2 interfaces do match.
>>>
>>> [1] https://lore.kernel.org/netdev/[email protected]/
>>> [2] https://lore.kernel.org/netdev/[email protected]/
>>> [3] https://github.com/goldsborough/ipc-bench
>>>
>>> v1->v2
>>> 1. Fix some build WARNINGs complained by kernel test rebot
>>> Reported-by: kernel test robot <[email protected]>
>>> 2. Add iperf3 test data.
>>>
>>> Wen Gu (5):
>>> net/smc: introduce SMC-D loopback device
>>> net/smc: choose loopback device in SMC-D communication
>>> net/smc: add dmb attach and detach interface
>>> net/smc: avoid data copy from sndbuf to peer RMB in SMC-D loopback
>>> net/smc: logic of cursors update in SMC-D loopback connections
>>>
>>> include/net/smc.h | 3 +
>>> net/smc/Makefile | 2 +-
>>> net/smc/af_smc.c | 88 +++++++++++-
>>> net/smc/smc_cdc.c | 59 ++++++--
>>> net/smc/smc_cdc.h | 1 +
>>> net/smc/smc_clc.c | 4 +-
>>> net/smc/smc_core.c | 62 +++++++++
>>> net/smc/smc_core.h | 2 +
>>> net/smc/smc_ism.c | 39 +++++-
>>> net/smc/smc_ism.h | 2 +
>>> net/smc/smc_loopback.c | 358 +++++++++++++++++++++++++++++++++++++++++++++++++
>>> net/smc/smc_loopback.h | 63 +++++++++
>>> 12 files changed, 662 insertions(+), 21 deletions(-)
>>> create mode 100644 net/smc/smc_loopback.c
>>> create mode 100644 net/smc/smc_loopback.h
>>>
On 2023/1/5 00:09, Alexandra Winter wrote:
>
>
> On 21.12.22 14:14, Wen Gu wrote:
>>
>>
>> On 2022/12/20 22:02, Niklas Schnelle wrote:
>>
>>> On Tue, 2022-12-20 at 11:21 +0800, Wen Gu wrote:
>>>> Hi, all
>>>>
>>>> # Background
>>>>
>>>> As previously mentioned in [1], we (Alibaba Cloud) are trying to use SMC
>>>> to accelerate TCP applications in cloud environment, improving inter-host
>>>> or inter-VM communication.
>>>>
>>>> In addition of these, we also found the value of SMC-D in scenario of local
>>>> inter-process communication, such as accelerate communication between containers
>>>> within the same host. So this RFC tries to provide a SMC-D loopback solution
>>>> in such scenario, to bring a significant improvement in latency and throughput
>>>> compared to TCP loopback.
>>>>
>>>> # Design
>>>>
>>>> This patch set provides a kind of SMC-D loopback solution.
>>>>
>>>> Patch #1/5 and #2/5 provide an SMC-D based dummy device, preparing for the
>>>> inter-process communication acceleration. Except for loopback acceleration,
>>>> the dummy device can also meet the requirements mentioned in [2], which is
>>>> providing a way to test SMC-D logic for broad community without ISM device.
>>>>
>>>> +------------------------------------------+
>>>> | +-----------+ +-----------+ |
>>>> | | process A | | process B | |
>>>> | +-----------+ +-----------+ |
>>>> | ^ ^ |
>>>> | | +---------------+ | |
>>>> | | | SMC stack | | |
>>>> | +--->| +-----------+ |<--| |
>>>> | | | dummy | | |
>>>> | | | device | | |
>>>> | +-+-----------+-+ |
>>>> | VM |
>>>> +------------------------------------------+
>>>>
>>>> Patch #3/5, #4/5, #5/5 provides a way to avoid data copy from sndbuf to RMB
>>>> and improve SMC-D loopback performance. Through extending smcd_ops with two
>>>> new semantic: attach_dmb and detach_dmb, sender's sndbuf shares the same
>>>> physical memory region with receiver's RMB. The data copied from userspace
>>>> to sender's sndbuf directly reaches the receiver's RMB without unnecessary
>>>> memory copy in the same kernel.
>>>>
>>>> +----------+ +----------+
>>>> | socket A | | socket B |
>>>> +----------+ +----------+
>>>> | ^
>>>> | +---------+ |
>>>> regard as | | ----------|
>>>> local sndbuf | B's | regard as
>>>> | | RMB | local RMB
>>>> |-------> | |
>>>> +---------+
>>>
>>> Hi Wen Gu,
>>>
>>> I maintain the s390 specific PCI support in Linux and would like to
>>> provide a bit of background on this. You're surely wondering why we
>>> even have a copy in there for our ISM virtual PCI device. To understand
>>> why this copy operation exists and why we need to keep it working, one
>>> needs a bit of s390 aka mainframe background.
>>>
>>> On s390 all (currently supported) native machines have a mandatory
>>> machine level hypervisor. All OSs whether z/OS or Linux run either on
>>> this machine level hypervisor as so called Logical Partitions (LPARs)
>>> or as second/third/… level guests on e.g. a KVM or z/VM hypervisor that
>>> in turn runs in an LPAR. Now, in terms of memory this machine level
>>> hypervisor sometimes called PR/SM unlike KVM, z/VM, or VMWare is a
>>> partitioning hypervisor without paging. This is one of the main reasons
>>> for the very-near-native performance of the machine hypervisor as the
>>> memory of its guests acts just like native RAM on other systems. It is
>>> never paged out and always accessible to IOMMU translated DMA from
>>> devices without the need for pinning pages and besides a trivial
>>> offset/limit adjustment an LPAR's MMU does the same amount of work as
>>> an MMU on a bare metal x86_64/ARM64 box.
>>>
>>> It also means however that when SMC-D is used to communicate between
>>> LPARs via an ISM device there is no way of mapping the DMBs to the
>>> same physical memory as there exists no MMU-like layer spanning
>>> partitions that could do such a mapping. Meanwhile for machine level
>>> firmware including the ISM virtual PCI device it is still possible to
>>> _copy_ memory between different memory partitions. So yeah while I do
>>> see the appeal of skipping the memcpy() for loopback or even between
>>> guests of a paging hypervisor such as KVM, which can map the DMBs on
>>> the same physical memory, we must keep in mind this original use case
>>> requiring a copy operation.
>>>
>>> Thanks,
>>> Niklas
>>>
>>
>> Hi Niklas,
>>
>> Thank you so much for the complete and detailed explanation! This provides
>> me a brand new perspective of s390 device that we hadn't dabbled in before.
>> Now I understand why shared memory is unavailable between different LPARs.
>>
>> Our original intention of proposing loopback device and the incoming device
>> (virtio-ism) for inter-VM is to use SMC-D to accelerate communication in the
>> case with no existing s390 ISM devices. In our conception, s390 ISM device,
>> loopback device and virtio-ism device are parallel and are abstracted by smcd_ops.
>>
>> +------------------------+
>> | SMC-D |
>> +------------------------+
>> -------- smcd_ops ---------
>> +------+ +------+ +------+
>> | s390 | | loop | |virtio|
>> | ISM | | back | | -ism |
>> | dev | | dev | | dev |
>> +------+ +------+ +------+
>>
>> We also believe that keeping the existing design and behavior of s390 ISM
>> device is unshaken. What we want to get support for is some smcd_ops extension
>> for devices with optional beneficial capability, such as nocopy here (Let's call
>> it this for now), which is really helpful for us in inter-process and inter-VM
>> scenario.
>>
>> And coincided with IBM's intention to add APIs between SMC-D and devices to
>> support various devices for SMC-D, as mentioned in [2], we send out this RFC and
>> the incoming virio-ism RFC, to provide some examples.
>>
>>>>
>>>> # Benchmark Test
>>>>
>>>> * Test environments:
>>>> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>>>> - SMC sndbuf/RMB size 1MB.
>>>>
>>>> * Test object:
>>>> - TCP: run on TCP loopback.
>>>> - domain: run on UNIX domain.
>>>> - SMC lo: run on SMC loopback device with patch #1/5 ~ #2/5.
>>>> - SMC lo-nocpy: run on SMC loopback device with patch #1/5 ~ #5/5.
>>>>
>>>> 1. ipc-benchmark (see [3])
>>>>
>>>> - ./<foo> -c 1000000 -s 100
>>>>
>>>> TCP domain SMC-lo SMC-lo-nocpy
>>>> Message
>>>> rate (msg/s) 75140 129548(+72.41) 152266(+102.64%) 151914(+102.17%)
>>>
>>> Interesting that it does beat UNIX domain sockets. Also, see my below
>>> comment for nginx/wrk as this seems very similar.
>>>
>>>>
>>>> 2. sockperf
>>>>
>>>> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>>>> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>>>>
>>>> TCP SMC-lo SMC-lo-nocpy
>>>> Bandwidth(MBps) 4943.359 4936.096(-0.15%) 8239.624(+66.68%)
>>>> Latency(us) 6.372 3.359(-47.28%) 3.25(-49.00%)
>>>>
>>>> 3. iperf3
>>>>
>>>> - serv: <smc_run> taskset -c <cpu> iperf3 -s
>>>> - clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15
>>>>
>>>> TCP SMC-lo SMC-lo-nocpy
>>>> Bitrate(Gb/s) 40.5 41.4(+2.22%) 76.4(+88.64%)
>>>>
>>>> 4. nginx/wrk
>>>>
>>>> - serv: <smc_run> nginx
>>>> - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80
>>>>
>>>> TCP SMC-lo SMC-lo-nocpy
>>>> Requests/s 154643.22 220894.03(+42.84%) 226754.3(+46.63%)
>>>
>>>
>>> This result is very interesting indeed. So with the much more realistic
>>> nginx/wrk workload it seems to copy hurts much less than the
>>> iperf3/sockperf would suggest while SMC-D itself seems to help more.
>>> I'd hope that this translates to actual applications as well. Maybe
>>> this makes SMC-D based loopback interesting even while keeping the
>>> copy, at least until we can come up with a sane way to work a no-copy
>>> variant into SMC-D?
>>>
>>
>> I agree, nginx/wrk workload is much more realistic for many applications.
>>
>> But we also encounter many other cases similar to sockperf on the cloud, which
>> requires high throughput, such as AI training and big data.
>>
>> So avoidance of copying between DMBs can help these cases a lot :)
>>
>>>>
>>>>
>>>> # Discussion
>>>>
>>>> 1. API between SMC-D and ISM device
>>>>
>>>> As Jan mentioned in [2], IBM are working on placing an API between SMC-D
>>>> and the ISM device for easier use of different "devices" for SMC-D.
>>>>
>>>> So, considering that the introduction of attach_dmb or detach_dmb can
>>>> effectively avoid data copying from sndbuf to RMB and brings obvious
>>>> throughput advantages in inter-VM or inter-process scenarios, can the
>>>> attach/detach semantics be taken into consideration when designing the
>>>> API to make it a standard ISM device behavior?
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>
>>> Due to the reasons explained above this behavior can't be emulated by
>>> ISM devices at least not when crossing partitions. Not sure if we can
>>> still incorporate it in the API and allow for both copying and
>>> remapping SMC-D like devices, it definitely needs careful consideration
>>> and I think also a better understanding of the benefit for real world
>>> workloads.
>>>
>>
>> Here I am not rigorous.
>>
>> Nocopy shouldn't be a standard ISM device behavior indeed. Actually we hope it be a
>> standard optional _SMC-D_ device behavior and defined by smcd_ops.
>>
>> For devices don't support these options, like ISM device on s390 architecture,
>> .attach_dmb/.detach_dmb and other reasonable extensions (which will be proposed to
>> discuss in incoming virtio-ism RFC) can be set to NULL or return invalid. And for
>> devices do support, they may be used for improving performance in some cases.
>>
>> In addition, can I know more latest news about the API design? :) , like its scale, will
>> it be a almost refactor of existing interface or incremental patching? and its object,
>> will it be tailored for exact ISM behavior or to reserve some options for other devices,
>> like nocopy here? From my understanding of [2], it might be the latter?
>>
>>>>
>>>> Maybe our RFC of SMC-D based inter-process acceleration (this one) and
>>>> inter-VM acceleration (will coming soon, which is the update of [1])
>>>> can provide some examples for new API design. And we are very glad to
>>>> discuss this on the mail list.
>>>>
>>>> 2. Way to select different ISM-like devices
>>>>
>>>> With the proposal of SMC-D loopback 'device' (this RFC) and incoming
>>>> device used for inter-VM acceleration as update of [1], SMC-D has more
>>>> options to choose from. So we need to consider that how to indicate
>>>> supported devices, how to determine which one to use, and their priority...
>>>
>>> Agree on this part, though it is for the SMC maintainers to decide, I
>>> think we would definitely want to be able to use any upcoming inter-VM
>>> devices on s390 possibly also in conjunction with ISM devices for
>>> communication across partitions.
>>>
>>
>> Yes, this part needs to be discussed with SMC maintainers. And thank you, we are very glad
>> if our devices can be applied on s390 through the efforts.
>>
>>
>> Best Regards,
>> Wen Gu
>>
>>>>
>>>> IMHO, this may require an update of CLC message and negotiation mechanism.
>>>> Again, we are very glad to discuss this with you on the mailing list.
>
> As described in
> SMC protocol (including SMC-D): https://www.ibm.com/support/pages/system/files/inline-files/IBM%20Shared%20Memory%20Communications%20Version%202_2.pdf
> the CLC messages provide a list of up to 8 ISM devices to chose from.
> So I would hope that we can use the existing protocol.
>
> The challenge will be to define GID (Global Interface ID) and CHID (a fabric ID) in
> a meaningful way for the new devices.
> There is always smcd_ops->query_remote_gid() as a safety net. But the idea is that
> a CHID mismatch is a fast way to tell that these 2 interfaces do match.
>
>
Hi Winter and all,
Thanks for your reply and suggestions! And sorry for my late reply because it took me
some time to understand SMC-Dv2 protocol and implementation.
I agree with your opinion. The existing SMC-Dv2 protocol whose CLC messages include
ism_dev[] list can solve the devices negotiation problem. And I am very willing to use
the existing protocol, because we all know that the protocol update is a long and complex
process.
If I understand correctly, SMC-D loopback(dummy) device can coordinate with existing
SMC-Dv2 protocol as follows. If there is any mistake, please point out.
# Initialization
- Initialize the loopback device with unique GID [Q-1].
- Register the loopback device as SMC-Dv2-capable device with a system_eid whose 24th
or 28th byte is non-zero [Q-2], so that this system's smc_ism_v2_capable will be set
to TRUE and SMC-Dv2 is available.
# Proposal
- Find the loopback device from the smcd_dev_list in smc_find_ism_v2_device_clnt();
- Record the SEID, GID and CHID[Q-3] of loopback device in the v2 extension part of CLC
proposal message.
# Accept
- Check the GID/CHID list and SEID in CLC proposal message, and find local matched ISM
device from smcd_dev_list in smc_find_ism_v2_device_serv(). If both sides of the
communication are in the same VM and share the same loopback device, the SEID, GID and
CHID will match and loopback device will be chosen [Q-4].
- Record the loopback device's GID/CHID and matched SEID into CLC accept message.
# Confirm
- Confirm the server-selected device (loopback device) accordingto CLC accept messages.
- Record the loopback device's GID/CHID and server-selected SEID in CLC confirm message.
Follow the above process, I supplement a patch based on this RFC in the email attachment.
With the attachment patch, SMC-D loopback will switch to use SMC-Dv2 protocol.
And in the above process, there are something I want to consult and discuss, which is marked
with '[Q-*]' in the above description.
# [Q-1]:
The GID of loopback device is randomly generated in this RFC patch set, but I will find a way
to unique the GID in formal patches. Any suggestions are welcome.
# [Q-2]:
In Linux implementation, the system_eid of the first registered smcd device will determinate
system's smc_ism_v2_capable (see smcd_register_dev()).
And I wonder that
1) How to define the system_eid? It can be inferred from the code that the 24th and 28th byte
are special for SMC-Dv2. So in attachment patch, I define the loopback device SEID as
static struct smc_lo_systemeid LO_SYSTEM_EID = {
.seid_string = "SMC-SYSZ-LOSEID000000000",
.serial_number = "1000",
.type = "1000",
};
Is there anything else I need to pay attention to?
2) Seems only the first added smcd device determinate the system smc_ism_v2_capable? If two
different smcd devices respectively with v1-indicated and v2-indicated system_eid, will
the order in which they are registered affects the result of smc_ism_v2_capable ?
# [Q-3]:
In attachment patch, I define a special CHID (0xFFFF) for loopback device, as a kind of
'unassociated ISM CHID' that not associated with any IP (OSA or HiperSockets) interfaces.
What's your opinion about this?
# [Q-4]:
In current Linux implementation, server will select the first successfully initialized device
from the candidates as the final selected one in smc_find_ism_v2_device_serv().
for (i = 0; i < matches; i++) {
ini->smcd_version = SMC_V2;
ini->is_smcd = true;
ini->ism_selected = i;
rc = smc_listen_ism_init(new_smc, ini);
if (rc) {
smc_find_ism_store_rc(rc, ini);
/* try next active ISM device */
continue;
}
return; /* matching and usable V2 ISM device found */
}
IMHO, maybe candidate devices should have different priorities? For example, the loopback device
may be preferred to use if loopback is available.
Best Regards,
Wen Gu
>>>>
>>>> [1] https://lore.kernel.org/netdev/[email protected]/
>>>> [2] https://lore.kernel.org/netdev/[email protected]/
>>>> [3] https://github.com/goldsborough/ipc-bench
>>>>
>>>> v1->v2
>>>> 1. Fix some build WARNINGs complained by kernel test rebot
>>>> Reported-by: kernel test robot <[email protected]>
>>>> 2. Add iperf3 test data.
>>>>
>>>> Wen Gu (5):
>>>> net/smc: introduce SMC-D loopback device
>>>> net/smc: choose loopback device in SMC-D communication
>>>> net/smc: add dmb attach and detach interface
>>>> net/smc: avoid data copy from sndbuf to peer RMB in SMC-D loopback
>>>> net/smc: logic of cursors update in SMC-D loopback connections
>>>>
>>>> include/net/smc.h | 3 +
>>>> net/smc/Makefile | 2 +-
>>>> net/smc/af_smc.c | 88 +++++++++++-
>>>> net/smc/smc_cdc.c | 59 ++++++--
>>>> net/smc/smc_cdc.h | 1 +
>>>> net/smc/smc_clc.c | 4 +-
>>>> net/smc/smc_core.c | 62 +++++++++
>>>> net/smc/smc_core.h | 2 +
>>>> net/smc/smc_ism.c | 39 +++++-
>>>> net/smc/smc_ism.h | 2 +
>>>> net/smc/smc_loopback.c | 358 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>> net/smc/smc_loopback.h | 63 +++++++++
>>>> 12 files changed, 662 insertions(+), 21 deletions(-)
>>>> create mode 100644 net/smc/smc_loopback.c
>>>> create mode 100644 net/smc/smc_loopback.h
>>>>
On 12.01.23 13:12, Wen Gu wrote:
>
>
> On 2023/1/5 00:09, Alexandra Winter wrote:
>>
>>
>> On 21.12.22 14:14, Wen Gu wrote:
>>>
>>>
>>> On 2022/12/20 22:02, Niklas Schnelle wrote:
>>>
>>>> On Tue, 2022-12-20 at 11:21 +0800, Wen Gu wrote:
>>>>> Hi, all
>>>>>
>>>>> # Background
>>>>>
>>>>> As previously mentioned in [1], we (Alibaba Cloud) are trying to
>>>>> use SMC
>>>>> to accelerate TCP applications in cloud environment, improving
>>>>> inter-host
>>>>> or inter-VM communication.
>>>>>
>>>>> In addition of these, we also found the value of SMC-D in scenario
>>>>> of local
>>>>> inter-process communication, such as accelerate communication
>>>>> between containers
>>>>> within the same host. So this RFC tries to provide a SMC-D loopback
>>>>> solution
>>>>> in such scenario, to bring a significant improvement in latency and
>>>>> throughput
>>>>> compared to TCP loopback.
>>>>>
>>>>> # Design
>>>>>
>>>>> This patch set provides a kind of SMC-D loopback solution.
>>>>>
>>>>> Patch #1/5 and #2/5 provide an SMC-D based dummy device, preparing
>>>>> for the
>>>>> inter-process communication acceleration. Except for loopback
>>>>> acceleration,
>>>>> the dummy device can also meet the requirements mentioned in [2],
>>>>> which is
>>>>> providing a way to test SMC-D logic for broad community without ISM
>>>>> device.
>>>>>
>>>>> +------------------------------------------+
>>>>> | +-----------+ +-----------+ |
>>>>> | | process A | | process B | |
>>>>> | +-----------+ +-----------+ |
>>>>> | ^ ^ |
>>>>> | | +---------------+ | |
>>>>> | | | SMC stack | | |
>>>>> | +--->| +-----------+ |<--| |
>>>>> | | | dummy | | |
>>>>> | | | device | | |
>>>>> | +-+-----------+-+ |
>>>>> | VM |
>>>>> +------------------------------------------+
>>>>>
>>>>> Patch #3/5, #4/5, #5/5 provides a way to avoid data copy from
>>>>> sndbuf to RMB
>>>>> and improve SMC-D loopback performance. Through extending smcd_ops
>>>>> with two
>>>>> new semantic: attach_dmb and detach_dmb, sender's sndbuf shares the
>>>>> same
>>>>> physical memory region with receiver's RMB. The data copied from
>>>>> userspace
>>>>> to sender's sndbuf directly reaches the receiver's RMB without
>>>>> unnecessary
>>>>> memory copy in the same kernel.
>>>>>
>>>>> +----------+ +----------+
>>>>> | socket A | | socket B |
>>>>> +----------+ +----------+
>>>>> | ^
>>>>> | +---------+ |
>>>>> regard as | | ----------|
>>>>> local sndbuf | B's | regard as
>>>>> | | RMB | local RMB
>>>>> |-------> | |
>>>>> +---------+
>>>>
>>>> Hi Wen Gu,
>>>>
>>>> I maintain the s390 specific PCI support in Linux and would like to
>>>> provide a bit of background on this. You're surely wondering why we
>>>> even have a copy in there for our ISM virtual PCI device. To understand
>>>> why this copy operation exists and why we need to keep it working, one
>>>> needs a bit of s390 aka mainframe background.
>>>>
>>>> On s390 all (currently supported) native machines have a mandatory
>>>> machine level hypervisor. All OSs whether z/OS or Linux run either on
>>>> this machine level hypervisor as so called Logical Partitions (LPARs)
>>>> or as second/third/… level guests on e.g. a KVM or z/VM hypervisor that
>>>> in turn runs in an LPAR. Now, in terms of memory this machine level
>>>> hypervisor sometimes called PR/SM unlike KVM, z/VM, or VMWare is a
>>>> partitioning hypervisor without paging. This is one of the main reasons
>>>> for the very-near-native performance of the machine hypervisor as the
>>>> memory of its guests acts just like native RAM on other systems. It is
>>>> never paged out and always accessible to IOMMU translated DMA from
>>>> devices without the need for pinning pages and besides a trivial
>>>> offset/limit adjustment an LPAR's MMU does the same amount of work as
>>>> an MMU on a bare metal x86_64/ARM64 box.
>>>>
>>>> It also means however that when SMC-D is used to communicate between
>>>> LPARs via an ISM device there is no way of mapping the DMBs to the
>>>> same physical memory as there exists no MMU-like layer spanning
>>>> partitions that could do such a mapping. Meanwhile for machine level
>>>> firmware including the ISM virtual PCI device it is still possible to
>>>> _copy_ memory between different memory partitions. So yeah while I do
>>>> see the appeal of skipping the memcpy() for loopback or even between
>>>> guests of a paging hypervisor such as KVM, which can map the DMBs on
>>>> the same physical memory, we must keep in mind this original use case
>>>> requiring a copy operation.
>>>>
>>>> Thanks,
>>>> Niklas
>>>>
>>>
>>> Hi Niklas,
>>>
>>> Thank you so much for the complete and detailed explanation! This
>>> provides
>>> me a brand new perspective of s390 device that we hadn't dabbled in
>>> before.
>>> Now I understand why shared memory is unavailable between different
>>> LPARs.
>>>
>>> Our original intention of proposing loopback device and the incoming
>>> device
>>> (virtio-ism) for inter-VM is to use SMC-D to accelerate communication
>>> in the
>>> case with no existing s390 ISM devices. In our conception, s390 ISM
>>> device,
>>> loopback device and virtio-ism device are parallel and are abstracted
>>> by smcd_ops.
>>>
>>> +------------------------+
>>> | SMC-D |
>>> +------------------------+
>>> -------- smcd_ops ---------
>>> +------+ +------+ +------+
>>> | s390 | | loop | |virtio|
>>> | ISM | | back | | -ism |
>>> | dev | | dev | | dev |
>>> +------+ +------+ +------+
>>>
>>> We also believe that keeping the existing design and behavior of s390
>>> ISM
>>> device is unshaken. What we want to get support for is some smcd_ops
>>> extension
>>> for devices with optional beneficial capability, such as nocopy here
>>> (Let's call
>>> it this for now), which is really helpful for us in inter-process and
>>> inter-VM
>>> scenario.
>>>
>>> And coincided with IBM's intention to add APIs between SMC-D and
>>> devices to
>>> support various devices for SMC-D, as mentioned in [2], we send out
>>> this RFC and
>>> the incoming virio-ism RFC, to provide some examples.
>>>
>>>>>
>>>>> # Benchmark Test
>>>>>
>>>>> * Test environments:
>>>>> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>>>>> - SMC sndbuf/RMB size 1MB.
>>>>>
>>>>> * Test object:
>>>>> - TCP: run on TCP loopback.
>>>>> - domain: run on UNIX domain.
>>>>> - SMC lo: run on SMC loopback device with patch #1/5 ~ #2/5.
>>>>> - SMC lo-nocpy: run on SMC loopback device with patch #1/5
>>>>> ~ #5/5.
>>>>>
>>>>> 1. ipc-benchmark (see [3])
>>>>>
>>>>> - ./<foo> -c 1000000 -s 100
>>>>>
>>>>> TCP domain
>>>>> SMC-lo SMC-lo-nocpy
>>>>> Message
>>>>> rate (msg/s) 75140 129548(+72.41)
>>>>> 152266(+102.64%) 151914(+102.17%)
>>>>
>>>> Interesting that it does beat UNIX domain sockets. Also, see my below
>>>> comment for nginx/wrk as this seems very similar.
>>>>
>>>>>
>>>>> 2. sockperf
>>>>>
>>>>> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>>>>> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp
>>>>> --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>>>>>
>>>>> TCP SMC-lo
>>>>> SMC-lo-nocpy
>>>>> Bandwidth(MBps) 4943.359 4936.096(-0.15%)
>>>>> 8239.624(+66.68%)
>>>>> Latency(us) 6.372 3.359(-47.28%)
>>>>> 3.25(-49.00%)
>>>>>
>>>>> 3. iperf3
>>>>>
>>>>> - serv: <smc_run> taskset -c <cpu> iperf3 -s
>>>>> - clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15
>>>>>
>>>>> TCP SMC-lo
>>>>> SMC-lo-nocpy
>>>>> Bitrate(Gb/s) 40.5 41.4(+2.22%)
>>>>> 76.4(+88.64%)
>>>>>
>>>>> 4. nginx/wrk
>>>>>
>>>>> - serv: <smc_run> nginx
>>>>> - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80
>>>>>
>>>>> TCP SMC-lo
>>>>> SMC-lo-nocpy
>>>>> Requests/s 154643.22 220894.03(+42.84%)
>>>>> 226754.3(+46.63%)
>>>>
>>>>
>>>> This result is very interesting indeed. So with the much more realistic
>>>> nginx/wrk workload it seems to copy hurts much less than the
>>>> iperf3/sockperf would suggest while SMC-D itself seems to help more.
>>>> I'd hope that this translates to actual applications as well. Maybe
>>>> this makes SMC-D based loopback interesting even while keeping the
>>>> copy, at least until we can come up with a sane way to work a no-copy
>>>> variant into SMC-D?
>>>>
>>>
>>> I agree, nginx/wrk workload is much more realistic for many
>>> applications.
>>>
>>> But we also encounter many other cases similar to sockperf on the
>>> cloud, which
>>> requires high throughput, such as AI training and big data.
>>>
>>> So avoidance of copying between DMBs can help these cases a lot :)
>>>
>>>>>
>>>>>
>>>>> # Discussion
>>>>>
>>>>> 1. API between SMC-D and ISM device
>>>>>
>>>>> As Jan mentioned in [2], IBM are working on placing an API between
>>>>> SMC-D
>>>>> and the ISM device for easier use of different "devices" for SMC-D.
>>>>>
>>>>> So, considering that the introduction of attach_dmb or detach_dmb can
>>>>> effectively avoid data copying from sndbuf to RMB and brings obvious
>>>>> throughput advantages in inter-VM or inter-process scenarios, can the
>>>>> attach/detach semantics be taken into consideration when designing the
>>>>> API to make it a standard ISM device behavior?
>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>
>>>> Due to the reasons explained above this behavior can't be emulated by
>>>> ISM devices at least not when crossing partitions. Not sure if we can
>>>> still incorporate it in the API and allow for both copying and
>>>> remapping SMC-D like devices, it definitely needs careful consideration
>>>> and I think also a better understanding of the benefit for real world
>>>> workloads.
>>>>
>>>
>>> Here I am not rigorous.
>>>
>>> Nocopy shouldn't be a standard ISM device behavior indeed. Actually
>>> we hope it be a
>>> standard optional _SMC-D_ device behavior and defined by smcd_ops.
>>>
>>> For devices don't support these options, like ISM device on s390
>>> architecture,
>>> .attach_dmb/.detach_dmb and other reasonable extensions (which will
>>> be proposed to
>>> discuss in incoming virtio-ism RFC) can be set to NULL or return
>>> invalid. And for
>>> devices do support, they may be used for improving performance in
>>> some cases.
>>>
>>> In addition, can I know more latest news about the API design? :) ,
>>> like its scale, will
>>> it be a almost refactor of existing interface or incremental
>>> patching? and its object,
>>> will it be tailored for exact ISM behavior or to reserve some options
>>> for other devices,
>>> like nocopy here? From my understanding of [2], it might be the latter?
>>>
>>>>>
>>>>> Maybe our RFC of SMC-D based inter-process acceleration (this one) and
>>>>> inter-VM acceleration (will coming soon, which is the update of [1])
>>>>> can provide some examples for new API design. And we are very glad to
>>>>> discuss this on the mail list.
>>>>>
>>>>> 2. Way to select different ISM-like devices
>>>>>
>>>>> With the proposal of SMC-D loopback 'device' (this RFC) and incoming
>>>>> device used for inter-VM acceleration as update of [1], SMC-D has more
>>>>> options to choose from. So we need to consider that how to indicate
>>>>> supported devices, how to determine which one to use, and their
>>>>> priority...
>>>>
>>>> Agree on this part, though it is for the SMC maintainers to decide, I
>>>> think we would definitely want to be able to use any upcoming inter-VM
>>>> devices on s390 possibly also in conjunction with ISM devices for
>>>> communication across partitions.
>>>>
>>>
>>> Yes, this part needs to be discussed with SMC maintainers. And thank
>>> you, we are very glad
>>> if our devices can be applied on s390 through the efforts.
>>>
>>>
>>> Best Regards,
>>> Wen Gu
>>>
>>>>>
>>>>> IMHO, this may require an update of CLC message and negotiation
>>>>> mechanism.
>>>>> Again, we are very glad to discuss this with you on the mailing list.
>>
>> As described in
>> SMC protocol (including SMC-D):
>> https://www.ibm.com/support/pages/system/files/inline-files/IBM%20Shared%20Memory%20Communications%20Version%202_2.pdf
>> the CLC messages provide a list of up to 8 ISM devices to chose from.
>> So I would hope that we can use the existing protocol.
>>
>> The challenge will be to define GID (Global Interface ID) and CHID (a
>> fabric ID) in
>> a meaningful way for the new devices.
>> There is always smcd_ops->query_remote_gid() as a safety net. But the
>> idea is that
>> a CHID mismatch is a fast way to tell that these 2 interfaces do match.
>>
>>
>
FYI, we just sent the rest part of the API to the net-next
https://lore.kernel.org/netdev/[email protected]/T/#t,
which should answer some questions in your patch series.
> Hi Winter and all,
>
> Thanks for your reply and suggestions! And sorry for my late reply
> because it took me
> some time to understand SMC-Dv2 protocol and implementation.
>
> I agree with your opinion. The existing SMC-Dv2 protocol whose CLC
> messages include
> ism_dev[] list can solve the devices negotiation problem. And I am very
> willing to use
> the existing protocol, because we all know that the protocol update is a
> long and complex
> process.
>
> If I understand correctly, SMC-D loopback(dummy) device can coordinate
> with existing
> SMC-Dv2 protocol as follows. If there is any mistake, please point out.
>
>
> # Initialization
>
> - Initialize the loopback device with unique GID [Q-1].
>
> - Register the loopback device as SMC-Dv2-capable device with a
> system_eid whose 24th
> or 28th byte is non-zero [Q-2], so that this system's
> smc_ism_v2_capable will be set
> to TRUE and SMC-Dv2 is available.
>
The decision point is the VLAN_ID, if it is x1FFF, the device will
support V2. i.e. If you can have subnet with VLAN_ID x1FFF, then the
SEID is necessary, so that the series or types is non-zero. (*1)
>
> # Proposal
>
> - Find the loopback device from the smcd_dev_list in
> smc_find_ism_v2_device_clnt();
>
> - Record the SEID, GID and CHID[Q-3] of loopback device in the v2
> extension part of CLC
> proposal message.
>
>
> # Accept
>
> - Check the GID/CHID list and SEID in CLC proposal message, and find
> local matched ISM
> device from smcd_dev_list in smc_find_ism_v2_device_serv(). If both
> sides of the
> communication are in the same VM and share the same loopback device,
> the SEID, GID and
> CHID will match and loopback device will be chosen [Q-4].
>
> - Record the loopback device's GID/CHID and matched SEID into CLC accept
> message.
>
>
> # Confirm
>
> - Confirm the server-selected device (loopback device) accordingto CLC
> accept messages.
>
> - Record the loopback device's GID/CHID and server-selected SEID in CLC
> confirm message.
>
>
> Follow the above process, I supplement a patch based on this RFC in the
> email attachment.
> With the attachment patch, SMC-D loopback will switch to use SMC-Dv2
> protocol.
>
>
>
> And in the above process, there are something I want to consult and
> discuss, which is marked
> with '[Q-*]' in the above description.
>
> # [Q-1]:
>
> The GID of loopback device is randomly generated in this RFC patch set,
> but I will find a way
> to unique the GID in formal patches. Any suggestions are welcome.
>
I think the randowmly generated GID is fine in your case, which is
equivalent to the IP address.
>
> # [Q-2]:
>
> In Linux implementation, the system_eid of the first registered smcd
> device will determinate
> system's smc_ism_v2_capable (see smcd_register_dev()).
>
> And I wonder that
>
> 1) How to define the system_eid? It can be inferred from the code that
> the 24th and 28th byte
> are special for SMC-Dv2. So in attachment patch, I define the
> loopback device SEID as
>
> static struct smc_lo_systemeid LO_SYSTEM_EID = {
> .seid_string = "SMC-SYSZ-LOSEID000000000",
> .serial_number = "1000",
> .type = "1000",
> };
>
> Is there anything else I need to pay attention to?
>
If you just want to use V2, such defination looks good.
e.g. you can use some unique information from "lshw"
>
> 2) Seems only the first added smcd device determinate the system
> smc_ism_v2_capable? If two
> different smcd devices respectively with v1-indicated and
> v2-indicated system_eid, will
> the order in which they are registered affects the result of
> smc_ism_v2_capable ?
>
see (*1)
>
> # [Q-3]:
>
> In attachment patch, I define a special CHID (0xFFFF) for loopback
> device, as a kind of
> 'unassociated ISM CHID' that not associated with any IP (OSA or
> HiperSockets) interfaces.
>
> What's your opinion about this?
>
It looks good to me
>
> # [Q-4]:
>
> In current Linux implementation, server will select the first
> successfully initialized device
> from the candidates as the final selected one in
> smc_find_ism_v2_device_serv().
>
> for (i = 0; i < matches; i++) {
> ini->smcd_version = SMC_V2;
> ini->is_smcd = true;
> ini->ism_selected = i;
> rc = smc_listen_ism_init(new_smc, ini);
> if (rc) {
> smc_find_ism_store_rc(rc, ini);
> /* try next active ISM device */
> continue;
> }
> return; /* matching and usable V2 ISM device found */
> }
>
> IMHO, maybe candidate devices should have different priorities? For
> example, the loopback device
> may be preferred to use if loopback is available.
>
IMO, I'd prefer such a order: ISM -> loopback -> RoCE
Because ISM for SMC-D is our standard user case, not loopback.
>
> Best Regards,
> Wen Gu
>
>>>>>
>>>>> [1]
>>>>> https://lore.kernel.org/netdev/[email protected]/
>>>>> [2]
>>>>> https://lore.kernel.org/netdev/[email protected]/
>>>>> [3] https://github.com/goldsborough/ipc-bench
>>>>>
>>>>> v1->v2
>>>>> 1. Fix some build WARNINGs complained by kernel test rebot
>>>>> Reported-by: kernel test robot <[email protected]>
>>>>> 2. Add iperf3 test data.
>>>>>
>>>>> Wen Gu (5):
>>>>> net/smc: introduce SMC-D loopback device
>>>>> net/smc: choose loopback device in SMC-D communication
>>>>> net/smc: add dmb attach and detach interface
>>>>> net/smc: avoid data copy from sndbuf to peer RMB in SMC-D loopback
>>>>> net/smc: logic of cursors update in SMC-D loopback connections
>>>>>
>>>>> include/net/smc.h | 3 +
>>>>> net/smc/Makefile | 2 +-
>>>>> net/smc/af_smc.c | 88 +++++++++++-
>>>>> net/smc/smc_cdc.c | 59 ++++++--
>>>>> net/smc/smc_cdc.h | 1 +
>>>>> net/smc/smc_clc.c | 4 +-
>>>>> net/smc/smc_core.c | 62 +++++++++
>>>>> net/smc/smc_core.h | 2 +
>>>>> net/smc/smc_ism.c | 39 +++++-
>>>>> net/smc/smc_ism.h | 2 +
>>>>> net/smc/smc_loopback.c | 358
>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> net/smc/smc_loopback.h | 63 +++++++++
>>>>> 12 files changed, 662 insertions(+), 21 deletions(-)
>>>>> create mode 100644 net/smc/smc_loopback.c
>>>>> create mode 100644 net/smc/smc_loopback.h
>>>>>
On 2023/1/16 19:01, Wenjia Zhang wrote:
>
>
> On 12.01.23 13:12, Wen Gu wrote:
>>
>>
>> On 2023/1/5 00:09, Alexandra Winter wrote:
>>>
>>>
>>> On 21.12.22 14:14, Wen Gu wrote:
>>>>
>>>>
>>>> On 2022/12/20 22:02, Niklas Schnelle wrote:
>>>>
>>>>> On Tue, 2022-12-20 at 11:21 +0800, Wen Gu wrote:
>>>>>> Hi, all
>>>>>>
>>>>>> # Background
>>>>>>
>>>>>> As previously mentioned in [1], we (Alibaba Cloud) are trying to use SMC
>>>>>> to accelerate TCP applications in cloud environment, improving inter-host
>>>>>> or inter-VM communication.
>>>>>>
>>>>>> In addition of these, we also found the value of SMC-D in scenario of local
>>>>>> inter-process communication, such as accelerate communication between containers
>>>>>> within the same host. So this RFC tries to provide a SMC-D loopback solution
>>>>>> in such scenario, to bring a significant improvement in latency and throughput
>>>>>> compared to TCP loopback.
>>>>>>
>>>>>> # Design
>>>>>>
>>>>>> This patch set provides a kind of SMC-D loopback solution.
>>>>>>
>>>>>> Patch #1/5 and #2/5 provide an SMC-D based dummy device, preparing for the
>>>>>> inter-process communication acceleration. Except for loopback acceleration,
>>>>>> the dummy device can also meet the requirements mentioned in [2], which is
>>>>>> providing a way to test SMC-D logic for broad community without ISM device.
>>>>>>
>>>>>> +------------------------------------------+
>>>>>> | +-----------+ +-----------+ |
>>>>>> | | process A | | process B | |
>>>>>> | +-----------+ +-----------+ |
>>>>>> | ^ ^ |
>>>>>> | | +---------------+ | |
>>>>>> | | | SMC stack | | |
>>>>>> | +--->| +-----------+ |<--| |
>>>>>> | | | dummy | | |
>>>>>> | | | device | | |
>>>>>> | +-+-----------+-+ |
>>>>>> | VM |
>>>>>> +------------------------------------------+
>>>>>>
>>>>>> Patch #3/5, #4/5, #5/5 provides a way to avoid data copy from sndbuf to RMB
>>>>>> and improve SMC-D loopback performance. Through extending smcd_ops with two
>>>>>> new semantic: attach_dmb and detach_dmb, sender's sndbuf shares the same
>>>>>> physical memory region with receiver's RMB. The data copied from userspace
>>>>>> to sender's sndbuf directly reaches the receiver's RMB without unnecessary
>>>>>> memory copy in the same kernel.
>>>>>>
>>>>>> +----------+ +----------+
>>>>>> | socket A | | socket B |
>>>>>> +----------+ +----------+
>>>>>> | ^
>>>>>> | +---------+ |
>>>>>> regard as | | ----------|
>>>>>> local sndbuf | B's | regard as
>>>>>> | | RMB | local RMB
>>>>>> |-------> | |
>>>>>> +---------+
>>>>>
>>>>> Hi Wen Gu,
>>>>>
>>>>> I maintain the s390 specific PCI support in Linux and would like to
>>>>> provide a bit of background on this. You're surely wondering why we
>>>>> even have a copy in there for our ISM virtual PCI device. To understand
>>>>> why this copy operation exists and why we need to keep it working, one
>>>>> needs a bit of s390 aka mainframe background.
>>>>>
>>>>> On s390 all (currently supported) native machines have a mandatory
>>>>> machine level hypervisor. All OSs whether z/OS or Linux run either on
>>>>> this machine level hypervisor as so called Logical Partitions (LPARs)
>>>>> or as second/third/… level guests on e.g. a KVM or z/VM hypervisor that
>>>>> in turn runs in an LPAR. Now, in terms of memory this machine level
>>>>> hypervisor sometimes called PR/SM unlike KVM, z/VM, or VMWare is a
>>>>> partitioning hypervisor without paging. This is one of the main reasons
>>>>> for the very-near-native performance of the machine hypervisor as the
>>>>> memory of its guests acts just like native RAM on other systems. It is
>>>>> never paged out and always accessible to IOMMU translated DMA from
>>>>> devices without the need for pinning pages and besides a trivial
>>>>> offset/limit adjustment an LPAR's MMU does the same amount of work as
>>>>> an MMU on a bare metal x86_64/ARM64 box.
>>>>>
>>>>> It also means however that when SMC-D is used to communicate between
>>>>> LPARs via an ISM device there is no way of mapping the DMBs to the
>>>>> same physical memory as there exists no MMU-like layer spanning
>>>>> partitions that could do such a mapping. Meanwhile for machine level
>>>>> firmware including the ISM virtual PCI device it is still possible to
>>>>> _copy_ memory between different memory partitions. So yeah while I do
>>>>> see the appeal of skipping the memcpy() for loopback or even between
>>>>> guests of a paging hypervisor such as KVM, which can map the DMBs on
>>>>> the same physical memory, we must keep in mind this original use case
>>>>> requiring a copy operation.
>>>>>
>>>>> Thanks,
>>>>> Niklas
>>>>>
>>>>
>>>> Hi Niklas,
>>>>
>>>> Thank you so much for the complete and detailed explanation! This provides
>>>> me a brand new perspective of s390 device that we hadn't dabbled in before.
>>>> Now I understand why shared memory is unavailable between different LPARs.
>>>>
>>>> Our original intention of proposing loopback device and the incoming device
>>>> (virtio-ism) for inter-VM is to use SMC-D to accelerate communication in the
>>>> case with no existing s390 ISM devices. In our conception, s390 ISM device,
>>>> loopback device and virtio-ism device are parallel and are abstracted by smcd_ops.
>>>>
>>>> +------------------------+
>>>> | SMC-D |
>>>> +------------------------+
>>>> -------- smcd_ops ---------
>>>> +------+ +------+ +------+
>>>> | s390 | | loop | |virtio|
>>>> | ISM | | back | | -ism |
>>>> | dev | | dev | | dev |
>>>> +------+ +------+ +------+
>>>>
>>>> We also believe that keeping the existing design and behavior of s390 ISM
>>>> device is unshaken. What we want to get support for is some smcd_ops extension
>>>> for devices with optional beneficial capability, such as nocopy here (Let's call
>>>> it this for now), which is really helpful for us in inter-process and inter-VM
>>>> scenario.
>>>>
>>>> And coincided with IBM's intention to add APIs between SMC-D and devices to
>>>> support various devices for SMC-D, as mentioned in [2], we send out this RFC and
>>>> the incoming virio-ism RFC, to provide some examples.
>>>>
>>>>>>
>>>>>> # Benchmark Test
>>>>>>
>>>>>> * Test environments:
>>>>>> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>>>>>> - SMC sndbuf/RMB size 1MB.
>>>>>>
>>>>>> * Test object:
>>>>>> - TCP: run on TCP loopback.
>>>>>> - domain: run on UNIX domain.
>>>>>> - SMC lo: run on SMC loopback device with patch #1/5 ~ #2/5.
>>>>>> - SMC lo-nocpy: run on SMC loopback device with patch #1/5 ~ #5/5.
>>>>>>
>>>>>> 1. ipc-benchmark (see [3])
>>>>>>
>>>>>> - ./<foo> -c 1000000 -s 100
>>>>>>
>>>>>> TCP domain SMC-lo SMC-lo-nocpy
>>>>>> Message
>>>>>> rate (msg/s) 75140 129548(+72.41) 152266(+102.64%) 151914(+102.17%)
>>>>>
>>>>> Interesting that it does beat UNIX domain sockets. Also, see my below
>>>>> comment for nginx/wrk as this seems very similar.
>>>>>
>>>>>>
>>>>>> 2. sockperf
>>>>>>
>>>>>> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>>>>>> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i
>>>>>> 127.0.0.1 -t 30
>>>>>>
>>>>>> TCP SMC-lo SMC-lo-nocpy
>>>>>> Bandwidth(MBps) 4943.359 4936.096(-0.15%) 8239.624(+66.68%)
>>>>>> Latency(us) 6.372 3.359(-47.28%) 3.25(-49.00%)
>>>>>>
>>>>>> 3. iperf3
>>>>>>
>>>>>> - serv: <smc_run> taskset -c <cpu> iperf3 -s
>>>>>> - clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15
>>>>>>
>>>>>> TCP SMC-lo SMC-lo-nocpy
>>>>>> Bitrate(Gb/s) 40.5 41.4(+2.22%) 76.4(+88.64%)
>>>>>>
>>>>>> 4. nginx/wrk
>>>>>>
>>>>>> - serv: <smc_run> nginx
>>>>>> - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80
>>>>>>
>>>>>> TCP SMC-lo SMC-lo-nocpy
>>>>>> Requests/s 154643.22 220894.03(+42.84%) 226754.3(+46.63%)
>>>>>
>>>>>
>>>>> This result is very interesting indeed. So with the much more realistic
>>>>> nginx/wrk workload it seems to copy hurts much less than the
>>>>> iperf3/sockperf would suggest while SMC-D itself seems to help more.
>>>>> I'd hope that this translates to actual applications as well. Maybe
>>>>> this makes SMC-D based loopback interesting even while keeping the
>>>>> copy, at least until we can come up with a sane way to work a no-copy
>>>>> variant into SMC-D?
>>>>>
>>>>
>>>> I agree, nginx/wrk workload is much more realistic for many applications.
>>>>
>>>> But we also encounter many other cases similar to sockperf on the cloud, which
>>>> requires high throughput, such as AI training and big data.
>>>>
>>>> So avoidance of copying between DMBs can help these cases a lot :)
>>>>
>>>>>>
>>>>>>
>>>>>> # Discussion
>>>>>>
>>>>>> 1. API between SMC-D and ISM device
>>>>>>
>>>>>> As Jan mentioned in [2], IBM are working on placing an API between SMC-D
>>>>>> and the ISM device for easier use of different "devices" for SMC-D.
>>>>>>
>>>>>> So, considering that the introduction of attach_dmb or detach_dmb can
>>>>>> effectively avoid data copying from sndbuf to RMB and brings obvious
>>>>>> throughput advantages in inter-VM or inter-process scenarios, can the
>>>>>> attach/detach semantics be taken into consideration when designing the
>>>>>> API to make it a standard ISM device behavior?
>>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>>
>>>>> Due to the reasons explained above this behavior can't be emulated by
>>>>> ISM devices at least not when crossing partitions. Not sure if we can
>>>>> still incorporate it in the API and allow for both copying and
>>>>> remapping SMC-D like devices, it definitely needs careful consideration
>>>>> and I think also a better understanding of the benefit for real world
>>>>> workloads.
>>>>>
>>>>
>>>> Here I am not rigorous.
>>>>
>>>> Nocopy shouldn't be a standard ISM device behavior indeed. Actually we hope it be a
>>>> standard optional _SMC-D_ device behavior and defined by smcd_ops.
>>>>
>>>> For devices don't support these options, like ISM device on s390 architecture,
>>>> .attach_dmb/.detach_dmb and other reasonable extensions (which will be proposed to
>>>> discuss in incoming virtio-ism RFC) can be set to NULL or return invalid. And for
>>>> devices do support, they may be used for improving performance in some cases.
>>>>
>>>> In addition, can I know more latest news about the API design? :) , like its scale, will
>>>> it be a almost refactor of existing interface or incremental patching? and its object,
>>>> will it be tailored for exact ISM behavior or to reserve some options for other devices,
>>>> like nocopy here? From my understanding of [2], it might be the latter?
>>>>
>>>>>>
>>>>>> Maybe our RFC of SMC-D based inter-process acceleration (this one) and
>>>>>> inter-VM acceleration (will coming soon, which is the update of [1])
>>>>>> can provide some examples for new API design. And we are very glad to
>>>>>> discuss this on the mail list.
>>>>>>
>>>>>> 2. Way to select different ISM-like devices
>>>>>>
>>>>>> With the proposal of SMC-D loopback 'device' (this RFC) and incoming
>>>>>> device used for inter-VM acceleration as update of [1], SMC-D has more
>>>>>> options to choose from. So we need to consider that how to indicate
>>>>>> supported devices, how to determine which one to use, and their priority...
>>>>>
>>>>> Agree on this part, though it is for the SMC maintainers to decide, I
>>>>> think we would definitely want to be able to use any upcoming inter-VM
>>>>> devices on s390 possibly also in conjunction with ISM devices for
>>>>> communication across partitions.
>>>>>
>>>>
>>>> Yes, this part needs to be discussed with SMC maintainers. And thank you, we are very glad
>>>> if our devices can be applied on s390 through the efforts.
>>>>
>>>>
>>>> Best Regards,
>>>> Wen Gu
>>>>
>>>>>>
>>>>>> IMHO, this may require an update of CLC message and negotiation mechanism.
>>>>>> Again, we are very glad to discuss this with you on the mailing list.
>>>
>>> As described in
>>> SMC protocol (including SMC-D):
>>> https://www.ibm.com/support/pages/system/files/inline-files/IBM%20Shared%20Memory%20Communications%20Version%202_2.pdf
>>> the CLC messages provide a list of up to 8 ISM devices to chose from.
>>> So I would hope that we can use the existing protocol.
>>>
>>> The challenge will be to define GID (Global Interface ID) and CHID (a fabric ID) in
>>> a meaningful way for the new devices.
>>> There is always smcd_ops->query_remote_gid() as a safety net. But the idea is that
>>> a CHID mismatch is a fast way to tell that these 2 interfaces do match.
>>>
>>>
>>
>
> FYI, we just sent the rest part of the API to the net-next
> https://lore.kernel.org/netdev/[email protected]/T/#t,
> which should answer some questions in your patch series.
>
Hi Wenjia,
Thanks for notification and your excellent work!
I will learn about the new API and find if it fits our needs. If all goes well I will
refactor my RFC based on the new API and send a v2, most likely after the Chinese New Year.
>
>> Hi Winter and all,
>>
>> Thanks for your reply and suggestions! And sorry for my late reply because it took me
>> some time to understand SMC-Dv2 protocol and implementation.
>>
>> I agree with your opinion. The existing SMC-Dv2 protocol whose CLC messages include
>> ism_dev[] list can solve the devices negotiation problem. And I am very willing to use
>> the existing protocol, because we all know that the protocol update is a long and complex
>> process.
>>
>> If I understand correctly, SMC-D loopback(dummy) device can coordinate with existing
>> SMC-Dv2 protocol as follows. If there is any mistake, please point out.
>>
>>
>> # Initialization
>>
>> - Initialize the loopback device with unique GID [Q-1].
>>
>> - Register the loopback device as SMC-Dv2-capable device with a system_eid whose 24th
>> or 28th byte is non-zero [Q-2], so that this system's smc_ism_v2_capable will be set
>> to TRUE and SMC-Dv2 is available.
>>
> The decision point is the VLAN_ID, if it is x1FFF, the device will support V2. i.e. If you can have subnet with VLAN_ID
> x1FFF, then the SEID is necessary, so that the series or types is non-zero. (*1)
In case there is any misunderstanding between us, I would like to rephrase my [Q-2] question:
int smcd_register_dev(struct smcd_dev *smcd)
{
<...>
mutex_lock(&smcd_dev_list.mutex);
if (list_empty(&smcd_dev_list.list)) {
u8 *system_eid = NULL;
smcd->ops->get_system_eid(smcd, &system_eid);
if (system_eid[24] != '0' || system_eid[28] != '0') {
smc_ism_v2_capable = true;
memcpy(smc_ism_v2_system_eid, system_eid,
SMC_MAX_EID_LEN);
}
}
<...>
}
It can be inferred from smcd_register_dev() that:
1) The 24th and 28th byte are special and determinate whether smc_ism_v2_capable is true.
Besides these, do other bytes of system_eid have hidden meanings that need attention ?
2) Only when smcd_dev_list is empty, the added smcd_dev will be checked, and its system_eid
determinates whether smc_ism_v2_capable is true. Why only the first added device will be
checked ?
If the first added smcd_dev has an system_eid whose 24th and 28th bytes are zero, and the
second added smcd_dev has an system_eid whose 24th and 28th bytes are non-zero. Should
smc_ism_v2_capable be true, since the second smcd_dev has v2-indicated system_eid ?
>>
>> # Proposal
>>
>> - Find the loopback device from the smcd_dev_list in smc_find_ism_v2_device_clnt();
>>
>> - Record the SEID, GID and CHID[Q-3] of loopback device in the v2 extension part of CLC
>> proposal message.
>>
>>
>> # Accept
>>
>> - Check the GID/CHID list and SEID in CLC proposal message, and find local matched ISM
>> device from smcd_dev_list in smc_find_ism_v2_device_serv(). If both sides of the
>> communication are in the same VM and share the same loopback device, the SEID, GID and
>> CHID will match and loopback device will be chosen [Q-4].
>>
>> - Record the loopback device's GID/CHID and matched SEID into CLC accept message.
>>
>>
>> # Confirm
>>
>> - Confirm the server-selected device (loopback device) accordingto CLC accept messages.
>>
>> - Record the loopback device's GID/CHID and server-selected SEID in CLC confirm message.
>>
>>
>> Follow the above process, I supplement a patch based on this RFC in the email attachment.
>> With the attachment patch, SMC-D loopback will switch to use SMC-Dv2 protocol.
>>
>>
>>
>> And in the above process, there are something I want to consult and discuss, which is marked
>> with '[Q-*]' in the above description.
>>
>> # [Q-1]:
>>
>> The GID of loopback device is randomly generated in this RFC patch set, but I will find a way
>> to unique the GID in formal patches. Any suggestions are welcome.
>>
> I think the randowmly generated GID is fine in your case, which is equivalent to the IP address.
Since whether the two sides can communicate through the loopback will be judged by whether the
gid of their loopback device is equal, the random GID may bring the risk of misjudgment because
it may not be unique. But considering this is an RFC, I simply used random GIDs.
>>
>> # [Q-2]:
>>
>> In Linux implementation, the system_eid of the first registered smcd device will determinate
>> system's smc_ism_v2_capable (see smcd_register_dev()).
>>
>> And I wonder that
>>
>> 1) How to define the system_eid? It can be inferred from the code that the 24th and 28th byte
>> are special for SMC-Dv2. So in attachment patch, I define the loopback device SEID as
>>
>> static struct smc_lo_systemeid LO_SYSTEM_EID = {
>> .seid_string = "SMC-SYSZ-LOSEID000000000",
>> .serial_number = "1000",
>> .type = "1000",
>> };
>>
>> Is there anything else I need to pay attention to?
>>
> If you just want to use V2, such defination looks good.
> e.g. you can use some unique information from "lshw"
OK, thank you.
>>
>> 2) Seems only the first added smcd device determinate the system smc_ism_v2_capable? If two
>> different smcd devices respectively with v1-indicated and v2-indicated system_eid, will
>> the order in which they are registered affects the result of smc_ism_v2_capable ?
>>
> see (*1)
>>
>> # [Q-3]:
>>
>> In attachment patch, I define a special CHID (0xFFFF) for loopback device, as a kind of
>> 'unassociated ISM CHID' that not associated with any IP (OSA or HiperSockets) interfaces.
>>
>> What's your opinion about this?
>>
> It looks good to me
OK.
>>
>> # [Q-4]:
>>
>> In current Linux implementation, server will select the first successfully initialized device
>> from the candidates as the final selected one in smc_find_ism_v2_device_serv().
>>
>> for (i = 0; i < matches; i++) {
>> ini->smcd_version = SMC_V2;
>> ini->is_smcd = true;
>> ini->ism_selected = i;
>> rc = smc_listen_ism_init(new_smc, ini);
>> if (rc) {
>> smc_find_ism_store_rc(rc, ini);
>> /* try next active ISM device */
>> continue;
>> }
>> return; /* matching and usable V2 ISM device found */
>> }
>>
>> IMHO, maybe candidate devices should have different priorities? For example, the loopback device
>> may be preferred to use if loopback is available.
>>
> IMO, I'd prefer such a order: ISM -> loopback -> RoCE
> Because ISM for SMC-D is our standard user case, not loopback.
OK, will follow this order.
>>
>> Best Regards,
>> Wen Gu
>>
>>>>>>
>>>>>> [1] https://lore.kernel.org/netdev/[email protected]/
>>>>>> [2] https://lore.kernel.org/netdev/[email protected]/
>>>>>> [3] https://github.com/goldsborough/ipc-bench
>>>>>>
>>>>>> v1->v2
>>>>>> 1. Fix some build WARNINGs complained by kernel test rebot
>>>>>> Reported-by: kernel test robot <[email protected]>
>>>>>> 2. Add iperf3 test data.
>>>>>>
>>>>>> Wen Gu (5):
>>>>>> net/smc: introduce SMC-D loopback device
>>>>>> net/smc: choose loopback device in SMC-D communication
>>>>>> net/smc: add dmb attach and detach interface
>>>>>> net/smc: avoid data copy from sndbuf to peer RMB in SMC-D loopback
>>>>>> net/smc: logic of cursors update in SMC-D loopback connections
>>>>>>
>>>>>> include/net/smc.h | 3 +
>>>>>> net/smc/Makefile | 2 +-
>>>>>> net/smc/af_smc.c | 88 +++++++++++-
>>>>>> net/smc/smc_cdc.c | 59 ++++++--
>>>>>> net/smc/smc_cdc.h | 1 +
>>>>>> net/smc/smc_clc.c | 4 +-
>>>>>> net/smc/smc_core.c | 62 +++++++++
>>>>>> net/smc/smc_core.h | 2 +
>>>>>> net/smc/smc_ism.c | 39 +++++-
>>>>>> net/smc/smc_ism.h | 2 +
>>>>>> net/smc/smc_loopback.c | 358 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> net/smc/smc_loopback.h | 63 +++++++++
>>>>>> 12 files changed, 662 insertions(+), 21 deletions(-)
>>>>>> create mode 100644 net/smc/smc_loopback.c
>>>>>> create mode 100644 net/smc/smc_loopback.h
>>>>>>
On 18.01.23 13:15, Wen Gu wrote:
>
>
> On 2023/1/16 19:01, Wenjia Zhang wrote:
>>
>>
>> On 12.01.23 13:12, Wen Gu wrote:
>>>
>>>
>>> On 2023/1/5 00:09, Alexandra Winter wrote:
>>>>
>>>>
>>>> On 21.12.22 14:14, Wen Gu wrote:
>>>>>
>>>>>
>>>>> On 2022/12/20 22:02, Niklas Schnelle wrote:
>>>>>
>>>>>> On Tue, 2022-12-20 at 11:21 +0800, Wen Gu wrote:
[...]
>>>>>>>
>>>>>>> 2. Way to select different ISM-like devices
>>>>>>>
>>>>>>> With the proposal of SMC-D loopback 'device' (this RFC) and incoming
>>>>>>> device used for inter-VM acceleration as update of [1], SMC-D has more
>>>>>>> options to choose from. So we need to consider that how to indicate
>>>>>>> supported devices, how to determine which one to use, and their priority...
>>>>>>
[...]
>>>>>>>
>>>>>>> IMHO, this may require an update of CLC message and negotiation mechanism.
>>>>>>> Again, we are very glad to discuss this with you on the mailing list.
>>>>
>>>> As described in
>>>> SMC protocol (including SMC-D): https://www.ibm.com/support/pages/system/files/inline-files/IBM%20Shared%20Memory%20Communications%20Version%202_2.pdf
>>>> the CLC messages provide a list of up to 8 ISM devices to chose from.
>>>> So I would hope that we can use the existing protocol.
>>>>
>>>> The challenge will be to define GID (Global Interface ID) and CHID (a fabric ID) in
>>>> a meaningful way for the new devices.
[...]
>>>
>>> I agree with your opinion. The existing SMC-Dv2 protocol whose CLC messages include
>>> ism_dev[] list can solve the devices negotiation problem. And I am very willing to use
>>> the existing protocol, because we all know that the protocol update is a long and complex
>>> process.
>>>
>>> If I understand correctly, SMC-D loopback(dummy) device can coordinate with existing
>>> SMC-Dv2 protocol as follows. If there is any mistake, please point out.
>>>
>>>
>>> # Initialization
>>>
>>> - Initialize the loopback device with unique GID [Q-1].
As you point out below [Q-1], the issue here is the uniqueness of the 8 byte GID
across all possible SMC peers (they could all offer loopback as a choice).
I was pondering some more about this and I am wondering, whether there is another
way to detect that source and target of the CLC proposal are the same Linux instance
(without using the content of the CLC message).
One idea could be:
If the netdev used for the CLC proposal is a local interface, then use smcd-loopback.
What do you think?
Let me try to summarize my pondering as a base for future discussion:
SMC CLC handshake as described in SMC protocol:
------------------------------------------------
The purpose of the handshake is to efficiently find out whether and how SMC can
be used to communicate with a TCP peer (happens over a standard netdevice).
For this purpose the following fields are exchanged:
+ UEID – (32 byte) Defined per system (OS instance).
User defined group of systems that should use SMC between each other.
If the peer has only different UEIDs: Don’t try to use SMC with this peer.
+ SEID – (32 byte) Defined per system.
Maximum space of systems that are able to use SMC-D between each other.
Equals to the machine hardware / first level hypervisor.
(in s390 KVM is at least the second level hipervisor)
Is unique per machine, derived from unique machine ID.
If the peer has a different SEIDs: Don’t try to use SMC-D with this peer.
+ CHID – (2 Bytes) Defined per SMC-D interface.
Fabric ID of this interface. Unique against other fabrics on this machine.
Try to find a pair of SMC-D interfaces on the same fabric for the 2 peers
to use for this SMC-D connection.
+ GID – (8 bytes) Defined per SMC-D interface.
ID of this interface.
For s390 ISM is actually globally unique. But for SMC-D protocol purpose
just needs to be unique within the CHID (fabric).
Use it to identify the chosen interfaces of the 2 peers.
Important usecases:
As the CLC handshake is exchanged via standard netdevice, the 2 peers can be:
- on different machines
- on same machine but different first level guest
- on same KVM host but different KVM guest,
- on same KVM guest (loopback case)
and the proposal list can include any combination of:
- one or more RDMA devices (to use with SMC-R)
- one or more s390 ISM devices on one or more CHIDs
- future: virtio-smcd interface(s)
So for loopback there are 2 options to think about, that can be implemented
with today’s SMC-D protocol definition:
(1) Smc-d Loopback is listed in the SMC-D proposal list
Loopback could be one interface in the SMC-Dv2 list of up to 8 CHID/GID pairs proposed.
We could use CHID 0xFFFF to point out this is a loopback.
And then only use this GID, if it belongs to both peers. (this is may be a small
add-on to today's protocol)
CON: This requires a unique loopback-GID for every OS instance (on this SEID).
Must also be unique against any other OS that would ever implement SMC-D
loopback, because this could be a handshake with any OS on this SEID.
PRO: It works in all cases where SMC works today.
(2)Find out that both peers of the CLC handshake are actually the same
Linux instance, without using the values of the proposal
One idea is that if a local netdev is used for the handshake, then we are in
the same instance.
CON: I guess this should work, but may not cover all usecases, where
smcd-loopback could be desired. (namespaces, etc..)
I still have some hope that there is another way to detect this somehow…
ideas are very welcome.
PRO: This is independent of the SMC protocol. It would be a Linux-only solution,
actually even *this* Linux only, future implementations could differ.
>>>
>>> - Register the loopback device as SMC-Dv2-capable device with a system_eid whose 24th
>>> or 28th byte is non-zero [Q-2], so that this system's smc_ism_v2_capable will be set
>>> to TRUE and SMC-Dv2 is available.
>>>
>> The decision point is the VLAN_ID, if it is x1FFF, the device will support V2. i.e. If you can have subnet with VLAN_ID x1FFF, then the SEID is necessary, so that the series or types is non-zero. (*1)
>
I guess there is some misunderstanding of today's code.
The invalid VLAN_ID 0x1FFF is used to signal to s390 ISM hardware that SMC-Dv2 is used, where
the CLC handshake does not have to be between peers on the same IP subnet.
If we cannot set x1FFF, then this is old hardware and we have to use SMC-Dv1 on this machine.
(all ISM interfaces are same hardware level).
If we can use SMC-Dv2 then we need to determine the SEID of this machine, so we can use it
in the CLC handshake. (if we run on v1 hardware, there is no need to determine the SEID)
> In case there is any misunderstanding between us, I would like to rephrase my [Q-2] question:
>
> int smcd_register_dev(struct smcd_dev *smcd)
> {
> <...>
> mutex_lock(&smcd_dev_list.mutex);
> if (list_empty(&smcd_dev_list.list)) {
> u8 *system_eid = NULL;
>
> smcd->ops->get_system_eid(smcd, &system_eid);
> if (system_eid[24] != '0' || system_eid[28] != '0') {
> smc_ism_v2_capable = true;
> memcpy(smc_ism_v2_system_eid, system_eid,
> SMC_MAX_EID_LEN);
> }
> }
> <...>
> }
>
> It can be inferred from smcd_register_dev() that:
>
> 1) The 24th and 28th byte are special and determinate whether smc_ism_v2_capable is true.
> Besides these, do other bytes of system_eid have hidden meanings that need attention ?
>
> 2) Only when smcd_dev_list is empty, the added smcd_dev will be checked, and its system_eid
> determinates whether smc_ism_v2_capable is true. Why only the first added device will be
> checked ?
>
> If the first added smcd_dev has an system_eid whose 24th and 28th bytes are zero, and the
> second added smcd_dev has an system_eid whose 24th and 28th bytes are non-zero. Should
> smc_ism_v2_capable be true, since the second smcd_dev has v2-indicated system_eid ?
>
This was a rather indirect way to determine smc_ism_v2_capable,
which is improved in the ism patches currently under review.
>>>
>>> # Proposal
>>>
>>> - Find the loopback device from the smcd_dev_list in smc_find_ism_v2_device_clnt();
>>>
>>> - Record the SEID, GID and CHID[Q-3] of loopback device in the v2 extension part of CLC
>>> proposal message.
>>>
>>>
>>> # Accept
>>>
>>> - Check the GID/CHID list and SEID in CLC proposal message, and find local matched ISM
>>> device from smcd_dev_list in smc_find_ism_v2_device_serv(). If both sides of the
>>> communication are in the same VM and share the same loopback device, the SEID, GID and
>>> CHID will match and loopback device will be chosen [Q-4].
>>>
>>> - Record the loopback device's GID/CHID and matched SEID into CLC accept message.
>>>
>>>
>>> # Confirm
>>>
>>> - Confirm the server-selected device (loopback device) accordingto CLC accept messages.
>>>
>>> - Record the loopback device's GID/CHID and server-selected SEID in CLC confirm message.
>>>
>>>
>>> Follow the above process, I supplement a patch based on this RFC in the email attachment.
>>> With the attachment patch, SMC-D loopback will switch to use SMC-Dv2 protocol.
>>>
>>>
>>>
>>> And in the above process, there are something I want to consult and discuss, which is marked
>>> with '[Q-*]' in the above description.
>>>
>>> # [Q-1]:
>>>
>>> The GID of loopback device is randomly generated in this RFC patch set, but I will find a way
>>> to unique the GID in formal patches. Any suggestions are welcome.
>>>
>> I think the randowmly generated GID is fine in your case, which is equivalent to the IP address.
>
> Since whether the two sides can communicate through the loopback will be judged by whether the
> gid of their loopback device is equal, the random GID may bring the risk of misjudgment because
> it may not be unique. But considering this is an RFC, I simply used random GIDs.
I share your concerns about using random 8 byte numbers for all possible instances.
Collisions may be unlikely, but if they happen, they cannot be detected and have nasty effects.
>
>>>
>>> # [Q-2]:
>>>
>>> In Linux implementation, the system_eid of the first registered smcd device will determinate
>>> system's smc_ism_v2_capable (see smcd_register_dev()).
See above, this is a rather indirect correlation.
>>>
>>> And I wonder that
>>>
>>> 1) How to define the system_eid? It can be inferred from the code that the 24th and 28th byte
>>> are special for SMC-Dv2. So in attachment patch, I define the loopback device SEID as
>>>
>>> static struct smc_lo_systemeid LO_SYSTEM_EID = {
>>> .seid_string = "SMC-SYSZ-LOSEID000000000",
>>> .serial_number = "1000",
>>> .type = "1000",
>>> };
>>>
>>> Is there anything else I need to pay attention to?
>>>
>> If you just want to use V2, such defination looks good.
>> e.g. you can use some unique information from "lshw"
>
> OK, thank you.
>
As mentioned above:
+ SEID – (32 byte) Defined per system.
Maximum space of systems that are able to use SMC-D between each other.
Equals to the machine hardware / first level hypervisor.
(in s390 KVM is at least the second level hipervisor)
Is unique per machine, derived from unique machine ID.
If the peer has a different SEIDs: Don’t try to use SMC-D with this peer.
We need to continue to use today's values on s390 architecture for backward compatibility!
Other architectures need to also use values that uniquely identify the machine it is
running on.
>>>
>>> 2) Seems only the first added smcd device determinate the system smc_ism_v2_capable? If two
>>> different smcd devices respectively with v1-indicated and v2-indicated system_eid, will
>>> the order in which they are registered affects the result of smc_ism_v2_capable ?
>>>
>> see (*1)
see above: all s390 ISM interfaces on a machine are same hardware level.
>>>
>>> # [Q-3]:
>>>
>>> In attachment patch, I define a special CHID (0xFFFF) for loopback device, as a kind of
>>> 'unassociated ISM CHID' that not associated with any IP (OSA or HiperSockets) interfaces.
>>>
>>> What's your opinion about this?
>>>
>> It looks good to me
>
> OK.
This maybe a small add-on to today's protocol, as a special CHID number that is evaluated
differently. But IMHO it would fit with the purpose of the VHCID/GID pairs.
0xFFFF cannot appear on today's s390ISM CHIDs. So this could be backwards compatible.
>
>>>
>>> # [Q-4]:
>>>
>>> In current Linux implementation, server will select the first successfully initialized device
>>> from the candidates as the final selected one in smc_find_ism_v2_device_serv().
>>>
>>> for (i = 0; i < matches; i++) {
>>> ini->smcd_version = SMC_V2;
>>> ini->is_smcd = true;
>>> ini->ism_selected = i;
>>> rc = smc_listen_ism_init(new_smc, ini);
>>> if (rc) {
>>> smc_find_ism_store_rc(rc, ini);
>>> /* try next active ISM device */
>>> continue;
>>> }
>>> return; /* matching and usable V2 ISM device found */
>>> }
>>>
>>> IMHO, maybe candidate devices should have different priorities? For example, the loopback device
>>> may be preferred to use if loopback is available.
>>>
>> IMO, I'd prefer such a order: ISM -> loopback -> RoCE
>> Because ISM for SMC-D is our standard user case, not loopback.
>
> OK, will follow this order.
My initial thought would be: loopback -> ISM -> RoCE,
just as it is for netdev loopback.
But that only makes sense if loopback performs better than ISM.
Can we postpone that decision until we have measurements?
On 20.12.22 04:21, Wen Gu wrote:
> This patch introduces a kind of loopback device for SMC-D, thus
> enabling the SMC communication between two local sockets in one
> kernel.
>
> The loopback device supports basic capabilities defined by SMC-D,
> including registering DMB, unregistering DMB and moving data.
>
> Considering that there is no ism device on other servers expect
> IBM z13,
Please use the wording 'on other architectures except s390'.
That is how IBM Z is referred to in the Linux kernel.
> the loopback device can be used as a dummy device to
> test SMC-D logic for the broad community.
>
> Signed-off-by: Wen Gu <[email protected]>
> ---
Hello Wen Gu,
as the general design discussions are ongoing, I didn't
do a thorough review. But here are some general remarks
that you may want to consider for future versions.
I would propose to add a module parameter (default off) to enable
SMC-D loopback.
> include/net/smc.h | 1 +
> net/smc/Makefile | 2 +-
> net/smc/af_smc.c | 12 ++-
> net/smc/smc_cdc.c | 6 ++
> net/smc/smc_cdc.h | 1 +
> net/smc/smc_loopback.c | 282 +++++++++++++++++++++++++++++++++++++++++++++++++
> net/smc/smc_loopback.h | 59 +++++++++++
> 7 files changed, 361 insertions(+), 2 deletions(-)
> create mode 100644 net/smc/smc_loopback.c
> create mode 100644 net/smc/smc_loopback.h
>
I am not convinced that this warrants a separate file.
[...]
>
> +}
> +
> +static int lo_add_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
> +{
> + return 0;
> +}
> +
> +static int lo_del_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
> +{
> + return 0;
> +}
> +
> +static int lo_set_vlan_required(struct smcd_dev *smcd)
> +{
> + return 0;
> +}
> +
> +static int lo_reset_vlan_required(struct smcd_dev *smcd)
> +{
> + return 0;
> +}
The VLAN functions are only required for SMC-Dv1
Seems you want to provide v1 support for loopback?
May be nice for testing v1 VLAN support.
But then you need proper VLAN support.
[...]
> +
> +static u8 *lo_get_system_eid(void)
> +{
> + return &LO_SYSTEM_EID.seid_string[0];
> +}
SEID is for the whole system not per device.
We probably need to register a different function
for each architecture.
> +
> +static u16 lo_get_chid(struct smcd_dev *smcd)
> +{
> + return 0;
> +}
> +
Shouldn't this return 0xFFFF in your current concept?
On 2023/1/19 20:30, Alexandra Winter wrote:
>
>
> On 18.01.23 13:15, Wen Gu wrote:
>>
>>
>> On 2023/1/16 19:01, Wenjia Zhang wrote:
>>>
>>>
>>> On 12.01.23 13:12, Wen Gu wrote:
>>>>
>>>>
>>>> On 2023/1/5 00:09, Alexandra Winter wrote:
>>>>>
>>>>>
>>>>> On 21.12.22 14:14, Wen Gu wrote:
>>>>>>
>>>>>>
>>>>>> On 2022/12/20 22:02, Niklas Schnelle wrote:
>>>>>>
>>>>>>> On Tue, 2022-12-20 at 11:21 +0800, Wen Gu wrote:
> [...]
>>>>>>>>
>>>>>>>> 2. Way to select different ISM-like devices
>>>>>>>>
>>>>>>>> With the proposal of SMC-D loopback 'device' (this RFC) and incoming
>>>>>>>> device used for inter-VM acceleration as update of [1], SMC-D has more
>>>>>>>> options to choose from. So we need to consider that how to indicate
>>>>>>>> supported devices, how to determine which one to use, and their priority...
>>>>>>>
> [...]
>>>>>>>>
>>>>>>>> IMHO, this may require an update of CLC message and negotiation mechanism.
>>>>>>>> Again, we are very glad to discuss this with you on the mailing list.
>>>>>
>>>>> As described in
>>>>> SMC protocol (including SMC-D): https://www.ibm.com/support/pages/system/files/inline-files/IBM%20Shared%20Memory%20Communications%20Version%202_2.pdf
>>>>> the CLC messages provide a list of up to 8 ISM devices to chose from.
>>>>> So I would hope that we can use the existing protocol.
>>>>>
>>>>> The challenge will be to define GID (Global Interface ID) and CHID (a fabric ID) in
>>>>> a meaningful way for the new devices.
> [...]
>>>>
>>>> I agree with your opinion. The existing SMC-Dv2 protocol whose CLC messages include
>>>> ism_dev[] list can solve the devices negotiation problem. And I am very willing to use
>>>> the existing protocol, because we all know that the protocol update is a long and complex
>>>> process.
>>>>
>>>> If I understand correctly, SMC-D loopback(dummy) device can coordinate with existing
>>>> SMC-Dv2 protocol as follows. If there is any mistake, please point out.
>>>>
>>>>
>>>> # Initialization
>>>>
>>>> - Initialize the loopback device with unique GID [Q-1].
>
> As you point out below [Q-1], the issue here is the uniqueness of the 8 byte GID
> across all possible SMC peers (they could all offer loopback as a choice).
>
> I was pondering some more about this and I am wondering, whether there is another
> way to detect that source and target of the CLC proposal are the same Linux instance
> (without using the content of the CLC message).
> One idea could be:
> If the netdev used for the CLC proposal is a local interface, then use smcd-loopback.
> What do you think?
>
Hi Winter,
Thanks a lot for your suggestions, and forgive me for the delay in replying due
to the vacation.
My opinions are below.
> Let me try to summarize my pondering as a base for future discussion:
>
> SMC CLC handshake as described in SMC protocol:
> ------------------------------------------------
> The purpose of the handshake is to efficiently find out whether and how SMC can
> be used to communicate with a TCP peer (happens over a standard netdevice).
> For this purpose the following fields are exchanged:
>
> + UEID – (32 byte) Defined per system (OS instance).
> User defined group of systems that should use SMC between each other.
> If the peer has only different UEIDs: Don’t try to use SMC with this peer.
>
> + SEID – (32 byte) Defined per system.
> Maximum space of systems that are able to use SMC-D between each other.
> Equals to the machine hardware / first level hypervisor.
> (in s390 KVM is at least the second level hipervisor)
> Is unique per machine, derived from unique machine ID.
> If the peer has a different SEIDs: Don’t try to use SMC-D with this peer.
>
> + CHID – (2 Bytes) Defined per SMC-D interface.
> Fabric ID of this interface. Unique against other fabrics on this machine.
> Try to find a pair of SMC-D interfaces on the same fabric for the 2 peers
> to use for this SMC-D connection.
>
> + GID – (8 bytes) Defined per SMC-D interface.
> ID of this interface.
> For s390 ISM is actually globally unique. But for SMC-D protocol purpose
> just needs to be unique within the CHID (fabric).
> Use it to identify the chosen interfaces of the 2 peers.
>
> Important usecases:
> As the CLC handshake is exchanged via standard netdevice, the 2 peers can be:
> - on different machines
> - on same machine but different first level guest
> - on same KVM host but different KVM guest,
> - on same KVM guest (loopback case)
>
> and the proposal list can include any combination of:
> - one or more RDMA devices (to use with SMC-R)
> - one or more s390 ISM devices on one or more CHIDs
> - future: virtio-smcd interface(s)
>
>
> So for loopback there are 2 options to think about, that can be implemented
> with today’s SMC-D protocol definition:
>
> (1) Smc-d Loopback is listed in the SMC-D proposal list
>
> Loopback could be one interface in the SMC-Dv2 list of up to 8 CHID/GID pairs proposed.
> We could use CHID 0xFFFF to point out this is a loopback.
> And then only use this GID, if it belongs to both peers. (this is may be a small
> add-on to today's protocol)
> CON: This requires a unique loopback-GID for every OS instance (on this SEID).
> Must also be unique against any other OS that would ever implement SMC-D
> loopback, because this could be a handshake with any OS on this SEID.
> PRO: It works in all cases where SMC works today.
>
> (2)Find out that both peers of the CLC handshake are actually the same
> Linux instance, without using the values of the proposal
>
> One idea is that if a local netdev is used for the handshake, then we are in
> the same instance.
> CON: I guess this should work, but may not cover all usecases, where
> smcd-loopback could be desired. (namespaces, etc..)
> I still have some hope that there is another way to detect this somehow…
> ideas are very welcome.
> PRO: This is independent of the SMC protocol. It would be a Linux-only solution,
> actually even *this* Linux only, future implementations could differ.
>
I totally agree with your summary.
I also considered option #2 at the beginning, like through established clcsock's fib
to judge whether the CLC proposal target is local. But as you mentioned, it is hard to
cover all possible usecases, a particular example is that two containers in the same
VM communicate with each other through physical NICs.
So IMHO, it seems hardly to avoid exchanging information (like GID) between each side
to confirm both sides are on the same Linux instance. Please correct me if I'm wrong.
For now, I perfer the option #1. But option #2 is still one of our options. All ideas are welcome.
>>>>
>>>> - Register the loopback device as SMC-Dv2-capable device with a system_eid whose 24th
>>>> or 28th byte is non-zero [Q-2], so that this system's smc_ism_v2_capable will be set
>>>> to TRUE and SMC-Dv2 is available.
>>>>
>>> The decision point is the VLAN_ID, if it is x1FFF, the device will support V2. i.e. If you can have subnet with VLAN_ID x1FFF, then the SEID is necessary, so that the series or types is non-zero. (*1)
>>
>
> I guess there is some misunderstanding of today's code.
> The invalid VLAN_ID 0x1FFF is used to signal to s390 ISM hardware that SMC-Dv2 is used, where
> the CLC handshake does not have to be between peers on the same IP subnet.
> If we cannot set x1FFF, then this is old hardware and we have to use SMC-Dv1 on this machine.
> (all ISM interfaces are same hardware level).
> If we can use SMC-Dv2 then we need to determine the SEID of this machine, so we can use it
> in the CLC handshake. (if we run on v1 hardware, there is no need to determine the SEID)
>
Thanks for explanation, now I understand.
>> In case there is any misunderstanding between us, I would like to rephrase my [Q-2] question:
>>
>> int smcd_register_dev(struct smcd_dev *smcd)
>> {
>> <...>
>> mutex_lock(&smcd_dev_list.mutex);
>> if (list_empty(&smcd_dev_list.list)) {
>> u8 *system_eid = NULL;
>>
>> smcd->ops->get_system_eid(smcd, &system_eid);
>> if (system_eid[24] != '0' || system_eid[28] != '0') {
>> smc_ism_v2_capable = true;
>> memcpy(smc_ism_v2_system_eid, system_eid,
>> SMC_MAX_EID_LEN);
>> }
>> }
>> <...>
>> }
>>
>> It can be inferred from smcd_register_dev() that:
>>
>> 1) The 24th and 28th byte are special and determinate whether smc_ism_v2_capable is true.
>> Besides these, do other bytes of system_eid have hidden meanings that need attention ?
>>
>> 2) Only when smcd_dev_list is empty, the added smcd_dev will be checked, and its system_eid
>> determinates whether smc_ism_v2_capable is true. Why only the first added device will be
>> checked ?
>>
>> If the first added smcd_dev has an system_eid whose 24th and 28th bytes are zero, and the
>> second added smcd_dev has an system_eid whose 24th and 28th bytes are non-zero. Should
>> smc_ism_v2_capable be true, since the second smcd_dev has v2-indicated system_eid ?
>>
>
>
> This was a rather indirect way to determine smc_ism_v2_capable,
> which is improved in the ism patches currently under review.
>
OK, thanks.
>>>>
>>>> # Proposal
>>>>
>>>> - Find the loopback device from the smcd_dev_list in smc_find_ism_v2_device_clnt();
>>>>
>>>> - Record the SEID, GID and CHID[Q-3] of loopback device in the v2 extension part of CLC
>>>> proposal message.
>>>>
>>>>
>>>> # Accept
>>>>
>>>> - Check the GID/CHID list and SEID in CLC proposal message, and find local matched ISM
>>>> device from smcd_dev_list in smc_find_ism_v2_device_serv(). If both sides of the
>>>> communication are in the same VM and share the same loopback device, the SEID, GID and
>>>> CHID will match and loopback device will be chosen [Q-4].
>>>>
>>>> - Record the loopback device's GID/CHID and matched SEID into CLC accept message.
>>>>
>>>>
>>>> # Confirm
>>>>
>>>> - Confirm the server-selected device (loopback device) accordingto CLC accept messages.
>>>>
>>>> - Record the loopback device's GID/CHID and server-selected SEID in CLC confirm message.
>>>>
>>>>
>>>> Follow the above process, I supplement a patch based on this RFC in the email attachment.
>>>> With the attachment patch, SMC-D loopback will switch to use SMC-Dv2 protocol.
>>>>
>>>>
>>>>
>>>> And in the above process, there are something I want to consult and discuss, which is marked
>>>> with '[Q-*]' in the above description.
>>>>
>>>> # [Q-1]:
>>>>
>>>> The GID of loopback device is randomly generated in this RFC patch set, but I will find a way
>>>> to unique the GID in formal patches. Any suggestions are welcome.
>>>>
>>> I think the randowmly generated GID is fine in your case, which is equivalent to the IP address.
>>
>> Since whether the two sides can communicate through the loopback will be judged by whether the
>> gid of their loopback device is equal, the random GID may bring the risk of misjudgment because
>> it may not be unique. But considering this is an RFC, I simply used random GIDs.
>
> I share your concerns about using random 8 byte numbers for all possible instances.
> Collisions may be unlikely, but if they happen, they cannot be detected and have nasty effects.
>
Yes, I'm thinking about it and trying to find a way to avoid collision, or safely fallback if collision happens.
>>
>>>>
>>>> # [Q-2]:
>>>>
>>>> In Linux implementation, the system_eid of the first registered smcd device will determinate
>>>> system's smc_ism_v2_capable (see smcd_register_dev()).
>
> See above, this is a rather indirect correlation.
>
>>>>
>>>> And I wonder that
>>>>
>>>> 1) How to define the system_eid? It can be inferred from the code that the 24th and 28th byte
>>>> are special for SMC-Dv2. So in attachment patch, I define the loopback device SEID as
>>>>
>>>> static struct smc_lo_systemeid LO_SYSTEM_EID = {
>>>> .seid_string = "SMC-SYSZ-LOSEID000000000",
>>>> .serial_number = "1000",
>>>> .type = "1000",
>>>> };
>>>>
>>>> Is there anything else I need to pay attention to?
>>>>
>>> If you just want to use V2, such defination looks good.
>>> e.g. you can use some unique information from "lshw"
>>
>> OK, thank you.
>>
>
> As mentioned above:
> + SEID – (32 byte) Defined per system.
> Maximum space of systems that are able to use SMC-D between each other.
> Equals to the machine hardware / first level hypervisor.
> (in s390 KVM is at least the second level hipervisor)
> Is unique per machine, derived from unique machine ID.
> If the peer has a different SEIDs: Don’t try to use SMC-D with this peer.
>
> We need to continue to use today's values on s390 architecture for backward compatibility!
> Other architectures need to also use values that uniquely identify the machine it is
> running on.
>
Yes, I agree.
IIUC, on s390 architecture, if machine's ISM hardware is V2 capable (can set VLAN_ID 0x1FFF),
then SEID will be set according to unique machine ID (for s390 architecture, it will be the ID
returned from get_cpu_id() in arch/s390/include/asm/processor.h). So,
- On s390 architecture, SMC-D loopback device should use the same SEID as ISM devices on the same
machine. If SMC-D loopback only supports V2 and machine is not V2 capable, then SMC-D loopback
should not be used.
- On the other architecture except s390, we need to find a similar way to generate a machine-unique
ID as part of SEID. (perhaps need a unified helper to do so ?)
>>>>
>>>> 2) Seems only the first added smcd device determinate the system smc_ism_v2_capable? If two
>>>> different smcd devices respectively with v1-indicated and v2-indicated system_eid, will
>>>> the order in which they are registered affects the result of smc_ism_v2_capable ?
>>>>
>>> see (*1)
>
> see above: all s390 ISM interfaces on a machine are same hardware level.
>
OK.
>>>>
>>>> # [Q-3]:
>>>>
>>>> In attachment patch, I define a special CHID (0xFFFF) for loopback device, as a kind of
>>>> 'unassociated ISM CHID' that not associated with any IP (OSA or HiperSockets) interfaces.
>>>>
>>>> What's your opinion about this?
>>>>
>>> It looks good to me
>>
>> OK.
>
> This maybe a small add-on to today's protocol, as a special CHID number that is evaluated
> differently. But IMHO it would fit with the purpose of the VHCID/GID pairs.
> 0xFFFF cannot appear on today's s390ISM CHIDs. So this could be backwards compatible.
>
Thanks, that's a good news.
>>
>>>>
>>>> # [Q-4]:
>>>>
>>>> In current Linux implementation, server will select the first successfully initialized device
>>>> from the candidates as the final selected one in smc_find_ism_v2_device_serv().
>>>>
>>>> for (i = 0; i < matches; i++) {
>>>> ini->smcd_version = SMC_V2;
>>>> ini->is_smcd = true;
>>>> ini->ism_selected = i;
>>>> rc = smc_listen_ism_init(new_smc, ini);
>>>> if (rc) {
>>>> smc_find_ism_store_rc(rc, ini);
>>>> /* try next active ISM device */
>>>> continue;
>>>> }
>>>> return; /* matching and usable V2 ISM device found */
>>>> }
>>>>
>>>> IMHO, maybe candidate devices should have different priorities? For example, the loopback device
>>>> may be preferred to use if loopback is available.
>>>>
>>> IMO, I'd prefer such a order: ISM -> loopback -> RoCE
>>> Because ISM for SMC-D is our standard user case, not loopback.
>>
>> OK, will follow this order.
>
> My initial thought would be: loopback -> ISM -> RoCE,
> just as it is for netdev loopback.
> But that only makes sense if loopback performs better than ISM.
> Can we postpone that decision until we have measurements?
>
Sure, it's very reasonable.
On 2023/1/20 00:25, Alexandra Winter wrote:
>
>
> On 20.12.22 04:21, Wen Gu wrote:
>> This patch introduces a kind of loopback device for SMC-D, thus
>> enabling the SMC communication between two local sockets in one
>> kernel.
>>
>> The loopback device supports basic capabilities defined by SMC-D,
>> including registering DMB, unregistering DMB and moving data.
>>
>> Considering that there is no ism device on other servers expect
>> IBM z13,
>
> Please use the wording 'on other architectures except s390'.
> That is how IBM Z is referred to in the Linux kernel.
>
Thanks, will use words consistent with this.
>
>> the loopback device can be used as a dummy device to
>> test SMC-D logic for the broad community.
>>
>> Signed-off-by: Wen Gu <[email protected]>
>> ---
>
> Hello Wen Gu,
>
> as the general design discussions are ongoing, I didn't
> do a thorough review. But here are some general remarks
> that you may want to consider for future versions.
> I would propose to add a module parameter (default off) to enable
> SMC-D loopback.
>
OK, will add a module parameter in the future version.
>> include/net/smc.h | 1 +
>> net/smc/Makefile | 2 +-
>> net/smc/af_smc.c | 12 ++-
>> net/smc/smc_cdc.c | 6 ++
>> net/smc/smc_cdc.h | 1 +
>> net/smc/smc_loopback.c | 282 +++++++++++++++++++++++++++++++++++++++++++++++++
>> net/smc/smc_loopback.h | 59 +++++++++++
>> 7 files changed, 361 insertions(+), 2 deletions(-)
>> create mode 100644 net/smc/smc_loopback.c
>> create mode 100644 net/smc/smc_loopback.h
>>
>
> I am not convinced that this warrants a separate file.
IMHO, the dummy device used by smcd loopback is corresponding to ISM device.
So I put the dummy device implementation in smc_loopback.c alone, imitating
drivers/s390/net/ism_drv.c. I think it maybe clearer to do so.
>
> [...]
>>
>> +}
>> +
>> +static int lo_add_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
>> +{
>> + return 0;
>> +}
>> +
>> +static int lo_del_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
>> +{
>> + return 0;
>> +}
>> +
>> +static int lo_set_vlan_required(struct smcd_dev *smcd)
>> +{
>> + return 0;
>> +}
>> +
>> +static int lo_reset_vlan_required(struct smcd_dev *smcd)
>> +{
>> + return 0;
>> +}
>
> The VLAN functions are only required for SMC-Dv1
> Seems you want to provide v1 support for loopback?
> May be nice for testing v1 VLAN support.
> But then you need proper VLAN support.
>
Based on the current discussion, I tend to only provide v2 support for loopback
since v2 is the general trend. So I will fix this in the future version.
> [...]
>> +
>> +static u8 *lo_get_system_eid(void)
>> +{
>> + return &LO_SYSTEM_EID.seid_string[0];
>> +}
> SEID is for the whole system not per device.
> We probably need to register a different function
> for each architecture.
>
Yes, I agree.
>> +
>> +static u16 lo_get_chid(struct smcd_dev *smcd)
>> +{
>> + return 0;
>> +}
>> +
>
> Shouldn't this return 0xFFFF in your current concept?
>
>
Yes, this should return 0xFFFF.
I supplemented a patch as attachment in this earlier reply:
https://lore.kernel.org/netdev/[email protected]/
and have amended this.