This patch set acts as the second part of the new version of [1] (The first
part can be referred from [2]), the updated things of this version are listed
at the end.
- Background
SMC-D is now used in IBM z with ISM function to optimize network interconnect
for intra-CPC communications. Inspired by this, we try to make SMC-D available
on the non-s390 architecture through a software-implemented Emulated-ISM device,
that is the loopback-ism device here, to accelerate inter-process or
inter-containers communication within the same OS instance.
- Design
This patch set includes 3 parts:
- Patch #1: some prepare work for loopback-ism.
- Patch #2-#7: implement loopback-ism device and adapt SMC-D for it.
loopback-ism now serves only SMC and no userspace interfaces exposed.
- Patch #8-#11: memory copy optimization for intra-OS scenario.
The loopback-ism device is designed as an ISMv2 device and not be limited to
a specific net namespace, ends of both inter-process connection (1/1' in diagram
below) or inter-container connection (2/2' in diagram below) can find the same
available loopback-ism and choose it during the CLC handshake.
Container 1 (ns1) Container 2 (ns2)
+-----------------------------------------+ +-------------------------+
| +-------+ +-------+ +-------+ | | +-------+ |
| | App A | | App B | | App C | | | | App D |<-+ |
| +-------+ +---^---+ +-------+ | | +-------+ |(2') |
| |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| |
| (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ |
| `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | |
+---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+
| | | |
Kernel | | | |
+----+-------v---+-----------v----------------------------------+---+----+
| | TCP | |
| | | |
| +--------------------------------------------------------------+ |
| |
| +--------------+ |
| | smc loopback | |
+---------------------------+--------------+-----------------------------+
loopback-ism device creates DMBs (shared memory) for each connection peer.
Since data transfer occurs within the same kernel, the sndbuf of each peer
is only a descriptor and point to the same memory region as peer DMB, so that
the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.
Container 1 (ns1) Container 2 (ns2)
+-----------------------------------------+ +-------------------------+
| +-------+ | | +-------+ |
| | App C |-----+ | | | App D | |
| +-------+ | | | +-^-----+ |
| | | | | |
| (2) | | | (2') | |
| | | | | |
+---------------|-------------------------+ +----------|--------------+
| |
Kernel | |
+---------------|-----------------------------------------|--------------+
| +--------+ +--v-----+ +--------+ +--------+ |
| |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| |
| +-----|--+ +--|-----+ +-----|--+ +--------+ |
| +-----|--+ | +-----|--+ |
| | DMB C | +---------------------------------| DMB D | |
| +--------+ +--------+ |
| |
| +--------------+ |
| | smc loopback | |
+---------------------------+--------------+-----------------------------+
- Benchmark Test
* Test environments:
- VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
- SMC sndbuf/DMB size 1MB.
* Test object:
- TCP: run on TCP loopback.
- SMC lo: run on SMC loopback-ism.
1. ipc-benchmark (see [3])
- ./<foo> -c 1000000 -s 100
TCP SMC-lo
Message
rate (msg/s) 84991 151293(+78.01%)
2. sockperf
- serv: <smc_run> sockperf sr --tcp
- clnt: <smc_run> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
TCP SMC-lo
Bandwidth(MBps) 5033.569 7987.732(+58.69%)
Latency(us) 5.986 3.398(-43.23%)
3. nginx/wrk
- serv: <smc_run> nginx
- clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
TCP SMC-lo
Requests/s 187951.76 267107.90(+42.12%)
4. redis-benchmark
- serv: <smc_run> redis-server
- clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
TCP SMC-lo
GET(Requests/s) 86132.64 118133.49(+37.15%)
SET(Requests/s) 87374.40 122887.86(+40.65%)
Change log:
v7->v6
- Patch #2: minor: remove unnecessary 'return' of inline smc_loopback_exit().
- Patch #10: minor: directly return 0 instead of 'rc' in smcd_cdc_msg_send().
- all: collect the Reviewed-by tags.
v6->RFC v5
Link: https://lore.kernel.org/netdev/[email protected]/
- Patch #2: make the use of CONFIG_SMC_LO cleaner.
- Patch #5: mark some smcd_ops that loopback-ism doesn't support as
optional and check for the support when they are called.
- Patch #7: keep loopback-ism at the beginning of the SMC-D device list.
- Some expression changes in commit logs and comments.
RFC v5->RFC v4:
Link: https://lore.kernel.org/netdev/[email protected]/
- Patch #2: minor changes in description of config SMC_LO and comments.
- Patch #10: minor changes in comments and if(smc_ism_support_dmb_nocopy())
check in smcd_cdc_msg_send().
- Patch #3: change smc_lo_generate_id() to smc_lo_generate_ids() and SMC_LO_CHID
to SMC_LO_RESERVED_CHID.
- Patch #5: memcpy while holding the ldev->dmb_ht_lock.
- Some expression changes in commit logs.
RFC v4->v3:
Link: https://lore.kernel.org/netdev/[email protected]/
- The merge window of v6.9 is open, so post this series as an RFC.
- Patch #6: since some information fed back by smc_nl_handle_smcd_dev() dose
not apply to Emulated-ISM (including loopback-ism here), loopback-ism is
not exposed through smc netlink for the time being. we may refactor this
part when smc netlink interface is updated.
v3->v2:
Link: https://lore.kernel.org/netdev/[email protected]/
- Patch #11: use tasklet_schedule(&conn->rx_tsklet) instead of smcd_cdc_rx_handler()
to avoid possible recursive locking of conn->send_lock and use {read|write}_lock_bh()
to acquire dmb_ht_lock.
v2->v1:
Link: https://lore.kernel.org/netdev/[email protected]/
- All the patches: changed the term virtual-ISM to Emulated-ISM as defined by SMCv2.1.
- Patch #3: optimized the description of SMC_LO config. Avoid exposing loopback-ism
to sysfs and remove all the knobs until future definition clear.
- Patch #3: try to make lockdep happy by using read_lock_bh() in smc_lo_move_data().
- Patch #6: defaultly use physical contiguous DMB buffers.
- Patch #11: defaultly enable DMB no-copy for loopback-ism and free the DMB in
unregister_dmb or detach_dmb when dmb_node->refcnt reaches 0, instead of using
wait_event to keep waiting in unregister_dmb.
v1->RFC:
Link: https://lore.kernel.org/netdev/[email protected]/
- Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
/sys/devices/virtual/smc/loopback-ism/xfer_bytes
- Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
merging sndbuf with peer DMB.
- Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
control of whether to merge sndbuf and DMB. They can be respectively set by:
/sys/devices/virtual/smc/loopback-ism/dmb_type
/sys/devices/virtual/smc/loopback-ism/dmb_copy
The motivation for these two control is that a performance bottleneck was
found when using vzalloced DMB and sndbuf is merged with DMB, and there are
many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
vmap lock contention [6]. It has significant effects, but using virtual memory
still has additional overhead compared to using physical memory.
So this new version provides controls of dmb_type and dmb_copy to suit
different scenarios.
- Some minor changes and comments improvements.
RFC->old version([1]):
Link: https://lore.kernel.org/netdev/[email protected]/
- Patch #1: improve the loopback-ism dump, it shows as follows now:
# smcd d
FID Type PCI-ID PCHID InUse #LGs PNET-ID
0000 0 loopback-ism ffff No 0
- Patch #3: introduce the smc_ism_set_v2_capable() helper and set
smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
regardless of whether there is already a device in smcd device list.
- Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
- Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
to activate or deactivate the loopback-ism.
- Patch #9: introduce the statistics of loopback-ism by
/sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
- Some minor changes and comments improvements.
[1] https://lore.kernel.org/netdev/[email protected]/
[2] https://lore.kernel.org/netdev/[email protected]/
[3] https://github.com/goldsborough/ipc-bench
[4] https://lore.kernel.org/all/[email protected]/
[5] https://lore.kernel.org/all/[email protected]/
[6] https://lore.kernel.org/all/[email protected]/
Wen Gu (11):
net/smc: decouple ism_client from SMC-D DMB registration
net/smc: introduce loopback-ism for SMC intra-OS shortcut
net/smc: implement ID-related operations of loopback-ism
net/smc: implement DMB-related operations of loopback-ism
net/smc: mark optional smcd_ops and check for support when called
net/smc: ignore loopback-ism when dumping SMC-D devices
net/smc: register loopback-ism into SMC-D device list
net/smc: add operations to merge sndbuf with peer DMB
net/smc: {at|de}tach sndbuf to peer DMB if supported
net/smc: adapt cursor update when sndbuf and peer DMB are merged
net/smc: implement DMB-merged operations of loopback-ism
drivers/s390/net/ism_drv.c | 2 +-
include/net/smc.h | 21 +-
net/smc/Kconfig | 13 ++
net/smc/Makefile | 1 +
net/smc/af_smc.c | 28 ++-
net/smc/smc_cdc.c | 36 +++-
net/smc/smc_core.c | 61 +++++-
net/smc/smc_core.h | 1 +
net/smc/smc_ism.c | 88 ++++++--
net/smc/smc_ism.h | 10 +
net/smc/smc_loopback.c | 427 +++++++++++++++++++++++++++++++++++++
net/smc/smc_loopback.h | 61 ++++++
12 files changed, 721 insertions(+), 28 deletions(-)
create mode 100644 net/smc/smc_loopback.c
create mode 100644 net/smc/smc_loopback.h
--
2.32.0.3.g01195cf9f
The struct 'ism_client' is specialized for s390 platform firmware ISM.
So replace it with 'void' to make SMCD DMB registration helper generic
for both Emulated-ISM and existing ISM.
Signed-off-by: Wen Gu <[email protected]>
Reviewed-by: Wenjia Zhang <[email protected]>
Reviewed-and-tested-by: Jan Karcher <[email protected]>
---
drivers/s390/net/ism_drv.c | 2 +-
include/net/smc.h | 4 ++--
net/smc/smc_ism.c | 7 ++-----
3 files changed, 5 insertions(+), 8 deletions(-)
diff --git a/drivers/s390/net/ism_drv.c b/drivers/s390/net/ism_drv.c
index 43778b088ffa..e36e3ea165d3 100644
--- a/drivers/s390/net/ism_drv.c
+++ b/drivers/s390/net/ism_drv.c
@@ -745,7 +745,7 @@ static int smcd_query_rgid(struct smcd_dev *smcd, struct smcd_gid *rgid,
}
static int smcd_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
- struct ism_client *client)
+ void *client)
{
return ism_register_dmb(smcd->priv, (struct ism_dmb *)dmb, client);
}
diff --git a/include/net/smc.h b/include/net/smc.h
index 10684d0a33df..542d12372c18 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -47,7 +47,6 @@ struct smcd_dmb {
#define ISM_ERROR 0xFFFF
struct smcd_dev;
-struct ism_client;
struct smcd_gid {
u64 gid;
@@ -58,7 +57,7 @@ struct smcd_ops {
int (*query_remote_gid)(struct smcd_dev *dev, struct smcd_gid *rgid,
u32 vid_valid, u32 vid);
int (*register_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb,
- struct ism_client *client);
+ void *client);
int (*unregister_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
int (*add_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
int (*del_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
@@ -78,6 +77,7 @@ struct smcd_ops {
struct smcd_dev {
const struct smcd_ops *ops;
void *priv;
+ void *client;
struct list_head list;
spinlock_t lock;
struct smc_connection **conn;
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index ac88de2a06a0..051726586730 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -222,7 +222,6 @@ int smc_ism_unregister_dmb(struct smcd_dev *smcd, struct smc_buf_desc *dmb_desc)
int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
struct smc_buf_desc *dmb_desc)
{
-#if IS_ENABLED(CONFIG_ISM)
struct smcd_dmb dmb;
int rc;
@@ -231,7 +230,7 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
dmb.sba_idx = dmb_desc->sba_idx;
dmb.vlan_id = lgr->vlan_id;
dmb.rgid = lgr->peer_gid.gid;
- rc = lgr->smcd->ops->register_dmb(lgr->smcd, &dmb, &smc_ism_client);
+ rc = lgr->smcd->ops->register_dmb(lgr->smcd, &dmb, lgr->smcd->client);
if (!rc) {
dmb_desc->sba_idx = dmb.sba_idx;
dmb_desc->token = dmb.dmb_tok;
@@ -240,9 +239,6 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
dmb_desc->len = dmb.dmb_len;
}
return rc;
-#else
- return 0;
-#endif
}
static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
@@ -446,6 +442,7 @@ static void smcd_register_dev(struct ism_dev *ism)
if (!smcd)
return;
smcd->priv = ism;
+ smcd->client = &smc_ism_client;
ism_set_priv(ism, &smc_ism_client, smcd);
if (smc_pnetid_by_dev_port(&ism->pdev->dev, 0, smcd->pnetid))
smc_pnetid_by_table_smcd(smcd);
--
2.32.0.3.g01195cf9f
Some operations are not supported by new introduced Emulated-ISM, so
mark them as optional and check if the device supports them when called.
Signed-off-by: Wen Gu <[email protected]>
Reviewed-by: Wenjia Zhang <[email protected]>
Reviewed-and-tested-by: Jan Karcher <[email protected]>
---
include/net/smc.h | 14 ++++++++------
net/smc/smc_ism.c | 9 ++++++++-
2 files changed, 16 insertions(+), 7 deletions(-)
diff --git a/include/net/smc.h b/include/net/smc.h
index 542d12372c18..33b753115e43 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -59,12 +59,6 @@ struct smcd_ops {
int (*register_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb,
void *client);
int (*unregister_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
- int (*add_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
- int (*del_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
- int (*set_vlan_required)(struct smcd_dev *dev);
- int (*reset_vlan_required)(struct smcd_dev *dev);
- int (*signal_event)(struct smcd_dev *dev, struct smcd_gid *rgid,
- u32 trigger_irq, u32 event_code, u64 info);
int (*move_data)(struct smcd_dev *dev, u64 dmb_tok, unsigned int idx,
bool sf, unsigned int offset, void *data,
unsigned int size);
@@ -72,6 +66,14 @@ struct smcd_ops {
void (*get_local_gid)(struct smcd_dev *dev, struct smcd_gid *gid);
u16 (*get_chid)(struct smcd_dev *dev);
struct device* (*get_dev)(struct smcd_dev *dev);
+
+ /* optional operations */
+ int (*add_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
+ int (*del_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
+ int (*set_vlan_required)(struct smcd_dev *dev);
+ int (*reset_vlan_required)(struct smcd_dev *dev);
+ int (*signal_event)(struct smcd_dev *dev, struct smcd_gid *rgid,
+ u32 trigger_irq, u32 event_code, u64 info);
};
struct smcd_dev {
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index 051726586730..36459a3c739d 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -126,6 +126,8 @@ int smc_ism_get_vlan(struct smcd_dev *smcd, unsigned short vlanid)
if (!vlanid) /* No valid vlan id */
return -EINVAL;
+ if (!smcd->ops->add_vlan_id)
+ return -EOPNOTSUPP;
/* create new vlan entry, in case we need it */
new_vlan = kzalloc(sizeof(*new_vlan), GFP_KERNEL);
@@ -171,6 +173,8 @@ int smc_ism_put_vlan(struct smcd_dev *smcd, unsigned short vlanid)
if (!vlanid) /* No valid vlan id */
return -EINVAL;
+ if (!smcd->ops->del_vlan_id)
+ return -EOPNOTSUPP;
spin_lock_irqsave(&smcd->lock, flags);
list_for_each_entry(vlan, &smcd->vlan, list) {
@@ -368,7 +372,8 @@ static void smcd_handle_sw_event(struct smc_ism_event_work *wrk)
smc_smcd_terminate(wrk->smcd, &peer_gid, ev_info.vlan_id);
break;
case ISM_EVENT_CODE_TESTLINK: /* Activity timer */
- if (ev_info.code == ISM_EVENT_REQUEST) {
+ if (ev_info.code == ISM_EVENT_REQUEST &&
+ wrk->smcd->ops->signal_event) {
ev_info.code = ISM_EVENT_RESPONSE;
wrk->smcd->ops->signal_event(wrk->smcd,
&peer_gid,
@@ -538,6 +543,8 @@ int smc_ism_signal_shutdown(struct smc_link_group *lgr)
if (lgr->peer_shutdown)
return 0;
+ if (!lgr->smcd->ops->signal_event)
+ return 0;
memcpy(ev_info.uid, lgr->id, SMC_LGR_ID_SIZE);
ev_info.vlan_id = lgr->vlan_id;
--
2.32.0.3.g01195cf9f
This implements operations related to IDs for the loopback-ism device.
loopback-ism uses an Extended GID that is a 128-bit GID instead of the
existing ISM 64-bit GID, and uses the CHID defined with the reserved
value 0xFFFF.
Signed-off-by: Wen Gu <[email protected]>
Reviewed-by: Wenjia Zhang <[email protected]>
Reviewed-and-tested-by: Jan Karcher <[email protected]>
---
net/smc/smc_loopback.c | 62 ++++++++++++++++++++++++++++++++++++++----
net/smc/smc_loopback.h | 3 ++
2 files changed, 60 insertions(+), 5 deletions(-)
diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index c364e3e6e3fb..0349632a76c4 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -18,11 +18,62 @@
#include "smc_ism.h"
#include "smc_loopback.h"
+#define SMC_LO_V2_CAPABLE 0x1 /* loopback-ism acts as ISMv2 */
+
static const char smc_lo_dev_name[] = "loopback-ism";
static struct smc_lo_dev *lo_dev;
+static void smc_lo_generate_ids(struct smc_lo_dev *ldev)
+{
+ struct smcd_gid *lgid = &ldev->local_gid;
+ uuid_t uuid;
+
+ uuid_gen(&uuid);
+ memcpy(&lgid->gid, &uuid, sizeof(lgid->gid));
+ memcpy(&lgid->gid_ext, (u8 *)&uuid + sizeof(lgid->gid),
+ sizeof(lgid->gid_ext));
+
+ ldev->chid = SMC_LO_RESERVED_CHID;
+}
+
+static int smc_lo_query_rgid(struct smcd_dev *smcd, struct smcd_gid *rgid,
+ u32 vid_valid, u32 vid)
+{
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ /* rgid should be the same as lgid */
+ if (!ldev || rgid->gid != ldev->local_gid.gid ||
+ rgid->gid_ext != ldev->local_gid.gid_ext)
+ return -ENETUNREACH;
+ return 0;
+}
+
+static int smc_lo_supports_v2(void)
+{
+ return SMC_LO_V2_CAPABLE;
+}
+
+static void smc_lo_get_local_gid(struct smcd_dev *smcd,
+ struct smcd_gid *smcd_gid)
+{
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ smcd_gid->gid = ldev->local_gid.gid;
+ smcd_gid->gid_ext = ldev->local_gid.gid_ext;
+}
+
+static u16 smc_lo_get_chid(struct smcd_dev *smcd)
+{
+ return ((struct smc_lo_dev *)smcd->priv)->chid;
+}
+
+static struct device *smc_lo_get_dev(struct smcd_dev *smcd)
+{
+ return &((struct smc_lo_dev *)smcd->priv)->dev;
+}
+
static const struct smcd_ops lo_ops = {
- .query_remote_gid = NULL,
+ .query_remote_gid = smc_lo_query_rgid,
.register_dmb = NULL,
.unregister_dmb = NULL,
.add_vlan_id = NULL,
@@ -31,10 +82,10 @@ static const struct smcd_ops lo_ops = {
.reset_vlan_required = NULL,
.signal_event = NULL,
.move_data = NULL,
- .supports_v2 = NULL,
- .get_local_gid = NULL,
- .get_chid = NULL,
- .get_dev = NULL,
+ .supports_v2 = smc_lo_supports_v2,
+ .get_local_gid = smc_lo_get_local_gid,
+ .get_chid = smc_lo_get_chid,
+ .get_dev = smc_lo_get_dev,
};
static struct smcd_dev *smcd_lo_alloc_dev(const struct smcd_ops *ops,
@@ -94,6 +145,7 @@ static void smcd_lo_unregister_dev(struct smc_lo_dev *ldev)
static int smc_lo_dev_init(struct smc_lo_dev *ldev)
{
+ smc_lo_generate_ids(ldev);
return smcd_lo_register_dev(ldev);
}
diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
index 7fc4b374bc82..d7df79f7bb64 100644
--- a/net/smc/smc_loopback.h
+++ b/net/smc/smc_loopback.h
@@ -20,10 +20,13 @@
#if IS_ENABLED(CONFIG_SMC_LO)
#define SMC_LO_MAX_DMBS 5000
+#define SMC_LO_RESERVED_CHID 0xFFFF
struct smc_lo_dev {
struct smcd_dev *smcd;
struct device dev;
+ u16 chid;
+ struct smcd_gid local_gid;
};
int smc_loopback_init(void);
--
2.32.0.3.g01195cf9f
Since loopback-ism is not a PCI device, the PCI information fed back by
smc_nl_handle_smcd_dev() does not apply to loopback-ism. So currently
ignore loopback-ism when dumping SMC-D devices. The netlink function of
loopback-ism will be refactored when SMC netlink interface is updated.
Link: https://lore.kernel.org/r/[email protected]/
Signed-off-by: Wen Gu <[email protected]>
Reviewed-by: Wenjia Zhang <[email protected]>
Reviewed-and-tested-by: Jan Karcher <[email protected]>
---
net/smc/smc_ism.c | 2 ++
net/smc/smc_ism.h | 5 +++++
2 files changed, 7 insertions(+)
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index 36459a3c739d..188fd28423c2 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -322,6 +322,8 @@ static void smc_nl_prep_smcd_dev(struct smcd_dev_list *dev_list,
list_for_each_entry(smcd, &dev_list->list, list) {
if (num < snum)
goto next;
+ if (smc_ism_is_loopback(smcd))
+ goto next;
if (smc_nl_handle_smcd_dev(smcd, skb, cb))
goto errout;
next:
diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
index 165cd013404b..322973527c61 100644
--- a/net/smc/smc_ism.h
+++ b/net/smc/smc_ism.h
@@ -84,4 +84,9 @@ static inline bool smc_ism_is_emulated(struct smcd_dev *smcd)
return __smc_ism_is_emulated(chid);
}
+static inline bool smc_ism_is_loopback(struct smcd_dev *smcd)
+{
+ return (smcd->ops->get_chid(smcd) == 0xFFFF);
+}
+
#endif
--
2.32.0.3.g01195cf9f
This implements DMB (un)registration and data move operations of
loopback-ism device.
Signed-off-by: Wen Gu <[email protected]>
Reviewed-by: Wenjia Zhang <[email protected]>
Reviewed-and-tested-by: Jan Karcher <[email protected]>
---
net/smc/smc_loopback.c | 129 ++++++++++++++++++++++++++++++++++++++++-
net/smc/smc_loopback.h | 13 +++++
2 files changed, 139 insertions(+), 3 deletions(-)
diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index 0349632a76c4..d75852549b00 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -15,10 +15,12 @@
#include <linux/types.h>
#include <net/smc.h>
+#include "smc_cdc.h"
#include "smc_ism.h"
#include "smc_loopback.h"
#define SMC_LO_V2_CAPABLE 0x1 /* loopback-ism acts as ISMv2 */
+#define SMC_DMA_ADDR_INVALID (~(dma_addr_t)0)
static const char smc_lo_dev_name[] = "loopback-ism";
static struct smc_lo_dev *lo_dev;
@@ -48,6 +50,125 @@ static int smc_lo_query_rgid(struct smcd_dev *smcd, struct smcd_gid *rgid,
return 0;
}
+static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
+ void *client_priv)
+{
+ struct smc_lo_dmb_node *dmb_node, *tmp_node;
+ struct smc_lo_dev *ldev = smcd->priv;
+ int sba_idx, rc;
+
+ /* check space for new dmb */
+ for_each_clear_bit(sba_idx, ldev->sba_idx_mask, SMC_LO_MAX_DMBS) {
+ if (!test_and_set_bit(sba_idx, ldev->sba_idx_mask))
+ break;
+ }
+ if (sba_idx == SMC_LO_MAX_DMBS)
+ return -ENOSPC;
+
+ dmb_node = kzalloc(sizeof(*dmb_node), GFP_KERNEL);
+ if (!dmb_node) {
+ rc = -ENOMEM;
+ goto err_bit;
+ }
+
+ dmb_node->sba_idx = sba_idx;
+ dmb_node->len = dmb->dmb_len;
+ dmb_node->cpu_addr = kzalloc(dmb_node->len, GFP_KERNEL |
+ __GFP_NOWARN | __GFP_NORETRY |
+ __GFP_NOMEMALLOC);
+ if (!dmb_node->cpu_addr) {
+ rc = -ENOMEM;
+ goto err_node;
+ }
+ dmb_node->dma_addr = SMC_DMA_ADDR_INVALID;
+
+again:
+ /* add new dmb into hash table */
+ get_random_bytes(&dmb_node->token, sizeof(dmb_node->token));
+ write_lock_bh(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_node->token) {
+ if (tmp_node->token == dmb_node->token) {
+ write_unlock_bh(&ldev->dmb_ht_lock);
+ goto again;
+ }
+ }
+ hash_add(ldev->dmb_ht, &dmb_node->list, dmb_node->token);
+ write_unlock_bh(&ldev->dmb_ht_lock);
+
+ dmb->sba_idx = dmb_node->sba_idx;
+ dmb->dmb_tok = dmb_node->token;
+ dmb->cpu_addr = dmb_node->cpu_addr;
+ dmb->dma_addr = dmb_node->dma_addr;
+ dmb->dmb_len = dmb_node->len;
+
+ return 0;
+
+err_node:
+ kfree(dmb_node);
+err_bit:
+ clear_bit(sba_idx, ldev->sba_idx_mask);
+ return rc;
+}
+
+static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
+{
+ struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ /* remove dmb from hash table */
+ write_lock_bh(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
+ if (tmp_node->token == dmb->dmb_tok) {
+ dmb_node = tmp_node;
+ break;
+ }
+ }
+ if (!dmb_node) {
+ write_unlock_bh(&ldev->dmb_ht_lock);
+ return -EINVAL;
+ }
+ hash_del(&dmb_node->list);
+ write_unlock_bh(&ldev->dmb_ht_lock);
+
+ clear_bit(dmb_node->sba_idx, ldev->sba_idx_mask);
+ kfree(dmb_node->cpu_addr);
+ kfree(dmb_node);
+
+ return 0;
+}
+
+static int smc_lo_move_data(struct smcd_dev *smcd, u64 dmb_tok,
+ unsigned int idx, bool sf, unsigned int offset,
+ void *data, unsigned int size)
+{
+ struct smc_lo_dmb_node *rmb_node = NULL, *tmp_node;
+ struct smc_lo_dev *ldev = smcd->priv;
+ struct smc_connection *conn;
+
+ read_lock_bh(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_tok) {
+ if (tmp_node->token == dmb_tok) {
+ rmb_node = tmp_node;
+ break;
+ }
+ }
+ if (!rmb_node) {
+ read_unlock_bh(&ldev->dmb_ht_lock);
+ return -EINVAL;
+ }
+ memcpy((char *)rmb_node->cpu_addr + offset, data, size);
+ read_unlock_bh(&ldev->dmb_ht_lock);
+
+ if (sf) {
+ conn = smcd->conn[rmb_node->sba_idx];
+ if (conn && !conn->killed)
+ tasklet_schedule(&conn->rx_tsklet);
+ else
+ return -EPIPE;
+ }
+ return 0;
+}
+
static int smc_lo_supports_v2(void)
{
return SMC_LO_V2_CAPABLE;
@@ -74,14 +195,14 @@ static struct device *smc_lo_get_dev(struct smcd_dev *smcd)
static const struct smcd_ops lo_ops = {
.query_remote_gid = smc_lo_query_rgid,
- .register_dmb = NULL,
- .unregister_dmb = NULL,
+ .register_dmb = smc_lo_register_dmb,
+ .unregister_dmb = smc_lo_unregister_dmb,
.add_vlan_id = NULL,
.del_vlan_id = NULL,
.set_vlan_required = NULL,
.reset_vlan_required = NULL,
.signal_event = NULL,
- .move_data = NULL,
+ .move_data = smc_lo_move_data,
.supports_v2 = smc_lo_supports_v2,
.get_local_gid = smc_lo_get_local_gid,
.get_chid = smc_lo_get_chid,
@@ -146,6 +267,8 @@ static void smcd_lo_unregister_dev(struct smc_lo_dev *ldev)
static int smc_lo_dev_init(struct smc_lo_dev *ldev)
{
smc_lo_generate_ids(ldev);
+ rwlock_init(&ldev->dmb_ht_lock);
+ hash_init(ldev->dmb_ht);
return smcd_lo_register_dev(ldev);
}
diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
index d7df79f7bb64..b8206338309f 100644
--- a/net/smc/smc_loopback.h
+++ b/net/smc/smc_loopback.h
@@ -20,13 +20,26 @@
#if IS_ENABLED(CONFIG_SMC_LO)
#define SMC_LO_MAX_DMBS 5000
+#define SMC_LO_DMBS_HASH_BITS 12
#define SMC_LO_RESERVED_CHID 0xFFFF
+struct smc_lo_dmb_node {
+ struct hlist_node list;
+ u64 token;
+ u32 len;
+ u32 sba_idx;
+ void *cpu_addr;
+ dma_addr_t dma_addr;
+};
+
struct smc_lo_dev {
struct smcd_dev *smcd;
struct device dev;
u16 chid;
struct smcd_gid local_gid;
+ rwlock_t dmb_ht_lock;
+ DECLARE_BITMAP(sba_idx_mask, SMC_LO_MAX_DMBS);
+ DECLARE_HASHTABLE(dmb_ht, SMC_LO_DMBS_HASH_BITS);
};
int smc_loopback_init(void);
--
2.32.0.3.g01195cf9f
After the loopback-ism device is ready, add it to the SMC-D device list
as an ISMv2 device, and always keep it at the beginning to ensure it is
preferred for providing a shortcut for data transfer within the same
kernel.
Signed-off-by: Wen Gu <[email protected]>
Reviewed-by: Wenjia Zhang <[email protected]>
Reviewed-and-tested-by: Jan Karcher <[email protected]>
---
net/smc/smc_ism.c | 30 +++++++++++++++++++++---------
net/smc/smc_ism.h | 1 +
net/smc/smc_loopback.c | 20 +++++++++++++-------
3 files changed, 35 insertions(+), 16 deletions(-)
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index 188fd28423c2..6bed0a61b746 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -91,6 +91,11 @@ bool smc_ism_is_v2_capable(void)
return smc_ism_v2_capable;
}
+void smc_ism_set_v2_capable(void)
+{
+ smc_ism_v2_capable = true;
+}
+
/* Set a connection using this DMBE. */
void smc_ism_set_conn(struct smc_connection *conn)
{
@@ -439,7 +444,7 @@ static struct smcd_dev *smcd_alloc_dev(struct device *parent, const char *name,
static void smcd_register_dev(struct ism_dev *ism)
{
const struct smcd_ops *ops = ism_get_smcd_ops();
- struct smcd_dev *smcd;
+ struct smcd_dev *smcd, *fentry;
if (!ops)
return;
@@ -454,16 +459,23 @@ static void smcd_register_dev(struct ism_dev *ism)
if (smc_pnetid_by_dev_port(&ism->pdev->dev, 0, smcd->pnetid))
smc_pnetid_by_table_smcd(smcd);
+ if (smcd->ops->supports_v2())
+ smc_ism_set_v2_capable();
mutex_lock(&smcd_dev_list.mutex);
- if (list_empty(&smcd_dev_list.list)) {
- if (smcd->ops->supports_v2())
- smc_ism_v2_capable = true;
- }
- /* sort list: devices without pnetid before devices with pnetid */
- if (smcd->pnetid[0])
+ /* sort list:
+ * - devices without pnetid before devices with pnetid;
+ * - loopback-ism always at the very beginning;
+ */
+ if (!smcd->pnetid[0]) {
+ fentry = list_first_entry_or_null(&smcd_dev_list.list,
+ struct smcd_dev, list);
+ if (fentry && smc_ism_is_loopback(fentry))
+ list_add(&smcd->list, &fentry->list);
+ else
+ list_add(&smcd->list, &smcd_dev_list.list);
+ } else {
list_add_tail(&smcd->list, &smcd_dev_list.list);
- else
- list_add(&smcd->list, &smcd_dev_list.list);
+ }
mutex_unlock(&smcd_dev_list.mutex);
pr_warn_ratelimited("smc: adding smcd device %s with pnetid %.16s%s\n",
diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
index 322973527c61..e6f57e5e1ef9 100644
--- a/net/smc/smc_ism.h
+++ b/net/smc/smc_ism.h
@@ -52,6 +52,7 @@ int smc_ism_signal_shutdown(struct smc_link_group *lgr);
void smc_ism_get_system_eid(u8 **eid);
u16 smc_ism_get_chid(struct smcd_dev *dev);
bool smc_ism_is_v2_capable(void);
+void smc_ism_set_v2_capable(void);
int smc_ism_init(void);
void smc_ism_exit(void);
int smcd_nl_get_device(struct sk_buff *skb, struct netlink_callback *cb);
diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index d75852549b00..94a57f4ee3f9 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -246,10 +246,12 @@ static int smcd_lo_register_dev(struct smc_lo_dev *ldev)
return -ENOMEM;
ldev->smcd = smcd;
smcd->priv = ldev;
-
- /* TODO:
- * register loopback-ism to smcd_dev list.
- */
+ smc_ism_set_v2_capable();
+ mutex_lock(&smcd_dev_list.mutex);
+ list_add(&smcd->list, &smcd_dev_list.list);
+ mutex_unlock(&smcd_dev_list.mutex);
+ pr_warn_ratelimited("smc: adding smcd device %s\n",
+ dev_name(&ldev->dev));
return 0;
}
@@ -257,9 +259,13 @@ static void smcd_lo_unregister_dev(struct smc_lo_dev *ldev)
{
struct smcd_dev *smcd = ldev->smcd;
- /* TODO:
- * unregister loopback-ism from smcd_dev list.
- */
+ pr_warn_ratelimited("smc: removing smcd device %s\n",
+ dev_name(&ldev->dev));
+ smcd->going_away = 1;
+ smc_smcd_terminate_all(smcd);
+ mutex_lock(&smcd_dev_list.mutex);
+ list_del_init(&smcd->list);
+ mutex_unlock(&smcd_dev_list.mutex);
kfree(smcd->conn);
kfree(smcd);
}
--
2.32.0.3.g01195cf9f
In some scenarios using Emulated-ISM device, sndbuf can share the same
physical memory region with peer DMB to avoid data copy from one side
to the other. In such case the sndbuf is only a descriptor that
describes the shared memory and does not actually occupy memory, it's
more like a ghost buffer.
+----------+ +----------+
| socket A | | socket B |
+----------+ +----------+
| |
+--------+ +--------+
| sndbuf | | DMB |
| desc | | desc |
+--------+ +--------+
| |
| +----v-----+
+--------------------------> memory |
+----------+
So here introduces three new SMC-D device operations to check if this
feature is supported by device, and to {attach|detach} ghost sndbuf to
peer DMB. For now only loopback-ism supports this.
Signed-off-by: Wen Gu <[email protected]>
Reviewed-by: Wenjia Zhang <[email protected]>
Reviewed-and-tested-by: Jan Karcher <[email protected]>
---
include/net/smc.h | 3 +++
net/smc/smc_ism.c | 40 ++++++++++++++++++++++++++++++++++++++++
net/smc/smc_ism.h | 4 ++++
3 files changed, 47 insertions(+)
diff --git a/include/net/smc.h b/include/net/smc.h
index 33b753115e43..db84e4e35080 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -74,6 +74,9 @@ struct smcd_ops {
int (*reset_vlan_required)(struct smcd_dev *dev);
int (*signal_event)(struct smcd_dev *dev, struct smcd_gid *rgid,
u32 trigger_irq, u32 event_code, u64 info);
+ int (*support_dmb_nocopy)(struct smcd_dev *dev);
+ int (*attach_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
+ int (*detach_dmb)(struct smcd_dev *dev, u64 token);
};
struct smcd_dev {
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index 6bed0a61b746..84f98e18c7db 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -250,6 +250,46 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
return rc;
}
+bool smc_ism_support_dmb_nocopy(struct smcd_dev *smcd)
+{
+ /* for now only loopback-ism supports
+ * merging sndbuf with peer DMB to avoid
+ * data copies between them.
+ */
+ return (smcd->ops->support_dmb_nocopy &&
+ smcd->ops->support_dmb_nocopy(smcd));
+}
+
+int smc_ism_attach_dmb(struct smcd_dev *dev, u64 token,
+ struct smc_buf_desc *dmb_desc)
+{
+ struct smcd_dmb dmb;
+ int rc = 0;
+
+ if (!dev->ops->attach_dmb)
+ return -EINVAL;
+
+ memset(&dmb, 0, sizeof(dmb));
+ dmb.dmb_tok = token;
+ rc = dev->ops->attach_dmb(dev, &dmb);
+ if (!rc) {
+ dmb_desc->sba_idx = dmb.sba_idx;
+ dmb_desc->token = dmb.dmb_tok;
+ dmb_desc->cpu_addr = dmb.cpu_addr;
+ dmb_desc->dma_addr = dmb.dma_addr;
+ dmb_desc->len = dmb.dmb_len;
+ }
+ return rc;
+}
+
+int smc_ism_detach_dmb(struct smcd_dev *dev, u64 token)
+{
+ if (!dev->ops->detach_dmb)
+ return -EINVAL;
+
+ return dev->ops->detach_dmb(dev, token);
+}
+
static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
struct sk_buff *skb,
struct netlink_callback *cb)
diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
index e6f57e5e1ef9..6763133dd8d0 100644
--- a/net/smc/smc_ism.h
+++ b/net/smc/smc_ism.h
@@ -48,6 +48,10 @@ int smc_ism_put_vlan(struct smcd_dev *dev, unsigned short vlan_id);
int smc_ism_register_dmb(struct smc_link_group *lgr, int buf_size,
struct smc_buf_desc *dmb_desc);
int smc_ism_unregister_dmb(struct smcd_dev *dev, struct smc_buf_desc *dmb_desc);
+bool smc_ism_support_dmb_nocopy(struct smcd_dev *smcd);
+int smc_ism_attach_dmb(struct smcd_dev *dev, u64 token,
+ struct smc_buf_desc *dmb_desc);
+int smc_ism_detach_dmb(struct smcd_dev *dev, u64 token);
int smc_ism_signal_shutdown(struct smc_link_group *lgr);
void smc_ism_get_system_eid(u8 **eid);
u16 smc_ism_get_chid(struct smcd_dev *dev);
--
2.32.0.3.g01195cf9f
This implements operations related to merging sndbuf with peer DMB in
loopback-ism. The DMB won't be freed until no sndbuf is attached to it.
Signed-off-by: Wen Gu <[email protected]>
Reviewed-by: Wenjia Zhang <[email protected]>
Reviewed-and-tested-by: Jan Karcher <[email protected]>
---
net/smc/smc_loopback.c | 120 +++++++++++++++++++++++++++++++++++------
net/smc/smc_loopback.h | 3 ++
2 files changed, 108 insertions(+), 15 deletions(-)
diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index 94a57f4ee3f9..3c5f64ca4115 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -20,6 +20,7 @@
#include "smc_loopback.h"
#define SMC_LO_V2_CAPABLE 0x1 /* loopback-ism acts as ISMv2 */
+#define SMC_LO_SUPPORT_NOCOPY 0x1
#define SMC_DMA_ADDR_INVALID (~(dma_addr_t)0)
static const char smc_lo_dev_name[] = "loopback-ism";
@@ -81,6 +82,7 @@ static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
goto err_node;
}
dmb_node->dma_addr = SMC_DMA_ADDR_INVALID;
+ refcount_set(&dmb_node->refcnt, 1);
again:
/* add new dmb into hash table */
@@ -94,6 +96,7 @@ static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
}
hash_add(ldev->dmb_ht, &dmb_node->list, dmb_node->token);
write_unlock_bh(&ldev->dmb_ht_lock);
+ atomic_inc(&ldev->dmb_cnt);
dmb->sba_idx = dmb_node->sba_idx;
dmb->dmb_tok = dmb_node->token;
@@ -110,13 +113,29 @@ static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
return rc;
}
+static void __smc_lo_unregister_dmb(struct smc_lo_dev *ldev,
+ struct smc_lo_dmb_node *dmb_node)
+{
+ /* remove dmb from hash table */
+ write_lock_bh(&ldev->dmb_ht_lock);
+ hash_del(&dmb_node->list);
+ write_unlock_bh(&ldev->dmb_ht_lock);
+
+ clear_bit(dmb_node->sba_idx, ldev->sba_idx_mask);
+ kvfree(dmb_node->cpu_addr);
+ kfree(dmb_node);
+
+ if (atomic_dec_and_test(&ldev->dmb_cnt))
+ wake_up(&ldev->ldev_release);
+}
+
static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
{
struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
struct smc_lo_dev *ldev = smcd->priv;
- /* remove dmb from hash table */
- write_lock_bh(&ldev->dmb_ht_lock);
+ /* find dmb from hash table */
+ read_lock_bh(&ldev->dmb_ht_lock);
hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
if (tmp_node->token == dmb->dmb_tok) {
dmb_node = tmp_node;
@@ -124,16 +143,76 @@ static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
}
}
if (!dmb_node) {
- write_unlock_bh(&ldev->dmb_ht_lock);
+ read_unlock_bh(&ldev->dmb_ht_lock);
return -EINVAL;
}
- hash_del(&dmb_node->list);
- write_unlock_bh(&ldev->dmb_ht_lock);
+ read_unlock_bh(&ldev->dmb_ht_lock);
- clear_bit(dmb_node->sba_idx, ldev->sba_idx_mask);
- kfree(dmb_node->cpu_addr);
- kfree(dmb_node);
+ if (refcount_dec_and_test(&dmb_node->refcnt))
+ __smc_lo_unregister_dmb(ldev, dmb_node);
+ return 0;
+}
+
+static int smc_lo_support_dmb_nocopy(struct smcd_dev *smcd)
+{
+ return SMC_LO_SUPPORT_NOCOPY;
+}
+
+static int smc_lo_attach_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
+{
+ struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ /* find dmb_node according to dmb->dmb_tok */
+ read_lock_bh(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
+ if (tmp_node->token == dmb->dmb_tok) {
+ dmb_node = tmp_node;
+ break;
+ }
+ }
+ if (!dmb_node) {
+ read_unlock_bh(&ldev->dmb_ht_lock);
+ return -EINVAL;
+ }
+ read_unlock_bh(&ldev->dmb_ht_lock);
+
+ if (!refcount_inc_not_zero(&dmb_node->refcnt))
+ /* the dmb is being unregistered, but has
+ * not been removed from the hash table.
+ */
+ return -EINVAL;
+ /* provide dmb information */
+ dmb->sba_idx = dmb_node->sba_idx;
+ dmb->dmb_tok = dmb_node->token;
+ dmb->cpu_addr = dmb_node->cpu_addr;
+ dmb->dma_addr = dmb_node->dma_addr;
+ dmb->dmb_len = dmb_node->len;
+ return 0;
+}
+
+static int smc_lo_detach_dmb(struct smcd_dev *smcd, u64 token)
+{
+ struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ /* find dmb_node according to dmb->dmb_tok */
+ read_lock_bh(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, token) {
+ if (tmp_node->token == token) {
+ dmb_node = tmp_node;
+ break;
+ }
+ }
+ if (!dmb_node) {
+ read_unlock_bh(&ldev->dmb_ht_lock);
+ return -EINVAL;
+ }
+ read_unlock_bh(&ldev->dmb_ht_lock);
+
+ if (refcount_dec_and_test(&dmb_node->refcnt))
+ __smc_lo_unregister_dmb(ldev, dmb_node);
return 0;
}
@@ -145,6 +224,12 @@ static int smc_lo_move_data(struct smcd_dev *smcd, u64 dmb_tok,
struct smc_lo_dev *ldev = smcd->priv;
struct smc_connection *conn;
+ if (!sf)
+ /* since sndbuf is merged with peer DMB, there is
+ * no need to copy data from sndbuf to peer DMB.
+ */
+ return 0;
+
read_lock_bh(&ldev->dmb_ht_lock);
hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_tok) {
if (tmp_node->token == dmb_tok) {
@@ -159,13 +244,10 @@ static int smc_lo_move_data(struct smcd_dev *smcd, u64 dmb_tok,
memcpy((char *)rmb_node->cpu_addr + offset, data, size);
read_unlock_bh(&ldev->dmb_ht_lock);
- if (sf) {
- conn = smcd->conn[rmb_node->sba_idx];
- if (conn && !conn->killed)
- tasklet_schedule(&conn->rx_tsklet);
- else
- return -EPIPE;
- }
+ conn = smcd->conn[rmb_node->sba_idx];
+ if (!conn || conn->killed)
+ return -EPIPE;
+ tasklet_schedule(&conn->rx_tsklet);
return 0;
}
@@ -197,6 +279,9 @@ static const struct smcd_ops lo_ops = {
.query_remote_gid = smc_lo_query_rgid,
.register_dmb = smc_lo_register_dmb,
.unregister_dmb = smc_lo_unregister_dmb,
+ .support_dmb_nocopy = smc_lo_support_dmb_nocopy,
+ .attach_dmb = smc_lo_attach_dmb,
+ .detach_dmb = smc_lo_detach_dmb,
.add_vlan_id = NULL,
.del_vlan_id = NULL,
.set_vlan_required = NULL,
@@ -275,12 +360,17 @@ static int smc_lo_dev_init(struct smc_lo_dev *ldev)
smc_lo_generate_ids(ldev);
rwlock_init(&ldev->dmb_ht_lock);
hash_init(ldev->dmb_ht);
+ atomic_set(&ldev->dmb_cnt, 0);
+ init_waitqueue_head(&ldev->ldev_release);
+
return smcd_lo_register_dev(ldev);
}
static void smc_lo_dev_exit(struct smc_lo_dev *ldev)
{
smcd_lo_unregister_dev(ldev);
+ if (atomic_read(&ldev->dmb_cnt))
+ wait_event(ldev->ldev_release, !atomic_read(&ldev->dmb_cnt));
}
static void smc_lo_dev_release(struct device *dev)
diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
index b8206338309f..6dd4292dae56 100644
--- a/net/smc/smc_loopback.h
+++ b/net/smc/smc_loopback.h
@@ -30,6 +30,7 @@ struct smc_lo_dmb_node {
u32 sba_idx;
void *cpu_addr;
dma_addr_t dma_addr;
+ refcount_t refcnt;
};
struct smc_lo_dev {
@@ -37,9 +38,11 @@ struct smc_lo_dev {
struct device dev;
u16 chid;
struct smcd_gid local_gid;
+ atomic_t dmb_cnt;
rwlock_t dmb_ht_lock;
DECLARE_BITMAP(sba_idx_mask, SMC_LO_MAX_DMBS);
DECLARE_HASHTABLE(dmb_ht, SMC_LO_DMBS_HASH_BITS);
+ wait_queue_head_t ldev_release;
};
int smc_loopback_init(void);
--
2.32.0.3.g01195cf9f
If the device used by SMC-D supports merging local sndbuf to peer DMB,
then create sndbuf descriptor and attach it to peer DMB once peer
token is obtained, and detach and free the sndbuf descriptor when the
connection is freed.
Signed-off-by: Wen Gu <[email protected]>
Reviewed-by: Wenjia Zhang <[email protected]>
Reviewed-and-tested-by: Jan Karcher <[email protected]>
---
net/smc/af_smc.c | 16 ++++++++++++
net/smc/smc_core.c | 61 +++++++++++++++++++++++++++++++++++++++++++++-
net/smc/smc_core.h | 1 +
3 files changed, 77 insertions(+), 1 deletion(-)
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 47f3bc1470bc..9389f0cfa374 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -1436,6 +1436,14 @@ static int smc_connect_ism(struct smc_sock *smc,
}
smc_conn_save_peer_info(smc, aclc);
+
+ if (smc_ism_support_dmb_nocopy(smc->conn.lgr->smcd)) {
+ rc = smcd_buf_attach(smc);
+ if (rc) {
+ rc = SMC_CLC_DECL_MEM; /* try to fallback */
+ goto connect_abort;
+ }
+ }
smc_close_init(smc);
smc_rx_init(smc);
smc_tx_init(smc);
@@ -2540,6 +2548,14 @@ static void smc_listen_work(struct work_struct *work)
mutex_unlock(&smc_server_lgr_pending);
}
smc_conn_save_peer_info(new_smc, cclc);
+
+ if (ini->is_smcd &&
+ smc_ism_support_dmb_nocopy(new_smc->conn.lgr->smcd)) {
+ rc = smcd_buf_attach(new_smc);
+ if (rc)
+ goto out_decl;
+ }
+
smc_listen_out_connected(new_smc);
SMC_STAT_SERV_SUCC_INC(sock_net(newclcsock->sk), ini);
goto out_free;
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 9b84d5897aa5..fafdb97adfad 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -1149,6 +1149,20 @@ static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
}
}
+static void smcd_buf_detach(struct smc_connection *conn)
+{
+ struct smcd_dev *smcd = conn->lgr->smcd;
+ u64 peer_token = conn->peer_token;
+
+ if (!conn->sndbuf_desc)
+ return;
+
+ smc_ism_detach_dmb(smcd, peer_token);
+
+ kfree(conn->sndbuf_desc);
+ conn->sndbuf_desc = NULL;
+}
+
static void smc_buf_unuse(struct smc_connection *conn,
struct smc_link_group *lgr)
{
@@ -1192,6 +1206,8 @@ void smc_conn_free(struct smc_connection *conn)
if (lgr->is_smcd) {
if (!list_empty(&lgr->list))
smc_ism_unset_conn(conn);
+ if (smc_ism_support_dmb_nocopy(lgr->smcd))
+ smcd_buf_detach(conn);
tasklet_kill(&conn->rx_tsklet);
} else {
smc_cdc_wait_pend_tx_wr(conn);
@@ -1445,6 +1461,8 @@ static void smc_conn_kill(struct smc_connection *conn, bool soft)
smc_sk_wake_ups(smc);
if (conn->lgr->is_smcd) {
smc_ism_unset_conn(conn);
+ if (smc_ism_support_dmb_nocopy(conn->lgr->smcd))
+ smcd_buf_detach(conn);
if (soft)
tasklet_kill(&conn->rx_tsklet);
else
@@ -2464,12 +2482,18 @@ int smc_buf_create(struct smc_sock *smc, bool is_smcd)
int rc;
/* create send buffer */
+ if (is_smcd &&
+ smc_ism_support_dmb_nocopy(smc->conn.lgr->smcd))
+ goto create_rmb;
+
rc = __smc_buf_create(smc, is_smcd, false);
if (rc)
return rc;
+
+create_rmb:
/* create rmb */
rc = __smc_buf_create(smc, is_smcd, true);
- if (rc) {
+ if (rc && smc->conn.sndbuf_desc) {
down_write(&smc->conn.lgr->sndbufs_lock);
list_del(&smc->conn.sndbuf_desc->list);
up_write(&smc->conn.lgr->sndbufs_lock);
@@ -2479,6 +2503,41 @@ int smc_buf_create(struct smc_sock *smc, bool is_smcd)
return rc;
}
+int smcd_buf_attach(struct smc_sock *smc)
+{
+ struct smc_connection *conn = &smc->conn;
+ struct smcd_dev *smcd = conn->lgr->smcd;
+ u64 peer_token = conn->peer_token;
+ struct smc_buf_desc *buf_desc;
+ int rc;
+
+ buf_desc = kzalloc(sizeof(*buf_desc), GFP_KERNEL);
+ if (!buf_desc)
+ return -ENOMEM;
+
+ /* The ghost sndbuf_desc describes the same memory region as
+ * peer RMB. Its lifecycle is consistent with the connection's
+ * and it will be freed with the connections instead of the
+ * link group.
+ */
+ rc = smc_ism_attach_dmb(smcd, peer_token, buf_desc);
+ if (rc)
+ goto free;
+
+ smc->sk.sk_sndbuf = buf_desc->len;
+ buf_desc->cpu_addr =
+ (u8 *)buf_desc->cpu_addr + sizeof(struct smcd_cdc_msg);
+ buf_desc->len -= sizeof(struct smcd_cdc_msg);
+ conn->sndbuf_desc = buf_desc;
+ conn->sndbuf_desc->used = 1;
+ atomic_set(&conn->sndbuf_space, conn->sndbuf_desc->len);
+ return 0;
+
+free:
+ kfree(buf_desc);
+ return rc;
+}
+
static inline int smc_rmb_reserve_rtoken_idx(struct smc_link_group *lgr)
{
int i;
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 1f175376037b..d93cf51dbd7c 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -557,6 +557,7 @@ void smc_smcd_terminate(struct smcd_dev *dev, struct smcd_gid *peer_gid,
void smc_smcd_terminate_all(struct smcd_dev *dev);
void smc_smcr_terminate_all(struct smc_ib_device *smcibdev);
int smc_buf_create(struct smc_sock *smc, bool is_smcd);
+int smcd_buf_attach(struct smc_sock *smc);
int smc_uncompress_bufsize(u8 compressed);
int smc_rmb_rtoken_handling(struct smc_connection *conn, struct smc_link *link,
struct smc_clc_msg_accept_confirm *clc);
--
2.32.0.3.g01195cf9f
If the local sndbuf shares the same physical memory with peer DMB,
the cursor update processing needs to be adapted to ensure that the
data to be consumed won't be overwritten.
So in this case, the fin_curs and sndbuf_space that were originally
updated after sending the CDC message should be modified to not be
update until the peer updates cons_curs.
Signed-off-by: Wen Gu <[email protected]>
Reviewed-by: Wenjia Zhang <[email protected]>
Reviewed-and-tested-by: Jan Karcher <[email protected]>
---
net/smc/smc_cdc.c | 36 ++++++++++++++++++++++++++++++++++--
1 file changed, 34 insertions(+), 2 deletions(-)
diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index 3c06625ceb20..619b3bab3824 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -18,6 +18,7 @@
#include "smc_tx.h"
#include "smc_rx.h"
#include "smc_close.h"
+#include "smc_ism.h"
/********************************** send *************************************/
@@ -255,6 +256,14 @@ int smcd_cdc_msg_send(struct smc_connection *conn)
return rc;
smc_curs_copy(&conn->rx_curs_confirmed, &curs, conn);
conn->local_rx_ctrl.prod_flags.cons_curs_upd_req = 0;
+
+ if (smc_ism_support_dmb_nocopy(conn->lgr->smcd))
+ /* if local sndbuf shares the same memory region with
+ * peer DMB, then don't update the tx_curs_fin
+ * and sndbuf_space until peer has consumed the data.
+ */
+ return 0;
+
/* Calculate transmitted data and increment free send buffer space */
diff = smc_curs_diff(conn->sndbuf_desc->len, &conn->tx_curs_fin,
&conn->tx_curs_sent);
@@ -266,7 +275,7 @@ int smcd_cdc_msg_send(struct smc_connection *conn)
smc_curs_copy(&conn->tx_curs_fin, &conn->tx_curs_sent, conn);
smc_tx_sndbuf_nonfull(smc);
- return rc;
+ return 0;
}
/********************************* receive ***********************************/
@@ -323,7 +332,7 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc,
{
union smc_host_cursor cons_old, prod_old;
struct smc_connection *conn = &smc->conn;
- int diff_cons, diff_prod;
+ int diff_cons, diff_prod, diff_tx;
smc_curs_copy(&prod_old, &conn->local_rx_ctrl.prod, conn);
smc_curs_copy(&cons_old, &conn->local_rx_ctrl.cons, conn);
@@ -339,6 +348,29 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc,
atomic_add(diff_cons, &conn->peer_rmbe_space);
/* guarantee 0 <= peer_rmbe_space <= peer_rmbe_size */
smp_mb__after_atomic();
+
+ /* if local sndbuf shares the same memory region with
+ * peer RMB, then update tx_curs_fin and sndbuf_space
+ * here since peer has already consumed the data.
+ */
+ if (conn->lgr->is_smcd &&
+ smc_ism_support_dmb_nocopy(conn->lgr->smcd)) {
+ /* Calculate consumed data and
+ * increment free send buffer space.
+ */
+ diff_tx = smc_curs_diff(conn->sndbuf_desc->len,
+ &conn->tx_curs_fin,
+ &conn->local_rx_ctrl.cons);
+ /* increase local sndbuf space and fin_curs */
+ smp_mb__before_atomic();
+ atomic_add(diff_tx, &conn->sndbuf_space);
+ /* guarantee 0 <= sndbuf_space <= sndbuf_desc->len */
+ smp_mb__after_atomic();
+ smc_curs_copy(&conn->tx_curs_fin,
+ &conn->local_rx_ctrl.cons, conn);
+
+ smc_tx_sndbuf_nonfull(smc);
+ }
}
diff_prod = smc_curs_diff(conn->rmb_desc->len, &prod_old,
--
2.32.0.3.g01195cf9f
On Sun, Apr 28, 2024 at 02:07:27PM +0800, Wen Gu wrote:
> This patch set acts as the second part of the new version of [1] (The first
> part can be referred from [2]), the updated things of this version are listed
> at the end.
>
> - Background
>
> SMC-D is now used in IBM z with ISM function to optimize network interconnect
> for intra-CPC communications. Inspired by this, we try to make SMC-D available
> on the non-s390 architecture through a software-implemented Emulated-ISM device,
> that is the loopback-ism device here, to accelerate inter-process or
> inter-containers communication within the same OS instance.
Just FYI:
Cilium has implemented this kind of shortcut with sockmap and sockops.
In fact, for intra-OS case, it is _very_ simple. The core code is less
than 50 lines. Please take a look here:
https://github.com/cilium/cilium/blob/v1.11.4/bpf/sockops/bpf_sockops.c
Like I mentioned in my LSF/MM/BPF proposal, we plan to implement
similiar eBPF things for inter-OS (aka VM) case.
More importantly, even LD_PRELOAD is not needed for this eBPF approach.
:)
Thanks.
Hello:
This series was applied to netdev/net-next.git (main)
by Paolo Abeni <[email protected]>:
On Sun, 28 Apr 2024 14:07:27 +0800 you wrote:
> This patch set acts as the second part of the new version of [1] (The first
> part can be referred from [2]), the updated things of this version are listed
> at the end.
>
> - Background
>
> SMC-D is now used in IBM z with ISM function to optimize network interconnect
> for intra-CPC communications. Inspired by this, we try to make SMC-D available
> on the non-s390 architecture through a software-implemented Emulated-ISM device,
> that is the loopback-ism device here, to accelerate inter-process or
> inter-containers communication within the same OS instance.
>
> [...]
Here is the summary with links:
- [net-next,v7,01/11] net/smc: decouple ism_client from SMC-D DMB registration
https://git.kernel.org/netdev/net-next/c/784c46f5467c
- [net-next,v7,02/11] net/smc: introduce loopback-ism for SMC intra-OS shortcut
https://git.kernel.org/netdev/net-next/c/46ac64419ded
- [net-next,v7,03/11] net/smc: implement ID-related operations of loopback-ism
https://git.kernel.org/netdev/net-next/c/45783ee85bf3
- [net-next,v7,04/11] net/smc: implement DMB-related operations of loopback-ism
https://git.kernel.org/netdev/net-next/c/f7a22071dbf3
- [net-next,v7,05/11] net/smc: mark optional smcd_ops and check for support when called
https://git.kernel.org/netdev/net-next/c/d1d8d0b6c7c6
- [net-next,v7,06/11] net/smc: ignore loopback-ism when dumping SMC-D devices
https://git.kernel.org/netdev/net-next/c/c8df2d449f64
- [net-next,v7,07/11] net/smc: register loopback-ism into SMC-D device list
https://git.kernel.org/netdev/net-next/c/04791343d858
- [net-next,v7,08/11] net/smc: add operations to merge sndbuf with peer DMB
https://git.kernel.org/netdev/net-next/c/439888826858
- [net-next,v7,09/11] net/smc: {at|de}tach sndbuf to peer DMB if supported
https://git.kernel.org/netdev/net-next/c/ae2be35cbed2
- [net-next,v7,10/11] net/smc: adapt cursor update when sndbuf and peer DMB are merged
https://git.kernel.org/netdev/net-next/c/cc0ab806fc52
- [net-next,v7,11/11] net/smc: implement DMB-merged operations of loopback-ism
https://git.kernel.org/netdev/net-next/c/c3a910f2380f
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
On 2024/4/28 23:49, Cong Wang wrote:
> On Sun, Apr 28, 2024 at 02:07:27PM +0800, Wen Gu wrote:
>> This patch set acts as the second part of the new version of [1] (The first
>> part can be referred from [2]), the updated things of this version are listed
>> at the end.
>>
>> - Background
>>
>> SMC-D is now used in IBM z with ISM function to optimize network interconnect
>> for intra-CPC communications. Inspired by this, we try to make SMC-D available
>> on the non-s390 architecture through a software-implemented Emulated-ISM device,
>> that is the loopback-ism device here, to accelerate inter-process or
>> inter-containers communication within the same OS instance.
>
> Just FYI:
>
> Cilium has implemented this kind of shortcut with sockmap and sockops.
> In fact, for intra-OS case, it is _very_ simple. The core code is less
> than 50 lines. Please take a look here:
> https://github.com/cilium/cilium/blob/v1.11.4/bpf/sockops/bpf_sockops.c
>
> Like I mentioned in my LSF/MM/BPF proposal, we plan to implement
> similiar eBPF things for inter-OS (aka VM) case.
>
> More importantly, even LD_PRELOAD is not needed for this eBPF approach.
> :)
>
> Thanks.
Hi, Cong. Thank you very much for the information. I learned about sockmap
before and from my perspective smcd loopback and sockmap each have their own
pros and cons.
The pros of smcd loopback is that it uses a standard process that defined
by RFC-7609 for negotiation, this CLC handshake helps smc correctly determine
whether the tcp connection should be upgraded no matter what middleware the
connection passes, e.g. through NAT. So we don't need to pay extra effort to
check whether the connection should be shortcut, unlike checking various policy
by bpf_sock_ops_ipv4() in sockmap. And since the handshake automatically select
different underlay devices for different scenarios (loopback-ism in intra-OS,
ISM in inter-VM of IBM z and RDMA in inter-VM of different hosts), various
scenarios can be covered through one smc protocol stack.
The cons of smcd loopback is also related to the CLC handshake, one more round
handshake may cause smc to perform worse than TCP in short-lived connection
scenarios. So we basically use smc upgrade in long-lived connection scenarios
and are exploring IPPROTO_SMC[1] to provide lossless fallback under adverse cases.
And we are also working on other upgrade ways than LD_PRELOAD, e.g. using eBPF
hook[2] with IPPROTO_SMC, to enhance the usability.
[1] https://lore.kernel.org/netdev/[email protected]/
[2] https://lore.kernel.org/all/ac84be00f97072a46f8a72b4e2be46cbb7fa5053.1692147782.git.geliang.tang@suse.com/
Thanks!
On Tue, May 07, 2024 at 10:34:09PM +0800, Wen Gu wrote:
>
>
> On 2024/4/28 23:49, Cong Wang wrote:
> > On Sun, Apr 28, 2024 at 02:07:27PM +0800, Wen Gu wrote:
> > > This patch set acts as the second part of the new version of [1] (The first
> > > part can be referred from [2]), the updated things of this version are listed
> > > at the end.
> > >
> > > - Background
> > >
> > > SMC-D is now used in IBM z with ISM function to optimize network interconnect
> > > for intra-CPC communications. Inspired by this, we try to make SMC-D available
> > > on the non-s390 architecture through a software-implemented Emulated-ISM device,
> > > that is the loopback-ism device here, to accelerate inter-process or
> > > inter-containers communication within the same OS instance.
> >
> > Just FYI:
> >
> > Cilium has implemented this kind of shortcut with sockmap and sockops.
> > In fact, for intra-OS case, it is _very_ simple. The core code is less
> > than 50 lines. Please take a look here:
> > https://github.com/cilium/cilium/blob/v1.11.4/bpf/sockops/bpf_sockops.c
> >
> > Like I mentioned in my LSF/MM/BPF proposal, we plan to implement
> > similiar eBPF things for inter-OS (aka VM) case.
> >
> > More importantly, even LD_PRELOAD is not needed for this eBPF approach.
> > :)
> >
> > Thanks.
>
> Hi, Cong. Thank you very much for the information. I learned about sockmap
> before and from my perspective smcd loopback and sockmap each have their own
> pros and cons.
>
> The pros of smcd loopback is that it uses a standard process that defined
> by RFC-7609 for negotiation, this CLC handshake helps smc correctly determine
> whether the tcp connection should be upgraded no matter what middleware the
> connection passes, e.g. through NAT. So we don't need to pay extra effort to
> check whether the connection should be shortcut, unlike checking various policy
> by bpf_sock_ops_ipv4() in sockmap. And since the handshake automatically select
> different underlay devices for different scenarios (loopback-ism in intra-OS,
> ISM in inter-VM of IBM z and RDMA in inter-VM of different hosts), various
> scenarios can be covered through one smc protocol stack.
>
> The cons of smcd loopback is also related to the CLC handshake, one more round
> handshake may cause smc to perform worse than TCP in short-lived connection
> scenarios. So we basically use smc upgrade in long-lived connection scenarios
> and are exploring IPPROTO_SMC[1] to provide lossless fallback under adverse cases.
You don't have to bother RFC's, since you could define your own TCP
options. And, the eBPF approach could also use TCP options whenver
needed. Cilium probably does not use them only because for intra-OS case
it is too simple to bother TCP options, as everything can be shared via a
shared socketmap.
In reality, the setup is not that complex. In many cases we already know
whether we have VM or container (or mixed) setup before we develop (as
a part of requirement gathering). And they rarely change.
Taking one step back, the discovery of VM or container or loopback cases
could be done via TCP options too, to deal with complex cases like
KataContainer. There is no reason to bother RFC's, maybe except the RDMA
case.
In fact, this is an advantage to me. We don't need to argue with anyone
on our own TCP option or eBPF code, we don't even have to share our own
eBPF code here.
>
> And we are also working on other upgrade ways than LD_PRELOAD, e.g. using eBPF
> hook[2] with IPPROTO_SMC, to enhance the usability.
That is wrong IMHO, because basically it just overwrites kernel modules
with eBPF, not how eBPF is supposed to be used. IOW, you could not use
it at all without SMC/MPTCP modules.
BTW, this approach does not work for kernel sockets, because you only
hook __sys_socket().
Of course, for sockmap or sockops, they could be used independently for
any other purposes. I hope now you could see the flexiblities of eBPF
over kernel modules.
Thanks.
On 2024/5/8 00:10, Cong Wang wrote:
> On Tue, May 07, 2024 at 10:34:09PM +0800, Wen Gu wrote:
>>
>>
>> On 2024/4/28 23:49, Cong Wang wrote:
>>> On Sun, Apr 28, 2024 at 02:07:27PM +0800, Wen Gu wrote:
>>>> This patch set acts as the second part of the new version of [1] (The first
>>>> part can be referred from [2]), the updated things of this version are listed
>>>> at the end.
>>>>
>>>> - Background
>>>>
>>>> SMC-D is now used in IBM z with ISM function to optimize network interconnect
>>>> for intra-CPC communications. Inspired by this, we try to make SMC-D available
>>>> on the non-s390 architecture through a software-implemented Emulated-ISM device,
>>>> that is the loopback-ism device here, to accelerate inter-process or
>>>> inter-containers communication within the same OS instance.
>>>
>>> Just FYI:
>>>
>>> Cilium has implemented this kind of shortcut with sockmap and sockops.
>>> In fact, for intra-OS case, it is _very_ simple. The core code is less
>>> than 50 lines. Please take a look here:
>>> https://github.com/cilium/cilium/blob/v1.11.4/bpf/sockops/bpf_sockops.c
>>>
>>> Like I mentioned in my LSF/MM/BPF proposal, we plan to implement
>>> similiar eBPF things for inter-OS (aka VM) case.
>>>
>>> More importantly, even LD_PRELOAD is not needed for this eBPF approach.
>>> :)
>>>
>>> Thanks.
>>
>> Hi, Cong. Thank you very much for the information. I learned about sockmap
>> before and from my perspective smcd loopback and sockmap each have their own
>> pros and cons.
>>
>> The pros of smcd loopback is that it uses a standard process that defined
>> by RFC-7609 for negotiation, this CLC handshake helps smc correctly determine
>> whether the tcp connection should be upgraded no matter what middleware the
>> connection passes, e.g. through NAT. So we don't need to pay extra effort to
>> check whether the connection should be shortcut, unlike checking various policy
>> by bpf_sock_ops_ipv4() in sockmap. And since the handshake automatically select
>> different underlay devices for different scenarios (loopback-ism in intra-OS,
>> ISM in inter-VM of IBM z and RDMA in inter-VM of different hosts), various
>> scenarios can be covered through one smc protocol stack.
>>
>> The cons of smcd loopback is also related to the CLC handshake, one more round
>> handshake may cause smc to perform worse than TCP in short-lived connection
>> scenarios. So we basically use smc upgrade in long-lived connection scenarios
>> and are exploring IPPROTO_SMC[1] to provide lossless fallback under adverse cases.
>
> You don't have to bother RFC's, since you could define your own TCP
> options. And, the eBPF approach could also use TCP options whenver
> needed. Cilium probably does not use them only because for intra-OS case
> it is too simple to bother TCP options, as everything can be shared via a
> shared socketmap.
>
> In reality, the setup is not that complex. In many cases we already know
> whether we have VM or container (or mixed) setup before we develop (as
> a part of requirement gathering). And they rarely change.
>
> Taking one step back, the discovery of VM or container or loopback cases
> could be done via TCP options too, to deal with complex cases like
> KataContainer. There is no reason to bother RFC's, maybe except the RDMA
> case.
>
> In fact, this is an advantage to me. We don't need to argue with anyone
> on our own TCP option or eBPF code, we don't even have to share our own
> eBPF code here.
>
Private TCP option could be a historical burden and a risk for compatibility,
so IMHO it doesn't work, at least for SMC. Besides, the smc handshake process
I mentioned is not just about TCP option, that is the first step. There are
another 3-way CLC handshake to choose right underlay for different cases
or safelly fallback to TCP, keep users out of bother. Lastly, the process was
designed for different cases and smcd loopback is one of the whole picture,
so simplicity of intra-OS case does not mean the existing handshake is meaningless
nor smcd loopback should give up using existing process but follow sockmap.
>>
>> And we are also working on other upgrade ways than LD_PRELOAD, e.g. using eBPF
>> hook[2] with IPPROTO_SMC, to enhance the usability.
>
> That is wrong IMHO, because basically it just overwrites kernel modules
> with eBPF, not how eBPF is supposed to be used. IOW, you could not use
> it at all without SMC/MPTCP modules.
>
Yes, it expects to be used for SMC/MPTCP modules.
> BTW, this approach does not work for kernel sockets, because you only
> hook __sys_socket().
>
In fact the purpose of this is mainly to transparently upgrade applications'
TCP sockets, so kernel sockets are not the target.
> Of course, for sockmap or sockops, they could be used independently for
> any other purposes. I hope now you could see the flexiblities of eBPF
> over kernel modules.
>
Yes, I agree with the pros of eBPF way, like flexiblities you mentioned.
As I said above, from my perspective they both have their own pros and cons.
> Thanks.
Thanks!
On Tue, May 07, 2024 at 09:10:41AM -0700, Cong Wang wrote:
> On Tue, May 07, 2024 at 10:34:09PM +0800, Wen Gu wrote:
> >
> >
> > On 2024/4/28 23:49, Cong Wang wrote:
> > > On Sun, Apr 28, 2024 at 02:07:27PM +0800, Wen Gu wrote:
> > > > This patch set acts as the second part of the new version of [1] (The first
> > > > part can be referred from [2]), the updated things of this version are listed
> > > > at the end.
> > > >
> > > > - Background
> > > >
> > > > SMC-D is now used in IBM z with ISM function to optimize network interconnect
> > > > for intra-CPC communications. Inspired by this, we try to make SMC-D available
> > > > on the non-s390 architecture through a software-implemented Emulated-ISM device,
> > > > that is the loopback-ism device here, to accelerate inter-process or
> > > > inter-containers communication within the same OS instance.
> > >
> > > Just FYI:
> > >
> > > Cilium has implemented this kind of shortcut with sockmap and sockops.
> > > In fact, for intra-OS case, it is _very_ simple. The core code is less
> > > than 50 lines. Please take a look here:
> > > https://github.com/cilium/cilium/blob/v1.11.4/bpf/sockops/bpf_sockops.c
> > >
> > > Like I mentioned in my LSF/MM/BPF proposal, we plan to implement
> > > similiar eBPF things for inter-OS (aka VM) case.
> > >
> > > More importantly, even LD_PRELOAD is not needed for this eBPF approach.
> > > :)
> > >
> > > Thanks.
> >
> > Hi, Cong. Thank you very much for the information. I learned about sockmap
> > before and from my perspective smcd loopback and sockmap each have their own
> > pros and cons.
> >
> > The pros of smcd loopback is that it uses a standard process that defined
> > by RFC-7609 for negotiation, this CLC handshake helps smc correctly determine
> > whether the tcp connection should be upgraded no matter what middleware the
> > connection passes, e.g. through NAT. So we don't need to pay extra effort to
> > check whether the connection should be shortcut, unlike checking various policy
> > by bpf_sock_ops_ipv4() in sockmap. And since the handshake automatically select
> > different underlay devices for different scenarios (loopback-ism in intra-OS,
> > ISM in inter-VM of IBM z and RDMA in inter-VM of different hosts), various
> > scenarios can be covered through one smc protocol stack.
> >
> > The cons of smcd loopback is also related to the CLC handshake, one more round
> > handshake may cause smc to perform worse than TCP in short-lived connection
> > scenarios. So we basically use smc upgrade in long-lived connection scenarios
> > and are exploring IPPROTO_SMC[1] to provide lossless fallback under adverse cases.
>
> You don't have to bother RFC's, since you could define your own TCP
> options. And, the eBPF approach could also use TCP options whenver
> needed. Cilium probably does not use them only because for intra-OS case
> it is too simple to bother TCP options, as everything can be shared via a
> shared socketmap.
You can define and use any private TCP options but that is not the right
way for a inter-host network protocol, especially for different subnet,
arch or OS (Linux, z/OS and so on). As the essence of communication
between any two parties, everyone need to abide by the same standards. I
also have to admit that SMC is a standard protocol and we need to extend
the protocol spec through standard and appropriate methods, such as IETF
RFC and protocol white paper. If it is only a temporary acceleration
solution used on a small scale, this restriction are not required.
>
> In reality, the setup is not that complex. In many cases we already know
> whether we have VM or container (or mixed) setup before we develop (as
> a part of requirement gathering). And they rarely change.
To running SMC in nowadays cloud infra, TCP handshake is not the most
efficient but the most practical way, for we can't touch the infra
under VM, even don't know what kind of environment.
>
> Taking one step back, the discovery of VM or container or loopback cases
> could be done via TCP options too, to deal with complex cases like
> KataContainer. There is no reason to bother RFC's, maybe except the RDMA
> case.
>
> In fact, this is an advantage to me. We don't need to argue with anyone
> on our own TCP option or eBPF code, we don't even have to share our own
> eBPF code here.
Actually I am looking forward to learn about the whole way of eBPF. Both
SMC and eBPF are trying to solve this issue in their own view. I don't
think one must be replaced by another one for now. To define these
inter/intra-OS/host scene and captivate interest are more important, I
think, than seeking a one-size-fits-all solution for the present time.
>
> >
> > And we are also working on other upgrade ways than LD_PRELOAD, e.g. using eBPF
> > hook[2] with IPPROTO_SMC, to enhance the usability.
>
> That is wrong IMHO, because basically it just overwrites kernel modules
> with eBPF, not how eBPF is supposed to be used. IOW, you could not use
> it at all without SMC/MPTCP modules.
The eBPF hookers are considering as a part of SMC modules. It should not
be used out of SMC module for now at least.
>
> BTW, this approach does not work for kernel sockets, because you only
> hook __sys_socket().
Yep, SMC should not replace kernel socket for now. In our scenario,
almost all applications suitable for SMC only run in user space.
Thanks,
Tony Lu
>
> Of course, for sockmap or sockops, they could be used independently for
> any other purposes. I hope now you could see the flexiblities of eBPF
> over kernel modules.
>
> Thanks.