2023-09-24 15:17:29

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v4 00/18] net/smc: implement virtual ISM extension and loopback-ism

Hi, all

# Background

SMC-D is now used in IBM z with ISM function to optimize network interconnect
for intra-CPC communications. Inspired by this, we try to make SMC-D available
on the non-s390 architecture through a software-simulated virtual ISM device,
such as loopback-ism device here, to accelerate inter-process or inter-containers
communication within the same OS.

# Design

This patch set includes 4 parts:

- Patch #1-#3: decouple ISM device hard code from SMC-D stack.
- Patch #4-#8: implement virtual ISM extension defined in SMCv2.1.
- Patch #9-#13: implement loopback-ism device.
- Patch #14-#18: memory copy optimization for the case using loopback.

The loopback-ism device is designed as a kernel device and not be limited to
a specific net namespace, ends of both inter-process connection (1/1' in diagram
below) or inter-container connection (2/2' in diagram below) will find that peer
shares the same loopback-ism device during the CLC handshake. Then loopback-ism
device will be chosen.

Container 1 (ns1) Container 2 (ns2)
+-----------------------------------------+ +-------------------------+
| +-------+ +-------+ +-------+ | | +-------+ |
| | App A | | App B | | App C | | | | App D |<-+ |
| +-------+ +---^---+ +-------+ | | +-------+ |(2') |
| |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| |
| (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ |
| `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | |
+---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+
| | | |
Kernel | | | |
+----+-------v---+-----------v----------------------------------+---+----+
| | TCP | |
| | | |
| +--------------------------------------------------------------+ |
| |
| +--------------+ |
| | smc loopback | |
+---------------------------+--------------+-----------------------------+


loopback-ism device allocs RMBs and sndbufs for each connection peer and 'moves'
data from sndbuf at one end to RMB at the other end. Since communication occurs
within the same kernel, the sndbuf can be mapped to peer RMB so that the data
copy in loopback-ism case can be avoided.

Container 1 (ns1) Container 2 (ns2)
+-----------------------------------------+ +-------------------------+
| +-------+ +-------+ +-------+ | | +-------+ |
| | App A | | App B | | App C | | | | App D | |
| +-------+ +--^----+ +-------+ | | +---^---+ |
| | | | | | | |
| (1) | (1') | (2) | | | (2') | |
| | | | | | | |
+-------|-----------|---------------|-----+ +------------|------------+
| | | |
Kernel | | | |
+-------|-----------|---------------|-----------------------|------------+
| +-----v-+ +-------+ +---v---+ +-------+ |
| | snd A |-+ | RMB B |<--+ | snd C |-+ +->| RMB D | |
| +-------+ | +-------+ | +-------+ | | +-------+ |
| +-------+ | +-------+ | +-------+ | | +-------+ |
| | RMB A | | | snd B | | | RMB C | | | | snd D | |
| +-------+ | +-------+ | +-------+ | | +-------+ |
| | +-------------v+ | |
| +-------------->| smc loopback |---------+ |
+---------------------------+--------------+-----------------------------+

# Benchmark Test

* Test environments:
- VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
- SMC sndbuf/RMB size 1MB.

* Test object:
- TCP: run on TCP loopback.
- domain: run on UNIX domain.
- SMC lo: run on SMC loopback device.

1. ipc-benchmark (see [1])

- ./<foo> -c 1000000 -s 100

TCP SMC-lo
Message
rate (msg/s) 81539 151251(+85.50%)

2. sockperf

- serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
- clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30

TCP SMC-lo
Bandwidth(MBps) 5313.66 8270.51(+55.65%)
Latency(us) 5.806 3.207(-44.76%)

3. nginx/wrk

- serv: <smc_run> nginx
- clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80

TCP SMC-lo
Requests/s 194641.79 258656.13(+32.89%)

4. redis-benchmark

- serv: <smc_run> redis-server
- clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024

TCP SMC-lo
GET(Requests/s) 85855.34 115640.35(+34.69%)
SET(Requests/s) 86337.15 118203.30(+36.90%)

[1] https://github.com/goldsborough/ipc-bench

v3->v4:
- Fix build warning of patch#12 about:
dmb_node->dma_addr = (dma_addr_t)dmb_node->cpu_addr;
- Replace kzalloc with vzalloc to alloc DMB in loopback-ism, which
may cause throughput or QPS to drop by 2% to 8%. see:
https://lore.kernel.org/netdev/[email protected]/
- Add SMC_LO in Kconfig to turn on/off loopback-ism.
- Make extension GID cleaner.

v2->v3:
- Fix build warning of patch#1 and patch#10.

v1->v2:
- Fix build error on s390 arch.

Wen Gu (18):
net/smc: decouple ism_dev from SMC-D device dump
net/smc: decouple ism_dev from SMC-D DMB registration
net/smc: extract v2 check helper from SMC-D device registration
net/smc: support SMCv2.x supplemental features negotiation
net/smc: reserve CHID range for SMC-D virtual device
net/smc: extend GID to 128bits only for virtual ISM device
net/smc: disable SEID on non-s390 architecture
net/smc: enable virtual ISM device feature bit
net/smc: introduce SMC-D loopback device
net/smc: implement ID-related operations of loopback
net/smc: implement some unsupported operations of loopback
net/smc: implement DMB-related operations of loopback
net/smc: register loopback device as SMC-Dv2 device
net/smc: add operation for getting DMB attribute
net/smc: add operations for DMB attach and detach
net/smc: avoid data copy from sndbuf to peer RMB in SMC-D
net/smc: modify cursor update logic when sndbuf mapped to RMB
net/smc: add interface implementation of loopback device

drivers/s390/net/ism_drv.c | 20 +-
include/net/smc.h | 32 ++-
include/uapi/linux/smc.h | 3 +
include/uapi/linux/smc_diag.h | 2 +
net/smc/Kconfig | 13 ++
net/smc/Makefile | 2 +-
net/smc/af_smc.c | 88 ++++++--
net/smc/smc.h | 7 +
net/smc/smc_cdc.c | 56 ++++-
net/smc/smc_cdc.h | 1 +
net/smc/smc_clc.c | 64 ++++--
net/smc/smc_clc.h | 10 +-
net/smc/smc_core.c | 111 +++++++++-
net/smc/smc_core.h | 9 +-
net/smc/smc_diag.c | 11 +-
net/smc/smc_ism.c | 100 ++++++---
net/smc/smc_ism.h | 24 ++-
net/smc/smc_loopback.c | 489 ++++++++++++++++++++++++++++++++++++++++++
net/smc/smc_loopback.h | 54 +++++
net/smc/smc_pnet.c | 4 +-
20 files changed, 996 insertions(+), 104 deletions(-)
create mode 100644 net/smc/smc_loopback.c
create mode 100644 net/smc/smc_loopback.h

--
1.8.3.1


2023-09-24 15:17:41

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v4 02/18] net/smc: decouple ism_dev from SMC-D DMB registration

This patch tries to decouple ISM device from SMC-D DMB registration,
So that the register_dmb option is not restricted to ISM devices.

Signed-off-by: Wen Gu <[email protected]>
---
drivers/s390/net/ism_drv.c | 2 +-
include/net/smc.h | 4 ++--
net/smc/smc_ism.c | 7 ++-----
3 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/s390/net/ism_drv.c b/drivers/s390/net/ism_drv.c
index 6df7f37..a34e913 100644
--- a/drivers/s390/net/ism_drv.c
+++ b/drivers/s390/net/ism_drv.c
@@ -781,7 +781,7 @@ static int smcd_query_rgid(struct smcd_dev *smcd, u64 rgid, u32 vid_valid,
}

static int smcd_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
- struct ism_client *client)
+ void *client)
{
return ism_register_dmb(smcd->priv, (struct ism_dmb *)dmb, client);
}
diff --git a/include/net/smc.h b/include/net/smc.h
index a002552..f75116e 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -50,13 +50,12 @@ struct smcd_dmb {
#define ISM_ERROR 0xFFFF

struct smcd_dev;
-struct ism_client;

struct smcd_ops {
int (*query_remote_gid)(struct smcd_dev *dev, u64 rgid, u32 vid_valid,
u32 vid);
int (*register_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb,
- struct ism_client *client);
+ void *client);
int (*unregister_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
int (*add_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
int (*del_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
@@ -77,6 +76,7 @@ struct smcd_ops {
struct smcd_dev {
const struct smcd_ops *ops;
void *priv;
+ void *client;
struct list_head list;
spinlock_t lock;
struct smc_connection **conn;
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index fc551f4..455ae0a 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -200,7 +200,6 @@ int smc_ism_unregister_dmb(struct smcd_dev *smcd, struct smc_buf_desc *dmb_desc)
int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
struct smc_buf_desc *dmb_desc)
{
-#if IS_ENABLED(CONFIG_ISM)
struct smcd_dmb dmb;
int rc;

@@ -209,7 +208,7 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
dmb.sba_idx = dmb_desc->sba_idx;
dmb.vlan_id = lgr->vlan_id;
dmb.rgid = lgr->peer_gid;
- rc = lgr->smcd->ops->register_dmb(lgr->smcd, &dmb, &smc_ism_client);
+ rc = lgr->smcd->ops->register_dmb(lgr->smcd, &dmb, lgr->smcd->client);
if (!rc) {
dmb_desc->sba_idx = dmb.sba_idx;
dmb_desc->token = dmb.dmb_tok;
@@ -218,9 +217,6 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
dmb_desc->len = dmb.dmb_len;
}
return rc;
-#else
- return 0;
-#endif
}

static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
@@ -421,6 +417,7 @@ static void smcd_register_dev(struct ism_dev *ism)
if (!smcd)
return;
smcd->priv = ism;
+ smcd->client = &smc_ism_client;
ism_set_priv(ism, &smc_ism_client, smcd);
if (smc_pnetid_by_dev_port(&ism->pdev->dev, 0, smcd->pnetid))
smc_pnetid_by_table_smcd(smcd);
--
1.8.3.1

2023-09-24 15:17:42

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v4 04/18] net/smc: support SMCv2.x supplemental features negotiation

This patch adds SMCv2.1 supplemental features negotiation. Supported
SMCv2.1 supplemental features are represented by feature_mask in FCE
header of CLC messages.

Server Client
Proposal(features(c-mask bits))
<-----------------------------------
Accept(features(s-mask bits))
----------------------------------->
Confirm(features(s&c-mask bits))
<-----------------------------------

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_clc.c | 7 +++++++
net/smc/smc_clc.h | 6 ++++--
net/smc/smc_core.h | 1 +
3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c
index 8deb46c..125b0d2 100644
--- a/net/smc/smc_clc.c
+++ b/net/smc/smc_clc.c
@@ -426,6 +426,7 @@ static int smc_clc_fill_fce(struct smc_clc_first_contact_ext_v2x *fce,
memset(fce, 0, sizeof(*fce));
fce->fce_v2_base.os_type = SMC_CLC_OS_LINUX;
fce->fce_v2_base.release = ini->release_nr;
+ fce->fce_v2_base.feature_mask = htons(ini->feature_mask);
memcpy(fce->fce_v2_base.hostname, smc_hostname, sizeof(smc_hostname));
if (ini->is_smcd && ini->release_nr < SMC_RELEASE_1) {
ret = sizeof(struct smc_clc_first_contact_ext);
@@ -906,6 +907,7 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
pclc_smcd->v2_ext_offset = htons(v2_ext_offset);
plen += sizeof(*v2_ext);

+ v2_ext->feature_mask = htons(ini->feature_mask);
read_lock(&smc_clc_eid_table.lock);
v2_ext->hdr.eid_cnt = smc_clc_eid_table.ueid_cnt;
plen += smc_clc_eid_table.ueid_cnt * SMC_MAX_EID_LEN;
@@ -1219,6 +1221,7 @@ int smc_clc_clnt_v2x_features_validate(struct smc_clc_first_contact_ext *fce,
return SMC_CLC_DECL_MAXLINKERR;
ini->max_links = fce_v2x->max_links;
}
+ ini->feature_mask &= ntohs(fce->feature_mask);

return 0;
}
@@ -1250,6 +1253,10 @@ int smc_clc_v2x_features_confirm_check(struct smc_clc_msg_accept_confirm *cclc,
return SMC_CLC_DECL_MAXLINKERR;
}

+ if (~(ini->feature_mask) & ntohs(fce->feature_mask))
+ return SMC_CLC_DECL_FEATUNSUPP;
+ ini->feature_mask = ntohs(fce->feature_mask);
+
return 0;
}

diff --git a/net/smc/smc_clc.h b/net/smc/smc_clc.h
index c5c8e7d..bcf37c8 100644
--- a/net/smc/smc_clc.h
+++ b/net/smc/smc_clc.h
@@ -48,6 +48,7 @@
#define SMC_CLC_DECL_RELEASEERR 0x03030009 /* release version negotiate failed */
#define SMC_CLC_DECL_MAXCONNERR 0x0303000a /* max connections negotiate failed */
#define SMC_CLC_DECL_MAXLINKERR 0x0303000b /* max links negotiate failed */
+#define SMC_CLC_DECL_FEATUNSUPP 0x0303000c /* supplemental features not supported */
#define SMC_CLC_DECL_MODEUNSUPP 0x03040000 /* smc modes do not match (R or D)*/
#define SMC_CLC_DECL_RMBE_EC 0x03050000 /* peer has eyecatcher in RMBE */
#define SMC_CLC_DECL_OPTUNSUPP 0x03060000 /* fastopen sockopt not supported */
@@ -138,7 +139,8 @@ struct smc_clc_v2_extension {
u8 roce[16]; /* RoCEv2 GID */
u8 max_conns;
u8 max_links;
- u8 reserved[14];
+ __be16 feature_mask;
+ u8 reserved[12];
u8 user_eids[][SMC_MAX_EID_LEN];
};

@@ -234,7 +236,7 @@ struct smc_clc_first_contact_ext {
u8 release : 4,
os_type : 4;
#endif
- u8 reserved2[2];
+ __be16 feature_mask;
u8 hostname[SMC_MAX_HOSTNAME_LEN];
};

diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 120027d..9f65678 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -401,6 +401,7 @@ struct smc_init_info {
u8 max_links;
u8 first_contact_peer;
u8 first_contact_local;
+ u16 feature_mask;
unsigned short vlan_id;
u32 rc;
u8 negotiated_eid[SMC_MAX_EID_LEN];
--
1.8.3.1

2023-09-24 15:18:02

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v4 03/18] net/smc: extract v2 check helper from SMC-D device registration

This patch extracts v2-capable logic from the process of registering the
ISM device as an SMC-D device, so that the registration process of other
underlying devices can reuse it.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_ism.c | 29 ++++++++++++++++++-----------
net/smc/smc_ism.h | 1 +
2 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index 455ae0a..8f1ba74 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -69,6 +69,22 @@ bool smc_ism_is_v2_capable(void)
return smc_ism_v2_capable;
}

+/* must be called under smcd_dev_list.mutex lock */
+void smc_ism_check_v2_capable(struct smcd_dev *smcd)
+{
+ u8 *system_eid = NULL;
+
+ if (smc_ism_v2_capable)
+ return;
+
+ system_eid = smcd->ops->get_system_eid();
+ if (smcd->ops->supports_v2()) {
+ smc_ism_v2_capable = true;
+ memcpy(smc_ism_v2_system_eid, system_eid,
+ SMC_MAX_EID_LEN);
+ }
+}
+
/* Set a connection using this DMBE. */
void smc_ism_set_conn(struct smc_connection *conn)
{
@@ -423,16 +439,7 @@ static void smcd_register_dev(struct ism_dev *ism)
smc_pnetid_by_table_smcd(smcd);

mutex_lock(&smcd_dev_list.mutex);
- if (list_empty(&smcd_dev_list.list)) {
- u8 *system_eid = NULL;
-
- system_eid = smcd->ops->get_system_eid();
- if (smcd->ops->supports_v2()) {
- smc_ism_v2_capable = true;
- memcpy(smc_ism_v2_system_eid, system_eid,
- SMC_MAX_EID_LEN);
- }
- }
+ smc_ism_check_v2_capable(smcd);
/* sort list: devices without pnetid before devices with pnetid */
if (smcd->pnetid[0])
list_add_tail(&smcd->list, &smcd_dev_list.list);
@@ -535,10 +542,10 @@ int smc_ism_init(void)
{
int rc = 0;

-#if IS_ENABLED(CONFIG_ISM)
smc_ism_v2_capable = false;
memset(smc_ism_v2_system_eid, 0, SMC_MAX_EID_LEN);

+#if IS_ENABLED(CONFIG_ISM)
rc = ism_register_client(&smc_ism_client);
#endif
return rc;
diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
index 832b2f4..14d2e77 100644
--- a/net/smc/smc_ism.h
+++ b/net/smc/smc_ism.h
@@ -42,6 +42,7 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int buf_size,
void smc_ism_get_system_eid(u8 **eid);
u16 smc_ism_get_chid(struct smcd_dev *dev);
bool smc_ism_is_v2_capable(void);
+void smc_ism_check_v2_capable(struct smcd_dev *dev);
int smc_ism_init(void);
void smc_ism_exit(void);
int smcd_nl_get_device(struct sk_buff *skb, struct netlink_callback *cb);
--
1.8.3.1

2023-09-24 15:18:06

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v4 05/18] net/smc: reserve CHID range for SMC-D virtual device

This patch reserve CHID range from 0xFF00 to 0xFFFF for SMC-D virtual
device and introduces helpers to identify them.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_ism.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
index 14d2e77..2ecc8de 100644
--- a/net/smc/smc_ism.h
+++ b/net/smc/smc_ism.h
@@ -15,6 +15,9 @@

#include "smc.h"

+#define SMC_VIRT_ISM_CHID_MAX 0xFFFF
+#define SMC_VIRT_ISM_CHID_MIN 0xFF00
+
struct smcd_dev_list { /* List of SMCD devices */
struct list_head list;
struct mutex mutex; /* Protects list of devices */
@@ -57,4 +60,16 @@ static inline int smc_ism_write(struct smcd_dev *smcd, u64 dmb_tok,
return rc < 0 ? rc : 0;
}

+static inline bool __smc_ism_is_virtdev(u16 chid)
+{
+ return (chid >= SMC_VIRT_ISM_CHID_MIN && chid <= SMC_VIRT_ISM_CHID_MAX);
+}
+
+static inline bool smc_ism_is_virtdev(struct smcd_dev *smcd)
+{
+ u16 chid = smcd->ops->get_chid(smcd);
+
+ return __smc_ism_is_virtdev(chid);
+}
+
#endif
--
1.8.3.1

2023-09-24 15:18:20

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v4 08/18] net/smc: enable virtual ISM device feature bit

This patch enables the first supplement feature of SMCv2.1, that
is the support for virtual ISM devices.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/af_smc.c | 2 ++
net/smc/smc.h | 7 +++++++
2 files changed, 9 insertions(+)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 5bb41404..c5b7716 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -1528,6 +1528,7 @@ static int __smc_connect(struct smc_sock *smc)
ini->smcr_version = SMC_V1 | SMC_V2;
ini->smc_type_v1 = SMC_TYPE_B;
ini->smc_type_v2 = SMC_TYPE_B;
+ ini->feature_mask = SMC_FEATURE_MASK;

/* get vlan id from IP device */
if (smc_vlan_by_tcpsk(smc->clcsock, ini)) {
@@ -1993,6 +1994,7 @@ static int smc_listen_v2_check(struct smc_sock *new_smc,
ini->smc_type_v2 = pclc->hdr.typev2;
ini->smcd_version = smcd_indicated(ini->smc_type_v1) ? SMC_V1 : 0;
ini->smcr_version = smcr_indicated(ini->smc_type_v1) ? SMC_V1 : 0;
+ ini->feature_mask = SMC_FEATURE_MASK;
if (pclc->hdr.version > SMC_V1) {
if (smcd_indicated(ini->smc_type_v2))
ini->smcd_version |= SMC_V2;
diff --git a/net/smc/smc.h b/net/smc/smc.h
index 24745fd..71b9640 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -58,6 +58,13 @@ enum smc_state { /* possible states of an SMC socket */
SMC_PROCESSABORT = 27,
};

+enum smc_supp_feature { /* supplemental features supported by current version */
+ SMC_SPF_VIRT_ISM_DEV = 0,
+};
+
+#define SMC_FEATURE_MASK \
+ (BIT(SMC_SPF_VIRT_ISM_DEV))
+
struct smc_link_group;

struct smc_wr_rx_hdr { /* common prefix part of LLC and CDC to demultiplex */
--
1.8.3.1

2023-09-24 15:18:28

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v4 11/18] net/smc: implement some unsupported operations of loopback

Vlan operations are not supported currently since the need for vlan in
loopback does not seem to be strong.

Signal_event operation is not supported since no event now needs to be
processed by loopback device.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_loopback.c | 36 +++++++++++++++++++++++++++++++-----
1 file changed, 31 insertions(+), 5 deletions(-)

diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index 610af99..4341803 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -75,6 +75,32 @@ static int smc_lo_query_rgid(struct smcd_dev *smcd, struct smcd_gid *rgid,
return 0;
}

+static int smc_lo_add_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
+{
+ return -EOPNOTSUPP;
+}
+
+static int smc_lo_del_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
+{
+ return -EOPNOTSUPP;
+}
+
+static int smc_lo_set_vlan_required(struct smcd_dev *smcd)
+{
+ return -EOPNOTSUPP;
+}
+
+static int smc_lo_reset_vlan_required(struct smcd_dev *smcd)
+{
+ return -EOPNOTSUPP;
+}
+
+static int smc_lo_signal_event(struct smcd_dev *dev, struct smcd_gid *rgid,
+ u32 trigger_irq, u32 event_code, u64 info)
+{
+ return 0;
+}
+
static int smc_lo_supports_v2(void)
{
return SMC_LO_SUPPORTS_V2;
@@ -108,11 +134,11 @@ static struct device *smc_lo_get_dev(struct smcd_dev *smcd)
.query_remote_gid = smc_lo_query_rgid,
.register_dmb = NULL,
.unregister_dmb = NULL,
- .add_vlan_id = NULL,
- .del_vlan_id = NULL,
- .set_vlan_required = NULL,
- .reset_vlan_required = NULL,
- .signal_event = NULL,
+ .add_vlan_id = smc_lo_add_vlan_id,
+ .del_vlan_id = smc_lo_del_vlan_id,
+ .set_vlan_required = smc_lo_set_vlan_required,
+ .reset_vlan_required = smc_lo_reset_vlan_required,
+ .signal_event = smc_lo_signal_event,
.move_data = NULL,
.supports_v2 = smc_lo_supports_v2,
.get_system_eid = smc_lo_get_system_eid,
--
1.8.3.1

2023-09-24 15:18:33

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v4 12/18] net/smc: implement DMB-related operations of loopback

This patch implements DMB registration, unregistration and data move
operations of SMC-D loopback.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_cdc.c | 6 +++
net/smc/smc_cdc.h | 1 +
net/smc/smc_loopback.c | 127 +++++++++++++++++++++++++++++++++++++++++++++++--
net/smc/smc_loopback.h | 13 +++++
4 files changed, 144 insertions(+), 3 deletions(-)

diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index 89105e9..2641800 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -411,6 +411,12 @@ static void smc_cdc_msg_recv(struct smc_sock *smc, struct smc_cdc_msg *cdc)
static void smcd_cdc_rx_tsklet(struct tasklet_struct *t)
{
struct smc_connection *conn = from_tasklet(conn, t, rx_tsklet);
+
+ smcd_cdc_rx_handler(conn);
+}
+
+void smcd_cdc_rx_handler(struct smc_connection *conn)
+{
struct smcd_cdc_msg *data_cdc;
struct smcd_cdc_msg cdc;
struct smc_sock *smc;
diff --git a/net/smc/smc_cdc.h b/net/smc/smc_cdc.h
index 696cc11..11559d4 100644
--- a/net/smc/smc_cdc.h
+++ b/net/smc/smc_cdc.h
@@ -301,5 +301,6 @@ int smcr_cdc_msg_send_validation(struct smc_connection *conn,
struct smc_wr_buf *wr_buf);
int smc_cdc_init(void) __init;
void smcd_cdc_rx_init(struct smc_connection *conn);
+void smcd_cdc_rx_handler(struct smc_connection *conn);

#endif /* SMC_CDC_H */
diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index 4341803..eb13291 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -16,11 +16,13 @@
#include <linux/smc.h>
#include <net/smc.h>

+#include "smc_cdc.h"
#include "smc_ism.h"
#include "smc_loopback.h"

#if IS_ENABLED(CONFIG_SMC_LO)
#define SMC_LO_SUPPORTS_V2 0x1 /* SMC-D loopback supports SMC-Dv2 */
+#define SMC_DMA_ADDR_INVALID (~(dma_addr_t)0)

static const char smc_lo_dev_name[] = "smc_lo";
static struct smcd_seid SMC_LO_SEID = {
@@ -75,6 +77,91 @@ static int smc_lo_query_rgid(struct smcd_dev *smcd, struct smcd_gid *rgid,
return 0;
}

+static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
+ void *client_priv)
+{
+ struct smc_lo_dmb_node *dmb_node, *tmp_node;
+ struct smc_lo_dev *ldev = smcd->priv;
+ int sba_idx, rc;
+
+ /* check space for new dmb */
+ for_each_clear_bit(sba_idx, ldev->sba_idx_mask, SMC_LODEV_MAX_DMBS) {
+ if (!test_and_set_bit(sba_idx, ldev->sba_idx_mask))
+ break;
+ }
+ if (sba_idx == SMC_LODEV_MAX_DMBS)
+ return -ENOSPC;
+
+ dmb_node = kzalloc(sizeof(*dmb_node), GFP_KERNEL);
+ if (!dmb_node) {
+ rc = -ENOMEM;
+ goto err_bit;
+ }
+
+ dmb_node->sba_idx = sba_idx;
+ dmb_node->cpu_addr = vzalloc(dmb->dmb_len);
+ if (!dmb_node->cpu_addr) {
+ rc = -ENOMEM;
+ goto err_node;
+ }
+ dmb_node->len = dmb->dmb_len;
+ dmb_node->dma_addr = SMC_DMA_ADDR_INVALID;
+
+again:
+ /* add new dmb into hash table */
+ get_random_bytes(&dmb_node->token, sizeof(dmb_node->token));
+ write_lock(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_node->token) {
+ if (tmp_node->token == dmb_node->token) {
+ write_unlock(&ldev->dmb_ht_lock);
+ goto again;
+ }
+ }
+ hash_add(ldev->dmb_ht, &dmb_node->list, dmb_node->token);
+ write_unlock(&ldev->dmb_ht_lock);
+
+ dmb->sba_idx = dmb_node->sba_idx;
+ dmb->dmb_tok = dmb_node->token;
+ dmb->cpu_addr = dmb_node->cpu_addr;
+ dmb->dma_addr = dmb_node->dma_addr;
+ dmb->dmb_len = dmb_node->len;
+
+ return 0;
+
+err_node:
+ kfree(dmb_node);
+err_bit:
+ clear_bit(sba_idx, ldev->sba_idx_mask);
+ return rc;
+}
+
+static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
+{
+ struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ /* remove dmb from hash table */
+ write_lock(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
+ if (tmp_node->token == dmb->dmb_tok) {
+ dmb_node = tmp_node;
+ break;
+ }
+ }
+ if (!dmb_node) {
+ write_unlock(&ldev->dmb_ht_lock);
+ return -EINVAL;
+ }
+ hash_del(&dmb_node->list);
+ write_unlock(&ldev->dmb_ht_lock);
+
+ clear_bit(dmb_node->sba_idx, ldev->sba_idx_mask);
+ vfree(dmb_node->cpu_addr);
+ kfree(dmb_node);
+
+ return 0;
+}
+
static int smc_lo_add_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
{
return -EOPNOTSUPP;
@@ -101,6 +188,38 @@ static int smc_lo_signal_event(struct smcd_dev *dev, struct smcd_gid *rgid,
return 0;
}

+static int smc_lo_move_data(struct smcd_dev *smcd, u64 dmb_tok, unsigned int idx,
+ bool sf, unsigned int offset, void *data,
+ unsigned int size)
+{
+ struct smc_lo_dmb_node *rmb_node = NULL, *tmp_node;
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ read_lock(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_tok) {
+ if (tmp_node->token == dmb_tok) {
+ rmb_node = tmp_node;
+ break;
+ }
+ }
+ if (!rmb_node) {
+ read_unlock(&ldev->dmb_ht_lock);
+ return -EINVAL;
+ }
+ read_unlock(&ldev->dmb_ht_lock);
+
+ memcpy((char *)rmb_node->cpu_addr + offset, data, size);
+
+ if (sf) {
+ struct smc_connection *conn =
+ smcd->conn[rmb_node->sba_idx];
+
+ if (conn && !conn->killed)
+ smcd_cdc_rx_handler(conn);
+ }
+ return 0;
+}
+
static int smc_lo_supports_v2(void)
{
return SMC_LO_SUPPORTS_V2;
@@ -132,14 +251,14 @@ static struct device *smc_lo_get_dev(struct smcd_dev *smcd)

static const struct smcd_ops lo_ops = {
.query_remote_gid = smc_lo_query_rgid,
- .register_dmb = NULL,
- .unregister_dmb = NULL,
+ .register_dmb = smc_lo_register_dmb,
+ .unregister_dmb = smc_lo_unregister_dmb,
.add_vlan_id = smc_lo_add_vlan_id,
.del_vlan_id = smc_lo_del_vlan_id,
.set_vlan_required = smc_lo_set_vlan_required,
.reset_vlan_required = smc_lo_reset_vlan_required,
.signal_event = smc_lo_signal_event,
- .move_data = NULL,
+ .move_data = smc_lo_move_data,
.supports_v2 = smc_lo_supports_v2,
.get_system_eid = smc_lo_get_system_eid,
.get_local_gid = smc_lo_get_local_gid,
@@ -213,6 +332,8 @@ static void smc_lo_dev_release(struct device *dev)
static int smc_lo_dev_init(struct smc_lo_dev *ldev)
{
smc_lo_generate_id(ldev);
+ rwlock_init(&ldev->dmb_ht_lock);
+ hash_init(ldev->dmb_ht);

return smcd_lo_register_dev(ldev);
}
diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
index 71a52da..a5b501b 100644
--- a/net/smc/smc_loopback.h
+++ b/net/smc/smc_loopback.h
@@ -21,12 +21,25 @@
#if IS_ENABLED(CONFIG_SMC_LO)
#define SMC_LO_CHID 0xFFFF
#define SMC_LODEV_MAX_DMBS 5000
+#define SMC_LODEV_DMBS_HASH_BITS 12
+
+struct smc_lo_dmb_node {
+ struct hlist_node list;
+ u64 token;
+ u32 len;
+ u32 sba_idx;
+ void *cpu_addr;
+ dma_addr_t dma_addr;
+};

struct smc_lo_dev {
struct smcd_dev *smcd;
struct device dev;
u16 chid;
struct smcd_gid local_gid;
+ DECLARE_BITMAP(sba_idx_mask, SMC_LODEV_MAX_DMBS);
+ rwlock_t dmb_ht_lock;
+ DECLARE_HASHTABLE(dmb_ht, SMC_LODEV_DMBS_HASH_BITS);
};
#endif

--
1.8.3.1

2023-09-24 15:18:39

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v4 13/18] net/smc: register loopback device as SMC-Dv2 device

After loopback device gets ready, add it to the smcd_dev list as an
SMC-Dv2 device for use by SMC-D protocol.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_loopback.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index eb13291..8375575 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -304,18 +304,27 @@ static int smcd_lo_register_dev(struct smc_lo_dev *ldev)

ldev->smcd = smcd;
smcd->priv = ldev;
-
- /* TODO:
- * register smc_lo to smcd_dev list.
- */
+ mutex_lock(&smcd_dev_list.mutex);
+ smc_ism_check_v2_capable(smcd);
+ list_add(&smcd->list, &smcd_dev_list.list);
+ mutex_unlock(&smcd_dev_list.mutex);
+ pr_warn_ratelimited("smc: adding smcd device %s with pnetid %.16s%s\n",
+ smc_lo_dev_name, smcd->pnetid,
+ smcd->pnetid_by_user ? " (user defined)" : "");
return 0;
}

static void smcd_lo_unregister_dev(struct smc_lo_dev *ldev)
{
- /* TODO:
- * unregister smc_lo from smcd_dev list.
- */
+ struct smcd_dev *smcd = ldev->smcd;
+
+ pr_warn_ratelimited("smc: removing smcd device %s\n",
+ smc_lo_dev_name);
+ smcd->going_away = 1;
+ smc_smcd_terminate_all(smcd);
+ mutex_lock(&smcd_dev_list.mutex);
+ list_del_init(&smcd->list);
+ mutex_unlock(&smcd_dev_list.mutex);
}

static void smc_lo_dev_release(struct device *dev)
--
1.8.3.1

2023-09-24 15:18:41

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v4 10/18] net/smc: implement ID-related operations of loopback

This patch implements GID/CHID/SEID related operations of SMC-D loopback
device. In loopback device, GID is generated by UUIDv4 algorithm, CHID is
reserved 0xFFFF, SEID is generated using the same algorithm as ISM device
under s390 architecture, and is 0 and disabled under non-s390 architecture.

Signed-off-by: Wen Gu <[email protected]>
---
include/net/smc.h | 6 ++++
net/smc/smc_ism.h | 2 ++
net/smc/smc_loopback.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++----
net/smc/smc_loopback.h | 3 ++
4 files changed, 101 insertions(+), 6 deletions(-)

diff --git a/include/net/smc.h b/include/net/smc.h
index d8db5e1..7c2d35c 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -41,6 +41,12 @@ struct smcd_dmb {
dma_addr_t dma_addr;
};

+struct smcd_seid {
+ u8 seid_string[24];
+ u8 serial_number[4];
+ u8 type[4];
+};
+
#define ISM_EVENT_DMB 0
#define ISM_EVENT_GID 1
#define ISM_EVENT_SWR 2
diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
index e6ea08c..7ab82dd 100644
--- a/net/smc/smc_ism.h
+++ b/net/smc/smc_ism.h
@@ -15,6 +15,8 @@

#include "smc.h"

+#define S390_ISM_IDENT_MASK 0x00FFFF
+
#define SMC_VIRT_ISM_CHID_MAX 0xFFFF
#define SMC_VIRT_ISM_CHID_MIN 0xFF00

diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index 4631236..610af99 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -13,17 +13,99 @@

#include <linux/device.h>
#include <linux/types.h>
+#include <linux/smc.h>
#include <net/smc.h>

#include "smc_ism.h"
#include "smc_loopback.h"

#if IS_ENABLED(CONFIG_SMC_LO)
+#define SMC_LO_SUPPORTS_V2 0x1 /* SMC-D loopback supports SMC-Dv2 */
+
static const char smc_lo_dev_name[] = "smc_lo";
+static struct smcd_seid SMC_LO_SEID = {
+ .seid_string = "IBM-SYSZ-ISMSEID00000000",
+ .serial_number = "0000",
+ .type = "0000",
+};
+
static struct smc_lo_dev *lo_dev;

+static void smc_lo_create_seid(struct smcd_seid *seid)
+{
+#if IS_ENABLED(CONFIG_S390)
+ struct cpuid id;
+ u16 ident_tail;
+ char tmp[5];
+
+ get_cpu_id(&id);
+ ident_tail = (u16)(id.ident & S390_ISM_IDENT_MASK);
+ snprintf(tmp, 5, "%04X", ident_tail);
+ memcpy(&seid->serial_number, tmp, 4);
+ snprintf(tmp, 5, "%04X", id.machine);
+ memcpy(&seid->type, tmp, 4);
+#else
+ memset(seid, 0, SMC_MAX_EID_LEN);
+#endif
+}
+
+static void smc_lo_generate_id(struct smc_lo_dev *ldev)
+{
+ struct smcd_gid *lgid = &ldev->local_gid;
+ uuid_t uuid;
+
+ uuid_gen(&uuid);
+ memcpy(&lgid->gid, &uuid, sizeof(lgid->gid));
+ memcpy(&lgid->gid_ext, (u8 *)&uuid + sizeof(lgid->gid),
+ sizeof(lgid->gid_ext));
+
+ ldev->chid = SMC_LO_CHID;
+ smc_lo_create_seid(&SMC_LO_SEID);
+}
+
+static int smc_lo_query_rgid(struct smcd_dev *smcd, struct smcd_gid *rgid,
+ u32 vid_valid, u32 vid)
+{
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ /* rgid should equal to lgid in loopback */
+ if (!ldev || rgid->gid != ldev->local_gid.gid ||
+ rgid->gid_ext != ldev->local_gid.gid_ext)
+ return -ENETUNREACH;
+ return 0;
+}
+
+static int smc_lo_supports_v2(void)
+{
+ return SMC_LO_SUPPORTS_V2;
+}
+
+static u8 *smc_lo_get_system_eid(void)
+{
+ return SMC_LO_SEID.seid_string;
+}
+
+static void smc_lo_get_local_gid(struct smcd_dev *smcd,
+ struct smcd_gid *smcd_gid)
+{
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ smcd_gid->gid = ldev->local_gid.gid;
+ smcd_gid->gid_ext = ldev->local_gid.gid_ext;
+}
+
+static u16 smc_lo_get_chid(struct smcd_dev *smcd)
+{
+ return ((struct smc_lo_dev *)smcd->priv)->chid;
+}
+
+static struct device *smc_lo_get_dev(struct smcd_dev *smcd)
+{
+ return &((struct smc_lo_dev *)smcd->priv)->dev;
+}
+
static const struct smcd_ops lo_ops = {
- .query_remote_gid = NULL,
+ .query_remote_gid = smc_lo_query_rgid,
.register_dmb = NULL,
.unregister_dmb = NULL,
.add_vlan_id = NULL,
@@ -32,11 +114,11 @@
.reset_vlan_required = NULL,
.signal_event = NULL,
.move_data = NULL,
- .supports_v2 = NULL,
- .get_system_eid = NULL,
- .get_local_gid = NULL,
- .get_chid = NULL,
- .get_dev = NULL,
+ .supports_v2 = smc_lo_supports_v2,
+ .get_system_eid = smc_lo_get_system_eid,
+ .get_local_gid = smc_lo_get_local_gid,
+ .get_chid = smc_lo_get_chid,
+ .get_dev = smc_lo_get_dev,
};

static struct smcd_dev *smcd_lo_alloc_dev(const struct smcd_ops *ops,
@@ -104,6 +186,8 @@ static void smc_lo_dev_release(struct device *dev)

static int smc_lo_dev_init(struct smc_lo_dev *ldev)
{
+ smc_lo_generate_id(ldev);
+
return smcd_lo_register_dev(ldev);
}

diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
index 39aa1dc..71a52da 100644
--- a/net/smc/smc_loopback.h
+++ b/net/smc/smc_loopback.h
@@ -19,11 +19,14 @@
#include <net/smc.h>

#if IS_ENABLED(CONFIG_SMC_LO)
+#define SMC_LO_CHID 0xFFFF
#define SMC_LODEV_MAX_DMBS 5000

struct smc_lo_dev {
struct smcd_dev *smcd;
struct device dev;
+ u16 chid;
+ struct smcd_gid local_gid;
};
#endif

--
1.8.3.1

2023-09-24 15:18:45

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v4 18/18] net/smc: add interface implementation of loopback device

This patch completes the specific implementation of loopback device
for the newly added SMC-D DMB-related interface.

The loopback device always provides mappable DMB because the device
users are in the same OS instance.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_loopback.c | 104 ++++++++++++++++++++++++++++++++++++++++++++-----
net/smc/smc_loopback.h | 5 +++
2 files changed, 99 insertions(+), 10 deletions(-)

diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
index 8375575..e9d2d62 100644
--- a/net/smc/smc_loopback.c
+++ b/net/smc/smc_loopback.c
@@ -106,6 +106,7 @@ static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
}
dmb_node->len = dmb->dmb_len;
dmb_node->dma_addr = SMC_DMA_ADDR_INVALID;
+ refcount_set(&dmb_node->refcnt, 1);

again:
/* add new dmb into hash table */
@@ -119,6 +120,7 @@ static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
}
hash_add(ldev->dmb_ht, &dmb_node->list, dmb_node->token);
write_unlock(&ldev->dmb_ht_lock);
+ atomic_inc(&ldev->dmb_cnt);

dmb->sba_idx = dmb_node->sba_idx;
dmb->dmb_tok = dmb_node->token;
@@ -140,18 +142,27 @@ static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
struct smc_lo_dev *ldev = smcd->priv;

- /* remove dmb from hash table */
- write_lock(&ldev->dmb_ht_lock);
+ /* find dmb from hash table */
+ read_lock(&ldev->dmb_ht_lock);
hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
if (tmp_node->token == dmb->dmb_tok) {
dmb_node = tmp_node;
+ dmb_node->freeing = 1;
break;
}
}
if (!dmb_node) {
- write_unlock(&ldev->dmb_ht_lock);
+ read_unlock(&ldev->dmb_ht_lock);
return -EINVAL;
}
+ read_unlock(&ldev->dmb_ht_lock);
+
+ /* wait for dmb refcnt to be 0 */
+ if (!refcount_dec_and_test(&dmb_node->refcnt))
+ wait_event(ldev->dmbs_release, !refcount_read(&dmb_node->refcnt));
+
+ /* remove dmb from hash table */
+ write_lock(&ldev->dmb_ht_lock);
hash_del(&dmb_node->list);
write_unlock(&ldev->dmb_ht_lock);

@@ -159,9 +170,69 @@ static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
vfree(dmb_node->cpu_addr);
kfree(dmb_node);

+ if (atomic_dec_and_test(&ldev->dmb_cnt))
+ wake_up(&ldev->ldev_release);
return 0;
}

+static int smc_lo_attach_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
+{
+ struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ /* find dmb_node according to dmb->dmb_tok */
+ read_lock(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
+ if (tmp_node->token == dmb->dmb_tok && !tmp_node->freeing) {
+ dmb_node = tmp_node;
+ break;
+ }
+ }
+ if (!dmb_node) {
+ read_unlock(&ldev->dmb_ht_lock);
+ return -EINVAL;
+ }
+ refcount_inc(&dmb_node->refcnt);
+ read_unlock(&ldev->dmb_ht_lock);
+
+ /* provide dmb information */
+ dmb->sba_idx = dmb_node->sba_idx;
+ dmb->dmb_tok = dmb_node->token;
+ dmb->cpu_addr = dmb_node->cpu_addr;
+ dmb->dma_addr = dmb_node->dma_addr;
+ dmb->dmb_len = dmb_node->len;
+ return 0;
+}
+
+static int smc_lo_detach_dmb(struct smcd_dev *smcd, u64 token)
+{
+ struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
+ struct smc_lo_dev *ldev = smcd->priv;
+
+ /* find dmb_node according to dmb->dmb_tok */
+ read_lock(&ldev->dmb_ht_lock);
+ hash_for_each_possible(ldev->dmb_ht, tmp_node, list, token) {
+ if (tmp_node->token == token) {
+ dmb_node = tmp_node;
+ break;
+ }
+ }
+ if (!dmb_node) {
+ read_unlock(&ldev->dmb_ht_lock);
+ return -EINVAL;
+ }
+ read_unlock(&ldev->dmb_ht_lock);
+
+ if (refcount_dec_and_test(&dmb_node->refcnt))
+ wake_up_all(&ldev->dmbs_release);
+ return 0;
+}
+
+static int smc_lo_get_dev_attr(struct smcd_dev *smcd)
+{
+ return BIT(ISM_ATTR_DMB_MAP);
+}
+
static int smc_lo_add_vlan_id(struct smcd_dev *smcd, u64 vlan_id)
{
return -EOPNOTSUPP;
@@ -194,7 +265,15 @@ static int smc_lo_move_data(struct smcd_dev *smcd, u64 dmb_tok, unsigned int idx
{
struct smc_lo_dmb_node *rmb_node = NULL, *tmp_node;
struct smc_lo_dev *ldev = smcd->priv;
-
+ struct smc_connection *conn;
+
+ if (!sf) {
+ /* local sndbuf shares the same physical memory with
+ * peer RMB, so no need to copy data from local sndbuf
+ * to peer RMB.
+ */
+ return 0;
+ }
read_lock(&ldev->dmb_ht_lock);
hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_tok) {
if (tmp_node->token == dmb_tok) {
@@ -210,13 +289,10 @@ static int smc_lo_move_data(struct smcd_dev *smcd, u64 dmb_tok, unsigned int idx

memcpy((char *)rmb_node->cpu_addr + offset, data, size);

- if (sf) {
- struct smc_connection *conn =
- smcd->conn[rmb_node->sba_idx];
+ conn = smcd->conn[rmb_node->sba_idx];
+ if (conn && !conn->killed)
+ smcd_cdc_rx_handler(conn);

- if (conn && !conn->killed)
- smcd_cdc_rx_handler(conn);
- }
return 0;
}

@@ -253,6 +329,8 @@ static struct device *smc_lo_get_dev(struct smcd_dev *smcd)
.query_remote_gid = smc_lo_query_rgid,
.register_dmb = smc_lo_register_dmb,
.unregister_dmb = smc_lo_unregister_dmb,
+ .attach_dmb = smc_lo_attach_dmb,
+ .detach_dmb = smc_lo_detach_dmb,
.add_vlan_id = smc_lo_add_vlan_id,
.del_vlan_id = smc_lo_del_vlan_id,
.set_vlan_required = smc_lo_set_vlan_required,
@@ -264,6 +342,7 @@ static struct device *smc_lo_get_dev(struct smcd_dev *smcd)
.get_local_gid = smc_lo_get_local_gid,
.get_chid = smc_lo_get_chid,
.get_dev = smc_lo_get_dev,
+ .get_dev_attr = smc_lo_get_dev_attr,
};

static struct smcd_dev *smcd_lo_alloc_dev(const struct smcd_ops *ops,
@@ -343,6 +422,9 @@ static int smc_lo_dev_init(struct smc_lo_dev *ldev)
smc_lo_generate_id(ldev);
rwlock_init(&ldev->dmb_ht_lock);
hash_init(ldev->dmb_ht);
+ atomic_set(&ldev->dmb_cnt, 0);
+ init_waitqueue_head(&ldev->dmbs_release);
+ init_waitqueue_head(&ldev->ldev_release);

return smcd_lo_register_dev(ldev);
}
@@ -376,6 +458,8 @@ static int smc_lo_dev_probe(void)
static void smc_lo_dev_exit(struct smc_lo_dev *ldev)
{
smcd_lo_unregister_dev(ldev);
+ if (atomic_read(&ldev->dmb_cnt))
+ wait_event(ldev->ldev_release, !atomic_read(&ldev->dmb_cnt));
}

static void smc_lo_dev_remove(void)
diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
index a5b501b..e42c807 100644
--- a/net/smc/smc_loopback.h
+++ b/net/smc/smc_loopback.h
@@ -30,6 +30,8 @@ struct smc_lo_dmb_node {
u32 sba_idx;
void *cpu_addr;
dma_addr_t dma_addr;
+ refcount_t refcnt;
+ u8 freeing : 1;
};

struct smc_lo_dev {
@@ -40,6 +42,9 @@ struct smc_lo_dev {
DECLARE_BITMAP(sba_idx_mask, SMC_LODEV_MAX_DMBS);
rwlock_t dmb_ht_lock;
DECLARE_HASHTABLE(dmb_ht, SMC_LODEV_DMBS_HASH_BITS);
+ atomic_t dmb_cnt;
+ wait_queue_head_t dmbs_release;
+ wait_queue_head_t ldev_release;
};
#endif

--
1.8.3.1

2023-09-24 15:18:47

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v4 14/18] net/smc: add operation for getting DMB attribute

On s390, ISM devices are used on machine level hypervisors which is a
partitioning hypervisor without paging. The sndbufs and peer DMBs in such
case can't be mapped to the same physical memory.

However in other cases, such as communication within the same OS instance
with loopback, the sndbufs and peer DMBs can be mapped to the same physical
memory to avoid memory copy from sndbufs to peer DMBs.

So this patch introduces an operation to smcd_ops to judge whether the
sndbufs-DMB map is available. And for reuse, the interface is designed to
return DMB attribute, not only the sndbuf-DMB map feature.

Signed-off-by: Wen Gu <[email protected]>
---
include/net/smc.h | 5 +++++
net/smc/smc_ism.c | 8 ++++++++
net/smc/smc_ism.h | 1 +
3 files changed, 14 insertions(+)

diff --git a/include/net/smc.h b/include/net/smc.h
index 7c2d35c..917572fb 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -55,6 +55,10 @@ struct smcd_seid {

#define ISM_ERROR 0xFFFF

+enum {
+ ISM_ATTR_DMB_MAP = 0,
+};
+
struct smcd_dev;

struct smcd_gid {
@@ -82,6 +86,7 @@ struct smcd_ops {
void (*get_local_gid)(struct smcd_dev *dev, struct smcd_gid *gid);
u16 (*get_chid)(struct smcd_dev *dev);
struct device* (*get_dev)(struct smcd_dev *dev);
+ int (*get_dev_attr)(struct smcd_dev *dev);
};

struct smcd_dev {
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index 0922fab..9b31d00 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -214,6 +214,14 @@ int smc_ism_unregister_dmb(struct smcd_dev *smcd, struct smc_buf_desc *dmb_desc)
return rc;
}

+bool smc_ism_dmb_mappable(struct smcd_dev *smcd)
+{
+ if (smcd->ops->get_dev_attr &&
+ (smcd->ops->get_dev_attr(smcd) & BIT(ISM_ATTR_DMB_MAP)))
+ return true;
+ return false;
+}
+
int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
struct smc_buf_desc *dmb_desc)
{
diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
index 7ab82dd..cef212c 100644
--- a/net/smc/smc_ism.h
+++ b/net/smc/smc_ism.h
@@ -44,6 +44,7 @@ int smc_ism_cantalk(struct smcd_gid *peer_gid, unsigned short vlan_id,
int smc_ism_register_dmb(struct smc_link_group *lgr, int buf_size,
struct smc_buf_desc *dmb_desc);
int smc_ism_unregister_dmb(struct smcd_dev *dev, struct smc_buf_desc *dmb_desc);
+bool smc_ism_dmb_mappable(struct smcd_dev *smcd);
int smc_ism_signal_shutdown(struct smc_link_group *lgr);
void smc_ism_get_system_eid(u8 **eid);
u16 smc_ism_get_chid(struct smcd_dev *dev);
--
1.8.3.1

2023-09-24 15:18:49

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v4 09/18] net/smc: introduce SMC-D loopback device

This patch introduces a kind of loopback device for SMC-D. The device
is created when SMC module is loaded and destroyed when the SMC module
is unloaded. The loopback device is a kernel device used only by the
SMC module and is not restricted by net namespace, so it can be used
for local inter-process or inter-container communication.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/Kconfig | 13 ++++
net/smc/Makefile | 2 +-
net/smc/af_smc.c | 12 +++-
net/smc/smc_loopback.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++++
net/smc/smc_loopback.h | 33 ++++++++++
5 files changed, 223 insertions(+), 2 deletions(-)
create mode 100644 net/smc/smc_loopback.c
create mode 100644 net/smc/smc_loopback.h

diff --git a/net/smc/Kconfig b/net/smc/Kconfig
index 1ab3c5a..3e9e03e 100644
--- a/net/smc/Kconfig
+++ b/net/smc/Kconfig
@@ -19,3 +19,16 @@ config SMC_DIAG
smcss.

if unsure, say Y.
+
+config SMC_LO
+ bool "SMC_LO: virtual loopback-ism for SMC"
+ depends on SMC
+ default n
+ help
+ SMC_LO provides a kind of virtual ISM device called loopback-ism
+ for SMC-D to upgrade AF_INET TCP connections whose ends share the
+ same kernel.
+ loopback-ism is a software-implemented simulation device that does
+ not depend on a specific architecture or hardware.
+
+ if unsure, say N.
diff --git a/net/smc/Makefile b/net/smc/Makefile
index 875efcd..a8c3711 100644
--- a/net/smc/Makefile
+++ b/net/smc/Makefile
@@ -4,5 +4,5 @@ obj-$(CONFIG_SMC) += smc.o
obj-$(CONFIG_SMC_DIAG) += smc_diag.o
smc-y := af_smc.o smc_pnet.o smc_ib.o smc_clc.o smc_core.o smc_wr.o smc_llc.o
smc-y += smc_cdc.o smc_tx.o smc_rx.o smc_close.o smc_ism.o smc_netlink.o smc_stats.o
-smc-y += smc_tracepoint.o
+smc-y += smc_tracepoint.o smc_loopback.o
smc-$(CONFIG_SYSCTL) += smc_sysctl.o
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index c5b7716..6435659 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -53,6 +53,7 @@
#include "smc_stats.h"
#include "smc_tracepoint.h"
#include "smc_sysctl.h"
+#include "smc_loopback.h"

static DEFINE_MUTEX(smc_server_lgr_pending); /* serialize link group
* creation on server
@@ -3565,15 +3566,23 @@ static int __init smc_init(void)
goto out_sock;
}

+ rc = smc_loopback_init();
+ if (rc) {
+ pr_err("%s: smc_loopback_init fails with %d\n", __func__, rc);
+ goto out_ib;
+ }
+
rc = tcp_register_ulp(&smc_ulp_ops);
if (rc) {
pr_err("%s: tcp_ulp_register fails with %d\n", __func__, rc);
- goto out_ib;
+ goto out_lo;
}

static_branch_enable(&tcp_have_smc);
return 0;

+out_lo:
+ smc_loopback_exit();
out_ib:
smc_ib_unregister_client();
out_sock:
@@ -3611,6 +3620,7 @@ static void __exit smc_exit(void)
tcp_unregister_ulp(&smc_ulp_ops);
sock_unregister(PF_SMC);
smc_core_exit();
+ smc_loopback_exit();
smc_ib_unregister_client();
smc_ism_exit();
destroy_workqueue(smc_close_wq);
diff --git a/net/smc/smc_loopback.c b/net/smc/smc_loopback.c
new file mode 100644
index 0000000..4631236
--- /dev/null
+++ b/net/smc/smc_loopback.c
@@ -0,0 +1,165 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Shared Memory Communications Direct over loopback device.
+ *
+ * Provide a SMC-D loopback dummy device.
+ *
+ * Copyright (c) 2022, Alibaba Inc.
+ *
+ * Author: Wen Gu <[email protected]>
+ * Tony Lu <[email protected]>
+ *
+ */
+
+#include <linux/device.h>
+#include <linux/types.h>
+#include <net/smc.h>
+
+#include "smc_ism.h"
+#include "smc_loopback.h"
+
+#if IS_ENABLED(CONFIG_SMC_LO)
+static const char smc_lo_dev_name[] = "smc_lo";
+static struct smc_lo_dev *lo_dev;
+
+static const struct smcd_ops lo_ops = {
+ .query_remote_gid = NULL,
+ .register_dmb = NULL,
+ .unregister_dmb = NULL,
+ .add_vlan_id = NULL,
+ .del_vlan_id = NULL,
+ .set_vlan_required = NULL,
+ .reset_vlan_required = NULL,
+ .signal_event = NULL,
+ .move_data = NULL,
+ .supports_v2 = NULL,
+ .get_system_eid = NULL,
+ .get_local_gid = NULL,
+ .get_chid = NULL,
+ .get_dev = NULL,
+};
+
+static struct smcd_dev *smcd_lo_alloc_dev(const struct smcd_ops *ops,
+ int max_dmbs)
+{
+ struct smcd_dev *smcd;
+
+ smcd = kzalloc(sizeof(*smcd), GFP_KERNEL);
+ if (!smcd)
+ return NULL;
+
+ smcd->conn = kcalloc(max_dmbs, sizeof(struct smc_connection *),
+ GFP_KERNEL);
+ if (!smcd->conn)
+ goto out_smcd;
+
+ smcd->ops = ops;
+
+ spin_lock_init(&smcd->lock);
+ spin_lock_init(&smcd->lgr_lock);
+ INIT_LIST_HEAD(&smcd->vlan);
+ INIT_LIST_HEAD(&smcd->lgr_list);
+ init_waitqueue_head(&smcd->lgrs_deleted);
+ return smcd;
+
+out_smcd:
+ kfree(smcd);
+ return NULL;
+}
+
+static int smcd_lo_register_dev(struct smc_lo_dev *ldev)
+{
+ struct smcd_dev *smcd;
+
+ smcd = smcd_lo_alloc_dev(&lo_ops, SMC_LODEV_MAX_DMBS);
+ if (!smcd)
+ return -ENOMEM;
+
+ ldev->smcd = smcd;
+ smcd->priv = ldev;
+
+ /* TODO:
+ * register smc_lo to smcd_dev list.
+ */
+ return 0;
+}
+
+static void smcd_lo_unregister_dev(struct smc_lo_dev *ldev)
+{
+ /* TODO:
+ * unregister smc_lo from smcd_dev list.
+ */
+}
+
+static void smc_lo_dev_release(struct device *dev)
+{
+ struct smc_lo_dev *ldev =
+ container_of(dev, struct smc_lo_dev, dev);
+ struct smcd_dev *smcd = ldev->smcd;
+
+ kfree(smcd->conn);
+ kfree(smcd);
+ kfree(ldev);
+}
+
+static int smc_lo_dev_init(struct smc_lo_dev *ldev)
+{
+ return smcd_lo_register_dev(ldev);
+}
+
+static int smc_lo_dev_probe(void)
+{
+ struct smc_lo_dev *ldev;
+ int ret;
+
+ ldev = kzalloc(sizeof(*ldev), GFP_KERNEL);
+ if (!ldev)
+ return -ENOMEM;
+
+ ldev->dev.parent = NULL;
+ ldev->dev.release = smc_lo_dev_release;
+ device_initialize(&ldev->dev);
+ dev_set_name(&ldev->dev, smc_lo_dev_name);
+
+ ret = smc_lo_dev_init(ldev);
+ if (ret)
+ goto free_dev;
+
+ lo_dev = ldev; /* global loopback device */
+ return 0;
+
+free_dev:
+ kfree(ldev);
+ return ret;
+}
+
+static void smc_lo_dev_exit(struct smc_lo_dev *ldev)
+{
+ smcd_lo_unregister_dev(ldev);
+}
+
+static void smc_lo_dev_remove(void)
+{
+ if (!lo_dev)
+ return;
+
+ smc_lo_dev_exit(lo_dev);
+ put_device(&lo_dev->dev); /* device_initialize in smc_lo_dev_probe */
+}
+#endif
+
+int smc_loopback_init(void)
+{
+#if IS_ENABLED(CONFIG_SMC_LO)
+ return smc_lo_dev_probe();
+#else
+ return 0;
+#endif
+}
+
+void smc_loopback_exit(void)
+{
+#if IS_ENABLED(CONFIG_SMC_LO)
+ smc_lo_dev_remove();
+#endif
+}
diff --git a/net/smc/smc_loopback.h b/net/smc/smc_loopback.h
new file mode 100644
index 0000000..39aa1dc
--- /dev/null
+++ b/net/smc/smc_loopback.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Shared Memory Communications Direct over loopback device.
+ *
+ * Provide a SMC-D loopback dummy device.
+ *
+ * Copyright (c) 2022, Alibaba Inc.
+ *
+ * Author: Wen Gu <[email protected]>
+ * Tony Lu <[email protected]>
+ *
+ */
+
+#ifndef _SMC_LOOPBACK_H
+#define _SMC_LOOPBACK_H
+
+#include <linux/device.h>
+#include <linux/err.h>
+#include <net/smc.h>
+
+#if IS_ENABLED(CONFIG_SMC_LO)
+#define SMC_LODEV_MAX_DMBS 5000
+
+struct smc_lo_dev {
+ struct smcd_dev *smcd;
+ struct device dev;
+};
+#endif
+
+int smc_loopback_init(void);
+void smc_loopback_exit(void);
+
+#endif /* _SMC_LOOPBACK_H */
--
1.8.3.1

2023-09-24 15:18:50

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v4 17/18] net/smc: modify cursor update logic when sndbuf mapped to RMB

Since local sndbuf shares the same physical memory with peer RMB after
mapping, the logic related to cursor update needs to be adapted to ensure
that the data written by local won't overwrite the data that has not been
consumed by the peer.

So in this case, the fin_curs and sndbuf_space that were originally
updated after sending the CDC message should be modified to not be update
until the peer updates cons_curs.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_cdc.c | 50 +++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 39 insertions(+), 11 deletions(-)

diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index 2641800..7cc677d 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -18,6 +18,7 @@
#include "smc_tx.h"
#include "smc_rx.h"
#include "smc_close.h"
+#include "smc_ism.h"

/********************************** send *************************************/

@@ -256,17 +257,24 @@ int smcd_cdc_msg_send(struct smc_connection *conn)
return rc;
smc_curs_copy(&conn->rx_curs_confirmed, &curs, conn);
conn->local_rx_ctrl.prod_flags.cons_curs_upd_req = 0;
- /* Calculate transmitted data and increment free send buffer space */
- diff = smc_curs_diff(conn->sndbuf_desc->len, &conn->tx_curs_fin,
- &conn->tx_curs_sent);
- /* increased by confirmed number of bytes */
- smp_mb__before_atomic();
- atomic_add(diff, &conn->sndbuf_space);
- /* guarantee 0 <= sndbuf_space <= sndbuf_desc->len */
- smp_mb__after_atomic();
- smc_curs_copy(&conn->tx_curs_fin, &conn->tx_curs_sent, conn);
+ if (!smc_ism_dmb_mappable(conn->lgr->smcd)) {
+ /* If local sndbuf has been mapped to peer RMB, then
+ * don't update the tx_curs_fin and sndbuf_space until
+ * peer has consumed the data in RMB.
+ */

- smc_tx_sndbuf_nonfull(smc);
+ /* Calculate transmitted data and increment free send buffer space */
+ diff = smc_curs_diff(conn->sndbuf_desc->len, &conn->tx_curs_fin,
+ &conn->tx_curs_sent);
+ /* increased by confirmed number of bytes */
+ smp_mb__before_atomic();
+ atomic_add(diff, &conn->sndbuf_space);
+ /* guarantee 0 <= sndbuf_space <= sndbuf_desc->len */
+ smp_mb__after_atomic();
+ smc_curs_copy(&conn->tx_curs_fin, &conn->tx_curs_sent, conn);
+
+ smc_tx_sndbuf_nonfull(smc);
+ }
return rc;
}

@@ -324,7 +332,7 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc,
{
union smc_host_cursor cons_old, prod_old;
struct smc_connection *conn = &smc->conn;
- int diff_cons, diff_prod;
+ int diff_cons, diff_prod, diff_tx;

smc_curs_copy(&prod_old, &conn->local_rx_ctrl.prod, conn);
smc_curs_copy(&cons_old, &conn->local_rx_ctrl.cons, conn);
@@ -340,6 +348,26 @@ static void smc_cdc_msg_recv_action(struct smc_sock *smc,
atomic_add(diff_cons, &conn->peer_rmbe_space);
/* guarantee 0 <= peer_rmbe_space <= peer_rmbe_size */
smp_mb__after_atomic();
+
+ if (conn->lgr->is_smcd &&
+ smc_ism_dmb_mappable(conn->lgr->smcd)) {
+ /* If local sndbuf has been mapped to peer RMB, then
+ * update tx_curs_fin and sndbuf_space when peer has
+ * consumed the data in it's RMB.
+ */
+
+ /* calculate peer rmb consumed data */
+ diff_tx = smc_curs_diff(conn->sndbuf_desc->len, &conn->tx_curs_fin,
+ &conn->local_rx_ctrl.cons);
+ /* increase local sndbuf space and fin_curs */
+ smp_mb__before_atomic();
+ atomic_add(diff_tx, &conn->sndbuf_space);
+ /* guarantee 0 <= sndbuf_space <= sndbuf_desc->len */
+ smp_mb__after_atomic();
+ smc_curs_copy(&conn->tx_curs_fin, &conn->local_rx_ctrl.cons, conn);
+
+ smc_tx_sndbuf_nonfull(smc);
+ }
}

diff_prod = smc_curs_diff(conn->rmb_desc->len, &prod_old,
--
1.8.3.1

2023-09-24 15:18:58

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v4 06/18] net/smc: extend GID to 128bits only for virtual ISM device

Virtual ISM devices are introduced to SMC-Dv2.1 protocal, whose GIDs
are 128-bits UUIDs as defined by RFC4122. And note that the GIDs of
ISM devices still remain 64-bits.

This patch adapts the relevant codes, such as CLC handshake, to make
it compatible with 128 bits GID.

Signed-off-by: Wen Gu <[email protected]>
---
drivers/s390/net/ism_drv.c | 18 +++++++------
include/net/smc.h | 15 +++++++----
include/uapi/linux/smc.h | 3 +++
include/uapi/linux/smc_diag.h | 2 ++
net/smc/af_smc.c | 60 +++++++++++++++++++++++++++++++++----------
net/smc/smc_clc.c | 43 +++++++++++++++++++++----------
net/smc/smc_clc.h | 4 +--
net/smc/smc_core.c | 41 +++++++++++++++++++++--------
net/smc/smc_core.h | 7 ++---
net/smc/smc_diag.c | 11 ++++++--
net/smc/smc_ism.c | 18 ++++++++-----
net/smc/smc_ism.h | 3 ++-
net/smc/smc_pnet.c | 4 +--
13 files changed, 163 insertions(+), 66 deletions(-)

diff --git a/drivers/s390/net/ism_drv.c b/drivers/s390/net/ism_drv.c
index a34e913..8e42eb2 100644
--- a/drivers/s390/net/ism_drv.c
+++ b/drivers/s390/net/ism_drv.c
@@ -774,10 +774,10 @@ static void __exit ism_exit(void)
/*************************** SMC-D Implementation *****************************/

#if IS_ENABLED(CONFIG_SMC)
-static int smcd_query_rgid(struct smcd_dev *smcd, u64 rgid, u32 vid_valid,
- u32 vid)
+static int smcd_query_rgid(struct smcd_dev *smcd, struct smcd_gid *rgid,
+ u32 vid_valid, u32 vid)
{
- return ism_query_rgid(smcd->priv, rgid, vid_valid, vid);
+ return ism_query_rgid(smcd->priv, rgid->gid, vid_valid, vid);
}

static int smcd_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
@@ -811,10 +811,10 @@ static int smcd_reset_vlan_required(struct smcd_dev *smcd)
return ism_cmd_simple(smcd->priv, ISM_RESET_VLAN);
}

-static int smcd_signal_ieq(struct smcd_dev *smcd, u64 rgid, u32 trigger_irq,
- u32 event_code, u64 info)
+static int smcd_signal_ieq(struct smcd_dev *smcd, struct smcd_gid *rgid,
+ u32 trigger_irq, u32 event_code, u64 info)
{
- return ism_signal_ieq(smcd->priv, rgid, trigger_irq, event_code, info);
+ return ism_signal_ieq(smcd->priv, rgid->gid, trigger_irq, event_code, info);
}

static int smcd_move(struct smcd_dev *smcd, u64 dmb_tok, unsigned int idx,
@@ -830,9 +830,11 @@ static int smcd_supports_v2(void)
SYSTEM_EID.type[0] != '0';
}

-static u64 smcd_get_local_gid(struct smcd_dev *smcd)
+static void smcd_get_local_gid(struct smcd_dev *smcd,
+ struct smcd_gid *smcd_gid)
{
- return ism_get_local_gid(smcd->priv);
+ smcd_gid->gid = ism_get_local_gid(smcd->priv);
+ smcd_gid->gid_ext = 0;
}

static u16 smcd_get_chid(struct smcd_dev *smcd)
diff --git a/include/net/smc.h b/include/net/smc.h
index f75116e..d8db5e1 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -51,9 +51,14 @@ struct smcd_dmb {

struct smcd_dev;

+struct smcd_gid {
+ u64 gid;
+ u64 gid_ext;
+};
+
struct smcd_ops {
- int (*query_remote_gid)(struct smcd_dev *dev, u64 rgid, u32 vid_valid,
- u32 vid);
+ int (*query_remote_gid)(struct smcd_dev *dev, struct smcd_gid *rgid,
+ u32 vid_valid, u32 vid);
int (*register_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb,
void *client);
int (*unregister_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
@@ -61,14 +66,14 @@ struct smcd_ops {
int (*del_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
int (*set_vlan_required)(struct smcd_dev *dev);
int (*reset_vlan_required)(struct smcd_dev *dev);
- int (*signal_event)(struct smcd_dev *dev, u64 rgid, u32 trigger_irq,
- u32 event_code, u64 info);
+ int (*signal_event)(struct smcd_dev *dev, struct smcd_gid *rgid,
+ u32 trigger_irq, u32 event_code, u64 info);
int (*move_data)(struct smcd_dev *dev, u64 dmb_tok, unsigned int idx,
bool sf, unsigned int offset, void *data,
unsigned int size);
int (*supports_v2)(void);
u8* (*get_system_eid)(void);
- u64 (*get_local_gid)(struct smcd_dev *dev);
+ void (*get_local_gid)(struct smcd_dev *dev, struct smcd_gid *gid);
u16 (*get_chid)(struct smcd_dev *dev);
struct device* (*get_dev)(struct smcd_dev *dev);
};
diff --git a/include/uapi/linux/smc.h b/include/uapi/linux/smc.h
index 837fcd4..0d2f020 100644
--- a/include/uapi/linux/smc.h
+++ b/include/uapi/linux/smc.h
@@ -99,6 +99,9 @@ enum {
SMC_NLA_LGR_V2_OS, /* u8 */
SMC_NLA_LGR_V2_NEG_EID, /* string */
SMC_NLA_LGR_V2_PEER_HOST, /* string */
+ SMC_NLA_LGR_V2_PAD, /* flag */
+ SMC_NLA_LGR_V2_GID_EXT, /* u64 */
+ SMC_NLA_LGR_V2_PEER_GID_EXT, /* u64 */
__SMC_NLA_LGR_V2_MAX,
SMC_NLA_LGR_V2_MAX = __SMC_NLA_LGR_V2_MAX - 1
};
diff --git a/include/uapi/linux/smc_diag.h b/include/uapi/linux/smc_diag.h
index 8cb3a6f..78b7dfb 100644
--- a/include/uapi/linux/smc_diag.h
+++ b/include/uapi/linux/smc_diag.h
@@ -107,6 +107,8 @@ struct smcd_diag_dmbinfo { /* SMC-D Socket internals */
__aligned_u64 my_gid; /* My GID */
__aligned_u64 token; /* Token of DMB */
__aligned_u64 peer_token; /* Token of remote DMBE */
+ __aligned_u64 peer_gid_ext; /* Peer GID extension */
+ __aligned_u64 my_gid_ext; /* My GID extension */
};

#endif /* _UAPI_SMC_DIAG_H_ */
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index bacdd97..5bb41404 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -719,7 +719,7 @@ static void smcd_conn_save_peer_info(struct smc_sock *smc,
int bufsize = smc_uncompress_bufsize(clc->d0.dmbe_size);

smc->conn.peer_rmbe_idx = clc->d0.dmbe_idx;
- smc->conn.peer_token = clc->d0.token;
+ smc->conn.peer_token = ntohll(clc->d0.token);
/* msg header takes up space in the buffer */
smc->conn.peer_rmbe_size = bufsize - sizeof(struct smcd_cdc_msg);
atomic_set(&smc->conn.peer_rmbe_space, smc->conn.peer_rmbe_size);
@@ -1044,7 +1044,8 @@ static int smc_find_ism_v2_device_clnt(struct smc_sock *smc,
{
int rc = SMC_CLC_DECL_NOSMCDDEV;
struct smcd_dev *smcd;
- int i = 1;
+ int i = 1, slot = 1;
+ bool is_virtdev;
u16 chid;

if (smcd_indicated(ini->smc_type_v1))
@@ -1056,14 +1057,21 @@ static int smc_find_ism_v2_device_clnt(struct smc_sock *smc,
chid = smc_ism_get_chid(smcd);
if (!smc_find_ism_v2_is_unique_chid(chid, ini, i))
continue;
+ is_virtdev = __smc_ism_is_virtdev(chid);
if (!smc_pnet_is_pnetid_set(smcd->pnetid) ||
smc_pnet_is_ndev_pnetid(sock_net(&smc->sk), smcd->pnetid)) {
+ if (is_virtdev && slot == SMC_MAX_ISM_DEVS)
+ /* no enough space for virtual ISM devices, whose GID
+ * takes 2 slots. Try the next potential ISM device.
+ */
+ continue;
ini->ism_dev[i] = smcd;
ini->ism_chid[i] = chid;
ini->is_smcd = true;
rc = 0;
i++;
- if (i > SMC_MAX_ISM_DEVS)
+ slot = is_virtdev ? slot + 2 : slot + 1;
+ if (slot > SMC_MAX_ISM_DEVS)
break;
}
}
@@ -1409,8 +1417,13 @@ static int smc_connect_ism(struct smc_sock *smc,
rc = smc_v2_determine_accepted_chid(aclc_v2, ini);
if (rc)
return rc;
+
+ if (__smc_ism_is_virtdev(ini->ism_chid[ini->ism_selected]))
+ ini->ism_peer_gid[ini->ism_selected].gid_ext =
+ ntohll(aclc_v2->d1.gid_ext);
+ /* for non-virtual ISM devices, peer gid_ext remains 0. */
}
- ini->ism_peer_gid[ini->ism_selected] = aclc->d0.gid;
+ ini->ism_peer_gid[ini->ism_selected].gid = ntohll(aclc->d0.gid);

/* there is only one lgr role for SMC-D; use server lock */
mutex_lock(&smc_server_lgr_pending);
@@ -2101,7 +2114,8 @@ static bool smc_is_already_selected(struct smcd_dev *smcd,

/* check for ISM devices matching proposed ISM devices */
static void smc_check_ism_v2_match(struct smc_init_info *ini,
- u16 proposed_chid, u64 proposed_gid,
+ u16 proposed_chid,
+ struct smcd_gid *proposed_gid,
unsigned int *matches)
{
struct smcd_dev *smcd;
@@ -2113,7 +2127,11 @@ static void smc_check_ism_v2_match(struct smc_init_info *ini,
continue;
if (smc_ism_get_chid(smcd) == proposed_chid &&
!smc_ism_cantalk(proposed_gid, ISM_RESERVED_VLANID, smcd)) {
- ini->ism_peer_gid[*matches] = proposed_gid;
+ ini->ism_peer_gid[*matches].gid = proposed_gid->gid;
+ if (__smc_ism_is_virtdev(proposed_chid))
+ ini->ism_peer_gid[*matches].gid_ext =
+ proposed_gid->gid_ext;
+ /* non-virtual ISM's peer gid_ext remains 0. */
ini->ism_dev[*matches] = smcd;
(*matches)++;
break;
@@ -2135,9 +2153,11 @@ static void smc_find_ism_v2_device_serv(struct smc_sock *new_smc,
struct smc_clc_v2_extension *smc_v2_ext;
struct smc_clc_msg_smcd *pclc_smcd;
unsigned int matches = 0;
+ struct smcd_gid smcd_gid;
u8 smcd_version;
u8 *eid = NULL;
int i, rc;
+ u16 chid;

if (!(ini->smcd_version & SMC_V2) || !smcd_indicated(ini->smc_type_v2))
goto not_found;
@@ -2147,18 +2167,31 @@ static void smc_find_ism_v2_device_serv(struct smc_sock *new_smc,
smcd_v2_ext = smc_get_clc_smcd_v2_ext(smc_v2_ext);

mutex_lock(&smcd_dev_list.mutex);
- if (pclc_smcd->ism.chid)
+ if (pclc_smcd->ism.chid) {
/* check for ISM device matching proposed native ISM device */
+ smcd_gid.gid = ntohll(pclc_smcd->ism.gid);
+ smcd_gid.gid_ext = 0;
smc_check_ism_v2_match(ini, ntohs(pclc_smcd->ism.chid),
- ntohll(pclc_smcd->ism.gid), &matches);
+ &smcd_gid, &matches);
+ }
for (i = 1; i <= smc_v2_ext->hdr.ism_gid_cnt; i++) {
/* check for ISM devices matching proposed non-native ISM
* devices
*/
- smc_check_ism_v2_match(ini,
- ntohs(smcd_v2_ext->gidchid[i - 1].chid),
- ntohll(smcd_v2_ext->gidchid[i - 1].gid),
- &matches);
+ smcd_gid.gid = ntohll(smcd_v2_ext->gidchid[i - 1].gid);
+ smcd_gid.gid_ext = 0;
+ chid = ntohs(smcd_v2_ext->gidchid[i - 1].chid);
+ if (__smc_ism_is_virtdev(chid)) {
+ if (i >= smc_v2_ext->hdr.ism_gid_cnt ||
+ chid != ntohs(smcd_v2_ext->gidchid[i].chid))
+ /* check if next slot exists and the next slot's
+ * chid is consistent with the current slot.
+ */
+ continue;
+
+ smcd_gid.gid_ext = ntohll(smcd_v2_ext->gidchid[i++].gid);
+ }
+ smc_check_ism_v2_match(ini, chid, &smcd_gid, &matches);
}
mutex_unlock(&smcd_dev_list.mutex);

@@ -2207,7 +2240,8 @@ static void smc_find_ism_v1_device_serv(struct smc_sock *new_smc,
if (!(ini->smcd_version & SMC_V1) || !smcd_indicated(ini->smc_type_v1))
goto not_found;
ini->is_smcd = true; /* prepare ISM check */
- ini->ism_peer_gid[0] = ntohll(pclc_smcd->ism.gid);
+ ini->ism_peer_gid[0].gid = ntohll(pclc_smcd->ism.gid);
+ ini->ism_peer_gid[0].gid_ext = 0;
rc = smc_find_ism_device(new_smc, ini);
if (rc)
goto not_found;
diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c
index 125b0d2..aeb9643 100644
--- a/net/smc/smc_clc.c
+++ b/net/smc/smc_clc.c
@@ -882,11 +882,13 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
ETH_ALEN);
}
if (smcd_indicated(ini->smc_type_v1)) {
+ struct smcd_gid smcd_gid;
+
/* add SMC-D specifics */
if (ini->ism_dev[0]) {
smcd = ini->ism_dev[0];
- pclc_smcd->ism.gid =
- htonll(smcd->ops->get_local_gid(smcd));
+ smcd->ops->get_local_gid(smcd, &smcd_gid);
+ pclc_smcd->ism.gid = htonll(smcd_gid.gid);
pclc_smcd->ism.chid =
htons(smc_ism_get_chid(ini->ism_dev[0]));
}
@@ -919,10 +921,11 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
read_unlock(&smc_clc_eid_table.lock);
}
if (smcd_indicated(ini->smc_type_v2)) {
+ struct smcd_gid smcd_gid;
u8 *eid = NULL;
+ int slot = 0;

v2_ext->hdr.flag.seid = smc_clc_eid_table.seid_enabled;
- v2_ext->hdr.ism_gid_cnt = ini->ism_offered_cnt;
v2_ext->hdr.smcd_v2_ext_offset = htons(sizeof(*v2_ext) -
offsetofend(struct smc_clnt_opts_area_hdr,
smcd_v2_ext_offset) +
@@ -934,14 +937,21 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
if (ini->ism_offered_cnt) {
for (i = 1; i <= ini->ism_offered_cnt; i++) {
smcd = ini->ism_dev[i];
- gidchids[i - 1].gid =
- htonll(smcd->ops->get_local_gid(smcd));
- gidchids[i - 1].chid =
+ smcd->ops->get_local_gid(smcd, &smcd_gid);
+ gidchids[slot].chid =
htons(smc_ism_get_chid(ini->ism_dev[i]));
+ gidchids[slot].gid = htonll(smcd_gid.gid);
+ if (__smc_ism_is_virtdev(gidchids[slot].chid)) {
+ /* virtual-ism takes two slots */
+ gidchids[slot + 1].chid = gidchids[slot].chid;
+ gidchids[slot + 1].gid = htonll(smcd_gid.gid_ext);
+ slot++;
+ }
+ slot++;
}
- plen += ini->ism_offered_cnt *
- sizeof(struct smc_clc_smcd_gid_chid);
+ plen += slot * sizeof(struct smc_clc_smcd_gid_chid);
}
+ v2_ext->hdr.ism_gid_cnt = slot;
}
if (smcr_indicated(ini->smc_type_v2)) {
memcpy(v2_ext->roce, ini->smcrv2.ib_gid_v2, SMC_GID_SIZE);
@@ -977,7 +987,7 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
vec[i++].iov_len = sizeof(*smcd_v2_ext);
if (ini->ism_offered_cnt) {
vec[i].iov_base = gidchids;
- vec[i++].iov_len = ini->ism_offered_cnt *
+ vec[i++].iov_len = v2_ext->hdr.ism_gid_cnt *
sizeof(struct smc_clc_smcd_gid_chid);
}
}
@@ -1019,23 +1029,28 @@ static int smc_clc_send_confirm_accept(struct smc_sock *smc,
if (first_contact)
clc->hdr.typev2 |= SMC_FIRST_CONTACT_MASK;
if (conn->lgr->is_smcd) {
+ struct smcd_gid smcd_gid;
+ u16 chid;
+
/* SMC-D specific settings */
memcpy(clc->hdr.eyecatcher, SMCD_EYECATCHER,
sizeof(SMCD_EYECATCHER));
+ conn->lgr->smcd->ops->get_local_gid(conn->lgr->smcd, &smcd_gid);
clc->hdr.typev1 = SMC_TYPE_D;
- clc->d0.gid =
- conn->lgr->smcd->ops->get_local_gid(conn->lgr->smcd);
- clc->d0.token = conn->rmb_desc->token;
+ clc->d0.gid = htonll(smcd_gid.gid);
+ clc->d0.token = htonll(conn->rmb_desc->token);
clc->d0.dmbe_size = conn->rmbe_size_comp;
clc->d0.dmbe_idx = 0;
memcpy(&clc->d0.linkid, conn->lgr->id, SMC_LGR_ID_SIZE);
if (version == SMC_V1) {
clc->hdr.length = htons(SMCD_CLC_ACCEPT_CONFIRM_LEN);
} else {
- clc_v2->d1.chid =
- htons(smc_ism_get_chid(conn->lgr->smcd));
+ chid = smc_ism_get_chid(conn->lgr->smcd);
+ clc_v2->d1.chid = htons(chid);
if (eid && eid[0])
memcpy(clc_v2->d1.eid, eid, SMC_MAX_EID_LEN);
+ if (__smc_ism_is_virtdev(chid))
+ clc_v2->d1.gid_ext = htonll(smcd_gid.gid_ext);
len = SMCD_CLC_ACCEPT_CONFIRM_LEN_V2;
if (first_contact) {
fce_len = smc_clc_fill_fce(&fce, ini);
diff --git a/net/smc/smc_clc.h b/net/smc/smc_clc.h
index bcf37c8..611763a 100644
--- a/net/smc/smc_clc.h
+++ b/net/smc/smc_clc.h
@@ -281,8 +281,8 @@ struct smc_clc_msg_accept_confirm_v2 { /* clc accept / confirm message */
struct smcd_clc_msg_accept_confirm_common d0;
__be16 chid;
u8 eid[SMC_MAX_EID_LEN];
- u8 reserved5[8];
- } d1;
+ __be64 gid_ext;
+ } __packed d1;
};
};

diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index d520ee6..6d7c738 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -284,6 +284,9 @@ static int smc_nl_fill_lgr_v2_common(struct smc_link_group *lgr,
{
char smc_host[SMC_MAX_HOSTNAME_LEN + 1];
char smc_eid[SMC_MAX_EID_LEN + 1];
+ struct smcd_dev *smcd = lgr->smcd;
+ struct smcd_gid smcd_gid;
+ bool is_virtdev;

if (nla_put_u8(skb, SMC_NLA_LGR_V2_VER, lgr->smc_version))
goto errv2attr;
@@ -299,6 +302,16 @@ static int smc_nl_fill_lgr_v2_common(struct smc_link_group *lgr,
smc_eid[SMC_MAX_EID_LEN] = 0;
if (nla_put_string(skb, SMC_NLA_LGR_V2_NEG_EID, smc_eid))
goto errv2attr;
+ smcd->ops->get_local_gid(smcd, &smcd_gid);
+ is_virtdev = smc_ism_is_virtdev(smcd);
+ if (nla_put_u64_64bit(skb, SMC_NLA_LGR_V2_GID_EXT,
+ is_virtdev ? smcd_gid.gid_ext : 0,
+ SMC_NLA_LGR_V2_PAD))
+ goto errv2attr;
+ if (nla_put_u64_64bit(skb, SMC_NLA_LGR_V2_PEER_GID_EXT,
+ is_virtdev ? lgr->peer_gid.gid_ext : 0,
+ SMC_NLA_LGR_V2_PAD))
+ goto errv2attr;

nla_nest_end(skb, v2_attrs);
return 0;
@@ -506,6 +519,7 @@ static int smc_nl_fill_smcd_lgr(struct smc_link_group *lgr,
{
char smc_pnet[SMC_MAX_PNETID_LEN + 1];
struct smcd_dev *smcd = lgr->smcd;
+ struct smcd_gid smcd_gid;
struct nlattr *attrs;
void *nlh;

@@ -521,11 +535,11 @@ static int smc_nl_fill_smcd_lgr(struct smc_link_group *lgr,

if (nla_put_u32(skb, SMC_NLA_LGR_D_ID, *((u32 *)&lgr->id)))
goto errattr;
+ smcd->ops->get_local_gid(smcd, &smcd_gid);
if (nla_put_u64_64bit(skb, SMC_NLA_LGR_D_GID,
- smcd->ops->get_local_gid(smcd),
- SMC_NLA_LGR_D_PAD))
+ smcd_gid.gid, SMC_NLA_LGR_D_PAD))
goto errattr;
- if (nla_put_u64_64bit(skb, SMC_NLA_LGR_D_PEER_GID, lgr->peer_gid,
+ if (nla_put_u64_64bit(skb, SMC_NLA_LGR_D_PEER_GID, lgr->peer_gid.gid,
SMC_NLA_LGR_D_PAD))
goto errattr;
if (nla_put_u8(skb, SMC_NLA_LGR_D_VLAN_ID, lgr->vlan_id))
@@ -876,7 +890,8 @@ static int smc_lgr_create(struct smc_sock *smc, struct smc_init_info *ini)
/* SMC-D specific settings */
smcd = ini->ism_dev[ini->ism_selected];
get_device(smcd->ops->get_dev(smcd));
- lgr->peer_gid = ini->ism_peer_gid[ini->ism_selected];
+ memcpy(&lgr->peer_gid, &ini->ism_peer_gid[ini->ism_selected],
+ sizeof(struct smcd_gid));
lgr->smcd = ini->ism_dev[ini->ism_selected];
lgr_list = &ini->ism_dev[ini->ism_selected]->lgr_list;
lgr_lock = &lgr->smcd->lgr_lock;
@@ -1514,7 +1529,8 @@ void smc_lgr_terminate_sched(struct smc_link_group *lgr)
}

/* Called when peer lgr shutdown (regularly or abnormally) is received */
-void smc_smcd_terminate(struct smcd_dev *dev, u64 peer_gid, unsigned short vlan)
+void smc_smcd_terminate(struct smcd_dev *dev, struct smcd_gid *peer_gid,
+ unsigned short vlan)
{
struct smc_link_group *lgr, *l;
LIST_HEAD(lgr_free_list);
@@ -1522,7 +1538,10 @@ void smc_smcd_terminate(struct smcd_dev *dev, u64 peer_gid, unsigned short vlan)
/* run common cleanup function and build free list */
spin_lock_bh(&dev->lgr_lock);
list_for_each_entry_safe(lgr, l, &dev->lgr_list, list) {
- if ((!peer_gid || lgr->peer_gid == peer_gid) &&
+ if ((!peer_gid->gid ||
+ (lgr->peer_gid.gid == peer_gid->gid &&
+ !smc_ism_is_virtdev(dev) ? 1 :
+ lgr->peer_gid.gid_ext == peer_gid->gid_ext)) &&
(vlan == VLAN_VID_MASK || lgr->vlan_id == vlan)) {
if (peer_gid) /* peer triggered termination */
lgr->peer_shutdown = 1;
@@ -1859,10 +1878,12 @@ static bool smcr_lgr_match(struct smc_link_group *lgr, u8 smcr_version,
return false;
}

-static bool smcd_lgr_match(struct smc_link_group *lgr,
- struct smcd_dev *smcismdev, u64 peer_gid)
+static bool smcd_lgr_match(struct smc_link_group *lgr, struct smcd_dev *smcismdev,
+ struct smcd_gid *peer_gid)
{
- return lgr->peer_gid == peer_gid && lgr->smcd == smcismdev;
+ return lgr->peer_gid.gid == peer_gid->gid && lgr->smcd == smcismdev &&
+ smc_ism_is_virtdev(smcismdev) ?
+ (lgr->peer_gid.gid_ext == peer_gid->gid_ext) : 1;
}

/* create a new SMC connection (and a new link group if necessary) */
@@ -1892,7 +1913,7 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
write_lock_bh(&lgr->conns_lock);
if ((ini->is_smcd ?
smcd_lgr_match(lgr, ini->ism_dev[ini->ism_selected],
- ini->ism_peer_gid[ini->ism_selected]) :
+ &ini->ism_peer_gid[ini->ism_selected]) :
smcr_lgr_match(lgr, ini->smcr_version,
ini->peer_systemid,
ini->peer_gid, ini->peer_mac, role,
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 9f65678..d57eb9b 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -17,6 +17,7 @@
#include <linux/pci.h>
#include <rdma/ib_verbs.h>
#include <net/genetlink.h>
+#include <net/smc.h>

#include "smc.h"
#include "smc_ib.h"
@@ -355,7 +356,7 @@ struct smc_link_group {
/* max links can be added in lgr */
};
struct { /* SMC-D */
- u64 peer_gid;
+ struct smcd_gid peer_gid;
/* Peer GID (remote) */
struct smcd_dev *smcd;
/* ISM device for VLAN reg. */
@@ -417,7 +418,7 @@ struct smc_init_info {
u32 ib_clcqpn;
struct smc_init_info_smcrv2 smcrv2;
/* SMC-D */
- u64 ism_peer_gid[SMC_MAX_ISM_DEVS + 1];
+ struct smcd_gid ism_peer_gid[SMC_MAX_ISM_DEVS + 1];
struct smcd_dev *ism_dev[SMC_MAX_ISM_DEVS + 1];
u16 ism_chid[SMC_MAX_ISM_DEVS + 1];
u8 ism_offered_cnt; /* # of ISM devices offered */
@@ -545,7 +546,7 @@ static inline void smc_set_pci_values(struct pci_dev *pci_dev,
void smc_lgr_put(struct smc_link_group *lgr);
void smcr_port_add(struct smc_ib_device *smcibdev, u8 ibport);
void smcr_port_err(struct smc_ib_device *smcibdev, u8 ibport);
-void smc_smcd_terminate(struct smcd_dev *dev, u64 peer_gid,
+void smc_smcd_terminate(struct smcd_dev *dev, struct smcd_gid *peer_gid,
unsigned short vlan);
void smc_smcd_terminate_all(struct smcd_dev *dev);
void smc_smcr_terminate_all(struct smc_ib_device *smcibdev);
diff --git a/net/smc/smc_diag.c b/net/smc/smc_diag.c
index 7ff2152..a7a48b6 100644
--- a/net/smc/smc_diag.c
+++ b/net/smc/smc_diag.c
@@ -21,6 +21,7 @@

#include "smc.h"
#include "smc_core.h"
+#include "smc_ism.h"

struct smc_diag_dump_ctx {
int pos[2];
@@ -168,12 +169,18 @@ static int __smc_diag_dump(struct sock *sk, struct sk_buff *skb,
struct smc_connection *conn = &smc->conn;
struct smcd_diag_dmbinfo dinfo;
struct smcd_dev *smcd = conn->lgr->smcd;
+ struct smcd_gid smcd_gid;

memset(&dinfo, 0, sizeof(dinfo));

dinfo.linkid = *((u32 *)conn->lgr->id);
- dinfo.peer_gid = conn->lgr->peer_gid;
- dinfo.my_gid = smcd->ops->get_local_gid(smcd);
+ dinfo.peer_gid = conn->lgr->peer_gid.gid;
+ smcd->ops->get_local_gid(smcd, &smcd_gid);
+ dinfo.my_gid = smcd_gid.gid;
+ if (smc_ism_is_virtdev(smcd)) {
+ dinfo.peer_gid_ext = conn->lgr->peer_gid.gid_ext;
+ dinfo.my_gid_ext = smcd_gid.gid_ext;
+ }
dinfo.token = conn->rmb_desc->token;
dinfo.peer_token = conn->peer_token;

diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index 8f1ba74..0922fab 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -44,7 +44,8 @@ static void smcd_handle_irq(struct ism_dev *ism, unsigned int dmbno,
#endif

/* Test if an ISM communication is possible - same CPC */
-int smc_ism_cantalk(u64 peer_gid, unsigned short vlan_id, struct smcd_dev *smcd)
+int smc_ism_cantalk(struct smcd_gid *peer_gid, unsigned short vlan_id,
+ struct smcd_dev *smcd)
{
return smcd->ops->query_remote_gid(smcd, peer_gid, vlan_id ? 1 : 0,
vlan_id);
@@ -223,7 +224,7 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
dmb.dmb_len = dmb_len;
dmb.sba_idx = dmb_desc->sba_idx;
dmb.vlan_id = lgr->vlan_id;
- dmb.rgid = lgr->peer_gid;
+ dmb.rgid = lgr->peer_gid.gid;
rc = lgr->smcd->ops->register_dmb(lgr->smcd, &dmb, lgr->smcd->client);
if (!rc) {
dmb_desc->sba_idx = dmb.sba_idx;
@@ -353,18 +354,20 @@ struct smc_ism_event_work {

static void smcd_handle_sw_event(struct smc_ism_event_work *wrk)
{
+ struct smcd_gid peer_gid = { .gid = wrk->event.tok,
+ .gid_ext = 0 };
union smcd_sw_event_info ev_info;

ev_info.info = wrk->event.info;
switch (wrk->event.code) {
case ISM_EVENT_CODE_SHUTDOWN: /* Peer shut down DMBs */
- smc_smcd_terminate(wrk->smcd, wrk->event.tok, ev_info.vlan_id);
+ smc_smcd_terminate(wrk->smcd, &peer_gid, ev_info.vlan_id);
break;
case ISM_EVENT_CODE_TESTLINK: /* Activity timer */
if (ev_info.code == ISM_EVENT_REQUEST) {
ev_info.code = ISM_EVENT_RESPONSE;
wrk->smcd->ops->signal_event(wrk->smcd,
- wrk->event.tok,
+ &peer_gid,
ISM_EVENT_REQUEST_IR,
ISM_EVENT_CODE_TESTLINK,
ev_info.info);
@@ -378,10 +381,13 @@ static void smc_ism_event_work(struct work_struct *work)
{
struct smc_ism_event_work *wrk =
container_of(work, struct smc_ism_event_work, work);
+ struct smcd_gid smcd_gid = { .gid = wrk->event.tok,
+ .gid_ext = 0 };

switch (wrk->event.type) {
case ISM_EVENT_GID: /* GID event, token is peer GID */
- smc_smcd_terminate(wrk->smcd, wrk->event.tok, VLAN_VID_MASK);
+ smcd_gid.gid = wrk->event.tok;
+ smc_smcd_terminate(wrk->smcd, &smcd_gid, VLAN_VID_MASK);
break;
case ISM_EVENT_DMB:
break;
@@ -530,7 +536,7 @@ int smc_ism_signal_shutdown(struct smc_link_group *lgr)
memcpy(ev_info.uid, lgr->id, SMC_LGR_ID_SIZE);
ev_info.vlan_id = lgr->vlan_id;
ev_info.code = ISM_EVENT_REQUEST;
- rc = lgr->smcd->ops->signal_event(lgr->smcd, lgr->peer_gid,
+ rc = lgr->smcd->ops->signal_event(lgr->smcd, &lgr->peer_gid,
ISM_EVENT_REQUEST_IR,
ISM_EVENT_CODE_SHUTDOWN,
ev_info.info);
diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
index 2ecc8de..e6ea08c 100644
--- a/net/smc/smc_ism.h
+++ b/net/smc/smc_ism.h
@@ -33,7 +33,8 @@ struct smc_ism_vlanid { /* VLAN id set on ISM device */

struct smcd_dev;

-int smc_ism_cantalk(u64 peer_gid, unsigned short vlan_id, struct smcd_dev *dev);
+int smc_ism_cantalk(struct smcd_gid *peer_gid, unsigned short vlan_id,
+ struct smcd_dev *dev);
void smc_ism_set_conn(struct smc_connection *conn);
void smc_ism_unset_conn(struct smc_connection *conn);
int smc_ism_get_vlan(struct smcd_dev *dev, unsigned short vlan_id);
diff --git a/net/smc/smc_pnet.c b/net/smc/smc_pnet.c
index 1177540..9f2c58c 100644
--- a/net/smc/smc_pnet.c
+++ b/net/smc/smc_pnet.c
@@ -1103,8 +1103,8 @@ static void smc_pnet_find_ism_by_pnetid(struct net_device *ndev,
list_for_each_entry(ismdev, &smcd_dev_list.list, list) {
if (smc_pnet_match(ismdev->pnetid, ndev_pnetid) &&
!ismdev->going_away &&
- (!ini->ism_peer_gid[0] ||
- !smc_ism_cantalk(ini->ism_peer_gid[0], ini->vlan_id,
+ (!ini->ism_peer_gid[0].gid ||
+ !smc_ism_cantalk(&ini->ism_peer_gid[0], ini->vlan_id,
ismdev))) {
ini->ism_dev[0] = ismdev;
break;
--
1.8.3.1

2023-09-24 15:18:59

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v4 16/18] net/smc: avoid data copy from sndbuf to peer RMB in SMC-D

This patch aims to avoid data copy from local sndbuf to peer RMB by mapping
local sndbuf to peer RMB when DMBs have ISM_ATTR_DMB_MAP attribute.

After this, local sndbuf and peer RMB share the same physical memory.

+----------+ +----------+
| socket A | | socket B |
+----------+ +----------+
| ^
| +---------+ |
regard as | | regard as
local sndbuf | B's | local RMB
| | RMB | |
|-------> | |-----------|
+---------+

1. From the perspective of RMB:

a. Created or reused when connection is created.
b. Unused and recycled to lgr buffer pool when connection is freed.
c. Freed when link group is freed.

2. From the perspective of sndbuf:

a. Mapped to peer RMB by the rtoken exchanged through CLC message.
Then accessing local sndbuf is equivalent to accessing peer RMB.
c. Unmapped from peer RMB and freed when connection is freed. Won't be
recycled to lgr buffer pool.

Therefore, the data written to local sndbuf will directly reach peer RMB.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/af_smc.c | 14 +++++++++++
net/smc/smc_core.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
net/smc/smc_core.h | 1 +
3 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 6435659..47556c3 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -1442,6 +1442,12 @@ static int smc_connect_ism(struct smc_sock *smc,
}

smc_conn_save_peer_info(smc, aclc);
+
+ if (smc_ism_dmb_mappable(smc->conn.lgr->smcd)) {
+ rc = smcd_buf_attach(smc);
+ if (rc)
+ goto connect_abort;
+ }
smc_close_init(smc);
smc_rx_init(smc);
smc_tx_init(smc);
@@ -2550,6 +2556,14 @@ static void smc_listen_work(struct work_struct *work)
mutex_unlock(&smc_server_lgr_pending);
}
smc_conn_save_peer_info(new_smc, cclc);
+
+ if (ini->is_smcd &&
+ smc_ism_dmb_mappable(new_smc->conn.lgr->smcd)) {
+ rc = smcd_buf_attach(new_smc);
+ if (rc)
+ goto out_decl;
+ }
+
smc_listen_out_connected(new_smc);
SMC_STAT_SERV_SUCC_INC(sock_net(newclcsock->sk), ini);
goto out_free;
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 6d7c738..96b1def 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -1154,6 +1154,20 @@ static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
}
}

+static void smcd_buf_detach(struct smc_connection *conn)
+{
+ struct smcd_dev *smcd = conn->lgr->smcd;
+ u64 peer_token = conn->peer_token;
+
+ if (!conn->sndbuf_desc)
+ return;
+
+ smc_ism_detach_dmb(smcd, peer_token);
+
+ kfree(conn->sndbuf_desc);
+ conn->sndbuf_desc = NULL;
+}
+
static void smc_buf_unuse(struct smc_connection *conn,
struct smc_link_group *lgr)
{
@@ -1198,6 +1212,10 @@ void smc_conn_free(struct smc_connection *conn)
if (!list_empty(&lgr->list))
smc_ism_unset_conn(conn);
tasklet_kill(&conn->rx_tsklet);
+
+ /* detach sndbuf from peer RMB */
+ if (smc_ism_dmb_mappable(lgr->smcd))
+ smcd_buf_detach(conn);
} else {
smc_cdc_wait_pend_tx_wr(conn);
if (current_work() != &conn->abort_work)
@@ -2459,15 +2477,23 @@ void smc_rmb_sync_sg_for_cpu(struct smc_connection *conn)
*/
int smc_buf_create(struct smc_sock *smc, bool is_smcd)
{
+ bool sndbuf_created = false;
int rc;

+ if (is_smcd &&
+ smc_ism_dmb_mappable(smc->conn.lgr->smcd))
+ goto create_rmb;
+
/* create send buffer */
rc = __smc_buf_create(smc, is_smcd, false);
if (rc)
return rc;
+ sndbuf_created = true;
+
+create_rmb:
/* create rmb */
rc = __smc_buf_create(smc, is_smcd, true);
- if (rc) {
+ if (rc && sndbuf_created) {
down_write(&smc->conn.lgr->sndbufs_lock);
list_del(&smc->conn.sndbuf_desc->list);
up_write(&smc->conn.lgr->sndbufs_lock);
@@ -2477,6 +2503,48 @@ int smc_buf_create(struct smc_sock *smc, bool is_smcd)
return rc;
}

+int smcd_buf_attach(struct smc_sock *smc)
+{
+ struct smc_connection *conn = &smc->conn;
+ struct smcd_dev *smcd = conn->lgr->smcd;
+ u64 peer_token = conn->peer_token;
+ struct smc_buf_desc *buf_desc;
+ int rc;
+
+ buf_desc = kzalloc(sizeof(*buf_desc), GFP_KERNEL);
+ if (!buf_desc)
+ return -ENOMEM;
+
+ /* map local sndbuf desc to peer RMB, so operations on local
+ * sndbuf are equivalent to operations on peer RMB.
+ */
+ rc = smc_ism_attach_dmb(smcd, peer_token, buf_desc);
+ if (rc) {
+ rc = SMC_CLC_DECL_MEM;
+ goto free;
+ }
+
+ smc->sk.sk_sndbuf = buf_desc->len;
+ buf_desc->cpu_addr = (u8 *)buf_desc->cpu_addr + sizeof(struct smcd_cdc_msg);
+ buf_desc->len -= sizeof(struct smcd_cdc_msg);
+ conn->sndbuf_desc = buf_desc;
+ conn->sndbuf_desc->used = 1;
+ atomic_set(&conn->sndbuf_space, conn->sndbuf_desc->len);
+ return 0;
+
+free:
+ if (conn->rmb_desc) {
+ /* free local RMB as well */
+ down_write(&conn->lgr->rmbs_lock);
+ list_del(&conn->rmb_desc->list);
+ up_write(&conn->lgr->rmbs_lock);
+ smc_buf_free(conn->lgr, true, conn->rmb_desc);
+ conn->rmb_desc = NULL;
+ }
+ kfree(buf_desc);
+ return rc;
+}
+
static inline int smc_rmb_reserve_rtoken_idx(struct smc_link_group *lgr)
{
int i;
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index d57eb9b..2cba119 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -551,6 +551,7 @@ void smc_smcd_terminate(struct smcd_dev *dev, struct smcd_gid *peer_gid,
void smc_smcd_terminate_all(struct smcd_dev *dev);
void smc_smcr_terminate_all(struct smc_ib_device *smcibdev);
int smc_buf_create(struct smc_sock *smc, bool is_smcd);
+int smcd_buf_attach(struct smc_sock *smc);
int smc_uncompress_bufsize(u8 compressed);
int smc_rmb_rtoken_handling(struct smc_connection *conn, struct smc_link *link,
struct smc_clc_msg_accept_confirm *clc);
--
1.8.3.1

2023-09-24 15:19:00

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v4 15/18] net/smc: add operations for DMB attach and detach

This patch extends smcd_ops, adding two more semantic for SMC-D DMB:

- attach_dmb:
Attach an already registered DMB to a specific buf_desc, so that we
can refer to the DMB through this buf_desc.

- detach_dmb:
Reverse operation of attach_dmb. detach the DMB from the buf_desc.

This interface extension is to prepare for the avoidance of memory copy
from sndbuf to RMB with SMC-D device whose DMBs has ISM_ATTR_DMB_MAP
attribute.

Signed-off-by: Wen Gu <[email protected]>
---
include/net/smc.h | 2 ++
net/smc/smc_ism.c | 31 +++++++++++++++++++++++++++++++
net/smc/smc_ism.h | 2 ++
3 files changed, 35 insertions(+)

diff --git a/include/net/smc.h b/include/net/smc.h
index 917572fb..39a21ed 100644
--- a/include/net/smc.h
+++ b/include/net/smc.h
@@ -72,6 +72,8 @@ struct smcd_ops {
int (*register_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb,
void *client);
int (*unregister_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
+ int (*attach_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
+ int (*detach_dmb)(struct smcd_dev *dev, u64 token);
int (*add_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
int (*del_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
int (*set_vlan_required)(struct smcd_dev *dev);
diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
index 9b31d00..32d5d96 100644
--- a/net/smc/smc_ism.c
+++ b/net/smc/smc_ism.c
@@ -244,6 +244,37 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int dmb_len,
return rc;
}

+int smc_ism_attach_dmb(struct smcd_dev *dev, u64 token,
+ struct smc_buf_desc *dmb_desc)
+{
+ struct smcd_dmb dmb;
+ int rc = 0;
+
+ memset(&dmb, 0, sizeof(dmb));
+ dmb.dmb_tok = token;
+
+ if (!dev->ops->attach_dmb)
+ return -EINVAL;
+
+ rc = dev->ops->attach_dmb(dev, &dmb);
+ if (!rc) {
+ dmb_desc->sba_idx = dmb.sba_idx;
+ dmb_desc->token = dmb.dmb_tok;
+ dmb_desc->cpu_addr = dmb.cpu_addr;
+ dmb_desc->dma_addr = dmb.dma_addr;
+ dmb_desc->len = dmb.dmb_len;
+ }
+ return rc;
+}
+
+int smc_ism_detach_dmb(struct smcd_dev *dev, u64 token)
+{
+ if (!dev->ops->detach_dmb)
+ return -EINVAL;
+
+ return dev->ops->detach_dmb(dev, token);
+}
+
static int smc_nl_handle_smcd_dev(struct smcd_dev *smcd,
struct sk_buff *skb,
struct netlink_callback *cb)
diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
index cef212c..f30aae4 100644
--- a/net/smc/smc_ism.h
+++ b/net/smc/smc_ism.h
@@ -45,6 +45,8 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int buf_size,
struct smc_buf_desc *dmb_desc);
int smc_ism_unregister_dmb(struct smcd_dev *dev, struct smc_buf_desc *dmb_desc);
bool smc_ism_dmb_mappable(struct smcd_dev *smcd);
+int smc_ism_attach_dmb(struct smcd_dev *dev, u64 token, struct smc_buf_desc *dmb_desc);
+int smc_ism_detach_dmb(struct smcd_dev *dev, u64 token);
int smc_ism_signal_shutdown(struct smc_link_group *lgr);
void smc_ism_get_system_eid(u8 **eid);
u16 smc_ism_get_chid(struct smcd_dev *dev);
--
1.8.3.1

2023-09-24 15:19:07

by Wen Gu

[permalink] [raw]
Subject: [PATCH net-next v4 07/18] net/smc: disable SEID on non-s390 architecture

On s390 architecture SMC-Dv2 SEID is generated by ISMv2 device with
partial cpuid. It is predefined and unique to represents the physical
machine (CPC), the maximum communication space of SMC-D.

On non-s390 archs like x86/64 or ARM, there is no similar information
to identify physical machine, especially in virtualization scenarios
like KVM, etc. In such cases, SEID is forcibly disabled and the user-
defined UEID will be used to represent the communicable space.

Signed-off-by: Wen Gu <[email protected]>
---
net/smc/smc_clc.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c
index aeb9643..deb43db 100644
--- a/net/smc/smc_clc.c
+++ b/net/smc/smc_clc.c
@@ -155,10 +155,12 @@ static int smc_clc_ueid_remove(char *ueid)
rc = 0;
}
}
+#if IS_ENABLED(CONFIG_S390)
if (!rc && !smc_clc_eid_table.ueid_cnt) {
smc_clc_eid_table.seid_enabled = 1;
rc = -EAGAIN; /* indicate success and enabling of seid */
}
+#endif
write_unlock(&smc_clc_eid_table.lock);
return rc;
}
@@ -273,22 +275,30 @@ int smc_nl_dump_seid(struct sk_buff *skb, struct netlink_callback *cb)

int smc_nl_enable_seid(struct sk_buff *skb, struct genl_info *info)
{
+#if IS_ENABLED(CONFIG_S390)
write_lock(&smc_clc_eid_table.lock);
smc_clc_eid_table.seid_enabled = 1;
write_unlock(&smc_clc_eid_table.lock);
return 0;
+#else
+ return -EOPNOTSUPP;
+#endif
}

int smc_nl_disable_seid(struct sk_buff *skb, struct genl_info *info)
{
int rc = 0;

+#if IS_ENABLED(CONFIG_S390)
write_lock(&smc_clc_eid_table.lock);
if (!smc_clc_eid_table.ueid_cnt)
rc = -ENOENT;
else
smc_clc_eid_table.seid_enabled = 0;
write_unlock(&smc_clc_eid_table.lock);
+#else
+ rc = -EOPNOTSUPP;
+#endif
return rc;
}

@@ -1292,7 +1302,11 @@ void __init smc_clc_init(void)
INIT_LIST_HEAD(&smc_clc_eid_table.list);
rwlock_init(&smc_clc_eid_table.lock);
smc_clc_eid_table.ueid_cnt = 0;
+#if IS_ENABLED(CONFIG_S390)
smc_clc_eid_table.seid_enabled = 1;
+#else
+ smc_clc_eid_table.seid_enabled = 0;
+#endif
}

void smc_clc_exit(void)
--
1.8.3.1

2023-09-25 01:47:29

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next v4 12/18] net/smc: implement DMB-related operations of loopback



On 2023/9/25 07:29, kernel test robot wrote:
> Hi Wen,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on net-next/main]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Wen-Gu/net-smc-decouple-ism_dev-from-SMC-D-device-dump/20230924-231933
> base: net-next/main
> patch link: https://lore.kernel.org/r/1695568613-125057-13-git-send-email-guwen%40linux.alibaba.com
> patch subject: [PATCH net-next v4 12/18] net/smc: implement DMB-related operations of loopback
> config: mips-allmodconfig (https://download.01.org/0day-ci/archive/20230925/[email protected]/config)
> compiler: mips-linux-gcc (GCC) 13.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230925/[email protected]/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <[email protected]>
> | Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
>
> All error/warnings (new ones prefixed by >>):
>
> net/smc/smc_loopback.c: In function 'smc_lo_register_dmb':
>>> net/smc/smc_loopback.c:102:30: error: implicit declaration of function 'vzalloc'; did you mean 'kvzalloc'? [-Werror=implicit-function-declaration]
> 102 | dmb_node->cpu_addr = vzalloc(dmb->dmb_len);
> | ^~~~~~~
> | kvzalloc
>>> net/smc/smc_loopback.c:102:28: warning: assignment to 'void *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
> 102 | dmb_node->cpu_addr = vzalloc(dmb->dmb_len);
> | ^
> net/smc/smc_loopback.c: In function 'smc_lo_unregister_dmb':
>>> net/smc/smc_loopback.c:159:9: error: implicit declaration of function 'vfree'; did you mean 'kvfree'? [-Werror=implicit-function-declaration]
> 159 | vfree(dmb_node->cpu_addr);
> | ^~~~~
> | kvfree
> cc1: some warnings being treated as errors
>

It can be fixed by including corresponding header file:

#include <linux/vmalloc.h>


Continue to wait for other review comments and will fix this in the next version.

Thanks.

>
> vim +102 net/smc/smc_loopback.c
>
> 79
> 80 static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
> 81 void *client_priv)
> 82 {
> 83 struct smc_lo_dmb_node *dmb_node, *tmp_node;
> 84 struct smc_lo_dev *ldev = smcd->priv;
> 85 int sba_idx, rc;
> 86
> 87 /* check space for new dmb */
> 88 for_each_clear_bit(sba_idx, ldev->sba_idx_mask, SMC_LODEV_MAX_DMBS) {
> 89 if (!test_and_set_bit(sba_idx, ldev->sba_idx_mask))
> 90 break;
> 91 }
> 92 if (sba_idx == SMC_LODEV_MAX_DMBS)
> 93 return -ENOSPC;
> 94
> 95 dmb_node = kzalloc(sizeof(*dmb_node), GFP_KERNEL);
> 96 if (!dmb_node) {
> 97 rc = -ENOMEM;
> 98 goto err_bit;
> 99 }
> 100
> 101 dmb_node->sba_idx = sba_idx;
> > 102 dmb_node->cpu_addr = vzalloc(dmb->dmb_len);
> 103 if (!dmb_node->cpu_addr) {
> 104 rc = -ENOMEM;
> 105 goto err_node;
> 106 }
> 107 dmb_node->len = dmb->dmb_len;
> 108 dmb_node->dma_addr = SMC_DMA_ADDR_INVALID;
> 109
> 110 again:
> 111 /* add new dmb into hash table */
> 112 get_random_bytes(&dmb_node->token, sizeof(dmb_node->token));
> 113 write_lock(&ldev->dmb_ht_lock);
> 114 hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_node->token) {
> 115 if (tmp_node->token == dmb_node->token) {
> 116 write_unlock(&ldev->dmb_ht_lock);
> 117 goto again;
> 118 }
> 119 }
> 120 hash_add(ldev->dmb_ht, &dmb_node->list, dmb_node->token);
> 121 write_unlock(&ldev->dmb_ht_lock);
> 122
> 123 dmb->sba_idx = dmb_node->sba_idx;
> 124 dmb->dmb_tok = dmb_node->token;
> 125 dmb->cpu_addr = dmb_node->cpu_addr;
> 126 dmb->dma_addr = dmb_node->dma_addr;
> 127 dmb->dmb_len = dmb_node->len;
> 128
> 129 return 0;
> 130
> 131 err_node:
> 132 kfree(dmb_node);
> 133 err_bit:
> 134 clear_bit(sba_idx, ldev->sba_idx_mask);
> 135 return rc;
> 136 }
> 137
> 138 static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
> 139 {
> 140 struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
> 141 struct smc_lo_dev *ldev = smcd->priv;
> 142
> 143 /* remove dmb from hash table */
> 144 write_lock(&ldev->dmb_ht_lock);
> 145 hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
> 146 if (tmp_node->token == dmb->dmb_tok) {
> 147 dmb_node = tmp_node;
> 148 break;
> 149 }
> 150 }
> 151 if (!dmb_node) {
> 152 write_unlock(&ldev->dmb_ht_lock);
> 153 return -EINVAL;
> 154 }
> 155 hash_del(&dmb_node->list);
> 156 write_unlock(&ldev->dmb_ht_lock);
> 157
> 158 clear_bit(dmb_node->sba_idx, ldev->sba_idx_mask);
> > 159 vfree(dmb_node->cpu_addr);
> 160 kfree(dmb_node);
> 161
> 162 return 0;
> 163 }
> 164
>

2023-09-25 09:58:49

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH net-next v4 12/18] net/smc: implement DMB-related operations of loopback

Hi Wen,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]

url: https://github.com/intel-lab-lkp/linux/commits/Wen-Gu/net-smc-decouple-ism_dev-from-SMC-D-device-dump/20230924-231933
base: net-next/main
patch link: https://lore.kernel.org/r/1695568613-125057-13-git-send-email-guwen%40linux.alibaba.com
patch subject: [PATCH net-next v4 12/18] net/smc: implement DMB-related operations of loopback
config: mips-allmodconfig (https://download.01.org/0day-ci/archive/20230925/[email protected]/config)
compiler: mips-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230925/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All error/warnings (new ones prefixed by >>):

net/smc/smc_loopback.c: In function 'smc_lo_register_dmb':
>> net/smc/smc_loopback.c:102:30: error: implicit declaration of function 'vzalloc'; did you mean 'kvzalloc'? [-Werror=implicit-function-declaration]
102 | dmb_node->cpu_addr = vzalloc(dmb->dmb_len);
| ^~~~~~~
| kvzalloc
>> net/smc/smc_loopback.c:102:28: warning: assignment to 'void *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
102 | dmb_node->cpu_addr = vzalloc(dmb->dmb_len);
| ^
net/smc/smc_loopback.c: In function 'smc_lo_unregister_dmb':
>> net/smc/smc_loopback.c:159:9: error: implicit declaration of function 'vfree'; did you mean 'kvfree'? [-Werror=implicit-function-declaration]
159 | vfree(dmb_node->cpu_addr);
| ^~~~~
| kvfree
cc1: some warnings being treated as errors


vim +102 net/smc/smc_loopback.c

79
80 static int smc_lo_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
81 void *client_priv)
82 {
83 struct smc_lo_dmb_node *dmb_node, *tmp_node;
84 struct smc_lo_dev *ldev = smcd->priv;
85 int sba_idx, rc;
86
87 /* check space for new dmb */
88 for_each_clear_bit(sba_idx, ldev->sba_idx_mask, SMC_LODEV_MAX_DMBS) {
89 if (!test_and_set_bit(sba_idx, ldev->sba_idx_mask))
90 break;
91 }
92 if (sba_idx == SMC_LODEV_MAX_DMBS)
93 return -ENOSPC;
94
95 dmb_node = kzalloc(sizeof(*dmb_node), GFP_KERNEL);
96 if (!dmb_node) {
97 rc = -ENOMEM;
98 goto err_bit;
99 }
100
101 dmb_node->sba_idx = sba_idx;
> 102 dmb_node->cpu_addr = vzalloc(dmb->dmb_len);
103 if (!dmb_node->cpu_addr) {
104 rc = -ENOMEM;
105 goto err_node;
106 }
107 dmb_node->len = dmb->dmb_len;
108 dmb_node->dma_addr = SMC_DMA_ADDR_INVALID;
109
110 again:
111 /* add new dmb into hash table */
112 get_random_bytes(&dmb_node->token, sizeof(dmb_node->token));
113 write_lock(&ldev->dmb_ht_lock);
114 hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb_node->token) {
115 if (tmp_node->token == dmb_node->token) {
116 write_unlock(&ldev->dmb_ht_lock);
117 goto again;
118 }
119 }
120 hash_add(ldev->dmb_ht, &dmb_node->list, dmb_node->token);
121 write_unlock(&ldev->dmb_ht_lock);
122
123 dmb->sba_idx = dmb_node->sba_idx;
124 dmb->dmb_tok = dmb_node->token;
125 dmb->cpu_addr = dmb_node->cpu_addr;
126 dmb->dma_addr = dmb_node->dma_addr;
127 dmb->dmb_len = dmb_node->len;
128
129 return 0;
130
131 err_node:
132 kfree(dmb_node);
133 err_bit:
134 clear_bit(sba_idx, ldev->sba_idx_mask);
135 return rc;
136 }
137
138 static int smc_lo_unregister_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb)
139 {
140 struct smc_lo_dmb_node *dmb_node = NULL, *tmp_node;
141 struct smc_lo_dev *ldev = smcd->priv;
142
143 /* remove dmb from hash table */
144 write_lock(&ldev->dmb_ht_lock);
145 hash_for_each_possible(ldev->dmb_ht, tmp_node, list, dmb->dmb_tok) {
146 if (tmp_node->token == dmb->dmb_tok) {
147 dmb_node = tmp_node;
148 break;
149 }
150 }
151 if (!dmb_node) {
152 write_unlock(&ldev->dmb_ht_lock);
153 return -EINVAL;
154 }
155 hash_del(&dmb_node->list);
156 write_unlock(&ldev->dmb_ht_lock);
157
158 clear_bit(dmb_node->sba_idx, ldev->sba_idx_mask);
> 159 vfree(dmb_node->cpu_addr);
160 kfree(dmb_node);
161
162 return 0;
163 }
164

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2023-09-25 13:59:16

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next v4 09/18] net/smc: introduce SMC-D loopback device



On 2023/9/25 19:50, Alexandra Winter wrote:
>
>
> On 24.09.23 17:16, Wen Gu wrote:
>> This patch introduces a kind of loopback device for SMC-D. The device
>> is created when SMC module is loaded and destroyed when the SMC module
>> is unloaded. The loopback device is a kernel device used only by the
>> SMC module and is not restricted by net namespace, so it can be used
>> for local inter-process or inter-container communication.
>>
>> Signed-off-by: Wen Gu <[email protected]>
>> ---
>> net/smc/Kconfig | 13 ++++
>> net/smc/Makefile | 2 +-
>> net/smc/af_smc.c | 12 +++-
>> net/smc/smc_loopback.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++++
>> net/smc/smc_loopback.h | 33 ++++++++++
>> 5 files changed, 223 insertions(+), 2 deletions(-)
>> create mode 100644 net/smc/smc_loopback.c
>> create mode 100644 net/smc/smc_loopback.h
>
>
> Hello Wen Gu,
>
> thank you for adding the Kconfig, so the distributions can decide when to offer this feature.
>
> I propose you add some kind of runtime switch as well. Not every user who loads the SMC module
> may want to exploit smcd-loopback. Especially in native environements without containers.
>
> If no RoCE interfaces or no ISM interfaces exist, the respective handling is skipped in SMC.
> If loopback is always created unconditionally, there is no way to opt-out.

Yes, I need to think about this. Make a runtime switch to enable/disable the loopback-ism just
like ip link up/down. An rough idea is to add an smc-tools command, like 'smcd device disable/enable loopback'.

Thank you very much.

Regards,
Wen Gu

2023-09-25 14:43:40

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next v4 09/18] net/smc: introduce SMC-D loopback device



On 2023/9/25 21:29, Alexandra Winter wrote:
>
>
> On 25.09.23 13:50, Alexandra Winter wrote:
>>
>>
>> On 24.09.23 17:16, Wen Gu wrote:
>>> This patch introduces a kind of loopback device for SMC-D. The device
>>> is created when SMC module is loaded and destroyed when the SMC module
>>> is unloaded. The loopback device is a kernel device used only by the
>>> SMC module and is not restricted by net namespace, so it can be used
>>> for local inter-process or inter-container communication.
>>>
>>> Signed-off-by: Wen Gu <[email protected]>
>>> ---
>>> net/smc/Kconfig | 13 ++++
>>> net/smc/Makefile | 2 +-
>>> net/smc/af_smc.c | 12 +++-
>>> net/smc/smc_loopback.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++++
>>> net/smc/smc_loopback.h | 33 ++++++++++
>>> 5 files changed, 223 insertions(+), 2 deletions(-)
>>> create mode 100644 net/smc/smc_loopback.c
>>> create mode 100644 net/smc/smc_loopback.h
>>
>>
>> Hello Wen Gu,
>>
>> thank you for adding the Kconfig, so the distributions can decide when to offer this feature.
>>
>> I propose you add some kind of runtime switch as well. Not every user who loads the SMC module
>> may want to exploit smcd-loopback. Especially in native environements without containers.
>>
>> If no RoCE interfaces or no ISM interfaces exist, the respective handling is skipped in SMC.
>> If loopback is always created unconditionally, there is no way to opt-out.
>>
>
> Another thing came to my mind:
>
> When loopback is created and registered when the SMC module is loaded, it will implicitly always have highest priority, right?
> That should be stated somewhere.
> Also, if you create a runtime switch this will change, so then you need to decide about priority of loopback vs ISM device (and other future smcd-devices).

Yes. I think the question may become 'How users to define the priority of existing the smcd devices'. In the past,
all the ISMv2 has nearly same performance so priority is not very important. But now there are other virtual ISM
devices, they perform differently.

My rough idea is defining a fixed priority, such as whenever loopback-ism is enabled, it is always the first in the
slots. If fixed priority is not appropriate, low-priority devices can be prioritized by disabling high-priority devices.

So it seems that the runtime switch of the loopback-ism is even more necessary.

Thanks,
Wen Gu


2023-09-25 15:24:49

by Dust Li

[permalink] [raw]
Subject: Re: [PATCH net-next v4 09/18] net/smc: introduce SMC-D loopback device

On Mon, Sep 25, 2023 at 01:50:22PM +0200, Alexandra Winter wrote:
>
>
>On 24.09.23 17:16, Wen Gu wrote:
>> This patch introduces a kind of loopback device for SMC-D. The device
>> is created when SMC module is loaded and destroyed when the SMC module
>> is unloaded. The loopback device is a kernel device used only by the
>> SMC module and is not restricted by net namespace, so it can be used
>> for local inter-process or inter-container communication.
>>
>> Signed-off-by: Wen Gu <[email protected]>
>> ---
>> net/smc/Kconfig | 13 ++++
>> net/smc/Makefile | 2 +-
>> net/smc/af_smc.c | 12 +++-
>> net/smc/smc_loopback.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++++
>> net/smc/smc_loopback.h | 33 ++++++++++
>> 5 files changed, 223 insertions(+), 2 deletions(-)
>> create mode 100644 net/smc/smc_loopback.c
>> create mode 100644 net/smc/smc_loopback.h
>
>
>Hello Wen Gu,
>
>thank you for adding the Kconfig, so the distributions can decide when to offer this feature.
>
>I propose you add some kind of runtime switch as well. Not every user who loads the SMC module
>may want to exploit smcd-loopback. Especially in native environements without containers.
>
>If no RoCE interfaces or no ISM interfaces exist, the respective handling is skipped in SMC.
>If loopback is always created unconditionally, there is no way to opt-out.

Hi Sandy,

After talking to Wen Gu offline, I think the real issue here might be
we don't have an abstract layer in SMC, something like net/core/dev.c

Without this, we cannot do:

1. Enable/disable those devices dynamically
Currently, If we want to disable a SMC-R device to communicate with
others, we need to refer to 'ip link set dev xxx down' to disable the
netdevice, then Infiniband subsystem will notify SMC that the state of
the IB device has changed. We cannot explicitly choose not to use some
specific IB/RoCE devices without disable totally.
If the loopback device need to support enable/disable itself, I
think it might be better to enable this feature for all SMC devices.

2. Do statistics per device
Now, we have to relay on IB/RoCE devices' hardware statistics to see
how many packets/bytes we have sent through this device.

Both the above issues get worse when the IB/RoCE device is shared by SMC
and userspace RDMA applications. If SMC-R and userspace RDMA applications
run at the same time, we can't enable the device to run userspace RDMA
applications while block it from running SMC. For statistics, we cannot
tell how many packets/bytes were sent by SMC and how many were sent by
userspace RDMA applications.

So I think those are better to support in the SMC layer.

Best regards!
Dust

2023-09-25 17:12:57

by Alexandra Winter

[permalink] [raw]
Subject: Re: [PATCH net-next v4 09/18] net/smc: introduce SMC-D loopback device



On 24.09.23 17:16, Wen Gu wrote:
> This patch introduces a kind of loopback device for SMC-D. The device
> is created when SMC module is loaded and destroyed when the SMC module
> is unloaded. The loopback device is a kernel device used only by the
> SMC module and is not restricted by net namespace, so it can be used
> for local inter-process or inter-container communication.
>
> Signed-off-by: Wen Gu <[email protected]>
> ---
> net/smc/Kconfig | 13 ++++
> net/smc/Makefile | 2 +-
> net/smc/af_smc.c | 12 +++-
> net/smc/smc_loopback.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++++
> net/smc/smc_loopback.h | 33 ++++++++++
> 5 files changed, 223 insertions(+), 2 deletions(-)
> create mode 100644 net/smc/smc_loopback.c
> create mode 100644 net/smc/smc_loopback.h


Hello Wen Gu,

thank you for adding the Kconfig, so the distributions can decide when to offer this feature.

I propose you add some kind of runtime switch as well. Not every user who loads the SMC module
may want to exploit smcd-loopback. Especially in native environements without containers.

If no RoCE interfaces or no ISM interfaces exist, the respective handling is skipped in SMC.
If loopback is always created unconditionally, there is no way to opt-out.

2023-09-25 20:01:05

by Alexandra Winter

[permalink] [raw]
Subject: Re: [PATCH net-next v4 09/18] net/smc: introduce SMC-D loopback device



On 25.09.23 13:50, Alexandra Winter wrote:
>
>
> On 24.09.23 17:16, Wen Gu wrote:
>> This patch introduces a kind of loopback device for SMC-D. The device
>> is created when SMC module is loaded and destroyed when the SMC module
>> is unloaded. The loopback device is a kernel device used only by the
>> SMC module and is not restricted by net namespace, so it can be used
>> for local inter-process or inter-container communication.
>>
>> Signed-off-by: Wen Gu <[email protected]>
>> ---
>> net/smc/Kconfig | 13 ++++
>> net/smc/Makefile | 2 +-
>> net/smc/af_smc.c | 12 +++-
>> net/smc/smc_loopback.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++++
>> net/smc/smc_loopback.h | 33 ++++++++++
>> 5 files changed, 223 insertions(+), 2 deletions(-)
>> create mode 100644 net/smc/smc_loopback.c
>> create mode 100644 net/smc/smc_loopback.h
>
>
> Hello Wen Gu,
>
> thank you for adding the Kconfig, so the distributions can decide when to offer this feature.
>
> I propose you add some kind of runtime switch as well. Not every user who loads the SMC module
> may want to exploit smcd-loopback. Especially in native environements without containers.
>
> If no RoCE interfaces or no ISM interfaces exist, the respective handling is skipped in SMC.
> If loopback is always created unconditionally, there is no way to opt-out.
>

Another thing came to my mind:

When loopback is created and registered when the SMC module is loaded, it will implicitly always have highest priority, right?
That should be stated somewhere.
Also, if you create a runtime switch this will change, so then you need to decide about priority of loopback vs ISM device (and other future smcd-devices).

2023-09-26 13:44:07

by Alexandra Winter

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/18] net/smc: implement virtual ISM extension and loopback-ism



On 24.09.23 17:16, Wen Gu wrote:
> This patch set includes 4 parts:
>
> - Patch #1-#3: decouple ISM device hard code from SMC-D stack.
> - Patch #4-#8: implement virtual ISM extension defined in SMCv2.1.
> - Patch #9-#13: implement loopback-ism device.
> - Patch #14-#18: memory copy optimization for the case using loopback.


Your cover letter is very well helpful, thanks a lot for that.
I really like the way you grouped the patches.
Just a thought:
If it takes too long to get this large patchset reviewed, you could
split it up into smaller sets and get them upstream one after the other.

I think it is especially valuable to more crisply define the interface between
SMC-D and the smc-d devices, given that virtio-ism may soon be a third kind of
smcd device.

2023-09-26 15:28:43

by Alexandra Winter

[permalink] [raw]
Subject: Re: [PATCH net-next v4 09/18] net/smc: introduce SMC-D loopback device



On 25.09.23 17:18, Dust Li wrote:
>> Hello Wen Gu,
>>
>> thank you for adding the Kconfig, so the distributions can decide when to offer this feature.
>>
>> I propose you add some kind of runtime switch as well. Not every user who loads the SMC module
>> may want to exploit smcd-loopback. Especially in native environements without containers.
>>
>> If no RoCE interfaces or no ISM interfaces exist, the respective handling is skipped in SMC.
>> If loopback is always created unconditionally, there is no way to opt-out.
> Hi Sandy,
>
> After talking to Wen Gu offline, I think the real issue here might be
> we don't have an abstract layer in SMC, something like net/core/dev.c
>
> Without this, we cannot do:
>
> 1. Enable/disable those devices dynamically
> Currently, If we want to disable a SMC-R device to communicate with
> others, we need to refer to 'ip link set dev xxx down' to disable the
> netdevice, then Infiniband subsystem will notify SMC that the state of
> the IB device has changed. We cannot explicitly choose not to use some
> specific IB/RoCE devices without disable totally.
> If the loopback device need to support enable/disable itself, I
> think it might be better to enable this feature for all SMC devices.
>
> 2. Do statistics per device
> Now, we have to relay on IB/RoCE devices' hardware statistics to see
> how many packets/bytes we have sent through this device.
>
> Both the above issues get worse when the IB/RoCE device is shared by SMC
> and userspace RDMA applications. If SMC-R and userspace RDMA applications
> run at the same time, we can't enable the device to run userspace RDMA
> applications while block it from running SMC. For statistics, we cannot
> tell how many packets/bytes were sent by SMC and how many were sent by
> userspace RDMA applications.
>
> So I think those are better to support in the SMC layer.
>
> Best regards!
> Dust

Thank you very much for your considerations. I also think a generic handling
of these requirements in the smc layer would be best. Especially, if we want
to add virtio-ism support soon. There we will face the same issues again.
Let's hear what others think about this.

2023-09-27 17:34:45

by Alexandra Winter

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/18] net/smc: implement virtual ISM extension and loopback-ism



On 24.09.23 17:16, Wen Gu wrote:
> Wen Gu (18):
> net/smc: decouple ism_dev from SMC-D device dump
> net/smc: decouple ism_dev from SMC-D DMB registration
> net/smc: extract v2 check helper from SMC-D device registration
> net/smc: support SMCv2.x supplemental features negotiation
> net/smc: reserve CHID range for SMC-D virtual device
> net/smc: extend GID to 128bits only for virtual ISM device
> net/smc: disable SEID on non-s390 architecture
> net/smc: enable virtual ISM device feature bit
> net/smc: introduce SMC-D loopback device
> net/smc: implement ID-related operations of loopback
> net/smc: implement some unsupported operations of loopback
> net/smc: implement DMB-related operations of loopback
> net/smc: register loopback device as SMC-Dv2 device
> net/smc: add operation for getting DMB attribute
> net/smc: add operations for DMB attach and detach
> net/smc: avoid data copy from sndbuf to peer RMB in SMC-D
> net/smc: modify cursor update logic when sndbuf mapped to RMB
> net/smc: add interface implementation of loopback device
>
> drivers/s390/net/ism_drv.c | 20 +-
> include/net/smc.h | 32 ++-
> include/uapi/linux/smc.h | 3 +
> include/uapi/linux/smc_diag.h | 2 +
> net/smc/Kconfig | 13 ++
> net/smc/Makefile | 2 +-
> net/smc/af_smc.c | 88 ++++++--
> net/smc/smc.h | 7 +
> net/smc/smc_cdc.c | 56 ++++-
> net/smc/smc_cdc.h | 1 +
> net/smc/smc_clc.c | 64 ++++--
> net/smc/smc_clc.h | 10 +-
> net/smc/smc_core.c | 111 +++++++++-
> net/smc/smc_core.h | 9 +-
> net/smc/smc_diag.c | 11 +-
> net/smc/smc_ism.c | 100 ++++++---
> net/smc/smc_ism.h | 24 ++-
> net/smc/smc_loopback.c | 489 ++++++++++++++++++++++++++++++++++++++++++
> net/smc/smc_loopback.h | 54 +++++
> net/smc/smc_pnet.c | 4 +-
> 20 files changed, 996 insertions(+), 104 deletions(-)
> create mode 100644 net/smc/smc_loopback.c
> create mode 100644 net/smc/smc_loopback.h


Hello Wen Gu,

I applied and built your patches and noticed some things that you may want to consider in the next version:

Series should be split up [2]

Several lines exceed 80 columns [1][3]

'git clang-format HEAD~18' finds several formatting issues.
Maybe not all of them need to be fixed.

codespell *.patch
0006-net-smc-extend-GID-to-128bits-only-for-virtual-ISM-d.patch:7: protocal ==> protocol

With your patches applied I get some new warnings [4]:
Seems there are some ntoh conversions missing

CHECK net/smc/af_smc.c
net/smc/af_smc.c:723:32: warning: cast to restricted __be64
net/smc/af_smc.c:1427:52: warning: cast to restricted __be64
CHECK net/smc/smc_pnet.c
CHECK net/smc/smc_ib.c
CHECK net/smc/smc_clc.c
net/smc/smc_clc.c:954:72: warning: incorrect type in argument 1 (different base types)
net/smc/smc_clc.c:954:72: expected unsigned short [usertype] chid
net/smc/smc_clc.c:954:72: got restricted __be16 [usertype] chid
net/smc/smc_clc.c:1050:29: warning: incorrect type in assignment (different base types)
net/smc/smc_clc.c:1050:29: expected unsigned long long [usertype] gid
net/smc/smc_clc.c:1050:29: got restricted __be64 [usertype]
net/smc/smc_clc.c:1051:31: warning: incorrect type in assignment (different base types)
net/smc/smc_clc.c:1051:31: expected unsigned long long [usertype] token
net/smc/smc_clc.c:1051:31: got restricted __be64 [usertype]


[1] linux/Documentation/process/coding-style.rst
[2] https://www.kernel.org/doc/html/v6.3/process/maintainer-netdev.html?highlight=network
[3] scripts/checkpatch.pl --strict --max-line-length=80 --git HEAD-18
[4] make C=2 CF=-D__CHECK_ENDIAN__ M=net/smc -Wunused-function -Wimplicit-fallthrough -Wincompatible-function-pointer-types-strict



When I installed the patches, I noticed that
> smcd info
showed an SEID, even though I had no ISM device --> good


> smcd device
FID Type PCI-ID PCHID InUse #LGs PNET-ID
0000 0 0000 Yes 2

This needs some improvements.., but I'm not sure what is the best way to display virtual smcd interfaces in the smc-tools.


I was able to do SMC transfers via the smcd-loopback feature :-D


2023-09-28 07:04:59

by Jan Karcher

[permalink] [raw]
Subject: Re: [PATCH net-next v4 09/18] net/smc: introduce SMC-D loopback device



On 26/09/2023 09:24, Alexandra Winter wrote:
>
>
> On 25.09.23 17:18, Dust Li wrote:
>>> Hello Wen Gu,
>>>
>>> thank you for adding the Kconfig, so the distributions can decide when to offer this feature.
>>>
>>> I propose you add some kind of runtime switch as well. Not every user who loads the SMC module
>>> may want to exploit smcd-loopback. Especially in native environements without containers.
>>>
>>> If no RoCE interfaces or no ISM interfaces exist, the respective handling is skipped in SMC.
>>> If loopback is always created unconditionally, there is no way to opt-out.
>> Hi Sandy,
>>
>> After talking to Wen Gu offline, I think the real issue here might be
>> we don't have an abstract layer in SMC, something like net/core/dev.c
>>
>> Without this, we cannot do:
>>
>> 1. Enable/disable those devices dynamically
>> Currently, If we want to disable a SMC-R device to communicate with
>> others, we need to refer to 'ip link set dev xxx down' to disable the
>> netdevice, then Infiniband subsystem will notify SMC that the state of
>> the IB device has changed. We cannot explicitly choose not to use some
>> specific IB/RoCE devices without disable totally.
>> If the loopback device need to support enable/disable itself, I
>> think it might be better to enable this feature for all SMC devices.
>>
>> 2. Do statistics per device
>> Now, we have to relay on IB/RoCE devices' hardware statistics to see
>> how many packets/bytes we have sent through this device.
>>
>> Both the above issues get worse when the IB/RoCE device is shared by SMC
>> and userspace RDMA applications. If SMC-R and userspace RDMA applications
>> run at the same time, we can't enable the device to run userspace RDMA
>> applications while block it from running SMC. For statistics, we cannot
>> tell how many packets/bytes were sent by SMC and how many were sent by
>> userspace RDMA applications.
>>
>> So I think those are better to support in the SMC layer.
>>
>> Best regards!
>> Dust
>
> Thank you very much for your considerations. I also think a generic handling
> of these requirements in the smc layer would be best. Especially, if we want
> to add virtio-ism support soon. There we will face the same issues again.
> Let's hear what others think about this.
>
>

Thanks you Sandy for bringing it up and Dust Li & Wen Gu for your thoughts.
I agree that such a runtime switch is needed and also that this generic
handling would be good in the smc layer.

2023-09-28 08:49:13

by Jan Karcher

[permalink] [raw]
Subject: Re: [PATCH net-next v4 03/18] net/smc: extract v2 check helper from SMC-D device registration



On 24/09/2023 17:16, Wen Gu wrote:
> This patch extracts v2-capable logic from the process of registering the
> ISM device as an SMC-D device, so that the registration process of other
> underlying devices can reuse it.
>
> Signed-off-by: Wen Gu <[email protected]>
> ---
> net/smc/smc_ism.c | 29 ++++++++++++++++++-----------
> net/smc/smc_ism.h | 1 +
> 2 files changed, 19 insertions(+), 11 deletions(-)
>
> diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
> index 455ae0a..8f1ba74 100644
> --- a/net/smc/smc_ism.c
> +++ b/net/smc/smc_ism.c
> @@ -69,6 +69,22 @@ bool smc_ism_is_v2_capable(void)
> return smc_ism_v2_capable;
> }
>
> +/* must be called under smcd_dev_list.mutex lock */
> +void smc_ism_check_v2_capable(struct smcd_dev *smcd)
> +{
> + u8 *system_eid = NULL;
> +
> + if (smc_ism_v2_capable)
> + return;
> +
> + system_eid = smcd->ops->get_system_eid();
> + if (smcd->ops->supports_v2()) {
> + smc_ism_v2_capable = true;
> + memcpy(smc_ism_v2_system_eid, system_eid,
> + SMC_MAX_EID_LEN);
> + }
> +}
> +
> /* Set a connection using this DMBE. */
> void smc_ism_set_conn(struct smc_connection *conn)
> {
> @@ -423,16 +439,7 @@ static void smcd_register_dev(struct ism_dev *ism)
> smc_pnetid_by_table_smcd(smcd);
>
> mutex_lock(&smcd_dev_list.mutex);
> - if (list_empty(&smcd_dev_list.list)) {
> - u8 *system_eid = NULL;
> -
> - system_eid = smcd->ops->get_system_eid();
> - if (smcd->ops->supports_v2()) {
> - smc_ism_v2_capable = true;
> - memcpy(smc_ism_v2_system_eid, system_eid,
> - SMC_MAX_EID_LEN);
> - }
> - }
> + smc_ism_check_v2_capable(smcd);

The list_empty check is omitted here which means the
smc_ism_check_v2_capable does not touch the list.
So i think the call could be placed prior the mutex_lock.

> /* sort list: devices without pnetid before devices with pnetid */
> if (smcd->pnetid[0])
> list_add_tail(&smcd->list, &smcd_dev_list.list);
> @@ -535,10 +542,10 @@ int smc_ism_init(void)
> {
> int rc = 0;
>
> -#if IS_ENABLED(CONFIG_ISM)
> smc_ism_v2_capable = false;
> memset(smc_ism_v2_system_eid, 0, SMC_MAX_EID_LEN);
>
> +#if IS_ENABLED(CONFIG_ISM)
> rc = ism_register_client(&smc_ism_client);
> #endif
> return rc;
> diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
> index 832b2f4..14d2e77 100644
> --- a/net/smc/smc_ism.h
> +++ b/net/smc/smc_ism.h
> @@ -42,6 +42,7 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int buf_size,
> void smc_ism_get_system_eid(u8 **eid);
> u16 smc_ism_get_chid(struct smcd_dev *dev);
> bool smc_ism_is_v2_capable(void);
> +void smc_ism_check_v2_capable(struct smcd_dev *dev);
> int smc_ism_init(void);
> void smc_ism_exit(void);
> int smcd_nl_get_device(struct sk_buff *skb, struct netlink_callback *cb);

2023-09-28 11:54:38

by Jan Karcher

[permalink] [raw]
Subject: Re: [PATCH net-next v4 05/18] net/smc: reserve CHID range for SMC-D virtual device



On 24/09/2023 17:16, Wen Gu wrote:
> This patch reserve CHID range from 0xFF00 to 0xFFFF for SMC-D virtual

The current state is that 0xFF00 - 0xFFFF is the range of all virtual
SMC-D devices. This range devides into:
- 0xFF00 - 0xFFFE is for virto-ism
- 0xFFFF is for loopback


> device and introduces helpers to identify them.
>
> Signed-off-by: Wen Gu <[email protected]>
> ---
> net/smc/smc_ism.h | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
> index 14d2e77..2ecc8de 100644
> --- a/net/smc/smc_ism.h
> +++ b/net/smc/smc_ism.h
> @@ -15,6 +15,9 @@
>
> #include "smc.h"
>
> +#define SMC_VIRT_ISM_CHID_MAX 0xFFFF

SMC_VIRT_ISM_MAX is 0xFFFE. Or do you mean virtual devices as the whole
group. If yes i think that this naming will be very confusing in a few
months/years.
Maybe something like SMC_VIRTUAL_DEV_CHID_{MIN|MAX}?

> +#define SMC_VIRT_ISM_CHID_MIN 0xFF00
> +
> struct smcd_dev_list { /* List of SMCD devices */
> struct list_head list;
> struct mutex mutex; /* Protects list of devices */
> @@ -57,4 +60,16 @@ static inline int smc_ism_write(struct smcd_dev *smcd, u64 dmb_tok,
> return rc < 0 ? rc : 0;
> }
>
> +static inline bool __smc_ism_is_virtdev(u16 chid)
> +{
> + return (chid >= SMC_VIRT_ISM_CHID_MIN && chid <= SMC_VIRT_ISM_CHID_MAX);
> +}
> +
> +static inline bool smc_ism_is_virtdev(struct smcd_dev *smcd)
> +{
> + u16 chid = smcd->ops->get_chid(smcd);
> +
> + return __smc_ism_is_virtdev(chid);
> +}
> +
> #endif

2023-09-28 13:19:14

by Alexandra Winter

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/18] net/smc: implement virtual ISM extension and loopback-ism



On 27.09.23 17:16, Alexandra Winter wrote:
> Hello Wen Gu,
>
> I applied and built your patches and noticed some things that you may want to consider in the next version:


FYI, patchwork basically complains about many the same issues:
https://patchwork.kernel.org/project/netdevbpf/list/?series=787037&state=*

In general you should run those check BEFORE you send the patches and not rely on patchwork.

2023-09-28 16:51:00

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/18] net/smc: implement virtual ISM extension and loopback-ism



On 2023/9/27 23:16, Alexandra Winter wrote:
>
>
> On 24.09.23 17:16, Wen Gu wrote:
>> Wen Gu (18):
>> net/smc: decouple ism_dev from SMC-D device dump
>> net/smc: decouple ism_dev from SMC-D DMB registration
>> net/smc: extract v2 check helper from SMC-D device registration
>> net/smc: support SMCv2.x supplemental features negotiation
>> net/smc: reserve CHID range for SMC-D virtual device
>> net/smc: extend GID to 128bits only for virtual ISM device
>> net/smc: disable SEID on non-s390 architecture
>> net/smc: enable virtual ISM device feature bit
>> net/smc: introduce SMC-D loopback device
>> net/smc: implement ID-related operations of loopback
>> net/smc: implement some unsupported operations of loopback
>> net/smc: implement DMB-related operations of loopback
>> net/smc: register loopback device as SMC-Dv2 device
>> net/smc: add operation for getting DMB attribute
>> net/smc: add operations for DMB attach and detach
>> net/smc: avoid data copy from sndbuf to peer RMB in SMC-D
>> net/smc: modify cursor update logic when sndbuf mapped to RMB
>> net/smc: add interface implementation of loopback device
>>
>> drivers/s390/net/ism_drv.c | 20 +-
>> include/net/smc.h | 32 ++-
>> include/uapi/linux/smc.h | 3 +
>> include/uapi/linux/smc_diag.h | 2 +
>> net/smc/Kconfig | 13 ++
>> net/smc/Makefile | 2 +-
>> net/smc/af_smc.c | 88 ++++++--
>> net/smc/smc.h | 7 +
>> net/smc/smc_cdc.c | 56 ++++-
>> net/smc/smc_cdc.h | 1 +
>> net/smc/smc_clc.c | 64 ++++--
>> net/smc/smc_clc.h | 10 +-
>> net/smc/smc_core.c | 111 +++++++++-
>> net/smc/smc_core.h | 9 +-
>> net/smc/smc_diag.c | 11 +-
>> net/smc/smc_ism.c | 100 ++++++---
>> net/smc/smc_ism.h | 24 ++-
>> net/smc/smc_loopback.c | 489 ++++++++++++++++++++++++++++++++++++++++++
>> net/smc/smc_loopback.h | 54 +++++
>> net/smc/smc_pnet.c | 4 +-
>> 20 files changed, 996 insertions(+), 104 deletions(-)
>> create mode 100644 net/smc/smc_loopback.c
>> create mode 100644 net/smc/smc_loopback.h
>
>
> Hello Wen Gu,
>
> I applied and built your patches and noticed some things that you may want to consider in the next version:
>
> Series should be split up [2]
>
> Several lines exceed 80 columns [1][3]
>
> 'git clang-format HEAD~18' finds several formatting issues.
> Maybe not all of them need to be fixed.
>
> codespell *.patch
> 0006-net-smc-extend-GID-to-128bits-only-for-virtual-ISM-d.patch:7: protocal ==> protocol
>
> With your patches applied I get some new warnings [4]:
> Seems there are some ntoh conversions missing
>
> CHECK net/smc/af_smc.c
> net/smc/af_smc.c:723:32: warning: cast to restricted __be64
> net/smc/af_smc.c:1427:52: warning: cast to restricted __be64
> CHECK net/smc/smc_pnet.c
> CHECK net/smc/smc_ib.c
> CHECK net/smc/smc_clc.c
> net/smc/smc_clc.c:954:72: warning: incorrect type in argument 1 (different base types)
> net/smc/smc_clc.c:954:72: expected unsigned short [usertype] chid
> net/smc/smc_clc.c:954:72: got restricted __be16 [usertype] chid
> net/smc/smc_clc.c:1050:29: warning: incorrect type in assignment (different base types)
> net/smc/smc_clc.c:1050:29: expected unsigned long long [usertype] gid
> net/smc/smc_clc.c:1050:29: got restricted __be64 [usertype]
> net/smc/smc_clc.c:1051:31: warning: incorrect type in assignment (different base types)
> net/smc/smc_clc.c:1051:31: expected unsigned long long [usertype] token
> net/smc/smc_clc.c:1051:31: got restricted __be64 [usertype]
>
>
> [1] linux/Documentation/process/coding-style.rst
> [2] https://www.kernel.org/doc/html/v6.3/process/maintainer-netdev.html?highlight=network
> [3] scripts/checkpatch.pl --strict --max-line-length=80 --git HEAD-18
> [4] make C=2 CF=-D__CHECK_ENDIAN__ M=net/smc -Wunused-function -Wimplicit-fallthrough -Wincompatible-function-pointer-types-strict
>
>

Hi Sandy,

Thank you very much for your detailed comments. They are really helpful.
I will check and fix them in my next version.

>
> When I installed the patches, I noticed that
>> smcd info
> showed an SEID, even though I had no ISM device --> good
>

Yes, virtual ISM device will return the right SEID (same as that returned
by ISM device) on s390 arch.

>
>> smcd device
> FID Type PCI-ID PCHID InUse #LGs PNET-ID
> 0000 0 0000 Yes 2
>
> This needs some improvements.., but I'm not sure what is the best way to display virtual smcd interfaces in the smc-tools.
>

Right. My idea is to add the display of the device's name and GID
along with the PCI information (all zero) to indicate it's virtual
smcd device.

>
> I was able to do SMC transfers via the smcd-loopback feature :-D
>

Glad to hear this. I will improve the issues you mentioned in the next
version and next version will be split up.

Thank you very much.
Wen Gu

2023-09-28 17:38:47

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/18] net/smc: implement virtual ISM extension and loopback-ism



On 2023/9/28 16:56, Alexandra Winter wrote:
>
>
> On 27.09.23 17:16, Alexandra Winter wrote:
>> Hello Wen Gu,
>>
>> I applied and built your patches and noticed some things that you may want to consider in the next version:
>
>
> FYI, patchwork basically complains about many the same issues:
> https://patchwork.kernel.org/project/netdevbpf/list/?series=787037&state=*
>
> In general you should run those check BEFORE you send the patches and not rely on patchwork.
Thank you Sandy. I seem to have not seen the specific content of these checks. May I ask how to
run those patchwork check locally? So that I can make sure everything is ok before send them.

2023-09-28 18:48:42

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next v4 09/18] net/smc: introduce SMC-D loopback device



On 2023/9/28 11:16, Jan Karcher wrote:
>
>
> On 26/09/2023 09:24, Alexandra Winter wrote:
>>
>>
>> On 25.09.23 17:18, Dust Li wrote:
>>>> Hello Wen Gu,
>>>>
>>>> thank you for adding the Kconfig, so the distributions can decide when to offer this feature.
>>>>
>>>> I propose you add some kind of runtime switch as well. Not every user who loads the SMC module
>>>> may want to exploit smcd-loopback. Especially in native environements without containers.
>>>>
>>>> If no RoCE interfaces or no ISM interfaces exist, the respective handling is skipped in SMC.
>>>> If loopback is always created unconditionally, there is no way to opt-out.
>>> Hi Sandy,
>>>
>>> After talking to Wen Gu offline, I think the real issue here might be
>>> we don't have an abstract layer in SMC, something like net/core/dev.c
>>>
>>> Without this, we cannot do:
>>>
>>> 1. Enable/disable those devices dynamically
>>>     Currently, If we want to disable a SMC-R device to communicate with
>>>     others, we need to refer to 'ip link set dev xxx down' to disable the
>>>     netdevice, then Infiniband subsystem will notify SMC that the state of
>>>     the IB device has changed. We cannot explicitly choose not to use some
>>>     specific IB/RoCE devices without disable totally.
>>>     If the loopback device need to support enable/disable itself, I
>>>     think it might be better to enable this feature for all SMC devices.
>>>
>>> 2. Do statistics per device
>>>     Now, we have to relay on IB/RoCE devices' hardware statistics to see
>>>     how many packets/bytes we have sent through this device.
>>>
>>> Both the above issues get worse when the IB/RoCE device is shared by SMC
>>> and userspace RDMA applications. If SMC-R and userspace RDMA applications
>>> run at the same time, we can't enable the device to run userspace RDMA
>>> applications while block it from running SMC. For statistics, we cannot
>>> tell how many packets/bytes were sent by SMC and how many were sent by
>>> userspace RDMA applications.
>>>
>>> So I think those are better to support in the SMC layer.
>>>
>>> Best regards!
>>> Dust
>>
>> Thank you very much for your considerations. I also think a generic handling
>> of these requirements in the smc layer would be best. Especially, if we want
>> to add virtio-ism support soon. There we will face the same issues again.
>> Let's hear what others think about this.
>>
>>
>
> Thanks you Sandy for bringing it up and Dust Li & Wen Gu for your thoughts.
> I agree that such a runtime switch is needed and also that this generic handling would be good in the smc layer.

Right. runtime switch is necessary. I'm trying some ways to see which one is more suitable.


As for implementing a abstract layer that capable of handling 1) enable/disable SMC usage of
RDMA/ISM devices. 2) count packets/bytes of RDMA/ISM devices that generated/consumed by SMC,
I believe it would be helpful, and IMHO its architecture may be:

----------------------------------------------
SMC protocol
(af_smc.c / smc_core.c / smc_clc.c ...)
----------------------------------------------
Abstract layer of SMC device
(define SMC device common operations)
----------------------------------------------
RDMA device | (virt) ISM device
(smc_ib.c) | (smc_ism.c / smc_loopback.c)
----------------------------------------------

But I also believe this may require a lot of works and may be a long-term job.

If only for the virtual ISM device, e.g.loopback-ism, I am considering adding it to the Linux
device tree (/sys/devices/virtual/) to make it more 'device-like', and controlling its
enable/disable and get the statistics through some files, such as
echo 1 > /sys/devices/virtual/loopback-ism/alive
or
cat /sys/devices/virtual/loopback-ism/statistics/{rx|tx}_{bytes|packets}
(similar to what tcp lo have in /sys/devices/virtual/net/lo)

What are your thoughts on it? Thanks.


--
A little off-topic, it's currently China's National Day holiday, which lasts for about a week,
so we are now on vacation. As a result, my responses might be a bit slower, but I will still
make time to check/reply the mail and prepare for my new version. Thank you all very much!

Regards,
Wen Gu

2023-09-28 19:47:54

by Alexandra Winter

[permalink] [raw]
Subject: Re: [PATCH net-next v4 05/18] net/smc: reserve CHID range for SMC-D virtual device



On 28.09.23 05:08, Jan Karcher wrote:
> On 24/09/2023 17:16, Wen Gu wrote:
>> This patch reserve CHID range from 0xFF00 to 0xFFFF for SMC-D virtual
>
> The current state is that 0xFF00 - 0xFFFF is the range of all virtual SMC-D devices. This range devides into:
> - 0xFF00 - 0xFFFE is for virto-ism
> - 0xFFFF is for loopback
>
>
>> device and introduces helpers to identify them.
>>
>> Signed-off-by: Wen Gu <[email protected]>
>> ---
>>   net/smc/smc_ism.h | 15 +++++++++++++++
>>   1 file changed, 15 insertions(+)
>>
>> diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
>> index 14d2e77..2ecc8de 100644
>> --- a/net/smc/smc_ism.h
>> +++ b/net/smc/smc_ism.h
>> @@ -15,6 +15,9 @@
>>     #include "smc.h"
>>   +#define SMC_VIRT_ISM_CHID_MAX        0xFFFF
>
> SMC_VIRT_ISM_MAX is 0xFFFE. Or do you mean virtual devices as the whole group. If yes i think that this naming will be very confusing in a few months/years.
> Maybe something like SMC_VIRTUAL_DEV_CHID_{MIN|MAX}?


IMO names are important. They can make future lives easier or harder.

Your first group of patches aims at 'decouple ISM device hard code from SMC-D stack'
Maybe now would be a good point in time to decide what ISM should mean in net/smc.
a) the s390 ISM devices
b) SMC-D devices in general
I would vote for a). (today a) and b) can be found in the code, as well as the term smcd_dev)

Then like Jan wrote above:
"0xFF00 - 0xFFFF is the range of all virtual SMC-D devices" and it should NOT be called SMC_VIRT_ISM_CHID_MAX.


Then in many places in net/smc 'ism' should be replaces by 'smcd_dev' or something similar.
Wen Gu, is that something you would offer to do as part of the preparation work for this series?

2023-09-30 01:02:00

by Alexandra Winter

[permalink] [raw]
Subject: Re: [PATCH net-next v4 09/18] net/smc: introduce SMC-D loopback device



On 28.09.23 20:35, Wen Gu wrote:
>
>
> On 2023/9/28 11:16, Jan Karcher wrote:
>>
>>
>> On 26/09/2023 09:24, Alexandra Winter wrote:
>>>
>>>
>>> On 25.09.23 17:18, Dust Li wrote:
>>>>> Hello Wen Gu,
>>>>>
>>>>> thank you for adding the Kconfig, so the distributions can decide when to offer this feature.
>>>>>
>>>>> I propose you add some kind of runtime switch as well. Not every user who loads the SMC module
>>>>> may want to exploit smcd-loopback. Especially in native environements without containers.
>>>>>
>>>>> If no RoCE interfaces or no ISM interfaces exist, the respective handling is skipped in SMC.
>>>>> If loopback is always created unconditionally, there is no way to opt-out.
>>>> Hi Sandy,
>>>>
>>>> After talking to Wen Gu offline, I think the real issue here might be
>>>> we don't have an abstract layer in SMC, something like net/core/dev.c
>>>>
>>>> Without this, we cannot do:
>>>>
>>>> 1. Enable/disable those devices dynamically
>>>>     Currently, If we want to disable a SMC-R device to communicate with
>>>>     others, we need to refer to 'ip link set dev xxx down' to disable the
>>>>     netdevice, then Infiniband subsystem will notify SMC that the state of
>>>>     the IB device has changed. We cannot explicitly choose not to use some
>>>>     specific IB/RoCE devices without disable totally.
>>>>     If the loopback device need to support enable/disable itself, I
>>>>     think it might be better to enable this feature for all SMC devices.
>>>>
>>>> 2. Do statistics per device
>>>>     Now, we have to relay on IB/RoCE devices' hardware statistics to see
>>>>     how many packets/bytes we have sent through this device.
>>>>
>>>> Both the above issues get worse when the IB/RoCE device is shared by SMC
>>>> and userspace RDMA applications. If SMC-R and userspace RDMA applications
>>>> run at the same time, we can't enable the device to run userspace RDMA
>>>> applications while block it from running SMC. For statistics, we cannot
>>>> tell how many packets/bytes were sent by SMC and how many were sent by
>>>> userspace RDMA applications.
>>>>
>>>> So I think those are better to support in the SMC layer.
>>>>
>>>> Best regards!
>>>> Dust
>>>
>>> Thank you very much for your considerations. I also think a generic handling
>>> of these requirements in the smc layer would be best. Especially, if we want
>>> to add virtio-ism support soon. There we will face the same issues again.
>>> Let's hear what others think about this.
>>>
>>>
>>
>> Thanks you Sandy for bringing it up and Dust Li & Wen Gu for your thoughts.
>> I agree that such a runtime switch is needed and also that this generic handling would be good in the smc layer.
>
> Right. runtime switch is necessary. I'm trying some ways to see which one is more suitable.
>
>
> As for implementing a abstract layer that capable of handling 1) enable/disable SMC usage of
> RDMA/ISM devices. 2) count packets/bytes of RDMA/ISM devices that generated/consumed by SMC,
> I believe it would be helpful, and IMHO its architecture may be:
>
> ----------------------------------------------
>                   SMC protocol
>     (af_smc.c / smc_core.c / smc_clc.c ...)
> ----------------------------------------------
>           Abstract layer of SMC device
>       (define SMC device common operations)
> ----------------------------------------------
>   RDMA device |        (virt) ISM device
>   (smc_ib.c)  |   (smc_ism.c / smc_loopback.c)
> ----------------------------------------------
>
> But I also believe this may require a lot of works and may be a long-term job.
>

I like that concept a lot. If we can agree on a direction, we can define
meaningful pieces and approach it piece by piece.


> If only for the virtual ISM device, e.g.loopback-ism, I am considering adding it to the Linux
> device tree (/sys/devices/virtual/) to make it more 'device-like', and controlling its
> enable/disable and get the statistics through some files, such as
> echo 1 > /sys/devices/virtual/loopback-ism/alive
> or
> cat /sys/devices/virtual/loopback-ism/statistics/{rx|tx}_{bytes|packets}
> (similar to what tcp lo have in /sys/devices/virtual/net/lo)
>
> What are your thoughts on it? Thanks.
>

Makes sense to me, but I don't have too much experience in that area.
I have never seen an attribute called 'alive' before.
I think attributes like 'power', 'enable' or 'online' are used for other device types.

>
> --
> A little off-topic, it's currently China's National Day holiday, which lasts for about a week,
> so we are now on vacation. As a result, my responses might be a bit slower, but I will still
> make time to check/reply the mail and prepare for my new version. Thank you all very much!
>
> Regards,
> Wen Gu

Next week is Germany's national holiday, so many of us are out as well.

2023-09-30 08:42:14

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next v4 03/18] net/smc: extract v2 check helper from SMC-D device registration



On 2023/9/28 11:08, Jan Karcher wrote:
>
>
> On 24/09/2023 17:16, Wen Gu wrote:
>> This patch extracts v2-capable logic from the process of registering the
>> ISM device as an SMC-D device, so that the registration process of other
>> underlying devices can reuse it.
>>
>> Signed-off-by: Wen Gu <[email protected]>
>> ---
>>   net/smc/smc_ism.c | 29 ++++++++++++++++++-----------
>>   net/smc/smc_ism.h |  1 +
>>   2 files changed, 19 insertions(+), 11 deletions(-)
>>
>> diff --git a/net/smc/smc_ism.c b/net/smc/smc_ism.c
>> index 455ae0a..8f1ba74 100644
>> --- a/net/smc/smc_ism.c
>> +++ b/net/smc/smc_ism.c
>> @@ -69,6 +69,22 @@ bool smc_ism_is_v2_capable(void)
>>       return smc_ism_v2_capable;
>>   }
>> +/* must be called under smcd_dev_list.mutex lock */
>> +void smc_ism_check_v2_capable(struct smcd_dev *smcd)
>> +{
>> +    u8 *system_eid = NULL;
>> +
>> +    if (smc_ism_v2_capable)
>> +        return;
>> +
>> +    system_eid = smcd->ops->get_system_eid();
>> +    if (smcd->ops->supports_v2()) {
>> +        smc_ism_v2_capable = true;
>> +        memcpy(smc_ism_v2_system_eid, system_eid,
>> +               SMC_MAX_EID_LEN);
>> +    }
>> +}
>> +
>>   /* Set a connection using this DMBE. */
>>   void smc_ism_set_conn(struct smc_connection *conn)
>>   {
>> @@ -423,16 +439,7 @@ static void smcd_register_dev(struct ism_dev *ism)
>>           smc_pnetid_by_table_smcd(smcd);
>>       mutex_lock(&smcd_dev_list.mutex);
>> -    if (list_empty(&smcd_dev_list.list)) {
>> -        u8 *system_eid = NULL;
>> -
>> -        system_eid = smcd->ops->get_system_eid();
>> -        if (smcd->ops->supports_v2()) {
>> -            smc_ism_v2_capable = true;
>> -            memcpy(smc_ism_v2_system_eid, system_eid,
>> -                   SMC_MAX_EID_LEN);
>> -        }
>> -    }
>> +    smc_ism_check_v2_capable(smcd);
>
> The list_empty check is omitted here which means the smc_ism_check_v2_capable does not touch the list.
> So i think the call could be placed prior the mutex_lock.
>

Good catch. I omitted the list_empty check in this version but forget to remove 'the
lock comments' and place the helper prior to the mutex_lock. It will be fixed.

Thank you.

>>       /* sort list: devices without pnetid before devices with pnetid */
>>       if (smcd->pnetid[0])
>>           list_add_tail(&smcd->list, &smcd_dev_list.list);
>> @@ -535,10 +542,10 @@ int smc_ism_init(void)
>>   {
>>       int rc = 0;
>> -#if IS_ENABLED(CONFIG_ISM)
>>       smc_ism_v2_capable = false;
>>       memset(smc_ism_v2_system_eid, 0, SMC_MAX_EID_LEN);
>> +#if IS_ENABLED(CONFIG_ISM)
>>       rc = ism_register_client(&smc_ism_client);
>>   #endif
>>       return rc;
>> diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
>> index 832b2f4..14d2e77 100644
>> --- a/net/smc/smc_ism.h
>> +++ b/net/smc/smc_ism.h
>> @@ -42,6 +42,7 @@ int smc_ism_register_dmb(struct smc_link_group *lgr, int buf_size,
>>   void smc_ism_get_system_eid(u8 **eid);
>>   u16 smc_ism_get_chid(struct smcd_dev *dev);
>>   bool smc_ism_is_v2_capable(void);
>> +void smc_ism_check_v2_capable(struct smcd_dev *dev);
>>   int smc_ism_init(void);
>>   void smc_ism_exit(void);
>>   int smcd_nl_get_device(struct sk_buff *skb, struct netlink_callback *cb);

2023-10-04 08:27:37

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next v4 05/18] net/smc: reserve CHID range for SMC-D virtual device



On 2023/9/28 17:10, Alexandra Winter wrote:
>
>
> On 28.09.23 05:08, Jan Karcher wrote:
>> On 24/09/2023 17:16, Wen Gu wrote:
>>> This patch reserve CHID range from 0xFF00 to 0xFFFF for SMC-D virtual
>>
>> The current state is that 0xFF00 - 0xFFFF is the range of all virtual SMC-D devices. This range devides into:
>> - 0xFF00 - 0xFFFE is for virto-ism
>> - 0xFFFF is for loopback
>>
>>
>>> device and introduces helpers to identify them.
>>>
>>> Signed-off-by: Wen Gu <[email protected]>
>>> ---
>>>   net/smc/smc_ism.h | 15 +++++++++++++++
>>>   1 file changed, 15 insertions(+)
>>>
>>> diff --git a/net/smc/smc_ism.h b/net/smc/smc_ism.h
>>> index 14d2e77..2ecc8de 100644
>>> --- a/net/smc/smc_ism.h
>>> +++ b/net/smc/smc_ism.h
>>> @@ -15,6 +15,9 @@
>>>     #include "smc.h"
>>>   +#define SMC_VIRT_ISM_CHID_MAX        0xFFFF
>>
>> SMC_VIRT_ISM_MAX is 0xFFFE. Or do you mean virtual devices as the whole group. If yes i think that this naming will be very confusing in a few months/years.
>> Maybe something like SMC_VIRTUAL_DEV_CHID_{MIN|MAX}?
>
>
> IMO names are important. They can make future lives easier or harder.
>


Hi Sandy and Jan,

I agree with your opinion that names are important.

I view these terms in this way:

SMC-D devices (smcd_dev)
|
|- s390 ISM devices (ISM, ism_dev)
|
|- virtual ISM devices (virtual ISM, smc_lo_dev)
| |
| |- loopback-ism
| |
| |- virtio-ism
|
|- maybe future devices

SMC_VIRT_ISM_CHID_MAX was introduced to represent the maximum CHID of virtual ISM devices. CHIDs used
by virtual ISM devices should be in range of [SMC_VIRT_ISM_CHID_MIN, SMC_VIRT_ISM_CHID_MAX].

I think the problem here is that SMC_VIRT_ISM_CHID_MAX might be misunderstood as CHID of virtio-ism?
Then I will change them to SMC_VIRTUAL_ISM_CHID_{MAX|MIN}.

> Your first group of patches aims at 'decouple ISM device hard code from SMC-D stack'
> Maybe now would be a good point in time to decide what ISM should mean in net/smc.
> a) the s390 ISM devices
> b) SMC-D devices in general
> I would vote for a). (today a) and b) can be found in the code, as well as the term smcd_dev)
>
> Then like Jan wrote above:
> "0xFF00 - 0xFFFF is the range of all virtual SMC-D devices" and it should NOT be called SMC_VIRT_ISM_CHID_MAX.
>

Yes, I also vote for a).

But IMHO, loopback-ism and virtio-ism should be better classified as 'virtual ISM devices', like
what describes in the specification, rather than 'virtual SMC-D devices', since they are intended
to emulate ISM devices for using SMC-D on non-s390 systems.

>
> Then in many places in net/smc 'ism' should be replaces by 'smcd_dev' or something similar.
> Wen Gu, is that something you would offer to do as part of the preparation work for this series?

Yes. But I'm not sure which 'ism' words you suggested to be replaced with 'smcd_dev'/'smcd'?

IMHO, in some generic codes like SMC-D operations (smcd_ops) or SMC-D device dump, they should
be generic to all kinds of SMC-D devices, so struct ism_dev or struct ism_client should not be used,
that is what patch#1 & #2 want to do.

But in some operations related to underlay device, like smcd_ism_register_dmb(), smc_ism_cantalk(),
and etc in smc_ism.c. They works for both s390 ISM devices and virtual ISM devices. I think they can
keep 'ism' in the helpers' name as they are now.

What do you think?

Thanks and regards,
Wen Gu




2023-10-04 08:42:51

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/18] net/smc: implement virtual ISM extension and loopback-ism



On 2023/9/29 21:31, Alexandra Winter wrote:
>
>
> On 28.09.23 19:29, Wen Gu wrote:
>>
>>
>> On 2023/9/28 16:56, Alexandra Winter wrote:
>>>
>>>
>>> On 27.09.23 17:16, Alexandra Winter wrote:
>>>> Hello Wen Gu,
>>>>
>>>> I applied and built your patches and noticed some things that you may want to consider in the next version:
>>>
>>>
>>> FYI, patchwork basically complains about many the same issues:
>>> https://patchwork.kernel.org/project/netdevbpf/list/?series=787037&state=*
>>>
>>> In general you should run those check BEFORE you send the patches and not rely on patchwork.
>> Thank you Sandy. I seem to have not seen the specific content of these checks. May I ask how to
>> run those patchwork check locally? So that I can make sure everything is ok before send them.
>>
>
> Citing from Documentation/process/maintainer-netdev.rst :
>
> "patchwork checks
> ~~~~~~~~~~~~~~~~
>
> Checks in patchwork are mostly simple wrappers around existing kernel
> scripts, the sources are available at:
>
> https://github.com/kuba-moo/nipa/tree/master/tests
>
> **Do not** post your patches just to run them through the checks.
> You must ensure that your patches are ready by testing them locally
> before posting to the mailing list. The patchwork build bot instance
> gets overloaded very easily and netdev@vger really doesn't need more
> traffic if we can help it."
>
> HTH

Thank you! Sandy.

2023-10-04 09:07:33

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next v4 09/18] net/smc: introduce SMC-D loopback device


On 2023/9/29 22:08, Alexandra Winter wrote:
>
>
> On 28.09.23 20:35, Wen Gu wrote:
>>
>>
>> On 2023/9/28 11:16, Jan Karcher wrote:
>>>
>>>
>>> On 26/09/2023 09:24, Alexandra Winter wrote:
>>>>
>>>>
>>>> On 25.09.23 17:18, Dust Li wrote:
>>>>>> Hello Wen Gu,
>>>>>>
>>>>>> thank you for adding the Kconfig, so the distributions can decide when to offer this feature.
>>>>>>
>>>>>> I propose you add some kind of runtime switch as well. Not every user who loads the SMC module
>>>>>> may want to exploit smcd-loopback. Especially in native environements without containers.
>>>>>>
>>>>>> If no RoCE interfaces or no ISM interfaces exist, the respective handling is skipped in SMC.
>>>>>> If loopback is always created unconditionally, there is no way to opt-out.
>>>>> Hi Sandy,
>>>>>
>>>>> After talking to Wen Gu offline, I think the real issue here might be
>>>>> we don't have an abstract layer in SMC, something like net/core/dev.c
>>>>>
>>>>> Without this, we cannot do:
>>>>>
>>>>> 1. Enable/disable those devices dynamically
>>>>>     Currently, If we want to disable a SMC-R device to communicate with
>>>>>     others, we need to refer to 'ip link set dev xxx down' to disable the
>>>>>     netdevice, then Infiniband subsystem will notify SMC that the state of
>>>>>     the IB device has changed. We cannot explicitly choose not to use some
>>>>>     specific IB/RoCE devices without disable totally.
>>>>>     If the loopback device need to support enable/disable itself, I
>>>>>     think it might be better to enable this feature for all SMC devices.
>>>>>
>>>>> 2. Do statistics per device
>>>>>     Now, we have to relay on IB/RoCE devices' hardware statistics to see
>>>>>     how many packets/bytes we have sent through this device.
>>>>>
>>>>> Both the above issues get worse when the IB/RoCE device is shared by SMC
>>>>> and userspace RDMA applications. If SMC-R and userspace RDMA applications
>>>>> run at the same time, we can't enable the device to run userspace RDMA
>>>>> applications while block it from running SMC. For statistics, we cannot
>>>>> tell how many packets/bytes were sent by SMC and how many were sent by
>>>>> userspace RDMA applications.
>>>>>
>>>>> So I think those are better to support in the SMC layer.
>>>>>
>>>>> Best regards!
>>>>> Dust
>>>>
>>>> Thank you very much for your considerations. I also think a generic handling
>>>> of these requirements in the smc layer would be best. Especially, if we want
>>>> to add virtio-ism support soon. There we will face the same issues again.
>>>> Let's hear what others think about this.
>>>>
>>>>
>>>
>>> Thanks you Sandy for bringing it up and Dust Li & Wen Gu for your thoughts.
>>> I agree that such a runtime switch is needed and also that this generic handling would be good in the smc layer.
>>
>> Right. runtime switch is necessary. I'm trying some ways to see which one is more suitable.
>>
>>
>> As for implementing a abstract layer that capable of handling 1) enable/disable SMC usage of
>> RDMA/ISM devices. 2) count packets/bytes of RDMA/ISM devices that generated/consumed by SMC,
>> I believe it would be helpful, and IMHO its architecture may be:
>>
>> ----------------------------------------------
>>                   SMC protocol
>>     (af_smc.c / smc_core.c / smc_clc.c ...)
>> ----------------------------------------------
>>           Abstract layer of SMC device
>>       (define SMC device common operations)
>> ----------------------------------------------
>>   RDMA device |        (virt) ISM device
>>   (smc_ib.c)  |   (smc_ism.c / smc_loopback.c)
>> ----------------------------------------------
>>
>> But I also believe this may require a lot of works and may be a long-term job.
>>
>
> I like that concept a lot. If we can agree on a direction, we can define
> meaningful pieces and approach it piece by piece.
>

Yes. It can be added to our interlock's backup list.

>
>> If only for the virtual ISM device, e.g.loopback-ism, I am considering adding it to the Linux
>> device tree (/sys/devices/virtual/) to make it more 'device-like', and controlling its
>> enable/disable and get the statistics through some files, such as
>> echo 1 > /sys/devices/virtual/loopback-ism/alive
>> or
>> cat /sys/devices/virtual/loopback-ism/statistics/{rx|tx}_{bytes|packets}
>> (similar to what tcp lo have in /sys/devices/virtual/net/lo)
>>
>> What are your thoughts on it? Thanks.
>>
>
> Makes sense to me, but I don't have too much experience in that area.
> I have never seen an attribute called 'alive' before.
> I think attributes like 'power', 'enable' or 'online' are used for other device types.
>

Thanks. I will refer to existing devices for reference.

>>
>> --
>> A little off-topic, it's currently China's National Day holiday, which lasts for about a week,
>> so we are now on vacation. As a result, my responses might be a bit slower, but I will still
>> make time to check/reply the mail and prepare for my new version. Thank you all very much!
>>
>> Regards,
>> Wen Gu
>
> Next week is Germany's national holiday, so many of us are out as well.

Have a nice holiday! :)

2023-10-05 15:48:29

by Niklas Schnelle

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/18] net/smc: implement virtual ISM extension and loopback-ism

On Sun, 2023-09-24 at 23:16 +0800, Wen Gu wrote:
> Hi, all
>
> # Background
>
> SMC-D is now used in IBM z with ISM function to optimize network interconnect
> for intra-CPC communications. Inspired by this, we try to make SMC-D available
> on the non-s390 architecture through a software-simulated virtual ISM device,
> such as loopback-ism device here, to accelerate inter-process or inter-containers
> communication within the same OS.
>
> # Design
>
> This patch set includes 4 parts:
>
> - Patch #1-#3: decouple ISM device hard code from SMC-D stack.
> - Patch #4-#8: implement virtual ISM extension defined in SMCv2.1.
> - Patch #9-#13: implement loopback-ism device.
> - Patch #14-#18: memory copy optimization for the case using loopback.
>
> The loopback-ism device is designed as a kernel device and not be limited to
> a specific net namespace, ends of both inter-process connection (1/1' in diagram
> below) or inter-container connection (2/2' in diagram below) will find that peer
> shares the same loopback-ism device during the CLC handshake. Then loopback-ism
> device will be chosen.
>
> Container 1 (ns1) Container 2 (ns2)
> +-----------------------------------------+ +-------------------------+
> | +-------+ +-------+ +-------+ | | +-------+ |
> | | App A | | App B | | App C | | | | App D |<-+ |
> | +-------+ +---^---+ +-------+ | | +-------+ |(2') |
> | |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| |
> | (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ |
> | `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | |
> +---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+
> | | | |
> Kernel | | | |
> +----+-------v---+-----------v----------------------------------+---+----+
> | | TCP | |
> | | | |
> | +--------------------------------------------------------------+ |
> | |
> | +--------------+ |
> | | smc loopback | |
> +---------------------------+--------------+-----------------------------+
>
>
> loopback-ism device allocs RMBs and sndbufs for each connection peer and 'moves'
> data from sndbuf at one end to RMB at the other end. Since communication occurs
> within the same kernel, the sndbuf can be mapped to peer RMB so that the data
> copy in loopback-ism case can be avoided.
>
> Container 1 (ns1) Container 2 (ns2)
> +-----------------------------------------+ +-------------------------+
> | +-------+ +-------+ +-------+ | | +-------+ |
> | | App A | | App B | | App C | | | | App D | |
> | +-------+ +--^----+ +-------+ | | +---^---+ |
> | | | | | | | |
> | (1) | (1') | (2) | | | (2') | |
> | | | | | | | |
> +-------|-----------|---------------|-----+ +------------|------------+
> | | | |
> Kernel | | | |
> +-------|-----------|---------------|-----------------------|------------+
> | +-----v-+ +-------+ +---v---+ +-------+ |
> | | snd A |-+ | RMB B |<--+ | snd C |-+ +->| RMB D | |
> | +-------+ | +-------+ | +-------+ | | +-------+ |
> | +-------+ | +-------+ | +-------+ | | +-------+ |
> | | RMB A | | | snd B | | | RMB C | | | | snd D | |
> | +-------+ | +-------+ | +-------+ | | +-------+ |
> | | +-------------v+ | |
> | +-------------->| smc loopback |---------+ |
> +---------------------------+--------------+-----------------------------+
>
> # Benchmark Test
>
> * Test environments:
> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
> - SMC sndbuf/RMB size 1MB.
>
> * Test object:
> - TCP: run on TCP loopback.
> - domain: run on UNIX domain.
> - SMC lo: run on SMC loopback device.
>
> 1. ipc-benchmark (see [1])
>
> - ./<foo> -c 1000000 -s 100
>
> TCP SMC-lo
> Message
> rate (msg/s) 81539 151251(+85.50%)
>
> 2. sockperf
>
> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>
> TCP SMC-lo
> Bandwidth(MBps) 5313.66 8270.51(+55.65%)
> Latency(us) 5.806 3.207(-44.76%)
>
> 3. nginx/wrk
>
> - serv: <smc_run> nginx
> - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
>
> TCP SMC-lo
> Requests/s 194641.79 258656.13(+32.89%)
>
> 4. redis-benchmark
>
> - serv: <smc_run> redis-server
> - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
>
> TCP SMC-lo
> GET(Requests/s) 85855.34 115640.35(+34.69%)
> SET(Requests/s) 86337.15 118203.30(+36.90%)
>
> [1] https://github.com/goldsborough/ipc-bench
>

Hi Wen Gu,

I've been trying out your series with iperf3, qperf, and uperf on
s390x. I'm using network namespaces with a ConnectX VF from the same
card in each namespace for the initial TCP/IP connection i.e. initially
it goes out to a real NIC even if that can switch internally. All of
these look great for streaming workloads both in terms of performance
and stability. With a Connect-Request-Response workload and uperf
however I've run into issues. The test configuration I use is as
follows:

Client Command:

# host=$ip_server ip netns exec client smc_run uperf -m tcp_crr.xml

Server Command:

# ip netns exec server smc_run uperf -s &> /dev/null

Uperf tcp_crr.xml:

<?xml version="1.0"?>
<profile name="TCP_CRR">
<group nthreads="12">
<transaction duration="120">
<flowop type="connect" options="remotehost=$host protocol=tcp" />
<flowop type="write" options="size=200"/>
<flowop type="read" options="size=1000"/>
<flowop type="disconnect" />
</transaction>
</group>
</profile>

The workload first runs fine but then after about 4 GB of data
transferred fails with "Connection refused" and "Connection reset by
peer" errors. The failure is not permanent however and re-running
the streaming workloads run fine again (with both uperf server and
client restarted). So I suspect something gets stuck in either the
client or server sockets. The same workload runs fine with TCP/IP of
course.

Thanks,
Niklas


2023-10-05 17:09:29

by Alexandra Winter

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/18] net/smc: implement virtual ISM extension and loopback-ism



On 28.09.23 19:29, Wen Gu wrote:
>
>
> On 2023/9/28 16:56, Alexandra Winter wrote:
>>
>>
>> On 27.09.23 17:16, Alexandra Winter wrote:
>>> Hello Wen Gu,
>>>
>>> I applied and built your patches and noticed some things that you may want to consider in the next version:
>>
>>
>> FYI, patchwork basically complains about many the same issues:
>> https://patchwork.kernel.org/project/netdevbpf/list/?series=787037&state=*
>>
>> In general you should run those check BEFORE you send the patches and not rely on patchwork.
> Thank you Sandy. I seem to have not seen the specific content of these checks. May I ask how to
> run those patchwork check locally? So that I can make sure everything is ok before send them.
>

Citing from Documentation/process/maintainer-netdev.rst :

"patchwork checks
~~~~~~~~~~~~~~~~

Checks in patchwork are mostly simple wrappers around existing kernel
scripts, the sources are available at:

https://github.com/kuba-moo/nipa/tree/master/tests

**Do not** post your patches just to run them through the checks.
You must ensure that your patches are ready by testing them locally
before posting to the mailing list. The patchwork build bot instance
gets overloaded very easily and netdev@vger really doesn't need more
traffic if we can help it."

HTH

2023-10-08 07:20:05

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/18] net/smc: implement virtual ISM extension and loopback-ism



On 2023/10/5 16:21, Niklas Schnelle wrote:

>
> Hi Wen Gu,
>
> I've been trying out your series with iperf3, qperf, and uperf on
> s390x. I'm using network namespaces with a ConnectX VF from the same
> card in each namespace for the initial TCP/IP connection i.e. initially
> it goes out to a real NIC even if that can switch internally. All of
> these look great for streaming workloads both in terms of performance
> and stability. With a Connect-Request-Response workload and uperf
> however I've run into issues. The test configuration I use is as
> follows:
>
> Client Command:
>
> # host=$ip_server ip netns exec client smc_run uperf -m tcp_crr.xml
>
> Server Command:
>
> # ip netns exec server smc_run uperf -s &> /dev/null
>
> Uperf tcp_crr.xml:
>
> <?xml version="1.0"?>
> <profile name="TCP_CRR">
> <group nthreads="12">
> <transaction duration="120">
> <flowop type="connect" options="remotehost=$host protocol=tcp" />
> <flowop type="write" options="size=200"/>
> <flowop type="read" options="size=1000"/>
> <flowop type="disconnect" />
> </transaction>
> </group>
> </profile>
>
> The workload first runs fine but then after about 4 GB of data
> transferred fails with "Connection refused" and "Connection reset by
> peer" errors. The failure is not permanent however and re-running
> the streaming workloads run fine again (with both uperf server and
> client restarted). So I suspect something gets stuck in either the
> client or server sockets. The same workload runs fine with TCP/IP of
> course.
>
> Thanks,
> Niklas
>
>

Hi Niklas,

Thank you very much for the test. With the test example you provided, I've
reproduced the issue in my VM. And moreover, sometimes the test complains
with 'Error saying goodbye with <ip>'

I'll figure out what's going on here.

Thanks!
Wen Gu

2023-10-12 07:55:08

by Dust Li

[permalink] [raw]
Subject: Re: [PATCH net-next v4 06/18] net/smc: extend GID to 128bits only for virtual ISM device

On Sun, Sep 24, 2023 at 11:16:41PM +0800, Wen Gu wrote:
>Virtual ISM devices are introduced to SMC-Dv2.1 protocal, whose GIDs
>are 128-bits UUIDs as defined by RFC4122. And note that the GIDs of
>ISM devices still remain 64-bits.
>
>This patch adapts the relevant codes, such as CLC handshake, to make
>it compatible with 128 bits GID.
>
>Signed-off-by: Wen Gu <[email protected]>
>---
> drivers/s390/net/ism_drv.c | 18 +++++++------
> include/net/smc.h | 15 +++++++----
> include/uapi/linux/smc.h | 3 +++
> include/uapi/linux/smc_diag.h | 2 ++
> net/smc/af_smc.c | 60 +++++++++++++++++++++++++++++++++----------
> net/smc/smc_clc.c | 43 +++++++++++++++++++++----------
> net/smc/smc_clc.h | 4 +--
> net/smc/smc_core.c | 41 +++++++++++++++++++++--------
> net/smc/smc_core.h | 7 ++---
> net/smc/smc_diag.c | 11 ++++++--
> net/smc/smc_ism.c | 18 ++++++++-----
> net/smc/smc_ism.h | 3 ++-
> net/smc/smc_pnet.c | 4 +--
> 13 files changed, 163 insertions(+), 66 deletions(-)
>
>diff --git a/drivers/s390/net/ism_drv.c b/drivers/s390/net/ism_drv.c
>index a34e913..8e42eb2 100644
>--- a/drivers/s390/net/ism_drv.c
>+++ b/drivers/s390/net/ism_drv.c
>@@ -774,10 +774,10 @@ static void __exit ism_exit(void)
> /*************************** SMC-D Implementation *****************************/
>
> #if IS_ENABLED(CONFIG_SMC)
>-static int smcd_query_rgid(struct smcd_dev *smcd, u64 rgid, u32 vid_valid,
>- u32 vid)
>+static int smcd_query_rgid(struct smcd_dev *smcd, struct smcd_gid *rgid,
>+ u32 vid_valid, u32 vid)
> {
>- return ism_query_rgid(smcd->priv, rgid, vid_valid, vid);
>+ return ism_query_rgid(smcd->priv, rgid->gid, vid_valid, vid);
> }
>
> static int smcd_register_dmb(struct smcd_dev *smcd, struct smcd_dmb *dmb,
>@@ -811,10 +811,10 @@ static int smcd_reset_vlan_required(struct smcd_dev *smcd)
> return ism_cmd_simple(smcd->priv, ISM_RESET_VLAN);
> }
>
>-static int smcd_signal_ieq(struct smcd_dev *smcd, u64 rgid, u32 trigger_irq,
>- u32 event_code, u64 info)
>+static int smcd_signal_ieq(struct smcd_dev *smcd, struct smcd_gid *rgid,
>+ u32 trigger_irq, u32 event_code, u64 info)
> {
>- return ism_signal_ieq(smcd->priv, rgid, trigger_irq, event_code, info);
>+ return ism_signal_ieq(smcd->priv, rgid->gid, trigger_irq, event_code, info);
> }
>
> static int smcd_move(struct smcd_dev *smcd, u64 dmb_tok, unsigned int idx,
>@@ -830,9 +830,11 @@ static int smcd_supports_v2(void)
> SYSTEM_EID.type[0] != '0';
> }
>
>-static u64 smcd_get_local_gid(struct smcd_dev *smcd)
>+static void smcd_get_local_gid(struct smcd_dev *smcd,
>+ struct smcd_gid *smcd_gid)
> {
>- return ism_get_local_gid(smcd->priv);
>+ smcd_gid->gid = ism_get_local_gid(smcd->priv);
>+ smcd_gid->gid_ext = 0;
> }
>
> static u16 smcd_get_chid(struct smcd_dev *smcd)
>diff --git a/include/net/smc.h b/include/net/smc.h
>index f75116e..d8db5e1 100644
>--- a/include/net/smc.h
>+++ b/include/net/smc.h
>@@ -51,9 +51,14 @@ struct smcd_dmb {
>
> struct smcd_dev;
>
>+struct smcd_gid {
>+ u64 gid;
>+ u64 gid_ext;
>+};
>+
> struct smcd_ops {
>- int (*query_remote_gid)(struct smcd_dev *dev, u64 rgid, u32 vid_valid,
>- u32 vid);
>+ int (*query_remote_gid)(struct smcd_dev *dev, struct smcd_gid *rgid,
>+ u32 vid_valid, u32 vid);
> int (*register_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb,
> void *client);
> int (*unregister_dmb)(struct smcd_dev *dev, struct smcd_dmb *dmb);
>@@ -61,14 +66,14 @@ struct smcd_ops {
> int (*del_vlan_id)(struct smcd_dev *dev, u64 vlan_id);
> int (*set_vlan_required)(struct smcd_dev *dev);
> int (*reset_vlan_required)(struct smcd_dev *dev);
>- int (*signal_event)(struct smcd_dev *dev, u64 rgid, u32 trigger_irq,
>- u32 event_code, u64 info);
>+ int (*signal_event)(struct smcd_dev *dev, struct smcd_gid *rgid,
>+ u32 trigger_irq, u32 event_code, u64 info);
> int (*move_data)(struct smcd_dev *dev, u64 dmb_tok, unsigned int idx,
> bool sf, unsigned int offset, void *data,
> unsigned int size);
> int (*supports_v2)(void);
> u8* (*get_system_eid)(void);
>- u64 (*get_local_gid)(struct smcd_dev *dev);
>+ void (*get_local_gid)(struct smcd_dev *dev, struct smcd_gid *gid);
> u16 (*get_chid)(struct smcd_dev *dev);
> struct device* (*get_dev)(struct smcd_dev *dev);
> };
>diff --git a/include/uapi/linux/smc.h b/include/uapi/linux/smc.h
>index 837fcd4..0d2f020 100644
>--- a/include/uapi/linux/smc.h
>+++ b/include/uapi/linux/smc.h
>@@ -99,6 +99,9 @@ enum {
> SMC_NLA_LGR_V2_OS, /* u8 */
> SMC_NLA_LGR_V2_NEG_EID, /* string */
> SMC_NLA_LGR_V2_PEER_HOST, /* string */
>+ SMC_NLA_LGR_V2_PAD, /* flag */
>+ SMC_NLA_LGR_V2_GID_EXT, /* u64 */
>+ SMC_NLA_LGR_V2_PEER_GID_EXT, /* u64 */

Why those abbributes are add here, which was supposed to add common
abbritubtes ?


> __SMC_NLA_LGR_V2_MAX,
> SMC_NLA_LGR_V2_MAX = __SMC_NLA_LGR_V2_MAX - 1
> };
>diff --git a/include/uapi/linux/smc_diag.h b/include/uapi/linux/smc_diag.h
>index 8cb3a6f..78b7dfb 100644
>--- a/include/uapi/linux/smc_diag.h
>+++ b/include/uapi/linux/smc_diag.h
>@@ -107,6 +107,8 @@ struct smcd_diag_dmbinfo { /* SMC-D Socket internals */
> __aligned_u64 my_gid; /* My GID */
> __aligned_u64 token; /* Token of DMB */
> __aligned_u64 peer_token; /* Token of remote DMBE */
>+ __aligned_u64 peer_gid_ext; /* Peer GID extension */
>+ __aligned_u64 my_gid_ext; /* My GID extension */
> };
>
> #endif /* _UAPI_SMC_DIAG_H_ */
>diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
>index bacdd97..5bb41404 100644
>--- a/net/smc/af_smc.c
>+++ b/net/smc/af_smc.c
>@@ -719,7 +719,7 @@ static void smcd_conn_save_peer_info(struct smc_sock *smc,
> int bufsize = smc_uncompress_bufsize(clc->d0.dmbe_size);
>
> smc->conn.peer_rmbe_idx = clc->d0.dmbe_idx;
>- smc->conn.peer_token = clc->d0.token;
>+ smc->conn.peer_token = ntohll(clc->d0.token);
> /* msg header takes up space in the buffer */
> smc->conn.peer_rmbe_size = bufsize - sizeof(struct smcd_cdc_msg);
> atomic_set(&smc->conn.peer_rmbe_space, smc->conn.peer_rmbe_size);
>@@ -1044,7 +1044,8 @@ static int smc_find_ism_v2_device_clnt(struct smc_sock *smc,
> {
> int rc = SMC_CLC_DECL_NOSMCDDEV;
> struct smcd_dev *smcd;
>- int i = 1;
>+ int i = 1, slot = 1;
>+ bool is_virtdev;
> u16 chid;
>
> if (smcd_indicated(ini->smc_type_v1))
>@@ -1056,14 +1057,21 @@ static int smc_find_ism_v2_device_clnt(struct smc_sock *smc,
> chid = smc_ism_get_chid(smcd);
> if (!smc_find_ism_v2_is_unique_chid(chid, ini, i))
> continue;
>+ is_virtdev = __smc_ism_is_virtdev(chid);
> if (!smc_pnet_is_pnetid_set(smcd->pnetid) ||
> smc_pnet_is_ndev_pnetid(sock_net(&smc->sk), smcd->pnetid)) {
>+ if (is_virtdev && slot == SMC_MAX_ISM_DEVS)
>+ /* no enough space for virtual ISM devices, whose GID
>+ * takes 2 slots. Try the next potential ISM device.
>+ */
>+ continue;
> ini->ism_dev[i] = smcd;
> ini->ism_chid[i] = chid;
> ini->is_smcd = true;
> rc = 0;
> i++;
>- if (i > SMC_MAX_ISM_DEVS)
>+ slot = is_virtdev ? slot + 2 : slot + 1;
>+ if (slot > SMC_MAX_ISM_DEVS)
> break;
> }
> }
>@@ -1409,8 +1417,13 @@ static int smc_connect_ism(struct smc_sock *smc,
> rc = smc_v2_determine_accepted_chid(aclc_v2, ini);
> if (rc)
> return rc;
>+
>+ if (__smc_ism_is_virtdev(ini->ism_chid[ini->ism_selected]))
>+ ini->ism_peer_gid[ini->ism_selected].gid_ext =
>+ ntohll(aclc_v2->d1.gid_ext);
>+ /* for non-virtual ISM devices, peer gid_ext remains 0. */
> }
>- ini->ism_peer_gid[ini->ism_selected] = aclc->d0.gid;
>+ ini->ism_peer_gid[ini->ism_selected].gid = ntohll(aclc->d0.gid);
>
> /* there is only one lgr role for SMC-D; use server lock */
> mutex_lock(&smc_server_lgr_pending);
>@@ -2101,7 +2114,8 @@ static bool smc_is_already_selected(struct smcd_dev *smcd,
>
> /* check for ISM devices matching proposed ISM devices */
> static void smc_check_ism_v2_match(struct smc_init_info *ini,
>- u16 proposed_chid, u64 proposed_gid,
>+ u16 proposed_chid,
>+ struct smcd_gid *proposed_gid,
> unsigned int *matches)
> {
> struct smcd_dev *smcd;
>@@ -2113,7 +2127,11 @@ static void smc_check_ism_v2_match(struct smc_init_info *ini,
> continue;
> if (smc_ism_get_chid(smcd) == proposed_chid &&
> !smc_ism_cantalk(proposed_gid, ISM_RESERVED_VLANID, smcd)) {
>- ini->ism_peer_gid[*matches] = proposed_gid;
>+ ini->ism_peer_gid[*matches].gid = proposed_gid->gid;
>+ if (__smc_ism_is_virtdev(proposed_chid))
>+ ini->ism_peer_gid[*matches].gid_ext =
>+ proposed_gid->gid_ext;
>+ /* non-virtual ISM's peer gid_ext remains 0. */
> ini->ism_dev[*matches] = smcd;
> (*matches)++;
> break;
>@@ -2135,9 +2153,11 @@ static void smc_find_ism_v2_device_serv(struct smc_sock *new_smc,
> struct smc_clc_v2_extension *smc_v2_ext;
> struct smc_clc_msg_smcd *pclc_smcd;
> unsigned int matches = 0;
>+ struct smcd_gid smcd_gid;
> u8 smcd_version;
> u8 *eid = NULL;
> int i, rc;
>+ u16 chid;
>
> if (!(ini->smcd_version & SMC_V2) || !smcd_indicated(ini->smc_type_v2))
> goto not_found;
>@@ -2147,18 +2167,31 @@ static void smc_find_ism_v2_device_serv(struct smc_sock *new_smc,
> smcd_v2_ext = smc_get_clc_smcd_v2_ext(smc_v2_ext);
>
> mutex_lock(&smcd_dev_list.mutex);
>- if (pclc_smcd->ism.chid)
>+ if (pclc_smcd->ism.chid) {
> /* check for ISM device matching proposed native ISM device */
>+ smcd_gid.gid = ntohll(pclc_smcd->ism.gid);
>+ smcd_gid.gid_ext = 0;
> smc_check_ism_v2_match(ini, ntohs(pclc_smcd->ism.chid),
>- ntohll(pclc_smcd->ism.gid), &matches);
>+ &smcd_gid, &matches);
>+ }
> for (i = 1; i <= smc_v2_ext->hdr.ism_gid_cnt; i++) {
> /* check for ISM devices matching proposed non-native ISM
> * devices
> */
>- smc_check_ism_v2_match(ini,
>- ntohs(smcd_v2_ext->gidchid[i - 1].chid),
>- ntohll(smcd_v2_ext->gidchid[i - 1].gid),
>- &matches);
>+ smcd_gid.gid = ntohll(smcd_v2_ext->gidchid[i - 1].gid);
>+ smcd_gid.gid_ext = 0;
>+ chid = ntohs(smcd_v2_ext->gidchid[i - 1].chid);
>+ if (__smc_ism_is_virtdev(chid)) {
>+ if (i >= smc_v2_ext->hdr.ism_gid_cnt ||
>+ chid != ntohs(smcd_v2_ext->gidchid[i].chid))
>+ /* check if next slot exists and the next slot's
>+ * chid is consistent with the current slot.
>+ */
>+ continue;
>+
>+ smcd_gid.gid_ext = ntohll(smcd_v2_ext->gidchid[i++].gid);
>+ }
>+ smc_check_ism_v2_match(ini, chid, &smcd_gid, &matches);
> }
> mutex_unlock(&smcd_dev_list.mutex);
>
>@@ -2207,7 +2240,8 @@ static void smc_find_ism_v1_device_serv(struct smc_sock *new_smc,
> if (!(ini->smcd_version & SMC_V1) || !smcd_indicated(ini->smc_type_v1))
> goto not_found;
> ini->is_smcd = true; /* prepare ISM check */
>- ini->ism_peer_gid[0] = ntohll(pclc_smcd->ism.gid);
>+ ini->ism_peer_gid[0].gid = ntohll(pclc_smcd->ism.gid);
>+ ini->ism_peer_gid[0].gid_ext = 0;
> rc = smc_find_ism_device(new_smc, ini);
> if (rc)
> goto not_found;
>diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c
>index 125b0d2..aeb9643 100644
>--- a/net/smc/smc_clc.c
>+++ b/net/smc/smc_clc.c
>@@ -882,11 +882,13 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
> ETH_ALEN);
> }
> if (smcd_indicated(ini->smc_type_v1)) {
>+ struct smcd_gid smcd_gid;
>+
> /* add SMC-D specifics */
> if (ini->ism_dev[0]) {
> smcd = ini->ism_dev[0];
>- pclc_smcd->ism.gid =
>- htonll(smcd->ops->get_local_gid(smcd));
>+ smcd->ops->get_local_gid(smcd, &smcd_gid);
>+ pclc_smcd->ism.gid = htonll(smcd_gid.gid);
> pclc_smcd->ism.chid =
> htons(smc_ism_get_chid(ini->ism_dev[0]));
> }
>@@ -919,10 +921,11 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
> read_unlock(&smc_clc_eid_table.lock);
> }
> if (smcd_indicated(ini->smc_type_v2)) {
>+ struct smcd_gid smcd_gid;
> u8 *eid = NULL;
>+ int slot = 0;
>
> v2_ext->hdr.flag.seid = smc_clc_eid_table.seid_enabled;
>- v2_ext->hdr.ism_gid_cnt = ini->ism_offered_cnt;
> v2_ext->hdr.smcd_v2_ext_offset = htons(sizeof(*v2_ext) -
> offsetofend(struct smc_clnt_opts_area_hdr,
> smcd_v2_ext_offset) +
>@@ -934,14 +937,21 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
> if (ini->ism_offered_cnt) {
> for (i = 1; i <= ini->ism_offered_cnt; i++) {
> smcd = ini->ism_dev[i];
>- gidchids[i - 1].gid =
>- htonll(smcd->ops->get_local_gid(smcd));
>- gidchids[i - 1].chid =
>+ smcd->ops->get_local_gid(smcd, &smcd_gid);
>+ gidchids[slot].chid =
> htons(smc_ism_get_chid(ini->ism_dev[i]));
>+ gidchids[slot].gid = htonll(smcd_gid.gid);
>+ if (__smc_ism_is_virtdev(gidchids[slot].chid)) {
>+ /* virtual-ism takes two slots */
>+ gidchids[slot + 1].chid = gidchids[slot].chid;
>+ gidchids[slot + 1].gid = htonll(smcd_gid.gid_ext);
>+ slot++;
>+ }
>+ slot++;
> }
>- plen += ini->ism_offered_cnt *
>- sizeof(struct smc_clc_smcd_gid_chid);
>+ plen += slot * sizeof(struct smc_clc_smcd_gid_chid);
> }
>+ v2_ext->hdr.ism_gid_cnt = slot;
> }
> if (smcr_indicated(ini->smc_type_v2)) {
> memcpy(v2_ext->roce, ini->smcrv2.ib_gid_v2, SMC_GID_SIZE);
>@@ -977,7 +987,7 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
> vec[i++].iov_len = sizeof(*smcd_v2_ext);
> if (ini->ism_offered_cnt) {
> vec[i].iov_base = gidchids;
>- vec[i++].iov_len = ini->ism_offered_cnt *
>+ vec[i++].iov_len = v2_ext->hdr.ism_gid_cnt *
> sizeof(struct smc_clc_smcd_gid_chid);
> }
> }
>@@ -1019,23 +1029,28 @@ static int smc_clc_send_confirm_accept(struct smc_sock *smc,
> if (first_contact)
> clc->hdr.typev2 |= SMC_FIRST_CONTACT_MASK;
> if (conn->lgr->is_smcd) {
>+ struct smcd_gid smcd_gid;
>+ u16 chid;
>+
> /* SMC-D specific settings */
> memcpy(clc->hdr.eyecatcher, SMCD_EYECATCHER,
> sizeof(SMCD_EYECATCHER));
>+ conn->lgr->smcd->ops->get_local_gid(conn->lgr->smcd, &smcd_gid);
> clc->hdr.typev1 = SMC_TYPE_D;
>- clc->d0.gid =
>- conn->lgr->smcd->ops->get_local_gid(conn->lgr->smcd);
>- clc->d0.token = conn->rmb_desc->token;
>+ clc->d0.gid = htonll(smcd_gid.gid);
>+ clc->d0.token = htonll(conn->rmb_desc->token);
> clc->d0.dmbe_size = conn->rmbe_size_comp;
> clc->d0.dmbe_idx = 0;
> memcpy(&clc->d0.linkid, conn->lgr->id, SMC_LGR_ID_SIZE);
> if (version == SMC_V1) {
> clc->hdr.length = htons(SMCD_CLC_ACCEPT_CONFIRM_LEN);
> } else {
>- clc_v2->d1.chid =
>- htons(smc_ism_get_chid(conn->lgr->smcd));
>+ chid = smc_ism_get_chid(conn->lgr->smcd);
>+ clc_v2->d1.chid = htons(chid);
> if (eid && eid[0])
> memcpy(clc_v2->d1.eid, eid, SMC_MAX_EID_LEN);
>+ if (__smc_ism_is_virtdev(chid))
>+ clc_v2->d1.gid_ext = htonll(smcd_gid.gid_ext);
> len = SMCD_CLC_ACCEPT_CONFIRM_LEN_V2;
> if (first_contact) {
> fce_len = smc_clc_fill_fce(&fce, ini);
>diff --git a/net/smc/smc_clc.h b/net/smc/smc_clc.h
>index bcf37c8..611763a 100644
>--- a/net/smc/smc_clc.h
>+++ b/net/smc/smc_clc.h
>@@ -281,8 +281,8 @@ struct smc_clc_msg_accept_confirm_v2 { /* clc accept / confirm message */
> struct smcd_clc_msg_accept_confirm_common d0;
> __be16 chid;
> u8 eid[SMC_MAX_EID_LEN];
>- u8 reserved5[8];
>- } d1;
>+ __be64 gid_ext;
>+ } __packed d1;
> };
> };
>
>diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
>index d520ee6..6d7c738 100644
>--- a/net/smc/smc_core.c
>+++ b/net/smc/smc_core.c
>@@ -284,6 +284,9 @@ static int smc_nl_fill_lgr_v2_common(struct smc_link_group *lgr,
> {
> char smc_host[SMC_MAX_HOSTNAME_LEN + 1];
> char smc_eid[SMC_MAX_EID_LEN + 1];
>+ struct smcd_dev *smcd = lgr->smcd;
>+ struct smcd_gid smcd_gid;
>+ bool is_virtdev;
>
> if (nla_put_u8(skb, SMC_NLA_LGR_V2_VER, lgr->smc_version))
> goto errv2attr;
>@@ -299,6 +302,16 @@ static int smc_nl_fill_lgr_v2_common(struct smc_link_group *lgr,
> smc_eid[SMC_MAX_EID_LEN] = 0;
> if (nla_put_string(skb, SMC_NLA_LGR_V2_NEG_EID, smc_eid))
> goto errv2attr;
>+ smcd->ops->get_local_gid(smcd, &smcd_gid);
>+ is_virtdev = smc_ism_is_virtdev(smcd);
>+ if (nla_put_u64_64bit(skb, SMC_NLA_LGR_V2_GID_EXT,
>+ is_virtdev ? smcd_gid.gid_ext : 0,
>+ SMC_NLA_LGR_V2_PAD))
>+ goto errv2attr;
>+ if (nla_put_u64_64bit(skb, SMC_NLA_LGR_V2_PEER_GID_EXT,
>+ is_virtdev ? lgr->peer_gid.gid_ext : 0,
>+ SMC_NLA_LGR_V2_PAD))
>+ goto errv2attr;

I ran into a kernel panic which pointed to here, and it turns out the
smcd here is NULL. See below:

But taking a closer look at the code, I'm wondering why those SMCD
related attributes are filled in smc_nl_fill_lgr_v2_common() which
should only fill the common attributes ?



[17567.395214] BUG: kernel NULL pointer dereference, address:
0000000000000000
[17567.395729] #PF: supervisor read access in kernel mode
[17567.396086] #PF: error_code(0x0000) - not-present page
[17567.396442] PGD 0 P4D 0
[17567.396623] Oops: 0000 [#1] SMP NOPTI
[17567.396873] CPU: 7 PID: 34729 Comm: smcr Tainted: G W E
6.6.0-rc2-00669-gf4cfa8873d90-dirty #577
[17567.397528] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[17567.398272] RIP: 0010:smc_nl_fill_lgr_v2_common.isra.0+0x1e0/0x2e0
[smc]
[17567.398734] Code: 21 0f 87 0e 01 00 00 0f 84 fc 00 00 00 8d 50 01 48
8d 4c 24 3f be 03 00 00 00 48 89 ef e8 28 c4 1a e1 85 c0 0f 85 72 fe ff
ff <49> 8b 45 00 4c 89 ef 48 8d 74 24 08 48 8b 40 68 ff d0 0f 1f 00 49
[17567.399950] RSP: 0018:ffffc90002adb820 EFLAGS: 00010246
[17567.400297] RAX: 0000000000000000 RBX: ffff8881cad2402c RCX:
0000000000000057
[17567.400769] RDX: 0000000000000000 RSI: ffffc90002adb85f RDI:
ffff8881cad24074
[17567.401237] RBP: ffff888107d4cd00 R08: 0020202020202020 R09:
2020202020202020
[17567.401709] R10: 2020202020202020 R11: 2020202020415955 R12:
ffff8882673e0000
[17567.402176] R13: 0000000000000000 R14: ffff8881cad23fe4 R15:
ffff88813fbe4b60
[17567.402650] FS: 00007f679f655740(0000) GS:ffff88842fdc0000(0000)
knlGS:0000000000000000
[17567.403179] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[17567.403561] CR2: 0000000000000000 CR3: 00000001680d2005 CR4:
0000000000370ee0
[17567.404029] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[17567.404501] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[17567.404968] Call Trace:
[17567.405140] <TASK>
[17567.405287] ? __die_body+0x1f/0x70
[17567.405533] ? page_fault_oops+0x14c/0x440
[17567.405813] ? exc_page_fault+0x69/0x120
[17567.406079] ? asm_exc_page_fault+0x26/0x30
[17567.406361] ? smc_nl_fill_lgr_v2_common.isra.0+0x1e0/0x2e0 [smc]
[17567.406778] smc_nl_fill_lgr_list.constprop.0+0x368/0x4e0 [smc]
[17567.407180] smcr_nl_get_link+0x17/0x20 [smc]
[17567.407484] genl_dumpit+0x32/0x90
[17567.407722] netlink_dump+0x19d/0x3b0
[17567.407971] __netlink_dump_start+0x1d3/0x290
[17567.408266] genl_family_rcv_msg_dumpit.isra.0+0x7d/0xd0
[17567.408625] ? __pfx_genl_start+0x10/0x10
[17567.408897] ? __pfx_genl_dumpit+0x10/0x10
[17567.409172] ? __pfx_genl_done+0x10/0x10
[17567.409437] genl_rcv_msg+0x113/0x2a0
[17567.409690] ? __pfx_smcr_nl_get_link+0x10/0x10 [smc]
[17567.410040] ? __pfx_genl_rcv_msg+0x10/0x10
[17567.410312] netlink_rcv_skb+0x58/0x110
[17567.410567] genl_rcv+0x28/0x40
[17567.410777] netlink_unicast+0x181/0x240
[17567.411033] netlink_sendmsg+0x240/0x4a0
[17567.411288] sock_sendmsg+0xb1/0xc0
[17567.411524] ____sys_sendmsg+0x20f/0x300
[17567.411780] ? copy_msghdr_from_user+0x62/0x80
[17567.412067] ___sys_sendmsg+0x81/0xc0
[17567.412306] ? folio_add_lru+0x2b/0x30
[17567.412555] ? do_anonymous_page+0x18d/0x4e0
[17567.412837] ? __handle_mm_fault+0x47f/0x7c0
[17567.413115] __sys_sendmsg+0x4d/0x80
[17567.413349] ? exit_to_user_mode_prepare+0x3c/0x190
[17567.413669] do_syscall_64+0x3c/0x90
[17567.413908] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[17567.414234] RIP: 0033:0x7f679f79c177
[17567.414467] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f
1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f
05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
[17567.415640] RSP: 002b:00007ffe6d08c168 EFLAGS: 00000246 ORIG_RAX:
000000000000002e
[17567.416119] RAX: ffffffffffffffda RBX: 00005619cb470390 RCX:
00007f679f79c177
[17567.416574] RDX: 0000000000000000 RSI: 00007ffe6d08c1a0 RDI:
0000000000000003
[17567.417026] RBP: 00005619cb4702a0 R08: 0000000000000004 R09:
0000000000000300
[17567.417481] R10: 0000000000000004 R11: 0000000000000246 R12:
00005619cb4704b0
[17567.417932] R13: 00007ffe6d08c1a0 R14: 0000000000000000 R15:
0000000000000000
[17567.418385] </TASK>
[17567.418535] Modules linked in: smc_diag(E) smc(E) rpcrdma(E)
sunrpc(E) ib_srpt(E) ib_isert(E) iscsi_target_mod(E) target_core_mod(E)
ib_ipoib(E) ib_iser(E) libiscsi(E) scsi_transport_iscsi(E) mlx5_ib(E)
rfkill(E)
[17567.419722] CR2: 0000000000000000
[17567.419940] ---[ end trace 0000000000000000 ]---
[17567.420237] RIP: 0010:smc_nl_fill_lgr_v2_common.isra.0+0x1e0/0x2e0
[smc]
[17567.420676] Code: 21 0f 87 0e 01 00 00 0f 84 fc 00 00 00 8d 50 01 48
8d 4c 24 3f be 03 00 00 00 48 89 ef e8 28 c4 1a e1 85 c0 0f 85 72 fe ff
ff <49> 8b 45 00 4c 89 ef 48 8d 74 24 08 48 8b 40 68 ff d0 0f 1f 00 49
[17567.421842] RSP: 0018:ffffc90002adb820 EFLAGS: 00010246
[17567.422176] RAX: 0000000000000000 RBX: ffff8881cad2402c RCX:
0000000000000057
[17567.422631] RDX: 0000000000000000 RSI: ffffc90002adb85f RDI:
ffff8881cad24074
[17567.423082] RBP: ffff888107d4cd00 R08: 0020202020202020 R09:
2020202020202020
[17567.423536] R10: 2020202020202020 R11: 2020202020415955 R12:
ffff8882673e0000
[17567.423987] R13: 0000000000000000 R14: ffff8881cad23fe4 R15:
ffff88813fbe4b60
[17567.424438] FS: 00007f679f655740(0000) GS:ffff88842fdc0000(0000)
knlGS:0000000000000000
[17567.424950] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[17567.425317] CR2: 0000000000000000 CR3: 00000001680d2005 CR4:
0000000000370ee0
[17567.425773] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[17567.426225] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[17567.426677] Kernel panic - not syncing: Fatal exception in interrupt
[17567.427583] Kernel Offset: disabled
[17567.427817] ---[ end Kernel panic - not syncing: Fatal exception in
interrupt ]---


Best regards,
Dust



>
> nla_nest_end(skb, v2_attrs);
> return 0;

2023-10-12 13:24:41

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next v4 06/18] net/smc: extend GID to 128bits only for virtual ISM device



On 2023/10/12 15:54, Dust Li wrote:

> On Sun, Sep 24, 2023 at 11:16:41PM +0800, Wen Gu wrote:
>> Virtual ISM devices are introduced to SMC-Dv2.1 protocal, whose GIDs
>> are 128-bits UUIDs as defined by RFC4122. And note that the GIDs of
>> ISM devices still remain 64-bits.
>>
>> This patch adapts the relevant codes, such as CLC handshake, to make
>> it compatible with 128 bits GID.
>>
>> Signed-off-by: Wen Gu <[email protected]>
>> ---
>> drivers/s390/net/ism_drv.c | 18 +++++++------
>> include/net/smc.h | 15 +++++++----
>> include/uapi/linux/smc.h | 3 +++
>> include/uapi/linux/smc_diag.h | 2 ++
>> net/smc/af_smc.c | 60 +++++++++++++++++++++++++++++++++----------
>> net/smc/smc_clc.c | 43 +++++++++++++++++++++----------
>> net/smc/smc_clc.h | 4 +--
>> net/smc/smc_core.c | 41 +++++++++++++++++++++--------
>> net/smc/smc_core.h | 7 ++---
>> net/smc/smc_diag.c | 11 ++++++--
>> net/smc/smc_ism.c | 18 ++++++++-----
>> net/smc/smc_ism.h | 3 ++-
>> net/smc/smc_pnet.c | 4 +--
>> 13 files changed, 163 insertions(+), 66 deletions(-)
>>

<...>

>> diff --git a/include/uapi/linux/smc.h b/include/uapi/linux/smc.h
>> index 837fcd4..0d2f020 100644
>> --- a/include/uapi/linux/smc.h
>> +++ b/include/uapi/linux/smc.h
>> @@ -99,6 +99,9 @@ enum {
>> SMC_NLA_LGR_V2_OS, /* u8 */
>> SMC_NLA_LGR_V2_NEG_EID, /* string */
>> SMC_NLA_LGR_V2_PEER_HOST, /* string */
>> + SMC_NLA_LGR_V2_PAD, /* flag */
>> + SMC_NLA_LGR_V2_GID_EXT, /* u64 */
>> + SMC_NLA_LGR_V2_PEER_GID_EXT, /* u64 */
>
> Why those abbributes are add here, which was supposed to add common
> abbritubtes ?
>

Thanks! After looking back at this code, I also feel that the extended GID
attributes should be moved to other places. The nested attributes here are
shared by both SMC-R and SMC-D.

So similar to the SMC_NLA_LGR_R_V2 nested attributes introduced by SMC-R,
a new SMC_NLA_LGR_D_V2 nested attributes may be more suitable,
and processed by a new smc_nl_fill_smcr_lgr_v2() helper in smc_nl_fill_smcd_lgr().

<...>

>> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
>> index d520ee6..6d7c738 100644
>> --- a/net/smc/smc_core.c
>> +++ b/net/smc/smc_core.c
>> @@ -284,6 +284,9 @@ static int smc_nl_fill_lgr_v2_common(struct smc_link_group *lgr,
>> {
>> char smc_host[SMC_MAX_HOSTNAME_LEN + 1];
>> char smc_eid[SMC_MAX_EID_LEN + 1];
>> + struct smcd_dev *smcd = lgr->smcd;
>> + struct smcd_gid smcd_gid;
>> + bool is_virtdev;
>>
>> if (nla_put_u8(skb, SMC_NLA_LGR_V2_VER, lgr->smc_version))
>> goto errv2attr;
>> @@ -299,6 +302,16 @@ static int smc_nl_fill_lgr_v2_common(struct smc_link_group *lgr,
>> smc_eid[SMC_MAX_EID_LEN] = 0;
>> if (nla_put_string(skb, SMC_NLA_LGR_V2_NEG_EID, smc_eid))
>> goto errv2attr;
>> + smcd->ops->get_local_gid(smcd, &smcd_gid);
>> + is_virtdev = smc_ism_is_virtdev(smcd);
>> + if (nla_put_u64_64bit(skb, SMC_NLA_LGR_V2_GID_EXT,
>> + is_virtdev ? smcd_gid.gid_ext : 0,
>> + SMC_NLA_LGR_V2_PAD))
>> + goto errv2attr;
>> + if (nla_put_u64_64bit(skb, SMC_NLA_LGR_V2_PEER_GID_EXT,
>> + is_virtdev ? lgr->peer_gid.gid_ext : 0,
>> + SMC_NLA_LGR_V2_PAD))
>> + goto errv2attr;
>
> I ran into a kernel panic which pointed to here, and it turns out the
> smcd here is NULL. See below:
>
> But taking a closer look at the code, I'm wondering why those SMCD
> related attributes are filled in smc_nl_fill_lgr_v2_common() which
> should only fill the common attributes ?
>
>

I guess the crash occurs when running in SMC-R. Due to this improper
code, the invalid lgr->smcd was accessed mistakenly in the common helper.

So as mentioned above, this extended GID related code will be moved out
of this common helper and fix this issue. Thanks!

Regards,
Wen Gu

>
> [17567.395214] BUG: kernel NULL pointer dereference, address:
> 0000000000000000
> [17567.395729] #PF: supervisor read access in kernel mode
> [17567.396086] #PF: error_code(0x0000) - not-present page
> [17567.396442] PGD 0 P4D 0
> [17567.396623] Oops: 0000 [#1] SMP NOPTI
> [17567.396873] CPU: 7 PID: 34729 Comm: smcr Tainted: G W E
> 6.6.0-rc2-00669-gf4cfa8873d90-dirty #577
> [17567.397528] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> [17567.398272] RIP: 0010:smc_nl_fill_lgr_v2_common.isra.0+0x1e0/0x2e0
> [smc]
> [17567.398734] Code: 21 0f 87 0e 01 00 00 0f 84 fc 00 00 00 8d 50 01 48
> 8d 4c 24 3f be 03 00 00 00 48 89 ef e8 28 c4 1a e1 85 c0 0f 85 72 fe ff
> ff <49> 8b 45 00 4c 89 ef 48 8d 74 24 08 48 8b 40 68 ff d0 0f 1f 00 49
> [17567.399950] RSP: 0018:ffffc90002adb820 EFLAGS: 00010246
> [17567.400297] RAX: 0000000000000000 RBX: ffff8881cad2402c RCX:
> 0000000000000057
> [17567.400769] RDX: 0000000000000000 RSI: ffffc90002adb85f RDI:
> ffff8881cad24074
> [17567.401237] RBP: ffff888107d4cd00 R08: 0020202020202020 R09:
> 2020202020202020
> [17567.401709] R10: 2020202020202020 R11: 2020202020415955 R12:
> ffff8882673e0000
> [17567.402176] R13: 0000000000000000 R14: ffff8881cad23fe4 R15:
> ffff88813fbe4b60
> [17567.402650] FS: 00007f679f655740(0000) GS:ffff88842fdc0000(0000)
> knlGS:0000000000000000
> [17567.403179] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [17567.403561] CR2: 0000000000000000 CR3: 00000001680d2005 CR4:
> 0000000000370ee0
> [17567.404029] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [17567.404501] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [17567.404968] Call Trace:
> [17567.405140] <TASK>
> [17567.405287] ? __die_body+0x1f/0x70
> [17567.405533] ? page_fault_oops+0x14c/0x440
> [17567.405813] ? exc_page_fault+0x69/0x120
> [17567.406079] ? asm_exc_page_fault+0x26/0x30
> [17567.406361] ? smc_nl_fill_lgr_v2_common.isra.0+0x1e0/0x2e0 [smc]
> [17567.406778] smc_nl_fill_lgr_list.constprop.0+0x368/0x4e0 [smc]
> [17567.407180] smcr_nl_get_link+0x17/0x20 [smc]
> [17567.407484] genl_dumpit+0x32/0x90
> [17567.407722] netlink_dump+0x19d/0x3b0
> [17567.407971] __netlink_dump_start+0x1d3/0x290
> [17567.408266] genl_family_rcv_msg_dumpit.isra.0+0x7d/0xd0
> [17567.408625] ? __pfx_genl_start+0x10/0x10
> [17567.408897] ? __pfx_genl_dumpit+0x10/0x10
> [17567.409172] ? __pfx_genl_done+0x10/0x10
> [17567.409437] genl_rcv_msg+0x113/0x2a0
> [17567.409690] ? __pfx_smcr_nl_get_link+0x10/0x10 [smc]
> [17567.410040] ? __pfx_genl_rcv_msg+0x10/0x10
> [17567.410312] netlink_rcv_skb+0x58/0x110
> [17567.410567] genl_rcv+0x28/0x40
> [17567.410777] netlink_unicast+0x181/0x240
> [17567.411033] netlink_sendmsg+0x240/0x4a0
> [17567.411288] sock_sendmsg+0xb1/0xc0
> [17567.411524] ____sys_sendmsg+0x20f/0x300
> [17567.411780] ? copy_msghdr_from_user+0x62/0x80
> [17567.412067] ___sys_sendmsg+0x81/0xc0
> [17567.412306] ? folio_add_lru+0x2b/0x30
> [17567.412555] ? do_anonymous_page+0x18d/0x4e0
> [17567.412837] ? __handle_mm_fault+0x47f/0x7c0
> [17567.413115] __sys_sendmsg+0x4d/0x80
> [17567.413349] ? exit_to_user_mode_prepare+0x3c/0x190
> [17567.413669] do_syscall_64+0x3c/0x90
> [17567.413908] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> [17567.414234] RIP: 0033:0x7f679f79c177
> [17567.414467] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f
> 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f
> 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
> [17567.415640] RSP: 002b:00007ffe6d08c168 EFLAGS: 00000246 ORIG_RAX:
> 000000000000002e
> [17567.416119] RAX: ffffffffffffffda RBX: 00005619cb470390 RCX:
> 00007f679f79c177
> [17567.416574] RDX: 0000000000000000 RSI: 00007ffe6d08c1a0 RDI:
> 0000000000000003
> [17567.417026] RBP: 00005619cb4702a0 R08: 0000000000000004 R09:
> 0000000000000300
> [17567.417481] R10: 0000000000000004 R11: 0000000000000246 R12:
> 00005619cb4704b0
> [17567.417932] R13: 00007ffe6d08c1a0 R14: 0000000000000000 R15:
> 0000000000000000
> [17567.418385] </TASK>
> [17567.418535] Modules linked in: smc_diag(E) smc(E) rpcrdma(E)
> sunrpc(E) ib_srpt(E) ib_isert(E) iscsi_target_mod(E) target_core_mod(E)
> ib_ipoib(E) ib_iser(E) libiscsi(E) scsi_transport_iscsi(E) mlx5_ib(E)
> rfkill(E)
> [17567.419722] CR2: 0000000000000000
> [17567.419940] ---[ end trace 0000000000000000 ]---
> [17567.420237] RIP: 0010:smc_nl_fill_lgr_v2_common.isra.0+0x1e0/0x2e0
> [smc]
> [17567.420676] Code: 21 0f 87 0e 01 00 00 0f 84 fc 00 00 00 8d 50 01 48
> 8d 4c 24 3f be 03 00 00 00 48 89 ef e8 28 c4 1a e1 85 c0 0f 85 72 fe ff
> ff <49> 8b 45 00 4c 89 ef 48 8d 74 24 08 48 8b 40 68 ff d0 0f 1f 00 49
> [17567.421842] RSP: 0018:ffffc90002adb820 EFLAGS: 00010246
> [17567.422176] RAX: 0000000000000000 RBX: ffff8881cad2402c RCX:
> 0000000000000057
> [17567.422631] RDX: 0000000000000000 RSI: ffffc90002adb85f RDI:
> ffff8881cad24074
> [17567.423082] RBP: ffff888107d4cd00 R08: 0020202020202020 R09:
> 2020202020202020
> [17567.423536] R10: 2020202020202020 R11: 2020202020415955 R12:
> ffff8882673e0000
> [17567.423987] R13: 0000000000000000 R14: ffff8881cad23fe4 R15:
> ffff88813fbe4b60
> [17567.424438] FS: 00007f679f655740(0000) GS:ffff88842fdc0000(0000)
> knlGS:0000000000000000
> [17567.424950] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [17567.425317] CR2: 0000000000000000 CR3: 00000001680d2005 CR4:
> 0000000000370ee0
> [17567.425773] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [17567.426225] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [17567.426677] Kernel panic - not syncing: Fatal exception in interrupt
> [17567.427583] Kernel Offset: disabled
> [17567.427817] ---[ end Kernel panic - not syncing: Fatal exception in
> interrupt ]---
>
>
> Best regards,
> Dust
>
>
>
>>
>> nla_nest_end(skb, v2_attrs);
>> return 0;

2023-10-17 03:49:22

by Wen Gu

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/18] net/smc: implement virtual ISM extension and loopback-ism



On 2023/10/8 15:19, Wen Gu wrote:
>
>
> On 2023/10/5 16:21, Niklas Schnelle wrote:
>
>>
>> Hi Wen Gu,
>>
>> I've been trying out your series with iperf3, qperf, and uperf on
>> s390x. I'm using network namespaces with a ConnectX VF from the same
>> card in each namespace for the initial TCP/IP connection i.e. initially
>> it goes out to a real NIC even if that can switch internally. All of
>> these look great for streaming workloads both in terms of performance
>> and stability. With a Connect-Request-Response workload and uperf
>> however I've run into issues. The test configuration I use is as
>> follows:
>>
>> Client Command:
>>
>> # host=$ip_server ip netns exec client smc_run uperf -m tcp_crr.xml
>>
>> Server Command:
>>
>> # ip netns exec server smc_run uperf -s &> /dev/null
>>
>> Uperf tcp_crr.xml:
>>
>> <?xml version="1.0"?>
>> <profile name="TCP_CRR">
>>          <group nthreads="12">
>>                  <transaction duration="120">
>>                          <flowop type="connect" options="remotehost=$host protocol=tcp" />
>>                          <flowop type="write" options="size=200"/>
>>                          <flowop type="read" options="size=1000"/>
>>                          <flowop type="disconnect" />
>>                  </transaction>
>>          </group>
>> </profile>
>>
>> The workload first runs fine but then after about 4 GB of data
>> transferred fails with "Connection refused" and "Connection reset by
>> peer" errors. The failure is not permanent however and re-running
>> the streaming workloads run fine again (with both uperf server and
>> client restarted). So I suspect something gets stuck in either the
>> client or server sockets. The same workload runs fine with TCP/IP of
>> course.
>>
>> Thanks,
>> Niklas
>>
>>
>
> Hi Niklas,
>
> Thank you very much for the test. With the test example you provided, I've
> reproduced the issue in my VM. And moreover, sometimes the test complains
> with 'Error saying goodbye with <ip>'
>
> I'll figure out what's going on here.
>
> Thanks!
> Wen Gu

I think that there is a common issue for SMC-R and SMC-D. I also reproduce
'connection reset by peer' and 'Error saying goodbye with <ip>' when using
SMC-R under the same test condition. They occur at the end of the test.

When the uperf test time ends, some signals are sent. At this point there
are usually some SMC connections doing CLC handshake. I catch some -EINTR(-4)
in client and -ECONNRESET(-104) in server returned from smc_clc_wait_msg,
(correspondingly handshake error counts also increase) and TCP RST packets
sent to terminate the CLC TCP connection(clcsock).

I am not sure if this should be considered as a bydesign or a bug of SMC.
From an application perspective, the conn reset behavior only happens when
using SMC.

@Wenjia, could you please take a look at this?

Thanks,
Wen Gu

2023-10-18 13:24:43

by Alexandra Winter

[permalink] [raw]
Subject: Re: [PATCH net-next v4 10/18] net/smc: implement ID-related operations of loopback



On 24.09.23 17:16, Wen Gu wrote:
> This patch implements GID/CHID/SEID related operations of SMC-D loopback device. In loopback device, GID is generated by UUIDv4 algorithm, CHID is reserved 0xFFFF, SEID is generated using the same algorithm as ISM device under s390 architecture, and is 0 and disabled under non-s390 architecture. Signed-off-by: Wen Gu <[email protected]> ---

IMO, get_system_eid should not be part of smcd_ops. And should not be provided by an smcd device.
It is a system_eid is a global value that is valid for all smcd interfaces of this system (os instance).
So I think it should be provided by the smc module.

I agree it needs to be architecture dependent and same as today for s390.

2023-10-18 19:44:22

by Wenjia Zhang

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/18] net/smc: implement virtual ISM extension and loopback-ism



On 17.10.23 05:49, Wen Gu wrote:
>
>
> On 2023/10/8 15:19, Wen Gu wrote:
>>
>>
>> On 2023/10/5 16:21, Niklas Schnelle wrote:
>>
>>>
>>> Hi Wen Gu,
>>>
>>> I've been trying out your series with iperf3, qperf, and uperf on
>>> s390x. I'm using network namespaces with a ConnectX VF from the same
>>> card in each namespace for the initial TCP/IP connection i.e. initially
>>> it goes out to a real NIC even if that can switch internally. All of
>>> these look great for streaming workloads both in terms of performance
>>> and stability. With a Connect-Request-Response workload and uperf
>>> however I've run into issues. The test configuration I use is as
>>> follows:
>>>
>>> Client Command:
>>>
>>> # host=$ip_server ip netns exec client smc_run uperf -m tcp_crr.xml
>>>
>>> Server Command:
>>>
>>> # ip netns exec server smc_run uperf -s &> /dev/null
>>>
>>> Uperf tcp_crr.xml:
>>>
>>> <?xml version="1.0"?>
>>> <profile name="TCP_CRR">
>>>          <group nthreads="12">
>>>                  <transaction duration="120">
>>>                          <flowop type="connect"
>>> options="remotehost=$host protocol=tcp" />
>>>                          <flowop type="write" options="size=200"/>
>>>                          <flowop type="read" options="size=1000"/>
>>>                          <flowop type="disconnect" />
>>>                  </transaction>
>>>          </group>
>>> </profile>
>>>
>>> The workload first runs fine but then after about 4 GB of data
>>> transferred fails with "Connection refused" and "Connection reset by
>>> peer" errors. The failure is not permanent however and re-running
>>> the streaming workloads run fine again (with both uperf server and
>>> client restarted). So I suspect something gets stuck in either the
>>> client or server sockets. The same workload runs fine with TCP/IP of
>>> course.
>>>
>>> Thanks,
>>> Niklas
>>>
>>>
>>
>> Hi Niklas,
>>
>> Thank you very much for the test. With the test example you provided,
>> I've
>> reproduced the issue in my VM. And moreover, sometimes the test complains
>> with 'Error saying goodbye with <ip>'
>>
>> I'll figure out what's going on here.
>>
>> Thanks!
>> Wen Gu
>
> I think that there is a common issue for SMC-R and SMC-D. I also reproduce
> 'connection reset by peer' and 'Error saying goodbye with <ip>' when using
> SMC-R under the same test condition. They occur at the end of the test.
>
> When the uperf test time ends, some signals are sent. At this point there
> are usually some SMC connections doing CLC handshake. I catch some
> -EINTR(-4)
> in client and -ECONNRESET(-104) in server returned from smc_clc_wait_msg,
> (correspondingly handshake error counts also increase) and TCP RST packets
> sent to terminate the CLC TCP connection(clcsock).
>
> I am not sure if this should be considered as a bydesign or a bug of SMC.
> From an application perspective, the conn reset behavior only happens when
> using SMC.
>
> @Wenjia, could you please take a look at this?
>
> Thanks,
> Wen Gu

Hi Wen,

Do you mean the bug in smc_clc_wait_msg()?
If yes, I can not see any problem in the smc_clc_wait_msg(). From your
description, it looks to me like the server should get the CLC_PROPOSAL
message, but nothing in it while the client is waiting for the accept
CLC_ACCEPT message from the server until the wait loops is broken out.

Thanks,
Wenjia