2021-04-05 05:25:11

by Leon Romanovsky

[permalink] [raw]
Subject: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

From: Leon Romanovsky <[email protected]>

From Avihai,

Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
imposed on PCI transactions, and thus, can improve performance.

Until now, relaxed ordering could be set only by user space applications
for user MRs. The following patch series enables relaxed ordering for the
kernel ULPs as well. Relaxed ordering is an optional capability, and as
such, it is ignored by vendors that don't support it.

The following test results show the performance improvement achieved
with relaxed ordering. The test was performed on a NVIDIA A100 in order
to check performance of storage infrastructure over xprtrdma:

Without Relaxed Ordering:
READ: bw=16.5GiB/s (17.7GB/s), 16.5GiB/s-16.5GiB/s (17.7GB/s-17.7GB/s),
io=1987GiB (2133GB), run=120422-120422msec

With relaxed ordering:
READ: bw=72.9GiB/s (78.2GB/s), 72.9GiB/s-72.9GiB/s (78.2GB/s-78.2GB/s),
io=2367GiB (2542GB), run=32492-32492msec

Thanks

Avihai Horon (10):
RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()
RDMA/core: Enable Relaxed Ordering in __ib_alloc_pd()
RDMA/iser: Enable Relaxed Ordering
RDMA/rtrs: Enable Relaxed Ordering
RDMA/srp: Enable Relaxed Ordering
nvme-rdma: Enable Relaxed Ordering
cifs: smbd: Enable Relaxed Ordering
net/rds: Enable Relaxed Ordering
net/smc: Enable Relaxed Ordering
xprtrdma: Enable Relaxed Ordering

drivers/infiniband/core/mr_pool.c | 7 +-
drivers/infiniband/core/rw.c | 12 ++--
drivers/infiniband/core/verbs.c | 26 +++++--
drivers/infiniband/hw/bnxt_re/ib_verbs.c | 2 +-
drivers/infiniband/hw/bnxt_re/ib_verbs.h | 2 +-
drivers/infiniband/hw/cxgb4/iw_cxgb4.h | 2 +-
drivers/infiniband/hw/cxgb4/mem.c | 2 +-
drivers/infiniband/hw/hns/hns_roce_device.h | 2 +-
drivers/infiniband/hw/hns/hns_roce_mr.c | 2 +-
drivers/infiniband/hw/i40iw/i40iw_verbs.c | 3 +-
drivers/infiniband/hw/mlx4/mlx4_ib.h | 2 +-
drivers/infiniband/hw/mlx4/mr.c | 2 +-
drivers/infiniband/hw/mlx5/mlx5_ib.h | 12 ++--
drivers/infiniband/hw/mlx5/mr.c | 61 ++++++++--------
drivers/infiniband/hw/mlx5/wr.c | 69 ++++++++++++++-----
drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 2 +-
drivers/infiniband/hw/ocrdma/ocrdma_verbs.h | 2 +-
drivers/infiniband/hw/qedr/verbs.c | 2 +-
drivers/infiniband/hw/qedr/verbs.h | 2 +-
drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c | 4 +-
.../infiniband/hw/vmw_pvrdma/pvrdma_verbs.h | 2 +-
drivers/infiniband/sw/rdmavt/mr.c | 3 +-
drivers/infiniband/sw/rdmavt/mr.h | 2 +-
drivers/infiniband/sw/rxe/rxe_verbs.c | 2 +-
drivers/infiniband/sw/siw/siw_verbs.c | 2 +-
drivers/infiniband/sw/siw/siw_verbs.h | 2 +-
drivers/infiniband/ulp/iser/iser_memory.c | 10 ++-
drivers/infiniband/ulp/iser/iser_verbs.c | 6 +-
drivers/infiniband/ulp/rtrs/rtrs-clt.c | 6 +-
drivers/infiniband/ulp/rtrs/rtrs-srv.c | 15 ++--
drivers/infiniband/ulp/srp/ib_srp.c | 8 +--
drivers/nvme/host/rdma.c | 19 +++--
fs/cifs/smbdirect.c | 17 +++--
include/rdma/ib_verbs.h | 11 ++-
include/rdma/mr_pool.h | 3 +-
net/rds/ib_frmr.c | 7 +-
net/smc/smc_ib.c | 3 +-
net/smc/smc_wr.c | 3 +-
net/sunrpc/xprtrdma/frwr_ops.c | 10 +--
39 files changed, 209 insertions(+), 140 deletions(-)

--
2.30.2


2021-04-05 05:25:11

by Leon Romanovsky

[permalink] [raw]
Subject: [PATCH rdma-next 03/10] RDMA/iser: Enable Relaxed Ordering

From: Avihai Horon <[email protected]>

Enable Relaxed Ordering for iser.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Reviewed-by: Michael Guralnik <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
---
drivers/infiniband/ulp/iser/iser_memory.c | 10 ++++------
drivers/infiniband/ulp/iser/iser_verbs.c | 6 ++++--
2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iser_memory.c b/drivers/infiniband/ulp/iser/iser_memory.c
index afec40da9b58..29849c9e82e8 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -271,9 +271,8 @@ iser_reg_sig_mr(struct iscsi_iser_task *iser_task,
wr->wr.send_flags = 0;
wr->mr = mr;
wr->key = mr->rkey;
- wr->access = IB_ACCESS_LOCAL_WRITE |
- IB_ACCESS_REMOTE_READ |
- IB_ACCESS_REMOTE_WRITE;
+ wr->access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ |
+ IB_ACCESS_REMOTE_WRITE | IB_ACCESS_RELAXED_ORDERING;
rsc->mr_valid = 1;

sig_reg->sge.lkey = mr->lkey;
@@ -318,9 +317,8 @@ static int iser_fast_reg_mr(struct iscsi_iser_task *iser_task,
wr->wr.num_sge = 0;
wr->mr = mr;
wr->key = mr->rkey;
- wr->access = IB_ACCESS_LOCAL_WRITE |
- IB_ACCESS_REMOTE_WRITE |
- IB_ACCESS_REMOTE_READ;
+ wr->access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
+ IB_ACCESS_REMOTE_READ | IB_ACCESS_RELAXED_ORDERING;

rsc->mr_valid = 1;

diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c
index 3c370ee25f2f..1e236b1cf29b 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -121,7 +121,8 @@ iser_create_fastreg_desc(struct iser_device *device,
else
mr_type = IB_MR_TYPE_MEM_REG;

- desc->rsc.mr = ib_alloc_mr(pd, mr_type, size, 0);
+ desc->rsc.mr = ib_alloc_mr(pd, mr_type, size,
+ IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(desc->rsc.mr)) {
ret = PTR_ERR(desc->rsc.mr);
iser_err("Failed to allocate ib_fast_reg_mr err=%d\n", ret);
@@ -129,7 +130,8 @@ iser_create_fastreg_desc(struct iser_device *device,
}

if (pi_enable) {
- desc->rsc.sig_mr = ib_alloc_mr_integrity(pd, size, size, 0);
+ desc->rsc.sig_mr = ib_alloc_mr_integrity(pd, size, size,
+ IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(desc->rsc.sig_mr)) {
ret = PTR_ERR(desc->rsc.sig_mr);
iser_err("Failed to allocate sig_mr err=%d\n", ret);
--
2.30.2

2021-04-05 05:25:11

by Leon Romanovsky

[permalink] [raw]
Subject: [PATCH rdma-next 04/10] RDMA/rtrs: Enable Relaxed Ordering

From: Avihai Horon <[email protected]>

Enable Relaxed Ordering fro rtrs client and server.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon <[email protected]>
Reviewed-by: Michael Guralnik <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
---
drivers/infiniband/ulp/rtrs/rtrs-clt.c | 6 ++++--
drivers/infiniband/ulp/rtrs/rtrs-srv.c | 15 ++++++++-------
2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/ulp/rtrs/rtrs-clt.c b/drivers/infiniband/ulp/rtrs/rtrs-clt.c
index 0d3960ed5b2b..a3fbb47a3574 100644
--- a/drivers/infiniband/ulp/rtrs/rtrs-clt.c
+++ b/drivers/infiniband/ulp/rtrs/rtrs-clt.c
@@ -1099,7 +1099,8 @@ static int rtrs_clt_read_req(struct rtrs_clt_io_req *req)
.mr = req->mr,
.key = req->mr->rkey,
.access = (IB_ACCESS_LOCAL_WRITE |
- IB_ACCESS_REMOTE_WRITE),
+ IB_ACCESS_REMOTE_WRITE |
+ IB_ACCESS_RELAXED_ORDERING),
};
wr = &rwr.wr;

@@ -1260,7 +1261,8 @@ static int alloc_sess_reqs(struct rtrs_clt_sess *sess)
goto out;

req->mr = ib_alloc_mr(sess->s.dev->ib_pd, IB_MR_TYPE_MEM_REG,
- sess->max_pages_per_mr, 0);
+ sess->max_pages_per_mr,
+ IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(req->mr)) {
err = PTR_ERR(req->mr);
req->mr = NULL;
diff --git a/drivers/infiniband/ulp/rtrs/rtrs-srv.c b/drivers/infiniband/ulp/rtrs/rtrs-srv.c
index 575f31ff20fd..c28ed5e2245d 100644
--- a/drivers/infiniband/ulp/rtrs/rtrs-srv.c
+++ b/drivers/infiniband/ulp/rtrs/rtrs-srv.c
@@ -312,8 +312,8 @@ static int rdma_write_sg(struct rtrs_srv_op *id)
rwr.mr = srv_mr->mr;
rwr.wr.send_flags = 0;
rwr.key = srv_mr->mr->rkey;
- rwr.access = (IB_ACCESS_LOCAL_WRITE |
- IB_ACCESS_REMOTE_WRITE);
+ rwr.access = (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
+ IB_ACCESS_RELAXED_ORDERING);
msg = srv_mr->iu->buf;
msg->buf_id = cpu_to_le16(id->msg_id);
msg->type = cpu_to_le16(RTRS_MSG_RKEY_RSP);
@@ -432,8 +432,8 @@ static int send_io_resp_imm(struct rtrs_srv_con *con, struct rtrs_srv_op *id,
rwr.wr.send_flags = 0;
rwr.mr = srv_mr->mr;
rwr.key = srv_mr->mr->rkey;
- rwr.access = (IB_ACCESS_LOCAL_WRITE |
- IB_ACCESS_REMOTE_WRITE);
+ rwr.access = (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
+ IB_ACCESS_RELAXED_ORDERING);
msg = srv_mr->iu->buf;
msg->buf_id = cpu_to_le16(id->msg_id);
msg->type = cpu_to_le16(RTRS_MSG_RKEY_RSP);
@@ -638,7 +638,7 @@ static int map_cont_bufs(struct rtrs_srv_sess *sess)
goto free_sg;
}
mr = ib_alloc_mr(sess->s.dev->ib_pd, IB_MR_TYPE_MEM_REG,
- sgt->nents, 0);
+ sgt->nents, IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(mr)) {
err = PTR_ERR(mr);
goto unmap_sg;
@@ -823,8 +823,9 @@ static int process_info_req(struct rtrs_srv_con *con,
rwr[mri].wr.send_flags = 0;
rwr[mri].mr = mr;
rwr[mri].key = mr->rkey;
- rwr[mri].access = (IB_ACCESS_LOCAL_WRITE |
- IB_ACCESS_REMOTE_WRITE);
+ rwr[mri].access =
+ (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
+ IB_ACCESS_RELAXED_ORDERING);
reg_wr = &rwr[mri].wr;
}

--
2.30.2

2021-04-05 05:25:23

by Leon Romanovsky

[permalink] [raw]
Subject: [PATCH rdma-next 02/10] RDMA/core: Enable Relaxed Ordering in __ib_alloc_pd()

From: Avihai Horon <[email protected]>

Enable Relaxed Ordering in __ib_alloc_pd() allocation of the
local_dma_lkey.

This will take effect only for devices that don't pre-allocate the lkey
but allocate it per PD allocation.

Signed-off-by: Avihai Horon <[email protected]>
Reviewed-by: Michael Guralnik <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
---
drivers/infiniband/core/verbs.c | 3 ++-
drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c | 1 +
2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index a1782f8a6ca0..9b719f7d6fd5 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -287,7 +287,8 @@ struct ib_pd *__ib_alloc_pd(struct ib_device *device, unsigned int flags,
if (device->attrs.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)
pd->local_dma_lkey = device->local_dma_lkey;
else
- mr_access_flags |= IB_ACCESS_LOCAL_WRITE;
+ mr_access_flags |=
+ IB_ACCESS_LOCAL_WRITE | IB_ACCESS_RELAXED_ORDERING;

if (flags & IB_PD_UNSAFE_GLOBAL_RKEY) {
pr_warn("%s: enabling unsafe global rkey\n", caller);
diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
index b3fa783698a0..d74827694f92 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
@@ -66,6 +66,7 @@ struct ib_mr *pvrdma_get_dma_mr(struct ib_pd *pd, int acc)
int ret;

/* Support only LOCAL_WRITE flag for DMA MRs */
+ acc &= ~IB_ACCESS_RELAXED_ORDERING;
if (acc & ~IB_ACCESS_LOCAL_WRITE) {
dev_warn(&dev->pdev->dev,
"unsupported dma mr access flags %#x\n", acc);
--
2.30.2

2021-04-05 05:25:31

by Leon Romanovsky

[permalink] [raw]
Subject: [PATCH rdma-next 07/10] cifs: smbd: Enable Relaxed Ordering

From: Avihai Horon <[email protected]>

Enable Relaxed Ordering for smbd.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon <[email protected]>
Reviewed-by: Michael Guralnik <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
---
fs/cifs/smbdirect.c | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 647098a5cf3b..1e86dc8bbe85 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -2178,8 +2178,10 @@ static void smbd_mr_recovery_work(struct work_struct *work)
continue;
}

- smbdirect_mr->mr = ib_alloc_mr(info->pd, info->mr_type,
- info->max_frmr_depth, 0);
+ smbdirect_mr->mr =
+ ib_alloc_mr(info->pd, info->mr_type,
+ info->max_frmr_depth,
+ IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(smbdirect_mr->mr)) {
log_rdma_mr(ERR, "ib_alloc_mr failed mr_type=%x max_frmr_depth=%x\n",
info->mr_type,
@@ -2244,7 +2246,8 @@ static int allocate_mr_list(struct smbd_connection *info)
if (!smbdirect_mr)
goto out;
smbdirect_mr->mr = ib_alloc_mr(info->pd, info->mr_type,
- info->max_frmr_depth, 0);
+ info->max_frmr_depth,
+ IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(smbdirect_mr->mr)) {
log_rdma_mr(ERR, "ib_alloc_mr failed mr_type=%x max_frmr_depth=%x\n",
info->mr_type, info->max_frmr_depth);
@@ -2406,9 +2409,10 @@ struct smbd_mr *smbd_register_mr(
reg_wr->wr.send_flags = IB_SEND_SIGNALED;
reg_wr->mr = smbdirect_mr->mr;
reg_wr->key = smbdirect_mr->mr->rkey;
- reg_wr->access = writing ?
- IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
- IB_ACCESS_REMOTE_READ;
+ reg_wr->access =
+ (writing ? IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
+ IB_ACCESS_REMOTE_READ) |
+ IB_ACCESS_RELAXED_ORDERING;

/*
* There is no need for waiting for complemtion on ib_post_send
--
2.30.2

2021-04-05 05:25:41

by Leon Romanovsky

[permalink] [raw]
Subject: [PATCH rdma-next 08/10] net/rds: Enable Relaxed Ordering

From: Avihai Horon <[email protected]>

Enable Relaxed Ordering for rds.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon <[email protected]>
Reviewed-by: Michael Guralnik <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
---
net/rds/ib_frmr.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
index 694eb916319e..1a60c2c90c78 100644
--- a/net/rds/ib_frmr.c
+++ b/net/rds/ib_frmr.c
@@ -76,7 +76,7 @@ static struct rds_ib_mr *rds_ib_alloc_frmr(struct rds_ib_device *rds_ibdev,

frmr = &ibmr->u.frmr;
frmr->mr = ib_alloc_mr(rds_ibdev->pd, IB_MR_TYPE_MEM_REG,
- pool->max_pages, 0);
+ pool->max_pages, IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(frmr->mr)) {
pr_warn("RDS/IB: %s failed to allocate MR", __func__);
err = PTR_ERR(frmr->mr);
@@ -156,9 +156,8 @@ static int rds_ib_post_reg_frmr(struct rds_ib_mr *ibmr)
reg_wr.wr.num_sge = 0;
reg_wr.mr = frmr->mr;
reg_wr.key = frmr->mr->rkey;
- reg_wr.access = IB_ACCESS_LOCAL_WRITE |
- IB_ACCESS_REMOTE_READ |
- IB_ACCESS_REMOTE_WRITE;
+ reg_wr.access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ |
+ IB_ACCESS_REMOTE_WRITE | IB_ACCESS_RELAXED_ORDERING;
reg_wr.wr.send_flags = IB_SEND_SIGNALED;

ret = ib_post_send(ibmr->ic->i_cm_id->qp, &reg_wr.wr, NULL);
--
2.30.2

2021-04-05 05:25:59

by Leon Romanovsky

[permalink] [raw]
Subject: [PATCH rdma-next 06/10] nvme-rdma: Enable Relaxed Ordering

From: Avihai Horon <[email protected]>

Enable Relaxed Ordering for nvme.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Reviewed-by: Michael Guralnik <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
---
drivers/nvme/host/rdma.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 4dbc17311e0b..8f106b20b39c 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -532,9 +532,8 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
*/
pages_per_mr = nvme_rdma_get_max_fr_pages(ibdev, queue->pi_support) + 1;
ret = ib_mr_pool_init(queue->qp, &queue->qp->rdma_mrs,
- queue->queue_size,
- IB_MR_TYPE_MEM_REG,
- pages_per_mr, 0, 0);
+ queue->queue_size, IB_MR_TYPE_MEM_REG,
+ pages_per_mr, 0, IB_ACCESS_RELAXED_ORDERING);
if (ret) {
dev_err(queue->ctrl->ctrl.device,
"failed to initialize MR pool sized %d for QID %d\n",
@@ -545,7 +544,8 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
if (queue->pi_support) {
ret = ib_mr_pool_init(queue->qp, &queue->qp->sig_mrs,
queue->queue_size, IB_MR_TYPE_INTEGRITY,
- pages_per_mr, pages_per_mr, 0);
+ pages_per_mr, pages_per_mr,
+ IB_ACCESS_RELAXED_ORDERING);
if (ret) {
dev_err(queue->ctrl->ctrl.device,
"failed to initialize PI MR pool sized %d for QID %d\n",
@@ -1382,9 +1382,9 @@ static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
req->reg_wr.wr.num_sge = 0;
req->reg_wr.mr = req->mr;
req->reg_wr.key = req->mr->rkey;
- req->reg_wr.access = IB_ACCESS_LOCAL_WRITE |
- IB_ACCESS_REMOTE_READ |
- IB_ACCESS_REMOTE_WRITE;
+ req->reg_wr.access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ |
+ IB_ACCESS_REMOTE_WRITE |
+ IB_ACCESS_RELAXED_ORDERING;

sg->addr = cpu_to_le64(req->mr->iova);
put_unaligned_le24(req->mr->length, sg->length);
@@ -1488,9 +1488,8 @@ static int nvme_rdma_map_sg_pi(struct nvme_rdma_queue *queue,
wr->wr.send_flags = 0;
wr->mr = req->mr;
wr->key = req->mr->rkey;
- wr->access = IB_ACCESS_LOCAL_WRITE |
- IB_ACCESS_REMOTE_READ |
- IB_ACCESS_REMOTE_WRITE;
+ wr->access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ |
+ IB_ACCESS_REMOTE_WRITE | IB_ACCESS_RELAXED_ORDERING;

sg->addr = cpu_to_le64(req->mr->iova);
put_unaligned_le24(req->mr->length, sg->length);
--
2.30.2

2021-04-05 05:26:06

by Leon Romanovsky

[permalink] [raw]
Subject: [PATCH rdma-next 09/10] net/smc: Enable Relaxed Ordering

From: Avihai Horon <[email protected]>

Enable Relaxed Ordering for smc.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon <[email protected]>
Reviewed-by: Michael Guralnik <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
---
net/smc/smc_ib.c | 3 ++-
net/smc/smc_wr.c | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index 4e91ed3dc265..6b65c5d1f957 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -579,7 +579,8 @@ int smc_ib_get_memory_region(struct ib_pd *pd, int access_flags,
return 0; /* already done */

buf_slot->mr_rx[link_idx] =
- ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, 1 << buf_slot->order, 0);
+ ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, 1 << buf_slot->order,
+ IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(buf_slot->mr_rx[link_idx])) {
int rc;

diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index cbc73a7e4d59..78e9650621f1 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -555,7 +555,8 @@ static void smc_wr_init_sge(struct smc_link *lnk)
lnk->wr_reg.wr.num_sge = 0;
lnk->wr_reg.wr.send_flags = IB_SEND_SIGNALED;
lnk->wr_reg.wr.opcode = IB_WR_REG_MR;
- lnk->wr_reg.access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
+ lnk->wr_reg.access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
+ IB_ACCESS_RELAXED_ORDERING;
}

void smc_wr_free_link(struct smc_link *lnk)
--
2.30.2

2021-04-05 05:27:11

by Leon Romanovsky

[permalink] [raw]
Subject: [PATCH rdma-next 10/10] xprtrdma: Enable Relaxed Ordering

From: Avihai Horon <[email protected]>

Enable Relaxed Ordering for xprtrdma.

Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.

Signed-off-by: Avihai Horon <[email protected]>
Reviewed-by: Michael Guralnik <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
---
net/sunrpc/xprtrdma/frwr_ops.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index cfbdd197cdfe..f9334c0a1a13 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -135,7 +135,8 @@ int frwr_mr_init(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr *mr)
struct ib_mr *frmr;
int rc;

- frmr = ib_alloc_mr(ep->re_pd, ep->re_mrtype, depth, 0);
+ frmr = ib_alloc_mr(ep->re_pd, ep->re_mrtype, depth,
+ IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(frmr))
goto out_mr_err;

@@ -339,9 +340,10 @@ struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt,
reg_wr = &mr->frwr.fr_regwr;
reg_wr->mr = ibmr;
reg_wr->key = ibmr->rkey;
- reg_wr->access = writing ?
- IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
- IB_ACCESS_REMOTE_READ;
+ reg_wr->access =
+ (writing ? IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
+ IB_ACCESS_REMOTE_READ) |
+ IB_ACCESS_RELAXED_ORDERING;

mr->mr_handle = ibmr->rkey;
mr->mr_length = ibmr->length;
--
2.30.2

2021-04-05 20:44:38

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <[email protected]>
>
> >From Avihai,
>
> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> imposed on PCI transactions, and thus, can improve performance.
>
> Until now, relaxed ordering could be set only by user space applications
> for user MRs. The following patch series enables relaxed ordering for the
> kernel ULPs as well. Relaxed ordering is an optional capability, and as
> such, it is ignored by vendors that don't support it.
>
> The following test results show the performance improvement achieved
> with relaxed ordering. The test was performed on a NVIDIA A100 in order
> to check performance of storage infrastructure over xprtrdma:

Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
What does that have to do with storage protocols?

Also if you enable this for basically all kernel ULPs, why not have
an opt-out into strict ordering for the cases that need it (if there are
any).

2021-04-05 20:48:26

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <[email protected]>
> >
> > >From Avihai,
> >
> > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> > imposed on PCI transactions, and thus, can improve performance.
> >
> > Until now, relaxed ordering could be set only by user space applications
> > for user MRs. The following patch series enables relaxed ordering for the
> > kernel ULPs as well. Relaxed ordering is an optional capability, and as
> > such, it is ignored by vendors that don't support it.
> >
> > The following test results show the performance improvement achieved
> > with relaxed ordering. The test was performed on a NVIDIA A100 in order
> > to check performance of storage infrastructure over xprtrdma:
>
> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
> What does that have to do with storage protocols?

This system is in use by our storage oriented customer who performed the
test. He runs drivers/infiniband/* stack from the upstream, simply backported
to specific kernel version.

The performance boost is seen in other systems too.

>
> Also if you enable this for basically all kernel ULPs, why not have
> an opt-out into strict ordering for the cases that need it (if there are
> any).

The RO property is optional, it can only improve. In addition, all in-kernel ULPs
don't need strict ordering. I can be mistaken here and Jason will correct me, it
is because of two things: ULP doesn't touch data before CQE and DMA API prohibits it.

Thanks

2021-04-06 03:21:55

by Santosh Shilimkar

[permalink] [raw]
Subject: Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs


> On Apr 5, 2021, at 7:08 AM, Leon Romanovsky <[email protected]> wrote:
>
> On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
>> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
>>> From: Leon Romanovsky <[email protected]>
>>>
>>>> From Avihai,
>>>
>>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
>>> imposed on PCI transactions, and thus, can improve performance.
>>>
>>> Until now, relaxed ordering could be set only by user space applications
>>> for user MRs. The following patch series enables relaxed ordering for the
>>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
>>> such, it is ignored by vendors that don't support it.
>>>
>>> The following test results show the performance improvement achieved
>>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
>>> to check performance of storage infrastructure over xprtrdma:
>>
>> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
>> What does that have to do with storage protocols?
>
> This system is in use by our storage oriented customer who performed the
> test. He runs drivers/infiniband/* stack from the upstream, simply backported
> to specific kernel version.
>
> The performance boost is seen in other systems too.
>
>>
>> Also if you enable this for basically all kernel ULPs, why not have
>> an opt-out into strict ordering for the cases that need it (if there are
>> any).
>
> The RO property is optional, it can only improve. In addition, all in-kernel ULPs
> don't need strict ordering. I can be mistaken here and Jason will correct me, it
> is because of two things: ULP doesn't touch data before CQE and DMA API prohibits it.
>

Sticking to in-kernel ULP’s don’t need strict ordering, why don’t you enable this
for HCA’s which supports it by default instead of patching very ULPs. Some ULPs
in future may forget to specify such flag.

Regards,
Santosh

2021-04-06 07:14:08

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

On 4/5/2021 10:08 AM, Leon Romanovsky wrote:
> On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
>> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
>>> From: Leon Romanovsky <[email protected]>
>>>
>>> >From Avihai,
>>>
>>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
>>> imposed on PCI transactions, and thus, can improve performance.
>>>
>>> Until now, relaxed ordering could be set only by user space applications
>>> for user MRs. The following patch series enables relaxed ordering for the
>>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
>>> such, it is ignored by vendors that don't support it.
>>>
>>> The following test results show the performance improvement achieved
>>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
>>> to check performance of storage infrastructure over xprtrdma:
>>
>> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
>> What does that have to do with storage protocols?
>
> This system is in use by our storage oriented customer who performed the
> test. He runs drivers/infiniband/* stack from the upstream, simply backported
> to specific kernel version.
>
> The performance boost is seen in other systems too.

We need to see more information about this test, and platform.

What correctness testing was done, and how was it verified? What
PCI bus type(s) were tested, and with what adapters? What storage
workload was generated, and were all possible RDMA exchanges by
each ULP exercised?

>> Also if you enable this for basically all kernel ULPs, why not have
>> an opt-out into strict ordering for the cases that need it (if there are
>> any).
>
> The RO property is optional, it can only improve. In addition, all in-kernel ULPs
> don't need strict ordering. I can be mistaken here and Jason will correct me, it
> is because of two things: ULP doesn't touch data before CQE and DMA API prohibits it.

+1 on Christoph's comment.

I woud hope most well-architected ULPs will support relaxed ordering,
but storage workloads, in my experience, can find ways to cause failure
in adapters. I would not suggest making this the default behavior
without extensive testing.

Tom.

2021-04-06 07:16:56

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH rdma-next 02/10] RDMA/core: Enable Relaxed Ordering in __ib_alloc_pd()

On 4/5/2021 1:23 AM, Leon Romanovsky wrote:
> From: Avihai Horon <[email protected]>
>
> Enable Relaxed Ordering in __ib_alloc_pd() allocation of the
> local_dma_lkey.
>
> This will take effect only for devices that don't pre-allocate the lkey
> but allocate it per PD allocation.
>
> Signed-off-by: Avihai Horon <[email protected]>
> Reviewed-by: Michael Guralnik <[email protected]>
> Signed-off-by: Leon Romanovsky <[email protected]>
> ---
> drivers/infiniband/core/verbs.c | 3 ++-
> drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c | 1 +
> 2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> index a1782f8a6ca0..9b719f7d6fd5 100644
> --- a/drivers/infiniband/core/verbs.c
> +++ b/drivers/infiniband/core/verbs.c
> @@ -287,7 +287,8 @@ struct ib_pd *__ib_alloc_pd(struct ib_device *device, unsigned int flags,
> if (device->attrs.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)
> pd->local_dma_lkey = device->local_dma_lkey;
> else
> - mr_access_flags |= IB_ACCESS_LOCAL_WRITE;
> + mr_access_flags |=
> + IB_ACCESS_LOCAL_WRITE | IB_ACCESS_RELAXED_ORDERING;

So, do local_dma_lkey's get relaxed ordering unconditionally?

> if (flags & IB_PD_UNSAFE_GLOBAL_RKEY) {
> pr_warn("%s: enabling unsafe global rkey\n", caller);
> diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
> index b3fa783698a0..d74827694f92 100644
> --- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
> +++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
> @@ -66,6 +66,7 @@ struct ib_mr *pvrdma_get_dma_mr(struct ib_pd *pd, int acc)
> int ret;
>
> /* Support only LOCAL_WRITE flag for DMA MRs */
> + acc &= ~IB_ACCESS_RELAXED_ORDERING;
> if (acc & ~IB_ACCESS_LOCAL_WRITE) {
> dev_warn(&dev->pdev->dev,
> "unsupported dma mr access flags %#x\n", acc);

Why does the pvrdma driver require relaxed ordering to be off?

Tom.

2021-04-06 07:48:08

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <[email protected]>
> >
> > >From Avihai,
> >
> > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> > imposed on PCI transactions, and thus, can improve performance.
> >
> > Until now, relaxed ordering could be set only by user space applications
> > for user MRs. The following patch series enables relaxed ordering for the
> > kernel ULPs as well. Relaxed ordering is an optional capability, and as
> > such, it is ignored by vendors that don't support it.
> >
> > The following test results show the performance improvement achieved
> > with relaxed ordering. The test was performed on a NVIDIA A100 in order
> > to check performance of storage infrastructure over xprtrdma:
>
> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
> What does that have to do with storage protocols?

I think it is a typo (or at least mit makes no sense to be talking
about NFS with a GPU chip) Probably it should be a DGX A100 which is a
dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA
workload.

AMD dual socket systems are well known to benefit from relaxed
ordering, people have been doing this in userspace for a while now
with the opt in.

What surprises me is the performance difference, I hadn't heard it is
4x!

Jason

2021-04-06 09:15:10

by Adit Ranadive

[permalink] [raw]
Subject: Re: [PATCH rdma-next 02/10] RDMA/core: Enable Relaxed Ordering in __ib_alloc_pd()

On 4/5/21 11:01 AM, Tom Talpey wrote:
> On 4/5/2021 1:23 AM, Leon Romanovsky wrote:
>> From: Avihai Horon <[email protected]>
>>
>> Enable Relaxed Ordering in __ib_alloc_pd() allocation of the
>> local_dma_lkey.
>>
>> This will take effect only for devices that don't pre-allocate the lkey
>> but allocate it per PD allocation.
>>
>> Signed-off-by: Avihai Horon <[email protected]>
>> Reviewed-by: Michael Guralnik <[email protected]>
>> Signed-off-by: Leon Romanovsky <[email protected]>
>> ---
>>   drivers/infiniband/core/verbs.c              | 3 ++-
>>   drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c | 1 +
>>   2 files changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
>> index a1782f8a6ca0..9b719f7d6fd5 100644
>> --- a/drivers/infiniband/core/verbs.c
>> +++ b/drivers/infiniband/core/verbs.c
>> @@ -287,7 +287,8 @@ struct ib_pd *__ib_alloc_pd(struct ib_device *device, unsigned int flags,
>>       if (device->attrs.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)
>>           pd->local_dma_lkey = device->local_dma_lkey;
>>       else
>> -        mr_access_flags |= IB_ACCESS_LOCAL_WRITE;
>> +        mr_access_flags |=
>> +            IB_ACCESS_LOCAL_WRITE | IB_ACCESS_RELAXED_ORDERING;
>
> So, do local_dma_lkey's get relaxed ordering unconditionally?
>
>>       if (flags & IB_PD_UNSAFE_GLOBAL_RKEY) {
>>           pr_warn("%s: enabling unsafe global rkey\n", caller);
>> diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
>> index b3fa783698a0..d74827694f92 100644
>> --- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
>> +++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
>> @@ -66,6 +66,7 @@ struct ib_mr *pvrdma_get_dma_mr(struct ib_pd *pd, int acc)
>>       int ret;
>>         /* Support only LOCAL_WRITE flag for DMA MRs */
>> +    acc &= ~IB_ACCESS_RELAXED_ORDERING;
>>       if (acc & ~IB_ACCESS_LOCAL_WRITE) {
>>           dev_warn(&dev->pdev->dev,
>>                "unsupported dma mr access flags %#x\n", acc);
>
> Why does the pvrdma driver require relaxed ordering to be off?

PVRDMA doesn't support any other flags other than LOCAL_WRITE for
DMA MRs so the MR creation will fail if any new unconditionally added
flag isn't cleared.

2021-04-06 11:20:54

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs



> On Apr 5, 2021, at 4:07 PM, Jason Gunthorpe <[email protected]> wrote:
>
> On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
>> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
>>> From: Leon Romanovsky <[email protected]>
>>>
>>>> From Avihai,
>>>
>>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
>>> imposed on PCI transactions, and thus, can improve performance.
>>>
>>> Until now, relaxed ordering could be set only by user space applications
>>> for user MRs. The following patch series enables relaxed ordering for the
>>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
>>> such, it is ignored by vendors that don't support it.
>>>
>>> The following test results show the performance improvement achieved
>>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
>>> to check performance of storage infrastructure over xprtrdma:
>>
>> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
>> What does that have to do with storage protocols?
>
> I think it is a typo (or at least mit makes no sense to be talking
> about NFS with a GPU chip) Probably it should be a DGX A100 which is a
> dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA
> workload.

We need to get a better idea what correctness testing has been done,
and whether positive correctness testing results can be replicated
on a variety of platforms.

I have an old Haswell dual-socket system in my lab, but otherwise
I'm not sure I have a platform that would be interesting for such a
test.


> AMD dual socket systems are well known to benefit from relaxed
> ordering, people have been doing this in userspace for a while now
> with the opt in.


--
Chuck Lever



2021-04-06 11:21:31

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
> > On Apr 5, 2021, at 4:07 PM, Jason Gunthorpe <[email protected]> wrote:
> > On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
> >> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> >>> From: Leon Romanovsky <[email protected]>
> >>>
> >>>> From Avihai,
> >>>
> >>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> >>> imposed on PCI transactions, and thus, can improve performance.
> >>>
> >>> Until now, relaxed ordering could be set only by user space applications
> >>> for user MRs. The following patch series enables relaxed ordering for the
> >>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
> >>> such, it is ignored by vendors that don't support it.
> >>>
> >>> The following test results show the performance improvement achieved
> >>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
> >>> to check performance of storage infrastructure over xprtrdma:
> >>
> >> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
> >> What does that have to do with storage protocols?
> >
> > I think it is a typo (or at least mit makes no sense to be talking
> > about NFS with a GPU chip) Probably it should be a DGX A100 which is a
> > dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA
> > workload.
>
> We need to get a better idea what correctness testing has been done,
> and whether positive correctness testing results can be replicated
> on a variety of platforms.
>
> I have an old Haswell dual-socket system in my lab, but otherwise
> I'm not sure I have a platform that would be interesting for such a
> test.

Not sure if Haswell will be useful for such testing. It looks like many
of those subscribe to 'quirk_relaxedordering_disable'.

2021-04-06 12:56:41

by Honggang LI

[permalink] [raw]
Subject: Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <[email protected]>
>
> From Avihai,
>
> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> imposed on PCI transactions, and thus, can improve performance.
>
> Until now, relaxed ordering could be set only by user space applications
> for user MRs. The following patch series enables relaxed ordering for the
> kernel ULPs as well. Relaxed ordering is an optional capability, and as
> such, it is ignored by vendors that don't support it.
>
> The following test results show the performance improvement achieved

Did you test this patchset with CPU does not support relaxed ordering?

We observed significantly performance degradation when run perftest with
relaxed ordering enabled over old CPU.

https://github.com/linux-rdma/perftest/issues/116

thanks

2021-04-06 13:19:41

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

On Tue, Apr 06, 2021 at 10:37:38AM +0800, Honggang LI wrote:
> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <[email protected]>
> >
> > From Avihai,
> >
> > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> > imposed on PCI transactions, and thus, can improve performance.
> >
> > Until now, relaxed ordering could be set only by user space applications
> > for user MRs. The following patch series enables relaxed ordering for the
> > kernel ULPs as well. Relaxed ordering is an optional capability, and as
> > such, it is ignored by vendors that don't support it.
> >
> > The following test results show the performance improvement achieved
>
> Did you test this patchset with CPU does not support relaxed ordering?

I don't think so, the CPUs that don't support RO are Intel's fourth/fifth-generation
and they are not interesting from performance point of view.

>
> We observed significantly performance degradation when run perftest with
> relaxed ordering enabled over old CPU.
>
> https://github.com/linux-rdma/perftest/issues/116

The perftest is slightly different, but you pointed to the valid point.
We forgot to call pcie_relaxed_ordering_enabled() before setting RO bit
and arguably this was needed to be done in perftest too.

Thanks

>
> thanks
>

2021-04-06 13:22:36

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>
>
> > On Apr 5, 2021, at 4:07 PM, Jason Gunthorpe <[email protected]> wrote:
> >
> > On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
> >> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> >>> From: Leon Romanovsky <[email protected]>
> >>>
> >>>> From Avihai,
> >>>
> >>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> >>> imposed on PCI transactions, and thus, can improve performance.
> >>>
> >>> Until now, relaxed ordering could be set only by user space applications
> >>> for user MRs. The following patch series enables relaxed ordering for the
> >>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
> >>> such, it is ignored by vendors that don't support it.
> >>>
> >>> The following test results show the performance improvement achieved
> >>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
> >>> to check performance of storage infrastructure over xprtrdma:
> >>
> >> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
> >> What does that have to do with storage protocols?
> >
> > I think it is a typo (or at least mit makes no sense to be talking
> > about NFS with a GPU chip) Probably it should be a DGX A100 which is a
> > dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA
> > workload.
>
> We need to get a better idea what correctness testing has been done,
> and whether positive correctness testing results can be replicated
> on a variety of platforms.

I will ask to provide more details.

>
> I have an old Haswell dual-socket system in my lab, but otherwise
> I'm not sure I have a platform that would be interesting for such a
> test.

We don't have such old systems too.

>
>
> > AMD dual socket systems are well known to benefit from relaxed
> > ordering, people have been doing this in userspace for a while now
> > with the opt in.
>
>
> --
> Chuck Lever
>
>
>

2021-04-06 13:36:59

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH rdma-next 02/10] RDMA/core: Enable Relaxed Ordering in __ib_alloc_pd()

On Mon, Apr 05, 2021 at 02:01:16PM -0400, Tom Talpey wrote:
> On 4/5/2021 1:23 AM, Leon Romanovsky wrote:
> > From: Avihai Horon <[email protected]>
> >
> > Enable Relaxed Ordering in __ib_alloc_pd() allocation of the
> > local_dma_lkey.
> >
> > This will take effect only for devices that don't pre-allocate the lkey
> > but allocate it per PD allocation.
> >
> > Signed-off-by: Avihai Horon <[email protected]>
> > Reviewed-by: Michael Guralnik <[email protected]>
> > Signed-off-by: Leon Romanovsky <[email protected]>
> > ---
> > drivers/infiniband/core/verbs.c | 3 ++-
> > drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c | 1 +
> > 2 files changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> > index a1782f8a6ca0..9b719f7d6fd5 100644
> > --- a/drivers/infiniband/core/verbs.c
> > +++ b/drivers/infiniband/core/verbs.c
> > @@ -287,7 +287,8 @@ struct ib_pd *__ib_alloc_pd(struct ib_device *device, unsigned int flags,
> > if (device->attrs.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)
> > pd->local_dma_lkey = device->local_dma_lkey;
> > else
> > - mr_access_flags |= IB_ACCESS_LOCAL_WRITE;
> > + mr_access_flags |=
> > + IB_ACCESS_LOCAL_WRITE | IB_ACCESS_RELAXED_ORDERING;
>
> So, do local_dma_lkey's get relaxed ordering unconditionally?

Yes, in mlx5, this lkey is created on the fly.

Thanks

2021-04-06 21:00:53

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:

> We need to get a better idea what correctness testing has been done,
> and whether positive correctness testing results can be replicated
> on a variety of platforms.

RO has been rolling out slowly on mlx5 over a few years and storage
ULPs are the last to change. eg the mlx5 ethernet driver has had RO
turned on for a long time, userspace HPC applications have been using
it for a while now too.

We know there are platforms with broken RO implementations (like
Haswell) but the kernel is supposed to globally turn off RO on all
those cases. I'd be a bit surprised if we discover any more from this
series.

On the other hand there are platforms that get huge speed ups from
turning this on, AMD is one example, there are a bunch in the ARM
world too.

Still, obviously people should test on the platforms they have.

Jason

2021-04-06 21:02:47

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

On Tue, Apr 06, 2021 at 08:09:43AM +0300, Leon Romanovsky wrote:
> On Tue, Apr 06, 2021 at 10:37:38AM +0800, Honggang LI wrote:
> > On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <[email protected]>
> > >
> > > From Avihai,
> > >
> > > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> > > imposed on PCI transactions, and thus, can improve performance.
> > >
> > > Until now, relaxed ordering could be set only by user space applications
> > > for user MRs. The following patch series enables relaxed ordering for the
> > > kernel ULPs as well. Relaxed ordering is an optional capability, and as
> > > such, it is ignored by vendors that don't support it.
> > >
> > > The following test results show the performance improvement achieved
> >
> > Did you test this patchset with CPU does not support relaxed ordering?
>
> I don't think so, the CPUs that don't support RO are Intel's fourth/fifth-generation
> and they are not interesting from performance point of view.
>
> >
> > We observed significantly performance degradation when run perftest with
> > relaxed ordering enabled over old CPU.
> >
> > https://github.com/linux-rdma/perftest/issues/116
>
> The perftest is slightly different, but you pointed to the valid point.
> We forgot to call pcie_relaxed_ordering_enabled() before setting RO bit
> and arguably this was needed to be done in perftest too.

No, the PCI device should not have the RO bit set in this situation.
It is something mlx5_core needs to do. We can't push this into
applications.

There should be no performance difference from asking for
IBV_ACCESS_RELAXED_ORDERING when RO is disabled at the PCI config and
not asking for it at all.

Either the platform has working relaxed ordering that gives a
performance gain and the RO config spec bit should be set, or it
doesn't and the bit should be clear.

This is not something to decide in userspace, or in RDMA. At worst it
becomes another platform specific PCI tunable people have to set.

I thought the old haswell systems were quirked to disable RO globally
anyhow?

Jason

2021-04-09 14:47:51

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs



> On Apr 9, 2021, at 10:26 AM, Tom Talpey <[email protected]> wrote:
>
> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>
>>> We need to get a better idea what correctness testing has been done,
>>> and whether positive correctness testing results can be replicated
>>> on a variety of platforms.
>> RO has been rolling out slowly on mlx5 over a few years and storage
>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>> turned on for a long time, userspace HPC applications have been using
>> it for a while now too.
>
> I'd love to see RO be used more, it was always something the RDMA
> specs supported and carefully architected for. My only concern is
> that it's difficult to get right, especially when the platforms
> have been running strictly-ordered for so long. The ULPs need
> testing, and a lot of it.
>
>> We know there are platforms with broken RO implementations (like
>> Haswell) but the kernel is supposed to globally turn off RO on all
>> those cases. I'd be a bit surprised if we discover any more from this
>> series.
>> On the other hand there are platforms that get huge speed ups from
>> turning this on, AMD is one example, there are a bunch in the ARM
>> world too.
>
> My belief is that the biggest risk is from situations where completions
> are batched, and therefore polling is used to detect them without
> interrupts (which explicitly). The RO pipeline will completely reorder
> DMA writes, and consumers which infer ordering from memory contents may
> break. This can even apply within the provider code, which may attempt
> to poll WR and CQ structures, and be tripped up.

You are referring specifically to RPC/RDMA depending on Receive
completions to guarantee that previous RDMA Writes have been
retired? Or is there a particular implementation practice in
the Linux RPC/RDMA code that worries you?


> The Mellanox adapter, itself, historically has strict in-order DMA
> semantics, and while it's great to relax that, changing it by default
> for all consumers is something to consider very cautiously.
>
>> Still, obviously people should test on the platforms they have.
>
> Yes, and "test" be taken seriously with focus on ULP data integrity.
> Speedups will mean nothing if the data is ever damaged.

I agree that data integrity comes first.

Since I currently don't have facilities to test RO in my lab, the
community will have to agree on a set of tests and expected results
that specifically exercise the corner cases you are concerned about.


--
Chuck Lever



2021-04-09 15:33:26

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

On 4/9/2021 10:45 AM, Chuck Lever III wrote:
>
>
>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <[email protected]> wrote:
>>
>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>>
>>>> We need to get a better idea what correctness testing has been done,
>>>> and whether positive correctness testing results can be replicated
>>>> on a variety of platforms.
>>> RO has been rolling out slowly on mlx5 over a few years and storage
>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>>> turned on for a long time, userspace HPC applications have been using
>>> it for a while now too.
>>
>> I'd love to see RO be used more, it was always something the RDMA
>> specs supported and carefully architected for. My only concern is
>> that it's difficult to get right, especially when the platforms
>> have been running strictly-ordered for so long. The ULPs need
>> testing, and a lot of it.
>>
>>> We know there are platforms with broken RO implementations (like
>>> Haswell) but the kernel is supposed to globally turn off RO on all
>>> those cases. I'd be a bit surprised if we discover any more from this
>>> series.
>>> On the other hand there are platforms that get huge speed ups from
>>> turning this on, AMD is one example, there are a bunch in the ARM
>>> world too.
>>
>> My belief is that the biggest risk is from situations where completions
>> are batched, and therefore polling is used to detect them without
>> interrupts (which explicitly). The RO pipeline will completely reorder
>> DMA writes, and consumers which infer ordering from memory contents may
>> break. This can even apply within the provider code, which may attempt
>> to poll WR and CQ structures, and be tripped up.
>
> You are referring specifically to RPC/RDMA depending on Receive
> completions to guarantee that previous RDMA Writes have been
> retired? Or is there a particular implementation practice in
> the Linux RPC/RDMA code that worries you?

Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
is hopefully unfounded, is that the RO pipeline might not have flushed
when a completion is posted *after* posting an interrupt.

Something like this...

RDMA Write arrives
PCIe RO Write for data
PCIe RO Write for data
...
RDMA Write arrives
PCIe RO Write for data
...
RDMA Send arrives
PCIe RO Write for receive data
PCIe RO Write for receive descriptor
PCIe interrupt (flushes RO pipeline for all three ops above)

RPC/RDMA polls CQ
Reaps receive completion

RDMA Send arrives
PCIe RO Write for receive data
PCIe RO write for receive descriptor
Does *not* interrupt, since CQ not armed

RPC/RDMA continues to poll CQ
Reaps receive completion
PCIe RO writes not yet flushed
Processes incomplete in-memory data
Bzzzt

Hopefully, the adapter performs a PCIe flushing read, or something
to avoid this when an interrupt is not generated. Alternatively, I'm
overly paranoid.

Tom.

>> The Mellanox adapter, itself, historically has strict in-order DMA
>> semantics, and while it's great to relax that, changing it by default
>> for all consumers is something to consider very cautiously.
>>
>>> Still, obviously people should test on the platforms they have.
>>
>> Yes, and "test" be taken seriously with focus on ULP data integrity.
>> Speedups will mean nothing if the data is ever damaged.
>
> I agree that data integrity comes first.
>
> Since I currently don't have facilities to test RO in my lab, the
> community will have to agree on a set of tests and expected results
> that specifically exercise the corner cases you are concerned about.
>
>
> --
> Chuck Lever
>
>
>
>

2021-04-09 16:29:38

by Haakon Bugge

[permalink] [raw]
Subject: Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs



> On 9 Apr 2021, at 17:32, Tom Talpey <[email protected]> wrote:
>
> On 4/9/2021 10:45 AM, Chuck Lever III wrote:
>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <[email protected]> wrote:
>>>
>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>>>
>>>>> We need to get a better idea what correctness testing has been done,
>>>>> and whether positive correctness testing results can be replicated
>>>>> on a variety of platforms.
>>>> RO has been rolling out slowly on mlx5 over a few years and storage
>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>>>> turned on for a long time, userspace HPC applications have been using
>>>> it for a while now too.
>>>
>>> I'd love to see RO be used more, it was always something the RDMA
>>> specs supported and carefully architected for. My only concern is
>>> that it's difficult to get right, especially when the platforms
>>> have been running strictly-ordered for so long. The ULPs need
>>> testing, and a lot of it.
>>>
>>>> We know there are platforms with broken RO implementations (like
>>>> Haswell) but the kernel is supposed to globally turn off RO on all
>>>> those cases. I'd be a bit surprised if we discover any more from this
>>>> series.
>>>> On the other hand there are platforms that get huge speed ups from
>>>> turning this on, AMD is one example, there are a bunch in the ARM
>>>> world too.
>>>
>>> My belief is that the biggest risk is from situations where completions
>>> are batched, and therefore polling is used to detect them without
>>> interrupts (which explicitly). The RO pipeline will completely reorder
>>> DMA writes, and consumers which infer ordering from memory contents may
>>> break. This can even apply within the provider code, which may attempt
>>> to poll WR and CQ structures, and be tripped up.
>> You are referring specifically to RPC/RDMA depending on Receive
>> completions to guarantee that previous RDMA Writes have been
>> retired? Or is there a particular implementation practice in
>> the Linux RPC/RDMA code that worries you?
>
> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
> is hopefully unfounded, is that the RO pipeline might not have flushed
> when a completion is posted *after* posting an interrupt.
>
> Something like this...
>
> RDMA Write arrives
> PCIe RO Write for data
> PCIe RO Write for data
> ...
> RDMA Write arrives
> PCIe RO Write for data
> ...
> RDMA Send arrives
> PCIe RO Write for receive data
> PCIe RO Write for receive descriptor

Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then it will shure prior written RO date has global visibility when the CQE can be observed.



> PCIe interrupt (flushes RO pipeline for all three ops above)

Before the interrupt, the HCA will write the EQE (Event Queue Entry). This has to be a Strongly Ordered write to "push" prior written CQEs so that when the EQE is observed, the prior writes of CQEs have global visibility.

And the MSI-X write likewise, to avoid spurious interrupts.



Thxs, Håkon

>
> RPC/RDMA polls CQ
> Reaps receive completion
>
> RDMA Send arrives
> PCIe RO Write for receive data
> PCIe RO write for receive descriptor
> Does *not* interrupt, since CQ not armed
>
> RPC/RDMA continues to poll CQ
> Reaps receive completion
> PCIe RO writes not yet flushed
> Processes incomplete in-memory data
> Bzzzt
>
> Hopefully, the adapter performs a PCIe flushing read, or something
> to avoid this when an interrupt is not generated. Alternatively, I'm
> overly paranoid.
>
> Tom.
>
>>> The Mellanox adapter, itself, historically has strict in-order DMA
>>> semantics, and while it's great to relax that, changing it by default
>>> for all consumers is something to consider very cautiously.
>>>
>>>> Still, obviously people should test on the platforms they have.
>>>
>>> Yes, and "test" be taken seriously with focus on ULP data integrity.
>>> Speedups will mean nothing if the data is ever damaged.
>> I agree that data integrity comes first.
>> Since I currently don't have facilities to test RO in my lab, the
>> community will have to agree on a set of tests and expected results
>> that specifically exercise the corner cases you are concerned about.
>> --
>> Chuck Lever

2021-04-09 16:41:29

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

On Fri, Apr 09, 2021 at 10:26:21AM -0400, Tom Talpey wrote:

> My belief is that the biggest risk is from situations where completions
> are batched, and therefore polling is used to detect them without
> interrupts (which explicitly).

We don't do this in the kernel.

All kernel ULPs only read data after they observe the CQE. We do not
have "last data polling" and our interrupt model does not support some
hacky "interrupt means go and use the data" approach.

ULPs have to be designed this way to use the DMA API properly.
Fencing a DMA before it is completed by the HW will cause IOMMU
errors.

Userspace is a different story, but that will remain as-is with
optional relaxed ordering.

Jason

2021-04-09 17:44:51

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

On 4/9/2021 12:40 PM, Jason Gunthorpe wrote:
> On Fri, Apr 09, 2021 at 10:26:21AM -0400, Tom Talpey wrote:
>
>> My belief is that the biggest risk is from situations where completions
>> are batched, and therefore polling is used to detect them without
>> interrupts (which explicitly).
>
> We don't do this in the kernel.
>
> All kernel ULPs only read data after they observe the CQE. We do not
> have "last data polling" and our interrupt model does not support some
> hacky "interrupt means go and use the data" approach.
>
> ULPs have to be designed this way to use the DMA API properly.

Yep. Totally agree.

My concern was about the data being written as relaxed, and the CQE
racing it. I'll reply in the other fork.


> Fencing a DMA before it is completed by the HW will cause IOMMU
> errors.
>
> Userspace is a different story, but that will remain as-is with
> optional relaxed ordering.
>
> Jason
>

2021-04-09 17:49:45

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

On 4/9/2021 12:27 PM, Haakon Bugge wrote:
>
>
>> On 9 Apr 2021, at 17:32, Tom Talpey <[email protected]> wrote:
>>
>> On 4/9/2021 10:45 AM, Chuck Lever III wrote:
>>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <[email protected]> wrote:
>>>>
>>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>>>>
>>>>>> We need to get a better idea what correctness testing has been done,
>>>>>> and whether positive correctness testing results can be replicated
>>>>>> on a variety of platforms.
>>>>> RO has been rolling out slowly on mlx5 over a few years and storage
>>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>>>>> turned on for a long time, userspace HPC applications have been using
>>>>> it for a while now too.
>>>>
>>>> I'd love to see RO be used more, it was always something the RDMA
>>>> specs supported and carefully architected for. My only concern is
>>>> that it's difficult to get right, especially when the platforms
>>>> have been running strictly-ordered for so long. The ULPs need
>>>> testing, and a lot of it.
>>>>
>>>>> We know there are platforms with broken RO implementations (like
>>>>> Haswell) but the kernel is supposed to globally turn off RO on all
>>>>> those cases. I'd be a bit surprised if we discover any more from this
>>>>> series.
>>>>> On the other hand there are platforms that get huge speed ups from
>>>>> turning this on, AMD is one example, there are a bunch in the ARM
>>>>> world too.
>>>>
>>>> My belief is that the biggest risk is from situations where completions
>>>> are batched, and therefore polling is used to detect them without
>>>> interrupts (which explicitly). The RO pipeline will completely reorder
>>>> DMA writes, and consumers which infer ordering from memory contents may
>>>> break. This can even apply within the provider code, which may attempt
>>>> to poll WR and CQ structures, and be tripped up.
>>> You are referring specifically to RPC/RDMA depending on Receive
>>> completions to guarantee that previous RDMA Writes have been
>>> retired? Or is there a particular implementation practice in
>>> the Linux RPC/RDMA code that worries you?
>>
>> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
>> is hopefully unfounded, is that the RO pipeline might not have flushed
>> when a completion is posted *after* posting an interrupt.
>>
>> Something like this...
>>
>> RDMA Write arrives
>> PCIe RO Write for data
>> PCIe RO Write for data
>> ...
>> RDMA Write arrives
>> PCIe RO Write for data
>> ...
>> RDMA Send arrives
>> PCIe RO Write for receive data
>> PCIe RO Write for receive descriptor
>
> Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then it will shure prior written RO date has global visibility when the CQE can be observed.

I wasn't aware that a strongly-ordered PCIe Write will ensure that
prior relaxed-ordered writes went first. If that's the case, I'm
fine with it - as long as the providers are correctly coded!!

>> PCIe interrupt (flushes RO pipeline for all three ops above)
>
> Before the interrupt, the HCA will write the EQE (Event Queue Entry). This has to be a Strongly Ordered write to "push" prior written CQEs so that when the EQE is observed, the prior writes of CQEs have global visibility.
>
> And the MSI-X write likewise, to avoid spurious interrupts.

Ok, and yes agreed the same principle would apply.

Is there any implication if a PCIe switch were present on the
motherboard? The switch is allowed to do some creative routing
if the operation is relaxed, correct?

Tom.

> Thxs, Håkon
>
>>
>> RPC/RDMA polls CQ
>> Reaps receive completion
>>
>> RDMA Send arrives
>> PCIe RO Write for receive data
>> PCIe RO write for receive descriptor
>> Does *not* interrupt, since CQ not armed
>>
>> RPC/RDMA continues to poll CQ
>> Reaps receive completion
>> PCIe RO writes not yet flushed
>> Processes incomplete in-memory data
>> Bzzzt
>>
>> Hopefully, the adapter performs a PCIe flushing read, or something
>> to avoid this when an interrupt is not generated. Alternatively, I'm
>> overly paranoid.
>>
>> Tom.
>>
>>>> The Mellanox adapter, itself, historically has strict in-order DMA
>>>> semantics, and while it's great to relax that, changing it by default
>>>> for all consumers is something to consider very cautiously.
>>>>
>>>>> Still, obviously people should test on the platforms they have.
>>>>
>>>> Yes, and "test" be taken seriously with focus on ULP data integrity.
>>>> Speedups will mean nothing if the data is ever damaged.
>>> I agree that data integrity comes first.
>>> Since I currently don't have facilities to test RO in my lab, the
>>> community will have to agree on a set of tests and expected results
>>> that specifically exercise the corner cases you are concerned about.
>>> --
>>> Chuck Lever
>

2021-04-10 13:31:04

by David Laight

[permalink] [raw]
Subject: RE: [PATCH rdma-next 00/10] Enable relaxed ordering for ULPs

From: Tom Talpey
> Sent: 09 April 2021 18:49
> On 4/9/2021 12:27 PM, Haakon Bugge wrote:
> >
> >
> >> On 9 Apr 2021, at 17:32, Tom Talpey <[email protected]> wrote:
> >>
> >> On 4/9/2021 10:45 AM, Chuck Lever III wrote:
> >>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <[email protected]> wrote:
> >>>>
> >>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
> >>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
> >>>>>
> >>>>>> We need to get a better idea what correctness testing has been done,
> >>>>>> and whether positive correctness testing results can be replicated
> >>>>>> on a variety of platforms.
> >>>>> RO has been rolling out slowly on mlx5 over a few years and storage
> >>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
> >>>>> turned on for a long time, userspace HPC applications have been using
> >>>>> it for a while now too.
> >>>>
> >>>> I'd love to see RO be used more, it was always something the RDMA
> >>>> specs supported and carefully architected for. My only concern is
> >>>> that it's difficult to get right, especially when the platforms
> >>>> have been running strictly-ordered for so long. The ULPs need
> >>>> testing, and a lot of it.
> >>>>
> >>>>> We know there are platforms with broken RO implementations (like
> >>>>> Haswell) but the kernel is supposed to globally turn off RO on all
> >>>>> those cases. I'd be a bit surprised if we discover any more from this
> >>>>> series.
> >>>>> On the other hand there are platforms that get huge speed ups from
> >>>>> turning this on, AMD is one example, there are a bunch in the ARM
> >>>>> world too.
> >>>>
> >>>> My belief is that the biggest risk is from situations where completions
> >>>> are batched, and therefore polling is used to detect them without
> >>>> interrupts (which explicitly). The RO pipeline will completely reorder
> >>>> DMA writes, and consumers which infer ordering from memory contents may
> >>>> break. This can even apply within the provider code, which may attempt
> >>>> to poll WR and CQ structures, and be tripped up.
> >>> You are referring specifically to RPC/RDMA depending on Receive
> >>> completions to guarantee that previous RDMA Writes have been
> >>> retired? Or is there a particular implementation practice in
> >>> the Linux RPC/RDMA code that worries you?
> >>
> >> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
> >> is hopefully unfounded, is that the RO pipeline might not have flushed
> >> when a completion is posted *after* posting an interrupt.
> >>
> >> Something like this...
> >>
> >> RDMA Write arrives
> >> PCIe RO Write for data
> >> PCIe RO Write for data
> >> ...
> >> RDMA Write arrives
> >> PCIe RO Write for data
> >> ...
> >> RDMA Send arrives
> >> PCIe RO Write for receive data
> >> PCIe RO Write for receive descriptor
> >
> > Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then
> it will shure prior written RO date has global visibility when the CQE can be observed.
>
> I wasn't aware that a strongly-ordered PCIe Write will ensure that
> prior relaxed-ordered writes went first. If that's the case, I'm
> fine with it - as long as the providers are correctly coded!!

I remember trying to read the relevant section of the PCIe spec.
(Possibly in a book that was trying to make it easier to understand!)
It is about as clear as mud.

I presume this is all about allowing PCIe targets (eg ethernet cards)
to use relaxed ordering on write requests to host memory.
And that such writes can be completed out of order?

It isn't entirely clear that you aren't talking of letting the
cpu do 'relaxed order' writes to PCIe targets!

For a typical ethernet driver the receive interrupt just means
'go and look at the receive descriptor ring'.
So there is an absolute requirement that the writes for data
buffer complete before the write to the receive descriptor.
There is no requirement for the interrupt (requested after the
descriptor write) to have been seen by the cpu.

Quite often the driver will find the 'receive complete'
descriptor when processing frames from an earlier interrupt
(and nothing to do in response to the interrupt itself).

So the write to the receive descriptor would have to have RO clear
to ensure that all the buffer writes complete first.

(The furthest I've got into PCIe internals was fixing the bug
in some vendor-supplied FPGA logic that failed to correctly
handle multiple data TLP responses to a single read TLP.
Fortunately it wasn't in the hard-IP bit.)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)