From: Leon Romanovsky <[email protected]>
From Avihai,
Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
imposed on PCI transactions, and thus, can improve performance.
Until now, relaxed ordering could be set only by user space applications
for user MRs. The following patch series enables relaxed ordering for the
kernel ULPs as well. Relaxed ordering is an optional capability, and as
such, it is ignored by vendors that don't support it.
The following test results show the performance improvement achieved
with relaxed ordering. The test was performed on a NVIDIA A100 in order
to check performance of storage infrastructure over xprtrdma:
Without Relaxed Ordering:
READ: bw=16.5GiB/s (17.7GB/s), 16.5GiB/s-16.5GiB/s (17.7GB/s-17.7GB/s),
io=1987GiB (2133GB), run=120422-120422msec
With relaxed ordering:
READ: bw=72.9GiB/s (78.2GB/s), 72.9GiB/s-72.9GiB/s (78.2GB/s-78.2GB/s),
io=2367GiB (2542GB), run=32492-32492msec
Thanks
Avihai Horon (10):
RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init()
RDMA/core: Enable Relaxed Ordering in __ib_alloc_pd()
RDMA/iser: Enable Relaxed Ordering
RDMA/rtrs: Enable Relaxed Ordering
RDMA/srp: Enable Relaxed Ordering
nvme-rdma: Enable Relaxed Ordering
cifs: smbd: Enable Relaxed Ordering
net/rds: Enable Relaxed Ordering
net/smc: Enable Relaxed Ordering
xprtrdma: Enable Relaxed Ordering
drivers/infiniband/core/mr_pool.c | 7 +-
drivers/infiniband/core/rw.c | 12 ++--
drivers/infiniband/core/verbs.c | 26 +++++--
drivers/infiniband/hw/bnxt_re/ib_verbs.c | 2 +-
drivers/infiniband/hw/bnxt_re/ib_verbs.h | 2 +-
drivers/infiniband/hw/cxgb4/iw_cxgb4.h | 2 +-
drivers/infiniband/hw/cxgb4/mem.c | 2 +-
drivers/infiniband/hw/hns/hns_roce_device.h | 2 +-
drivers/infiniband/hw/hns/hns_roce_mr.c | 2 +-
drivers/infiniband/hw/i40iw/i40iw_verbs.c | 3 +-
drivers/infiniband/hw/mlx4/mlx4_ib.h | 2 +-
drivers/infiniband/hw/mlx4/mr.c | 2 +-
drivers/infiniband/hw/mlx5/mlx5_ib.h | 12 ++--
drivers/infiniband/hw/mlx5/mr.c | 61 ++++++++--------
drivers/infiniband/hw/mlx5/wr.c | 69 ++++++++++++++-----
drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 2 +-
drivers/infiniband/hw/ocrdma/ocrdma_verbs.h | 2 +-
drivers/infiniband/hw/qedr/verbs.c | 2 +-
drivers/infiniband/hw/qedr/verbs.h | 2 +-
drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c | 4 +-
.../infiniband/hw/vmw_pvrdma/pvrdma_verbs.h | 2 +-
drivers/infiniband/sw/rdmavt/mr.c | 3 +-
drivers/infiniband/sw/rdmavt/mr.h | 2 +-
drivers/infiniband/sw/rxe/rxe_verbs.c | 2 +-
drivers/infiniband/sw/siw/siw_verbs.c | 2 +-
drivers/infiniband/sw/siw/siw_verbs.h | 2 +-
drivers/infiniband/ulp/iser/iser_memory.c | 10 ++-
drivers/infiniband/ulp/iser/iser_verbs.c | 6 +-
drivers/infiniband/ulp/rtrs/rtrs-clt.c | 6 +-
drivers/infiniband/ulp/rtrs/rtrs-srv.c | 15 ++--
drivers/infiniband/ulp/srp/ib_srp.c | 8 +--
drivers/nvme/host/rdma.c | 19 +++--
fs/cifs/smbdirect.c | 17 +++--
include/rdma/ib_verbs.h | 11 ++-
include/rdma/mr_pool.h | 3 +-
net/rds/ib_frmr.c | 7 +-
net/smc/smc_ib.c | 3 +-
net/smc/smc_wr.c | 3 +-
net/sunrpc/xprtrdma/frwr_ops.c | 10 +--
39 files changed, 209 insertions(+), 140 deletions(-)
--
2.30.2
From: Avihai Horon <[email protected]>
Enable Relaxed Ordering for iser.
Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.
Signed-off-by: Avihai Horon <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Reviewed-by: Michael Guralnik <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
---
drivers/infiniband/ulp/iser/iser_memory.c | 10 ++++------
drivers/infiniband/ulp/iser/iser_verbs.c | 6 ++++--
2 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/drivers/infiniband/ulp/iser/iser_memory.c b/drivers/infiniband/ulp/iser/iser_memory.c
index afec40da9b58..29849c9e82e8 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -271,9 +271,8 @@ iser_reg_sig_mr(struct iscsi_iser_task *iser_task,
wr->wr.send_flags = 0;
wr->mr = mr;
wr->key = mr->rkey;
- wr->access = IB_ACCESS_LOCAL_WRITE |
- IB_ACCESS_REMOTE_READ |
- IB_ACCESS_REMOTE_WRITE;
+ wr->access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ |
+ IB_ACCESS_REMOTE_WRITE | IB_ACCESS_RELAXED_ORDERING;
rsc->mr_valid = 1;
sig_reg->sge.lkey = mr->lkey;
@@ -318,9 +317,8 @@ static int iser_fast_reg_mr(struct iscsi_iser_task *iser_task,
wr->wr.num_sge = 0;
wr->mr = mr;
wr->key = mr->rkey;
- wr->access = IB_ACCESS_LOCAL_WRITE |
- IB_ACCESS_REMOTE_WRITE |
- IB_ACCESS_REMOTE_READ;
+ wr->access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
+ IB_ACCESS_REMOTE_READ | IB_ACCESS_RELAXED_ORDERING;
rsc->mr_valid = 1;
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c
index 3c370ee25f2f..1e236b1cf29b 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -121,7 +121,8 @@ iser_create_fastreg_desc(struct iser_device *device,
else
mr_type = IB_MR_TYPE_MEM_REG;
- desc->rsc.mr = ib_alloc_mr(pd, mr_type, size, 0);
+ desc->rsc.mr = ib_alloc_mr(pd, mr_type, size,
+ IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(desc->rsc.mr)) {
ret = PTR_ERR(desc->rsc.mr);
iser_err("Failed to allocate ib_fast_reg_mr err=%d\n", ret);
@@ -129,7 +130,8 @@ iser_create_fastreg_desc(struct iser_device *device,
}
if (pi_enable) {
- desc->rsc.sig_mr = ib_alloc_mr_integrity(pd, size, size, 0);
+ desc->rsc.sig_mr = ib_alloc_mr_integrity(pd, size, size,
+ IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(desc->rsc.sig_mr)) {
ret = PTR_ERR(desc->rsc.sig_mr);
iser_err("Failed to allocate sig_mr err=%d\n", ret);
--
2.30.2
From: Avihai Horon <[email protected]>
Enable Relaxed Ordering fro rtrs client and server.
Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.
Signed-off-by: Avihai Horon <[email protected]>
Reviewed-by: Michael Guralnik <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
---
drivers/infiniband/ulp/rtrs/rtrs-clt.c | 6 ++++--
drivers/infiniband/ulp/rtrs/rtrs-srv.c | 15 ++++++++-------
2 files changed, 12 insertions(+), 9 deletions(-)
diff --git a/drivers/infiniband/ulp/rtrs/rtrs-clt.c b/drivers/infiniband/ulp/rtrs/rtrs-clt.c
index 0d3960ed5b2b..a3fbb47a3574 100644
--- a/drivers/infiniband/ulp/rtrs/rtrs-clt.c
+++ b/drivers/infiniband/ulp/rtrs/rtrs-clt.c
@@ -1099,7 +1099,8 @@ static int rtrs_clt_read_req(struct rtrs_clt_io_req *req)
.mr = req->mr,
.key = req->mr->rkey,
.access = (IB_ACCESS_LOCAL_WRITE |
- IB_ACCESS_REMOTE_WRITE),
+ IB_ACCESS_REMOTE_WRITE |
+ IB_ACCESS_RELAXED_ORDERING),
};
wr = &rwr.wr;
@@ -1260,7 +1261,8 @@ static int alloc_sess_reqs(struct rtrs_clt_sess *sess)
goto out;
req->mr = ib_alloc_mr(sess->s.dev->ib_pd, IB_MR_TYPE_MEM_REG,
- sess->max_pages_per_mr, 0);
+ sess->max_pages_per_mr,
+ IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(req->mr)) {
err = PTR_ERR(req->mr);
req->mr = NULL;
diff --git a/drivers/infiniband/ulp/rtrs/rtrs-srv.c b/drivers/infiniband/ulp/rtrs/rtrs-srv.c
index 575f31ff20fd..c28ed5e2245d 100644
--- a/drivers/infiniband/ulp/rtrs/rtrs-srv.c
+++ b/drivers/infiniband/ulp/rtrs/rtrs-srv.c
@@ -312,8 +312,8 @@ static int rdma_write_sg(struct rtrs_srv_op *id)
rwr.mr = srv_mr->mr;
rwr.wr.send_flags = 0;
rwr.key = srv_mr->mr->rkey;
- rwr.access = (IB_ACCESS_LOCAL_WRITE |
- IB_ACCESS_REMOTE_WRITE);
+ rwr.access = (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
+ IB_ACCESS_RELAXED_ORDERING);
msg = srv_mr->iu->buf;
msg->buf_id = cpu_to_le16(id->msg_id);
msg->type = cpu_to_le16(RTRS_MSG_RKEY_RSP);
@@ -432,8 +432,8 @@ static int send_io_resp_imm(struct rtrs_srv_con *con, struct rtrs_srv_op *id,
rwr.wr.send_flags = 0;
rwr.mr = srv_mr->mr;
rwr.key = srv_mr->mr->rkey;
- rwr.access = (IB_ACCESS_LOCAL_WRITE |
- IB_ACCESS_REMOTE_WRITE);
+ rwr.access = (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
+ IB_ACCESS_RELAXED_ORDERING);
msg = srv_mr->iu->buf;
msg->buf_id = cpu_to_le16(id->msg_id);
msg->type = cpu_to_le16(RTRS_MSG_RKEY_RSP);
@@ -638,7 +638,7 @@ static int map_cont_bufs(struct rtrs_srv_sess *sess)
goto free_sg;
}
mr = ib_alloc_mr(sess->s.dev->ib_pd, IB_MR_TYPE_MEM_REG,
- sgt->nents, 0);
+ sgt->nents, IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(mr)) {
err = PTR_ERR(mr);
goto unmap_sg;
@@ -823,8 +823,9 @@ static int process_info_req(struct rtrs_srv_con *con,
rwr[mri].wr.send_flags = 0;
rwr[mri].mr = mr;
rwr[mri].key = mr->rkey;
- rwr[mri].access = (IB_ACCESS_LOCAL_WRITE |
- IB_ACCESS_REMOTE_WRITE);
+ rwr[mri].access =
+ (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
+ IB_ACCESS_RELAXED_ORDERING);
reg_wr = &rwr[mri].wr;
}
--
2.30.2
From: Avihai Horon <[email protected]>
Enable Relaxed Ordering in __ib_alloc_pd() allocation of the
local_dma_lkey.
This will take effect only for devices that don't pre-allocate the lkey
but allocate it per PD allocation.
Signed-off-by: Avihai Horon <[email protected]>
Reviewed-by: Michael Guralnik <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
---
drivers/infiniband/core/verbs.c | 3 ++-
drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c | 1 +
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index a1782f8a6ca0..9b719f7d6fd5 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -287,7 +287,8 @@ struct ib_pd *__ib_alloc_pd(struct ib_device *device, unsigned int flags,
if (device->attrs.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)
pd->local_dma_lkey = device->local_dma_lkey;
else
- mr_access_flags |= IB_ACCESS_LOCAL_WRITE;
+ mr_access_flags |=
+ IB_ACCESS_LOCAL_WRITE | IB_ACCESS_RELAXED_ORDERING;
if (flags & IB_PD_UNSAFE_GLOBAL_RKEY) {
pr_warn("%s: enabling unsafe global rkey\n", caller);
diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
index b3fa783698a0..d74827694f92 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
@@ -66,6 +66,7 @@ struct ib_mr *pvrdma_get_dma_mr(struct ib_pd *pd, int acc)
int ret;
/* Support only LOCAL_WRITE flag for DMA MRs */
+ acc &= ~IB_ACCESS_RELAXED_ORDERING;
if (acc & ~IB_ACCESS_LOCAL_WRITE) {
dev_warn(&dev->pdev->dev,
"unsupported dma mr access flags %#x\n", acc);
--
2.30.2
From: Avihai Horon <[email protected]>
Enable Relaxed Ordering for smbd.
Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.
Signed-off-by: Avihai Horon <[email protected]>
Reviewed-by: Michael Guralnik <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
---
fs/cifs/smbdirect.c | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)
diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 647098a5cf3b..1e86dc8bbe85 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -2178,8 +2178,10 @@ static void smbd_mr_recovery_work(struct work_struct *work)
continue;
}
- smbdirect_mr->mr = ib_alloc_mr(info->pd, info->mr_type,
- info->max_frmr_depth, 0);
+ smbdirect_mr->mr =
+ ib_alloc_mr(info->pd, info->mr_type,
+ info->max_frmr_depth,
+ IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(smbdirect_mr->mr)) {
log_rdma_mr(ERR, "ib_alloc_mr failed mr_type=%x max_frmr_depth=%x\n",
info->mr_type,
@@ -2244,7 +2246,8 @@ static int allocate_mr_list(struct smbd_connection *info)
if (!smbdirect_mr)
goto out;
smbdirect_mr->mr = ib_alloc_mr(info->pd, info->mr_type,
- info->max_frmr_depth, 0);
+ info->max_frmr_depth,
+ IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(smbdirect_mr->mr)) {
log_rdma_mr(ERR, "ib_alloc_mr failed mr_type=%x max_frmr_depth=%x\n",
info->mr_type, info->max_frmr_depth);
@@ -2406,9 +2409,10 @@ struct smbd_mr *smbd_register_mr(
reg_wr->wr.send_flags = IB_SEND_SIGNALED;
reg_wr->mr = smbdirect_mr->mr;
reg_wr->key = smbdirect_mr->mr->rkey;
- reg_wr->access = writing ?
- IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
- IB_ACCESS_REMOTE_READ;
+ reg_wr->access =
+ (writing ? IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
+ IB_ACCESS_REMOTE_READ) |
+ IB_ACCESS_RELAXED_ORDERING;
/*
* There is no need for waiting for complemtion on ib_post_send
--
2.30.2
From: Avihai Horon <[email protected]>
Enable Relaxed Ordering for rds.
Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.
Signed-off-by: Avihai Horon <[email protected]>
Reviewed-by: Michael Guralnik <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
---
net/rds/ib_frmr.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
index 694eb916319e..1a60c2c90c78 100644
--- a/net/rds/ib_frmr.c
+++ b/net/rds/ib_frmr.c
@@ -76,7 +76,7 @@ static struct rds_ib_mr *rds_ib_alloc_frmr(struct rds_ib_device *rds_ibdev,
frmr = &ibmr->u.frmr;
frmr->mr = ib_alloc_mr(rds_ibdev->pd, IB_MR_TYPE_MEM_REG,
- pool->max_pages, 0);
+ pool->max_pages, IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(frmr->mr)) {
pr_warn("RDS/IB: %s failed to allocate MR", __func__);
err = PTR_ERR(frmr->mr);
@@ -156,9 +156,8 @@ static int rds_ib_post_reg_frmr(struct rds_ib_mr *ibmr)
reg_wr.wr.num_sge = 0;
reg_wr.mr = frmr->mr;
reg_wr.key = frmr->mr->rkey;
- reg_wr.access = IB_ACCESS_LOCAL_WRITE |
- IB_ACCESS_REMOTE_READ |
- IB_ACCESS_REMOTE_WRITE;
+ reg_wr.access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ |
+ IB_ACCESS_REMOTE_WRITE | IB_ACCESS_RELAXED_ORDERING;
reg_wr.wr.send_flags = IB_SEND_SIGNALED;
ret = ib_post_send(ibmr->ic->i_cm_id->qp, ®_wr.wr, NULL);
--
2.30.2
From: Avihai Horon <[email protected]>
Enable Relaxed Ordering for nvme.
Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.
Signed-off-by: Avihai Horon <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Reviewed-by: Michael Guralnik <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
---
drivers/nvme/host/rdma.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 4dbc17311e0b..8f106b20b39c 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -532,9 +532,8 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
*/
pages_per_mr = nvme_rdma_get_max_fr_pages(ibdev, queue->pi_support) + 1;
ret = ib_mr_pool_init(queue->qp, &queue->qp->rdma_mrs,
- queue->queue_size,
- IB_MR_TYPE_MEM_REG,
- pages_per_mr, 0, 0);
+ queue->queue_size, IB_MR_TYPE_MEM_REG,
+ pages_per_mr, 0, IB_ACCESS_RELAXED_ORDERING);
if (ret) {
dev_err(queue->ctrl->ctrl.device,
"failed to initialize MR pool sized %d for QID %d\n",
@@ -545,7 +544,8 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma_queue *queue)
if (queue->pi_support) {
ret = ib_mr_pool_init(queue->qp, &queue->qp->sig_mrs,
queue->queue_size, IB_MR_TYPE_INTEGRITY,
- pages_per_mr, pages_per_mr, 0);
+ pages_per_mr, pages_per_mr,
+ IB_ACCESS_RELAXED_ORDERING);
if (ret) {
dev_err(queue->ctrl->ctrl.device,
"failed to initialize PI MR pool sized %d for QID %d\n",
@@ -1382,9 +1382,9 @@ static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
req->reg_wr.wr.num_sge = 0;
req->reg_wr.mr = req->mr;
req->reg_wr.key = req->mr->rkey;
- req->reg_wr.access = IB_ACCESS_LOCAL_WRITE |
- IB_ACCESS_REMOTE_READ |
- IB_ACCESS_REMOTE_WRITE;
+ req->reg_wr.access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ |
+ IB_ACCESS_REMOTE_WRITE |
+ IB_ACCESS_RELAXED_ORDERING;
sg->addr = cpu_to_le64(req->mr->iova);
put_unaligned_le24(req->mr->length, sg->length);
@@ -1488,9 +1488,8 @@ static int nvme_rdma_map_sg_pi(struct nvme_rdma_queue *queue,
wr->wr.send_flags = 0;
wr->mr = req->mr;
wr->key = req->mr->rkey;
- wr->access = IB_ACCESS_LOCAL_WRITE |
- IB_ACCESS_REMOTE_READ |
- IB_ACCESS_REMOTE_WRITE;
+ wr->access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ |
+ IB_ACCESS_REMOTE_WRITE | IB_ACCESS_RELAXED_ORDERING;
sg->addr = cpu_to_le64(req->mr->iova);
put_unaligned_le24(req->mr->length, sg->length);
--
2.30.2
From: Avihai Horon <[email protected]>
Enable Relaxed Ordering for smc.
Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.
Signed-off-by: Avihai Horon <[email protected]>
Reviewed-by: Michael Guralnik <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
---
net/smc/smc_ib.c | 3 ++-
net/smc/smc_wr.c | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index 4e91ed3dc265..6b65c5d1f957 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -579,7 +579,8 @@ int smc_ib_get_memory_region(struct ib_pd *pd, int access_flags,
return 0; /* already done */
buf_slot->mr_rx[link_idx] =
- ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, 1 << buf_slot->order, 0);
+ ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, 1 << buf_slot->order,
+ IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(buf_slot->mr_rx[link_idx])) {
int rc;
diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index cbc73a7e4d59..78e9650621f1 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -555,7 +555,8 @@ static void smc_wr_init_sge(struct smc_link *lnk)
lnk->wr_reg.wr.num_sge = 0;
lnk->wr_reg.wr.send_flags = IB_SEND_SIGNALED;
lnk->wr_reg.wr.opcode = IB_WR_REG_MR;
- lnk->wr_reg.access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
+ lnk->wr_reg.access = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
+ IB_ACCESS_RELAXED_ORDERING;
}
void smc_wr_free_link(struct smc_link *lnk)
--
2.30.2
From: Avihai Horon <[email protected]>
Enable Relaxed Ordering for xprtrdma.
Relaxed Ordering is an optional access flag and as such, it is ignored
by vendors that don't support it.
Signed-off-by: Avihai Horon <[email protected]>
Reviewed-by: Michael Guralnik <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
---
net/sunrpc/xprtrdma/frwr_ops.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index cfbdd197cdfe..f9334c0a1a13 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -135,7 +135,8 @@ int frwr_mr_init(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr *mr)
struct ib_mr *frmr;
int rc;
- frmr = ib_alloc_mr(ep->re_pd, ep->re_mrtype, depth, 0);
+ frmr = ib_alloc_mr(ep->re_pd, ep->re_mrtype, depth,
+ IB_ACCESS_RELAXED_ORDERING);
if (IS_ERR(frmr))
goto out_mr_err;
@@ -339,9 +340,10 @@ struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt,
reg_wr = &mr->frwr.fr_regwr;
reg_wr->mr = ibmr;
reg_wr->key = ibmr->rkey;
- reg_wr->access = writing ?
- IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
- IB_ACCESS_REMOTE_READ;
+ reg_wr->access =
+ (writing ? IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
+ IB_ACCESS_REMOTE_READ) |
+ IB_ACCESS_RELAXED_ORDERING;
mr->mr_handle = ibmr->rkey;
mr->mr_length = ibmr->length;
--
2.30.2
On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <[email protected]>
>
> >From Avihai,
>
> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> imposed on PCI transactions, and thus, can improve performance.
>
> Until now, relaxed ordering could be set only by user space applications
> for user MRs. The following patch series enables relaxed ordering for the
> kernel ULPs as well. Relaxed ordering is an optional capability, and as
> such, it is ignored by vendors that don't support it.
>
> The following test results show the performance improvement achieved
> with relaxed ordering. The test was performed on a NVIDIA A100 in order
> to check performance of storage infrastructure over xprtrdma:
Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
What does that have to do with storage protocols?
Also if you enable this for basically all kernel ULPs, why not have
an opt-out into strict ordering for the cases that need it (if there are
any).
On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <[email protected]>
> >
> > >From Avihai,
> >
> > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> > imposed on PCI transactions, and thus, can improve performance.
> >
> > Until now, relaxed ordering could be set only by user space applications
> > for user MRs. The following patch series enables relaxed ordering for the
> > kernel ULPs as well. Relaxed ordering is an optional capability, and as
> > such, it is ignored by vendors that don't support it.
> >
> > The following test results show the performance improvement achieved
> > with relaxed ordering. The test was performed on a NVIDIA A100 in order
> > to check performance of storage infrastructure over xprtrdma:
>
> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
> What does that have to do with storage protocols?
This system is in use by our storage oriented customer who performed the
test. He runs drivers/infiniband/* stack from the upstream, simply backported
to specific kernel version.
The performance boost is seen in other systems too.
>
> Also if you enable this for basically all kernel ULPs, why not have
> an opt-out into strict ordering for the cases that need it (if there are
> any).
The RO property is optional, it can only improve. In addition, all in-kernel ULPs
don't need strict ordering. I can be mistaken here and Jason will correct me, it
is because of two things: ULP doesn't touch data before CQE and DMA API prohibits it.
Thanks
> On Apr 5, 2021, at 7:08 AM, Leon Romanovsky <[email protected]> wrote:
>
> On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
>> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
>>> From: Leon Romanovsky <[email protected]>
>>>
>>>> From Avihai,
>>>
>>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
>>> imposed on PCI transactions, and thus, can improve performance.
>>>
>>> Until now, relaxed ordering could be set only by user space applications
>>> for user MRs. The following patch series enables relaxed ordering for the
>>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
>>> such, it is ignored by vendors that don't support it.
>>>
>>> The following test results show the performance improvement achieved
>>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
>>> to check performance of storage infrastructure over xprtrdma:
>>
>> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
>> What does that have to do with storage protocols?
>
> This system is in use by our storage oriented customer who performed the
> test. He runs drivers/infiniband/* stack from the upstream, simply backported
> to specific kernel version.
>
> The performance boost is seen in other systems too.
>
>>
>> Also if you enable this for basically all kernel ULPs, why not have
>> an opt-out into strict ordering for the cases that need it (if there are
>> any).
>
> The RO property is optional, it can only improve. In addition, all in-kernel ULPs
> don't need strict ordering. I can be mistaken here and Jason will correct me, it
> is because of two things: ULP doesn't touch data before CQE and DMA API prohibits it.
>
Sticking to in-kernel ULP’s don’t need strict ordering, why don’t you enable this
for HCA’s which supports it by default instead of patching very ULPs. Some ULPs
in future may forget to specify such flag.
Regards,
Santosh
On 4/5/2021 10:08 AM, Leon Romanovsky wrote:
> On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
>> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
>>> From: Leon Romanovsky <[email protected]>
>>>
>>> >From Avihai,
>>>
>>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
>>> imposed on PCI transactions, and thus, can improve performance.
>>>
>>> Until now, relaxed ordering could be set only by user space applications
>>> for user MRs. The following patch series enables relaxed ordering for the
>>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
>>> such, it is ignored by vendors that don't support it.
>>>
>>> The following test results show the performance improvement achieved
>>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
>>> to check performance of storage infrastructure over xprtrdma:
>>
>> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
>> What does that have to do with storage protocols?
>
> This system is in use by our storage oriented customer who performed the
> test. He runs drivers/infiniband/* stack from the upstream, simply backported
> to specific kernel version.
>
> The performance boost is seen in other systems too.
We need to see more information about this test, and platform.
What correctness testing was done, and how was it verified? What
PCI bus type(s) were tested, and with what adapters? What storage
workload was generated, and were all possible RDMA exchanges by
each ULP exercised?
>> Also if you enable this for basically all kernel ULPs, why not have
>> an opt-out into strict ordering for the cases that need it (if there are
>> any).
>
> The RO property is optional, it can only improve. In addition, all in-kernel ULPs
> don't need strict ordering. I can be mistaken here and Jason will correct me, it
> is because of two things: ULP doesn't touch data before CQE and DMA API prohibits it.
+1 on Christoph's comment.
I woud hope most well-architected ULPs will support relaxed ordering,
but storage workloads, in my experience, can find ways to cause failure
in adapters. I would not suggest making this the default behavior
without extensive testing.
Tom.
On 4/5/2021 1:23 AM, Leon Romanovsky wrote:
> From: Avihai Horon <[email protected]>
>
> Enable Relaxed Ordering in __ib_alloc_pd() allocation of the
> local_dma_lkey.
>
> This will take effect only for devices that don't pre-allocate the lkey
> but allocate it per PD allocation.
>
> Signed-off-by: Avihai Horon <[email protected]>
> Reviewed-by: Michael Guralnik <[email protected]>
> Signed-off-by: Leon Romanovsky <[email protected]>
> ---
> drivers/infiniband/core/verbs.c | 3 ++-
> drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c | 1 +
> 2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> index a1782f8a6ca0..9b719f7d6fd5 100644
> --- a/drivers/infiniband/core/verbs.c
> +++ b/drivers/infiniband/core/verbs.c
> @@ -287,7 +287,8 @@ struct ib_pd *__ib_alloc_pd(struct ib_device *device, unsigned int flags,
> if (device->attrs.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)
> pd->local_dma_lkey = device->local_dma_lkey;
> else
> - mr_access_flags |= IB_ACCESS_LOCAL_WRITE;
> + mr_access_flags |=
> + IB_ACCESS_LOCAL_WRITE | IB_ACCESS_RELAXED_ORDERING;
So, do local_dma_lkey's get relaxed ordering unconditionally?
> if (flags & IB_PD_UNSAFE_GLOBAL_RKEY) {
> pr_warn("%s: enabling unsafe global rkey\n", caller);
> diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
> index b3fa783698a0..d74827694f92 100644
> --- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
> +++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
> @@ -66,6 +66,7 @@ struct ib_mr *pvrdma_get_dma_mr(struct ib_pd *pd, int acc)
> int ret;
>
> /* Support only LOCAL_WRITE flag for DMA MRs */
> + acc &= ~IB_ACCESS_RELAXED_ORDERING;
> if (acc & ~IB_ACCESS_LOCAL_WRITE) {
> dev_warn(&dev->pdev->dev,
> "unsupported dma mr access flags %#x\n", acc);
Why does the pvrdma driver require relaxed ordering to be off?
Tom.
On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <[email protected]>
> >
> > >From Avihai,
> >
> > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> > imposed on PCI transactions, and thus, can improve performance.
> >
> > Until now, relaxed ordering could be set only by user space applications
> > for user MRs. The following patch series enables relaxed ordering for the
> > kernel ULPs as well. Relaxed ordering is an optional capability, and as
> > such, it is ignored by vendors that don't support it.
> >
> > The following test results show the performance improvement achieved
> > with relaxed ordering. The test was performed on a NVIDIA A100 in order
> > to check performance of storage infrastructure over xprtrdma:
>
> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
> What does that have to do with storage protocols?
I think it is a typo (or at least mit makes no sense to be talking
about NFS with a GPU chip) Probably it should be a DGX A100 which is a
dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA
workload.
AMD dual socket systems are well known to benefit from relaxed
ordering, people have been doing this in userspace for a while now
with the opt in.
What surprises me is the performance difference, I hadn't heard it is
4x!
Jason
On 4/5/21 11:01 AM, Tom Talpey wrote:
> On 4/5/2021 1:23 AM, Leon Romanovsky wrote:
>> From: Avihai Horon <[email protected]>
>>
>> Enable Relaxed Ordering in __ib_alloc_pd() allocation of the
>> local_dma_lkey.
>>
>> This will take effect only for devices that don't pre-allocate the lkey
>> but allocate it per PD allocation.
>>
>> Signed-off-by: Avihai Horon <[email protected]>
>> Reviewed-by: Michael Guralnik <[email protected]>
>> Signed-off-by: Leon Romanovsky <[email protected]>
>> ---
>> drivers/infiniband/core/verbs.c | 3 ++-
>> drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c | 1 +
>> 2 files changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
>> index a1782f8a6ca0..9b719f7d6fd5 100644
>> --- a/drivers/infiniband/core/verbs.c
>> +++ b/drivers/infiniband/core/verbs.c
>> @@ -287,7 +287,8 @@ struct ib_pd *__ib_alloc_pd(struct ib_device *device, unsigned int flags,
>> if (device->attrs.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)
>> pd->local_dma_lkey = device->local_dma_lkey;
>> else
>> - mr_access_flags |= IB_ACCESS_LOCAL_WRITE;
>> + mr_access_flags |=
>> + IB_ACCESS_LOCAL_WRITE | IB_ACCESS_RELAXED_ORDERING;
>
> So, do local_dma_lkey's get relaxed ordering unconditionally?
>
>> if (flags & IB_PD_UNSAFE_GLOBAL_RKEY) {
>> pr_warn("%s: enabling unsafe global rkey\n", caller);
>> diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
>> index b3fa783698a0..d74827694f92 100644
>> --- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
>> +++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c
>> @@ -66,6 +66,7 @@ struct ib_mr *pvrdma_get_dma_mr(struct ib_pd *pd, int acc)
>> int ret;
>> /* Support only LOCAL_WRITE flag for DMA MRs */
>> + acc &= ~IB_ACCESS_RELAXED_ORDERING;
>> if (acc & ~IB_ACCESS_LOCAL_WRITE) {
>> dev_warn(&dev->pdev->dev,
>> "unsupported dma mr access flags %#x\n", acc);
>
> Why does the pvrdma driver require relaxed ordering to be off?
PVRDMA doesn't support any other flags other than LOCAL_WRITE for
DMA MRs so the MR creation will fail if any new unconditionally added
flag isn't cleared.
> On Apr 5, 2021, at 4:07 PM, Jason Gunthorpe <[email protected]> wrote:
>
> On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
>> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
>>> From: Leon Romanovsky <[email protected]>
>>>
>>>> From Avihai,
>>>
>>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
>>> imposed on PCI transactions, and thus, can improve performance.
>>>
>>> Until now, relaxed ordering could be set only by user space applications
>>> for user MRs. The following patch series enables relaxed ordering for the
>>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
>>> such, it is ignored by vendors that don't support it.
>>>
>>> The following test results show the performance improvement achieved
>>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
>>> to check performance of storage infrastructure over xprtrdma:
>>
>> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
>> What does that have to do with storage protocols?
>
> I think it is a typo (or at least mit makes no sense to be talking
> about NFS with a GPU chip) Probably it should be a DGX A100 which is a
> dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA
> workload.
We need to get a better idea what correctness testing has been done,
and whether positive correctness testing results can be replicated
on a variety of platforms.
I have an old Haswell dual-socket system in my lab, but otherwise
I'm not sure I have a platform that would be interesting for such a
test.
> AMD dual socket systems are well known to benefit from relaxed
> ordering, people have been doing this in userspace for a while now
> with the opt in.
--
Chuck Lever
On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
> > On Apr 5, 2021, at 4:07 PM, Jason Gunthorpe <[email protected]> wrote:
> > On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
> >> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> >>> From: Leon Romanovsky <[email protected]>
> >>>
> >>>> From Avihai,
> >>>
> >>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> >>> imposed on PCI transactions, and thus, can improve performance.
> >>>
> >>> Until now, relaxed ordering could be set only by user space applications
> >>> for user MRs. The following patch series enables relaxed ordering for the
> >>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
> >>> such, it is ignored by vendors that don't support it.
> >>>
> >>> The following test results show the performance improvement achieved
> >>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
> >>> to check performance of storage infrastructure over xprtrdma:
> >>
> >> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
> >> What does that have to do with storage protocols?
> >
> > I think it is a typo (or at least mit makes no sense to be talking
> > about NFS with a GPU chip) Probably it should be a DGX A100 which is a
> > dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA
> > workload.
>
> We need to get a better idea what correctness testing has been done,
> and whether positive correctness testing results can be replicated
> on a variety of platforms.
>
> I have an old Haswell dual-socket system in my lab, but otherwise
> I'm not sure I have a platform that would be interesting for such a
> test.
Not sure if Haswell will be useful for such testing. It looks like many
of those subscribe to 'quirk_relaxedordering_disable'.
On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <[email protected]>
>
> From Avihai,
>
> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> imposed on PCI transactions, and thus, can improve performance.
>
> Until now, relaxed ordering could be set only by user space applications
> for user MRs. The following patch series enables relaxed ordering for the
> kernel ULPs as well. Relaxed ordering is an optional capability, and as
> such, it is ignored by vendors that don't support it.
>
> The following test results show the performance improvement achieved
Did you test this patchset with CPU does not support relaxed ordering?
We observed significantly performance degradation when run perftest with
relaxed ordering enabled over old CPU.
https://github.com/linux-rdma/perftest/issues/116
thanks
On Tue, Apr 06, 2021 at 10:37:38AM +0800, Honggang LI wrote:
> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <[email protected]>
> >
> > From Avihai,
> >
> > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> > imposed on PCI transactions, and thus, can improve performance.
> >
> > Until now, relaxed ordering could be set only by user space applications
> > for user MRs. The following patch series enables relaxed ordering for the
> > kernel ULPs as well. Relaxed ordering is an optional capability, and as
> > such, it is ignored by vendors that don't support it.
> >
> > The following test results show the performance improvement achieved
>
> Did you test this patchset with CPU does not support relaxed ordering?
I don't think so, the CPUs that don't support RO are Intel's fourth/fifth-generation
and they are not interesting from performance point of view.
>
> We observed significantly performance degradation when run perftest with
> relaxed ordering enabled over old CPU.
>
> https://github.com/linux-rdma/perftest/issues/116
The perftest is slightly different, but you pointed to the valid point.
We forgot to call pcie_relaxed_ordering_enabled() before setting RO bit
and arguably this was needed to be done in perftest too.
Thanks
>
> thanks
>
On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>
>
> > On Apr 5, 2021, at 4:07 PM, Jason Gunthorpe <[email protected]> wrote:
> >
> > On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote:
> >> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> >>> From: Leon Romanovsky <[email protected]>
> >>>
> >>>> From Avihai,
> >>>
> >>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> >>> imposed on PCI transactions, and thus, can improve performance.
> >>>
> >>> Until now, relaxed ordering could be set only by user space applications
> >>> for user MRs. The following patch series enables relaxed ordering for the
> >>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
> >>> such, it is ignored by vendors that don't support it.
> >>>
> >>> The following test results show the performance improvement achieved
> >>> with relaxed ordering. The test was performed on a NVIDIA A100 in order
> >>> to check performance of storage infrastructure over xprtrdma:
> >>
> >> Isn't the Nvidia A100 a GPU not actually supported by Linux at all?
> >> What does that have to do with storage protocols?
> >
> > I think it is a typo (or at least mit makes no sense to be talking
> > about NFS with a GPU chip) Probably it should be a DGX A100 which is a
> > dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA
> > workload.
>
> We need to get a better idea what correctness testing has been done,
> and whether positive correctness testing results can be replicated
> on a variety of platforms.
I will ask to provide more details.
>
> I have an old Haswell dual-socket system in my lab, but otherwise
> I'm not sure I have a platform that would be interesting for such a
> test.
We don't have such old systems too.
>
>
> > AMD dual socket systems are well known to benefit from relaxed
> > ordering, people have been doing this in userspace for a while now
> > with the opt in.
>
>
> --
> Chuck Lever
>
>
>
On Mon, Apr 05, 2021 at 02:01:16PM -0400, Tom Talpey wrote:
> On 4/5/2021 1:23 AM, Leon Romanovsky wrote:
> > From: Avihai Horon <[email protected]>
> >
> > Enable Relaxed Ordering in __ib_alloc_pd() allocation of the
> > local_dma_lkey.
> >
> > This will take effect only for devices that don't pre-allocate the lkey
> > but allocate it per PD allocation.
> >
> > Signed-off-by: Avihai Horon <[email protected]>
> > Reviewed-by: Michael Guralnik <[email protected]>
> > Signed-off-by: Leon Romanovsky <[email protected]>
> > ---
> > drivers/infiniband/core/verbs.c | 3 ++-
> > drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c | 1 +
> > 2 files changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> > index a1782f8a6ca0..9b719f7d6fd5 100644
> > --- a/drivers/infiniband/core/verbs.c
> > +++ b/drivers/infiniband/core/verbs.c
> > @@ -287,7 +287,8 @@ struct ib_pd *__ib_alloc_pd(struct ib_device *device, unsigned int flags,
> > if (device->attrs.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)
> > pd->local_dma_lkey = device->local_dma_lkey;
> > else
> > - mr_access_flags |= IB_ACCESS_LOCAL_WRITE;
> > + mr_access_flags |=
> > + IB_ACCESS_LOCAL_WRITE | IB_ACCESS_RELAXED_ORDERING;
>
> So, do local_dma_lkey's get relaxed ordering unconditionally?
Yes, in mlx5, this lkey is created on the fly.
Thanks
On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
> We need to get a better idea what correctness testing has been done,
> and whether positive correctness testing results can be replicated
> on a variety of platforms.
RO has been rolling out slowly on mlx5 over a few years and storage
ULPs are the last to change. eg the mlx5 ethernet driver has had RO
turned on for a long time, userspace HPC applications have been using
it for a while now too.
We know there are platforms with broken RO implementations (like
Haswell) but the kernel is supposed to globally turn off RO on all
those cases. I'd be a bit surprised if we discover any more from this
series.
On the other hand there are platforms that get huge speed ups from
turning this on, AMD is one example, there are a bunch in the ARM
world too.
Still, obviously people should test on the platforms they have.
Jason
On Tue, Apr 06, 2021 at 08:09:43AM +0300, Leon Romanovsky wrote:
> On Tue, Apr 06, 2021 at 10:37:38AM +0800, Honggang LI wrote:
> > On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <[email protected]>
> > >
> > > From Avihai,
> > >
> > > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
> > > imposed on PCI transactions, and thus, can improve performance.
> > >
> > > Until now, relaxed ordering could be set only by user space applications
> > > for user MRs. The following patch series enables relaxed ordering for the
> > > kernel ULPs as well. Relaxed ordering is an optional capability, and as
> > > such, it is ignored by vendors that don't support it.
> > >
> > > The following test results show the performance improvement achieved
> >
> > Did you test this patchset with CPU does not support relaxed ordering?
>
> I don't think so, the CPUs that don't support RO are Intel's fourth/fifth-generation
> and they are not interesting from performance point of view.
>
> >
> > We observed significantly performance degradation when run perftest with
> > relaxed ordering enabled over old CPU.
> >
> > https://github.com/linux-rdma/perftest/issues/116
>
> The perftest is slightly different, but you pointed to the valid point.
> We forgot to call pcie_relaxed_ordering_enabled() before setting RO bit
> and arguably this was needed to be done in perftest too.
No, the PCI device should not have the RO bit set in this situation.
It is something mlx5_core needs to do. We can't push this into
applications.
There should be no performance difference from asking for
IBV_ACCESS_RELAXED_ORDERING when RO is disabled at the PCI config and
not asking for it at all.
Either the platform has working relaxed ordering that gives a
performance gain and the RO config spec bit should be set, or it
doesn't and the bit should be clear.
This is not something to decide in userspace, or in RDMA. At worst it
becomes another platform specific PCI tunable people have to set.
I thought the old haswell systems were quirked to disable RO globally
anyhow?
Jason
> On Apr 9, 2021, at 10:26 AM, Tom Talpey <[email protected]> wrote:
>
> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>
>>> We need to get a better idea what correctness testing has been done,
>>> and whether positive correctness testing results can be replicated
>>> on a variety of platforms.
>> RO has been rolling out slowly on mlx5 over a few years and storage
>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>> turned on for a long time, userspace HPC applications have been using
>> it for a while now too.
>
> I'd love to see RO be used more, it was always something the RDMA
> specs supported and carefully architected for. My only concern is
> that it's difficult to get right, especially when the platforms
> have been running strictly-ordered for so long. The ULPs need
> testing, and a lot of it.
>
>> We know there are platforms with broken RO implementations (like
>> Haswell) but the kernel is supposed to globally turn off RO on all
>> those cases. I'd be a bit surprised if we discover any more from this
>> series.
>> On the other hand there are platforms that get huge speed ups from
>> turning this on, AMD is one example, there are a bunch in the ARM
>> world too.
>
> My belief is that the biggest risk is from situations where completions
> are batched, and therefore polling is used to detect them without
> interrupts (which explicitly). The RO pipeline will completely reorder
> DMA writes, and consumers which infer ordering from memory contents may
> break. This can even apply within the provider code, which may attempt
> to poll WR and CQ structures, and be tripped up.
You are referring specifically to RPC/RDMA depending on Receive
completions to guarantee that previous RDMA Writes have been
retired? Or is there a particular implementation practice in
the Linux RPC/RDMA code that worries you?
> The Mellanox adapter, itself, historically has strict in-order DMA
> semantics, and while it's great to relax that, changing it by default
> for all consumers is something to consider very cautiously.
>
>> Still, obviously people should test on the platforms they have.
>
> Yes, and "test" be taken seriously with focus on ULP data integrity.
> Speedups will mean nothing if the data is ever damaged.
I agree that data integrity comes first.
Since I currently don't have facilities to test RO in my lab, the
community will have to agree on a set of tests and expected results
that specifically exercise the corner cases you are concerned about.
--
Chuck Lever
On 4/9/2021 10:45 AM, Chuck Lever III wrote:
>
>
>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <[email protected]> wrote:
>>
>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>>
>>>> We need to get a better idea what correctness testing has been done,
>>>> and whether positive correctness testing results can be replicated
>>>> on a variety of platforms.
>>> RO has been rolling out slowly on mlx5 over a few years and storage
>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>>> turned on for a long time, userspace HPC applications have been using
>>> it for a while now too.
>>
>> I'd love to see RO be used more, it was always something the RDMA
>> specs supported and carefully architected for. My only concern is
>> that it's difficult to get right, especially when the platforms
>> have been running strictly-ordered for so long. The ULPs need
>> testing, and a lot of it.
>>
>>> We know there are platforms with broken RO implementations (like
>>> Haswell) but the kernel is supposed to globally turn off RO on all
>>> those cases. I'd be a bit surprised if we discover any more from this
>>> series.
>>> On the other hand there are platforms that get huge speed ups from
>>> turning this on, AMD is one example, there are a bunch in the ARM
>>> world too.
>>
>> My belief is that the biggest risk is from situations where completions
>> are batched, and therefore polling is used to detect them without
>> interrupts (which explicitly). The RO pipeline will completely reorder
>> DMA writes, and consumers which infer ordering from memory contents may
>> break. This can even apply within the provider code, which may attempt
>> to poll WR and CQ structures, and be tripped up.
>
> You are referring specifically to RPC/RDMA depending on Receive
> completions to guarantee that previous RDMA Writes have been
> retired? Or is there a particular implementation practice in
> the Linux RPC/RDMA code that worries you?
Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
is hopefully unfounded, is that the RO pipeline might not have flushed
when a completion is posted *after* posting an interrupt.
Something like this...
RDMA Write arrives
PCIe RO Write for data
PCIe RO Write for data
...
RDMA Write arrives
PCIe RO Write for data
...
RDMA Send arrives
PCIe RO Write for receive data
PCIe RO Write for receive descriptor
PCIe interrupt (flushes RO pipeline for all three ops above)
RPC/RDMA polls CQ
Reaps receive completion
RDMA Send arrives
PCIe RO Write for receive data
PCIe RO write for receive descriptor
Does *not* interrupt, since CQ not armed
RPC/RDMA continues to poll CQ
Reaps receive completion
PCIe RO writes not yet flushed
Processes incomplete in-memory data
Bzzzt
Hopefully, the adapter performs a PCIe flushing read, or something
to avoid this when an interrupt is not generated. Alternatively, I'm
overly paranoid.
Tom.
>> The Mellanox adapter, itself, historically has strict in-order DMA
>> semantics, and while it's great to relax that, changing it by default
>> for all consumers is something to consider very cautiously.
>>
>>> Still, obviously people should test on the platforms they have.
>>
>> Yes, and "test" be taken seriously with focus on ULP data integrity.
>> Speedups will mean nothing if the data is ever damaged.
>
> I agree that data integrity comes first.
>
> Since I currently don't have facilities to test RO in my lab, the
> community will have to agree on a set of tests and expected results
> that specifically exercise the corner cases you are concerned about.
>
>
> --
> Chuck Lever
>
>
>
>
> On 9 Apr 2021, at 17:32, Tom Talpey <[email protected]> wrote:
>
> On 4/9/2021 10:45 AM, Chuck Lever III wrote:
>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <[email protected]> wrote:
>>>
>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>>>
>>>>> We need to get a better idea what correctness testing has been done,
>>>>> and whether positive correctness testing results can be replicated
>>>>> on a variety of platforms.
>>>> RO has been rolling out slowly on mlx5 over a few years and storage
>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>>>> turned on for a long time, userspace HPC applications have been using
>>>> it for a while now too.
>>>
>>> I'd love to see RO be used more, it was always something the RDMA
>>> specs supported and carefully architected for. My only concern is
>>> that it's difficult to get right, especially when the platforms
>>> have been running strictly-ordered for so long. The ULPs need
>>> testing, and a lot of it.
>>>
>>>> We know there are platforms with broken RO implementations (like
>>>> Haswell) but the kernel is supposed to globally turn off RO on all
>>>> those cases. I'd be a bit surprised if we discover any more from this
>>>> series.
>>>> On the other hand there are platforms that get huge speed ups from
>>>> turning this on, AMD is one example, there are a bunch in the ARM
>>>> world too.
>>>
>>> My belief is that the biggest risk is from situations where completions
>>> are batched, and therefore polling is used to detect them without
>>> interrupts (which explicitly). The RO pipeline will completely reorder
>>> DMA writes, and consumers which infer ordering from memory contents may
>>> break. This can even apply within the provider code, which may attempt
>>> to poll WR and CQ structures, and be tripped up.
>> You are referring specifically to RPC/RDMA depending on Receive
>> completions to guarantee that previous RDMA Writes have been
>> retired? Or is there a particular implementation practice in
>> the Linux RPC/RDMA code that worries you?
>
> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
> is hopefully unfounded, is that the RO pipeline might not have flushed
> when a completion is posted *after* posting an interrupt.
>
> Something like this...
>
> RDMA Write arrives
> PCIe RO Write for data
> PCIe RO Write for data
> ...
> RDMA Write arrives
> PCIe RO Write for data
> ...
> RDMA Send arrives
> PCIe RO Write for receive data
> PCIe RO Write for receive descriptor
Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then it will shure prior written RO date has global visibility when the CQE can be observed.
> PCIe interrupt (flushes RO pipeline for all three ops above)
Before the interrupt, the HCA will write the EQE (Event Queue Entry). This has to be a Strongly Ordered write to "push" prior written CQEs so that when the EQE is observed, the prior writes of CQEs have global visibility.
And the MSI-X write likewise, to avoid spurious interrupts.
Thxs, Håkon
>
> RPC/RDMA polls CQ
> Reaps receive completion
>
> RDMA Send arrives
> PCIe RO Write for receive data
> PCIe RO write for receive descriptor
> Does *not* interrupt, since CQ not armed
>
> RPC/RDMA continues to poll CQ
> Reaps receive completion
> PCIe RO writes not yet flushed
> Processes incomplete in-memory data
> Bzzzt
>
> Hopefully, the adapter performs a PCIe flushing read, or something
> to avoid this when an interrupt is not generated. Alternatively, I'm
> overly paranoid.
>
> Tom.
>
>>> The Mellanox adapter, itself, historically has strict in-order DMA
>>> semantics, and while it's great to relax that, changing it by default
>>> for all consumers is something to consider very cautiously.
>>>
>>>> Still, obviously people should test on the platforms they have.
>>>
>>> Yes, and "test" be taken seriously with focus on ULP data integrity.
>>> Speedups will mean nothing if the data is ever damaged.
>> I agree that data integrity comes first.
>> Since I currently don't have facilities to test RO in my lab, the
>> community will have to agree on a set of tests and expected results
>> that specifically exercise the corner cases you are concerned about.
>> --
>> Chuck Lever
On Fri, Apr 09, 2021 at 10:26:21AM -0400, Tom Talpey wrote:
> My belief is that the biggest risk is from situations where completions
> are batched, and therefore polling is used to detect them without
> interrupts (which explicitly).
We don't do this in the kernel.
All kernel ULPs only read data after they observe the CQE. We do not
have "last data polling" and our interrupt model does not support some
hacky "interrupt means go and use the data" approach.
ULPs have to be designed this way to use the DMA API properly.
Fencing a DMA before it is completed by the HW will cause IOMMU
errors.
Userspace is a different story, but that will remain as-is with
optional relaxed ordering.
Jason
On 4/9/2021 12:40 PM, Jason Gunthorpe wrote:
> On Fri, Apr 09, 2021 at 10:26:21AM -0400, Tom Talpey wrote:
>
>> My belief is that the biggest risk is from situations where completions
>> are batched, and therefore polling is used to detect them without
>> interrupts (which explicitly).
>
> We don't do this in the kernel.
>
> All kernel ULPs only read data after they observe the CQE. We do not
> have "last data polling" and our interrupt model does not support some
> hacky "interrupt means go and use the data" approach.
>
> ULPs have to be designed this way to use the DMA API properly.
Yep. Totally agree.
My concern was about the data being written as relaxed, and the CQE
racing it. I'll reply in the other fork.
> Fencing a DMA before it is completed by the HW will cause IOMMU
> errors.
>
> Userspace is a different story, but that will remain as-is with
> optional relaxed ordering.
>
> Jason
>
On 4/9/2021 12:27 PM, Haakon Bugge wrote:
>
>
>> On 9 Apr 2021, at 17:32, Tom Talpey <[email protected]> wrote:
>>
>> On 4/9/2021 10:45 AM, Chuck Lever III wrote:
>>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <[email protected]> wrote:
>>>>
>>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>>>>
>>>>>> We need to get a better idea what correctness testing has been done,
>>>>>> and whether positive correctness testing results can be replicated
>>>>>> on a variety of platforms.
>>>>> RO has been rolling out slowly on mlx5 over a few years and storage
>>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>>>>> turned on for a long time, userspace HPC applications have been using
>>>>> it for a while now too.
>>>>
>>>> I'd love to see RO be used more, it was always something the RDMA
>>>> specs supported and carefully architected for. My only concern is
>>>> that it's difficult to get right, especially when the platforms
>>>> have been running strictly-ordered for so long. The ULPs need
>>>> testing, and a lot of it.
>>>>
>>>>> We know there are platforms with broken RO implementations (like
>>>>> Haswell) but the kernel is supposed to globally turn off RO on all
>>>>> those cases. I'd be a bit surprised if we discover any more from this
>>>>> series.
>>>>> On the other hand there are platforms that get huge speed ups from
>>>>> turning this on, AMD is one example, there are a bunch in the ARM
>>>>> world too.
>>>>
>>>> My belief is that the biggest risk is from situations where completions
>>>> are batched, and therefore polling is used to detect them without
>>>> interrupts (which explicitly). The RO pipeline will completely reorder
>>>> DMA writes, and consumers which infer ordering from memory contents may
>>>> break. This can even apply within the provider code, which may attempt
>>>> to poll WR and CQ structures, and be tripped up.
>>> You are referring specifically to RPC/RDMA depending on Receive
>>> completions to guarantee that previous RDMA Writes have been
>>> retired? Or is there a particular implementation practice in
>>> the Linux RPC/RDMA code that worries you?
>>
>> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
>> is hopefully unfounded, is that the RO pipeline might not have flushed
>> when a completion is posted *after* posting an interrupt.
>>
>> Something like this...
>>
>> RDMA Write arrives
>> PCIe RO Write for data
>> PCIe RO Write for data
>> ...
>> RDMA Write arrives
>> PCIe RO Write for data
>> ...
>> RDMA Send arrives
>> PCIe RO Write for receive data
>> PCIe RO Write for receive descriptor
>
> Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then it will shure prior written RO date has global visibility when the CQE can be observed.
I wasn't aware that a strongly-ordered PCIe Write will ensure that
prior relaxed-ordered writes went first. If that's the case, I'm
fine with it - as long as the providers are correctly coded!!
>> PCIe interrupt (flushes RO pipeline for all three ops above)
>
> Before the interrupt, the HCA will write the EQE (Event Queue Entry). This has to be a Strongly Ordered write to "push" prior written CQEs so that when the EQE is observed, the prior writes of CQEs have global visibility.
>
> And the MSI-X write likewise, to avoid spurious interrupts.
Ok, and yes agreed the same principle would apply.
Is there any implication if a PCIe switch were present on the
motherboard? The switch is allowed to do some creative routing
if the operation is relaxed, correct?
Tom.
> Thxs, Håkon
>
>>
>> RPC/RDMA polls CQ
>> Reaps receive completion
>>
>> RDMA Send arrives
>> PCIe RO Write for receive data
>> PCIe RO write for receive descriptor
>> Does *not* interrupt, since CQ not armed
>>
>> RPC/RDMA continues to poll CQ
>> Reaps receive completion
>> PCIe RO writes not yet flushed
>> Processes incomplete in-memory data
>> Bzzzt
>>
>> Hopefully, the adapter performs a PCIe flushing read, or something
>> to avoid this when an interrupt is not generated. Alternatively, I'm
>> overly paranoid.
>>
>> Tom.
>>
>>>> The Mellanox adapter, itself, historically has strict in-order DMA
>>>> semantics, and while it's great to relax that, changing it by default
>>>> for all consumers is something to consider very cautiously.
>>>>
>>>>> Still, obviously people should test on the platforms they have.
>>>>
>>>> Yes, and "test" be taken seriously with focus on ULP data integrity.
>>>> Speedups will mean nothing if the data is ever damaged.
>>> I agree that data integrity comes first.
>>> Since I currently don't have facilities to test RO in my lab, the
>>> community will have to agree on a set of tests and expected results
>>> that specifically exercise the corner cases you are concerned about.
>>> --
>>> Chuck Lever
>
From: Tom Talpey
> Sent: 09 April 2021 18:49
> On 4/9/2021 12:27 PM, Haakon Bugge wrote:
> >
> >
> >> On 9 Apr 2021, at 17:32, Tom Talpey <[email protected]> wrote:
> >>
> >> On 4/9/2021 10:45 AM, Chuck Lever III wrote:
> >>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <[email protected]> wrote:
> >>>>
> >>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
> >>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
> >>>>>
> >>>>>> We need to get a better idea what correctness testing has been done,
> >>>>>> and whether positive correctness testing results can be replicated
> >>>>>> on a variety of platforms.
> >>>>> RO has been rolling out slowly on mlx5 over a few years and storage
> >>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
> >>>>> turned on for a long time, userspace HPC applications have been using
> >>>>> it for a while now too.
> >>>>
> >>>> I'd love to see RO be used more, it was always something the RDMA
> >>>> specs supported and carefully architected for. My only concern is
> >>>> that it's difficult to get right, especially when the platforms
> >>>> have been running strictly-ordered for so long. The ULPs need
> >>>> testing, and a lot of it.
> >>>>
> >>>>> We know there are platforms with broken RO implementations (like
> >>>>> Haswell) but the kernel is supposed to globally turn off RO on all
> >>>>> those cases. I'd be a bit surprised if we discover any more from this
> >>>>> series.
> >>>>> On the other hand there are platforms that get huge speed ups from
> >>>>> turning this on, AMD is one example, there are a bunch in the ARM
> >>>>> world too.
> >>>>
> >>>> My belief is that the biggest risk is from situations where completions
> >>>> are batched, and therefore polling is used to detect them without
> >>>> interrupts (which explicitly). The RO pipeline will completely reorder
> >>>> DMA writes, and consumers which infer ordering from memory contents may
> >>>> break. This can even apply within the provider code, which may attempt
> >>>> to poll WR and CQ structures, and be tripped up.
> >>> You are referring specifically to RPC/RDMA depending on Receive
> >>> completions to guarantee that previous RDMA Writes have been
> >>> retired? Or is there a particular implementation practice in
> >>> the Linux RPC/RDMA code that worries you?
> >>
> >> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
> >> is hopefully unfounded, is that the RO pipeline might not have flushed
> >> when a completion is posted *after* posting an interrupt.
> >>
> >> Something like this...
> >>
> >> RDMA Write arrives
> >> PCIe RO Write for data
> >> PCIe RO Write for data
> >> ...
> >> RDMA Write arrives
> >> PCIe RO Write for data
> >> ...
> >> RDMA Send arrives
> >> PCIe RO Write for receive data
> >> PCIe RO Write for receive descriptor
> >
> > Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then
> it will shure prior written RO date has global visibility when the CQE can be observed.
>
> I wasn't aware that a strongly-ordered PCIe Write will ensure that
> prior relaxed-ordered writes went first. If that's the case, I'm
> fine with it - as long as the providers are correctly coded!!
I remember trying to read the relevant section of the PCIe spec.
(Possibly in a book that was trying to make it easier to understand!)
It is about as clear as mud.
I presume this is all about allowing PCIe targets (eg ethernet cards)
to use relaxed ordering on write requests to host memory.
And that such writes can be completed out of order?
It isn't entirely clear that you aren't talking of letting the
cpu do 'relaxed order' writes to PCIe targets!
For a typical ethernet driver the receive interrupt just means
'go and look at the receive descriptor ring'.
So there is an absolute requirement that the writes for data
buffer complete before the write to the receive descriptor.
There is no requirement for the interrupt (requested after the
descriptor write) to have been seen by the cpu.
Quite often the driver will find the 'receive complete'
descriptor when processing frames from an earlier interrupt
(and nothing to do in response to the interrupt itself).
So the write to the receive descriptor would have to have RO clear
to ensure that all the buffer writes complete first.
(The furthest I've got into PCIe internals was fixing the bug
in some vendor-supplied FPGA logic that failed to correctly
handle multiple data TLP responses to a single read TLP.
Fortunately it wasn't in the hard-IP bit.)
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On 4/6/2021 2:53 PM, Jason Gunthorpe wrote:
> On Tue, Apr 06, 2021 at 08:09:43AM +0300, Leon Romanovsky wrote:
>> On Tue, Apr 06, 2021 at 10:37:38AM +0800, Honggang LI wrote:
>>> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote:
>>>> From: Leon Romanovsky <[email protected]>
>>>>
>>>> From Avihai,
>>>>
>>>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering
>>>> imposed on PCI transactions, and thus, can improve performance.
>>>>
>>>> Until now, relaxed ordering could be set only by user space applications
>>>> for user MRs. The following patch series enables relaxed ordering for the
>>>> kernel ULPs as well. Relaxed ordering is an optional capability, and as
>>>> such, it is ignored by vendors that don't support it.
>>>>
>>>> The following test results show the performance improvement achieved
>>> Did you test this patchset with CPU does not support relaxed ordering?
>> I don't think so, the CPUs that don't support RO are Intel's fourth/fifth-generation
>> and they are not interesting from performance point of view.
>>
>>> We observed significantly performance degradation when run perftest with
>>> relaxed ordering enabled over old CPU.
>>>
>>> https://github.com/linux-rdma/perftest/issues/116
>> The perftest is slightly different, but you pointed to the valid point.
>> We forgot to call pcie_relaxed_ordering_enabled() before setting RO bit
>> and arguably this was needed to be done in perftest too.
> No, the PCI device should not have the RO bit set in this situation.
> It is something mlx5_core needs to do. We can't push this into
> applications.
pcie_relaxed_ordering_enabled is called in
drivers/net/ethernet/mellanox/mlx5/core/en_common.c so probably need to
move it to
mlx5_core in this series.
>
> There should be no performance difference from asking for
> IBV_ACCESS_RELAXED_ORDERING when RO is disabled at the PCI config and
> not asking for it at all.
>
> Either the platform has working relaxed ordering that gives a
> performance gain and the RO config spec bit should be set, or it
> doesn't and the bit should be clear.
is this the case today ?
>
> This is not something to decide in userspace, or in RDMA. At worst it
> becomes another platform specific PCI tunable people have to set.
>
> I thought the old haswell systems were quirked to disable RO globally
> anyhow?
>
> Jason
> On 10 Apr 2021, at 15:30, David Laight <[email protected]> wrote:
>
> From: Tom Talpey
>> Sent: 09 April 2021 18:49
>> On 4/9/2021 12:27 PM, Haakon Bugge wrote:
>>>
>>>
>>>> On 9 Apr 2021, at 17:32, Tom Talpey <[email protected]> wrote:
>>>>
>>>> On 4/9/2021 10:45 AM, Chuck Lever III wrote:
>>>>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <[email protected]> wrote:
>>>>>>
>>>>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>>>>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>>>>>>
>>>>>>>> We need to get a better idea what correctness testing has been done,
>>>>>>>> and whether positive correctness testing results can be replicated
>>>>>>>> on a variety of platforms.
>>>>>>> RO has been rolling out slowly on mlx5 over a few years and storage
>>>>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>>>>>>> turned on for a long time, userspace HPC applications have been using
>>>>>>> it for a while now too.
>>>>>>
>>>>>> I'd love to see RO be used more, it was always something the RDMA
>>>>>> specs supported and carefully architected for. My only concern is
>>>>>> that it's difficult to get right, especially when the platforms
>>>>>> have been running strictly-ordered for so long. The ULPs need
>>>>>> testing, and a lot of it.
>>>>>>
>>>>>>> We know there are platforms with broken RO implementations (like
>>>>>>> Haswell) but the kernel is supposed to globally turn off RO on all
>>>>>>> those cases. I'd be a bit surprised if we discover any more from this
>>>>>>> series.
>>>>>>> On the other hand there are platforms that get huge speed ups from
>>>>>>> turning this on, AMD is one example, there are a bunch in the ARM
>>>>>>> world too.
>>>>>>
>>>>>> My belief is that the biggest risk is from situations where completions
>>>>>> are batched, and therefore polling is used to detect them without
>>>>>> interrupts (which explicitly). The RO pipeline will completely reorder
>>>>>> DMA writes, and consumers which infer ordering from memory contents may
>>>>>> break. This can even apply within the provider code, which may attempt
>>>>>> to poll WR and CQ structures, and be tripped up.
>>>>> You are referring specifically to RPC/RDMA depending on Receive
>>>>> completions to guarantee that previous RDMA Writes have been
>>>>> retired? Or is there a particular implementation practice in
>>>>> the Linux RPC/RDMA code that worries you?
>>>>
>>>> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
>>>> is hopefully unfounded, is that the RO pipeline might not have flushed
>>>> when a completion is posted *after* posting an interrupt.
>>>>
>>>> Something like this...
>>>>
>>>> RDMA Write arrives
>>>> PCIe RO Write for data
>>>> PCIe RO Write for data
>>>> ...
>>>> RDMA Write arrives
>>>> PCIe RO Write for data
>>>> ...
>>>> RDMA Send arrives
>>>> PCIe RO Write for receive data
>>>> PCIe RO Write for receive descriptor
>>>
>>> Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then
>> it will shure prior written RO date has global visibility when the CQE can be observed.
>>
>> I wasn't aware that a strongly-ordered PCIe Write will ensure that
>> prior relaxed-ordered writes went first. If that's the case, I'm
>> fine with it - as long as the providers are correctly coded!!
The PCIe spec (Table Ordering Rules Summary) is quite clear here (A Posted request is Memory Write Request in this context):
A Posted Request must not pass another Posted Request unless A2b applies.
A2b: A Posted Request with RO Set is permitted to pass another Posted Request.
Thxs, Håkon
>
> I remember trying to read the relevant section of the PCIe spec.
> (Possibly in a book that was trying to make it easier to understand!)
> It is about as clear as mud.
>
> I presume this is all about allowing PCIe targets (eg ethernet cards)
> to use relaxed ordering on write requests to host memory.
> And that such writes can be completed out of order?
>
> It isn't entirely clear that you aren't talking of letting the
> cpu do 'relaxed order' writes to PCIe targets!
>
> For a typical ethernet driver the receive interrupt just means
> 'go and look at the receive descriptor ring'.
> So there is an absolute requirement that the writes for data
> buffer complete before the write to the receive descriptor.
> There is no requirement for the interrupt (requested after the
> descriptor write) to have been seen by the cpu.
>
> Quite often the driver will find the 'receive complete'
> descriptor when processing frames from an earlier interrupt
> (and nothing to do in response to the interrupt itself).
>
> So the write to the receive descriptor would have to have RO clear
> to ensure that all the buffer writes complete first.
>
> (The furthest I've got into PCIe internals was fixing the bug
> in some vendor-supplied FPGA logic that failed to correctly
> handle multiple data TLP responses to a single read TLP.
> Fortunately it wasn't in the hard-IP bit.)
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
On 4/12/2021 2:32 PM, Haakon Bugge wrote:
>
>
>> On 10 Apr 2021, at 15:30, David Laight <[email protected]> wrote:
>>
>> From: Tom Talpey
>>> Sent: 09 April 2021 18:49
>>> On 4/9/2021 12:27 PM, Haakon Bugge wrote:
>>>>
>>>>
>>>>> On 9 Apr 2021, at 17:32, Tom Talpey <[email protected]> wrote:
>>>>>
>>>>> On 4/9/2021 10:45 AM, Chuck Lever III wrote:
>>>>>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <[email protected]> wrote:
>>>>>>>
>>>>>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote:
>>>>>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote:
>>>>>>>>
>>>>>>>>> We need to get a better idea what correctness testing has been done,
>>>>>>>>> and whether positive correctness testing results can be replicated
>>>>>>>>> on a variety of platforms.
>>>>>>>> RO has been rolling out slowly on mlx5 over a few years and storage
>>>>>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO
>>>>>>>> turned on for a long time, userspace HPC applications have been using
>>>>>>>> it for a while now too.
>>>>>>>
>>>>>>> I'd love to see RO be used more, it was always something the RDMA
>>>>>>> specs supported and carefully architected for. My only concern is
>>>>>>> that it's difficult to get right, especially when the platforms
>>>>>>> have been running strictly-ordered for so long. The ULPs need
>>>>>>> testing, and a lot of it.
>>>>>>>
>>>>>>>> We know there are platforms with broken RO implementations (like
>>>>>>>> Haswell) but the kernel is supposed to globally turn off RO on all
>>>>>>>> those cases. I'd be a bit surprised if we discover any more from this
>>>>>>>> series.
>>>>>>>> On the other hand there are platforms that get huge speed ups from
>>>>>>>> turning this on, AMD is one example, there are a bunch in the ARM
>>>>>>>> world too.
>>>>>>>
>>>>>>> My belief is that the biggest risk is from situations where completions
>>>>>>> are batched, and therefore polling is used to detect them without
>>>>>>> interrupts (which explicitly). The RO pipeline will completely reorder
>>>>>>> DMA writes, and consumers which infer ordering from memory contents may
>>>>>>> break. This can even apply within the provider code, which may attempt
>>>>>>> to poll WR and CQ structures, and be tripped up.
>>>>>> You are referring specifically to RPC/RDMA depending on Receive
>>>>>> completions to guarantee that previous RDMA Writes have been
>>>>>> retired? Or is there a particular implementation practice in
>>>>>> the Linux RPC/RDMA code that worries you?
>>>>>
>>>>> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which
>>>>> is hopefully unfounded, is that the RO pipeline might not have flushed
>>>>> when a completion is posted *after* posting an interrupt.
>>>>>
>>>>> Something like this...
>>>>>
>>>>> RDMA Write arrives
>>>>> PCIe RO Write for data
>>>>> PCIe RO Write for data
>>>>> ...
>>>>> RDMA Write arrives
>>>>> PCIe RO Write for data
>>>>> ...
>>>>> RDMA Send arrives
>>>>> PCIe RO Write for receive data
>>>>> PCIe RO Write for receive descriptor
>>>>
>>>> Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then
>>> it will shure prior written RO date has global visibility when the CQE can be observed.
>>>
>>> I wasn't aware that a strongly-ordered PCIe Write will ensure that
>>> prior relaxed-ordered writes went first. If that's the case, I'm
>>> fine with it - as long as the providers are correctly coded!!
>
> The PCIe spec (Table Ordering Rules Summary) is quite clear here (A Posted request is Memory Write Request in this context):
>
> A Posted Request must not pass another Posted Request unless A2b applies.
>
> A2b: A Posted Request with RO Set is permitted to pass another Posted Request.
>
>
> Thxs, Håkon
Ok, good - a non-RO write (for example, to a CQE), or an interrupt
(which would be similarly non-RO), will "get behind" all prior writes.
So the issue is only in testing all the providers and platforms,
to be sure this new behavior isn't tickling anything that went
unnoticed all along, because no RDMA provider ever issued RO.
Honestly, the Haswell sounds like a great first candidate, because
if it has a known-broken RO behavior, verifying that it works with
this change is highly important. I'd have greater confidence in newer
platforms, in other words. They *all* have to work, proveably.
Tom.
>> I remember trying to read the relevant section of the PCIe spec.
>> (Possibly in a book that was trying to make it easier to understand!)
>> It is about as clear as mud.
>>
>> I presume this is all about allowing PCIe targets (eg ethernet cards)
>> to use relaxed ordering on write requests to host memory.
>> And that such writes can be completed out of order?
>>
>> It isn't entirely clear that you aren't talking of letting the
>> cpu do 'relaxed order' writes to PCIe targets!
>>
>> For a typical ethernet driver the receive interrupt just means
>> 'go and look at the receive descriptor ring'.
>> So there is an absolute requirement that the writes for data
>> buffer complete before the write to the receive descriptor.
>> There is no requirement for the interrupt (requested after the
>> descriptor write) to have been seen by the cpu.
>>
>> Quite often the driver will find the 'receive complete'
>> descriptor when processing frames from an earlier interrupt
>> (and nothing to do in response to the interrupt itself).
>>
>> So the write to the receive descriptor would have to have RO clear
>> to ensure that all the buffer writes complete first.
>>
>> (The furthest I've got into PCIe internals was fixing the bug
>> in some vendor-supplied FPGA logic that failed to correctly
>> handle multiple data TLP responses to a single read TLP.
>> Fortunately it wasn't in the hard-IP bit.)
>>
>> David
>>
>> -
>> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
>> Registration No: 1397386 (Wales)
>
On Mon, Apr 12, 2021 at 04:20:47PM -0400, Tom Talpey wrote:
> So the issue is only in testing all the providers and platforms,
> to be sure this new behavior isn't tickling anything that went
> unnoticed all along, because no RDMA provider ever issued RO.
The mlx5 ethernet driver has run in RO mode for a long time, and it
operates in basically the same way as RDMA. The issues with Haswell
have been worked out there already.
The only open question is if the ULPs have errors in their
implementation, which I don't think we can find out until we apply
this series and people start running their tests aggressively.
Jason
On 4/12/2021 6:48 PM, Jason Gunthorpe wrote:
> On Mon, Apr 12, 2021 at 04:20:47PM -0400, Tom Talpey wrote:
>
>> So the issue is only in testing all the providers and platforms,
>> to be sure this new behavior isn't tickling anything that went
>> unnoticed all along, because no RDMA provider ever issued RO.
>
> The mlx5 ethernet driver has run in RO mode for a long time, and it
> operates in basically the same way as RDMA. The issues with Haswell
> have been worked out there already.
>
> The only open question is if the ULPs have errors in their
> implementation, which I don't think we can find out until we apply
> this series and people start running their tests aggressively.
I agree that the core RO support should go in. But turning it on
by default for a ULP should be the decision of each ULP maintainer.
It's a huge risk to shift all the storage drivers overnight. How
do you propose to ensure the aggressive testing happens?
One thing that worries me is the patch02 on-by-default for the dma_lkey.
There's no way for a ULP to prevent IB_ACCESS_RELAXED_ORDERING
from being set in __ib_alloc_pd().
Tom.
From: Tom Talpey
> Sent: 14 April 2021 15:16
>
> On 4/12/2021 6:48 PM, Jason Gunthorpe wrote:
> > On Mon, Apr 12, 2021 at 04:20:47PM -0400, Tom Talpey wrote:
> >
> >> So the issue is only in testing all the providers and platforms,
> >> to be sure this new behavior isn't tickling anything that went
> >> unnoticed all along, because no RDMA provider ever issued RO.
> >
> > The mlx5 ethernet driver has run in RO mode for a long time, and it
> > operates in basically the same way as RDMA. The issues with Haswell
> > have been worked out there already.
> >
> > The only open question is if the ULPs have errors in their
> > implementation, which I don't think we can find out until we apply
> > this series and people start running their tests aggressively.
>
> I agree that the core RO support should go in. But turning it on
> by default for a ULP should be the decision of each ULP maintainer.
> It's a huge risk to shift all the storage drivers overnight. How
> do you propose to ensure the aggressive testing happens?
>
> One thing that worries me is the patch02 on-by-default for the dma_lkey.
> There's no way for a ULP to prevent IB_ACCESS_RELAXED_ORDERING
> from being set in __ib_alloc_pd().
What is a ULP in this context?
I've presumed that this is all about getting PCIe targets (ie cards)
to set the RO (relaxed ordering) bit in some of the write TLP they
generate for writing to host memory?
So whatever driver initialises the target needs to configure whatever
target-specific register enables the RO transfers themselves.
After that there could be flags in the PCIe config space of the target
and any bridges that clear the RO flag.
There could also be flags in the bridges and root complex to ignore
the RO flag even if it is set.
Then the Linux kernel can have option(s) to tell the driver not
to enable RO - even though the driver believes it should all work.
This could be a single global flag, or fin-grained in some way.
So what exactly is this patch series doing?
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Wed, Apr 14, 2021 at 10:16:28AM -0400, Tom Talpey wrote:
> On 4/12/2021 6:48 PM, Jason Gunthorpe wrote:
> > On Mon, Apr 12, 2021 at 04:20:47PM -0400, Tom Talpey wrote:
> >
> > > So the issue is only in testing all the providers and platforms,
> > > to be sure this new behavior isn't tickling anything that went
> > > unnoticed all along, because no RDMA provider ever issued RO.
> >
> > The mlx5 ethernet driver has run in RO mode for a long time, and it
> > operates in basically the same way as RDMA. The issues with Haswell
> > have been worked out there already.
> >
> > The only open question is if the ULPs have errors in their
> > implementation, which I don't think we can find out until we apply
> > this series and people start running their tests aggressively.
>
> I agree that the core RO support should go in. But turning it on
> by default for a ULP should be the decision of each ULP maintainer.
> It's a huge risk to shift all the storage drivers overnight. How
> do you propose to ensure the aggressive testing happens?
Realistically we do test most of the RDMA storage ULPs at NVIDIA over
mlx5 which is the only HW that will enable this for now.
I disagree it is a "huge risk".
Additional wider testing is welcomed and can happen over the 16 week
release cycle for a kernel. I would aim to get the relaxed ordering
changed merged to linux-next a week or so after the merge window.
Further testing happens before these changes would get picked up in a
distro on something like MLNX_OFED.
I don't think we need to make the patch design worse or over think the
submission process for something that, so far, hasn't discovered any
issues and alread has a proven track record in other ULPs.
Any storage ULP that has a problem here is mis-using verbs and the DMA
API and thus has an existing data-corruption bug that they are simply
lucky to have not yet discovered.
> One thing that worries me is the patch02 on-by-default for the dma_lkey.
> There's no way for a ULP to prevent IB_ACCESS_RELAXED_ORDERING
> from being set in __ib_alloc_pd().
The ULPs are being forced into relaxed_ordering. They don't get to
turn it off one by one. The v2 will be more explicit about this as
there will be no ULP patches, just the verbs core code being updated.
Jason
On Wed, Apr 14, 2021 at 02:41:52PM +0000, David Laight wrote:
> So whatever driver initialises the target needs to configure whatever
> target-specific register enables the RO transfers themselves.
RDMA in general, and mlx5 in particular, is a layered design:
mlx5_core <- owns the PCI function, should turn on RO at the PCI
function level
mlx5_en <- Commands the chip to use RO for queues used in ethernet
ib_core
ib_uverbs
mlx5_ib <- Commands the chip to use RO for some queues used in
userspace
ib_srp* <- A ULP driver built on RDMA - this patch commands the chip
to use RO on SRP queues
nvme-rdma <- Ditto
ib_iser <- Ditto
rds <- Ditto
So this series is about expanding the set of queues running on mlx5
that have RO turned when the PCI function is already running with RO
enabled.
We want as many queues as possible RO enabled because it brings big
performance wins
Jason