2016-06-15 03:15:19

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 00/24] NFS/RDMA client patches proposed for v4.8

This series implements the following:

- Fixes to FMR disconnect recovery
- Removal of the insecure ALLPHYSICAL memory registration mode
- Significant reductions in per-transport memory consumption
- Support for sec=krb5, sec=krb5i, and sec=krb5p with NFS/RDMA
(with no performance impact on sec=sys)
- Pre-requisites for device removal support

Kerberos with NFS/RDMA is useful mainly for secure authentication of
each RPC transaction (sec=krb5). The Kerberos integrity and privacy
services are also available, providing feature parity with TCP in
environments where the use of sec=krb5i or sec=krb5p is mandated by
IT policy.

Sagi's proposed fix for mlx4's new FRWR API is included. I'll drop
it from my series once the official version of this fix is merged.


Available in the "nfs-rdma-for-4.8" topic branch of this git repo:

git://git.linux-nfs.org/projects/cel/cel-2.6.git

Or for browsing:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfs-rdma-for-4.8


Changes since v1:
- Rebased on v4.7-rc3
- Re-ordered series so FMR fixes come first
- Fix ib_unmap_fmr list handling
- Replaced the bogus 256-MRs-per-QP patch with dynamic MR allocation
- Add performance counters to expose MR allocation and recovery behavior
- Included Sagi's proposed mlx4 priv pages fix
- Make it easier to merge my datatouch patch with Scott's bug fix
- Split some patches for readability
- Drop some clean-up patches to keep the series patch count down

---

Chuck Lever (23):
xprtrdma: Remove FMRs from the unmap list after unmapping
xprtrdma: Create common scatterlist fields in rpcrdma_mw
xprtrdma: Move init and release helpers
xprtrdma: Rename fields in rpcrdma_fmr
xprtrdma: Use scatterlist for DMA mapping and unmapping under FMR
xprtrdma: Refactor MR recovery work queues
xprtrdma: Do not leak an MW during a DMA map failure
xprtrdma: Remove ALLPHYSICAL memory registration mode
xprtrdma: Remove rpcrdma_map_one() and friends
xprtrdma: Reply buffer exhaustion can be catastrophic
xprtrdma: Honor ->send_request API contract
xprtrdma: Chunk list encoders must not return zero
xprtrdma: Allocate MRs on demand
xprtrdma: Release orphaned MRs immediately
xprtrdma: Place registered MWs on a per-req list
xprtrdma: Chunk list encoders no longer share one rl_segments array
xprtrdma: rpcrdma_inline_fixup() overruns the receive page list
xprtrdma: Do not update {head,tail}.iov_len in rpcrdma_inline_fixup()
xprtrdma: Update only specific fields in private receive buffer
xprtrdma: Clean up fixup_copy_count accounting
xprtrdma: No direct data placement with krb5i and krb5p
svc: Avoid garbage replies when pc_func() returns rpc_drop_reply
NFS: Don't drop CB requests with invalid principals

Sagi Grimberg (1):
mlx4-ib: Use coherent memory for priv pages


drivers/infiniband/hw/mlx4/mlx4_ib.h | 1
drivers/infiniband/hw/mlx4/mr.c | 42 +---
fs/nfs/callback_xdr.c | 6 -
include/linux/sunrpc/auth.h | 3
include/linux/sunrpc/gss_api.h | 2
net/sunrpc/auth_gss/auth_gss.c | 2
net/sunrpc/auth_gss/gss_krb5_mech.c | 2
net/sunrpc/auth_gss/gss_mech_switch.c | 12 +
net/sunrpc/svc.c | 8 +
net/sunrpc/xprtrdma/Makefile | 2
net/sunrpc/xprtrdma/fmr_ops.c | 372 +++++++++++++++------------------
net/sunrpc/xprtrdma/frwr_ops.c | 352 +++++++++++--------------------
net/sunrpc/xprtrdma/physical_ops.c | 122 -----------
net/sunrpc/xprtrdma/rpc_rdma.c | 274 +++++++++++++-----------
net/sunrpc/xprtrdma/transport.c | 40 ++--
net/sunrpc/xprtrdma/verbs.c | 189 +++++++++++++----
net/sunrpc/xprtrdma/xprt_rdma.h | 116 ++++------
17 files changed, 684 insertions(+), 861 deletions(-)
delete mode 100644 net/sunrpc/xprtrdma/physical_ops.c

--
Chuck Lever


2016-06-15 03:15:27

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages

From: Sagi Grimberg <[email protected]>

kmalloc doesn't guarantee the returned memory is all on one page.

Fixes: 1b2cd0fc673c ("IB/mlx4: Support the new memory ... ")
Signed-off-by: Sagi Grimberg <[email protected]>
---
drivers/infiniband/hw/mlx4/mlx4_ib.h | 1 -
drivers/infiniband/hw/mlx4/mr.c | 42 +++++-----------------------------
2 files changed, 6 insertions(+), 37 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 6c5ac5d..4a8bbe4 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -139,7 +139,6 @@ struct mlx4_ib_mr {
u32 max_pages;
struct mlx4_mr mmr;
struct ib_umem *umem;
- void *pages_alloc;
};

struct mlx4_ib_mw {
diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
index 6312721..c4c2044 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -278,30 +278,13 @@ mlx4_alloc_priv_pages(struct ib_device *device,
int max_pages)
{
int size = max_pages * sizeof(u64);
- int add_size;
- int ret;
-
- add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0);

- mr->pages_alloc = kzalloc(size + add_size, GFP_KERNEL);
- if (!mr->pages_alloc)
+ mr->pages = dma_alloc_coherent(device->dma_device, size,
+ &mr->page_map, GFP_KERNEL);
+ if (!mr->pages)
return -ENOMEM;

- mr->pages = PTR_ALIGN(mr->pages_alloc, MLX4_MR_PAGES_ALIGN);
-
- mr->page_map = dma_map_single(device->dma_device, mr->pages,
- size, DMA_TO_DEVICE);
-
- if (dma_mapping_error(device->dma_device, mr->page_map)) {
- ret = -ENOMEM;
- goto err;
- }
-
return 0;
-err:
- kfree(mr->pages_alloc);
-
- return ret;
}

static void
@@ -311,9 +294,8 @@ mlx4_free_priv_pages(struct mlx4_ib_mr *mr)
struct ib_device *device = mr->ibmr.device;
int size = mr->max_pages * sizeof(u64);

- dma_unmap_single(device->dma_device, mr->page_map,
- size, DMA_TO_DEVICE);
- kfree(mr->pages_alloc);
+ dma_free_coherent(device->dma_device, size,
+ mr->pages, mr->page_map);
mr->pages = NULL;
}
}
@@ -532,19 +514,7 @@ int mlx4_ib_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
unsigned int *sg_offset)
{
struct mlx4_ib_mr *mr = to_mmr(ibmr);
- int rc;

mr->npages = 0;
-
- ib_dma_sync_single_for_cpu(ibmr->device, mr->page_map,
- sizeof(u64) * mr->max_pages,
- DMA_TO_DEVICE);
-
- rc = ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, mlx4_set_page);
-
- ib_dma_sync_single_for_device(ibmr->device, mr->page_map,
- sizeof(u64) * mr->max_pages,
- DMA_TO_DEVICE);
-
- return rc;
+ return ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, mlx4_set_page);
}


2016-06-15 03:15:36

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 02/24] xprtrdma: Remove FMRs from the unmap list after unmapping

ib_unmap_fmr() takes a list of FMRs to unmap. However, it does not
remove the FMRs from this list as it processes them. Other
ib_unmap_fmr() call sites are careful to remove FMRs from the list
after ib_unmap_fmr() returns.

Since commit 7c7a5390dc6c8 ("xprtrdma: Add ro_unmap_sync method for FMR")
fmr_op_unmap_sync passes more than one FMR to ib_unmap_fmr(), but
it didn't bother to remove the FMRs from that list once the call was
complete.

I've noticed some instability that could be related to list
tangling by the new fmr_op_unmap_sync() logic. In an abundance
of caution, add some defensive logic to clean up properly after
ib_unmap_fmr().

Fixes: 7c7a5390dc6c8 ("xprtrdma: Add ro_unmap_sync method for FMR")
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 6326ebe..958c792 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -63,9 +63,12 @@ static int
__fmr_unmap(struct rpcrdma_mw *mw)
{
LIST_HEAD(l);
+ int rc;

list_add(&mw->fmr.fmr->list, &l);
- return ib_unmap_fmr(&l);
+ rc = ib_unmap_fmr(&l);
+ list_del_init(&mw->fmr.fmr->list);
+ return rc;
}

/* Deferred reset of a single FMR. Generate a fresh rkey by
@@ -149,6 +152,7 @@ fmr_op_init(struct rpcrdma_xprt *r_xprt)
r->fmr.fmr = ib_alloc_fmr(pd, mr_access_flags, &fmr_attr);
if (IS_ERR(r->fmr.fmr))
goto out_fmr_err;
+ INIT_LIST_HEAD(&r->fmr.fmr->list);

r->mw_xprt = r_xprt;
list_add(&r->mw_list, &buf->rb_mws);
@@ -267,7 +271,7 @@ fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
seg = &req->rl_segments[i];
mw = seg->rl_mw;

- list_add(&mw->fmr.fmr->list, &unmap_list);
+ list_add_tail(&mw->fmr.fmr->list, &unmap_list);

i += seg->mr_nsegs;
}
@@ -280,7 +284,9 @@ fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
*/
for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
seg = &req->rl_segments[i];
+ mw = seg->rl_mw;

+ list_del_init(&mw->fmr.fmr->list);
__fmr_dma_unmap(r_xprt, seg);
rpcrdma_put_mw(r_xprt, seg->rl_mw);



2016-06-15 03:15:52

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 04/24] xprtrdma: Move init and release helpers

Clean up: Moving these helpers in a separate patch makes later
patches more readable.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 120 +++++++++++++++++++++++++---------------
net/sunrpc/xprtrdma/frwr_ops.c | 90 +++++++++++++++---------------
2 files changed, 120 insertions(+), 90 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 958c792..b3f8699 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -35,6 +35,12 @@
/* Maximum scatter/gather per FMR */
#define RPCRDMA_MAX_FMR_SGES (64)

+/* Access mode of externally registered pages */
+enum {
+ RPCRDMA_FMR_ACCESS_FLAGS = IB_ACCESS_REMOTE_WRITE |
+ IB_ACCESS_REMOTE_READ,
+};
+
static struct workqueue_struct *fmr_recovery_wq;

#define FMR_RECOVERY_WQ_FLAGS (WQ_UNBOUND)
@@ -60,6 +66,45 @@ fmr_destroy_recovery_wq(void)
}

static int
+__fmr_init(struct rpcrdma_mw *mw, struct ib_pd *pd)
+{
+ static struct ib_fmr_attr fmr_attr = {
+ .max_pages = RPCRDMA_MAX_FMR_SGES,
+ .max_maps = 1,
+ .page_shift = PAGE_SHIFT
+ };
+
+ mw->fmr.physaddrs = kcalloc(RPCRDMA_MAX_FMR_SGES,
+ sizeof(u64), GFP_KERNEL);
+ if (!mw->fmr.physaddrs)
+ goto out_free;
+
+ mw->mw_sg = kcalloc(RPCRDMA_MAX_FMR_SGES,
+ sizeof(*mw->mw_sg), GFP_KERNEL);
+ if (!mw->mw_sg)
+ goto out_free;
+
+ sg_init_table(mw->mw_sg, RPCRDMA_MAX_FMR_SGES);
+
+ mw->fmr.fmr = ib_alloc_fmr(pd, RPCRDMA_FMR_ACCESS_FLAGS,
+ &fmr_attr);
+ if (IS_ERR(mw->fmr.fmr))
+ goto out_fmr_err;
+
+ INIT_LIST_HEAD(&mw->fmr.fmr->list);
+ return 0;
+
+out_fmr_err:
+ dprintk("RPC: %s: ib_alloc_fmr returned %ld\n", __func__,
+ PTR_ERR(mw->fmr.fmr));
+
+out_free:
+ kfree(mw->mw_sg);
+ kfree(mw->fmr.physaddrs);
+ return -ENOMEM;
+}
+
+static int
__fmr_unmap(struct rpcrdma_mw *mw)
{
LIST_HEAD(l);
@@ -71,6 +116,30 @@ __fmr_unmap(struct rpcrdma_mw *mw)
return rc;
}

+static void
+__fmr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
+{
+ struct ib_device *device = r_xprt->rx_ia.ri_device;
+ int nsegs = seg->mr_nsegs;
+
+ while (nsegs--)
+ rpcrdma_unmap_one(device, seg++);
+}
+
+static void
+__fmr_release(struct rpcrdma_mw *r)
+{
+ int rc;
+
+ kfree(r->fmr.physaddrs);
+ kfree(r->mw_sg);
+
+ rc = ib_dealloc_fmr(r->fmr.fmr);
+ if (rc)
+ pr_err("rpcrdma: final ib_dealloc_fmr for %p returned %i\n",
+ r, rc);
+}
+
/* Deferred reset of a single FMR. Generate a fresh rkey by
* replacing the MR. There's no recovery if this fails.
*/
@@ -119,12 +188,6 @@ static int
fmr_op_init(struct rpcrdma_xprt *r_xprt)
{
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
- int mr_access_flags = IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_READ;
- struct ib_fmr_attr fmr_attr = {
- .max_pages = RPCRDMA_MAX_FMR_SGES,
- .max_maps = 1,
- .page_shift = PAGE_SHIFT
- };
struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
struct rpcrdma_mw *r;
int i, rc;
@@ -138,36 +201,22 @@ fmr_op_init(struct rpcrdma_xprt *r_xprt)
i *= buf->rb_max_requests; /* one set for each RPC slot */
dprintk("RPC: %s: initalizing %d FMRs\n", __func__, i);

- rc = -ENOMEM;
while (i--) {
r = kzalloc(sizeof(*r), GFP_KERNEL);
if (!r)
- goto out;
-
- r->fmr.physaddrs = kmalloc(RPCRDMA_MAX_FMR_SGES *
- sizeof(u64), GFP_KERNEL);
- if (!r->fmr.physaddrs)
- goto out_free;
+ return -ENOMEM;

- r->fmr.fmr = ib_alloc_fmr(pd, mr_access_flags, &fmr_attr);
- if (IS_ERR(r->fmr.fmr))
- goto out_fmr_err;
- INIT_LIST_HEAD(&r->fmr.fmr->list);
+ rc = __fmr_init(r, pd);
+ if (rc) {
+ kfree(r);
+ return rc;
+ }

r->mw_xprt = r_xprt;
list_add(&r->mw_list, &buf->rb_mws);
list_add(&r->mw_all, &buf->rb_all);
}
return 0;
-
-out_fmr_err:
- rc = PTR_ERR(r->fmr.fmr);
- dprintk("RPC: %s: ib_alloc_fmr status %i\n", __func__, rc);
- kfree(r->fmr.physaddrs);
-out_free:
- kfree(r);
-out:
- return rc;
}

/* Use the ib_map_phys_fmr() verb to register a memory region
@@ -236,16 +285,6 @@ out_maperr:
return rc;
}

-static void
-__fmr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
-{
- struct ib_device *device = r_xprt->rx_ia.ri_device;
- int nsegs = seg->mr_nsegs;
-
- while (nsegs--)
- rpcrdma_unmap_one(device, seg++);
-}
-
/* Invalidate all memory regions that were registered for "req".
*
* Sleeps until it is safe for the host CPU to access the
@@ -338,18 +377,11 @@ static void
fmr_op_destroy(struct rpcrdma_buffer *buf)
{
struct rpcrdma_mw *r;
- int rc;

while (!list_empty(&buf->rb_all)) {
r = list_entry(buf->rb_all.next, struct rpcrdma_mw, mw_all);
list_del(&r->mw_all);
- kfree(r->fmr.physaddrs);
-
- rc = ib_dealloc_fmr(r->fmr.fmr);
- if (rc)
- dprintk("RPC: %s: ib_dealloc_fmr failed %i\n",
- __func__, rc);
-
+ __fmr_release(r);
kfree(r);
}
}
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index f02ab80..9cd60bf0 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -99,6 +99,50 @@ frwr_destroy_recovery_wq(void)
}

static int
+__frwr_init(struct rpcrdma_mw *r, struct ib_pd *pd, unsigned int depth)
+{
+ struct rpcrdma_frmr *f = &r->frmr;
+ int rc;
+
+ f->fr_mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, depth);
+ if (IS_ERR(f->fr_mr))
+ goto out_mr_err;
+
+ r->mw_sg = kcalloc(depth, sizeof(*r->mw_sg), GFP_KERNEL);
+ if (!r->mw_sg)
+ goto out_list_err;
+
+ sg_init_table(r->mw_sg, depth);
+ init_completion(&f->fr_linv_done);
+ return 0;
+
+out_mr_err:
+ rc = PTR_ERR(f->fr_mr);
+ dprintk("RPC: %s: ib_alloc_mr status %i\n",
+ __func__, rc);
+ return rc;
+
+out_list_err:
+ rc = -ENOMEM;
+ dprintk("RPC: %s: sg allocation failure\n",
+ __func__);
+ ib_dereg_mr(f->fr_mr);
+ return rc;
+}
+
+static void
+__frwr_release(struct rpcrdma_mw *r)
+{
+ int rc;
+
+ rc = ib_dereg_mr(r->frmr.fr_mr);
+ if (rc)
+ pr_err("rpcrdma: final ib_dereg_mr for %p returned %i\n",
+ r, rc);
+ kfree(r->mw_sg);
+}
+
+static int
__frwr_reset_mr(struct rpcrdma_ia *ia, struct rpcrdma_mw *r)
{
struct rpcrdma_frmr *f = &r->frmr;
@@ -165,52 +209,6 @@ __frwr_queue_recovery(struct rpcrdma_mw *r)
}

static int
-__frwr_init(struct rpcrdma_mw *r, struct ib_pd *pd, unsigned int depth)
-{
- struct rpcrdma_frmr *f = &r->frmr;
- int rc;
-
- f->fr_mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, depth);
- if (IS_ERR(f->fr_mr))
- goto out_mr_err;
-
- r->mw_sg = kcalloc(depth, sizeof(*r->mw_sg), GFP_KERNEL);
- if (!r->mw_sg)
- goto out_list_err;
-
- sg_init_table(r->mw_sg, depth);
-
- init_completion(&f->fr_linv_done);
-
- return 0;
-
-out_mr_err:
- rc = PTR_ERR(f->fr_mr);
- dprintk("RPC: %s: ib_alloc_mr status %i\n",
- __func__, rc);
- return rc;
-
-out_list_err:
- rc = -ENOMEM;
- dprintk("RPC: %s: sg allocation failure\n",
- __func__);
- ib_dereg_mr(f->fr_mr);
- return rc;
-}
-
-static void
-__frwr_release(struct rpcrdma_mw *r)
-{
- int rc;
-
- rc = ib_dereg_mr(r->frmr.fr_mr);
- if (rc)
- dprintk("RPC: %s: ib_dereg_mr status %i\n",
- __func__, rc);
- kfree(r->mw_sg);
-}
-
-static int
frwr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
struct rpcrdma_create_data_internal *cdata)
{


2016-06-15 03:15:44

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 03/24] xprtrdma: Create common scatterlist fields in rpcrdma_mw

Clean up: FMR is about to replace the rpcrdma_map_one code with
scatterlists. Move the scatterlist fields out of the FRWR-specific
union and into the generic part of rpcrdma_mw.

One minor change: -EIO is now returned if FRWR registration fails.
The RPC is terminated immediately, since the problem is likely due
to a software bug, thus retrying likely won't help.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/frwr_ops.c | 85 +++++++++++++++++++--------------------
net/sunrpc/xprtrdma/xprt_rdma.h | 8 ++--
2 files changed, 46 insertions(+), 47 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index c094754..f02ab80 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -125,17 +125,16 @@ __frwr_reset_mr(struct rpcrdma_ia *ia, struct rpcrdma_mw *r)
}

static void
-__frwr_reset_and_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mw *mw)
+__frwr_reset_and_unmap(struct rpcrdma_mw *mw)
{
+ struct rpcrdma_xprt *r_xprt = mw->mw_xprt;
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
- struct rpcrdma_frmr *f = &mw->frmr;
int rc;

rc = __frwr_reset_mr(ia, mw);
- ib_dma_unmap_sg(ia->ri_device, f->fr_sg, f->fr_nents, f->fr_dir);
+ ib_dma_unmap_sg(ia->ri_device, mw->mw_sg, mw->mw_nents, mw->mw_dir);
if (rc)
return;
-
rpcrdma_put_mw(r_xprt, mw);
}

@@ -152,8 +151,7 @@ __frwr_recovery_worker(struct work_struct *work)
struct rpcrdma_mw *r = container_of(work, struct rpcrdma_mw,
mw_work);

- __frwr_reset_and_unmap(r->mw_xprt, r);
- return;
+ __frwr_reset_and_unmap(r);
}

/* A broken MR was discovered in a context that can't sleep.
@@ -167,8 +165,7 @@ __frwr_queue_recovery(struct rpcrdma_mw *r)
}

static int
-__frwr_init(struct rpcrdma_mw *r, struct ib_pd *pd, struct ib_device *device,
- unsigned int depth)
+__frwr_init(struct rpcrdma_mw *r, struct ib_pd *pd, unsigned int depth)
{
struct rpcrdma_frmr *f = &r->frmr;
int rc;
@@ -177,11 +174,11 @@ __frwr_init(struct rpcrdma_mw *r, struct ib_pd *pd, struct ib_device *device,
if (IS_ERR(f->fr_mr))
goto out_mr_err;

- f->fr_sg = kcalloc(depth, sizeof(*f->fr_sg), GFP_KERNEL);
- if (!f->fr_sg)
+ r->mw_sg = kcalloc(depth, sizeof(*r->mw_sg), GFP_KERNEL);
+ if (!r->mw_sg)
goto out_list_err;

- sg_init_table(f->fr_sg, depth);
+ sg_init_table(r->mw_sg, depth);

init_completion(&f->fr_linv_done);

@@ -210,7 +207,7 @@ __frwr_release(struct rpcrdma_mw *r)
if (rc)
dprintk("RPC: %s: ib_dereg_mr status %i\n",
__func__, rc);
- kfree(r->frmr.fr_sg);
+ kfree(r->mw_sg);
}

static int
@@ -350,7 +347,6 @@ static int
frwr_op_init(struct rpcrdma_xprt *r_xprt)
{
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
- struct ib_device *device = r_xprt->rx_ia.ri_device;
unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
int i;
@@ -372,7 +368,7 @@ frwr_op_init(struct rpcrdma_xprt *r_xprt)
if (!r)
return -ENOMEM;

- rc = __frwr_init(r, pd, device, depth);
+ rc = __frwr_init(r, pd, depth);
if (rc) {
kfree(r);
return rc;
@@ -386,7 +382,7 @@ frwr_op_init(struct rpcrdma_xprt *r_xprt)
return 0;
}

-/* Post a FAST_REG Work Request to register a memory region
+/* Post a REG_MR Work Request to register a memory region
* for remote access via RDMA READ or RDMA WRITE.
*/
static int
@@ -394,8 +390,6 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
int nsegs, bool writing)
{
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
- struct ib_device *device = ia->ri_device;
- enum dma_data_direction direction = rpcrdma_data_dir(writing);
struct rpcrdma_mr_seg *seg1 = seg;
struct rpcrdma_mw *mw;
struct rpcrdma_frmr *frmr;
@@ -421,15 +415,14 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,

if (nsegs > ia->ri_max_frmr_depth)
nsegs = ia->ri_max_frmr_depth;
-
for (i = 0; i < nsegs;) {
if (seg->mr_page)
- sg_set_page(&frmr->fr_sg[i],
+ sg_set_page(&mw->mw_sg[i],
seg->mr_page,
seg->mr_len,
offset_in_page(seg->mr_offset));
else
- sg_set_buf(&frmr->fr_sg[i], seg->mr_offset,
+ sg_set_buf(&mw->mw_sg[i], seg->mr_offset,
seg->mr_len);

++seg;
@@ -440,26 +433,20 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
offset_in_page((seg-1)->mr_offset + (seg-1)->mr_len))
break;
}
- frmr->fr_nents = i;
- frmr->fr_dir = direction;
-
- dma_nents = ib_dma_map_sg(device, frmr->fr_sg, frmr->fr_nents, direction);
- if (!dma_nents) {
- pr_err("RPC: %s: failed to dma map sg %p sg_nents %u\n",
- __func__, frmr->fr_sg, frmr->fr_nents);
- return -ENOMEM;
- }
+ mw->mw_nents = i;
+ mw->mw_dir = rpcrdma_data_dir(writing);

- n = ib_map_mr_sg(mr, frmr->fr_sg, frmr->fr_nents, NULL, PAGE_SIZE);
- if (unlikely(n != frmr->fr_nents)) {
- pr_err("RPC: %s: failed to map mr %p (%u/%u)\n",
- __func__, frmr->fr_mr, n, frmr->fr_nents);
- rc = n < 0 ? n : -EINVAL;
- goto out_senderr;
- }
+ dma_nents = ib_dma_map_sg(ia->ri_device,
+ mw->mw_sg, mw->mw_nents, mw->mw_dir);
+ if (!dma_nents)
+ goto out_dmamap_err;
+
+ n = ib_map_mr_sg(mr, mw->mw_sg, mw->mw_nents, NULL, PAGE_SIZE);
+ if (unlikely(n != mw->mw_nents))
+ goto out_mapmr_err;

dprintk("RPC: %s: Using frmr %p to map %u segments (%u bytes)\n",
- __func__, mw, frmr->fr_nents, mr->length);
+ __func__, mw, mw->mw_nents, mr->length);

key = (u8)(mr->rkey & 0x000000FF);
ib_update_fast_reg_key(mr, ++key);
@@ -484,13 +471,25 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
seg1->rl_mw = mw;
seg1->mr_rkey = mr->rkey;
seg1->mr_base = mr->iova;
- seg1->mr_nsegs = frmr->fr_nents;
+ seg1->mr_nsegs = mw->mw_nents;
seg1->mr_len = mr->length;

- return frmr->fr_nents;
+ return mw->mw_nents;
+
+out_dmamap_err:
+ pr_err("rpcrdma: failed to dma map sg %p sg_nents %u\n",
+ mw->mw_sg, mw->mw_nents);
+ return -ENOMEM;
+
+out_mapmr_err:
+ pr_err("rpcrdma: failed to map mr %p (%u/%u)\n",
+ frmr->fr_mr, n, mw->mw_nents);
+ rc = n < 0 ? n : -EIO;
+ __frwr_queue_recovery(mw);
+ return rc;

out_senderr:
- dprintk("RPC: %s: ib_post_send status %i\n", __func__, rc);
+ pr_err("rpcrdma: ib_post_send status %i\n", rc);
__frwr_queue_recovery(mw);
return rc;
}
@@ -582,8 +581,8 @@ unmap:
mw = seg->rl_mw;
seg->rl_mw = NULL;

- ib_dma_unmap_sg(ia->ri_device, f->fr_sg, f->fr_nents,
- f->fr_dir);
+ ib_dma_unmap_sg(ia->ri_device,
+ mw->mw_sg, mw->mw_nents, mw->mw_dir);
rpcrdma_put_mw(r_xprt, mw);

i += seg->mr_nsegs;
@@ -630,7 +629,7 @@ frwr_op_unmap_safe(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
mw = seg->rl_mw;

if (sync)
- __frwr_reset_and_unmap(r_xprt, mw);
+ __frwr_reset_and_unmap(mw);
else
__frwr_queue_recovery(mw);

diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 95cdc66..c53abd1 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -221,9 +221,6 @@ enum rpcrdma_frmr_state {
};

struct rpcrdma_frmr {
- struct scatterlist *fr_sg;
- int fr_nents;
- enum dma_data_direction fr_dir;
struct ib_mr *fr_mr;
struct ib_cqe fr_cqe;
enum rpcrdma_frmr_state fr_state;
@@ -240,13 +237,16 @@ struct rpcrdma_fmr {
};

struct rpcrdma_mw {
+ struct list_head mw_list;
+ struct scatterlist *mw_sg;
+ int mw_nents;
+ enum dma_data_direction mw_dir;
union {
struct rpcrdma_fmr fmr;
struct rpcrdma_frmr frmr;
};
struct work_struct mw_work;
struct rpcrdma_xprt *mw_xprt;
- struct list_head mw_list;
struct list_head mw_all;
};



2016-06-15 03:16:08

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 06/24] xprtrdma: Use scatterlist for DMA mapping and unmapping under FMR

The use of a scatterlist for handling DMA mapping and unmapping
was recently introduced in frwr_ops.c in commit 4143f34e01e9
("xprtrdma: Port to new memory registration API"). That commit did
not make a similar update to xprtrdma's FMR support because the
core ib_map_phys_fmr() and ib_unmap_fmr() APIs have not been changed
to take a scatterlist argument.

However, FMR still needs to do DMA mapping and unmapping. It appears
that RDS, for example, uses a scatterlist for this, then builds the
DMA addr array for the ib_map_phys_fmr call separately. I see that
SRP also utilizes a scatterlist for DMA mapping. xprtrdma can do
something similar.

This modernization is used immediately to properly defer DMA
unmapping during fmr_unmap_safe (a FIXME). It separates the DMA
unmapping coordinates from the rl_segments array. This array, being
part of an rpcrdma_req, is always re-used immediately when an RPC
exits. A scatterlist is allocated in memory independent of the
rl_segments array, so it can be preserved indefinitely (ie, until
the MR invalidation and DMA unmapping can actually be done by a
worker thread).

The FRWR and FMR DMA mapping code are slightly different from each
other now, and will diverge further when the "Check for holes" logic
can be removed from FRWR (support for SG_GAP MRs). So I chose not to
create helpers for the common-looking code.

Fixes: ead3f26e359e ("xprtrdma: Add ro_unmap_safe memreg method")
Suggested-by: Sagi Grimberg <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 96 ++++++++++++++++++++++++-----------------
1 file changed, 57 insertions(+), 39 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index a6a67b4..3044593 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -117,13 +117,28 @@ __fmr_unmap(struct rpcrdma_mw *mw)
}

static void
-__fmr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
+__fmr_dma_unmap(struct rpcrdma_mw *mw)
{
- struct ib_device *device = r_xprt->rx_ia.ri_device;
- int nsegs = seg->mr_nsegs;
+ struct rpcrdma_xprt *r_xprt = mw->mw_xprt;

- while (nsegs--)
- rpcrdma_unmap_one(device, seg++);
+ ib_dma_unmap_sg(r_xprt->rx_ia.ri_device,
+ mw->mw_sg, mw->mw_nents, mw->mw_dir);
+ rpcrdma_put_mw(r_xprt, mw);
+}
+
+static void
+__fmr_reset_and_unmap(struct rpcrdma_mw *mw)
+{
+ int rc;
+
+ /* ORDER */
+ rc = __fmr_unmap(mw);
+ if (rc) {
+ pr_warn("rpcrdma: ib_unmap_fmr status %d, fmr %p orphaned\n",
+ rc, mw);
+ return;
+ }
+ __fmr_dma_unmap(mw);
}

static void
@@ -147,11 +162,9 @@ static void
__fmr_recovery_worker(struct work_struct *work)
{
struct rpcrdma_mw *mw = container_of(work, struct rpcrdma_mw,
- mw_work);
- struct rpcrdma_xprt *r_xprt = mw->mw_xprt;
+ mw_work);

- __fmr_unmap(mw);
- rpcrdma_put_mw(r_xprt, mw);
+ __fmr_reset_and_unmap(mw);
return;
}

@@ -226,12 +239,10 @@ static int
fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
int nsegs, bool writing)
{
- struct rpcrdma_ia *ia = &r_xprt->rx_ia;
- struct ib_device *device = ia->ri_device;
- enum dma_data_direction direction = rpcrdma_data_dir(writing);
struct rpcrdma_mr_seg *seg1 = seg;
int len, pageoff, i, rc;
struct rpcrdma_mw *mw;
+ u64 *dma_pages;

mw = seg1->rl_mw;
seg1->rl_mw = NULL;
@@ -253,8 +264,14 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
if (nsegs > RPCRDMA_MAX_FMR_SGES)
nsegs = RPCRDMA_MAX_FMR_SGES;
for (i = 0; i < nsegs;) {
- rpcrdma_map_one(device, seg, direction);
- mw->fmr.fm_physaddrs[i] = seg->mr_dma;
+ if (seg->mr_page)
+ sg_set_page(&mw->mw_sg[i],
+ seg->mr_page,
+ seg->mr_len,
+ offset_in_page(seg->mr_offset));
+ else
+ sg_set_buf(&mw->mw_sg[i], seg->mr_offset,
+ seg->mr_len);
len += seg->mr_len;
++seg;
++i;
@@ -263,25 +280,37 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
offset_in_page((seg-1)->mr_offset + (seg-1)->mr_len))
break;
}
+ mw->mw_nents = i;
+ mw->mw_dir = rpcrdma_data_dir(writing);
+
+ if (!ib_dma_map_sg(r_xprt->rx_ia.ri_device,
+ mw->mw_sg, mw->mw_nents, mw->mw_dir))
+ goto out_dmamap_err;

- rc = ib_map_phys_fmr(mw->fmr.fm_mr, mw->fmr.fm_physaddrs,
- i, seg1->mr_dma);
+ for (i = 0, dma_pages = mw->fmr.fm_physaddrs; i < mw->mw_nents; i++)
+ dma_pages[i] = sg_dma_address(&mw->mw_sg[i]);
+ rc = ib_map_phys_fmr(mw->fmr.fm_mr, dma_pages, mw->mw_nents,
+ dma_pages[0]);
if (rc)
goto out_maperr;

seg1->rl_mw = mw;
seg1->mr_rkey = mw->fmr.fm_mr->rkey;
- seg1->mr_base = seg1->mr_dma + pageoff;
- seg1->mr_nsegs = i;
+ seg1->mr_base = dma_pages[0] + pageoff;
+ seg1->mr_nsegs = mw->mw_nents;
seg1->mr_len = len;
- return i;
+ return mw->mw_nents;
+
+out_dmamap_err:
+ pr_err("rpcrdma: failed to dma map sg %p sg_nents %u\n",
+ mw->mw_sg, mw->mw_nents);
+ return -ENOMEM;

out_maperr:
- dprintk("RPC: %s: ib_map_phys_fmr %u@0x%llx+%i (%d) status %i\n",
- __func__, len, (unsigned long long)seg1->mr_dma,
- pageoff, i, rc);
- while (i--)
- rpcrdma_unmap_one(device, --seg);
+ pr_err("rpcrdma: ib_map_phys_fmr %u@0x%llx+%i (%d) status %i\n",
+ len, (unsigned long long)dma_pages[0],
+ pageoff, mw->mw_nents, rc);
+ __fmr_dma_unmap(mw);
return rc;
}

@@ -326,8 +355,7 @@ fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
mw = seg->rl_mw;

list_del_init(&mw->fmr.fm_mr->list);
- __fmr_dma_unmap(r_xprt, seg);
- rpcrdma_put_mw(r_xprt, seg->rl_mw);
+ __fmr_dma_unmap(mw);

i += seg->mr_nsegs;
seg->mr_nsegs = 0;
@@ -339,11 +367,6 @@ fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)

/* Use a slow, safe mechanism to invalidate all memory regions
* that were registered for "req".
- *
- * In the asynchronous case, DMA unmapping occurs first here
- * because the rpcrdma_mr_seg is released immediately after this
- * call. It's contents won't be available in __fmr_dma_unmap later.
- * FIXME.
*/
static void
fmr_op_unmap_safe(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
@@ -357,15 +380,10 @@ fmr_op_unmap_safe(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
seg = &req->rl_segments[i];
mw = seg->rl_mw;

- if (sync) {
- /* ORDER */
- __fmr_unmap(mw);
- __fmr_dma_unmap(r_xprt, seg);
- rpcrdma_put_mw(r_xprt, mw);
- } else {
- __fmr_dma_unmap(r_xprt, seg);
+ if (sync)
+ __fmr_reset_and_unmap(mw);
+ else
__fmr_queue_recovery(mw);
- }

i += seg->mr_nsegs;
seg->mr_nsegs = 0;


2016-06-15 03:16:10

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 05/24] xprtrdma: Rename fields in rpcrdma_fmr

Clean up: Use the same naming convention used in other
RPC/RDMA-related data structures.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 36 ++++++++++++++++++------------------
net/sunrpc/xprtrdma/xprt_rdma.h | 4 ++--
2 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index b3f8699..a6a67b4 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -74,9 +74,9 @@ __fmr_init(struct rpcrdma_mw *mw, struct ib_pd *pd)
.page_shift = PAGE_SHIFT
};

- mw->fmr.physaddrs = kcalloc(RPCRDMA_MAX_FMR_SGES,
- sizeof(u64), GFP_KERNEL);
- if (!mw->fmr.physaddrs)
+ mw->fmr.fm_physaddrs = kcalloc(RPCRDMA_MAX_FMR_SGES,
+ sizeof(u64), GFP_KERNEL);
+ if (!mw->fmr.fm_physaddrs)
goto out_free;

mw->mw_sg = kcalloc(RPCRDMA_MAX_FMR_SGES,
@@ -86,21 +86,21 @@ __fmr_init(struct rpcrdma_mw *mw, struct ib_pd *pd)

sg_init_table(mw->mw_sg, RPCRDMA_MAX_FMR_SGES);

- mw->fmr.fmr = ib_alloc_fmr(pd, RPCRDMA_FMR_ACCESS_FLAGS,
- &fmr_attr);
- if (IS_ERR(mw->fmr.fmr))
+ mw->fmr.fm_mr = ib_alloc_fmr(pd, RPCRDMA_FMR_ACCESS_FLAGS,
+ &fmr_attr);
+ if (IS_ERR(mw->fmr.fm_mr))
goto out_fmr_err;

- INIT_LIST_HEAD(&mw->fmr.fmr->list);
+ INIT_LIST_HEAD(&mw->fmr.fm_mr->list);
return 0;

out_fmr_err:
dprintk("RPC: %s: ib_alloc_fmr returned %ld\n", __func__,
- PTR_ERR(mw->fmr.fmr));
+ PTR_ERR(mw->fmr.fm_mr));

out_free:
kfree(mw->mw_sg);
- kfree(mw->fmr.physaddrs);
+ kfree(mw->fmr.fm_physaddrs);
return -ENOMEM;
}

@@ -110,9 +110,9 @@ __fmr_unmap(struct rpcrdma_mw *mw)
LIST_HEAD(l);
int rc;

- list_add(&mw->fmr.fmr->list, &l);
+ list_add(&mw->fmr.fm_mr->list, &l);
rc = ib_unmap_fmr(&l);
- list_del_init(&mw->fmr.fmr->list);
+ list_del_init(&mw->fmr.fm_mr->list);
return rc;
}

@@ -131,10 +131,10 @@ __fmr_release(struct rpcrdma_mw *r)
{
int rc;

- kfree(r->fmr.physaddrs);
+ kfree(r->fmr.fm_physaddrs);
kfree(r->mw_sg);

- rc = ib_dealloc_fmr(r->fmr.fmr);
+ rc = ib_dealloc_fmr(r->fmr.fm_mr);
if (rc)
pr_err("rpcrdma: final ib_dealloc_fmr for %p returned %i\n",
r, rc);
@@ -254,7 +254,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
nsegs = RPCRDMA_MAX_FMR_SGES;
for (i = 0; i < nsegs;) {
rpcrdma_map_one(device, seg, direction);
- mw->fmr.physaddrs[i] = seg->mr_dma;
+ mw->fmr.fm_physaddrs[i] = seg->mr_dma;
len += seg->mr_len;
++seg;
++i;
@@ -264,13 +264,13 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
break;
}

- rc = ib_map_phys_fmr(mw->fmr.fmr, mw->fmr.physaddrs,
+ rc = ib_map_phys_fmr(mw->fmr.fm_mr, mw->fmr.fm_physaddrs,
i, seg1->mr_dma);
if (rc)
goto out_maperr;

seg1->rl_mw = mw;
- seg1->mr_rkey = mw->fmr.fmr->rkey;
+ seg1->mr_rkey = mw->fmr.fm_mr->rkey;
seg1->mr_base = seg1->mr_dma + pageoff;
seg1->mr_nsegs = i;
seg1->mr_len = len;
@@ -310,7 +310,7 @@ fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
seg = &req->rl_segments[i];
mw = seg->rl_mw;

- list_add_tail(&mw->fmr.fmr->list, &unmap_list);
+ list_add_tail(&mw->fmr.fm_mr->list, &unmap_list);

i += seg->mr_nsegs;
}
@@ -325,7 +325,7 @@ fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
seg = &req->rl_segments[i];
mw = seg->rl_mw;

- list_del_init(&mw->fmr.fmr->list);
+ list_del_init(&mw->fmr.fm_mr->list);
__fmr_dma_unmap(r_xprt, seg);
rpcrdma_put_mw(r_xprt, seg->rl_mw);

diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index c53abd1..04696c0 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -232,8 +232,8 @@ struct rpcrdma_frmr {
};

struct rpcrdma_fmr {
- struct ib_fmr *fmr;
- u64 *physaddrs;
+ struct ib_fmr *fm_mr;
+ u64 *fm_physaddrs;
};

struct rpcrdma_mw {


2016-06-15 03:16:17

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 07/24] xprtrdma: Refactor MR recovery work queues

I found that commit ead3f26e359e ("xprtrdma: Add ro_unmap_safe
memreg method"), which introduces ro_unmap_safe, never wired up the
FMR recovery worker.

The FMR and FRWR recovery work queues both do the same thing.
Instead of setting up separate individual work queues for this,
schedule a delayed worker to deal with them, since recovering MRs is
not performance-critical.

Fixes: ead3f26e359e ("xprtrdma: Add ro_unmap_safe memreg method")
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 147 ++++++++++++++++-----------------------
net/sunrpc/xprtrdma/frwr_ops.c | 82 +++++-----------------
net/sunrpc/xprtrdma/transport.c | 16 +---
net/sunrpc/xprtrdma/verbs.c | 43 +++++++++++
net/sunrpc/xprtrdma/xprt_rdma.h | 13 ++-
5 files changed, 135 insertions(+), 166 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 3044593..43bfb5d 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -19,13 +19,6 @@
* verb (fmr_op_unmap).
*/

-/* Transport recovery
- *
- * After a transport reconnect, fmr_op_map re-uses the MR already
- * allocated for the RPC, but generates a fresh rkey then maps the
- * MR again. This process is synchronous.
- */
-
#include "xprt_rdma.h"

#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
@@ -41,30 +34,6 @@ enum {
IB_ACCESS_REMOTE_READ,
};

-static struct workqueue_struct *fmr_recovery_wq;
-
-#define FMR_RECOVERY_WQ_FLAGS (WQ_UNBOUND)
-
-int
-fmr_alloc_recovery_wq(void)
-{
- fmr_recovery_wq = alloc_workqueue("fmr_recovery", WQ_UNBOUND, 0);
- return !fmr_recovery_wq ? -ENOMEM : 0;
-}
-
-void
-fmr_destroy_recovery_wq(void)
-{
- struct workqueue_struct *wq;
-
- if (!fmr_recovery_wq)
- return;
-
- wq = fmr_recovery_wq;
- fmr_recovery_wq = NULL;
- destroy_workqueue(wq);
-}
-
static int
__fmr_init(struct rpcrdma_mw *mw, struct ib_pd *pd)
{
@@ -117,65 +86,55 @@ __fmr_unmap(struct rpcrdma_mw *mw)
}

static void
-__fmr_dma_unmap(struct rpcrdma_mw *mw)
-{
- struct rpcrdma_xprt *r_xprt = mw->mw_xprt;
-
- ib_dma_unmap_sg(r_xprt->rx_ia.ri_device,
- mw->mw_sg, mw->mw_nents, mw->mw_dir);
- rpcrdma_put_mw(r_xprt, mw);
-}
-
-static void
-__fmr_reset_and_unmap(struct rpcrdma_mw *mw)
-{
- int rc;
-
- /* ORDER */
- rc = __fmr_unmap(mw);
- if (rc) {
- pr_warn("rpcrdma: ib_unmap_fmr status %d, fmr %p orphaned\n",
- rc, mw);
- return;
- }
- __fmr_dma_unmap(mw);
-}
-
-static void
__fmr_release(struct rpcrdma_mw *r)
{
+ LIST_HEAD(unmap_list);
int rc;

kfree(r->fmr.fm_physaddrs);
kfree(r->mw_sg);

+ /* In case this one was left mapped, try to unmap it
+ * to prevent dealloc_fmr from failing with EBUSY
+ */
+ rc = __fmr_unmap(r);
+ if (rc)
+ pr_err("rpcrdma: final ib_unmap_fmr for %p failed %i\n",
+ r, rc);
+
rc = ib_dealloc_fmr(r->fmr.fm_mr);
if (rc)
pr_err("rpcrdma: final ib_dealloc_fmr for %p returned %i\n",
r, rc);
}

-/* Deferred reset of a single FMR. Generate a fresh rkey by
- * replacing the MR. There's no recovery if this fails.
+/* Reset of a single FMR.
+ *
+ * There's no recovery if this fails. The FMR is abandoned, but
+ * remains in rb_all. It will be cleaned up when the transport is
+ * destroyed.
*/
static void
-__fmr_recovery_worker(struct work_struct *work)
+fmr_op_recover_mr(struct rpcrdma_mw *mw)
{
- struct rpcrdma_mw *mw = container_of(work, struct rpcrdma_mw,
- mw_work);
+ struct rpcrdma_xprt *r_xprt = mw->mw_xprt;
+ int rc;

- __fmr_reset_and_unmap(mw);
- return;
-}
+ /* ORDER: invalidate first */
+ rc = __fmr_unmap(mw);

-/* A broken MR was discovered in a context that can't sleep.
- * Defer recovery to the recovery worker.
- */
-static void
-__fmr_queue_recovery(struct rpcrdma_mw *mw)
-{
- INIT_WORK(&mw->mw_work, __fmr_recovery_worker);
- queue_work(fmr_recovery_wq, &mw->mw_work);
+ /* ORDER: then DMA unmap */
+ ib_dma_unmap_sg(r_xprt->rx_ia.ri_device,
+ mw->mw_sg, mw->mw_nents, mw->mw_dir);
+ if (rc) {
+ pr_err("rpcrdma: FMR reset status %d, %p orphaned\n",
+ rc, mw);
+ r_xprt->rx_stats.mrs_orphaned++;
+ return;
+ }
+
+ rpcrdma_put_mw(r_xprt, mw);
+ r_xprt->rx_stats.mrs_recovered++;
}

static int
@@ -246,16 +205,11 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,

mw = seg1->rl_mw;
seg1->rl_mw = NULL;
- if (!mw) {
- mw = rpcrdma_get_mw(r_xprt);
- if (!mw)
- return -ENOMEM;
- } else {
- /* this is a retransmit; generate a fresh rkey */
- rc = __fmr_unmap(mw);
- if (rc)
- return rc;
- }
+ if (mw)
+ rpcrdma_defer_mr_recovery(mw);
+ mw = rpcrdma_get_mw(r_xprt);
+ if (!mw)
+ return -ENOMEM;

pageoff = offset_in_page(seg1->mr_offset);
seg1->mr_offset -= pageoff; /* start of page */
@@ -310,7 +264,7 @@ out_maperr:
pr_err("rpcrdma: ib_map_phys_fmr %u@0x%llx+%i (%d) status %i\n",
len, (unsigned long long)dma_pages[0],
pageoff, mw->mw_nents, rc);
- __fmr_dma_unmap(mw);
+ rpcrdma_defer_mr_recovery(mw);
return rc;
}

@@ -333,7 +287,7 @@ fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
/* ORDER: Invalidate all of the req's MRs first
*
* ib_unmap_fmr() is slow, so use a single call instead
- * of one call per mapped MR.
+ * of one call per mapped FMR.
*/
for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
seg = &req->rl_segments[i];
@@ -345,7 +299,7 @@ fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
}
rc = ib_unmap_fmr(&unmap_list);
if (rc)
- pr_warn("%s: ib_unmap_fmr failed (%i)\n", __func__, rc);
+ goto out_reset;

/* ORDER: Now DMA unmap all of the req's MRs, and return
* them to the free MW list.
@@ -355,7 +309,9 @@ fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
mw = seg->rl_mw;

list_del_init(&mw->fmr.fm_mr->list);
- __fmr_dma_unmap(mw);
+ ib_dma_unmap_sg(r_xprt->rx_ia.ri_device,
+ mw->mw_sg, mw->mw_nents, mw->mw_dir);
+ rpcrdma_put_mw(r_xprt, mw);

i += seg->mr_nsegs;
seg->mr_nsegs = 0;
@@ -363,6 +319,20 @@ fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
}

req->rl_nchunks = 0;
+ return;
+
+out_reset:
+ pr_err("rpcrdma: ib_unmap_fmr failed (%i)\n", rc);
+
+ for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+ seg = &req->rl_segments[i];
+ mw = seg->rl_mw;
+
+ list_del_init(&mw->fmr.fm_mr->list);
+ fmr_op_recover_mr(mw);
+
+ i += seg->mr_nsegs;
+ }
}

/* Use a slow, safe mechanism to invalidate all memory regions
@@ -381,9 +351,9 @@ fmr_op_unmap_safe(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
mw = seg->rl_mw;

if (sync)
- __fmr_reset_and_unmap(mw);
+ fmr_op_recover_mr(mw);
else
- __fmr_queue_recovery(mw);
+ rpcrdma_defer_mr_recovery(mw);

i += seg->mr_nsegs;
seg->mr_nsegs = 0;
@@ -408,6 +378,7 @@ const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
.ro_unmap_sync = fmr_op_unmap_sync,
.ro_unmap_safe = fmr_op_unmap_safe,
+ .ro_recover_mr = fmr_op_recover_mr,
.ro_open = fmr_op_open,
.ro_maxpages = fmr_op_maxpages,
.ro_init = fmr_op_init,
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 9cd60bf0..cbb2d05 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -73,31 +73,6 @@
# define RPCDBG_FACILITY RPCDBG_TRANS
#endif

-static struct workqueue_struct *frwr_recovery_wq;
-
-#define FRWR_RECOVERY_WQ_FLAGS (WQ_UNBOUND | WQ_MEM_RECLAIM)
-
-int
-frwr_alloc_recovery_wq(void)
-{
- frwr_recovery_wq = alloc_workqueue("frwr_recovery",
- FRWR_RECOVERY_WQ_FLAGS, 0);
- return !frwr_recovery_wq ? -ENOMEM : 0;
-}
-
-void
-frwr_destroy_recovery_wq(void)
-{
- struct workqueue_struct *wq;
-
- if (!frwr_recovery_wq)
- return;
-
- wq = frwr_recovery_wq;
- frwr_recovery_wq = NULL;
- destroy_workqueue(wq);
-}
-
static int
__frwr_init(struct rpcrdma_mw *r, struct ib_pd *pd, unsigned int depth)
{
@@ -168,8 +143,14 @@ __frwr_reset_mr(struct rpcrdma_ia *ia, struct rpcrdma_mw *r)
return 0;
}

+/* Reset of a single FRMR. Generate a fresh rkey by replacing the MR.
+ *
+ * There's no recovery if this fails. The FRMR is abandoned, but
+ * remains in rb_all. It will be cleaned up when the transport is
+ * destroyed.
+ */
static void
-__frwr_reset_and_unmap(struct rpcrdma_mw *mw)
+frwr_op_recover_mr(struct rpcrdma_mw *mw)
{
struct rpcrdma_xprt *r_xprt = mw->mw_xprt;
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
@@ -177,35 +158,15 @@ __frwr_reset_and_unmap(struct rpcrdma_mw *mw)

rc = __frwr_reset_mr(ia, mw);
ib_dma_unmap_sg(ia->ri_device, mw->mw_sg, mw->mw_nents, mw->mw_dir);
- if (rc)
+ if (rc) {
+ pr_err("rpcrdma: FRMR reset status %d, %p orphaned\n",
+ rc, mw);
+ r_xprt->rx_stats.mrs_orphaned++;
return;
- rpcrdma_put_mw(r_xprt, mw);
-}
-
-/* Deferred reset of a single FRMR. Generate a fresh rkey by
- * replacing the MR.
- *
- * There's no recovery if this fails. The FRMR is abandoned, but
- * remains in rb_all. It will be cleaned up when the transport is
- * destroyed.
- */
-static void
-__frwr_recovery_worker(struct work_struct *work)
-{
- struct rpcrdma_mw *r = container_of(work, struct rpcrdma_mw,
- mw_work);
-
- __frwr_reset_and_unmap(r);
-}
+ }

-/* A broken MR was discovered in a context that can't sleep.
- * Defer recovery to the recovery worker.
- */
-static void
-__frwr_queue_recovery(struct rpcrdma_mw *r)
-{
- INIT_WORK(&r->mw_work, __frwr_recovery_worker);
- queue_work(frwr_recovery_wq, &r->mw_work);
+ rpcrdma_put_mw(r_xprt, mw);
+ r_xprt->rx_stats.mrs_recovered++;
}

static int
@@ -401,7 +362,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
seg1->rl_mw = NULL;
do {
if (mw)
- __frwr_queue_recovery(mw);
+ rpcrdma_defer_mr_recovery(mw);
mw = rpcrdma_get_mw(r_xprt);
if (!mw)
return -ENOMEM;
@@ -483,12 +444,11 @@ out_mapmr_err:
pr_err("rpcrdma: failed to map mr %p (%u/%u)\n",
frmr->fr_mr, n, mw->mw_nents);
rc = n < 0 ? n : -EIO;
- __frwr_queue_recovery(mw);
+ rpcrdma_defer_mr_recovery(mw);
return rc;

out_senderr:
- pr_err("rpcrdma: ib_post_send status %i\n", rc);
- __frwr_queue_recovery(mw);
+ rpcrdma_defer_mr_recovery(mw);
return rc;
}

@@ -627,9 +587,9 @@ frwr_op_unmap_safe(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
mw = seg->rl_mw;

if (sync)
- __frwr_reset_and_unmap(mw);
+ frwr_op_recover_mr(mw);
else
- __frwr_queue_recovery(mw);
+ rpcrdma_defer_mr_recovery(mw);

i += seg->mr_nsegs;
seg->mr_nsegs = 0;
@@ -642,9 +602,6 @@ frwr_op_destroy(struct rpcrdma_buffer *buf)
{
struct rpcrdma_mw *r;

- /* Ensure stale MWs for "buf" are no longer in flight */
- flush_workqueue(frwr_recovery_wq);
-
while (!list_empty(&buf->rb_all)) {
r = list_entry(buf->rb_all.next, struct rpcrdma_mw, mw_all);
list_del(&r->mw_all);
@@ -657,6 +614,7 @@ const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
.ro_map = frwr_op_map,
.ro_unmap_sync = frwr_op_unmap_sync,
.ro_unmap_safe = frwr_op_unmap_safe,
+ .ro_recover_mr = frwr_op_recover_mr,
.ro_open = frwr_op_open,
.ro_maxpages = frwr_op_maxpages,
.ro_init = frwr_op_init,
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 99d2e5b..4c8e7f1 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -660,7 +660,7 @@ void xprt_rdma_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
xprt->stat.bad_xids,
xprt->stat.req_u,
xprt->stat.bklog_u);
- seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu %lu %lu %lu %lu\n",
+ seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu %lu %lu %lu %lu ",
r_xprt->rx_stats.read_chunk_count,
r_xprt->rx_stats.write_chunk_count,
r_xprt->rx_stats.reply_chunk_count,
@@ -672,6 +672,9 @@ void xprt_rdma_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
r_xprt->rx_stats.failed_marshal_count,
r_xprt->rx_stats.bad_reply_count,
r_xprt->rx_stats.nomsg_call_count);
+ seq_printf(seq, "%lu %lu\n",
+ r_xprt->rx_stats.mrs_recovered,
+ r_xprt->rx_stats.mrs_orphaned);
}

static int
@@ -741,7 +744,6 @@ void xprt_rdma_cleanup(void)
__func__, rc);

rpcrdma_destroy_wq();
- frwr_destroy_recovery_wq();

rc = xprt_unregister_transport(&xprt_rdma_bc);
if (rc)
@@ -753,20 +755,13 @@ int xprt_rdma_init(void)
{
int rc;

- rc = frwr_alloc_recovery_wq();
- if (rc)
- return rc;
-
rc = rpcrdma_alloc_wq();
- if (rc) {
- frwr_destroy_recovery_wq();
+ if (rc)
return rc;
- }

rc = xprt_register_transport(&xprt_rdma);
if (rc) {
rpcrdma_destroy_wq();
- frwr_destroy_recovery_wq();
return rc;
}

@@ -774,7 +769,6 @@ int xprt_rdma_init(void)
if (rc) {
xprt_unregister_transport(&xprt_rdma);
rpcrdma_destroy_wq();
- frwr_destroy_recovery_wq();
return rc;
}

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index b044d98a..77a371d 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -777,6 +777,41 @@ rpcrdma_ep_disconnect(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
ib_drain_qp(ia->ri_id->qp);
}

+static void
+rpcrdma_mr_recovery_worker(struct work_struct *work)
+{
+ struct rpcrdma_buffer *buf = container_of(work, struct rpcrdma_buffer,
+ rb_recovery_worker.work);
+ struct rpcrdma_mw *mw;
+
+ spin_lock(&buf->rb_recovery_lock);
+ while (!list_empty(&buf->rb_stale_mrs)) {
+ mw = list_first_entry(&buf->rb_stale_mrs,
+ struct rpcrdma_mw, mw_list);
+ list_del_init(&mw->mw_list);
+ spin_unlock(&buf->rb_recovery_lock);
+
+ dprintk("RPC: %s: recovering MR %p\n", __func__, mw);
+ mw->mw_xprt->rx_ia.ri_ops->ro_recover_mr(mw);
+
+ spin_lock(&buf->rb_recovery_lock);
+ };
+ spin_unlock(&buf->rb_recovery_lock);
+}
+
+void
+rpcrdma_defer_mr_recovery(struct rpcrdma_mw *mw)
+{
+ struct rpcrdma_xprt *r_xprt = mw->mw_xprt;
+ struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+
+ spin_lock(&buf->rb_recovery_lock);
+ list_add(&mw->mw_list, &buf->rb_stale_mrs);
+ spin_unlock(&buf->rb_recovery_lock);
+
+ schedule_delayed_work(&buf->rb_recovery_worker, 0);
+}
+
struct rpcrdma_req *
rpcrdma_create_req(struct rpcrdma_xprt *r_xprt)
{
@@ -837,8 +872,12 @@ rpcrdma_buffer_create(struct rpcrdma_xprt *r_xprt)

buf->rb_max_requests = r_xprt->rx_data.max_requests;
buf->rb_bc_srv_max_requests = 0;
- spin_lock_init(&buf->rb_lock);
atomic_set(&buf->rb_credits, 1);
+ spin_lock_init(&buf->rb_lock);
+ spin_lock_init(&buf->rb_recovery_lock);
+ INIT_LIST_HEAD(&buf->rb_stale_mrs);
+ INIT_DELAYED_WORK(&buf->rb_recovery_worker,
+ rpcrdma_mr_recovery_worker);

rc = ia->ri_ops->ro_init(r_xprt);
if (rc)
@@ -923,6 +962,8 @@ rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
{
struct rpcrdma_ia *ia = rdmab_to_ia(buf);

+ cancel_delayed_work_sync(&buf->rb_recovery_worker);
+
while (!list_empty(&buf->rb_recv_bufs)) {
struct rpcrdma_rep *rep;

diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 04696c0..4e03037 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -245,7 +245,6 @@ struct rpcrdma_mw {
struct rpcrdma_fmr fmr;
struct rpcrdma_frmr frmr;
};
- struct work_struct mw_work;
struct rpcrdma_xprt *mw_xprt;
struct list_head mw_all;
};
@@ -341,6 +340,10 @@ struct rpcrdma_buffer {
struct list_head rb_allreqs;

u32 rb_bc_max_requests;
+
+ spinlock_t rb_recovery_lock; /* protect rb_stale_mrs */
+ struct list_head rb_stale_mrs;
+ struct delayed_work rb_recovery_worker;
};
#define rdmab_to_ia(b) (&container_of((b), struct rpcrdma_xprt, rx_buf)->rx_ia)

@@ -387,6 +390,8 @@ struct rpcrdma_stats {
unsigned long bad_reply_count;
unsigned long nomsg_call_count;
unsigned long bcall_count;
+ unsigned long mrs_recovered;
+ unsigned long mrs_orphaned;
};

/*
@@ -400,6 +405,7 @@ struct rpcrdma_memreg_ops {
struct rpcrdma_req *);
void (*ro_unmap_safe)(struct rpcrdma_xprt *,
struct rpcrdma_req *, bool);
+ void (*ro_recover_mr)(struct rpcrdma_mw *);
int (*ro_open)(struct rpcrdma_ia *,
struct rpcrdma_ep *,
struct rpcrdma_create_data_internal *);
@@ -477,6 +483,8 @@ void rpcrdma_buffer_put(struct rpcrdma_req *);
void rpcrdma_recv_buffer_get(struct rpcrdma_req *);
void rpcrdma_recv_buffer_put(struct rpcrdma_rep *);

+void rpcrdma_defer_mr_recovery(struct rpcrdma_mw *);
+
struct rpcrdma_regbuf *rpcrdma_alloc_regbuf(struct rpcrdma_ia *,
size_t, gfp_t);
void rpcrdma_free_regbuf(struct rpcrdma_ia *,
@@ -484,9 +492,6 @@ void rpcrdma_free_regbuf(struct rpcrdma_ia *,

int rpcrdma_ep_post_extra_recv(struct rpcrdma_xprt *, unsigned int);

-int frwr_alloc_recovery_wq(void);
-void frwr_destroy_recovery_wq(void);
-
int rpcrdma_alloc_wq(void);
void rpcrdma_destroy_wq(void);



2016-06-15 03:16:24

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 08/24] xprtrdma: Do not leak an MW during a DMA map failure

Based on code audit.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 1 +
net/sunrpc/xprtrdma/frwr_ops.c | 1 +
2 files changed, 2 insertions(+)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 43bfb5d..eb42d7f 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -258,6 +258,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
out_dmamap_err:
pr_err("rpcrdma: failed to dma map sg %p sg_nents %u\n",
mw->mw_sg, mw->mw_nents);
+ rpcrdma_defer_mr_recovery(mw);
return -ENOMEM;

out_maperr:
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index cbb2d05..c9ead2b 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -438,6 +438,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
out_dmamap_err:
pr_err("rpcrdma: failed to dma map sg %p sg_nents %u\n",
mw->mw_sg, mw->mw_nents);
+ rpcrdma_defer_mr_recovery(mw);
return -ENOMEM;

out_mapmr_err:


2016-06-15 03:16:32

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 09/24] xprtrdma: Remove ALLPHYSICAL memory registration mode

No HCA or RNIC in the kernel tree requires the use of ALLPHYSICAL.

ALLPHYSICAL advertises in the clear on the network fabric an R_key
that is good for all of the client's memory. No known exploit
exists, but theoretically any user on the server can use that R_key
on the client's QP to read or update any part of the client's memory.

ALLPHYSICAL exposes the client to server bugs, including:
o base/bounds errors causing data outside the i/o buffer to be
accessed
o RDMA access after reply causing data corruption and/or integrity
fail

ALLPHYSICAL can't protect application memory regions from server
update after a local signal or soft timeout has terminated an RPC.

ALLPHYSICAL chunks are no larger than a page. Special cases to
handle small chunks and long chunk lists have been a source of
implementation complexity and bugs.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/Makefile | 2 -
net/sunrpc/xprtrdma/physical_ops.c | 122 ------------------------------------
net/sunrpc/xprtrdma/verbs.c | 15 ----
net/sunrpc/xprtrdma/xprt_rdma.h | 5 -
4 files changed, 2 insertions(+), 142 deletions(-)
delete mode 100644 net/sunrpc/xprtrdma/physical_ops.c

diff --git a/net/sunrpc/xprtrdma/Makefile b/net/sunrpc/xprtrdma/Makefile
index dc9f3b5..ef19fa4 100644
--- a/net/sunrpc/xprtrdma/Makefile
+++ b/net/sunrpc/xprtrdma/Makefile
@@ -1,7 +1,7 @@
obj-$(CONFIG_SUNRPC_XPRT_RDMA) += rpcrdma.o

rpcrdma-y := transport.o rpc_rdma.o verbs.o \
- fmr_ops.o frwr_ops.o physical_ops.o \
+ fmr_ops.o frwr_ops.o \
svc_rdma.o svc_rdma_backchannel.o svc_rdma_transport.o \
svc_rdma_marshal.o svc_rdma_sendto.o svc_rdma_recvfrom.o \
module.o
diff --git a/net/sunrpc/xprtrdma/physical_ops.c b/net/sunrpc/xprtrdma/physical_ops.c
deleted file mode 100644
index 3750596..0000000
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ /dev/null
@@ -1,122 +0,0 @@
-/*
- * Copyright (c) 2015 Oracle. All rights reserved.
- * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
- */
-
-/* No-op chunk preparation. All client memory is pre-registered.
- * Sometimes referred to as ALLPHYSICAL mode.
- *
- * Physical registration is simple because all client memory is
- * pre-registered and never deregistered. This mode is good for
- * adapter bring up, but is considered not safe: the server is
- * trusted not to abuse its access to client memory not involved
- * in RDMA I/O.
- */
-
-#include "xprt_rdma.h"
-
-#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
-# define RPCDBG_FACILITY RPCDBG_TRANS
-#endif
-
-static int
-physical_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
- struct rpcrdma_create_data_internal *cdata)
-{
- struct ib_mr *mr;
-
- /* Obtain an rkey to use for RPC data payloads.
- */
- mr = ib_get_dma_mr(ia->ri_pd,
- IB_ACCESS_LOCAL_WRITE |
- IB_ACCESS_REMOTE_WRITE |
- IB_ACCESS_REMOTE_READ);
- if (IS_ERR(mr)) {
- pr_err("%s: ib_get_dma_mr for failed with %lX\n",
- __func__, PTR_ERR(mr));
- return -ENOMEM;
- }
- ia->ri_dma_mr = mr;
-
- rpcrdma_set_max_header_sizes(ia, cdata, min_t(unsigned int,
- RPCRDMA_MAX_DATA_SEGS,
- RPCRDMA_MAX_HDR_SEGS));
- return 0;
-}
-
-/* PHYSICAL memory registration conveys one page per chunk segment.
- */
-static size_t
-physical_op_maxpages(struct rpcrdma_xprt *r_xprt)
-{
- return min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
- RPCRDMA_MAX_HDR_SEGS);
-}
-
-static int
-physical_op_init(struct rpcrdma_xprt *r_xprt)
-{
- return 0;
-}
-
-/* The client's physical memory is already exposed for
- * remote access via RDMA READ or RDMA WRITE.
- */
-static int
-physical_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
- int nsegs, bool writing)
-{
- struct rpcrdma_ia *ia = &r_xprt->rx_ia;
-
- rpcrdma_map_one(ia->ri_device, seg, rpcrdma_data_dir(writing));
- seg->mr_rkey = ia->ri_dma_mr->rkey;
- seg->mr_base = seg->mr_dma;
- return 1;
-}
-
-/* DMA unmap all memory regions that were mapped for "req".
- */
-static void
-physical_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
-{
- struct ib_device *device = r_xprt->rx_ia.ri_device;
- unsigned int i;
-
- for (i = 0; req->rl_nchunks; --req->rl_nchunks)
- rpcrdma_unmap_one(device, &req->rl_segments[i++]);
-}
-
-/* Use a slow, safe mechanism to invalidate all memory regions
- * that were registered for "req".
- *
- * For physical memory registration, there is no good way to
- * fence a single MR that has been advertised to the server. The
- * client has already handed the server an R_key that cannot be
- * invalidated and is shared by all MRs on this connection.
- * Tearing down the PD might be the only safe choice, but it's
- * not clear that a freshly acquired DMA R_key would be different
- * than the one used by the PD that was just destroyed.
- * FIXME.
- */
-static void
-physical_op_unmap_safe(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
- bool sync)
-{
- physical_op_unmap_sync(r_xprt, req);
-}
-
-static void
-physical_op_destroy(struct rpcrdma_buffer *buf)
-{
-}
-
-const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
- .ro_map = physical_op_map,
- .ro_unmap_sync = physical_op_unmap_sync,
- .ro_unmap_safe = physical_op_unmap_safe,
- .ro_open = physical_op_open,
- .ro_maxpages = physical_op_maxpages,
- .ro_init = physical_op_init,
- .ro_destroy = physical_op_destroy,
- .ro_displayname = "physical",
-};
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 77a371d..5ee98e9 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -379,8 +379,6 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg)
struct rpcrdma_ia *ia = &xprt->rx_ia;
int rc;

- ia->ri_dma_mr = NULL;
-
ia->ri_id = rpcrdma_create_id(xprt, ia, addr);
if (IS_ERR(ia->ri_id)) {
rc = PTR_ERR(ia->ri_id);
@@ -418,9 +416,6 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg)
case RPCRDMA_FRMR:
ia->ri_ops = &rpcrdma_frwr_memreg_ops;
break;
- case RPCRDMA_ALLPHYSICAL:
- ia->ri_ops = &rpcrdma_physical_memreg_ops;
- break;
case RPCRDMA_MTHCAFMR:
ia->ri_ops = &rpcrdma_fmr_memreg_ops;
break;
@@ -585,8 +580,6 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
out2:
ib_free_cq(sendcq);
out1:
- if (ia->ri_dma_mr)
- ib_dereg_mr(ia->ri_dma_mr);
return rc;
}

@@ -600,8 +593,6 @@ out1:
void
rpcrdma_ep_destroy(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
{
- int rc;
-
dprintk("RPC: %s: entering, connected is %d\n",
__func__, ep->rep_connected);

@@ -615,12 +606,6 @@ rpcrdma_ep_destroy(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)

ib_free_cq(ep->rep_attr.recv_cq);
ib_free_cq(ep->rep_attr.send_cq);
-
- if (ia->ri_dma_mr) {
- rc = ib_dereg_mr(ia->ri_dma_mr);
- dprintk("RPC: %s: ib_dereg_mr returned %i\n",
- __func__, rc);
- }
}

/*
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 4e03037..bcb168e 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -68,7 +68,6 @@ struct rpcrdma_ia {
struct ib_device *ri_device;
struct rdma_cm_id *ri_id;
struct ib_pd *ri_pd;
- struct ib_mr *ri_dma_mr;
struct completion ri_done;
int ri_async_rc;
unsigned int ri_max_frmr_depth;
@@ -269,8 +268,7 @@ struct rpcrdma_mw {
* NOTES:
* o RPCRDMA_MAX_SEGS is the max number of addressible chunk elements we
* marshal. The number needed varies depending on the iov lists that
- * are passed to us, the memory registration mode we are in, and if
- * physical addressing is used, the layout.
+ * are passed to us and the memory registration mode we are in.
*/

struct rpcrdma_mr_seg { /* chunk descriptors */
@@ -417,7 +415,6 @@ struct rpcrdma_memreg_ops {

extern const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops;
extern const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops;
-extern const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops;

/*
* RPCRDMA transport -- encapsulates the structures above for


2016-06-15 03:16:40

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 10/24] xprtrdma: Remove rpcrdma_map_one() and friends

Clean up: ALLPHYSICAL is gone and FMR has been converted to use
scatterlists. There are no more users of these functions.

This patch shrinks the size of struct rpcrdma_req by about 3500
bytes on x86_64. There is one of these structs for each RPC credit
(128 credits per transport connection).

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 8 --------
net/sunrpc/xprtrdma/xprt_rdma.h | 36 ------------------------------------
2 files changed, 44 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 5ee98e9..b80e767f 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1086,14 +1086,6 @@ rpcrdma_recv_buffer_put(struct rpcrdma_rep *rep)
* Wrappers for internal-use kmalloc memory registration, used by buffer code.
*/

-void
-rpcrdma_mapping_error(struct rpcrdma_mr_seg *seg)
-{
- dprintk("RPC: map_one: offset %p iova %llx len %zu\n",
- seg->mr_offset,
- (unsigned long long)seg->mr_dma, seg->mr_dmalen);
-}
-
/**
* rpcrdma_alloc_regbuf - kmalloc and register memory for SEND/RECV buffers
* @ia: controlling rpcrdma_ia
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index bcb168e..f1b6f2f 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -277,9 +277,6 @@ struct rpcrdma_mr_seg { /* chunk descriptors */
u32 mr_rkey; /* registration result */
u32 mr_len; /* length of chunk or segment */
int mr_nsegs; /* number of segments in chunk or 0 */
- enum dma_data_direction mr_dir; /* segment mapping direction */
- dma_addr_t mr_dma; /* segment mapping address */
- size_t mr_dmalen; /* segment mapping length */
struct page *mr_page; /* owning page, if any */
char *mr_offset; /* kva if no page, else offset */
};
@@ -496,45 +493,12 @@ void rpcrdma_destroy_wq(void);
* Wrappers for chunk registration, shared by read/write chunk code.
*/

-void rpcrdma_mapping_error(struct rpcrdma_mr_seg *);
-
static inline enum dma_data_direction
rpcrdma_data_dir(bool writing)
{
return writing ? DMA_FROM_DEVICE : DMA_TO_DEVICE;
}

-static inline void
-rpcrdma_map_one(struct ib_device *device, struct rpcrdma_mr_seg *seg,
- enum dma_data_direction direction)
-{
- seg->mr_dir = direction;
- seg->mr_dmalen = seg->mr_len;
-
- if (seg->mr_page)
- seg->mr_dma = ib_dma_map_page(device,
- seg->mr_page, offset_in_page(seg->mr_offset),
- seg->mr_dmalen, seg->mr_dir);
- else
- seg->mr_dma = ib_dma_map_single(device,
- seg->mr_offset,
- seg->mr_dmalen, seg->mr_dir);
-
- if (ib_dma_mapping_error(device, seg->mr_dma))
- rpcrdma_mapping_error(seg);
-}
-
-static inline void
-rpcrdma_unmap_one(struct ib_device *device, struct rpcrdma_mr_seg *seg)
-{
- if (seg->mr_page)
- ib_dma_unmap_page(device,
- seg->mr_dma, seg->mr_dmalen, seg->mr_dir);
- else
- ib_dma_unmap_single(device,
- seg->mr_dma, seg->mr_dmalen, seg->mr_dir);
-}
-
/*
* RPC/RDMA connection management calls - xprtrdma/rpc_rdma.c
*/


2016-06-15 03:16:48

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 11/24] xprtrdma: Reply buffer exhaustion can be catastrophic

Not having an rpcrdma_rep at call_allocate time can be a problem.
It means that send_request can't post a receive buffer to catch
the RPC's reply. Possible consequences are RPC timeouts or even
transport deadlock.

Instead of allowing an RPC to proceed if an rpcrdma_rep is
not available, return NULL to force call_allocate to wait and
try again.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/verbs.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index b80e767f..8b8abd6 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1004,8 +1004,6 @@ rpcrdma_put_mw(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mw *mw)

/*
* Get a set of request/reply buffers.
- *
- * Reply buffer (if available) is attached to send buffer upon return.
*/
struct rpcrdma_req *
rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
@@ -1024,13 +1022,13 @@ rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)

out_reqbuf:
spin_unlock(&buffers->rb_lock);
- pr_warn("RPC: %s: out of request buffers\n", __func__);
+ pr_warn("rpcrdma: out of request buffers (%p)\n", buffers);
return NULL;
out_repbuf:
+ list_add(&req->rl_free, &buffers->rb_send_bufs);
spin_unlock(&buffers->rb_lock);
- pr_warn("RPC: %s: out of reply buffers\n", __func__);
- req->rl_reply = NULL;
- return req;
+ pr_warn("rpcrdma: out of reply buffers (%p)\n", buffers);
+ return NULL;
}

/*


2016-06-15 03:16:56

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 12/24] xprtrdma: Honor ->send_request API contract

Commit c93c62231cf5 ("xprtrdma: Disconnect on registration failure")
added a disconnect for some RPC marshaling failures. This is needed
only in a handful of cases, but it was triggering for simple stuff
like temporary resource shortages. Try to straighten this out.

Fix up the lower layers so they don't return -ENOMEM or other error
codes that the RPC client's FSM doesn't explicitly recognize.

Also fix up the places in the send_request path that do want a
disconnect. For example, when ib_post_send or ib_post_recv fail,
this is a sign that there is a send or receive queue resource
miscalculation. That should be rare, and is a sign of a software
bug. But xprtrdma can recover: disconnect to reset the transport and
start over.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 6 +++---
net/sunrpc/xprtrdma/frwr_ops.c | 13 +++++++------
net/sunrpc/xprtrdma/rpc_rdma.c | 2 +-
net/sunrpc/xprtrdma/transport.c | 20 +++++++++++++++-----
net/sunrpc/xprtrdma/verbs.c | 22 +++++++++++++---------
5 files changed, 39 insertions(+), 24 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index eb42d7f..1ee2b10 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -209,7 +209,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
rpcrdma_defer_mr_recovery(mw);
mw = rpcrdma_get_mw(r_xprt);
if (!mw)
- return -ENOMEM;
+ return -ENOBUFS;

pageoff = offset_in_page(seg1->mr_offset);
seg1->mr_offset -= pageoff; /* start of page */
@@ -259,14 +259,14 @@ out_dmamap_err:
pr_err("rpcrdma: failed to dma map sg %p sg_nents %u\n",
mw->mw_sg, mw->mw_nents);
rpcrdma_defer_mr_recovery(mw);
- return -ENOMEM;
+ return -EIO;

out_maperr:
pr_err("rpcrdma: ib_map_phys_fmr %u@0x%llx+%i (%d) status %i\n",
len, (unsigned long long)dma_pages[0],
pageoff, mw->mw_nents, rc);
rpcrdma_defer_mr_recovery(mw);
- return rc;
+ return -EIO;
}

/* Invalidate all memory regions that were registered for "req".
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index c9ead2b..e77e40a 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -365,7 +365,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
rpcrdma_defer_mr_recovery(mw);
mw = rpcrdma_get_mw(r_xprt);
if (!mw)
- return -ENOMEM;
+ return -ENOBUFS;
} while (mw->frmr.fr_state != FRMR_IS_INVALID);
frmr = &mw->frmr;
frmr->fr_state = FRMR_IS_VALID;
@@ -439,18 +439,18 @@ out_dmamap_err:
pr_err("rpcrdma: failed to dma map sg %p sg_nents %u\n",
mw->mw_sg, mw->mw_nents);
rpcrdma_defer_mr_recovery(mw);
- return -ENOMEM;
+ return -EIO;

out_mapmr_err:
pr_err("rpcrdma: failed to map mr %p (%u/%u)\n",
frmr->fr_mr, n, mw->mw_nents);
- rc = n < 0 ? n : -EIO;
rpcrdma_defer_mr_recovery(mw);
- return rc;
+ return -EIO;

out_senderr:
+ pr_err("rpcrdma: FRMR registration ib_post_send returned %i\n", rc);
rpcrdma_defer_mr_recovery(mw);
- return rc;
+ return -ENOTCONN;
}

static struct ib_send_wr *
@@ -552,7 +552,8 @@ unmap:
return;

reset_mrs:
- pr_warn("%s: ib_post_send failed %i\n", __func__, rc);
+ pr_err("rpcrdma: FRMR invalidate ib_post_send returned %i\n", rc);
+ rdma_disconnect(ia->ri_id);

/* Find and reset the MRs in the LOCAL_INV WRs that did not
* get posted. This is synchronous, and slow.
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 35a8109..77e002f 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -251,7 +251,7 @@ rpcrdma_convert_iovs(struct xdr_buf *xdrbuf, unsigned int pos,
/* alloc the pagelist for receiving buffer */
ppages[p] = alloc_page(GFP_ATOMIC);
if (!ppages[p])
- return -ENOMEM;
+ return -EAGAIN;
}
seg[n].mr_page = ppages[p];
seg[n].mr_offset = (void *)(unsigned long) page_base;
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 4c8e7f1..be4dd2c 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -558,7 +558,6 @@ out_sendbuf:

out_fail:
rpcrdma_buffer_put(req);
- r_xprt->rx_stats.failed_marshal_count++;
return NULL;
}

@@ -590,8 +589,19 @@ xprt_rdma_free(void *buffer)
rpcrdma_buffer_put(req);
}

-/*
+/**
+ * xprt_rdma_send_request - marshal and send an RPC request
+ * @task: RPC task with an RPC message in rq_snd_buf
+ *
+ * Return values:
+ * 0: The request has been sent
+ * ENOTCONN: Caller needs to invoke connect logic then call again
+ * ENOBUFS: Call again later to send the request
+ * EIO: A permanent error occurred. The request was not sent,
+ * and don't try it again
+ *
* send_request invokes the meat of RPC RDMA. It must do the following:
+ *
* 1. Marshal the RPC request into an RPC RDMA request, which means
* putting a header in front of data, and creating IOVs for RDMA
* from those in the request.
@@ -600,7 +610,6 @@ xprt_rdma_free(void *buffer)
* the request (rpcrdma_ep_post).
* 4. No partial sends are possible in the RPC-RDMA protocol (as in UDP).
*/
-
static int
xprt_rdma_send_request(struct rpc_task *task)
{
@@ -630,11 +639,12 @@ xprt_rdma_send_request(struct rpc_task *task)
return 0;

failed_marshal:
- r_xprt->rx_stats.failed_marshal_count++;
dprintk("RPC: %s: rpcrdma_marshal_req failed, status %i\n",
__func__, rc);
if (rc == -EIO)
- return -EIO;
+ r_xprt->rx_stats.failed_marshal_count++;
+ if (rc != -ENOTCONN)
+ return rc;
drop_connection:
xprt_disconnect_done(xprt);
return -ENOTCONN; /* implies disconnect */
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 8b8abd6..35f2176 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1166,7 +1166,7 @@ rpcrdma_ep_post(struct rpcrdma_ia *ia,
if (rep) {
rc = rpcrdma_ep_post_recv(ia, ep, rep);
if (rc)
- goto out;
+ return rc;
req->rl_reply = NULL;
}

@@ -1191,10 +1191,12 @@ rpcrdma_ep_post(struct rpcrdma_ia *ia,

rc = ib_post_send(ia->ri_id->qp, &send_wr, &send_wr_fail);
if (rc)
- dprintk("RPC: %s: ib_post_send returned %i\n", __func__,
- rc);
-out:
- return rc;
+ goto out_postsend_err;
+ return 0;
+
+out_postsend_err:
+ pr_err("rpcrdma: RDMA Send ib_post_send returned %i\n", rc);
+ return -ENOTCONN;
}

/*
@@ -1219,11 +1221,13 @@ rpcrdma_ep_post_recv(struct rpcrdma_ia *ia,
DMA_BIDIRECTIONAL);

rc = ib_post_recv(ia->ri_id->qp, &recv_wr, &recv_wr_fail);
-
if (rc)
- dprintk("RPC: %s: ib_post_recv returned %i\n", __func__,
- rc);
- return rc;
+ goto out_postrecv;
+ return 0;
+
+out_postrecv:
+ pr_err("rpcrdma: ib_post_recv returned %i\n", rc);
+ return -ENOTCONN;
}

/**


2016-06-15 03:17:04

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 13/24] xprtrdma: Chunk list encoders must not return zero

Clean up, based on code audit: Remove the possibility that the
chunk list XDR encoders can return zero, which would be interpreted
as a NULL.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 2 ++
net/sunrpc/xprtrdma/frwr_ops.c | 2 ++
net/sunrpc/xprtrdma/rpc_rdma.c | 6 +++---
3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 1ee2b10..e86fe83 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -236,6 +236,8 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
}
mw->mw_nents = i;
mw->mw_dir = rpcrdma_data_dir(writing);
+ if (i == 0)
+ goto out_dmamap_err;

if (!ib_dma_map_sg(r_xprt->rx_ia.ri_device,
mw->mw_sg, mw->mw_nents, mw->mw_dir))
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index e77e40a..d7e63bd 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -394,6 +394,8 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
}
mw->mw_nents = i;
mw->mw_dir = rpcrdma_data_dir(writing);
+ if (i == 0)
+ goto out_dmamap_err;

dma_nents = ib_dma_map_sg(ia->ri_device,
mw->mw_sg, mw->mw_nents, mw->mw_dir);
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 77e002f..8fde0ab 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -329,7 +329,7 @@ rpcrdma_encode_read_list(struct rpcrdma_xprt *r_xprt,

do {
n = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs, false);
- if (n <= 0)
+ if (n < 0)
return ERR_PTR(n);

*iptr++ = xdr_one; /* item present */
@@ -397,7 +397,7 @@ rpcrdma_encode_write_list(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
nchunks = 0;
do {
n = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs, true);
- if (n <= 0)
+ if (n < 0)
return ERR_PTR(n);

iptr = xdr_encode_rdma_segment(iptr, seg);
@@ -462,7 +462,7 @@ rpcrdma_encode_reply_chunk(struct rpcrdma_xprt *r_xprt,
nchunks = 0;
do {
n = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs, true);
- if (n <= 0)
+ if (n < 0)
return ERR_PTR(n);

iptr = xdr_encode_rdma_segment(iptr, seg);


2016-06-15 03:17:12

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 14/24] xprtrdma: Allocate MRs on demand

Frequent MR list exhaustion can impact I/O throughput, so enough MRs
are always created during transport set-up to prevent running out.
This means more MRs are created than most workloads need.

Commit 94f58c58c0b4 ("xprtrdma: Allow Read list and Reply chunk
simultaneously") introduced support for sending two chunk lists per
RPC, which consumes more MRs per RPC.

Instead of trying to provision more MRs, introduce a mechanism for
allocating MRs on demand. A few MRs are allocated during transport
set-up to kick things off.

This significantly reduces the average number of MRs per transport
while allowing the MR count to grow for workloads or devices that
need more MRs.

FRWR with mlx4 allocated almost 400 MRs per transport before this
patch. Now it starts with 16.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 64 +++-------------------------
net/sunrpc/xprtrdma/frwr_ops.c | 64 +++-------------------------
net/sunrpc/xprtrdma/transport.c | 5 +-
net/sunrpc/xprtrdma/verbs.c | 90 ++++++++++++++++++++++++++++++++++++---
net/sunrpc/xprtrdma/xprt_rdma.h | 7 ++-
5 files changed, 106 insertions(+), 124 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index e86fe83..840479f 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -35,7 +35,7 @@ enum {
};

static int
-__fmr_init(struct rpcrdma_mw *mw, struct ib_pd *pd)
+fmr_op_init_mr(struct rpcrdma_ia *ia, struct rpcrdma_mw *mw)
{
static struct ib_fmr_attr fmr_attr = {
.max_pages = RPCRDMA_MAX_FMR_SGES,
@@ -55,7 +55,7 @@ __fmr_init(struct rpcrdma_mw *mw, struct ib_pd *pd)

sg_init_table(mw->mw_sg, RPCRDMA_MAX_FMR_SGES);

- mw->fmr.fm_mr = ib_alloc_fmr(pd, RPCRDMA_FMR_ACCESS_FLAGS,
+ mw->fmr.fm_mr = ib_alloc_fmr(ia->ri_pd, RPCRDMA_FMR_ACCESS_FLAGS,
&fmr_attr);
if (IS_ERR(mw->fmr.fm_mr))
goto out_fmr_err;
@@ -86,7 +86,7 @@ __fmr_unmap(struct rpcrdma_mw *mw)
}

static void
-__fmr_release(struct rpcrdma_mw *r)
+fmr_op_release_mr(struct rpcrdma_mw *r)
{
LIST_HEAD(unmap_list);
int rc;
@@ -106,13 +106,11 @@ __fmr_release(struct rpcrdma_mw *r)
if (rc)
pr_err("rpcrdma: final ib_dealloc_fmr for %p returned %i\n",
r, rc);
+
+ kfree(r);
}

/* Reset of a single FMR.
- *
- * There's no recovery if this fails. The FMR is abandoned, but
- * remains in rb_all. It will be cleaned up when the transport is
- * destroyed.
*/
static void
fmr_op_recover_mr(struct rpcrdma_mw *mw)
@@ -156,41 +154,6 @@ fmr_op_maxpages(struct rpcrdma_xprt *r_xprt)
RPCRDMA_MAX_HDR_SEGS * RPCRDMA_MAX_FMR_SGES);
}

-static int
-fmr_op_init(struct rpcrdma_xprt *r_xprt)
-{
- struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
- struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
- struct rpcrdma_mw *r;
- int i, rc;
-
- spin_lock_init(&buf->rb_mwlock);
- INIT_LIST_HEAD(&buf->rb_mws);
- INIT_LIST_HEAD(&buf->rb_all);
-
- i = max_t(int, RPCRDMA_MAX_DATA_SEGS / RPCRDMA_MAX_FMR_SGES, 1);
- i += 2; /* head + tail */
- i *= buf->rb_max_requests; /* one set for each RPC slot */
- dprintk("RPC: %s: initalizing %d FMRs\n", __func__, i);
-
- while (i--) {
- r = kzalloc(sizeof(*r), GFP_KERNEL);
- if (!r)
- return -ENOMEM;
-
- rc = __fmr_init(r, pd);
- if (rc) {
- kfree(r);
- return rc;
- }
-
- r->mw_xprt = r_xprt;
- list_add(&r->mw_list, &buf->rb_mws);
- list_add(&r->mw_all, &buf->rb_all);
- }
- return 0;
-}
-
/* Use the ib_map_phys_fmr() verb to register a memory region
* for remote access via RDMA READ or RDMA WRITE.
*/
@@ -364,19 +327,6 @@ fmr_op_unmap_safe(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
}
}

-static void
-fmr_op_destroy(struct rpcrdma_buffer *buf)
-{
- struct rpcrdma_mw *r;
-
- while (!list_empty(&buf->rb_all)) {
- r = list_entry(buf->rb_all.next, struct rpcrdma_mw, mw_all);
- list_del(&r->mw_all);
- __fmr_release(r);
- kfree(r);
- }
-}
-
const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
.ro_unmap_sync = fmr_op_unmap_sync,
@@ -384,7 +334,7 @@ const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_recover_mr = fmr_op_recover_mr,
.ro_open = fmr_op_open,
.ro_maxpages = fmr_op_maxpages,
- .ro_init = fmr_op_init,
- .ro_destroy = fmr_op_destroy,
+ .ro_init_mr = fmr_op_init_mr,
+ .ro_release_mr = fmr_op_release_mr,
.ro_displayname = "fmr",
};
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index d7e63bd..f603c3a 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -74,12 +74,13 @@
#endif

static int
-__frwr_init(struct rpcrdma_mw *r, struct ib_pd *pd, unsigned int depth)
+frwr_op_init_mr(struct rpcrdma_ia *ia, struct rpcrdma_mw *r)
{
+ unsigned int depth = ia->ri_max_frmr_depth;
struct rpcrdma_frmr *f = &r->frmr;
int rc;

- f->fr_mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, depth);
+ f->fr_mr = ib_alloc_mr(ia->ri_pd, IB_MR_TYPE_MEM_REG, depth);
if (IS_ERR(f->fr_mr))
goto out_mr_err;

@@ -106,7 +107,7 @@ out_list_err:
}

static void
-__frwr_release(struct rpcrdma_mw *r)
+frwr_op_release_mr(struct rpcrdma_mw *r)
{
int rc;

@@ -115,6 +116,7 @@ __frwr_release(struct rpcrdma_mw *r)
pr_err("rpcrdma: final ib_dereg_mr for %p returned %i\n",
r, rc);
kfree(r->mw_sg);
+ kfree(r);
}

static int
@@ -302,45 +304,6 @@ frwr_wc_localinv_wake(struct ib_cq *cq, struct ib_wc *wc)
complete_all(&frmr->fr_linv_done);
}

-static int
-frwr_op_init(struct rpcrdma_xprt *r_xprt)
-{
- struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
- unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
- struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
- int i;
-
- spin_lock_init(&buf->rb_mwlock);
- INIT_LIST_HEAD(&buf->rb_mws);
- INIT_LIST_HEAD(&buf->rb_all);
-
- i = max_t(int, RPCRDMA_MAX_DATA_SEGS / depth, 1);
- i += 2; /* head + tail */
- i *= buf->rb_max_requests; /* one set for each RPC slot */
- dprintk("RPC: %s: initalizing %d FRMRs\n", __func__, i);
-
- while (i--) {
- struct rpcrdma_mw *r;
- int rc;
-
- r = kzalloc(sizeof(*r), GFP_KERNEL);
- if (!r)
- return -ENOMEM;
-
- rc = __frwr_init(r, pd, depth);
- if (rc) {
- kfree(r);
- return rc;
- }
-
- r->mw_xprt = r_xprt;
- list_add(&r->mw_list, &buf->rb_mws);
- list_add(&r->mw_all, &buf->rb_all);
- }
-
- return 0;
-}
-
/* Post a REG_MR Work Request to register a memory region
* for remote access via RDMA READ or RDMA WRITE.
*/
@@ -601,19 +564,6 @@ frwr_op_unmap_safe(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
}
}

-static void
-frwr_op_destroy(struct rpcrdma_buffer *buf)
-{
- struct rpcrdma_mw *r;
-
- while (!list_empty(&buf->rb_all)) {
- r = list_entry(buf->rb_all.next, struct rpcrdma_mw, mw_all);
- list_del(&r->mw_all);
- __frwr_release(r);
- kfree(r);
- }
-}
-
const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
.ro_map = frwr_op_map,
.ro_unmap_sync = frwr_op_unmap_sync,
@@ -621,7 +571,7 @@ const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
.ro_recover_mr = frwr_op_recover_mr,
.ro_open = frwr_op_open,
.ro_maxpages = frwr_op_maxpages,
- .ro_init = frwr_op_init,
- .ro_destroy = frwr_op_destroy,
+ .ro_init_mr = frwr_op_init_mr,
+ .ro_release_mr = frwr_op_release_mr,
.ro_displayname = "frwr",
};
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index be4dd2c..b1dd42a 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -682,9 +682,10 @@ void xprt_rdma_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
r_xprt->rx_stats.failed_marshal_count,
r_xprt->rx_stats.bad_reply_count,
r_xprt->rx_stats.nomsg_call_count);
- seq_printf(seq, "%lu %lu\n",
+ seq_printf(seq, "%lu %lu %lu\n",
r_xprt->rx_stats.mrs_recovered,
- r_xprt->rx_stats.mrs_orphaned);
+ r_xprt->rx_stats.mrs_orphaned,
+ r_xprt->rx_stats.mrs_allocated);
}

static int
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 35f2176..4a7a712 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -797,6 +797,55 @@ rpcrdma_defer_mr_recovery(struct rpcrdma_mw *mw)
schedule_delayed_work(&buf->rb_recovery_worker, 0);
}

+static void
+rpcrdma_create_mrs(struct rpcrdma_xprt *r_xprt)
+{
+ struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+ struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+ unsigned int count;
+ LIST_HEAD(free);
+ LIST_HEAD(all);
+
+ for (count = 0; count < 16; count++) {
+ struct rpcrdma_mw *mw;
+ int rc;
+
+ mw = kzalloc(sizeof(*mw), GFP_KERNEL);
+ if (!mw)
+ break;
+
+ rc = ia->ri_ops->ro_init_mr(ia, mw);
+ if (rc) {
+ kfree(mw);
+ break;
+ }
+
+ mw->mw_xprt = r_xprt;
+
+ list_add(&mw->mw_list, &free);
+ list_add(&mw->mw_all, &all);
+ }
+
+ spin_lock(&buf->rb_mwlock);
+ list_splice(&free, &buf->rb_mws);
+ list_splice(&all, &buf->rb_all);
+ r_xprt->rx_stats.mrs_allocated += count;
+ spin_unlock(&buf->rb_mwlock);
+
+ dprintk("RPC: %s: created %u MRs\n", __func__, count);
+}
+
+static void
+rpcrdma_mr_refresh_worker(struct work_struct *work)
+{
+ struct rpcrdma_buffer *buf = container_of(work, struct rpcrdma_buffer,
+ rb_refresh_worker.work);
+ struct rpcrdma_xprt *r_xprt = container_of(buf, struct rpcrdma_xprt,
+ rx_buf);
+
+ rpcrdma_create_mrs(r_xprt);
+}
+
struct rpcrdma_req *
rpcrdma_create_req(struct rpcrdma_xprt *r_xprt)
{
@@ -852,21 +901,23 @@ int
rpcrdma_buffer_create(struct rpcrdma_xprt *r_xprt)
{
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
- struct rpcrdma_ia *ia = &r_xprt->rx_ia;
int i, rc;

buf->rb_max_requests = r_xprt->rx_data.max_requests;
buf->rb_bc_srv_max_requests = 0;
atomic_set(&buf->rb_credits, 1);
+ spin_lock_init(&buf->rb_mwlock);
spin_lock_init(&buf->rb_lock);
spin_lock_init(&buf->rb_recovery_lock);
+ INIT_LIST_HEAD(&buf->rb_mws);
+ INIT_LIST_HEAD(&buf->rb_all);
INIT_LIST_HEAD(&buf->rb_stale_mrs);
+ INIT_DELAYED_WORK(&buf->rb_refresh_worker,
+ rpcrdma_mr_refresh_worker);
INIT_DELAYED_WORK(&buf->rb_recovery_worker,
rpcrdma_mr_recovery_worker);

- rc = ia->ri_ops->ro_init(r_xprt);
- if (rc)
- goto out;
+ rpcrdma_create_mrs(r_xprt);

INIT_LIST_HEAD(&buf->rb_send_bufs);
INIT_LIST_HEAD(&buf->rb_allreqs);
@@ -946,6 +997,10 @@ void
rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
{
struct rpcrdma_ia *ia = rdmab_to_ia(buf);
+ struct rpcrdma_xprt *r_xprt = container_of(buf, struct rpcrdma_xprt,
+ rx_buf);
+ struct rpcrdma_mw *mw;
+ unsigned int count;

cancel_delayed_work_sync(&buf->rb_recovery_worker);

@@ -970,7 +1025,21 @@ rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
}
spin_unlock(&buf->rb_reqslock);

- ia->ri_ops->ro_destroy(buf);
+ count = 0;
+ spin_lock(&buf->rb_mwlock);
+ while (!list_empty(&buf->rb_all)) {
+ mw = list_entry(buf->rb_all.next, struct rpcrdma_mw, mw_all);
+ list_del(&mw->mw_all);
+
+ spin_unlock(&buf->rb_mwlock);
+ ia->ri_ops->ro_release_mr(mw);
+ count++;
+ spin_lock(&buf->rb_mwlock);
+ }
+ spin_unlock(&buf->rb_mwlock);
+ r_xprt->rx_stats.mrs_allocated = 0;
+
+ dprintk("RPC: %s: released %u MRs\n", __func__, count);
}

struct rpcrdma_mw *
@@ -988,8 +1057,17 @@ rpcrdma_get_mw(struct rpcrdma_xprt *r_xprt)
spin_unlock(&buf->rb_mwlock);

if (!mw)
- pr_err("RPC: %s: no MWs available\n", __func__);
+ goto out_nomws;
return mw;
+
+out_nomws:
+ dprintk("RPC: %s: no MWs available\n", __func__);
+ schedule_delayed_work(&buf->rb_refresh_worker, 0);
+
+ /* Allow the reply handler and refresh worker to run */
+ cond_resched();
+
+ return NULL;
}

void
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index f1b6f2f..0bde4c0 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -339,6 +339,7 @@ struct rpcrdma_buffer {
spinlock_t rb_recovery_lock; /* protect rb_stale_mrs */
struct list_head rb_stale_mrs;
struct delayed_work rb_recovery_worker;
+ struct delayed_work rb_refresh_worker;
};
#define rdmab_to_ia(b) (&container_of((b), struct rpcrdma_xprt, rx_buf)->rx_ia)

@@ -387,6 +388,7 @@ struct rpcrdma_stats {
unsigned long bcall_count;
unsigned long mrs_recovered;
unsigned long mrs_orphaned;
+ unsigned long mrs_allocated;
};

/*
@@ -405,8 +407,9 @@ struct rpcrdma_memreg_ops {
struct rpcrdma_ep *,
struct rpcrdma_create_data_internal *);
size_t (*ro_maxpages)(struct rpcrdma_xprt *);
- int (*ro_init)(struct rpcrdma_xprt *);
- void (*ro_destroy)(struct rpcrdma_buffer *);
+ int (*ro_init_mr)(struct rpcrdma_ia *,
+ struct rpcrdma_mw *);
+ void (*ro_release_mr)(struct rpcrdma_mw *);
const char *ro_displayname;
};



2016-06-15 03:17:20

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 15/24] xprtrdma: Release orphaned MRs immediately

Instead of leaving orphaned MRs to be released when the transport
is destroyed, release them immediately. The MR free list can now be
replenished if it becomes exhausted.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 19 +++++++++++++------
net/sunrpc/xprtrdma/frwr_ops.c | 19 +++++++++++++------
2 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 840479f..2ad117f 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -124,15 +124,22 @@ fmr_op_recover_mr(struct rpcrdma_mw *mw)
/* ORDER: then DMA unmap */
ib_dma_unmap_sg(r_xprt->rx_ia.ri_device,
mw->mw_sg, mw->mw_nents, mw->mw_dir);
- if (rc) {
- pr_err("rpcrdma: FMR reset status %d, %p orphaned\n",
- rc, mw);
- r_xprt->rx_stats.mrs_orphaned++;
- return;
- }
+ if (rc)
+ goto out_release;

rpcrdma_put_mw(r_xprt, mw);
r_xprt->rx_stats.mrs_recovered++;
+ return;
+
+out_release:
+ pr_err("rpcrdma: FMR reset failed (%d), %p released\n", rc, mw);
+ r_xprt->rx_stats.mrs_orphaned++;
+
+ spin_lock(&r_xprt->rx_buf.rb_mwlock);
+ list_del(&mw->mw_all);
+ spin_unlock(&r_xprt->rx_buf.rb_mwlock);
+
+ fmr_op_release_mr(mw);
}

static int
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index f603c3a..38de90c 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -160,15 +160,22 @@ frwr_op_recover_mr(struct rpcrdma_mw *mw)

rc = __frwr_reset_mr(ia, mw);
ib_dma_unmap_sg(ia->ri_device, mw->mw_sg, mw->mw_nents, mw->mw_dir);
- if (rc) {
- pr_err("rpcrdma: FRMR reset status %d, %p orphaned\n",
- rc, mw);
- r_xprt->rx_stats.mrs_orphaned++;
- return;
- }
+ if (rc)
+ goto out_release;

rpcrdma_put_mw(r_xprt, mw);
r_xprt->rx_stats.mrs_recovered++;
+ return;
+
+out_release:
+ pr_err("rpcrdma: FRMR reset failed %d, %p release\n", rc, mw);
+ r_xprt->rx_stats.mrs_orphaned++;
+
+ spin_lock(&r_xprt->rx_buf.rb_mwlock);
+ list_del(&mw->mw_all);
+ spin_unlock(&r_xprt->rx_buf.rb_mwlock);
+
+ frwr_op_release_mr(mw);
}

static int


2016-06-15 03:17:29

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 16/24] xprtrdma: Place registered MWs on a per-req list

Instead of placing registered MWs sparsely into the rl_segments
array, place these MWs on a per-req list.

ro_unmap_{sync,safe} can then simply pull those MWs off the list
instead of walking through the array.

This change significantly reduces the size of struct rpcrdma_req
by removing nsegs and rl_mw from every array element.

As an additional clean-up, chunk co-ordinates are returned in the
"*mw" output argument so they are no longer needed in every
array element.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/fmr_ops.c | 65 ++++++++++---------------------
net/sunrpc/xprtrdma/frwr_ops.c | 72 ++++++++++++-----------------------
net/sunrpc/xprtrdma/rpc_rdma.c | 81 ++++++++++++++++++---------------------
net/sunrpc/xprtrdma/transport.c | 3 +
net/sunrpc/xprtrdma/verbs.c | 1
net/sunrpc/xprtrdma/xprt_rdma.h | 11 +++--
6 files changed, 94 insertions(+), 139 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 2ad117f..8d23317 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -91,6 +91,10 @@ fmr_op_release_mr(struct rpcrdma_mw *r)
LIST_HEAD(unmap_list);
int rc;

+ /* Ensure MW is not on any rl_registered list */
+ if (!list_empty(&r->mw_list))
+ list_del(&r->mw_list);
+
kfree(r->fmr.fm_physaddrs);
kfree(r->mw_sg);

@@ -166,17 +170,13 @@ fmr_op_maxpages(struct rpcrdma_xprt *r_xprt)
*/
static int
fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
- int nsegs, bool writing)
+ int nsegs, bool writing, struct rpcrdma_mw **out)
{
struct rpcrdma_mr_seg *seg1 = seg;
int len, pageoff, i, rc;
struct rpcrdma_mw *mw;
u64 *dma_pages;

- mw = seg1->rl_mw;
- seg1->rl_mw = NULL;
- if (mw)
- rpcrdma_defer_mr_recovery(mw);
mw = rpcrdma_get_mw(r_xprt);
if (!mw)
return -ENOBUFS;
@@ -220,11 +220,11 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
if (rc)
goto out_maperr;

- seg1->rl_mw = mw;
- seg1->mr_rkey = mw->fmr.fm_mr->rkey;
- seg1->mr_base = dma_pages[0] + pageoff;
- seg1->mr_nsegs = mw->mw_nents;
- seg1->mr_len = len;
+ mw->mw_handle = mw->fmr.fm_mr->rkey;
+ mw->mw_length = len;
+ mw->mw_offset = dma_pages[0] + pageoff;
+
+ *out = mw;
return mw->mw_nents;

out_dmamap_err:
@@ -245,13 +245,13 @@ out_maperr:
*
* Sleeps until it is safe for the host CPU to access the
* previously mapped memory regions.
+ *
+ * Caller ensures that req->rl_registered is not empty.
*/
static void
fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
{
- struct rpcrdma_mr_seg *seg;
- unsigned int i, nchunks;
- struct rpcrdma_mw *mw;
+ struct rpcrdma_mw *mw, *tmp;
LIST_HEAD(unmap_list);
int rc;

@@ -262,14 +262,8 @@ fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
* ib_unmap_fmr() is slow, so use a single call instead
* of one call per mapped FMR.
*/
- for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
- seg = &req->rl_segments[i];
- mw = seg->rl_mw;
-
+ list_for_each_entry(mw, &req->rl_registered, mw_list)
list_add_tail(&mw->fmr.fm_mr->list, &unmap_list);
-
- i += seg->mr_nsegs;
- }
rc = ib_unmap_fmr(&unmap_list);
if (rc)
goto out_reset;
@@ -277,34 +271,22 @@ fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
/* ORDER: Now DMA unmap all of the req's MRs, and return
* them to the free MW list.
*/
- for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
- seg = &req->rl_segments[i];
- mw = seg->rl_mw;
-
+ list_for_each_entry_safe(mw, tmp, &req->rl_registered, mw_list) {
+ list_del_init(&mw->mw_list);
list_del_init(&mw->fmr.fm_mr->list);
ib_dma_unmap_sg(r_xprt->rx_ia.ri_device,
mw->mw_sg, mw->mw_nents, mw->mw_dir);
rpcrdma_put_mw(r_xprt, mw);
-
- i += seg->mr_nsegs;
- seg->mr_nsegs = 0;
- seg->rl_mw = NULL;
}

- req->rl_nchunks = 0;
return;

out_reset:
pr_err("rpcrdma: ib_unmap_fmr failed (%i)\n", rc);

- for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
- seg = &req->rl_segments[i];
- mw = seg->rl_mw;
-
+ list_for_each_entry_safe(mw, tmp, &req->rl_registered, mw_list) {
list_del_init(&mw->fmr.fm_mr->list);
fmr_op_recover_mr(mw);
-
- i += seg->mr_nsegs;
}
}

@@ -315,22 +297,17 @@ static void
fmr_op_unmap_safe(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
bool sync)
{
- struct rpcrdma_mr_seg *seg;
struct rpcrdma_mw *mw;
- unsigned int i;

- for (i = 0; req->rl_nchunks; req->rl_nchunks--) {
- seg = &req->rl_segments[i];
- mw = seg->rl_mw;
+ while (!list_empty(&req->rl_registered)) {
+ mw = list_first_entry(&req->rl_registered,
+ struct rpcrdma_mw, mw_list);
+ list_del_init(&mw->mw_list);

if (sync)
fmr_op_recover_mr(mw);
else
rpcrdma_defer_mr_recovery(mw);
-
- i += seg->mr_nsegs;
- seg->mr_nsegs = 0;
- seg->rl_mw = NULL;
}
}

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 38de90c..bacb859 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -111,6 +111,10 @@ frwr_op_release_mr(struct rpcrdma_mw *r)
{
int rc;

+ /* Ensure MW is not on any rl_registered list */
+ if (!list_empty(&r->mw_list))
+ list_del(&r->mw_list);
+
rc = ib_dereg_mr(r->frmr.fr_mr);
if (rc)
pr_err("rpcrdma: final ib_dereg_mr for %p returned %i\n",
@@ -316,10 +320,9 @@ frwr_wc_localinv_wake(struct ib_cq *cq, struct ib_wc *wc)
*/
static int
frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
- int nsegs, bool writing)
+ int nsegs, bool writing, struct rpcrdma_mw **out)
{
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
- struct rpcrdma_mr_seg *seg1 = seg;
struct rpcrdma_mw *mw;
struct rpcrdma_frmr *frmr;
struct ib_mr *mr;
@@ -328,8 +331,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
int rc, i, n, dma_nents;
u8 key;

- mw = seg1->rl_mw;
- seg1->rl_mw = NULL;
+ mw = NULL;
do {
if (mw)
rpcrdma_defer_mr_recovery(mw);
@@ -399,12 +401,11 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
if (rc)
goto out_senderr;

- seg1->rl_mw = mw;
- seg1->mr_rkey = mr->rkey;
- seg1->mr_base = mr->iova;
- seg1->mr_nsegs = mw->mw_nents;
- seg1->mr_len = mr->length;
+ mw->mw_handle = mr->rkey;
+ mw->mw_length = mr->length;
+ mw->mw_offset = mr->iova;

+ *out = mw;
return mw->mw_nents;

out_dmamap_err:
@@ -426,9 +427,8 @@ out_senderr:
}

static struct ib_send_wr *
-__frwr_prepare_linv_wr(struct rpcrdma_mr_seg *seg)
+__frwr_prepare_linv_wr(struct rpcrdma_mw *mw)
{
- struct rpcrdma_mw *mw = seg->rl_mw;
struct rpcrdma_frmr *f = &mw->frmr;
struct ib_send_wr *invalidate_wr;

@@ -448,16 +448,16 @@ __frwr_prepare_linv_wr(struct rpcrdma_mr_seg *seg)
*
* Sleeps until it is safe for the host CPU to access the
* previously mapped memory regions.
+ *
+ * Caller ensures that req->rl_registered is not empty.
*/
static void
frwr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
{
struct ib_send_wr *invalidate_wrs, *pos, *prev, *bad_wr;
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
- struct rpcrdma_mr_seg *seg;
- unsigned int i, nchunks;
+ struct rpcrdma_mw *mw, *tmp;
struct rpcrdma_frmr *f;
- struct rpcrdma_mw *mw;
int rc;

dprintk("RPC: %s: req %p\n", __func__, req);
@@ -467,22 +467,18 @@ frwr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
* Chain the LOCAL_INV Work Requests and post them with
* a single ib_post_send() call.
*/
+ f = NULL;
invalidate_wrs = pos = prev = NULL;
- seg = NULL;
- for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
- seg = &req->rl_segments[i];
-
- pos = __frwr_prepare_linv_wr(seg);
+ list_for_each_entry(mw, &req->rl_registered, mw_list) {
+ pos = __frwr_prepare_linv_wr(mw);

if (!invalidate_wrs)
invalidate_wrs = pos;
else
prev->next = pos;
prev = pos;
-
- i += seg->mr_nsegs;
+ f = &mw->frmr;
}
- f = &seg->rl_mw->frmr;

/* Strong send queue ordering guarantees that when the
* last WR in the chain completes, all WRs in the chain
@@ -507,20 +503,12 @@ frwr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
* them to the free MW list.
*/
unmap:
- for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
- seg = &req->rl_segments[i];
- mw = seg->rl_mw;
- seg->rl_mw = NULL;
-
+ list_for_each_entry_safe(mw, tmp, &req->rl_registered, mw_list) {
+ list_del_init(&mw->mw_list);
ib_dma_unmap_sg(ia->ri_device,
mw->mw_sg, mw->mw_nents, mw->mw_dir);
rpcrdma_put_mw(r_xprt, mw);
-
- i += seg->mr_nsegs;
- seg->mr_nsegs = 0;
}
-
- req->rl_nchunks = 0;
return;

reset_mrs:
@@ -530,17 +518,12 @@ reset_mrs:
/* Find and reset the MRs in the LOCAL_INV WRs that did not
* get posted. This is synchronous, and slow.
*/
- for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
- seg = &req->rl_segments[i];
- mw = seg->rl_mw;
+ list_for_each_entry(mw, &req->rl_registered, mw_list) {
f = &mw->frmr;
-
if (mw->frmr.fr_mr->rkey == bad_wr->ex.invalidate_rkey) {
__frwr_reset_mr(ia, mw);
bad_wr = bad_wr->next;
}
-
- i += seg->mr_nsegs;
}
goto unmap;
}
@@ -552,22 +535,17 @@ static void
frwr_op_unmap_safe(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
bool sync)
{
- struct rpcrdma_mr_seg *seg;
struct rpcrdma_mw *mw;
- unsigned int i;

- for (i = 0; req->rl_nchunks; req->rl_nchunks--) {
- seg = &req->rl_segments[i];
- mw = seg->rl_mw;
+ while (!list_empty(&req->rl_registered)) {
+ mw = list_first_entry(&req->rl_registered,
+ struct rpcrdma_mw, mw_list);
+ list_del_init(&mw->mw_list);

if (sync)
frwr_op_recover_mr(mw);
else
rpcrdma_defer_mr_recovery(mw);
-
- i += seg->mr_nsegs;
- seg->mr_nsegs = 0;
- seg->rl_mw = NULL;
}
}

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 8fde0ab..6d34c1f 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -286,11 +286,11 @@ rpcrdma_convert_iovs(struct xdr_buf *xdrbuf, unsigned int pos,
}

static inline __be32 *
-xdr_encode_rdma_segment(__be32 *iptr, struct rpcrdma_mr_seg *seg)
+xdr_encode_rdma_segment(__be32 *iptr, struct rpcrdma_mw *mw)
{
- *iptr++ = cpu_to_be32(seg->mr_rkey);
- *iptr++ = cpu_to_be32(seg->mr_len);
- return xdr_encode_hyper(iptr, seg->mr_base);
+ *iptr++ = cpu_to_be32(mw->mw_handle);
+ *iptr++ = cpu_to_be32(mw->mw_length);
+ return xdr_encode_hyper(iptr, mw->mw_offset);
}

/* XDR-encode the Read list. Supports encoding a list of read
@@ -311,6 +311,7 @@ rpcrdma_encode_read_list(struct rpcrdma_xprt *r_xprt,
__be32 *iptr, enum rpcrdma_chunktype rtype)
{
struct rpcrdma_mr_seg *seg = req->rl_nextseg;
+ struct rpcrdma_mw *mw;
unsigned int pos;
int n, nsegs;

@@ -328,9 +329,11 @@ rpcrdma_encode_read_list(struct rpcrdma_xprt *r_xprt,
return ERR_PTR(nsegs);

do {
- n = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs, false);
+ n = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
+ false, &mw);
if (n < 0)
return ERR_PTR(n);
+ list_add(&mw->mw_list, &req->rl_registered);

*iptr++ = xdr_one; /* item present */

@@ -338,13 +341,12 @@ rpcrdma_encode_read_list(struct rpcrdma_xprt *r_xprt,
* have the same "position".
*/
*iptr++ = cpu_to_be32(pos);
- iptr = xdr_encode_rdma_segment(iptr, seg);
+ iptr = xdr_encode_rdma_segment(iptr, mw);

- dprintk("RPC: %5u %s: read segment pos %u "
- "%d@0x%016llx:0x%08x (%s)\n",
+ dprintk("RPC: %5u %s: pos %u %u@0x%016llx:0x%08x (%s)\n",
rqst->rq_task->tk_pid, __func__, pos,
- seg->mr_len, (unsigned long long)seg->mr_base,
- seg->mr_rkey, n < nsegs ? "more" : "last");
+ mw->mw_length, (unsigned long long)mw->mw_offset,
+ mw->mw_handle, n < nsegs ? "more" : "last");

r_xprt->rx_stats.read_chunk_count++;
req->rl_nchunks++;
@@ -376,6 +378,7 @@ rpcrdma_encode_write_list(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
enum rpcrdma_chunktype wtype)
{
struct rpcrdma_mr_seg *seg = req->rl_nextseg;
+ struct rpcrdma_mw *mw;
int n, nsegs, nchunks;
__be32 *segcount;

@@ -396,17 +399,18 @@ rpcrdma_encode_write_list(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,

nchunks = 0;
do {
- n = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs, true);
+ n = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
+ true, &mw);
if (n < 0)
return ERR_PTR(n);
+ list_add(&mw->mw_list, &req->rl_registered);

- iptr = xdr_encode_rdma_segment(iptr, seg);
+ iptr = xdr_encode_rdma_segment(iptr, mw);

- dprintk("RPC: %5u %s: write segment "
- "%d@0x016%llx:0x%08x (%s)\n",
+ dprintk("RPC: %5u %s: %u@0x016%llx:0x%08x (%s)\n",
rqst->rq_task->tk_pid, __func__,
- seg->mr_len, (unsigned long long)seg->mr_base,
- seg->mr_rkey, n < nsegs ? "more" : "last");
+ mw->mw_length, (unsigned long long)mw->mw_offset,
+ mw->mw_handle, n < nsegs ? "more" : "last");

r_xprt->rx_stats.write_chunk_count++;
r_xprt->rx_stats.total_rdma_request += seg->mr_len;
@@ -443,6 +447,7 @@ rpcrdma_encode_reply_chunk(struct rpcrdma_xprt *r_xprt,
__be32 *iptr, enum rpcrdma_chunktype wtype)
{
struct rpcrdma_mr_seg *seg = req->rl_nextseg;
+ struct rpcrdma_mw *mw;
int n, nsegs, nchunks;
__be32 *segcount;

@@ -461,17 +466,18 @@ rpcrdma_encode_reply_chunk(struct rpcrdma_xprt *r_xprt,

nchunks = 0;
do {
- n = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs, true);
+ n = r_xprt->rx_ia.ri_ops->ro_map(r_xprt, seg, nsegs,
+ true, &mw);
if (n < 0)
return ERR_PTR(n);
+ list_add(&mw->mw_list, &req->rl_registered);

- iptr = xdr_encode_rdma_segment(iptr, seg);
+ iptr = xdr_encode_rdma_segment(iptr, mw);

- dprintk("RPC: %5u %s: reply segment "
- "%d@0x%016llx:0x%08x (%s)\n",
+ dprintk("RPC: %5u %s: %u@0x%016llx:0x%08x (%s)\n",
rqst->rq_task->tk_pid, __func__,
- seg->mr_len, (unsigned long long)seg->mr_base,
- seg->mr_rkey, n < nsegs ? "more" : "last");
+ mw->mw_length, (unsigned long long)mw->mw_offset,
+ mw->mw_handle, n < nsegs ? "more" : "last");

r_xprt->rx_stats.reply_chunk_count++;
r_xprt->rx_stats.total_rdma_request += seg->mr_len;
@@ -690,10 +696,7 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
out_overflow:
pr_err("rpcrdma: send overflow: hdrlen %zd rpclen %zu %s/%s\n",
hdrlen, rpclen, transfertypes[rtype], transfertypes[wtype]);
- /* Terminate this RPC. Chunks registered above will be
- * released by xprt_release -> xprt_rmda_free .
- */
- return -EIO;
+ iptr = ERR_PTR(-EIO);

out_unmap:
r_xprt->rx_ia.ri_ops->ro_unmap_safe(r_xprt, req, false);
@@ -705,15 +708,13 @@ out_unmap:
* RDMA'd by server. See map at rpcrdma_create_chunks()! :-)
*/
static int
-rpcrdma_count_chunks(struct rpcrdma_rep *rep, unsigned int max, int wrchunk, __be32 **iptrp)
+rpcrdma_count_chunks(struct rpcrdma_rep *rep, int wrchunk, __be32 **iptrp)
{
unsigned int i, total_len;
struct rpcrdma_write_chunk *cur_wchunk;
char *base = (char *)rdmab_to_msg(rep->rr_rdmabuf);

i = be32_to_cpu(**iptrp);
- if (i > max)
- return -1;
cur_wchunk = (struct rpcrdma_write_chunk *) (*iptrp + 1);
total_len = 0;
while (i--) {
@@ -960,14 +961,13 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
(headerp->rm_body.rm_chunks[1] == xdr_zero &&
headerp->rm_body.rm_chunks[2] != xdr_zero) ||
(headerp->rm_body.rm_chunks[1] != xdr_zero &&
- req->rl_nchunks == 0))
+ list_empty(&req->rl_registered)))
goto badheader;
if (headerp->rm_body.rm_chunks[1] != xdr_zero) {
/* count any expected write chunks in read reply */
/* start at write chunk array count */
iptr = &headerp->rm_body.rm_chunks[2];
- rdmalen = rpcrdma_count_chunks(rep,
- req->rl_nchunks, 1, &iptr);
+ rdmalen = rpcrdma_count_chunks(rep, 1, &iptr);
/* check for validity, and no reply chunk after */
if (rdmalen < 0 || *iptr++ != xdr_zero)
goto badheader;
@@ -997,11 +997,11 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
if (headerp->rm_body.rm_chunks[0] != xdr_zero ||
headerp->rm_body.rm_chunks[1] != xdr_zero ||
headerp->rm_body.rm_chunks[2] != xdr_one ||
- req->rl_nchunks == 0)
+ list_empty(&req->rl_registered))
goto badheader;
iptr = (__be32 *)((unsigned char *)headerp +
RPCRDMA_HDRLEN_MIN);
- rdmalen = rpcrdma_count_chunks(rep, req->rl_nchunks, 0, &iptr);
+ rdmalen = rpcrdma_count_chunks(rep, 0, &iptr);
if (rdmalen < 0)
goto badheader;
r_xprt->rx_stats.total_rdma_reply += rdmalen;
@@ -1014,14 +1014,9 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)

badheader:
default:
- dprintk("%s: invalid rpcrdma reply header (type %d):"
- " chunks[012] == %d %d %d"
- " expected chunks <= %d\n",
- __func__, be32_to_cpu(headerp->rm_type),
- headerp->rm_body.rm_chunks[0],
- headerp->rm_body.rm_chunks[1],
- headerp->rm_body.rm_chunks[2],
- req->rl_nchunks);
+ dprintk("RPC: %5u %s: invalid rpcrdma reply (type %u)\n",
+ rqst->rq_task->tk_pid, __func__,
+ be32_to_cpu(headerp->rm_type));
status = -EIO;
r_xprt->rx_stats.bad_reply_count++;
break;
@@ -1035,7 +1030,7 @@ out:
* control: waking the next RPC waits until this RPC has
* relinquished all its Send Queue entries.
*/
- if (req->rl_nchunks)
+ if (!list_empty(&req->rl_registered))
r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt, req);

spin_lock_bh(&xprt->transport_lock);
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index b1dd42a..81f0e87 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -619,6 +619,9 @@ xprt_rdma_send_request(struct rpc_task *task)
struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
int rc = 0;

+ /* On retransmit, remove any previously registered chunks */
+ r_xprt->rx_ia.ri_ops->ro_unmap_safe(r_xprt, req, false);
+
rc = rpcrdma_marshal_req(rqst);
if (rc < 0)
goto failed_marshal;
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 4a7a712..ae99c04 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -862,6 +862,7 @@ rpcrdma_create_req(struct rpcrdma_xprt *r_xprt)
spin_unlock(&buffer->rb_reqslock);
req->rl_cqe.done = rpcrdma_wc_send;
req->rl_buffer = &r_xprt->rx_buf;
+ INIT_LIST_HEAD(&req->rl_registered);
return req;
}

diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 0bde4c0..025365c 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -245,6 +245,9 @@ struct rpcrdma_mw {
struct rpcrdma_frmr frmr;
};
struct rpcrdma_xprt *mw_xprt;
+ u32 mw_handle;
+ u32 mw_length;
+ u64 mw_offset;
struct list_head mw_all;
};

@@ -272,11 +275,7 @@ struct rpcrdma_mw {
*/

struct rpcrdma_mr_seg { /* chunk descriptors */
- struct rpcrdma_mw *rl_mw; /* registered MR */
- u64 mr_base; /* registration result */
- u32 mr_rkey; /* registration result */
u32 mr_len; /* length of chunk or segment */
- int mr_nsegs; /* number of segments in chunk or 0 */
struct page *mr_page; /* owning page, if any */
char *mr_offset; /* kva if no page, else offset */
};
@@ -294,6 +293,7 @@ struct rpcrdma_req {
struct ib_sge rl_send_iov[RPCRDMA_MAX_IOVS];
struct rpcrdma_regbuf *rl_rdmabuf;
struct rpcrdma_regbuf *rl_sendbuf;
+ struct list_head rl_registered; /* registered segments */
struct rpcrdma_mr_seg rl_segments[RPCRDMA_MAX_SEGS];
struct rpcrdma_mr_seg *rl_nextseg;

@@ -397,7 +397,8 @@ struct rpcrdma_stats {
struct rpcrdma_xprt;
struct rpcrdma_memreg_ops {
int (*ro_map)(struct rpcrdma_xprt *,
- struct rpcrdma_mr_seg *, int, bool);
+ struct rpcrdma_mr_seg *, int, bool,
+ struct rpcrdma_mw **);
void (*ro_unmap_sync)(struct rpcrdma_xprt *,
struct rpcrdma_req *);
void (*ro_unmap_safe)(struct rpcrdma_xprt *,


2016-06-15 03:17:37

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 17/24] xprtrdma: Chunk list encoders no longer share one rl_segments array

Currently, all three chunk list encoders each use a portion of the
one rl_segments array in rpcrdma_req. This is because the MWs for
each chunk list were preserved in rl_segments so that ro_unmap could
find and invalidate them after the RPC was complete.

However, now that MWs are placed on a per-req linked list as they
are registered, there is no longer any information in rpcrdma_mr_seg
that is shared between ro_map and ro_unmap_{sync,safe}, and thus
nothing in rl_segments needs to be preserved after
rpcrdma_marshal_req is complete.

Thus the rl_segments array can be used now just for the needs of
each rpcrdma_convert_iovs call. Once each chunk list is encoded, the
next chunk list encoder is free to re-use all of rl_segments.

This means all three chunk lists in one RPC request can now each
encode a full size data payload with no increase in the size of
rl_segments.

This is a key requirement for Kerberos support, since both the Call
and Reply for a single RPC transaction are conveyed via Long
messages (RDMA Read/Write). Both can be large.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 61 ++++++++++++++++++---------------------
net/sunrpc/xprtrdma/xprt_rdma.h | 36 ++++++++++-------------
2 files changed, 44 insertions(+), 53 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 6d34c1f..f60d229 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -196,8 +196,7 @@ rpcrdma_tail_pullup(struct xdr_buf *buf)
* MR when they can.
*/
static int
-rpcrdma_convert_kvec(struct kvec *vec, struct rpcrdma_mr_seg *seg,
- int n, int nsegs)
+rpcrdma_convert_kvec(struct kvec *vec, struct rpcrdma_mr_seg *seg, int n)
{
size_t page_offset;
u32 remaining;
@@ -206,7 +205,7 @@ rpcrdma_convert_kvec(struct kvec *vec, struct rpcrdma_mr_seg *seg,
base = vec->iov_base;
page_offset = offset_in_page(base);
remaining = vec->iov_len;
- while (remaining && n < nsegs) {
+ while (remaining && n < RPCRDMA_MAX_SEGS) {
seg[n].mr_page = NULL;
seg[n].mr_offset = base;
seg[n].mr_len = min_t(u32, PAGE_SIZE - page_offset, remaining);
@@ -230,23 +229,23 @@ rpcrdma_convert_kvec(struct kvec *vec, struct rpcrdma_mr_seg *seg,

static int
rpcrdma_convert_iovs(struct xdr_buf *xdrbuf, unsigned int pos,
- enum rpcrdma_chunktype type, struct rpcrdma_mr_seg *seg, int nsegs)
+ enum rpcrdma_chunktype type, struct rpcrdma_mr_seg *seg)
{
- int len, n = 0, p;
- int page_base;
+ int len, n, p, page_base;
struct page **ppages;

+ n = 0;
if (pos == 0) {
- n = rpcrdma_convert_kvec(&xdrbuf->head[0], seg, n, nsegs);
- if (n == nsegs)
- return -EIO;
+ n = rpcrdma_convert_kvec(&xdrbuf->head[0], seg, n);
+ if (n == RPCRDMA_MAX_SEGS)
+ goto out_overflow;
}

len = xdrbuf->page_len;
ppages = xdrbuf->pages + (xdrbuf->page_base >> PAGE_SHIFT);
page_base = xdrbuf->page_base & ~PAGE_MASK;
p = 0;
- while (len && n < nsegs) {
+ while (len && n < RPCRDMA_MAX_SEGS) {
if (!ppages[p]) {
/* alloc the pagelist for receiving buffer */
ppages[p] = alloc_page(GFP_ATOMIC);
@@ -257,7 +256,7 @@ rpcrdma_convert_iovs(struct xdr_buf *xdrbuf, unsigned int pos,
seg[n].mr_offset = (void *)(unsigned long) page_base;
seg[n].mr_len = min_t(u32, PAGE_SIZE - page_base, len);
if (seg[n].mr_len > PAGE_SIZE)
- return -EIO;
+ goto out_overflow;
len -= seg[n].mr_len;
++n;
++p;
@@ -265,8 +264,8 @@ rpcrdma_convert_iovs(struct xdr_buf *xdrbuf, unsigned int pos,
}

/* Message overflows the seg array */
- if (len && n == nsegs)
- return -EIO;
+ if (len && n == RPCRDMA_MAX_SEGS)
+ goto out_overflow;

/* When encoding the read list, the tail is always sent inline */
if (type == rpcrdma_readch)
@@ -277,12 +276,16 @@ rpcrdma_convert_iovs(struct xdr_buf *xdrbuf, unsigned int pos,
* xdr pad bytes, saving the server an RDMA operation. */
if (xdrbuf->tail[0].iov_len < 4 && xprt_rdma_pad_optimize)
return n;
- n = rpcrdma_convert_kvec(&xdrbuf->tail[0], seg, n, nsegs);
- if (n == nsegs)
- return -EIO;
+ n = rpcrdma_convert_kvec(&xdrbuf->tail[0], seg, n);
+ if (n == RPCRDMA_MAX_SEGS)
+ goto out_overflow;
}

return n;
+
+out_overflow:
+ pr_err("rpcrdma: segment array overflow\n");
+ return -EIO;
}

static inline __be32 *
@@ -310,7 +313,7 @@ rpcrdma_encode_read_list(struct rpcrdma_xprt *r_xprt,
struct rpcrdma_req *req, struct rpc_rqst *rqst,
__be32 *iptr, enum rpcrdma_chunktype rtype)
{
- struct rpcrdma_mr_seg *seg = req->rl_nextseg;
+ struct rpcrdma_mr_seg *seg;
struct rpcrdma_mw *mw;
unsigned int pos;
int n, nsegs;
@@ -323,8 +326,8 @@ rpcrdma_encode_read_list(struct rpcrdma_xprt *r_xprt,
pos = rqst->rq_snd_buf.head[0].iov_len;
if (rtype == rpcrdma_areadch)
pos = 0;
- nsegs = rpcrdma_convert_iovs(&rqst->rq_snd_buf, pos, rtype, seg,
- RPCRDMA_MAX_SEGS - req->rl_nchunks);
+ seg = req->rl_segments;
+ nsegs = rpcrdma_convert_iovs(&rqst->rq_snd_buf, pos, rtype, seg);
if (nsegs < 0)
return ERR_PTR(nsegs);

@@ -349,11 +352,9 @@ rpcrdma_encode_read_list(struct rpcrdma_xprt *r_xprt,
mw->mw_handle, n < nsegs ? "more" : "last");

r_xprt->rx_stats.read_chunk_count++;
- req->rl_nchunks++;
seg += n;
nsegs -= n;
} while (nsegs);
- req->rl_nextseg = seg;

/* Finish Read list */
*iptr++ = xdr_zero; /* Next item not present */
@@ -377,7 +378,7 @@ rpcrdma_encode_write_list(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
struct rpc_rqst *rqst, __be32 *iptr,
enum rpcrdma_chunktype wtype)
{
- struct rpcrdma_mr_seg *seg = req->rl_nextseg;
+ struct rpcrdma_mr_seg *seg;
struct rpcrdma_mw *mw;
int n, nsegs, nchunks;
__be32 *segcount;
@@ -387,10 +388,10 @@ rpcrdma_encode_write_list(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
return iptr;
}

+ seg = req->rl_segments;
nsegs = rpcrdma_convert_iovs(&rqst->rq_rcv_buf,
rqst->rq_rcv_buf.head[0].iov_len,
- wtype, seg,
- RPCRDMA_MAX_SEGS - req->rl_nchunks);
+ wtype, seg);
if (nsegs < 0)
return ERR_PTR(nsegs);

@@ -414,12 +415,10 @@ rpcrdma_encode_write_list(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,

r_xprt->rx_stats.write_chunk_count++;
r_xprt->rx_stats.total_rdma_request += seg->mr_len;
- req->rl_nchunks++;
nchunks++;
seg += n;
nsegs -= n;
} while (nsegs);
- req->rl_nextseg = seg;

/* Update count of segments in this Write chunk */
*segcount = cpu_to_be32(nchunks);
@@ -446,7 +445,7 @@ rpcrdma_encode_reply_chunk(struct rpcrdma_xprt *r_xprt,
struct rpcrdma_req *req, struct rpc_rqst *rqst,
__be32 *iptr, enum rpcrdma_chunktype wtype)
{
- struct rpcrdma_mr_seg *seg = req->rl_nextseg;
+ struct rpcrdma_mr_seg *seg;
struct rpcrdma_mw *mw;
int n, nsegs, nchunks;
__be32 *segcount;
@@ -456,8 +455,8 @@ rpcrdma_encode_reply_chunk(struct rpcrdma_xprt *r_xprt,
return iptr;
}

- nsegs = rpcrdma_convert_iovs(&rqst->rq_rcv_buf, 0, wtype, seg,
- RPCRDMA_MAX_SEGS - req->rl_nchunks);
+ seg = req->rl_segments;
+ nsegs = rpcrdma_convert_iovs(&rqst->rq_rcv_buf, 0, wtype, seg);
if (nsegs < 0)
return ERR_PTR(nsegs);

@@ -481,12 +480,10 @@ rpcrdma_encode_reply_chunk(struct rpcrdma_xprt *r_xprt,

r_xprt->rx_stats.reply_chunk_count++;
r_xprt->rx_stats.total_rdma_request += seg->mr_len;
- req->rl_nchunks++;
nchunks++;
seg += n;
nsegs -= n;
} while (nsegs);
- req->rl_nextseg = seg;

/* Update count of segments in the Reply chunk */
*segcount = cpu_to_be32(nchunks);
@@ -656,8 +653,6 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
* send a Call message with a Position Zero Read chunk and a
* regular Read chunk at the same time.
*/
- req->rl_nchunks = 0;
- req->rl_nextseg = req->rl_segments;
iptr = headerp->rm_body.rm_chunks;
iptr = rpcrdma_encode_read_list(r_xprt, req, rqst, iptr, rtype);
if (IS_ERR(iptr))
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 025365c..c713483 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -171,23 +171,14 @@ rdmab_to_msg(struct rpcrdma_regbuf *rb)
* o recv buffer (posted to provider)
* o ib_sge (also donated to provider)
* o status of reply (length, success or not)
- * o bookkeeping state to get run by tasklet (list, etc)
+ * o bookkeeping state to get run by reply handler (list, etc)
*
- * These are allocated during initialization, per-transport instance;
- * however, the tasklet execution list itself is global, as it should
- * always be pretty short.
+ * These are allocated during initialization, per-transport instance.
*
* N of these are associated with a transport instance, and stored in
* struct rpcrdma_buffer. N is the max number of outstanding requests.
*/

-#define RPCRDMA_MAX_DATA_SEGS ((1 * 1024 * 1024) / PAGE_SIZE)
-
-/* data segments + head/tail for Call + head/tail for Reply */
-#define RPCRDMA_MAX_SEGS (RPCRDMA_MAX_DATA_SEGS + 4)
-
-struct rpcrdma_buffer;
-
struct rpcrdma_rep {
struct ib_cqe rr_cqe;
unsigned int rr_len;
@@ -267,13 +258,18 @@ struct rpcrdma_mw {
* of iovs for send operations. The reason is that the iovs passed to
* ib_post_{send,recv} must not be modified until the work request
* completes.
- *
- * NOTES:
- * o RPCRDMA_MAX_SEGS is the max number of addressible chunk elements we
- * marshal. The number needed varies depending on the iov lists that
- * are passed to us and the memory registration mode we are in.
*/

+/* Maximum number of page-sized "segments" per chunk list to be
+ * registered or invalidated. Must handle a Reply chunk:
+ */
+enum {
+ RPCRDMA_MAX_IOV_SEGS = 3,
+ RPCRDMA_MAX_DATA_SEGS = ((1 * 1024 * 1024) / PAGE_SIZE) + 1,
+ RPCRDMA_MAX_SEGS = RPCRDMA_MAX_DATA_SEGS +
+ RPCRDMA_MAX_IOV_SEGS,
+};
+
struct rpcrdma_mr_seg { /* chunk descriptors */
u32 mr_len; /* length of chunk or segment */
struct page *mr_page; /* owning page, if any */
@@ -282,10 +278,10 @@ struct rpcrdma_mr_seg { /* chunk descriptors */

#define RPCRDMA_MAX_IOVS (2)

+struct rpcrdma_buffer;
struct rpcrdma_req {
struct list_head rl_free;
unsigned int rl_niovs;
- unsigned int rl_nchunks;
unsigned int rl_connect_cookie;
struct rpc_task *rl_task;
struct rpcrdma_buffer *rl_buffer;
@@ -293,13 +289,13 @@ struct rpcrdma_req {
struct ib_sge rl_send_iov[RPCRDMA_MAX_IOVS];
struct rpcrdma_regbuf *rl_rdmabuf;
struct rpcrdma_regbuf *rl_sendbuf;
- struct list_head rl_registered; /* registered segments */
- struct rpcrdma_mr_seg rl_segments[RPCRDMA_MAX_SEGS];
- struct rpcrdma_mr_seg *rl_nextseg;

struct ib_cqe rl_cqe;
struct list_head rl_all;
bool rl_backchannel;
+
+ struct list_head rl_registered; /* registered segments */
+ struct rpcrdma_mr_seg rl_segments[RPCRDMA_MAX_SEGS];
};

static inline struct rpcrdma_req *


2016-06-15 03:17:44

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 18/24] xprtrdma: rpcrdma_inline_fixup() overruns the receive page list

When the remaining length of an incoming reply is longer than the
XDR buf's page_len, switch over to the tail iovec instead of
copying more than page_len bytes into the page list.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index f60d229..e3560c2 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -773,12 +773,17 @@ rpcrdma_inline_fixup(struct rpc_rqst *rqst, char *srcp, int copy_len, int pad)
page_base &= ~PAGE_MASK;

if (copy_len && rqst->rq_rcv_buf.page_len) {
- npages = PAGE_ALIGN(page_base +
- rqst->rq_rcv_buf.page_len) >> PAGE_SHIFT;
+ int pagelist_len;
+
+ pagelist_len = rqst->rq_rcv_buf.page_len;
+ if (pagelist_len > copy_len)
+ pagelist_len = copy_len;
+ npages = PAGE_ALIGN(page_base + pagelist_len) >> PAGE_SHIFT;
for (; i < npages; i++) {
curlen = PAGE_SIZE - page_base;
- if (curlen > copy_len)
- curlen = copy_len;
+ if (curlen > pagelist_len)
+ curlen = pagelist_len;
+
dprintk("RPC: %s: page %d"
" srcp 0x%p len %d curlen %d\n",
__func__, i, srcp, copy_len, curlen);
@@ -788,7 +793,8 @@ rpcrdma_inline_fixup(struct rpc_rqst *rqst, char *srcp, int copy_len, int pad)
kunmap_atomic(destp);
srcp += curlen;
copy_len -= curlen;
- if (copy_len == 0)
+ pagelist_len -= curlen;
+ if (!pagelist_len)
break;
page_base = 0;
}


2016-06-15 03:17:58

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 19/24] xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()

While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.

The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.

As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:

- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
if they are not exactly the same

Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.

To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.

While I remember all this, write down the conclusion in documenting
comments.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 61 ++++++++++++++++++++++------------------
1 file changed, 33 insertions(+), 28 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index e3560c2..d018eb7 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -740,8 +740,16 @@ rpcrdma_count_chunks(struct rpcrdma_rep *rep, int wrchunk, __be32 **iptrp)
return total_len;
}

-/*
- * Scatter inline received data back into provided iov's.
+/**
+ * rpcrdma_inline_fixup - Scatter inline received data into rqst's iovecs
+ * @rqst: controlling RPC request
+ * @srcp: points to RPC message payload in receive buffer
+ * @copy_len: remaining length of receive buffer content
+ * @pad: Write chunk pad bytes needed (zero for pure inline)
+ *
+ * The upper layer has set the maximum number of bytes it can
+ * receive in each component of rq_rcv_buf. These values are set in
+ * the head.iov_len, page_len, tail.iov_len, and buflen fields.
*/
static void
rpcrdma_inline_fixup(struct rpc_rqst *rqst, char *srcp, int copy_len, int pad)
@@ -751,17 +759,19 @@ rpcrdma_inline_fixup(struct rpc_rqst *rqst, char *srcp, int copy_len, int pad)
struct page **ppages;
int page_base;

+ /* The head iovec is redirected to the RPC reply message
+ * in the receive buffer, to avoid a memcopy.
+ */
+ rqst->rq_rcv_buf.head[0].iov_base = srcp;
+
+ /* The contents of the receive buffer that follow
+ * head.iov_len bytes are copied into the page list.
+ */
curlen = rqst->rq_rcv_buf.head[0].iov_len;
- if (curlen > copy_len) { /* write chunk header fixup */
+ if (curlen > copy_len)
curlen = copy_len;
- rqst->rq_rcv_buf.head[0].iov_len = curlen;
- }
-
dprintk("RPC: %s: srcp 0x%p len %d hdrlen %d\n",
__func__, srcp, copy_len, curlen);
-
- /* Shift pointer for first receive segment only */
- rqst->rq_rcv_buf.head[0].iov_base = srcp;
srcp += curlen;
copy_len -= curlen;

@@ -798,28 +808,23 @@ rpcrdma_inline_fixup(struct rpc_rqst *rqst, char *srcp, int copy_len, int pad)
break;
page_base = 0;
}
- }

- if (copy_len && rqst->rq_rcv_buf.tail[0].iov_len) {
- curlen = copy_len;
- if (curlen > rqst->rq_rcv_buf.tail[0].iov_len)
- curlen = rqst->rq_rcv_buf.tail[0].iov_len;
- if (rqst->rq_rcv_buf.tail[0].iov_base != srcp)
- memmove(rqst->rq_rcv_buf.tail[0].iov_base, srcp, curlen);
- dprintk("RPC: %s: tail srcp 0x%p len %d curlen %d\n",
- __func__, srcp, copy_len, curlen);
- rqst->rq_rcv_buf.tail[0].iov_len = curlen;
- copy_len -= curlen; ++i;
- } else
- rqst->rq_rcv_buf.tail[0].iov_len = 0;
-
- if (pad) {
- /* implicit padding on terminal chunk */
- unsigned char *p = rqst->rq_rcv_buf.tail[0].iov_base;
- while (pad--)
- p[rqst->rq_rcv_buf.tail[0].iov_len++] = 0;
+ /* Implicit padding for the last segment in a Write
+ * chunk is inserted inline at the front of the tail
+ * iovec. The upper layer ignores the content of
+ * the pad. Simply ensure inline content in the tail
+ * that follows the Write chunk is properly aligned.
+ */
+ if (pad)
+ srcp -= pad;
}

+ /* The tail iovec is redirected to the remaining data
+ * in the receive buffer, to avoid a memcopy.
+ */
+ if (copy_len || pad)
+ rqst->rq_rcv_buf.tail[0].iov_base = srcp;
+
if (copy_len)
dprintk("RPC: %s: %d bytes in"
" %d extra segments (%d lost)\n",


2016-06-15 03:18:02

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 20/24] xprtrdma: Update only specific fields in private receive buffer

Now that rpcrdma_inline_fixup() updates only two fields in
rq_rcv_buf, a full memcpy of that structure to rq_private_buf is
unwarranted. Updating rq_private_buf fields only where needed also
better documents what is going on.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index d018eb7..a0e811d 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -750,6 +750,11 @@ rpcrdma_count_chunks(struct rpcrdma_rep *rep, int wrchunk, __be32 **iptrp)
* The upper layer has set the maximum number of bytes it can
* receive in each component of rq_rcv_buf. These values are set in
* the head.iov_len, page_len, tail.iov_len, and buflen fields.
+ *
+ * Unlike the TCP equivalent (xdr_partial_copy_from_skb), in
+ * many cases this function simply updates iov_base pointers in
+ * rq_rcv_buf to point directly to the received reply data, to
+ * avoid copying reply data.
*/
static void
rpcrdma_inline_fixup(struct rpc_rqst *rqst, char *srcp, int copy_len, int pad)
@@ -763,6 +768,7 @@ rpcrdma_inline_fixup(struct rpc_rqst *rqst, char *srcp, int copy_len, int pad)
* in the receive buffer, to avoid a memcopy.
*/
rqst->rq_rcv_buf.head[0].iov_base = srcp;
+ rqst->rq_private_buf.head[0].iov_base = srcp;

/* The contents of the receive buffer that follow
* head.iov_len bytes are copied into the page list.
@@ -822,16 +828,15 @@ rpcrdma_inline_fixup(struct rpc_rqst *rqst, char *srcp, int copy_len, int pad)
/* The tail iovec is redirected to the remaining data
* in the receive buffer, to avoid a memcopy.
*/
- if (copy_len || pad)
+ if (copy_len || pad) {
rqst->rq_rcv_buf.tail[0].iov_base = srcp;
+ rqst->rq_private_buf.tail[0].iov_base = srcp;
+ }

if (copy_len)
dprintk("RPC: %s: %d bytes in"
" %d extra segments (%d lost)\n",
__func__, olen, i, copy_len);
-
- /* TBD avoid a warning from call_decode() */
- rqst->rq_private_buf = rqst->rq_rcv_buf;
}

void


2016-06-15 03:18:10

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 21/24] xprtrdma: Clean up fixup_copy_count accounting

fixup_copy_count should count only the number of bytes copied to the
page list. The head and tail are now always handled without a data
copy.

And the debugging at the end of rpcrdma_inline_fixup() is also no
longer necessary, since copy_len will be non-zero when there is reply
data in the tail (a normal and valid case).

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 26 +++++++++++++-------------
1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index a0e811d..dac2990 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -755,11 +755,14 @@ rpcrdma_count_chunks(struct rpcrdma_rep *rep, int wrchunk, __be32 **iptrp)
* many cases this function simply updates iov_base pointers in
* rq_rcv_buf to point directly to the received reply data, to
* avoid copying reply data.
+ *
+ * Returns the count of bytes which had to be memcopied.
*/
-static void
+static unsigned long
rpcrdma_inline_fixup(struct rpc_rqst *rqst, char *srcp, int copy_len, int pad)
{
- int i, npages, curlen, olen;
+ unsigned long fixup_copy_count;
+ int i, npages, curlen;
char *destp;
struct page **ppages;
int page_base;
@@ -781,13 +784,10 @@ rpcrdma_inline_fixup(struct rpc_rqst *rqst, char *srcp, int copy_len, int pad)
srcp += curlen;
copy_len -= curlen;

- olen = copy_len;
- i = 0;
- rpcx_to_rdmax(rqst->rq_xprt)->rx_stats.fixup_copy_count += olen;
page_base = rqst->rq_rcv_buf.page_base;
ppages = rqst->rq_rcv_buf.pages + (page_base >> PAGE_SHIFT);
page_base &= ~PAGE_MASK;
-
+ fixup_copy_count = 0;
if (copy_len && rqst->rq_rcv_buf.page_len) {
int pagelist_len;

@@ -795,7 +795,7 @@ rpcrdma_inline_fixup(struct rpc_rqst *rqst, char *srcp, int copy_len, int pad)
if (pagelist_len > copy_len)
pagelist_len = copy_len;
npages = PAGE_ALIGN(page_base + pagelist_len) >> PAGE_SHIFT;
- for (; i < npages; i++) {
+ for (i = 0; i < npages; i++) {
curlen = PAGE_SIZE - page_base;
if (curlen > pagelist_len)
curlen = pagelist_len;
@@ -809,6 +809,7 @@ rpcrdma_inline_fixup(struct rpc_rqst *rqst, char *srcp, int copy_len, int pad)
kunmap_atomic(destp);
srcp += curlen;
copy_len -= curlen;
+ fixup_copy_count += curlen;
pagelist_len -= curlen;
if (!pagelist_len)
break;
@@ -833,10 +834,7 @@ rpcrdma_inline_fixup(struct rpc_rqst *rqst, char *srcp, int copy_len, int pad)
rqst->rq_private_buf.tail[0].iov_base = srcp;
}

- if (copy_len)
- dprintk("RPC: %s: %d bytes in"
- " %d extra segments (%d lost)\n",
- __func__, olen, i, copy_len);
+ return fixup_copy_count;
}

void
@@ -999,8 +997,10 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
rep->rr_len -= RPCRDMA_HDRLEN_MIN;
status = rep->rr_len;
}
- /* Fix up the rpc results for upper layer */
- rpcrdma_inline_fixup(rqst, (char *)iptr, rep->rr_len, rdmalen);
+
+ r_xprt->rx_stats.fixup_copy_count +=
+ rpcrdma_inline_fixup(rqst, (char *)iptr, rep->rr_len,
+ rdmalen);
break;

case rdma_nomsg:


2016-06-15 03:18:18

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 22/24] xprtrdma: No direct data placement with krb5i and krb5p

Direct data placement is not allowed when using flavors that
guarantee integrity or privacy. When such security flavors are in
effect, don't allow the use of Read and Write chunks for moving
individual data items. All messages larger than the inline threshold
are sent via Long Call or Long Reply.

On my systems (CX-3 Pro on FDR), for small I/O operations, the use
of Long messages adds only around 5 usecs of latency in each
direction.

Note that when integrity or encryption is used, the host CPU touches
every byte in these messages. Even if it could be used, data
movement offload doesn't buy much in this case.

Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/auth.h | 3 +++
include/linux/sunrpc/gss_api.h | 2 ++
net/sunrpc/auth_gss/auth_gss.c | 2 ++
net/sunrpc/auth_gss/gss_krb5_mech.c | 2 ++
net/sunrpc/auth_gss/gss_mech_switch.c | 12 ++++++++++++
net/sunrpc/xprtrdma/rpc_rdma.c | 12 ++++++++++--
6 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/include/linux/sunrpc/auth.h b/include/linux/sunrpc/auth.h
index 8997915..3a40287 100644
--- a/include/linux/sunrpc/auth.h
+++ b/include/linux/sunrpc/auth.h
@@ -107,6 +107,9 @@ struct rpc_auth {
/* per-flavor data */
};

+/* rpc_auth au_flags */
+#define RPCAUTH_AUTH_DATATOUCH 0x00000002
+
struct rpc_auth_create_args {
rpc_authflavor_t pseudoflavor;
const char *target_name;
diff --git a/include/linux/sunrpc/gss_api.h b/include/linux/sunrpc/gss_api.h
index 1f911cc..68ec78c 100644
--- a/include/linux/sunrpc/gss_api.h
+++ b/include/linux/sunrpc/gss_api.h
@@ -73,6 +73,7 @@ u32 gss_delete_sec_context(
rpc_authflavor_t gss_svc_to_pseudoflavor(struct gss_api_mech *, u32 qop,
u32 service);
u32 gss_pseudoflavor_to_service(struct gss_api_mech *, u32 pseudoflavor);
+bool gss_pseudoflavor_to_datatouch(struct gss_api_mech *, u32 pseudoflavor);
char *gss_service_to_auth_domain_name(struct gss_api_mech *, u32 service);

struct pf_desc {
@@ -81,6 +82,7 @@ struct pf_desc {
u32 service;
char *name;
char *auth_domain_name;
+ bool datatouch;
};

/* Different mechanisms (e.g., krb5 or spkm3) may implement gss-api, and
diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c
index e64ae93..bca3537 100644
--- a/net/sunrpc/auth_gss/auth_gss.c
+++ b/net/sunrpc/auth_gss/auth_gss.c
@@ -1017,6 +1017,8 @@ gss_create_new(struct rpc_auth_create_args *args, struct rpc_clnt *clnt)
auth->au_rslack = GSS_VERF_SLACK >> 2;
auth->au_ops = &authgss_ops;
auth->au_flavor = flavor;
+ if (gss_pseudoflavor_to_datatouch(gss_auth->mech, flavor))
+ auth->au_flags |= RPCAUTH_AUTH_DATATOUCH;
atomic_set(&auth->au_count, 1);
kref_init(&gss_auth->kref);

diff --git a/net/sunrpc/auth_gss/gss_krb5_mech.c b/net/sunrpc/auth_gss/gss_krb5_mech.c
index 6542749..6059583 100644
--- a/net/sunrpc/auth_gss/gss_krb5_mech.c
+++ b/net/sunrpc/auth_gss/gss_krb5_mech.c
@@ -745,12 +745,14 @@ static struct pf_desc gss_kerberos_pfs[] = {
.qop = GSS_C_QOP_DEFAULT,
.service = RPC_GSS_SVC_INTEGRITY,
.name = "krb5i",
+ .datatouch = true,
},
[2] = {
.pseudoflavor = RPC_AUTH_GSS_KRB5P,
.qop = GSS_C_QOP_DEFAULT,
.service = RPC_GSS_SVC_PRIVACY,
.name = "krb5p",
+ .datatouch = true,
},
};

diff --git a/net/sunrpc/auth_gss/gss_mech_switch.c b/net/sunrpc/auth_gss/gss_mech_switch.c
index 7063d85..5fec3ab 100644
--- a/net/sunrpc/auth_gss/gss_mech_switch.c
+++ b/net/sunrpc/auth_gss/gss_mech_switch.c
@@ -361,6 +361,18 @@ gss_pseudoflavor_to_service(struct gss_api_mech *gm, u32 pseudoflavor)
}
EXPORT_SYMBOL(gss_pseudoflavor_to_service);

+bool
+gss_pseudoflavor_to_datatouch(struct gss_api_mech *gm, u32 pseudoflavor)
+{
+ int i;
+
+ for (i = 0; i < gm->gm_pf_num; i++) {
+ if (gm->gm_pfs[i].pseudoflavor == pseudoflavor)
+ return gm->gm_pfs[i].datatouch;
+ }
+ return false;
+}
+
char *
gss_service_to_auth_domain_name(struct gss_api_mech *gm, u32 service)
{
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index dac2990..a47f170 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -570,6 +570,7 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
enum rpcrdma_chunktype rtype, wtype;
struct rpcrdma_msg *headerp;
+ bool ddp_allowed;
ssize_t hdrlen;
size_t rpclen;
__be32 *iptr;
@@ -586,6 +587,13 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
headerp->rm_credit = cpu_to_be32(r_xprt->rx_buf.rb_max_requests);
headerp->rm_type = rdma_msg;

+ /* When the ULP employs a GSS flavor that guarantees integrity
+ * or privacy, direct data placement of individual data items
+ * is not allowed.
+ */
+ ddp_allowed = !(rqst->rq_cred->cr_auth->au_flags &
+ RPCAUTH_AUTH_DATATOUCH);
+
/*
* Chunks needed for results?
*
@@ -597,7 +605,7 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
*/
if (rpcrdma_results_inline(r_xprt, rqst))
wtype = rpcrdma_noch;
- else if (rqst->rq_rcv_buf.flags & XDRBUF_READ)
+ else if (ddp_allowed && rqst->rq_rcv_buf.flags & XDRBUF_READ)
wtype = rpcrdma_writech;
else
wtype = rpcrdma_replych;
@@ -620,7 +628,7 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
rtype = rpcrdma_noch;
rpcrdma_inline_pullup(rqst);
rpclen = rqst->rq_svec[0].iov_len;
- } else if (rqst->rq_snd_buf.flags & XDRBUF_WRITE) {
+ } else if (ddp_allowed && rqst->rq_snd_buf.flags & XDRBUF_WRITE) {
rtype = rpcrdma_readch;
rpclen = rqst->rq_svec[0].iov_len;
rpclen += rpcrdma_tail_pullup(&rqst->rq_snd_buf);


2016-06-15 03:18:29

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 23/24] svc: Avoid garbage replies when pc_func() returns rpc_drop_reply

If an RPC program does not set vs_dispatch and pc_func() returns
rpc_drop_reply, the server sends a reply anyway containing a single
word containing the value RPC_DROP_REPLY (in network byte-order, of
course). This is a nonsense RPC message.

Fixes: 9e701c610923 ("svcrpc: simpler request dropping")
Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/svc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index cc98528..87290a5 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1188,7 +1188,8 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
*statp = procp->pc_func(rqstp, rqstp->rq_argp, rqstp->rq_resp);

/* Encode reply */
- if (test_bit(RQ_DROPME, &rqstp->rq_flags)) {
+ if (*statp == rpc_drop_reply ||
+ test_bit(RQ_DROPME, &rqstp->rq_flags)) {
if (procp->pc_release)
procp->pc_release(rqstp, NULL, rqstp->rq_resp);
goto dropit;


2016-06-15 03:18:34

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v2 24/24] NFS: Don't drop CB requests with invalid principals

Before commit 778be232a207 ("NFS do not find client in NFSv4
pg_authenticate"), the Linux callback server replied with
RPC_AUTH_ERROR / RPC_AUTH_BADCRED, instead of dropping the CB
request. Let's restore that behavior so the server has a chance to
do something useful about it, and provide a warning that helps
admins correct the problem.

Fixes: 778be232a207 ("NFS do not find client in NFSv4 ...")
Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfs/callback_xdr.c | 6 +++++-
net/sunrpc/svc.c | 5 +++++
2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/callback_xdr.c b/fs/nfs/callback_xdr.c
index d81f96a..656f68f 100644
--- a/fs/nfs/callback_xdr.c
+++ b/fs/nfs/callback_xdr.c
@@ -925,7 +925,7 @@ static __be32 nfs4_callback_compound(struct svc_rqst *rqstp, void *argp, void *r
if (hdr_arg.minorversion == 0) {
cps.clp = nfs4_find_client_ident(SVC_NET(rqstp), hdr_arg.cb_ident);
if (!cps.clp || !check_gss_callback_principal(cps.clp, rqstp))
- return rpc_drop_reply;
+ goto out_invalidcred;
}

cps.minorversion = hdr_arg.minorversion;
@@ -953,6 +953,10 @@ static __be32 nfs4_callback_compound(struct svc_rqst *rqstp, void *argp, void *r
nfs_put_client(cps.clp);
dprintk("%s: done, status = %u\n", __func__, ntohl(status));
return rpc_success;
+
+out_invalidcred:
+ pr_warn_ratelimited("NFS: NFSv4 callback contains invalid cred\n");
+ return rpc_autherr_badcred;
}

/*
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 87290a5..c5b0cb4 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1194,6 +1194,11 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
procp->pc_release(rqstp, NULL, rqstp->rq_resp);
goto dropit;
}
+ if (*statp == rpc_autherr_badcred) {
+ if (procp->pc_release)
+ procp->pc_release(rqstp, NULL, rqstp->rq_resp);
+ goto err_bad_auth;
+ }
if (*statp == rpc_success &&
(xdr = procp->pc_encode) &&
!xdr(rqstp, resv->iov_base+resv->iov_len, rqstp->rq_resp)) {


2016-06-15 04:28:55

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages

On Tue, Jun 14, 2016 at 11:15:25PM -0400, Chuck Lever wrote:
> From: Sagi Grimberg <[email protected]>
>
> kmalloc doesn't guarantee the returned memory is all on one page.

IMHO, the patch posted by Christoph at that thread is best way to go,
because you changed streaming DMA mappings to be coherent DMA mappings [1].

"The kernel developers recommend the use of streaming mappings over
coherent mappings whenever possible" [1].

[1] http://www.makelinux.net/ldd3/chp-15-sect-4


Attachments:
(No filename) (480.00 B)
signature.asc (819.00 B)
Digital signature
Download all attachments

2016-06-15 16:40:03

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages


> On Jun 15, 2016, at 12:28 AM, Leon Romanovsky <[email protected]> wrote:
>
> On Tue, Jun 14, 2016 at 11:15:25PM -0400, Chuck Lever wrote:
>> From: Sagi Grimberg <[email protected]>
>>
>> kmalloc doesn't guarantee the returned memory is all on one page.
>
> IMHO, the patch posted by Christoph at that thread is best way to go,
> because you changed streaming DMA mappings to be coherent DMA mappings [1].
>
> "The kernel developers recommend the use of streaming mappings over
> coherent mappings whenever possible" [1].
>
> [1] http://www.makelinux.net/ldd3/chp-15-sect-4

Hi Leon-

I'll happily drop this patch from my 4.8 series as soon
as an official mlx4/mlx5 fix is merged.

Meanwhile, I notice some unexplained instability (driver
resets, list corruption, and so on) when I test NFS/RDMA
without this patch included. So it is attached to the
series for anyone with mlx4 who wants to pull my topic
branch and try it out.


--
Chuck Lever




2016-06-16 14:35:48

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages

On Wed, Jun 15, 2016 at 12:40:07PM -0400, Chuck Lever wrote:
>
> > On Jun 15, 2016, at 12:28 AM, Leon Romanovsky <[email protected]> wrote:
> >
> > On Tue, Jun 14, 2016 at 11:15:25PM -0400, Chuck Lever wrote:
> >> From: Sagi Grimberg <[email protected]>
> >>
> >> kmalloc doesn't guarantee the returned memory is all on one page.
> >
> > IMHO, the patch posted by Christoph at that thread is best way to go,
> > because you changed streaming DMA mappings to be coherent DMA mappings [1].
> >
> > "The kernel developers recommend the use of streaming mappings over
> > coherent mappings whenever possible" [1].
> >
> > [1] http://www.makelinux.net/ldd3/chp-15-sect-4
>
> Hi Leon-
>
> I'll happily drop this patch from my 4.8 series as soon
> as an official mlx4/mlx5 fix is merged.
>
> Meanwhile, I notice some unexplained instability (driver
> resets, list corruption, and so on) when I test NFS/RDMA
> without this patch included. So it is attached to the
> series for anyone with mlx4 who wants to pull my topic
> branch and try it out.

hi Chuck,

We plan to send attached patch during our second round of fixes for
mlx4/mlx5 and would be grateful to you if you could provide your
Tested-by tag before.

Thanks


Attachments:
(No filename) (0.00 B)
signature.asc (819.00 B)
Digital signature
Download all attachments

2016-06-16 21:10:37

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages


>>> On Jun 15, 2016, at 12:28 AM, Leon Romanovsky <[email protected]> wrote:
>>>
>>> On Tue, Jun 14, 2016 at 11:15:25PM -0400, Chuck Lever wrote:
>>>> From: Sagi Grimberg <[email protected]>
>>>>
>>>> kmalloc doesn't guarantee the returned memory is all on one page.
>>>
>>> IMHO, the patch posted by Christoph at that thread is best way to go,
>>> because you changed streaming DMA mappings to be coherent DMA mappings [1].
>>>
>>> "The kernel developers recommend the use of streaming mappings over
>>> coherent mappings whenever possible" [1].
>>>
>>> [1] http://www.makelinux.net/ldd3/chp-15-sect-4
>>
>> Hi Leon-
>>
>> I'll happily drop this patch from my 4.8 series as soon
>> as an official mlx4/mlx5 fix is merged.
>>
>> Meanwhile, I notice some unexplained instability (driver
>> resets, list corruption, and so on) when I test NFS/RDMA
>> without this patch included. So it is attached to the
>> series for anyone with mlx4 who wants to pull my topic
>> branch and try it out.
>
> hi Chuck,
>
> We plan to send attached patch during our second round of fixes for
> mlx4/mlx5 and would be grateful to you if you could provide your
> Tested-by tag before.

First of all, IIRC the patch author was Christoph wasn't he.

Plus, you do realize that this patch makes the pages allocation
in granularity of pages. In systems with a large page size this
is completely redundant, it might even be harmful as the storage
ULPs need lots of MRs.

Also, I don't see how that solves the issue, I'm not sure I even
understand the issue. Do you? Were you able to reproduce it?

IFF the pages buffer end not being aligned to a cacheline is problematic
then why not extent it to end in a cacheline? Why in the next full page?

2016-06-16 21:58:37

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages


> On Jun 16, 2016, at 5:10 PM, Sagi Grimberg <[email protected]> wrote:
>
>
>>>> On Jun 15, 2016, at 12:28 AM, Leon Romanovsky <[email protected]> wrote:
>>>>
>>>> On Tue, Jun 14, 2016 at 11:15:25PM -0400, Chuck Lever wrote:
>>>>> From: Sagi Grimberg <[email protected]>
>>>>>
>>>>> kmalloc doesn't guarantee the returned memory is all on one page.
>>>>
>>>> IMHO, the patch posted by Christoph at that thread is best way to go,
>>>> because you changed streaming DMA mappings to be coherent DMA mappings [1].
>>>>
>>>> "The kernel developers recommend the use of streaming mappings over
>>>> coherent mappings whenever possible" [1].
>>>>
>>>> [1] http://www.makelinux.net/ldd3/chp-15-sect-4
>>>
>>> Hi Leon-
>>>
>>> I'll happily drop this patch from my 4.8 series as soon
>>> as an official mlx4/mlx5 fix is merged.
>>>
>>> Meanwhile, I notice some unexplained instability (driver
>>> resets, list corruption, and so on) when I test NFS/RDMA
>>> without this patch included. So it is attached to the
>>> series for anyone with mlx4 who wants to pull my topic
>>> branch and try it out.
>>
>> hi Chuck,
>>
>> We plan to send attached patch during our second round of fixes for
>> mlx4/mlx5 and would be grateful to you if you could provide your
>> Tested-by tag before.

Fwiw, Tested-by: Chuck Lever <[email protected]>


> First of all, IIRC the patch author was Christoph wasn't he.
>
> Plus, you do realize that this patch makes the pages allocation
> in granularity of pages. In systems with a large page size this
> is completely redundant, it might even be harmful as the storage
> ULPs need lots of MRs.

I agree that the official fix should take a conservative
approach to allocating this resource; there will be lots
of MRs in an active system. This fix doesn't seem too
careful.


> Also, I don't see how that solves the issue, I'm not sure I even
> understand the issue. Do you? Were you able to reproduce it?

The issue is that dma_map_single() does not seem to DMA map
portions of a memory region that are past the end of the first
page of that region. Maybe that's a bug?

This patch works around that behavior by guaranteeing that

a) the memory region starts at the beginning of a page, and
b) the memory region is never larger than a page

This patch is not sufficient to repair mlx5, because b)
cannot be satisfied in that case; the array of __be64's can
be larger than 512 entries.


> IFF the pages buffer end not being aligned to a cacheline is problematic
> then why not extent it to end in a cacheline? Why in the next full page?

I think the patch description justifies the choice of
solution, but does not describe the original issue at
all. The original issue had nothing to do with cacheline
alignment.

Lastly, this patch should remove the definition of
MLX4_MR_PAGES_ALIGN.


--
Chuck Lever




2016-06-17 09:05:55

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages

On Fri, Jun 17, 2016 at 12:10:33AM +0300, Sagi Grimberg wrote:
>
> >>>On Jun 15, 2016, at 12:28 AM, Leon Romanovsky <[email protected]> wrote:
> >>>
> >>>On Tue, Jun 14, 2016 at 11:15:25PM -0400, Chuck Lever wrote:
> >>>>From: Sagi Grimberg <[email protected]>
> >>>>
> >>>>kmalloc doesn't guarantee the returned memory is all on one page.
> >>>
> >>>IMHO, the patch posted by Christoph at that thread is best way to go,
> >>>because you changed streaming DMA mappings to be coherent DMA mappings [1].
> >>>
> >>>"The kernel developers recommend the use of streaming mappings over
> >>>coherent mappings whenever possible" [1].
> >>>
> >>>[1] http://www.makelinux.net/ldd3/chp-15-sect-4
> >>
> >>Hi Leon-
> >>
> >>I'll happily drop this patch from my 4.8 series as soon
> >>as an official mlx4/mlx5 fix is merged.
> >>
> >>Meanwhile, I notice some unexplained instability (driver
> >>resets, list corruption, and so on) when I test NFS/RDMA
> >>without this patch included. So it is attached to the
> >>series for anyone with mlx4 who wants to pull my topic
> >>branch and try it out.
> >
> >hi Chuck,
> >
> >We plan to send attached patch during our second round of fixes for
> >mlx4/mlx5 and would be grateful to you if you could provide your
> >Tested-by tag before.
>
> First of all, IIRC the patch author was Christoph wasn't he.

Do you think that author's name can provide different results in
verification/bug reproduction? We didn't send this patch officially
yet and all relevant authors will be acknowledged and honored when the
official (second round of IB fixes) will come.

>
> Plus, you do realize that this patch makes the pages allocation
> in granularity of pages. In systems with a large page size this
> is completely redundant, it might even be harmful as the storage
> ULPs need lots of MRs.

I see proper execution of the driver as an important goal which goes
before various micro optimizations, which will come after.

id you ask yourself, why are not so many users use that ARCH_KMALLOC_MINALIGN?

➜ linux-rdma git:(master) grep -rI ARCH_KMALLOC_MINALIGN drivers/*
drivers/infiniband/hw/mlx4/mr.c: add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
drivers/infiniband/hw/mlx5/mr.c: add_size = max_t(int, MLX5_UMR_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
drivers/md/dm-crypt.c: ARCH_KMALLOC_MINALIGN);
drivers/usb/core/buffer.c: * ARCH_KMALLOC_MINALIGN.
drivers/usb/core/buffer.c: if (ARCH_KMALLOC_MINALIGN <= 32)
drivers/usb/core/buffer.c: else if (ARCH_KMALLOC_MINALIGN <= 64)
drivers/usb/core/buffer.c: else if (ARCH_KMALLOC_MINALIGN <= 128)
drivers/usb/misc/usbtest.c: return (unsigned long)buf & (ARCH_KMALLOC_MINALIGN - 1);

>
> Also, I don't see how that solves the issue, I'm not sure I even
> understand the issue. Do you? Were you able to reproduce it?

Yes, the issue is that address supplied to dma_map_single wasn't aligned
to DMA cacheline size.

And as the one, who wrote this code for mlx5, it will be great if you
can give me a pointer. Why did you chose to use MLX5_UMR_ALIGN in that
function? This will add 2048 bytes instead of 64 for mlx4.

>
> IFF the pages buffer end not being aligned to a cacheline is problematic
> then why not extent it to end in a cacheline? Why in the next full page?

It will fit in one page.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


Attachments:
(No filename) (3.46 kB)
signature.asc (819.00 B)
Digital signature
Download all attachments

2016-06-17 09:20:34

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages

On Thu, Jun 16, 2016 at 05:58:29PM -0400, Chuck Lever wrote:
>
> > On Jun 16, 2016, at 5:10 PM, Sagi Grimberg <[email protected]> wrote:
> >
> >
> >>>> On Jun 15, 2016, at 12:28 AM, Leon Romanovsky <[email protected]> wrote:
> >>>>
> >>>> On Tue, Jun 14, 2016 at 11:15:25PM -0400, Chuck Lever wrote:
> >>>>> From: Sagi Grimberg <[email protected]>
> >>>>>
> >>>>> kmalloc doesn't guarantee the returned memory is all on one page.
> >>>>
> >>>> IMHO, the patch posted by Christoph at that thread is best way to go,
> >>>> because you changed streaming DMA mappings to be coherent DMA mappings [1].
> >>>>
> >>>> "The kernel developers recommend the use of streaming mappings over
> >>>> coherent mappings whenever possible" [1].
> >>>>
> >>>> [1] http://www.makelinux.net/ldd3/chp-15-sect-4
> >>>
> >>> Hi Leon-
> >>>
> >>> I'll happily drop this patch from my 4.8 series as soon
> >>> as an official mlx4/mlx5 fix is merged.
> >>>
> >>> Meanwhile, I notice some unexplained instability (driver
> >>> resets, list corruption, and so on) when I test NFS/RDMA
> >>> without this patch included. So it is attached to the
> >>> series for anyone with mlx4 who wants to pull my topic
> >>> branch and try it out.
> >>
> >> hi Chuck,
> >>
> >> We plan to send attached patch during our second round of fixes for
> >> mlx4/mlx5 and would be grateful to you if you could provide your
> >> Tested-by tag before.
>
> Fwiw, Tested-by: Chuck Lever <[email protected]>

Thanks, I appreciate it.

>
>
> > First of all, IIRC the patch author was Christoph wasn't he.
> >
> > Plus, you do realize that this patch makes the pages allocation
> > in granularity of pages. In systems with a large page size this
> > is completely redundant, it might even be harmful as the storage
> > ULPs need lots of MRs.
>
> I agree that the official fix should take a conservative
> approach to allocating this resource; there will be lots
> of MRs in an active system. This fix doesn't seem too
> careful.

In mlx5 system, we always added 2048 bytes to such allocations, for
reasons unknown to me. And it doesn't seem as a conservative approach
either.

>
>
> > Also, I don't see how that solves the issue, I'm not sure I even
> > understand the issue. Do you? Were you able to reproduce it?
>
> The issue is that dma_map_single() does not seem to DMA map
> portions of a memory region that are past the end of the first
> page of that region. Maybe that's a bug?

No, I didn't find support for that. Function dma_map_single expects
contiguous memory aligned to cache line, there is no limitation to be
page bounded.

>
> This patch works around that behavior by guaranteeing that
>
> a) the memory region starts at the beginning of a page, and
> b) the memory region is never larger than a page

b) the memory region ends on cache line.

>
> This patch is not sufficient to repair mlx5, because b)
> cannot be satisfied in that case; the array of __be64's can
> be larger than 512 entries.
>
>
> > IFF the pages buffer end not being aligned to a cacheline is problematic
> > then why not extent it to end in a cacheline? Why in the next full page?
>
> I think the patch description justifies the choice of
> solution, but does not describe the original issue at
> all. The original issue had nothing to do with cacheline
> alignment.

I disagree, kmalloc with supplied flags will return contiguous memory
which is enough for dma_map_single. It is cache line alignment.

>
> Lastly, this patch should remove the definition of
> MLX4_MR_PAGES_ALIGN.

Thanks, I missed it.

>
>
> --
> Chuck Lever
>
>
>


Attachments:
(No filename) (3.52 kB)
signature.asc (819.00 B)
Digital signature
Download all attachments

2016-06-17 19:56:07

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages


> On Jun 17, 2016, at 5:20 AM, Leon Romanovsky <[email protected]> wrote:
>
> On Thu, Jun 16, 2016 at 05:58:29PM -0400, Chuck Lever wrote:
>>
>>> On Jun 16, 2016, at 5:10 PM, Sagi Grimberg <[email protected]> wrote:
>>
>>> First of all, IIRC the patch author was Christoph wasn't he.
>>>
>>> Plus, you do realize that this patch makes the pages allocation
>>> in granularity of pages. In systems with a large page size this
>>> is completely redundant, it might even be harmful as the storage
>>> ULPs need lots of MRs.
>>
>> I agree that the official fix should take a conservative
>> approach to allocating this resource; there will be lots
>> of MRs in an active system. This fix doesn't seem too
>> careful.
>
> In mlx5 system, we always added 2048 bytes to such allocations, for
> reasons unknown to me. And it doesn't seem as a conservative approach
> either.

The mlx5 approach is much better than allocating a whole
page, when you consider platforms with 64KB pages.

A 1MB payload (for NFS) on such a platform comprises just
16 pages. So xprtrdma will allocate MRs with support for
16 pages. That's a priv pages array of 128 bytes, and you
just put it in a 64KB page all by itself.

So maybe adding 2048 bytes is not optimal either. But I
think sticking with kmalloc here is a more optimal choice.


>>> Also, I don't see how that solves the issue, I'm not sure I even
>>> understand the issue. Do you? Were you able to reproduce it?
>>
>> The issue is that dma_map_single() does not seem to DMA map
>> portions of a memory region that are past the end of the first
>> page of that region. Maybe that's a bug?
>
> No, I didn't find support for that. Function dma_map_single expects
> contiguous memory aligned to cache line, there is no limitation to be
> page bounded.

There certainly isn't, but that doesn't mean there can't
be a bug somewhere ;-) and maybe not in dma_map_single.
It could be that the "array on one page only" limitation
is somewhere else in the mlx4 driver, or even in the HCA
firmware.


>> This patch works around that behavior by guaranteeing that
>>
>> a) the memory region starts at the beginning of a page, and
>> b) the memory region is never larger than a page
>
> b) the memory region ends on cache line.

I think we demonstrated pretty clearly that the issue
occurs only when the end of the priv pages array crosses
into a new page.

We didn't see any problem otherwise.


>> This patch is not sufficient to repair mlx5, because b)
>> cannot be satisfied in that case; the array of __be64's can
>> be larger than 512 entries.
>>
>>
>>> IFF the pages buffer end not being aligned to a cacheline is problematic
>>> then why not extent it to end in a cacheline? Why in the next full page?
>>
>> I think the patch description justifies the choice of
>> solution, but does not describe the original issue at
>> all. The original issue had nothing to do with cacheline
>> alignment.
>
> I disagree, kmalloc with supplied flags will return contiguous memory
> which is enough for dma_map_single. It is cache line alignment.

The reason I find this hard to believe is that there is
no end alignment guarantee at all in this code, but it
works without issue when SLUB debugging is not enabled.

xprtrdma allocates 256 elements in this array on x86.
The code makes the array start on an 0x40 byte boundary.
I'm pretty sure that means the end of that array will
also be on at least an 0x40 byte boundary, and thus
aligned to the DMA cacheline, whether or not SLUB
debugging is enabled.

Notice that in the current code, if the consumer requests
an odd number of SGs, that array can't possibly end on
an alignment boundary. But we've never had a complaint.

SLUB debugging changes the alignment of lots of things,
but mlx4_alloc_priv_pages is the only breakage that has
been reported.

DMA-API.txt says:

> [T]he mapped region must begin exactly on a cache line
> boundary and end exactly on one (to prevent two separately
> mapped regions from sharing a single cache line)

The way I read this, cacheline alignment shouldn't be
an issue at all, as long as DMA cachelines aren't
shared between mappings.

If I simply increase the memory allocation size a little
and ensure the end of the mapping is aligned, that should
be enough to prevent DMA cacheline sharing with another
memory allocation on the same page. But I still see Local
Protection Errors when SLUB debugging is enabled, on my
system (with patches to allocate more pages per MR).

I'm not convinced this has anything to do with DMA
cacheline alignment. The reason your patch fixes this
issue is because it keeps the entire array on one page.


--
Chuck Lever




2016-06-18 10:56:56

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages

On Fri, Jun 17, 2016 at 03:55:56PM -0400, Chuck Lever wrote:
>
> > On Jun 17, 2016, at 5:20 AM, Leon Romanovsky <[email protected]> wrote:
> >
> > On Thu, Jun 16, 2016 at 05:58:29PM -0400, Chuck Lever wrote:
> >>
> >>> On Jun 16, 2016, at 5:10 PM, Sagi Grimberg <[email protected]> wrote:
> >>
> >>> First of all, IIRC the patch author was Christoph wasn't he.
> >>>
> >>> Plus, you do realize that this patch makes the pages allocation
> >>> in granularity of pages. In systems with a large page size this
> >>> is completely redundant, it might even be harmful as the storage
> >>> ULPs need lots of MRs.
> >>
> >> I agree that the official fix should take a conservative
> >> approach to allocating this resource; there will be lots
> >> of MRs in an active system. This fix doesn't seem too
> >> careful.
> >
> > In mlx5 system, we always added 2048 bytes to such allocations, for
> > reasons unknown to me. And it doesn't seem as a conservative approach
> > either.
>
> The mlx5 approach is much better than allocating a whole
> page, when you consider platforms with 64KB pages.
>
> A 1MB payload (for NFS) on such a platform comprises just
> 16 pages. So xprtrdma will allocate MRs with support for
> 16 pages. That's a priv pages array of 128 bytes, and you
> just put it in a 64KB page all by itself.
>
> So maybe adding 2048 bytes is not optimal either. But I
> think sticking with kmalloc here is a more optimal choice.

I agree with your's and Sagi's points, just preferred working solution
over optimal. I'll send an optimal version.

>
>
> >>> Also, I don't see how that solves the issue, I'm not sure I even
> >>> understand the issue. Do you? Were you able to reproduce it?
> >>
> >> The issue is that dma_map_single() does not seem to DMA map
> >> portions of a memory region that are past the end of the first
> >> page of that region. Maybe that's a bug?
> >
> > No, I didn't find support for that. Function dma_map_single expects
> > contiguous memory aligned to cache line, there is no limitation to be
> > page bounded.
>
> There certainly isn't, but that doesn't mean there can't
> be a bug somewhere ;-) and maybe not in dma_map_single.
> It could be that the "array on one page only" limitation
> is somewhere else in the mlx4 driver, or even in the HCA
> firmware.

We checked with HW/FW/arch teams prior to respond.

>
>
> >> This patch works around that behavior by guaranteeing that
> >>
> >> a) the memory region starts at the beginning of a page, and
> >> b) the memory region is never larger than a page
> >
> > b) the memory region ends on cache line.
>
> I think we demonstrated pretty clearly that the issue
> occurs only when the end of the priv pages array crosses
> into a new page.
>
> We didn't see any problem otherwise.

SLUB debug do exactly one thing, change alignment and this is why no
issue there observed before.

>
> >> This patch is not sufficient to repair mlx5, because b)
> >> cannot be satisfied in that case; the array of __be64's can
> >> be larger than 512 entries.
> >>
> >>
> >>> IFF the pages buffer end not being aligned to a cacheline is problematic
> >>> then why not extent it to end in a cacheline? Why in the next full page?
> >>
> >> I think the patch description justifies the choice of
> >> solution, but does not describe the original issue at
> >> all. The original issue had nothing to do with cacheline
> >> alignment.
> >
> > I disagree, kmalloc with supplied flags will return contiguous memory
> > which is enough for dma_map_single. It is cache line alignment.
>
> The reason I find this hard to believe is that there is
> no end alignment guarantee at all in this code, but it
> works without issue when SLUB debugging is not enabled.
>
> xprtrdma allocates 256 elements in this array on x86.
> The code makes the array start on an 0x40 byte boundary.
> I'm pretty sure that means the end of that array will
> also be on at least an 0x40 byte boundary, and thus
> aligned to the DMA cacheline, whether or not SLUB
> debugging is enabled.
>
> Notice that in the current code, if the consumer requests
> an odd number of SGs, that array can't possibly end on
> an alignment boundary. But we've never had a complaint.
>
> SLUB debugging changes the alignment of lots of things,
> but mlx4_alloc_priv_pages is the only breakage that has
> been reported.

I think it is related to custom logic which is in this function only.
I posted grep output earlier to emphasize it.

For example adds 2K region after the actual data and it will ensure
alignment.

>
> DMA-API.txt says:
>
> > [T]he mapped region must begin exactly on a cache line
> > boundary and end exactly on one (to prevent two separately
> > mapped regions from sharing a single cache line)
>
> The way I read this, cacheline alignment shouldn't be
> an issue at all, as long as DMA cachelines aren't
> shared between mappings.
>
> If I simply increase the memory allocation size a little
> and ensure the end of the mapping is aligned, that should
> be enough to prevent DMA cacheline sharing with another
> memory allocation on the same page. But I still see Local
> Protection Errors when SLUB debugging is enabled, on my
> system (with patches to allocate more pages per MR).
>
> I'm not convinced this has anything to do with DMA
> cacheline alignment. The reason your patch fixes this
> issue is because it keeps the entire array on one page.

If you don't mind, we can do an experiment.
Let's add padding which will prevent alignment issue and
for sure will cross the page boundary.

diff --git a/drivers/infiniband/hw/mlx4/mr.c
b/drivers/infiniband/hw/mlx4/mr.c
index 6312721..41e277e 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -280,8 +280,10 @@ mlx4_alloc_priv_pages(struct ib_device *device,
int size = max_pages * sizeof(u64);
int add_size;
int ret;
-
+/*
add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
+*/
+ add_size = 2048;

mr->pages_alloc = kzalloc(size + add_size, GFP_KERNEL);
if (!mr->pages_alloc)


Attachments:
(No filename) (5.94 kB)
signature.asc (819.00 B)
Digital signature
Download all attachments

2016-06-19 07:05:27

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages


>> First of all, IIRC the patch author was Christoph wasn't he.
>
> Do you think that author's name can provide different results in
> verification/bug reproduction? We didn't send this patch officially
> yet and all relevant authors will be acknowledged and honored when the
> official (second round of IB fixes) will come.

Not sure what you mean here. what different results?

>> Plus, you do realize that this patch makes the pages allocation
>> in granularity of pages. In systems with a large page size this
>> is completely redundant, it might even be harmful as the storage
>> ULPs need lots of MRs.
>
> I see proper execution of the driver as an important goal which goes
> before various micro optimizations, which will come after.

I still don't understand how this fixes the original bug report from
Chuck. I sent a patch to make the pages allocation dma coherent
which fixes the issue. Yishai is the driver maintainer, he should
decide how this issue should be addressed.

In any event, if we end-up aligning to page size I would expect to
see a FIXME comment saying we can do better...

> id you ask yourself, why are not so many users use that ARCH_KMALLOC_MINALIGN?
>
> ➜ linux-rdma git:(master) grep -rI ARCH_KMALLOC_MINALIGN drivers/*
> drivers/infiniband/hw/mlx4/mr.c: add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
> drivers/infiniband/hw/mlx5/mr.c: add_size = max_t(int, MLX5_UMR_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
> drivers/md/dm-crypt.c: ARCH_KMALLOC_MINALIGN);
> drivers/usb/core/buffer.c: * ARCH_KMALLOC_MINALIGN.
> drivers/usb/core/buffer.c: if (ARCH_KMALLOC_MINALIGN <= 32)
> drivers/usb/core/buffer.c: else if (ARCH_KMALLOC_MINALIGN <= 64)
> drivers/usb/core/buffer.c: else if (ARCH_KMALLOC_MINALIGN <= 128)
> drivers/usb/misc/usbtest.c: return (unsigned long)buf & (ARCH_KMALLOC_MINALIGN - 1);
>

Not sure how to answer that, the use of it comes to make the allocation
padding just as much as we need it to be in order to align the pointer.

>>
>> Also, I don't see how that solves the issue, I'm not sure I even
>> understand the issue. Do you? Were you able to reproduce it?
>
> Yes, the issue is that address supplied to dma_map_single wasn't aligned
> to DMA cacheline size.

Thats not true Leon, the address is always aligned to
MLX4_MR_PAGES_ALIGN which is 64B.

>
> And as the one, who wrote this code for mlx5, it will be great if you
> can give me a pointer. Why did you chose to use MLX5_UMR_ALIGN in that
> function? This will add 2048 bytes instead of 64 for mlx4.

The PRM states that the data descriptors array (MTT/KLM) pointer
must be aligned to 2K. I wish it didn't but it does. You can see in the
code that each registration that goes via UMR does the exact same thing.

>> IFF the pages buffer end not being aligned to a cacheline is problematic
>> then why not extent it to end in a cacheline? Why in the next full page?
>
> It will fit in one page.

Yea, but that single page can be 64K for a 128 pointers array...

2016-06-19 09:48:54

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages


>> First of all, IIRC the patch author was Christoph wasn't he.
>>
>> Plus, you do realize that this patch makes the pages allocation
>> in granularity of pages. In systems with a large page size this
>> is completely redundant, it might even be harmful as the storage
>> ULPs need lots of MRs.
>
> I agree that the official fix should take a conservative
> approach to allocating this resource; there will be lots
> of MRs in an active system. This fix doesn't seem too
> careful.
>
>
>> Also, I don't see how that solves the issue, I'm not sure I even
>> understand the issue. Do you? Were you able to reproduce it?
>
> The issue is that dma_map_single() does not seem to DMA map
> portions of a memory region that are past the end of the first
> page of that region. Maybe that's a bug?

That seems weird to me, from looking at the code I didn't see
any indication that such a mapping would fail. Maybe we are seeing
a mlx4 specific issue? If this is some kind of generic dma-mapping
bug mlx5 would suffer from the same problem right? does it?

> This patch works around that behavior by guaranteeing that
>
> a) the memory region starts at the beginning of a page, and
> b) the memory region is never larger than a page
>
> This patch is not sufficient to repair mlx5, because b)
> cannot be satisfied in that case; the array of __be64's can
> be larger than 512 entries.

If a single page boundary is indeed the root-cause then I agree
this would not solve the problem for mlx5.

>> IFF the pages buffer end not being aligned to a cacheline is problematic
>> then why not extent it to end in a cacheline? Why in the next full page?
>
> I think the patch description justifies the choice of
> solution, but does not describe the original issue at
> all. The original issue had nothing to do with cacheline
> alignment.
>
> Lastly, this patch should remove the definition of
> MLX4_MR_PAGES_ALIGN.

The mlx4 PRM explicitly states that the translation (pages) vector
should align to 64 bytes and this is where this define comes from,
hence I don't think it should be removed from the code.

2016-06-19 09:59:05

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages


>> In mlx5 system, we always added 2048 bytes to such allocations, for
>> reasons unknown to me. And it doesn't seem as a conservative approach
>> either.
>
> The mlx5 approach is much better than allocating a whole
> page, when you consider platforms with 64KB pages.
>
> A 1MB payload (for NFS) on such a platform comprises just
> 16 pages. So xprtrdma will allocate MRs with support for
> 16 pages. That's a priv pages array of 128 bytes, and you
> just put it in a 64KB page all by itself.
>
> So maybe adding 2048 bytes is not optimal either. But I
> think sticking with kmalloc here is a more optimal choice.

Again, the 2K constraint is not coming from any sort of dma mapping
alignment consideration, it comes from the _device_ limitation requiring
the translation vector to be aligned to 2K.

>>>> Also, I don't see how that solves the issue, I'm not sure I even
>>>> understand the issue. Do you? Were you able to reproduce it?
>>>
>>> The issue is that dma_map_single() does not seem to DMA map
>>> portions of a memory region that are past the end of the first
>>> page of that region. Maybe that's a bug?
>>
>> No, I didn't find support for that. Function dma_map_single expects
>> contiguous memory aligned to cache line, there is no limitation to be
>> page bounded.
>
> There certainly isn't, but that doesn't mean there can't
> be a bug somewhere ;-) and maybe not in dma_map_single.
> It could be that the "array on one page only" limitation
> is somewhere else in the mlx4 driver, or even in the HCA
> firmware.

I'm starting to think this is the case. Leon, I think it's time
to get the FW/HW guys involved...

>> I disagree, kmalloc with supplied flags will return contiguous memory
>> which is enough for dma_map_single. It is cache line alignment.
>
> The reason I find this hard to believe is that there is
> no end alignment guarantee at all in this code, but it
> works without issue when SLUB debugging is not enabled.
>
> xprtrdma allocates 256 elements in this array on x86.
> The code makes the array start on an 0x40 byte boundary.
> I'm pretty sure that means the end of that array will
> also be on at least an 0x40 byte boundary, and thus
> aligned to the DMA cacheline, whether or not SLUB
> debugging is enabled.
>
> Notice that in the current code, if the consumer requests
> an odd number of SGs, that array can't possibly end on
> an alignment boundary. But we've never had a complaint.
>
> SLUB debugging changes the alignment of lots of things,
> but mlx4_alloc_priv_pages is the only breakage that has
> been reported.

I tend to agree, I even have a feeling that this won't happen
on mlx5.

> DMA-API.txt says:
>
>> [T]he mapped region must begin exactly on a cache line
>> boundary and end exactly on one (to prevent two separately
>> mapped regions from sharing a single cache line)
>
> The way I read this, cacheline alignment shouldn't be
> an issue at all, as long as DMA cachelines aren't
> shared between mappings.
>
> If I simply increase the memory allocation size a little
> and ensure the end of the mapping is aligned, that should
> be enough to prevent DMA cacheline sharing with another
> memory allocation on the same page. But I still see Local
> Protection Errors when SLUB debugging is enabled, on my
> system (with patches to allocate more pages per MR).
>
> I'm not convinced this has anything to do with DMA
> cacheline alignment. The reason your patch fixes this
> issue is because it keeps the entire array on one page.

I share this feeling, I wrote several times that I don't understand
how this patch solves the issue and I would appreciate if someone
can explain it to me (preferably with evidence).

2016-06-19 10:04:08

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages


> Hi Leon-
>
> I created a different patch, see attachment. It aligns
> the start _and_ end of the DMA mapped region, places
> large arrays so they encounter a page boundary, and
> leaves slack space around each array so there is no
> possibility of a shared DMA cacheline or other activity
> in that memory.
>
> I am able to reproduce the Local Protection Errors with
> this patch applied and SLUB debugging disabled.

Thanks Chuck for proving that the dma alignment is not the issue here.

I suggest that we go with my dma coherent patch for now until Leon and
the Mellanox team can debug this one with the HW/FW folks and find out
what is going on.

Leon, I had my share of debugging this area on mlx4/mlx5 areas. If you
want I can help with debugging this one.

2016-06-19 19:39:02

by Or Gerlitz

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages

On Sun, Jun 19, 2016 at 1:04 PM, Sagi Grimberg <[email protected]> wrote:

>> I am able to reproduce the Local Protection Errors with
>> this patch applied and SLUB debugging disabled.

> Thanks Chuck for proving that the dma alignment is not the issue here.
>
> I suggest that we go with my dma coherent patch for now until Leon and
> the Mellanox team can debug this one with the HW/FW folks and find out
> what is going on.
>
> Leon, I had my share of debugging this area on mlx4/mlx5 areas. If you
> want I can help with debugging this one.

Hi Sagi, Leon and Co,

2016-06-19 19:43:53

by Or Gerlitz

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages

On Sun, Jun 19, 2016 at 10:38 PM, Or Gerlitz <[email protected]> wrote:

> From quick reading of the patch I got the impression that some scheme
> which used to work is now broken, did we get a bisection result
> pointing to the upstream commit which introduce the regression? I
> didn't see such note along the thread, basically, I think this is
> where we should be starting, thoughts? I also added the mlx4 core/IB
> maintainer.

Oh, I missed the 1st post of the thread pointing to commit
1b2cd0fc673c ('IB/mlx4: Support the new memory [...]') -- looking on
the patch, the only thing which is explicitly visible to upper layers
is the setting of ib_dev.map_mr_sg API call. So there's NFS code which
depends on this verb being exported, and if yes, does X else does Y?

2016-06-19 20:02:37

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages


> On Jun 19, 2016, at 3:38 PM, Or Gerlitz <[email protected]> wrote:
>
> On Sun, Jun 19, 2016 at 1:04 PM, Sagi Grimberg <[email protected]> wrote:
>
>>> I am able to reproduce the Local Protection Errors with
>>> this patch applied and SLUB debugging disabled.
>
>> Thanks Chuck for proving that the dma alignment is not the issue here.
>>
>> I suggest that we go with my dma coherent patch for now until Leon and
>> the Mellanox team can debug this one with the HW/FW folks and find out
>> what is going on.
>>
>> Leon, I had my share of debugging this area on mlx4/mlx5 areas. If you
>> want I can help with debugging this one.
>
> Hi Sagi, Leon and Co,
>
> From quick reading of the patch I got the impression that some scheme
> which used to work is now broken, did we get a bisection result
> pointing to the upstream commit which introduce the regression?

Fixes: 1b2cd0fc673c ('IB/mlx4: Support the new memory registration API')

The problem was introduced by the new FR API. I reported
this issue back in late April:

http://marc.info/?l=linux-rdma&m=146194706501705&w=2

and bisected the appearance of symptoms to:

commit d86bd1bece6fc41d59253002db5441fe960a37f6
Author: Joonsoo Kim <[email protected]>
Date: Tue Mar 15 14:55:12 2016 -0700

mm/slub: support left redzone

The left redzone changes the alignment characteristics
of regions returned by kmalloc. Further diagnosis showed
the problem was with mlx4_alloc_priv_pages(), and the
WR flush occurred only when mr->pages happened to contain
a page boundary.

What we don't understand is why a page boundary in that
array is a problem.


> I
> didn't see such note along the thread, basically, I think this is
> where we should be starting, thoughts? I also added the mlx4 core/IB
> maintainer.

Yishai was notified about this issue on May 25:

http://marc.info/?l=linux-rdma&m=146419192913960&w=2


--
Chuck Lever




2016-06-20 05:44:59

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages

On Sun, Jun 19, 2016 at 04:02:23PM -0400, Chuck Lever wrote:
>

Thanks Chuck and Sagi for the help.

<...>

>
> > I
> > didn't see such note along the thread, basically, I think this is
> > where we should be starting, thoughts? I also added the mlx4 core/IB
> > maintainer.
>
> Yishai was notified about this issue on May 25:
>
> http://marc.info/?l=linux-rdma&m=146419192913960&w=2

Yishai and me follow this thread closely and we work on finding the
root cause of this issue.

Thanks

>
>
> --
> Chuck Lever
>
>
>


Attachments:
(No filename) (526.00 B)
signature.asc (819.00 B)
Digital signature
Download all attachments

2016-06-20 06:35:11

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages


> Yishai and me follow this thread closely and we work on finding the
> root cause of this issue.

Thanks Leon and Yishai, let me know if you need any help with this.

Do you agree we should move forward with the original patch until
we get this resolved?

Also, did anyone find out if this is happening in mlx5 as well?
(Chuck?), if not then this would trim the root-cause to be a mlx4
specific issue.

2016-06-20 07:02:03

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages

On Mon, Jun 20, 2016 at 09:34:28AM +0300, Sagi Grimberg wrote:
>
> >Yishai and me follow this thread closely and we work on finding the
> >root cause of this issue.
>
> Thanks Leon and Yishai, let me know if you need any help with this.
>
> Do you agree we should move forward with the original patch until
> we get this resolved?

We will do our best to meet our fixes submission target, which is
planned to be this week.

And for sure we will submit one of two: your's proposed patch
or Yishai's fix if any.

Right now, your's patch is running in our verification.

Thanks


Attachments:
(No filename) (578.00 B)
signature.asc (819.00 B)
Digital signature
Download all attachments

2016-06-20 08:35:53

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages


> We will do our best to meet our fixes submission target, which is
> planned to be this week.
>
> And for sure we will submit one of two: your's proposed patch
> or Yishai's fix if any.
>
> Right now, your's patch is running in our verification.

Sure, thanks Leon.

2016-06-20 13:42:18

by Yishai Hadas

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages

On 6/20/2016 8:44 AM, Leon Romanovsky wrote:
> On Sun, Jun 19, 2016 at 04:02:23PM -0400, Chuck Lever wrote:
>>
>
> Thanks Chuck and Sagi for the help.
>
> <...>
>
>>
>>> I
>>> didn't see such note along the thread, basically, I think this is
>>> where we should be starting, thoughts? I also added the mlx4 core/IB
>>> maintainer.
>>
>> Yishai was notified about this issue on May 25:
>>
>> http://marc.info/?l=linux-rdma&m=146419192913960&w=2
>
> Yishai and me follow this thread closely and we work on finding the
> root cause of this issue.
>

Just found the root cause of the problem, it was found to be a hardware
limitation that is described as part of the PRM. The driver code had to
be written accordingly, confirmed that internally with the relevant people.

From PRM:
"The PBL should be physically contiguous, must reside in a
64-byte-aligned address, and must not include the last 8 bytes of a page."

The last sentence pointed that only one page can be used as the last 8
bytes should not be included. That's why there is a hard limit in the
code for 511 entries.

Re the candidate fix that you sent, from initial review it makes sense,
we'll formally confirm it soon after finalizing the regression testing
in our side.

Thanks Chuck and Sagi for evaluating and working on a solution.

2016-06-21 13:56:49

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages


> Just found the root cause of the problem, it was found to be a hardware
> limitation that is described as part of the PRM. The driver code had to
> be written accordingly, confirmed that internally with the relevant people.
>
> From PRM:
> "The PBL should be physically contiguous, must reside in a
> 64-byte-aligned address, and must not include the last 8 bytes of a page."
>
> The last sentence pointed that only one page can be used as the last 8
> bytes should not be included. That's why there is a hard limit in the
> code for 511 entries.
>
> Re the candidate fix that you sent, from initial review it makes sense,
> we'll formally confirm it soon after finalizing the regression testing
> in our side.
>
> Thanks Chuck and Sagi for evaluating and working on a solution.

Thanks Yishai,

That clears up the root-cause.

Does the same holds for mlx5? or we can leave it alone?

2016-06-21 14:35:17

by Laurence Oberman

[permalink] [raw]
Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages



----- Original Message -----
> From: "Sagi Grimberg" <[email protected]>
> To: "Yishai Hadas" <[email protected]>, "Chuck Lever" <[email protected]>
> Cc: [email protected], "Or Gerlitz" <[email protected]>, "Yishai Hadas" <[email protected]>, "linux-rdma"
> <[email protected]>, "Linux NFS Mailing List" <[email protected]>, "Majd Dibbiny"
> <[email protected]>
> Sent: Tuesday, June 21, 2016 9:56:44 AM
> Subject: Re: [PATCH v2 01/24] mlx4-ib: Use coherent memory for priv pages
>
>
> > Just found the root cause of the problem, it was found to be a hardware
> > limitation that is described as part of the PRM. The driver code had to
> > be written accordingly, confirmed that internally with the relevant people.
> >
> > From PRM:
> > "The PBL should be physically contiguous, must reside in a
> > 64-byte-aligned address, and must not include the last 8 bytes of a page."
> >
> > The last sentence pointed that only one page can be used as the last 8
> > bytes should not be included. That's why there is a hard limit in the
> > code for 511 entries.
> >
> > Re the candidate fix that you sent, from initial review it makes sense,
> > we'll formally confirm it soon after finalizing the regression testing
> > in our side.
> >
> > Thanks Chuck and Sagi for evaluating and working on a solution.
>
> Thanks Yishai,
>
> That clears up the root-cause.
>
> Does the same holds for mlx5? or we can leave it alone?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

Also wondering about mlx5 because the default is coherent and increasing the allowed queue depth got me into the swiotlb error situation.
Backing the queue depth down per Bart's suggestion to 32 avoids the swiotlb errors.
Likley 128 is too high anyway, but the weird part of my testing as already mentioned was that its only seen during reconnect activity.

Thanks
Laurence