LinuxLists.cc - [PATCH 00/11] svcrdma: WR context management bug fixes and cleanup

2008-07-03 02:27:44

Subject: [PATCH 00/11] svcrdma: WR context management bug fixes and cleanup

Bruce:

This is version 2 of the WR context cleanup patchset. It includes modifications
per your review. It has been tested with iozone and Connectathon over IB
and iWARP on x86_64. The x86_32 platform has only been compile tested for
warnings.

This set of patches fixes bugs related to DMA mapping and unmapping and
in the process redesigns and cleans up the way that WR contexts are managed.

In addition to the bug fixes, there are three major changes:

- The data structure that maps client side memory to server side
memory has been separated from the WR context. This makes for cleaner
type checking and a smaller memory footprint for the RDMA transport.

- The mapping and unmapping of DMA is performed independently from the
allocation and deallocation of the WR context. This is to allow the
unmapping to occur early on recv where the unmapping of the DMA needs
to be done prior to looking at the data and therefore before
deallocating the WR context.

- The WR context cache now uses a kmem_cache and is shared across
all transport instances. This allows contexts to be shared between
mounts and idle mounts don't consume context memory.

This patchset is intended for 2.6.27 and is based on 2.6.26-rc8. These patches
are also available at the following git url:

git://git.linux-nfs.org/projects/tomtucker/xprt-switch-2.6.git

[PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping

include/linux/sunrpc/svc_rdma.h | 27 +++++++++++++++++++++++++++
net/sunrpc/xprtrdma/svc_rdma.c | 19 +++++++++++++++++++
net/sunrpc/xprtrdma/svc_rdma_transport.c | 26 ++++++++++++++++++++++++++
3 files changed, 72 insertions(+), 0 deletions(-)

[PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing

net/sunrpc/xprtrdma/svc_rdma_sendto.c | 163 ++++++++++++++----------------
net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +-
2 files changed, 80 insertions(+), 88 deletions(-)

[PATCH 03/11] svcrdma: Use reply and chunk map for RDMA_READ processing

include/linux/sunrpc/svc_rdma.h | 1 +
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 83 ++++++++++++++-----------------
2 files changed, 39 insertions(+), 45 deletions(-)

[PATCH 04/11] svcrdma: Move the DMA unmap logic to the CQ handler

net/sunrpc/xprtrdma/svc_rdma_transport.c | 20 ++++++++++++++------
1 files changed, 14 insertions(+), 6 deletions(-)

[PATCH 05/11] svcrdma: Add dma map count and WARN_ON

include/linux/sunrpc/svc_rdma.h | 1 +
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 1 +
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 3 +++
net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +++++
4 files changed, 10 insertions(+), 0 deletions(-)

[PATCH 06/11] svcrdma: Remove unneeded spin locks from __svc_rdma_free

net/sunrpc/xprtrdma/svc_rdma_transport.c | 4 ----
1 files changed, 0 insertions(+), 4 deletions(-)

[PATCH 07/11] svcrdma: Remove unused wait q from svcrdma_xprt structure

include/linux/sunrpc/svc_rdma.h | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)

[PATCH 08/11] svcrdma: Limit ORD based on client's advertised IRD

net/sunrpc/xprtrdma/svc_rdma_transport.c | 16 ++++++++++++----
1 files changed, 12 insertions(+), 4 deletions(-)

[PATCH 09/11] svcrdma: Add flush_scheduled_work to module exit function

net/sunrpc/xprtrdma/svc_rdma.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

[PATCH 10/11] svcrdma: Create a kmem cache for the WR contexts

net/sunrpc/xprtrdma/svc_rdma.c | 23 +++++++++++++++++++----
1 files changed, 19 insertions(+), 4 deletions(-)

[PATCH 11/11] svcrdma: Change WR context get/put to use the kmem cache

include/linux/sunrpc/svc_rdma.h | 6 --
net/sunrpc/xprtrdma/svc_rdma_transport.c | 121 +++---------------------------
2 files changed, 12 insertions(+), 115 deletions(-)

Signed-off-by: Tom Tucker <[email protected]>

2008-07-03 02:27:44

by Tom Tucker

[permalink] [raw]

Subject: [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping

Create a new data structure to hold the remote client address space
to local server address space mapping.

Signed-off-by: Tom Tucker <[email protected]>

---
include/linux/sunrpc/svc_rdma.h | 27 +++++++++++++++++++++++++++
net/sunrpc/xprtrdma/svc_rdma.c | 19 +++++++++++++++++++
net/sunrpc/xprtrdma/svc_rdma_transport.c | 26 ++++++++++++++++++++++++++
3 files changed, 72 insertions(+), 0 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 05eb466..bd8749c 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -86,6 +86,31 @@ struct svc_rdma_op_ctxt {
struct page *pages[RPCSVC_MAXPAGES];
};

+/*
+ * NFS_ requests are mapped on the client side by the chunk lists in
+ * the RPCRDMA header. During the fetching of the RPC from the client
+ * and the writing of the reply to the client, the memory in the
+ * client and the memory in the server must be mapped as contiguous
+ * vaddr/len for access by the hardware. These data strucures keep
+ * these mappings.
+ *
+ * For an RDMA_WRITE, the 'sge' maps the RPC REPLY. For RDMA_READ, the
+ * 'sge' in the svc_rdma_req_map maps the server side RPC reply and the
+ * 'ch' field maps the read-list of the RPCRDMA header to the 'sge'
+ * mapping of the reply.
+ */
+struct svc_rdma_chunk_sge {
+ int start; /* sge no for this chunk */
+ int count; /* sge count for this chunk */
+};
+struct svc_rdma_req_map {
+ unsigned long count;
+ union {
+ struct kvec sge[RPCSVC_MAXPAGES];
+ struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES];
+ };
+};
+
#define RDMACTXT_F_LAST_CTXT 2

struct svcxprt_rdma {
@@ -173,6 +198,8 @@ extern int svc_rdma_post_recv(struct svcxprt_rdma *);
extern int svc_rdma_create_listen(struct svc_serv *, int, struct sockaddr *);
extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *);
extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int);
+extern struct svc_rdma_req_map *svc_rdma_get_req_map(void);
+extern void svc_rdma_put_req_map(struct svc_rdma_req_map *);
extern void svc_sq_reap(struct svcxprt_rdma *);
extern void svc_rq_reap(struct svcxprt_rdma *);
extern struct svc_xprt_class svc_rdma_class;
diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
index 88c0ca2..171f205 100644
--- a/net/sunrpc/xprtrdma/svc_rdma.c
+++ b/net/sunrpc/xprtrdma/svc_rdma.c
@@ -69,6 +69,9 @@ atomic_t rdma_stat_rq_prod;
atomic_t rdma_stat_sq_poll;
atomic_t rdma_stat_sq_prod;

+/* Temporary NFS request map cache */
+struct kmem_cache *svc_rdma_map_cachep;
+
/*
* This function implements reading and resetting an atomic_t stat
* variable through read/write to a proc file. Any write to the file
@@ -241,6 +244,7 @@ void svc_rdma_cleanup(void)
svcrdma_table_header = NULL;
}
svc_unreg_xprt_class(&svc_rdma_class);
+ kmem_cache_destroy(svc_rdma_map_cachep);
}

int svc_rdma_init(void)
@@ -255,9 +259,24 @@ int svc_rdma_init(void)
svcrdma_table_header =
register_sysctl_table(svcrdma_root_table);

+ /* Create the temporary map cache */
+ svc_rdma_map_cachep = kmem_cache_create("svc_rdma_map_cache",
+ sizeof(struct svc_rdma_req_map),
+ 0,
+ SLAB_HWCACHE_ALIGN,
+ NULL);
+ if (!svc_rdma_map_cachep) {
+ printk(KERN_INFO "Could not allocate map cache.\n");
+ goto err;
+ }
+
/* Register RDMA with the SVC transport switch */
svc_reg_xprt_class(&svc_rdma_class);
return 0;
+
+ err:
+ unregister_sysctl_table(svcrdma_table_header);
+ return -ENOMEM;
}
MODULE_AUTHOR("Tom Tucker <[email protected]>");
MODULE_DESCRIPTION("SVC RDMA Transport");
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index e132509..ae90758 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -173,6 +173,32 @@ void svc_rdma_put_context(struct svc_rdma_op_ctxt *ctxt, int free_pages)
atomic_dec(&xprt->sc_ctxt_used);
}

+/* Temporary NFS request map cache. Created in svc_rdma.c */
+extern struct kmem_cache *svc_rdma_map_cachep;
+
+/*
+ * Temporary NFS req mappings are shared across all transport
+ * instances. These are short lived and should be bounded by the number
+ * of concurrent server threads * depth of the SQ.
+ */
+struct svc_rdma_req_map *svc_rdma_get_req_map(void)
+{
+ struct svc_rdma_req_map *map;
+ while (1) {
+ map = kmem_cache_alloc(svc_rdma_map_cachep, GFP_KERNEL);
+ if (map)
+ break;
+ schedule_timeout_uninterruptible(msecs_to_jiffies(500));
+ }
+ map->count = 0;
+ return map;
+}
+
+void svc_rdma_put_req_map(struct svc_rdma_req_map *map)
+{
+ kmem_cache_free(svc_rdma_map_cachep, map);
+}
+
/* ib_cq event handler */
static void cq_event_handler(struct ib_event *event, void *context)
{

2008-07-03 02:27:45

by Tom Tucker

[permalink] [raw]

Subject: [PATCH 05/11] svcrdma: Add dma map count and WARN_ON

Add a dma map count in order to verify that all DMA mapping resources
have been freed when the transport is closed.

Signed-off-by: Tom Tucker <[email protected]>

---
include/linux/sunrpc/svc_rdma.h | 1 +
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 1 +
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 3 +++
net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +++++
4 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index fd5e8a1..ab93afc 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -130,6 +130,7 @@ struct svcxprt_rdma {

struct ib_pd *sc_pd;

+ atomic_t sc_dma_used;
atomic_t sc_ctxt_used;
struct list_head sc_ctxt_free;
int sc_ctxt_cnt;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index d25971b..b4b17f4 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -222,6 +222,7 @@ static void rdma_set_ctxt_sge(struct svcxprt_rdma *xprt,
ctxt->count = count;
ctxt->direction = DMA_FROM_DEVICE;
for (i = 0; i < count; i++) {
+ atomic_inc(&xprt->sc_dma_used);
ctxt->sge[i].addr =
ib_dma_map_single(xprt->sc_cm_id->device,
vec[i].iov_base, vec[i].iov_len,
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index bdc11a3..a19b22b 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -163,6 +163,7 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
sge_bytes = min((size_t)bc,
(size_t)(vec->sge[xdr_sge_no].iov_len-sge_off));
sge[sge_no].length = sge_bytes;
+ atomic_inc(&xprt->sc_dma_used);
sge[sge_no].addr =
ib_dma_map_single(xprt->sc_cm_id->device,
(void *)
@@ -385,6 +386,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
ctxt->count = 1;

/* Prepare the SGE for the RPCRDMA Header */
+ atomic_inc(&rdma->sc_dma_used);
ctxt->sge[0].addr =
ib_dma_map_page(rdma->sc_cm_id->device,
page, 0, PAGE_SIZE, DMA_TO_DEVICE);
@@ -396,6 +398,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
for (sge_no = 1; byte_count && sge_no < vec->count; sge_no++) {
sge_bytes = min_t(size_t, vec->sge[sge_no].iov_len, byte_count);
byte_count -= sge_bytes;
+ atomic_inc(&rdma->sc_dma_used);
ctxt->sge[sge_no].addr =
ib_dma_map_single(rdma->sc_cm_id->device,
vec->sge[sge_no].iov_base,
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 7e8ee66..6fddd58 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -155,6 +155,7 @@ static void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt)
struct svcxprt_rdma *xprt = ctxt->xprt;
int i;
for (i = 0; i < ctxt->count && ctxt->sge[i].length; i++) {
+ atomic_dec(&xprt->sc_dma_used);
ib_dma_unmap_single(xprt->sc_cm_id->device,
ctxt->sge[i].addr,
ctxt->sge[i].length,
@@ -519,6 +520,7 @@ static struct svcxprt_rdma *rdma_create_xprt(struct svc_serv *serv,
cma_xprt->sc_max_requests = svcrdma_max_requests;
cma_xprt->sc_sq_depth = svcrdma_max_requests * RPCRDMA_SQ_DEPTH_MULT;
atomic_set(&cma_xprt->sc_sq_count, 0);
+ atomic_set(&cma_xprt->sc_ctxt_used, 0);

if (!listener) {
int reqs = cma_xprt->sc_max_requests;
@@ -569,6 +571,7 @@ int svc_rdma_post_recv(struct svcxprt_rdma *xprt)
BUG_ON(sge_no >= xprt->sc_max_sge);
page = svc_rdma_get_page();
ctxt->pages[sge_no] = page;
+ atomic_inc(&xprt->sc_dma_used);
pa = ib_dma_map_page(xprt->sc_cm_id->device,
page, 0, PAGE_SIZE,
DMA_FROM_DEVICE);
@@ -1049,6 +1052,7 @@ static void __svc_rdma_free(struct work_struct *work)

/* Warn if we leaked a resource or under-referenced */
WARN_ON(atomic_read(&rdma->sc_ctxt_used) != 0);
+ WARN_ON(atomic_read(&rdma->sc_dma_used) != 0);

/* Destroy the QP if present (not a listener) */
if (rdma->sc_qp && !IS_ERR(rdma->sc_qp))
@@ -1169,6 +1173,7 @@ void svc_rdma_send_error(struct svcxprt_rdma *xprt, struct rpcrdma_msg *rmsgp,
length = svc_rdma_xdr_encode_error(xprt, rmsgp, err, va);

/* Prepare SGE for local address */
+ atomic_inc(&xprt->sc_dma_used);
sge.addr = ib_dma_map_page(xprt->sc_cm_id->device,
p, 0, PAGE_SIZE, DMA_FROM_DEVICE);
sge.lkey = xprt->sc_phys_mr->lkey;

2008-07-03 02:27:45

by Tom Tucker

[permalink] [raw]

Subject: [PATCH 11/11] svcrdma: Change WR context get/put to use the kmem cache

Change the WR context pool to be shared across mount points. This
reduces the RDMA transport memory footprint significantly since
idle mounts don't consume WR context memory.

Signed-off-by: Tom Tucker <[email protected]>

---
include/linux/sunrpc/svc_rdma.h | 6 --
net/sunrpc/xprtrdma/svc_rdma_transport.c | 121 +++---------------------------
2 files changed, 12 insertions(+), 115 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index d8d74c4..ef2e3a2 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -73,7 +73,6 @@ extern atomic_t rdma_stat_sq_prod;
struct svc_rdma_op_ctxt {
struct svc_rdma_op_ctxt *read_hdr;
int hdr_count;
- struct list_head free_list;
struct xdr_buf arg;
struct list_head dto_q;
enum ib_wr_opcode wr_op;
@@ -131,11 +130,6 @@ struct svcxprt_rdma {

atomic_t sc_dma_used;
atomic_t sc_ctxt_used;
- struct list_head sc_ctxt_free;
- int sc_ctxt_cnt;
- int sc_ctxt_bump;
- int sc_ctxt_max;
- spinlock_t sc_ctxt_lock;
struct list_head sc_rq_dto_q;
spinlock_t sc_rq_dto_lock;
struct ib_qp *sc_qp;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 80104f4..19ddc38 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -84,69 +84,23 @@ struct svc_xprt_class svc_rdma_class = {
.xcl_max_payload = RPCSVC_MAXPAYLOAD_TCP,
};

-static int rdma_bump_context_cache(struct svcxprt_rdma *xprt)
-{
- int target;
- int at_least_one = 0;
- struct svc_rdma_op_ctxt *ctxt;
-
- target = min(xprt->sc_ctxt_cnt + xprt->sc_ctxt_bump,
- xprt->sc_ctxt_max);
-
- spin_lock_bh(&xprt->sc_ctxt_lock);
- while (xprt->sc_ctxt_cnt < target) {
- xprt->sc_ctxt_cnt++;
- spin_unlock_bh(&xprt->sc_ctxt_lock);
-
- ctxt = kmalloc(sizeof(*ctxt), GFP_KERNEL);
-
- spin_lock_bh(&xprt->sc_ctxt_lock);
- if (ctxt) {
- at_least_one = 1;
- INIT_LIST_HEAD(&ctxt->free_list);
- list_add(&ctxt->free_list, &xprt->sc_ctxt_free);
- } else {
- /* kmalloc failed...give up for now */
- xprt->sc_ctxt_cnt--;
- break;
- }
- }
- spin_unlock_bh(&xprt->sc_ctxt_lock);
- dprintk("svcrdma: sc_ctxt_max=%d, sc_ctxt_cnt=%d\n",
- xprt->sc_ctxt_max, xprt->sc_ctxt_cnt);
- return at_least_one;
-}
+/* WR context cache. Created in svc_rdma.c */
+extern struct kmem_cache *svc_rdma_ctxt_cachep;

struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
{
struct svc_rdma_op_ctxt *ctxt;

while (1) {
- spin_lock_bh(&xprt->sc_ctxt_lock);
- if (unlikely(list_empty(&xprt->sc_ctxt_free))) {
- /* Try to bump my cache. */
- spin_unlock_bh(&xprt->sc_ctxt_lock);
-
- if (rdma_bump_context_cache(xprt))
- continue;
-
- printk(KERN_INFO "svcrdma: sleeping waiting for "
- "context memory on xprt=%p\n",
- xprt);
- schedule_timeout_uninterruptible(msecs_to_jiffies(500));
- continue;
- }
- ctxt = list_entry(xprt->sc_ctxt_free.next,
- struct svc_rdma_op_ctxt,
- free_list);
- list_del_init(&ctxt->free_list);
- spin_unlock_bh(&xprt->sc_ctxt_lock);
- ctxt->xprt = xprt;
- INIT_LIST_HEAD(&ctxt->dto_q);
- ctxt->count = 0;
- atomic_inc(&xprt->sc_ctxt_used);
- break;
+ ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep, GFP_KERNEL);
+ if (ctxt)
+ break;
+ schedule_timeout_uninterruptible(msecs_to_jiffies(500));
}
+ ctxt->xprt = xprt;
+ INIT_LIST_HEAD(&ctxt->dto_q);
+ ctxt->count = 0;
+ atomic_inc(&xprt->sc_ctxt_used);
return ctxt;
}

@@ -174,9 +128,7 @@ void svc_rdma_put_context(struct svc_rdma_op_ctxt *ctxt, int free_pages)
for (i = 0; i < ctxt->count; i++)
put_page(ctxt->pages[i]);

- spin_lock_bh(&xprt->sc_ctxt_lock);
- list_add(&ctxt->free_list, &xprt->sc_ctxt_free);
- spin_unlock_bh(&xprt->sc_ctxt_lock);
+ kmem_cache_free(svc_rdma_ctxt_cachep, ctxt);
atomic_dec(&xprt->sc_ctxt_used);
}

@@ -461,40 +413,6 @@ static void sq_comp_handler(struct ib_cq *cq, void *cq_context)
tasklet_schedule(&dto_tasklet);
}

-static void create_context_cache(struct svcxprt_rdma *xprt,
- int ctxt_count, int ctxt_bump, int ctxt_max)
-{
- struct svc_rdma_op_ctxt *ctxt;
- int i;
-
- xprt->sc_ctxt_max = ctxt_max;
- xprt->sc_ctxt_bump = ctxt_bump;
- xprt->sc_ctxt_cnt = 0;
- atomic_set(&xprt->sc_ctxt_used, 0);
-
- INIT_LIST_HEAD(&xprt->sc_ctxt_free);
- for (i = 0; i < ctxt_count; i++) {
- ctxt = kmalloc(sizeof(*ctxt), GFP_KERNEL);
- if (ctxt) {
- INIT_LIST_HEAD(&ctxt->free_list);
- list_add(&ctxt->free_list, &xprt->sc_ctxt_free);
- xprt->sc_ctxt_cnt++;
- }
- }
-}
-
-static void destroy_context_cache(struct svcxprt_rdma *xprt)
-{
- while (!list_empty(&xprt->sc_ctxt_free)) {
- struct svc_rdma_op_ctxt *ctxt;
- ctxt = list_entry(xprt->sc_ctxt_free.next,
- struct svc_rdma_op_ctxt,
- free_list);
- list_del_init(&ctxt->free_list);
- kfree(ctxt);
- }
-}
-
static struct svcxprt_rdma *rdma_create_xprt(struct svc_serv *serv,
int listener)
{
@@ -511,7 +429,6 @@ static struct svcxprt_rdma *rdma_create_xprt(struct svc_serv *serv,

spin_lock_init(&cma_xprt->sc_lock);
spin_lock_init(&cma_xprt->sc_read_complete_lock);
- spin_lock_init(&cma_xprt->sc_ctxt_lock);
spin_lock_init(&cma_xprt->sc_rq_dto_lock);

cma_xprt->sc_ord = svcrdma_ord;
@@ -522,20 +439,7 @@ static struct svcxprt_rdma *rdma_create_xprt(struct svc_serv *serv,
atomic_set(&cma_xprt->sc_sq_count, 0);
atomic_set(&cma_xprt->sc_ctxt_used, 0);

- if (!listener) {
- int reqs = cma_xprt->sc_max_requests;
- create_context_cache(cma_xprt,
- reqs << 1, /* starting size */
- reqs, /* bump amount */
- reqs +
- cma_xprt->sc_sq_depth +
- RPCRDMA_MAX_THREADS + 1); /* max */
- if (list_empty(&cma_xprt->sc_ctxt_free)) {
- kfree(cma_xprt);
- return NULL;
- }
- clear_bit(XPT_LISTENER, &cma_xprt->sc_xprt.xpt_flags);
- } else
+ if (listener)
set_bit(XPT_LISTENER, &cma_xprt->sc_xprt.xpt_flags);

return cma_xprt;
@@ -1077,7 +981,6 @@ static void __svc_rdma_free(struct work_struct *work)
/* Destroy the CM ID */
rdma_destroy_id(rdma->sc_cm_id);

- destroy_context_cache(rdma);
kfree(rdma);
}

2008-07-03 02:27:45

by Tom Tucker

[permalink] [raw]

Subject: [PATCH 07/11] svcrdma: Remove unused wait q from svcrdma_xprt structure

The sc_read_wait queue head is no longer used. Remove it.

Signed-off-by: Tom Tucker <[email protected]>

---
include/linux/sunrpc/svc_rdma.h | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index ab93afc..d8d74c4 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -119,7 +119,6 @@ struct svcxprt_rdma {
struct rdma_cm_id *sc_cm_id; /* RDMA connection id */
struct list_head sc_accept_q; /* Conn. waiting accept */
int sc_ord; /* RDMA read limit */
- wait_queue_head_t sc_read_wait;
int sc_max_sge;

int sc_sq_depth; /* Depth of SQ */

2008-07-03 02:27:45

by Tom Tucker

[permalink] [raw]

Subject: [PATCH 08/11] svcrdma: Limit ORD based on client's advertised IRD

When adapters have differing IRD limits, the RDMA transport will fail to
connect properly. The RDMA transport should use the client's advertised
inbound read limit when computing its outbound read limit. For iWARP
transports, there is currently no standard for exchanging IRD/ORD
during connection establishment so the 'responder_resources' field in the
connect event is the local device's limit. The RDMA transport can be
configured to use a smaller ORD by writing the desired number to the
/proc/sys/sunrpc/svc_rdma/max_outbound_read_requests file.

Signed-off-by: Tom Tucker <[email protected]>

---
net/sunrpc/xprtrdma/svc_rdma_transport.c | 16 ++++++++++++----
1 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 7647789..80104f4 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -606,7 +606,7 @@ int svc_rdma_post_recv(struct svcxprt_rdma *xprt)
* will call the recvfrom method on the listen xprt which will accept the new
* connection.
*/
-static void handle_connect_req(struct rdma_cm_id *new_cma_id)
+static void handle_connect_req(struct rdma_cm_id *new_cma_id, size_t client_ird)
{
struct svcxprt_rdma *listen_xprt = new_cma_id->context;
struct svcxprt_rdma *newxprt;
@@ -623,6 +623,9 @@ static void handle_connect_req(struct rdma_cm_id *new_cma_id)
dprintk("svcrdma: Creating newxprt=%p, cm_id=%p, listenxprt=%p\n",
newxprt, newxprt->sc_cm_id, listen_xprt);

+ /* Save client advertised inbound read limit for use later in accept. */
+ newxprt->sc_ord = client_ird;
+
/* Set the local and remote addresses in the transport */
sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.dst_addr;
svc_xprt_set_remote(&newxprt->sc_xprt, sa, svc_addr_len(sa));
@@ -659,7 +662,8 @@ static int rdma_listen_handler(struct rdma_cm_id *cma_id,
case RDMA_CM_EVENT_CONNECT_REQUEST:
dprintk("svcrdma: Connect request on cma_id=%p, xprt = %p, "
"event=%d\n", cma_id, cma_id->context, event->event);
- handle_connect_req(cma_id);
+ handle_connect_req(cma_id,
+ event->param.conn.responder_resources);
break;

case RDMA_CM_EVENT_ESTABLISHED:
@@ -833,8 +837,12 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
(size_t)svcrdma_max_requests);
newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_max_requests;

- newxprt->sc_ord = min((size_t)devattr.max_qp_rd_atom,
- (size_t)svcrdma_ord);
+ /*
+ * Limit ORD based on client limit, local device limit, and
+ * configured svcrdma limit.
+ */
+ newxprt->sc_ord = min_t(size_t, devattr.max_qp_rd_atom, newxprt->sc_ord);
+ newxprt->sc_ord = min_t(size_t, svcrdma_ord, newxprt->sc_ord);

newxprt->sc_pd = ib_alloc_pd(newxprt->sc_cm_id->device);
if (IS_ERR(newxprt->sc_pd)) {

2008-07-03 02:27:45

by Tom Tucker

[permalink] [raw]

Subject: [PATCH 06/11] svcrdma: Remove unneeded spin locks from __svc_rdma_free

At the time __svc_rdma_free is called, we are guaranteed that all references
to this transport are gone. There is, therefore, no need to protect the
resource lists with a spin lock.

Signed-off-by: Tom Tucker <[email protected]>

---
net/sunrpc/xprtrdma/svc_rdma_transport.c | 4 ----
1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 6fddd58..7647789 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -1027,7 +1027,6 @@ static void __svc_rdma_free(struct work_struct *work)
* cm_id because the device ptr is needed to unmap the dma in
* svc_rdma_put_context.
*/
- spin_lock_bh(&rdma->sc_read_complete_lock);
while (!list_empty(&rdma->sc_read_complete_q)) {
struct svc_rdma_op_ctxt *ctxt;
ctxt = list_entry(rdma->sc_read_complete_q.next,
@@ -1036,10 +1035,8 @@ static void __svc_rdma_free(struct work_struct *work)
list_del_init(&ctxt->dto_q);
svc_rdma_put_context(ctxt, 1);
}
- spin_unlock_bh(&rdma->sc_read_complete_lock);

/* Destroy queued, but not processed recv completions */
- spin_lock_bh(&rdma->sc_rq_dto_lock);
while (!list_empty(&rdma->sc_rq_dto_q)) {
struct svc_rdma_op_ctxt *ctxt;
ctxt = list_entry(rdma->sc_rq_dto_q.next,
@@ -1048,7 +1045,6 @@ static void __svc_rdma_free(struct work_struct *work)
list_del_init(&ctxt->dto_q);
svc_rdma_put_context(ctxt, 1);
}
- spin_unlock_bh(&rdma->sc_rq_dto_lock);

/* Warn if we leaked a resource or under-referenced */
WARN_ON(atomic_read(&rdma->sc_ctxt_used) != 0);

2008-07-03 02:27:45

by Tom Tucker

[permalink] [raw]

Subject: [PATCH 09/11] svcrdma: Add flush_scheduled_work to module exit function

Make certain all transports pending free are flushed from the wq
before unloading the module.

Signed-off-by: Tom Tucker <[email protected]>

---
net/sunrpc/xprtrdma/svc_rdma.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
index 171f205..527acfd 100644
--- a/net/sunrpc/xprtrdma/svc_rdma.c
+++ b/net/sunrpc/xprtrdma/svc_rdma.c
@@ -239,6 +239,7 @@ static ctl_table svcrdma_root_table[] = {
void svc_rdma_cleanup(void)
{
dprintk("SVCRDMA Module Removed, deregister RPC RDMA transport\n");
+ flush_scheduled_work();
if (svcrdma_table_header) {
unregister_sysctl_table(svcrdma_table_header);
svcrdma_table_header = NULL;

2008-07-03 02:27:45

by Tom Tucker

[permalink] [raw]

Subject: [PATCH 10/11] svcrdma: Create a kmem cache for the WR contexts

Create a kmem cache to hold WR contexts. Next we will convert
the WR context get and put services to use this kmem cache.

Signed-off-by: Tom Tucker <[email protected]>

---
net/sunrpc/xprtrdma/svc_rdma.c | 23 +++++++++++++++++++----
1 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
index 527acfd..8710117 100644
--- a/net/sunrpc/xprtrdma/svc_rdma.c
+++ b/net/sunrpc/xprtrdma/svc_rdma.c
@@ -69,8 +69,9 @@ atomic_t rdma_stat_rq_prod;
atomic_t rdma_stat_sq_poll;
atomic_t rdma_stat_sq_prod;

-/* Temporary NFS request map cache */
+/* Temporary NFS request map and context caches */
struct kmem_cache *svc_rdma_map_cachep;
+struct kmem_cache *svc_rdma_ctxt_cachep;

/*
* This function implements reading and resetting an atomic_t stat
@@ -246,6 +247,7 @@ void svc_rdma_cleanup(void)
}
svc_unreg_xprt_class(&svc_rdma_class);
kmem_cache_destroy(svc_rdma_map_cachep);
+ kmem_cache_destroy(svc_rdma_ctxt_cachep);
}

int svc_rdma_init(void)
@@ -268,14 +270,27 @@ int svc_rdma_init(void)
NULL);
if (!svc_rdma_map_cachep) {
printk(KERN_INFO "Could not allocate map cache.\n");
- goto err;
+ goto err0;
+ }
+
+ /* Create the temporary context cache */
+ svc_rdma_ctxt_cachep =
+ kmem_cache_create("svc_rdma_ctxt_cache",
+ sizeof(struct svc_rdma_op_ctxt),
+ 0,
+ SLAB_HWCACHE_ALIGN,
+ NULL);
+ if (!svc_rdma_ctxt_cachep) {
+ printk(KERN_INFO "Could not allocate WR ctxt cache.\n");
+ goto err1;
}

/* Register RDMA with the SVC transport switch */
svc_reg_xprt_class(&svc_rdma_class);
return 0;
-
- err:
+ err1:
+ kmem_cache_destroy(svc_rdma_map_cachep);
+ err0:
unregister_sysctl_table(svcrdma_table_header);
return -ENOMEM;
}

2008-07-03 02:27:44

by Tom Tucker

[permalink] [raw]

Subject: [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing

Use the new svc_rdma_req_map data type for mapping the client side memory
to the server side memory. Move the DMA mapping to the context pointed to
by each WR individually so that it is unmapped after the WR completes.

Signed-off-by: Tom Tucker <[email protected]>
---
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 163 ++++++++++++++----------------
net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +-
2 files changed, 80 insertions(+), 88 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index fb82b1b..bdc11a3 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -63,52 +63,44 @@
* SGE[2..sge_count-2] data from xdr->pages[]
* SGE[sge_count-1] data from xdr->tail.
*
+ * The max SGE we need is the length of the XDR / pagesize + one for
+ * head + one for tail + one for RPCRDMA header. Since RPCSVC_MAXPAGES
+ * reserves a page for both the request and the reply header, and this
+ * array is only concerned with the reply we are assured that we have
+ * on extra page for the RPCRMDA header.
*/
-static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
- struct xdr_buf *xdr,
- struct ib_sge *sge,
- int *sge_count)
+static void xdr_to_sge(struct svcxprt_rdma *xprt,
+ struct xdr_buf *xdr,
+ struct svc_rdma_req_map *vec)
{
- /* Max we need is the length of the XDR / pagesize + one for
- * head + one for tail + one for RPCRDMA header
- */
int sge_max = (xdr->len+PAGE_SIZE-1) / PAGE_SIZE + 3;
int sge_no;
- u32 byte_count = xdr->len;
u32 sge_bytes;
u32 page_bytes;
- int page_off;
+ u32 page_off;
int page_no;

+ BUG_ON(xdr->len !=
+ (xdr->head[0].iov_len + xdr->page_len + xdr->tail[0].iov_len));
+
/* Skip the first sge, this is for the RPCRDMA header */
sge_no = 1;

/* Head SGE */
- sge[sge_no].addr = ib_dma_map_single(xprt->sc_cm_id->device,
- xdr->head[0].iov_base,
- xdr->head[0].iov_len,
- DMA_TO_DEVICE);
- sge_bytes = min_t(u32, byte_count, xdr->head[0].iov_len);
- byte_count -= sge_bytes;
- sge[sge_no].length = sge_bytes;
- sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
+ vec->sge[sge_no].iov_base = xdr->head[0].iov_base;
+ vec->sge[sge_no].iov_len = xdr->head[0].iov_len;
sge_no++;

/* pages SGE */
page_no = 0;
page_bytes = xdr->page_len;
page_off = xdr->page_base;
- while (byte_count && page_bytes) {
- sge_bytes = min_t(u32, byte_count, (PAGE_SIZE-page_off));
- sge[sge_no].addr =
- ib_dma_map_page(xprt->sc_cm_id->device,
- xdr->pages[page_no], page_off,
- sge_bytes, DMA_TO_DEVICE);
- sge_bytes = min(sge_bytes, page_bytes);
- byte_count -= sge_bytes;
+ while (page_bytes) {
+ vec->sge[sge_no].iov_base =
+ page_address(xdr->pages[page_no]) + page_off;
+ sge_bytes = min_t(u32, page_bytes, (PAGE_SIZE - page_off));
page_bytes -= sge_bytes;
- sge[sge_no].length = sge_bytes;
- sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
+ vec->sge[sge_no].iov_len = sge_bytes;

sge_no++;
page_no++;
@@ -116,36 +108,24 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
}

/* Tail SGE */
- if (byte_count && xdr->tail[0].iov_len) {
- sge[sge_no].addr =
- ib_dma_map_single(xprt->sc_cm_id->device,
- xdr->tail[0].iov_base,
- xdr->tail[0].iov_len,
- DMA_TO_DEVICE);
- sge_bytes = min_t(u32, byte_count, xdr->tail[0].iov_len);
- byte_count -= sge_bytes;
- sge[sge_no].length = sge_bytes;
- sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
+ if (xdr->tail[0].iov_len) {
+ vec->sge[sge_no].iov_base = xdr->tail[0].iov_base;
+ vec->sge[sge_no].iov_len = xdr->tail[0].iov_len;
sge_no++;
}

BUG_ON(sge_no > sge_max);
- BUG_ON(byte_count != 0);
-
- *sge_count = sge_no;
- return sge;
+ vec->count = sge_no;
}

-
/* Assumptions:
* - The specified write_len can be represented in sc_max_sge * PAGE_SIZE
*/
static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
u32 rmr, u64 to,
u32 xdr_off, int write_len,
- struct ib_sge *xdr_sge, int sge_count)
+ struct svc_rdma_req_map *vec)
{
- struct svc_rdma_op_ctxt *tmp_sge_ctxt;
struct ib_send_wr write_wr;
struct ib_sge *sge;
int xdr_sge_no;
@@ -154,25 +134,23 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
int sge_off;
int bc;
struct svc_rdma_op_ctxt *ctxt;
- int ret = 0;

- BUG_ON(sge_count > RPCSVC_MAXPAGES);
+ BUG_ON(vec->count > RPCSVC_MAXPAGES);
dprintk("svcrdma: RDMA_WRITE rmr=%x, to=%llx, xdr_off=%d, "
- "write_len=%d, xdr_sge=%p, sge_count=%d\n",
+ "write_len=%d, vec->sge=%p, vec->count=%lu\n",
rmr, (unsigned long long)to, xdr_off,
- write_len, xdr_sge, sge_count);
+ write_len, vec->sge, vec->count);

ctxt = svc_rdma_get_context(xprt);
- ctxt->count = 0;
- tmp_sge_ctxt = svc_rdma_get_context(xprt);
- sge = tmp_sge_ctxt->sge;
+ ctxt->direction = DMA_TO_DEVICE;
+ sge = ctxt->sge;

/* Find the SGE associated with xdr_off */
- for (bc = xdr_off, xdr_sge_no = 1; bc && xdr_sge_no < sge_count;
+ for (bc = xdr_off, xdr_sge_no = 1; bc && xdr_sge_no < vec->count;
xdr_sge_no++) {
- if (xdr_sge[xdr_sge_no].length > bc)
+ if (vec->sge[xdr_sge_no].iov_len > bc)
break;
- bc -= xdr_sge[xdr_sge_no].length;
+ bc -= vec->sge[xdr_sge_no].iov_len;
}

sge_off = bc;
@@ -180,21 +158,27 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
sge_no = 0;

/* Copy the remaining SGE */
- while (bc != 0 && xdr_sge_no < sge_count) {
- sge[sge_no].addr = xdr_sge[xdr_sge_no].addr + sge_off;
- sge[sge_no].lkey = xdr_sge[xdr_sge_no].lkey;
+ while (bc != 0 && xdr_sge_no < vec->count) {
+ sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
sge_bytes = min((size_t)bc,
- (size_t)(xdr_sge[xdr_sge_no].length-sge_off));
+ (size_t)(vec->sge[xdr_sge_no].iov_len-sge_off));
sge[sge_no].length = sge_bytes;
-
+ sge[sge_no].addr =
+ ib_dma_map_single(xprt->sc_cm_id->device,
+ (void *)
+ vec->sge[xdr_sge_no].iov_base + sge_off,
+ sge_bytes, DMA_TO_DEVICE);
+ if (dma_mapping_error(sge[sge_no].addr))
+ goto err;
sge_off = 0;
sge_no++;
+ ctxt->count++;
xdr_sge_no++;
bc -= sge_bytes;
}

BUG_ON(bc != 0);
- BUG_ON(xdr_sge_no > sge_count);
+ BUG_ON(xdr_sge_no > vec->count);

/* Prepare WRITE WR */
memset(&write_wr, 0, sizeof write_wr);
@@ -209,21 +193,20 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,

/* Post It */
atomic_inc(&rdma_stat_write);
- if (svc_rdma_send(xprt, &write_wr)) {
- svc_rdma_put_context(ctxt, 1);
- /* Fatal error, close transport */
- ret = -EIO;
- }
- svc_rdma_put_context(tmp_sge_ctxt, 0);
- return ret;
+ if (svc_rdma_send(xprt, &write_wr))
+ goto err;
+ return 0;
+ err:
+ svc_rdma_put_context(ctxt, 0);
+ /* Fatal error, close transport */
+ return -EIO;
}

static int send_write_chunks(struct svcxprt_rdma *xprt,
struct rpcrdma_msg *rdma_argp,
struct rpcrdma_msg *rdma_resp,
struct svc_rqst *rqstp,
- struct ib_sge *sge,
- int sge_count)
+ struct svc_rdma_req_map *vec)
{
u32 xfer_len = rqstp->rq_res.page_len + rqstp->rq_res.tail[0].iov_len;
int write_len;
@@ -269,8 +252,7 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
rs_offset + chunk_off,
xdr_off,
this_write,
- sge,
- sge_count);
+ vec);
if (ret) {
dprintk("svcrdma: RDMA_WRITE failed, ret=%d\n",
ret);
@@ -292,8 +274,7 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt,
struct rpcrdma_msg *rdma_argp,
struct rpcrdma_msg *rdma_resp,
struct svc_rqst *rqstp,
- struct ib_sge *sge,
- int sge_count)
+ struct svc_rdma_req_map *vec)
{
u32 xfer_len = rqstp->rq_res.len;
int write_len;
@@ -341,8 +322,7 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt,
rs_offset + chunk_off,
xdr_off,
this_write,
- sge,
- sge_count);
+ vec);
if (ret) {
dprintk("svcrdma: RDMA_WRITE failed, ret=%d\n",
ret);
@@ -380,7 +360,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
struct page *page,
struct rpcrdma_msg *rdma_resp,
struct svc_rdma_op_ctxt *ctxt,
- int sge_count,
+ struct svc_rdma_req_map *vec,
int byte_count)
{
struct ib_send_wr send_wr;
@@ -413,10 +393,15 @@ static int send_reply(struct svcxprt_rdma *rdma,
ctxt->sge[0].lkey = rdma->sc_phys_mr->lkey;

/* Determine how many of our SGE are to be transmitted */
- for (sge_no = 1; byte_count && sge_no < sge_count; sge_no++) {
- sge_bytes = min((size_t)ctxt->sge[sge_no].length,
- (size_t)byte_count);
+ for (sge_no = 1; byte_count && sge_no < vec->count; sge_no++) {
+ sge_bytes = min_t(size_t, vec->sge[sge_no].iov_len, byte_count);
byte_count -= sge_bytes;
+ ctxt->sge[sge_no].addr =
+ ib_dma_map_single(rdma->sc_cm_id->device,
+ vec->sge[sge_no].iov_base,
+ sge_bytes, DMA_TO_DEVICE);
+ ctxt->sge[sge_no].length = sge_bytes;
+ ctxt->sge[sge_no].lkey = rdma->sc_phys_mr->lkey;
}
BUG_ON(byte_count != 0);

@@ -428,8 +413,10 @@ static int send_reply(struct svcxprt_rdma *rdma,
ctxt->pages[page_no+1] = rqstp->rq_respages[page_no];
ctxt->count++;
rqstp->rq_respages[page_no] = NULL;
+ /* If there are more pages than SGE, terminate SGE list */
+ if (page_no+1 >= sge_no)
+ ctxt->sge[page_no+1].length = 0;
}
-
BUG_ON(sge_no > rdma->sc_max_sge);
memset(&send_wr, 0, sizeof send_wr);
ctxt->wr_op = IB_WR_SEND;
@@ -473,20 +460,20 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
enum rpcrdma_proc reply_type;
int ret;
int inline_bytes;
- struct ib_sge *sge;
- int sge_count = 0;
struct page *res_page;
struct svc_rdma_op_ctxt *ctxt;
+ struct svc_rdma_req_map *vec;

dprintk("svcrdma: sending response for rqstp=%p\n", rqstp);

/* Get the RDMA request header. */
rdma_argp = xdr_start(&rqstp->rq_arg);

- /* Build an SGE for the XDR */
+ /* Build an req vec for the XDR */
ctxt = svc_rdma_get_context(rdma);
ctxt->direction = DMA_TO_DEVICE;
- sge = xdr_to_sge(rdma, &rqstp->rq_res, ctxt->sge, &sge_count);
+ vec = svc_rdma_get_req_map();
+ xdr_to_sge(rdma, &rqstp->rq_res, vec);

inline_bytes = rqstp->rq_res.len;

@@ -503,7 +490,7 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)

/* Send any write-chunk data and build resp write-list */
ret = send_write_chunks(rdma, rdma_argp, rdma_resp,
- rqstp, sge, sge_count);
+ rqstp, vec);
if (ret < 0) {
printk(KERN_ERR "svcrdma: failed to send write chunks, rc=%d\n",
ret);
@@ -513,7 +500,7 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)

/* Send any reply-list data and update resp reply-list */
ret = send_reply_chunks(rdma, rdma_argp, rdma_resp,
- rqstp, sge, sge_count);
+ rqstp, vec);
if (ret < 0) {
printk(KERN_ERR "svcrdma: failed to send reply chunks, rc=%d\n",
ret);
@@ -521,11 +508,13 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
}
inline_bytes -= ret;

- ret = send_reply(rdma, rqstp, res_page, rdma_resp, ctxt, sge_count,
+ ret = send_reply(rdma, rqstp, res_page, rdma_resp, ctxt, vec,
inline_bytes);
+ svc_rdma_put_req_map(vec);
dprintk("svcrdma: send_reply returns %d\n", ret);
return ret;
error:
+ svc_rdma_put_req_map(vec);
svc_rdma_put_context(ctxt, 0);
put_page(res_page);
return ret;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index ae90758..fc86338 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -387,10 +387,13 @@ static void sq_cq_reap(struct svcxprt_rdma *xprt)

switch (ctxt->wr_op) {
case IB_WR_SEND:
- case IB_WR_RDMA_WRITE:
svc_rdma_put_context(ctxt, 1);
break;

+ case IB_WR_RDMA_WRITE:
+ svc_rdma_put_context(ctxt, 0);
+ break;
+
case IB_WR_RDMA_READ:
if (test_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags)) {
struct svc_rdma_op_ctxt *read_hdr = ctxt->read_hdr;

2008-07-03 02:27:44

by Tom Tucker

[permalink] [raw]

Subject: [PATCH 04/11] svcrdma: Move the DMA unmap logic to the CQ handler

Separate DMA unmap from context destruction and perform DMA unmapping
in the SQ/RQ CQ reap functions. This is necessary to support software
based RDMA implementations that actually copy the data in their
ib_dma_unmap callback functions and architectures that don't have
cache coherent I/O busses.

Signed-off-by: Tom Tucker <[email protected]>

---
net/sunrpc/xprtrdma/svc_rdma_transport.c | 20 ++++++++++++++------
1 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index fc86338..7e8ee66 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -150,6 +150,18 @@ struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
return ctxt;
}

+static void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt)
+{
+ struct svcxprt_rdma *xprt = ctxt->xprt;
+ int i;
+ for (i = 0; i < ctxt->count && ctxt->sge[i].length; i++) {
+ ib_dma_unmap_single(xprt->sc_cm_id->device,
+ ctxt->sge[i].addr,
+ ctxt->sge[i].length,
+ ctxt->direction);
+ }
+}
+
void svc_rdma_put_context(struct svc_rdma_op_ctxt *ctxt, int free_pages)
{
struct svcxprt_rdma *xprt;
@@ -161,12 +173,6 @@ void svc_rdma_put_context(struct svc_rdma_op_ctxt *ctxt, int free_pages)
for (i = 0; i < ctxt->count; i++)
put_page(ctxt->pages[i]);

- for (i = 0; i < ctxt->count; i++)
- ib_dma_unmap_single(xprt->sc_cm_id->device,
- ctxt->sge[i].addr,
- ctxt->sge[i].length,
- ctxt->direction);
-
spin_lock_bh(&xprt->sc_ctxt_lock);
list_add(&ctxt->free_list, &xprt->sc_ctxt_free);
spin_unlock_bh(&xprt->sc_ctxt_lock);
@@ -328,6 +334,7 @@ static void rq_cq_reap(struct svcxprt_rdma *xprt)
ctxt = (struct svc_rdma_op_ctxt *)(unsigned long)wc.wr_id;
ctxt->wc_status = wc.status;
ctxt->byte_len = wc.byte_len;
+ svc_rdma_unmap_dma(ctxt);
if (wc.status != IB_WC_SUCCESS) {
/* Close the transport */
dprintk("svcrdma: transport closing putting ctxt %p\n", ctxt);
@@ -377,6 +384,7 @@ static void sq_cq_reap(struct svcxprt_rdma *xprt)
ctxt = (struct svc_rdma_op_ctxt *)(unsigned long)wc.wr_id;
xprt = ctxt->xprt;

+ svc_rdma_unmap_dma(ctxt);
if (wc.status != IB_WC_SUCCESS)
/* Close the transport */
set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);

2008-07-03 02:27:44

by Tom Tucker

[permalink] [raw]

Subject: [PATCH 03/11] svcrdma: Use reply and chunk map for RDMA_READ processing

Modify the RDMA_READ processing to use the reply and chunk list mapping data
types. Also add a special purpose 'hdr_count' field in in the context to hold
the header page count instead of overloading the SGE length field and
corrupting the DMA map length.

Signed-off-by: Tom Tucker <[email protected]>

---
include/linux/sunrpc/svc_rdma.h | 1 +
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 83 ++++++++++++++-----------------
2 files changed, 39 insertions(+), 45 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index bd8749c..fd5e8a1 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -72,6 +72,7 @@ extern atomic_t rdma_stat_sq_prod;
*/
struct svc_rdma_op_ctxt {
struct svc_rdma_op_ctxt *read_hdr;
+ int hdr_count;
struct list_head free_list;
struct xdr_buf arg;
struct list_head dto_q;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 06ab484..d25971b 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -112,11 +112,6 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
rqstp->rq_arg.tail[0].iov_len = 0;
}

-struct chunk_sge {
- int start; /* sge no for this chunk */
- int count; /* sge count for this chunk */
-};
-
/* Encode a read-chunk-list as an array of IB SGE
*
* Assumptions:
@@ -134,8 +129,8 @@ static int rdma_rcl_to_sge(struct svcxprt_rdma *xprt,
struct svc_rqst *rqstp,
struct svc_rdma_op_ctxt *head,
struct rpcrdma_msg *rmsgp,
- struct ib_sge *sge,
- struct chunk_sge *ch_sge_ary,
+ struct svc_rdma_req_map *rpl_map,
+ struct svc_rdma_req_map *chl_map,
int ch_count,
int byte_count)
{
@@ -156,22 +151,18 @@ static int rdma_rcl_to_sge(struct svcxprt_rdma *xprt,
head->arg.head[0] = rqstp->rq_arg.head[0];
head->arg.tail[0] = rqstp->rq_arg.tail[0];
head->arg.pages = &head->pages[head->count];
- head->sge[0].length = head->count; /* save count of hdr pages */
+ head->hdr_count = head->count; /* save count of hdr pages */
head->arg.page_base = 0;
head->arg.page_len = ch_bytes;
head->arg.len = rqstp->rq_arg.len + ch_bytes;
head->arg.buflen = rqstp->rq_arg.buflen + ch_bytes;
head->count++;
- ch_sge_ary[0].start = 0;
+ chl_map->ch[0].start = 0;
while (byte_count) {
+ rpl_map->sge[sge_no].iov_base =
+ page_address(rqstp->rq_arg.pages[page_no]) + page_off;
sge_bytes = min_t(int, PAGE_SIZE-page_off, ch_bytes);
- sge[sge_no].addr =
- ib_dma_map_page(xprt->sc_cm_id->device,
- rqstp->rq_arg.pages[page_no],
- page_off, sge_bytes,
- DMA_FROM_DEVICE);
- sge[sge_no].length = sge_bytes;
- sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
+ rpl_map->sge[sge_no].iov_len = sge_bytes;
/*
* Don't bump head->count here because the same page
* may be used by multiple SGE.
@@ -187,11 +178,11 @@ static int rdma_rcl_to_sge(struct svcxprt_rdma *xprt,
* SGE, move to the next SGE
*/
if (ch_bytes == 0) {
- ch_sge_ary[ch_no].count =
- sge_no - ch_sge_ary[ch_no].start;
+ chl_map->ch[ch_no].count =
+ sge_no - chl_map->ch[ch_no].start;
ch_no++;
ch++;
- ch_sge_ary[ch_no].start = sge_no;
+ chl_map->ch[ch_no].start = sge_no;
ch_bytes = ch->rc_target.rs_length;
/* If bytes remaining account for next chunk */
if (byte_count) {
@@ -220,18 +211,24 @@ static int rdma_rcl_to_sge(struct svcxprt_rdma *xprt,
return sge_no;
}

-static void rdma_set_ctxt_sge(struct svc_rdma_op_ctxt *ctxt,
- struct ib_sge *sge,
+static void rdma_set_ctxt_sge(struct svcxprt_rdma *xprt,
+ struct svc_rdma_op_ctxt *ctxt,
+ struct kvec *vec,
u64 *sgl_offset,
int count)
{
int i;

ctxt->count = count;
+ ctxt->direction = DMA_FROM_DEVICE;
for (i = 0; i < count; i++) {
- ctxt->sge[i].addr = sge[i].addr;
- ctxt->sge[i].length = sge[i].length;
- *sgl_offset = *sgl_offset + sge[i].length;
+ ctxt->sge[i].addr =
+ ib_dma_map_single(xprt->sc_cm_id->device,
+ vec[i].iov_base, vec[i].iov_len,
+ DMA_FROM_DEVICE);
+ ctxt->sge[i].length = vec[i].iov_len;
+ ctxt->sge[i].lkey = xprt->sc_phys_mr->lkey;
+ *sgl_offset = *sgl_offset + vec[i].iov_len;
}
}

@@ -282,34 +279,29 @@ static int rdma_read_xdr(struct svcxprt_rdma *xprt,
struct ib_send_wr read_wr;
int err = 0;
int ch_no;
- struct ib_sge *sge;
int ch_count;
int byte_count;
int sge_count;
u64 sgl_offset;
struct rpcrdma_read_chunk *ch;
struct svc_rdma_op_ctxt *ctxt = NULL;
- struct svc_rdma_op_ctxt *tmp_sge_ctxt;
- struct svc_rdma_op_ctxt *tmp_ch_ctxt;
- struct chunk_sge *ch_sge_ary;
+ struct svc_rdma_req_map *rpl_map;
+ struct svc_rdma_req_map *chl_map;

/* If no read list is present, return 0 */
ch = svc_rdma_get_read_chunk(rmsgp);
if (!ch)
return 0;

- /* Allocate temporary contexts to keep SGE */
- BUG_ON(sizeof(struct ib_sge) < sizeof(struct chunk_sge));
- tmp_sge_ctxt = svc_rdma_get_context(xprt);
- sge = tmp_sge_ctxt->sge;
- tmp_ch_ctxt = svc_rdma_get_context(xprt);
- ch_sge_ary = (struct chunk_sge *)tmp_ch_ctxt->sge;
+ /* Allocate temporary reply and chunk maps */
+ rpl_map = svc_rdma_get_req_map();
+ chl_map = svc_rdma_get_req_map();

svc_rdma_rcl_chunk_counts(ch, &ch_count, &byte_count);
if (ch_count > RPCSVC_MAXPAGES)
return -EINVAL;
sge_count = rdma_rcl_to_sge(xprt, rqstp, hdr_ctxt, rmsgp,
- sge, ch_sge_ary,
+ rpl_map, chl_map,
ch_count, byte_count);
sgl_offset = 0;
ch_no = 0;
@@ -331,14 +323,15 @@ next_sge:
read_wr.wr.rdma.remote_addr =
get_unaligned(&(ch->rc_target.rs_offset)) +
sgl_offset;
- read_wr.sg_list = &sge[ch_sge_ary[ch_no].start];
+ read_wr.sg_list = ctxt->sge;
read_wr.num_sge =
- rdma_read_max_sge(xprt, ch_sge_ary[ch_no].count);
- rdma_set_ctxt_sge(ctxt, &sge[ch_sge_ary[ch_no].start],
+ rdma_read_max_sge(xprt, chl_map->ch[ch_no].count);
+ rdma_set_ctxt_sge(xprt, ctxt,
+ &rpl_map->sge[chl_map->ch[ch_no].start],
&sgl_offset,
read_wr.num_sge);
if (((ch+1)->rc_discrim == 0) &&
- (read_wr.num_sge == ch_sge_ary[ch_no].count)) {
+ (read_wr.num_sge == chl_map->ch[ch_no].count)) {
/*
* Mark the last RDMA_READ with a bit to
* indicate all RPC data has been fetched from
@@ -358,9 +351,9 @@ next_sge:
}
atomic_inc(&rdma_stat_read);

- if (read_wr.num_sge < ch_sge_ary[ch_no].count) {
- ch_sge_ary[ch_no].count -= read_wr.num_sge;
- ch_sge_ary[ch_no].start += read_wr.num_sge;
+ if (read_wr.num_sge < chl_map->ch[ch_no].count) {
+ chl_map->ch[ch_no].count -= read_wr.num_sge;
+ chl_map->ch[ch_no].start += read_wr.num_sge;
goto next_sge;
}
sgl_offset = 0;
@@ -368,8 +361,8 @@ next_sge:
}

out:
- svc_rdma_put_context(tmp_sge_ctxt, 0);
- svc_rdma_put_context(tmp_ch_ctxt, 0);
+ svc_rdma_put_req_map(rpl_map);
+ svc_rdma_put_req_map(chl_map);

/* Detach arg pages. svc_recv will replenish them */
for (ch_no = 0; &rqstp->rq_pages[ch_no] < rqstp->rq_respages; ch_no++)
@@ -399,7 +392,7 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
rqstp->rq_pages[page_no] = head->pages[page_no];
}
/* Point rq_arg.pages past header */
- rqstp->rq_arg.pages = &rqstp->rq_pages[head->sge[0].length];
+ rqstp->rq_arg.pages = &rqstp->rq_pages[head->hdr_count];
rqstp->rq_arg.page_len = head->arg.page_len;
rqstp->rq_arg.page_base = head->arg.page_base;

2008-07-03 20:27:57

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH 00/11] svcrdma: WR context management bug fixes and cleanup

On Wed, Jul 02, 2008 at 09:27:32PM -0500, Tom Tucker wrote:
> Bruce:
>
> This is version 2 of the WR context cleanup patchset. It includes modifications
> per your review. It has been tested with iozone and Connectathon over IB
> and iWARP on x86_64. The x86_32 platform has only been compile tested for
> warnings.
>
> This set of patches fixes bugs related to DMA mapping and unmapping and
> in the process redesigns and cleans up the way that WR contexts are managed.
>
> In addition to the bug fixes, there are three major changes:
>
> - The data structure that maps client side memory to server side
> memory has been separated from the WR context. This makes for cleaner
> type checking and a smaller memory footprint for the RDMA transport.
>
> - The mapping and unmapping of DMA is performed independently from the
> allocation and deallocation of the WR context. This is to allow the
> unmapping to occur early on recv where the unmapping of the DMA needs
> to be done prior to looking at the data and therefore before
> deallocating the WR context.
>
> - The WR context cache now uses a kmem_cache and is shared across
> all transport instances. This allows contexts to be shared between
> mounts and idle mounts don't consume context memory.
>
> This patchset is intended for 2.6.27 and is based on 2.6.26-rc8. These patches
> are also available at the following git url:
>
> git://git.linux-nfs.org/projects/tomtucker/xprt-switch-2.6.git

Thanks!--I've merged this and pushed out the result to

git://linux-nfs.org/~bfields/linux.git for-2.6.27

--b.

>
> [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping
>
> include/linux/sunrpc/svc_rdma.h | 27 +++++++++++++++++++++++++++
> net/sunrpc/xprtrdma/svc_rdma.c | 19 +++++++++++++++++++
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 26 ++++++++++++++++++++++++++
> 3 files changed, 72 insertions(+), 0 deletions(-)
>
> [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing
>
> net/sunrpc/xprtrdma/svc_rdma_sendto.c | 163 ++++++++++++++----------------
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +-
> 2 files changed, 80 insertions(+), 88 deletions(-)
>
> [PATCH 03/11] svcrdma: Use reply and chunk map for RDMA_READ processing
>
> include/linux/sunrpc/svc_rdma.h | 1 +
> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 83 ++++++++++++++-----------------
> 2 files changed, 39 insertions(+), 45 deletions(-)
>
> [PATCH 04/11] svcrdma: Move the DMA unmap logic to the CQ handler
>
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 20 ++++++++++++++------
> 1 files changed, 14 insertions(+), 6 deletions(-)
>
> [PATCH 05/11] svcrdma: Add dma map count and WARN_ON
>
> include/linux/sunrpc/svc_rdma.h | 1 +
> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 1 +
> net/sunrpc/xprtrdma/svc_rdma_sendto.c | 3 +++
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +++++
> 4 files changed, 10 insertions(+), 0 deletions(-)
>
> [PATCH 06/11] svcrdma: Remove unneeded spin locks from __svc_rdma_free
>
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 4 ----
> 1 files changed, 0 insertions(+), 4 deletions(-)
>
> [PATCH 07/11] svcrdma: Remove unused wait q from svcrdma_xprt structure
>
> include/linux/sunrpc/svc_rdma.h | 1 -
> 1 files changed, 0 insertions(+), 1 deletions(-)
>
> [PATCH 08/11] svcrdma: Limit ORD based on client's advertised IRD
>
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 16 ++++++++++++----
> 1 files changed, 12 insertions(+), 4 deletions(-)
>
> [PATCH 09/11] svcrdma: Add flush_scheduled_work to module exit function
>
> net/sunrpc/xprtrdma/svc_rdma.c | 1 +
> 1 files changed, 1 insertions(+), 0 deletions(-)
>
> [PATCH 10/11] svcrdma: Create a kmem cache for the WR contexts
>
> net/sunrpc/xprtrdma/svc_rdma.c | 23 +++++++++++++++++++----
> 1 files changed, 19 insertions(+), 4 deletions(-)
>
> [PATCH 11/11] svcrdma: Change WR context get/put to use the kmem cache
>
> include/linux/sunrpc/svc_rdma.h | 6 --
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 121 +++---------------------------
> 2 files changed, 12 insertions(+), 115 deletions(-)
>
> Signed-off-by: Tom Tucker <[email protected]>