2021-07-26 14:47:06

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 0/3] Optimize NFSD Send completion processing

The following series improves the efficiency of NFSD's Send
completion processing by removing costly operations from the svcrdma
Send completion handlers. Each of these patches reduces the CPU
utilized per RPC by Send completion by an average of 2-3%.

The goal is to improve the rate of RPCs that can be retired for a
single-transport workload, thus increasing the server's scalability.

These patches are also available for testing:

https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/log/?h=for-next

---

Chuck Lever (3):
svcrdma: Fewer calls to wake_up() in Send completion handler
svcrdma: Relieve contention on sc_send_lock.
svcrdma: Convert rdma->sc_rw_ctxts to llist


include/linux/sunrpc/svc_rdma.h | 7 +--
net/sunrpc/xprtrdma/svc_rdma_rw.c | 56 ++++++++++++++++--------
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 41 +++++++++--------
net/sunrpc/xprtrdma/svc_rdma_transport.c | 4 +-
4 files changed, 66 insertions(+), 42 deletions(-)

--
Chuck Lever


2021-07-26 14:47:18

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 1/3] svcrdma: Fewer calls to wake_up() in Send completion handler

Because wake_up() takes an IRQ-safe lock, it can be expensive,
especially to call inside of a single-threaded completion handler.
What's more, the Send wait queue almost never has waiters, so
most of the time, this is an expensive no-op.

As always, the goal is to reduce the average overhead of each
completion, because a transport's completion handlers are single-
threaded on one CPU core. This change reduces CPU utilization of
the Send completion thread by 2-3% on my server.

Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/svc_rdma.h | 1 +
net/sunrpc/xprtrdma/svc_rdma_rw.c | 7 ++-----
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 18 +++++++++++++++---
3 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 3184465de3a0..57c60ffe76dd 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -207,6 +207,7 @@ extern void svc_rdma_send_error_msg(struct svcxprt_rdma *rdma,
struct svc_rdma_send_ctxt *sctxt,
struct svc_rdma_recv_ctxt *rctxt,
int status);
+extern void svc_rdma_wake_send_waiters(struct svcxprt_rdma *rdma, int avail);
extern int svc_rdma_sendto(struct svc_rqst *);
extern int svc_rdma_result_payload(struct svc_rqst *rqstp, unsigned int offset,
unsigned int length);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 1e651447dc4e..3d1b119f6e3e 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -248,8 +248,7 @@ static void svc_rdma_write_done(struct ib_cq *cq, struct ib_wc *wc)

trace_svcrdma_wc_write(wc, &cc->cc_cid);

- atomic_add(cc->cc_sqecount, &rdma->sc_sq_avail);
- wake_up(&rdma->sc_send_wait);
+ svc_rdma_wake_send_waiters(rdma, cc->cc_sqecount);

if (unlikely(wc->status != IB_WC_SUCCESS))
svc_xprt_deferred_close(&rdma->sc_xprt);
@@ -304,9 +303,7 @@ static void svc_rdma_wc_read_done(struct ib_cq *cq, struct ib_wc *wc)

trace_svcrdma_wc_read(wc, &cc->cc_cid);

- atomic_add(cc->cc_sqecount, &rdma->sc_sq_avail);
- wake_up(&rdma->sc_send_wait);
-
+ svc_rdma_wake_send_waiters(rdma, cc->cc_sqecount);
cc->cc_status = wc->status;
complete(&cc->cc_done);
return;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index d6bbafb773e1..fba2ee4eb607 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -258,6 +258,20 @@ void svc_rdma_send_ctxt_put(struct svcxprt_rdma *rdma,
spin_unlock(&rdma->sc_send_lock);
}

+/**
+ * svc_rdma_wake_send_waiters - manage Send Queue accounting
+ * @rdma: controlling transport
+ * @avail: Number of additional SQEs that are now available
+ *
+ */
+void svc_rdma_wake_send_waiters(struct svcxprt_rdma *rdma, int avail)
+{
+ atomic_add(avail, &rdma->sc_sq_avail);
+ smp_mb__after_atomic();
+ if (unlikely(waitqueue_active(&rdma->sc_send_wait)))
+ wake_up(&rdma->sc_send_wait);
+}
+
/**
* svc_rdma_wc_send - Invoked by RDMA provider for each polled Send WC
* @cq: Completion Queue context
@@ -275,11 +289,9 @@ static void svc_rdma_wc_send(struct ib_cq *cq, struct ib_wc *wc)

trace_svcrdma_wc_send(wc, &ctxt->sc_cid);

+ svc_rdma_wake_send_waiters(rdma, 1);
complete(&ctxt->sc_done);

- atomic_inc(&rdma->sc_sq_avail);
- wake_up(&rdma->sc_send_wait);
-
if (unlikely(wc->status != IB_WC_SUCCESS))
svc_xprt_deferred_close(&rdma->sc_xprt);
}


2021-07-26 14:47:49

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 3/3] svcrdma: Convert rdma->sc_rw_ctxts to llist

Relieve contention on sc_rw_ctxt_lock by converting rdma->sc_rw_ctxts
to an llist.

The goal is to reduce the average overhead of Send completions,
because a transport's completion handlers are single-threaded on
one CPU core. This change reduces CPU utilization of each Send
completion by 2-3% on my server.

Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/svc_rdma.h | 2 +
net/sunrpc/xprtrdma/svc_rdma_rw.c | 49 +++++++++++++++++++++---------
net/sunrpc/xprtrdma/svc_rdma_transport.c | 2 +
3 files changed, 37 insertions(+), 16 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 5f8d5af6556c..24aa159d29a7 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -92,7 +92,7 @@ struct svcxprt_rdma {
spinlock_t sc_send_lock;
struct llist_head sc_send_ctxts;
spinlock_t sc_rw_ctxt_lock;
- struct list_head sc_rw_ctxts;
+ struct llist_head sc_rw_ctxts;

u32 sc_pending_recvs;
u32 sc_recv_batch;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 3d1b119f6e3e..e27433f08ca7 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -35,6 +35,7 @@ static void svc_rdma_wc_read_done(struct ib_cq *cq, struct ib_wc *wc);
* controlling svcxprt_rdma is destroyed.
*/
struct svc_rdma_rw_ctxt {
+ struct llist_node rw_node;
struct list_head rw_list;
struct rdma_rw_ctx rw_ctx;
unsigned int rw_nents;
@@ -53,19 +54,19 @@ static struct svc_rdma_rw_ctxt *
svc_rdma_get_rw_ctxt(struct svcxprt_rdma *rdma, unsigned int sges)
{
struct svc_rdma_rw_ctxt *ctxt;
+ struct llist_node *node;

spin_lock(&rdma->sc_rw_ctxt_lock);
-
- ctxt = svc_rdma_next_ctxt(&rdma->sc_rw_ctxts);
- if (ctxt) {
- list_del(&ctxt->rw_list);
- spin_unlock(&rdma->sc_rw_ctxt_lock);
+ node = llist_del_first(&rdma->sc_rw_ctxts);
+ spin_unlock(&rdma->sc_rw_ctxt_lock);
+ if (node) {
+ ctxt = llist_entry(node, struct svc_rdma_rw_ctxt, rw_node);
} else {
- spin_unlock(&rdma->sc_rw_ctxt_lock);
ctxt = kmalloc(struct_size(ctxt, rw_first_sgl, SG_CHUNK_SIZE),
GFP_KERNEL);
if (!ctxt)
goto out_noctx;
+
INIT_LIST_HEAD(&ctxt->rw_list);
}

@@ -83,14 +84,18 @@ svc_rdma_get_rw_ctxt(struct svcxprt_rdma *rdma, unsigned int sges)
return NULL;
}

-static void svc_rdma_put_rw_ctxt(struct svcxprt_rdma *rdma,
- struct svc_rdma_rw_ctxt *ctxt)
+static void __svc_rdma_put_rw_ctxt(struct svcxprt_rdma *rdma,
+ struct svc_rdma_rw_ctxt *ctxt,
+ struct llist_head *list)
{
sg_free_table_chained(&ctxt->rw_sg_table, SG_CHUNK_SIZE);
+ llist_add(&ctxt->rw_node, list);
+}

- spin_lock(&rdma->sc_rw_ctxt_lock);
- list_add(&ctxt->rw_list, &rdma->sc_rw_ctxts);
- spin_unlock(&rdma->sc_rw_ctxt_lock);
+static void svc_rdma_put_rw_ctxt(struct svcxprt_rdma *rdma,
+ struct svc_rdma_rw_ctxt *ctxt)
+{
+ __svc_rdma_put_rw_ctxt(rdma, ctxt, &rdma->sc_rw_ctxts);
}

/**
@@ -101,9 +106,10 @@ static void svc_rdma_put_rw_ctxt(struct svcxprt_rdma *rdma,
void svc_rdma_destroy_rw_ctxts(struct svcxprt_rdma *rdma)
{
struct svc_rdma_rw_ctxt *ctxt;
+ struct llist_node *node;

- while ((ctxt = svc_rdma_next_ctxt(&rdma->sc_rw_ctxts)) != NULL) {
- list_del(&ctxt->rw_list);
+ while ((node = llist_del_first(&rdma->sc_rw_ctxts)) != NULL) {
+ ctxt = llist_entry(node, struct svc_rdma_rw_ctxt, rw_node);
kfree(ctxt);
}
}
@@ -171,20 +177,35 @@ static void svc_rdma_cc_init(struct svcxprt_rdma *rdma,
cc->cc_sqecount = 0;
}

+/*
+ * The consumed rw_ctx's are cleaned and placed on a local llist so
+ * that only one atomic llist operation is needed to put them all
+ * back on the free list.
+ */
static void svc_rdma_cc_release(struct svc_rdma_chunk_ctxt *cc,
enum dma_data_direction dir)
{
struct svcxprt_rdma *rdma = cc->cc_rdma;
+ struct llist_node *first, *last;
struct svc_rdma_rw_ctxt *ctxt;
+ LLIST_HEAD(free);

+ first = last = NULL;
while ((ctxt = svc_rdma_next_ctxt(&cc->cc_rwctxts)) != NULL) {
list_del(&ctxt->rw_list);

rdma_rw_ctx_destroy(&ctxt->rw_ctx, rdma->sc_qp,
rdma->sc_port_num, ctxt->rw_sg_table.sgl,
ctxt->rw_nents, dir);
- svc_rdma_put_rw_ctxt(rdma, ctxt);
+ __svc_rdma_put_rw_ctxt(rdma, ctxt, &free);
+
+ ctxt->rw_node.next = first;
+ first = &ctxt->rw_node;
+ if (!last)
+ last = first;
}
+ if (first)
+ llist_add_batch(first, last, &rdma->sc_rw_ctxts);
}

/* State for sending a Write or Reply chunk.
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 99474078c304..d1faa522c3dd 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -138,7 +138,7 @@ static struct svcxprt_rdma *svc_rdma_create_xprt(struct svc_serv *serv,
INIT_LIST_HEAD(&cma_xprt->sc_rq_dto_q);
init_llist_head(&cma_xprt->sc_send_ctxts);
init_llist_head(&cma_xprt->sc_recv_ctxts);
- INIT_LIST_HEAD(&cma_xprt->sc_rw_ctxts);
+ init_llist_head(&cma_xprt->sc_rw_ctxts);
init_waitqueue_head(&cma_xprt->sc_send_wait);

spin_lock_init(&cma_xprt->sc_lock);


2021-07-26 14:48:09

by Chuck Lever III

[permalink] [raw]
Subject: [PATCH v1 2/3] svcrdma: Relieve contention on sc_send_lock.

/proc/lock_stat indicates the the sc_send_lock is heavily
contended when the server is under load from a single client.

To address this, convert the send_ctxt free list to an llist.
Returning an item to the send_ctxt cache is now waitless, which
reduces the instruction path length in the single-threaded Send
handler (svc_rdma_wc_send).

The goal is to enable the ib_comp_wq worker to handle a higher
RPC/RDMA Send completion rate given the same CPU resources. This
change reduces CPU utilization of Send completion by 2-3% on my
server.

Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/svc_rdma.h | 4 ++--
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 23 ++++++++---------------
net/sunrpc/xprtrdma/svc_rdma_transport.c | 2 +-
3 files changed, 11 insertions(+), 18 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 57c60ffe76dd..5f8d5af6556c 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -90,7 +90,7 @@ struct svcxprt_rdma {
struct ib_pd *sc_pd;

spinlock_t sc_send_lock;
- struct list_head sc_send_ctxts;
+ struct llist_head sc_send_ctxts;
spinlock_t sc_rw_ctxt_lock;
struct list_head sc_rw_ctxts;

@@ -150,7 +150,7 @@ struct svc_rdma_recv_ctxt {
};

struct svc_rdma_send_ctxt {
- struct list_head sc_list;
+ struct llist_node sc_node;
struct rpc_rdma_cid sc_cid;

struct ib_send_wr sc_send_wr;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index fba2ee4eb607..599021b2391d 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -113,13 +113,6 @@

static void svc_rdma_wc_send(struct ib_cq *cq, struct ib_wc *wc);

-static inline struct svc_rdma_send_ctxt *
-svc_rdma_next_send_ctxt(struct list_head *list)
-{
- return list_first_entry_or_null(list, struct svc_rdma_send_ctxt,
- sc_list);
-}
-
static void svc_rdma_send_cid_init(struct svcxprt_rdma *rdma,
struct rpc_rdma_cid *cid)
{
@@ -182,9 +175,10 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
void svc_rdma_send_ctxts_destroy(struct svcxprt_rdma *rdma)
{
struct svc_rdma_send_ctxt *ctxt;
+ struct llist_node *node;

- while ((ctxt = svc_rdma_next_send_ctxt(&rdma->sc_send_ctxts))) {
- list_del(&ctxt->sc_list);
+ while ((node = llist_del_first(&rdma->sc_send_ctxts)) != NULL) {
+ ctxt = llist_entry(node, struct svc_rdma_send_ctxt, sc_node);
ib_dma_unmap_single(rdma->sc_pd->device,
ctxt->sc_sges[0].addr,
rdma->sc_max_req_size,
@@ -204,12 +198,13 @@ void svc_rdma_send_ctxts_destroy(struct svcxprt_rdma *rdma)
struct svc_rdma_send_ctxt *svc_rdma_send_ctxt_get(struct svcxprt_rdma *rdma)
{
struct svc_rdma_send_ctxt *ctxt;
+ struct llist_node *node;

spin_lock(&rdma->sc_send_lock);
- ctxt = svc_rdma_next_send_ctxt(&rdma->sc_send_ctxts);
- if (!ctxt)
+ node = llist_del_first(&rdma->sc_send_ctxts);
+ if (!node)
goto out_empty;
- list_del(&ctxt->sc_list);
+ ctxt = llist_entry(node, struct svc_rdma_send_ctxt, sc_node);
spin_unlock(&rdma->sc_send_lock);

out:
@@ -253,9 +248,7 @@ void svc_rdma_send_ctxt_put(struct svcxprt_rdma *rdma,
ctxt->sc_sges[i].length);
}

- spin_lock(&rdma->sc_send_lock);
- list_add(&ctxt->sc_list, &rdma->sc_send_ctxts);
- spin_unlock(&rdma->sc_send_lock);
+ llist_add(&ctxt->sc_node, &rdma->sc_send_ctxts);
}

/**
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index d94b7759ada1..99474078c304 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -136,7 +136,7 @@ static struct svcxprt_rdma *svc_rdma_create_xprt(struct svc_serv *serv,
svc_xprt_init(net, &svc_rdma_class, &cma_xprt->sc_xprt, serv);
INIT_LIST_HEAD(&cma_xprt->sc_accept_q);
INIT_LIST_HEAD(&cma_xprt->sc_rq_dto_q);
- INIT_LIST_HEAD(&cma_xprt->sc_send_ctxts);
+ init_llist_head(&cma_xprt->sc_send_ctxts);
init_llist_head(&cma_xprt->sc_recv_ctxts);
INIT_LIST_HEAD(&cma_xprt->sc_rw_ctxts);
init_waitqueue_head(&cma_xprt->sc_send_wait);


2021-07-26 17:09:47

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH v1 0/3] Optimize NFSD Send completion processing

On 7/26/2021 10:46 AM, Chuck Lever wrote:
> The following series improves the efficiency of NFSD's Send
> completion processing by removing costly operations from the svcrdma
> Send completion handlers. Each of these patches reduces the CPU
> utilized per RPC by Send completion by an average of 2-3%.

Nice gain. Are the improvements additive, i.e. 5-10% CPU reduction
for all three together?

Tom

2021-07-26 17:11:24

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH v1 1/3] svcrdma: Fewer calls to wake_up() in Send completion handler

On 7/26/2021 10:46 AM, Chuck Lever wrote:
> /**
> * svc_rdma_wc_send - Invoked by RDMA provider for each polled Send WC
> * @cq: Completion Queue context
> @@ -275,11 +289,9 @@ static void svc_rdma_wc_send(struct ib_cq *cq, struct ib_wc *wc)
>
> trace_svcrdma_wc_send(wc, &ctxt->sc_cid);
>
> + svc_rdma_wake_send_waiters(rdma, 1);
> complete(&ctxt->sc_done);
>
> - atomic_inc(&rdma->sc_sq_avail);
> - wake_up(&rdma->sc_send_wait);

This appears to change the order of wake_up() vs complete().
Is that intentional? Is there any possibility of a false
scheduler activation, later leading to a second wakeup or poll?

Tom.

2021-07-26 17:53:51

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v1 1/3] svcrdma: Fewer calls to wake_up() in Send completion handler



> On Jul 26, 2021, at 12:50 PM, Tom Talpey <[email protected]> wrote:
>
> On 7/26/2021 10:46 AM, Chuck Lever wrote:
>> /**
>> * svc_rdma_wc_send - Invoked by RDMA provider for each polled Send WC
>> * @cq: Completion Queue context
>> @@ -275,11 +289,9 @@ static void svc_rdma_wc_send(struct ib_cq *cq, struct ib_wc *wc)
>> trace_svcrdma_wc_send(wc, &ctxt->sc_cid);
>> + svc_rdma_wake_send_waiters(rdma, 1);
>> complete(&ctxt->sc_done);
>> - atomic_inc(&rdma->sc_sq_avail);
>> - wake_up(&rdma->sc_send_wait);
>
> This appears to change the order of wake_up() vs complete().
> Is that intentional?

IIRC I reversed the order here to be consistent with the other
Send completion handlers.


> Is there any possibility of a false
> scheduler activation, later leading to a second wakeup or poll?

The two "wake-ups" here are not related to each other, and RPC
Replies are transmitted already so this shouldn't have a direct
impact on server latency. But I might not understand your
question.


--
Chuck Lever



2021-07-26 17:53:51

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v1 0/3] Optimize NFSD Send completion processing



> On Jul 26, 2021, at 12:52 PM, Tom Talpey <[email protected]> wrote:
>
> On 7/26/2021 10:46 AM, Chuck Lever wrote:
>> The following series improves the efficiency of NFSD's Send
>> completion processing by removing costly operations from the svcrdma
>> Send completion handlers. Each of these patches reduces the CPU
>> utilized per RPC by Send completion by an average of 2-3%.
>
> Nice gain. Are the improvements additive, i.e. 5-10% CPU reduction
> for all three together?

Yes, the improvements are additive.

--
Chuck Lever



2021-07-26 20:30:58

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH v1 1/3] svcrdma: Fewer calls to wake_up() in Send completion handler

On 7/26/2021 1:53 PM, Chuck Lever III wrote:
>
>
>> On Jul 26, 2021, at 12:50 PM, Tom Talpey <[email protected]> wrote:
>>
>> On 7/26/2021 10:46 AM, Chuck Lever wrote:
>>> /**
>>> * svc_rdma_wc_send - Invoked by RDMA provider for each polled Send WC
>>> * @cq: Completion Queue context
>>> @@ -275,11 +289,9 @@ static void svc_rdma_wc_send(struct ib_cq *cq, struct ib_wc *wc)
>>> trace_svcrdma_wc_send(wc, &ctxt->sc_cid);
>>> + svc_rdma_wake_send_waiters(rdma, 1);
>>> complete(&ctxt->sc_done);
>>> - atomic_inc(&rdma->sc_sq_avail);
>>> - wake_up(&rdma->sc_send_wait);
>>
>> This appears to change the order of wake_up() vs complete().
>> Is that intentional?
>
> IIRC I reversed the order here to be consistent with the other
> Send completion handlers.
>
>
>> Is there any possibility of a false
>> scheduler activation, later leading to a second wakeup or poll?
>
> The two "wake-ups" here are not related to each other, and RPC
> Replies are transmitted already so this shouldn't have a direct
> impact on server latency. But I might not understand your
> question.

IIUC, you're saying that the thread which is awaiting the
completion of ctxt->sc_done is not also waiting to send
anything, therefore no thread is going to experience a
fire drill. Ok.

Feel free to add my
Reviewed-By: Tom Talpey <[email protected]>
to the series.

Tom.