2023-07-04 00:13:31

by Chuck Lever

[permalink] [raw]
Subject: [PATCH v2 0/9] SUNRPC service thread scheduler optimizations

Walking a linked list to find an idle thread is not CPU cache-
friendly, and in fact I've noted palpable per-request latency
impacts as the number of nfsd threads on the server increases.

After discussing some possible improvements with Jeff at LSF/MM,
I've been experimenting with the following series. I've measured an
order of magnitude latency improvement in the thread lookup time,
and have managed to keep the whole thing lockless.

After some offline discussion with Neil, I tried out using just
the xarray plus its spinlock to mark threads idle or busy. This
worked as well as the bitmap for lower thread counts, but got
predictably bad as the thread count when past several hundred. For
the moment I'm sticking with the wait-free bitmap lookup.

Also, the maximum thread count is now 4096. I'm still willing to try
an RCU-based bitmap resizing mechanism if we believe this is still
to small of a maximum.


Changes since RFC:
* Add a counter for ingress RPC messages
* Add a few documenting comments
* Move the more controversial patches to the end of the series
* Clarify the refactoring of svc_wake_up()
* Increase the value of RPCSVC_MAXPOOLTHREADS to 4096 (and tested with that many threads)
* Optimize the loop in svc_pool_wake_idle_thread()
* Optimize marking a thread "idle" in svc_get_next_xprt()

---

Chuck Lever (9):
SUNRPC: Deduplicate thread wake-up code
SUNRPC: Report when no service thread is available.
SUNRPC: Split the svc_xprt_dequeue tracepoint
SUNRPC: Count ingress RPC messages per svc_pool
SUNRPC: Clean up svc_set_num_threads
SUNRPC: Replace dprintk() call site in __svc_create()
SUNRPC: Don't disable BH's when taking sp_lock
SUNRPC: Replace sp_threads_all with an xarray
SUNRPC: Convert RQ_BUSY into a per-pool bitmap


fs/nfsd/nfssvc.c | 3 +-
include/linux/sunrpc/svc.h | 18 ++--
include/trace/events/sunrpc.h | 159 +++++++++++++++++++++++++++----
net/sunrpc/svc.c | 174 ++++++++++++++++++++++------------
net/sunrpc/svc_xprt.c | 114 +++++++++++-----------
5 files changed, 328 insertions(+), 140 deletions(-)

--
Chuck Lever



2023-07-04 00:13:45

by Chuck Lever

[permalink] [raw]
Subject: [PATCH v2 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap

From: Chuck Lever <[email protected]>

I've noticed that client-observed server request latency goes up
simply when the nfsd thread count is increased.

List walking is known to be memory-inefficient. On a busy server
with many threads, enqueuing a transport will walk the "all threads"
list quite frequently. This also pulls in the cache lines for some
hot fields in each svc_rqst (namely, rq_flags).

The svc_xprt_enqueue() call that concerns me most is the one in
svc_rdma_wc_receive(), which is single-threaded per CQ. Slowing
down completion handling limits the total throughput per RDMA
connection.

So, avoid walking the "all threads" list to find an idle thread to
wake. Instead, set up an idle bitmap and use find_next_bit, which
should work the same way as RQ_BUSY but it will touch only the
cachelines that the bitmap is in. Stick with atomic bit operations
to avoid taking the pool lock.

Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/svc.h | 6 ++++--
include/trace/events/sunrpc.h | 1 -
net/sunrpc/svc.c | 27 +++++++++++++++++++++------
net/sunrpc/svc_xprt.c | 30 ++++++++++++++++++++++++------
4 files changed, 49 insertions(+), 15 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 6f8bfcd44250..27ffcf7371d0 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -35,6 +35,7 @@ struct svc_pool {
spinlock_t sp_lock; /* protects sp_sockets */
struct list_head sp_sockets; /* pending sockets */
unsigned int sp_nrthreads; /* # of threads in pool */
+ unsigned long *sp_idle_map; /* idle threads */
struct xarray sp_thread_xa;

/* statistics on pool operation */
@@ -190,6 +191,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
#define RPCSVC_MAXPAGES ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
+ 2 + 1)

+#define RPCSVC_MAXPOOLTHREADS (4096)
+
/*
* The context of a single thread, including the request currently being
* processed.
@@ -239,8 +242,7 @@ struct svc_rqst {
#define RQ_SPLICE_OK (4) /* turned off in gss privacy
* to prevent encrypting page
* cache pages */
-#define RQ_BUSY (5) /* request is busy */
-#define RQ_DATA (6) /* request has data */
+#define RQ_DATA (5) /* request has data */
unsigned long rq_flags; /* flags field */
u32 rq_thread_id; /* xarray index */
ktime_t rq_qtime; /* enqueue time */
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index ea43c6059bdb..c07824a254bf 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
svc_rqst_flag(USEDEFERRAL) \
svc_rqst_flag(DROPME) \
svc_rqst_flag(SPLICE_OK) \
- svc_rqst_flag(BUSY) \
svc_rqst_flag_end(DATA)

#undef svc_rqst_flag
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index ef350f0d8925..d0278e5190ba 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -509,6 +509,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
INIT_LIST_HEAD(&pool->sp_sockets);
spin_lock_init(&pool->sp_lock);
xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
+ /* All threads initially marked "busy" */
+ pool->sp_idle_map =
+ bitmap_zalloc_node(RPCSVC_MAXPOOLTHREADS, GFP_KERNEL,
+ svc_pool_map_get_node(i));
+ if (!pool->sp_idle_map)
+ return NULL;

percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
@@ -596,6 +602,8 @@ svc_destroy(struct kref *ref)
percpu_counter_destroy(&pool->sp_threads_starved);

xa_destroy(&pool->sp_thread_xa);
+ bitmap_free(pool->sp_idle_map);
+ pool->sp_idle_map = NULL;
}
kfree(serv->sv_pools);
kfree(serv);
@@ -647,7 +655,6 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)

folio_batch_init(&rqstp->rq_fbatch);

- __set_bit(RQ_BUSY, &rqstp->rq_flags);
rqstp->rq_server = serv;
rqstp->rq_pool = pool;

@@ -677,7 +684,7 @@ static struct svc_rqst *
svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
{
static const struct xa_limit limit = {
- .max = U32_MAX,
+ .max = RPCSVC_MAXPOOLTHREADS,
};
struct svc_rqst *rqstp;
int ret;
@@ -722,12 +729,19 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
struct svc_pool *pool)
{
struct svc_rqst *rqstp;
- unsigned long index;
+ unsigned long bit;

- xa_for_each(&pool->sp_thread_xa, index, rqstp) {
- if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
+ /* Check the pool's idle bitmap locklessly so that multiple
+ * idle searches can proceed concurrently.
+ */
+ for_each_set_bit(bit, pool->sp_idle_map, pool->sp_nrthreads) {
+ if (!test_and_clear_bit(bit, pool->sp_idle_map))
continue;

+ rqstp = xa_load(&pool->sp_thread_xa, bit);
+ if (!rqstp)
+ break;
+
WRITE_ONCE(rqstp->rq_qtime, ktime_get());
wake_up_process(rqstp->rq_task);
percpu_counter_inc(&pool->sp_threads_woken);
@@ -767,7 +781,8 @@ svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *stat
}

found_pool:
- rqstp = xa_find(&pool->sp_thread_xa, &zero, U32_MAX, XA_PRESENT);
+ rqstp = xa_find(&pool->sp_thread_xa, &zero, RPCSVC_MAXPOOLTHREADS,
+ XA_PRESENT);
if (rqstp) {
__xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
task = rqstp->rq_task;
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 7709120b45c1..2844b32c16ea 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -735,6 +735,25 @@ rqst_should_sleep(struct svc_rqst *rqstp)
return true;
}

+static void svc_pool_thread_mark_idle(struct svc_pool *pool,
+ struct svc_rqst *rqstp)
+{
+ smp_mb__before_atomic();
+ set_bit(rqstp->rq_thread_id, pool->sp_idle_map);
+ smp_mb__after_atomic();
+}
+
+/*
+ * Note: If we were awoken, then this rqstp has already been marked busy.
+ */
+static void svc_pool_thread_mark_busy(struct svc_pool *pool,
+ struct svc_rqst *rqstp)
+{
+ smp_mb__before_atomic();
+ clear_bit(rqstp->rq_thread_id, pool->sp_idle_map);
+ smp_mb__after_atomic();
+}
+
static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
{
struct svc_pool *pool = rqstp->rq_pool;
@@ -756,18 +775,17 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
set_current_state(TASK_INTERRUPTIBLE);
smp_mb__before_atomic();
clear_bit(SP_CONGESTED, &pool->sp_flags);
- clear_bit(RQ_BUSY, &rqstp->rq_flags);
- smp_mb__after_atomic();

- if (likely(rqst_should_sleep(rqstp)))
+ if (likely(rqst_should_sleep(rqstp))) {
+ svc_pool_thread_mark_idle(pool, rqstp);
time_left = schedule_timeout(timeout);
- else
+ } else
__set_current_state(TASK_RUNNING);

try_to_freeze();

- set_bit(RQ_BUSY, &rqstp->rq_flags);
- smp_mb__after_atomic();
+ svc_pool_thread_mark_busy(pool, rqstp);
+
rqstp->rq_xprt = svc_xprt_dequeue(pool);
if (rqstp->rq_xprt) {
trace_svc_pool_awoken(rqstp);



2023-07-04 01:40:54

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v2 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap

On Tue, 04 Jul 2023, Chuck Lever wrote:
> From: Chuck Lever <[email protected]>
>
> I've noticed that client-observed server request latency goes up
> simply when the nfsd thread count is increased.
>
> List walking is known to be memory-inefficient. On a busy server
> with many threads, enqueuing a transport will walk the "all threads"
> list quite frequently. This also pulls in the cache lines for some
> hot fields in each svc_rqst (namely, rq_flags).

I think this text could usefully be re-written. By this point in the
series we aren't list walking.

I'd also be curious to know what latency different you get for just this
change.

>
> The svc_xprt_enqueue() call that concerns me most is the one in
> svc_rdma_wc_receive(), which is single-threaded per CQ. Slowing
> down completion handling limits the total throughput per RDMA
> connection.
>
> So, avoid walking the "all threads" list to find an idle thread to
> wake. Instead, set up an idle bitmap and use find_next_bit, which
> should work the same way as RQ_BUSY but it will touch only the
> cachelines that the bitmap is in. Stick with atomic bit operations
> to avoid taking the pool lock.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> include/linux/sunrpc/svc.h | 6 ++++--
> include/trace/events/sunrpc.h | 1 -
> net/sunrpc/svc.c | 27 +++++++++++++++++++++------
> net/sunrpc/svc_xprt.c | 30 ++++++++++++++++++++++++------
> 4 files changed, 49 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 6f8bfcd44250..27ffcf7371d0 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -35,6 +35,7 @@ struct svc_pool {
> spinlock_t sp_lock; /* protects sp_sockets */
> struct list_head sp_sockets; /* pending sockets */
> unsigned int sp_nrthreads; /* # of threads in pool */
> + unsigned long *sp_idle_map; /* idle threads */
> struct xarray sp_thread_xa;
>
> /* statistics on pool operation */
> @@ -190,6 +191,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> #define RPCSVC_MAXPAGES ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
> + 2 + 1)
>
> +#define RPCSVC_MAXPOOLTHREADS (4096)
> +
> /*
> * The context of a single thread, including the request currently being
> * processed.
> @@ -239,8 +242,7 @@ struct svc_rqst {
> #define RQ_SPLICE_OK (4) /* turned off in gss privacy
> * to prevent encrypting page
> * cache pages */
> -#define RQ_BUSY (5) /* request is busy */
> -#define RQ_DATA (6) /* request has data */
> +#define RQ_DATA (5) /* request has data */

Might this be a good opportunity to convert this to an enum ??

> unsigned long rq_flags; /* flags field */
> u32 rq_thread_id; /* xarray index */
> ktime_t rq_qtime; /* enqueue time */
> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> index ea43c6059bdb..c07824a254bf 100644
> --- a/include/trace/events/sunrpc.h
> +++ b/include/trace/events/sunrpc.h
> @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
> svc_rqst_flag(USEDEFERRAL) \
> svc_rqst_flag(DROPME) \
> svc_rqst_flag(SPLICE_OK) \
> - svc_rqst_flag(BUSY) \
> svc_rqst_flag_end(DATA)
>
> #undef svc_rqst_flag
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index ef350f0d8925..d0278e5190ba 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -509,6 +509,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> INIT_LIST_HEAD(&pool->sp_sockets);
> spin_lock_init(&pool->sp_lock);
> xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
> + /* All threads initially marked "busy" */
> + pool->sp_idle_map =
> + bitmap_zalloc_node(RPCSVC_MAXPOOLTHREADS, GFP_KERNEL,
> + svc_pool_map_get_node(i));
> + if (!pool->sp_idle_map)
> + return NULL;
>
> percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> @@ -596,6 +602,8 @@ svc_destroy(struct kref *ref)
> percpu_counter_destroy(&pool->sp_threads_starved);
>
> xa_destroy(&pool->sp_thread_xa);
> + bitmap_free(pool->sp_idle_map);
> + pool->sp_idle_map = NULL;
> }
> kfree(serv->sv_pools);
> kfree(serv);
> @@ -647,7 +655,6 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
>
> folio_batch_init(&rqstp->rq_fbatch);
>
> - __set_bit(RQ_BUSY, &rqstp->rq_flags);
> rqstp->rq_server = serv;
> rqstp->rq_pool = pool;
>
> @@ -677,7 +684,7 @@ static struct svc_rqst *
> svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> {
> static const struct xa_limit limit = {
> - .max = U32_MAX,
> + .max = RPCSVC_MAXPOOLTHREADS,
> };
> struct svc_rqst *rqstp;
> int ret;
> @@ -722,12 +729,19 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> struct svc_pool *pool)
> {
> struct svc_rqst *rqstp;
> - unsigned long index;
> + unsigned long bit;
>
> - xa_for_each(&pool->sp_thread_xa, index, rqstp) {
> - if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> + /* Check the pool's idle bitmap locklessly so that multiple
> + * idle searches can proceed concurrently.
> + */
> + for_each_set_bit(bit, pool->sp_idle_map, pool->sp_nrthreads) {
> + if (!test_and_clear_bit(bit, pool->sp_idle_map))
> continue;

I would really rather the map was "sp_busy_map". (initialised with bitmap_fill())
Then you could "test_and_set_bit_lock()" and later "clear_bit_unlock()"
and so get all the required memory barriers.
What we are doing here is locking a particular thread for a task, so
"lock" is an appropriate description of what is happening.
See also svc_pool_thread_mark_* below.

>
> + rqstp = xa_load(&pool->sp_thread_xa, bit);
> + if (!rqstp)
> + break;
> +
> WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> wake_up_process(rqstp->rq_task);
> percpu_counter_inc(&pool->sp_threads_woken);
> @@ -767,7 +781,8 @@ svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *stat
> }
>
> found_pool:
> - rqstp = xa_find(&pool->sp_thread_xa, &zero, U32_MAX, XA_PRESENT);
> + rqstp = xa_find(&pool->sp_thread_xa, &zero, RPCSVC_MAXPOOLTHREADS,
> + XA_PRESENT);
> if (rqstp) {
> __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> task = rqstp->rq_task;
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 7709120b45c1..2844b32c16ea 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -735,6 +735,25 @@ rqst_should_sleep(struct svc_rqst *rqstp)
> return true;
> }
>
> +static void svc_pool_thread_mark_idle(struct svc_pool *pool,
> + struct svc_rqst *rqstp)
> +{
> + smp_mb__before_atomic();
> + set_bit(rqstp->rq_thread_id, pool->sp_idle_map);
> + smp_mb__after_atomic();
> +}

There memory barriers above and below bother me. There is no comment
telling me what they are protecting against.
I would rather svc_pool_thread_mark_idle - which unlocks the thread -
were

clear_bit_unlock(rqstp->rq_thread_id, pool->sp_busy_map);

and that svc_pool_thread_mark_busy were

test_and_set_bit_lock(rqstp->rq_thread_id, pool->sp_busy_map);

Then it would be more obvious what was happening.

Thanks,
NeilBrown

> +
> +/*
> + * Note: If we were awoken, then this rqstp has already been marked busy.
> + */
> +static void svc_pool_thread_mark_busy(struct svc_pool *pool,
> + struct svc_rqst *rqstp)
> +{
> + smp_mb__before_atomic();
> + clear_bit(rqstp->rq_thread_id, pool->sp_idle_map);
> + smp_mb__after_atomic();
> +}
> +
> static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
> {
> struct svc_pool *pool = rqstp->rq_pool;
> @@ -756,18 +775,17 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
> set_current_state(TASK_INTERRUPTIBLE);
> smp_mb__before_atomic();
> clear_bit(SP_CONGESTED, &pool->sp_flags);
> - clear_bit(RQ_BUSY, &rqstp->rq_flags);
> - smp_mb__after_atomic();
>
> - if (likely(rqst_should_sleep(rqstp)))
> + if (likely(rqst_should_sleep(rqstp))) {
> + svc_pool_thread_mark_idle(pool, rqstp);
> time_left = schedule_timeout(timeout);
> - else
> + } else
> __set_current_state(TASK_RUNNING);
>
> try_to_freeze();
>
> - set_bit(RQ_BUSY, &rqstp->rq_flags);
> - smp_mb__after_atomic();
> + svc_pool_thread_mark_busy(pool, rqstp);
> +
> rqstp->rq_xprt = svc_xprt_dequeue(pool);
> if (rqstp->rq_xprt) {
> trace_svc_pool_awoken(rqstp);
>
>
>


2023-07-04 02:03:57

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH v2 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap

On Tue, Jul 04, 2023 at 11:26:22AM +1000, NeilBrown wrote:
> On Tue, 04 Jul 2023, Chuck Lever wrote:
> > From: Chuck Lever <[email protected]>
> >
> > I've noticed that client-observed server request latency goes up
> > simply when the nfsd thread count is increased.
> >
> > List walking is known to be memory-inefficient. On a busy server
> > with many threads, enqueuing a transport will walk the "all threads"
> > list quite frequently. This also pulls in the cache lines for some
> > hot fields in each svc_rqst (namely, rq_flags).
>
> I think this text could usefully be re-written. By this point in the
> series we aren't list walking.
>
> I'd also be curious to know what latency different you get for just this
> change.

Not much of a latency difference at lower thread counts.

The difference I notice is that with the spinlock version of
pool_wake_idle_thread, there is significant lock contention as
the thread count increases, and the throughput result of my fio
test is lower (outside the result variance).


> > The svc_xprt_enqueue() call that concerns me most is the one in
> > svc_rdma_wc_receive(), which is single-threaded per CQ. Slowing
> > down completion handling limits the total throughput per RDMA
> > connection.
> >
> > So, avoid walking the "all threads" list to find an idle thread to
> > wake. Instead, set up an idle bitmap and use find_next_bit, which
> > should work the same way as RQ_BUSY but it will touch only the
> > cachelines that the bitmap is in. Stick with atomic bit operations
> > to avoid taking the pool lock.
> >
> > Signed-off-by: Chuck Lever <[email protected]>
> > ---
> > include/linux/sunrpc/svc.h | 6 ++++--
> > include/trace/events/sunrpc.h | 1 -
> > net/sunrpc/svc.c | 27 +++++++++++++++++++++------
> > net/sunrpc/svc_xprt.c | 30 ++++++++++++++++++++++++------
> > 4 files changed, 49 insertions(+), 15 deletions(-)
> >
> > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > index 6f8bfcd44250..27ffcf7371d0 100644
> > --- a/include/linux/sunrpc/svc.h
> > +++ b/include/linux/sunrpc/svc.h
> > @@ -35,6 +35,7 @@ struct svc_pool {
> > spinlock_t sp_lock; /* protects sp_sockets */
> > struct list_head sp_sockets; /* pending sockets */
> > unsigned int sp_nrthreads; /* # of threads in pool */
> > + unsigned long *sp_idle_map; /* idle threads */
> > struct xarray sp_thread_xa;
> >
> > /* statistics on pool operation */
> > @@ -190,6 +191,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> > #define RPCSVC_MAXPAGES ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
> > + 2 + 1)
> >
> > +#define RPCSVC_MAXPOOLTHREADS (4096)
> > +
> > /*
> > * The context of a single thread, including the request currently being
> > * processed.
> > @@ -239,8 +242,7 @@ struct svc_rqst {
> > #define RQ_SPLICE_OK (4) /* turned off in gss privacy
> > * to prevent encrypting page
> > * cache pages */
> > -#define RQ_BUSY (5) /* request is busy */
> > -#define RQ_DATA (6) /* request has data */
> > +#define RQ_DATA (5) /* request has data */
>
> Might this be a good opportunity to convert this to an enum ??
>
> > unsigned long rq_flags; /* flags field */
> > u32 rq_thread_id; /* xarray index */
> > ktime_t rq_qtime; /* enqueue time */
> > diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> > index ea43c6059bdb..c07824a254bf 100644
> > --- a/include/trace/events/sunrpc.h
> > +++ b/include/trace/events/sunrpc.h
> > @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
> > svc_rqst_flag(USEDEFERRAL) \
> > svc_rqst_flag(DROPME) \
> > svc_rqst_flag(SPLICE_OK) \
> > - svc_rqst_flag(BUSY) \
> > svc_rqst_flag_end(DATA)
> >
> > #undef svc_rqst_flag
> > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > index ef350f0d8925..d0278e5190ba 100644
> > --- a/net/sunrpc/svc.c
> > +++ b/net/sunrpc/svc.c
> > @@ -509,6 +509,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> > INIT_LIST_HEAD(&pool->sp_sockets);
> > spin_lock_init(&pool->sp_lock);
> > xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
> > + /* All threads initially marked "busy" */
> > + pool->sp_idle_map =
> > + bitmap_zalloc_node(RPCSVC_MAXPOOLTHREADS, GFP_KERNEL,
> > + svc_pool_map_get_node(i));
> > + if (!pool->sp_idle_map)
> > + return NULL;
> >
> > percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> > percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> > @@ -596,6 +602,8 @@ svc_destroy(struct kref *ref)
> > percpu_counter_destroy(&pool->sp_threads_starved);
> >
> > xa_destroy(&pool->sp_thread_xa);
> > + bitmap_free(pool->sp_idle_map);
> > + pool->sp_idle_map = NULL;
> > }
> > kfree(serv->sv_pools);
> > kfree(serv);
> > @@ -647,7 +655,6 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
> >
> > folio_batch_init(&rqstp->rq_fbatch);
> >
> > - __set_bit(RQ_BUSY, &rqstp->rq_flags);
> > rqstp->rq_server = serv;
> > rqstp->rq_pool = pool;
> >
> > @@ -677,7 +684,7 @@ static struct svc_rqst *
> > svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> > {
> > static const struct xa_limit limit = {
> > - .max = U32_MAX,
> > + .max = RPCSVC_MAXPOOLTHREADS,
> > };
> > struct svc_rqst *rqstp;
> > int ret;
> > @@ -722,12 +729,19 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> > struct svc_pool *pool)
> > {
> > struct svc_rqst *rqstp;
> > - unsigned long index;
> > + unsigned long bit;
> >
> > - xa_for_each(&pool->sp_thread_xa, index, rqstp) {
> > - if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> > + /* Check the pool's idle bitmap locklessly so that multiple
> > + * idle searches can proceed concurrently.
> > + */
> > + for_each_set_bit(bit, pool->sp_idle_map, pool->sp_nrthreads) {
> > + if (!test_and_clear_bit(bit, pool->sp_idle_map))
> > continue;
>
> I would really rather the map was "sp_busy_map". (initialised with bitmap_fill())
> Then you could "test_and_set_bit_lock()" and later "clear_bit_unlock()"
> and so get all the required memory barriers.
> What we are doing here is locking a particular thread for a task, so
> "lock" is an appropriate description of what is happening.
> See also svc_pool_thread_mark_* below.
>
> >
> > + rqstp = xa_load(&pool->sp_thread_xa, bit);
> > + if (!rqstp)
> > + break;
> > +
> > WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> > wake_up_process(rqstp->rq_task);
> > percpu_counter_inc(&pool->sp_threads_woken);
> > @@ -767,7 +781,8 @@ svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *stat
> > }
> >
> > found_pool:
> > - rqstp = xa_find(&pool->sp_thread_xa, &zero, U32_MAX, XA_PRESENT);
> > + rqstp = xa_find(&pool->sp_thread_xa, &zero, RPCSVC_MAXPOOLTHREADS,
> > + XA_PRESENT);
> > if (rqstp) {
> > __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> > task = rqstp->rq_task;
> > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > index 7709120b45c1..2844b32c16ea 100644
> > --- a/net/sunrpc/svc_xprt.c
> > +++ b/net/sunrpc/svc_xprt.c
> > @@ -735,6 +735,25 @@ rqst_should_sleep(struct svc_rqst *rqstp)
> > return true;
> > }
> >
> > +static void svc_pool_thread_mark_idle(struct svc_pool *pool,
> > + struct svc_rqst *rqstp)
> > +{
> > + smp_mb__before_atomic();
> > + set_bit(rqstp->rq_thread_id, pool->sp_idle_map);
> > + smp_mb__after_atomic();
> > +}
>
> There memory barriers above and below bother me. There is no comment
> telling me what they are protecting against.
> I would rather svc_pool_thread_mark_idle - which unlocks the thread -
> were
>
> clear_bit_unlock(rqstp->rq_thread_id, pool->sp_busy_map);
>
> and that svc_pool_thread_mark_busy were
>
> test_and_set_bit_lock(rqstp->rq_thread_id, pool->sp_busy_map);
>
> Then it would be more obvious what was happening.

Not obvious to me, but that's very likely because I'm not clear what
clear_bit_unlock() does. :-)

I'll try this change for the next version of the series.


> Thanks,
> NeilBrown
>
> > +
> > +/*
> > + * Note: If we were awoken, then this rqstp has already been marked busy.
> > + */
> > +static void svc_pool_thread_mark_busy(struct svc_pool *pool,
> > + struct svc_rqst *rqstp)
> > +{
> > + smp_mb__before_atomic();
> > + clear_bit(rqstp->rq_thread_id, pool->sp_idle_map);
> > + smp_mb__after_atomic();
> > +}
> > +
> > static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
> > {
> > struct svc_pool *pool = rqstp->rq_pool;
> > @@ -756,18 +775,17 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
> > set_current_state(TASK_INTERRUPTIBLE);
> > smp_mb__before_atomic();
> > clear_bit(SP_CONGESTED, &pool->sp_flags);
> > - clear_bit(RQ_BUSY, &rqstp->rq_flags);
> > - smp_mb__after_atomic();
> >
> > - if (likely(rqst_should_sleep(rqstp)))
> > + if (likely(rqst_should_sleep(rqstp))) {
> > + svc_pool_thread_mark_idle(pool, rqstp);
> > time_left = schedule_timeout(timeout);
> > - else
> > + } else
> > __set_current_state(TASK_RUNNING);
> >
> > try_to_freeze();
> >
> > - set_bit(RQ_BUSY, &rqstp->rq_flags);
> > - smp_mb__after_atomic();
> > + svc_pool_thread_mark_busy(pool, rqstp);
> > +
> > rqstp->rq_xprt = svc_xprt_dequeue(pool);
> > if (rqstp->rq_xprt) {
> > trace_svc_pool_awoken(rqstp);
> >
> >
> >
>

2023-07-04 02:22:41

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v2 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap

On Tue, 04 Jul 2023, Chuck Lever wrote:
> On Tue, Jul 04, 2023 at 11:26:22AM +1000, NeilBrown wrote:
> > On Tue, 04 Jul 2023, Chuck Lever wrote:
> > > From: Chuck Lever <[email protected]>
> > >
> > > I've noticed that client-observed server request latency goes up
> > > simply when the nfsd thread count is increased.
> > >
> > > List walking is known to be memory-inefficient. On a busy server
> > > with many threads, enqueuing a transport will walk the "all threads"
> > > list quite frequently. This also pulls in the cache lines for some
> > > hot fields in each svc_rqst (namely, rq_flags).
> >
> > I think this text could usefully be re-written. By this point in the
> > series we aren't list walking.
> >
> > I'd also be curious to know what latency different you get for just this
> > change.
>
> Not much of a latency difference at lower thread counts.
>
> The difference I notice is that with the spinlock version of
> pool_wake_idle_thread, there is significant lock contention as
> the thread count increases, and the throughput result of my fio
> test is lower (outside the result variance).
>
>
> > > The svc_xprt_enqueue() call that concerns me most is the one in
> > > svc_rdma_wc_receive(), which is single-threaded per CQ. Slowing
> > > down completion handling limits the total throughput per RDMA
> > > connection.
> > >
> > > So, avoid walking the "all threads" list to find an idle thread to
> > > wake. Instead, set up an idle bitmap and use find_next_bit, which
> > > should work the same way as RQ_BUSY but it will touch only the
> > > cachelines that the bitmap is in. Stick with atomic bit operations
> > > to avoid taking the pool lock.
> > >
> > > Signed-off-by: Chuck Lever <[email protected]>
> > > ---
> > > include/linux/sunrpc/svc.h | 6 ++++--
> > > include/trace/events/sunrpc.h | 1 -
> > > net/sunrpc/svc.c | 27 +++++++++++++++++++++------
> > > net/sunrpc/svc_xprt.c | 30 ++++++++++++++++++++++++------
> > > 4 files changed, 49 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > > index 6f8bfcd44250..27ffcf7371d0 100644
> > > --- a/include/linux/sunrpc/svc.h
> > > +++ b/include/linux/sunrpc/svc.h
> > > @@ -35,6 +35,7 @@ struct svc_pool {
> > > spinlock_t sp_lock; /* protects sp_sockets */
> > > struct list_head sp_sockets; /* pending sockets */
> > > unsigned int sp_nrthreads; /* # of threads in pool */
> > > + unsigned long *sp_idle_map; /* idle threads */
> > > struct xarray sp_thread_xa;
> > >
> > > /* statistics on pool operation */
> > > @@ -190,6 +191,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> > > #define RPCSVC_MAXPAGES ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
> > > + 2 + 1)
> > >
> > > +#define RPCSVC_MAXPOOLTHREADS (4096)
> > > +
> > > /*
> > > * The context of a single thread, including the request currently being
> > > * processed.
> > > @@ -239,8 +242,7 @@ struct svc_rqst {
> > > #define RQ_SPLICE_OK (4) /* turned off in gss privacy
> > > * to prevent encrypting page
> > > * cache pages */
> > > -#define RQ_BUSY (5) /* request is busy */
> > > -#define RQ_DATA (6) /* request has data */
> > > +#define RQ_DATA (5) /* request has data */
> >
> > Might this be a good opportunity to convert this to an enum ??
> >
> > > unsigned long rq_flags; /* flags field */
> > > u32 rq_thread_id; /* xarray index */
> > > ktime_t rq_qtime; /* enqueue time */
> > > diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> > > index ea43c6059bdb..c07824a254bf 100644
> > > --- a/include/trace/events/sunrpc.h
> > > +++ b/include/trace/events/sunrpc.h
> > > @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
> > > svc_rqst_flag(USEDEFERRAL) \
> > > svc_rqst_flag(DROPME) \
> > > svc_rqst_flag(SPLICE_OK) \
> > > - svc_rqst_flag(BUSY) \
> > > svc_rqst_flag_end(DATA)
> > >
> > > #undef svc_rqst_flag
> > > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > > index ef350f0d8925..d0278e5190ba 100644
> > > --- a/net/sunrpc/svc.c
> > > +++ b/net/sunrpc/svc.c
> > > @@ -509,6 +509,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> > > INIT_LIST_HEAD(&pool->sp_sockets);
> > > spin_lock_init(&pool->sp_lock);
> > > xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
> > > + /* All threads initially marked "busy" */
> > > + pool->sp_idle_map =
> > > + bitmap_zalloc_node(RPCSVC_MAXPOOLTHREADS, GFP_KERNEL,
> > > + svc_pool_map_get_node(i));
> > > + if (!pool->sp_idle_map)
> > > + return NULL;
> > >
> > > percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> > > percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> > > @@ -596,6 +602,8 @@ svc_destroy(struct kref *ref)
> > > percpu_counter_destroy(&pool->sp_threads_starved);
> > >
> > > xa_destroy(&pool->sp_thread_xa);
> > > + bitmap_free(pool->sp_idle_map);
> > > + pool->sp_idle_map = NULL;
> > > }
> > > kfree(serv->sv_pools);
> > > kfree(serv);
> > > @@ -647,7 +655,6 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
> > >
> > > folio_batch_init(&rqstp->rq_fbatch);
> > >
> > > - __set_bit(RQ_BUSY, &rqstp->rq_flags);
> > > rqstp->rq_server = serv;
> > > rqstp->rq_pool = pool;
> > >
> > > @@ -677,7 +684,7 @@ static struct svc_rqst *
> > > svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> > > {
> > > static const struct xa_limit limit = {
> > > - .max = U32_MAX,
> > > + .max = RPCSVC_MAXPOOLTHREADS,
> > > };
> > > struct svc_rqst *rqstp;
> > > int ret;
> > > @@ -722,12 +729,19 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> > > struct svc_pool *pool)
> > > {
> > > struct svc_rqst *rqstp;
> > > - unsigned long index;
> > > + unsigned long bit;
> > >
> > > - xa_for_each(&pool->sp_thread_xa, index, rqstp) {
> > > - if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> > > + /* Check the pool's idle bitmap locklessly so that multiple
> > > + * idle searches can proceed concurrently.
> > > + */
> > > + for_each_set_bit(bit, pool->sp_idle_map, pool->sp_nrthreads) {
> > > + if (!test_and_clear_bit(bit, pool->sp_idle_map))
> > > continue;
> >
> > I would really rather the map was "sp_busy_map". (initialised with bitmap_fill())
> > Then you could "test_and_set_bit_lock()" and later "clear_bit_unlock()"
> > and so get all the required memory barriers.
> > What we are doing here is locking a particular thread for a task, so
> > "lock" is an appropriate description of what is happening.
> > See also svc_pool_thread_mark_* below.
> >
> > >
> > > + rqstp = xa_load(&pool->sp_thread_xa, bit);
> > > + if (!rqstp)
> > > + break;
> > > +
> > > WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> > > wake_up_process(rqstp->rq_task);
> > > percpu_counter_inc(&pool->sp_threads_woken);
> > > @@ -767,7 +781,8 @@ svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *stat
> > > }
> > >
> > > found_pool:
> > > - rqstp = xa_find(&pool->sp_thread_xa, &zero, U32_MAX, XA_PRESENT);
> > > + rqstp = xa_find(&pool->sp_thread_xa, &zero, RPCSVC_MAXPOOLTHREADS,
> > > + XA_PRESENT);
> > > if (rqstp) {
> > > __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> > > task = rqstp->rq_task;
> > > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > > index 7709120b45c1..2844b32c16ea 100644
> > > --- a/net/sunrpc/svc_xprt.c
> > > +++ b/net/sunrpc/svc_xprt.c
> > > @@ -735,6 +735,25 @@ rqst_should_sleep(struct svc_rqst *rqstp)
> > > return true;
> > > }
> > >
> > > +static void svc_pool_thread_mark_idle(struct svc_pool *pool,
> > > + struct svc_rqst *rqstp)
> > > +{
> > > + smp_mb__before_atomic();
> > > + set_bit(rqstp->rq_thread_id, pool->sp_idle_map);
> > > + smp_mb__after_atomic();
> > > +}
> >
> > There memory barriers above and below bother me. There is no comment
> > telling me what they are protecting against.
> > I would rather svc_pool_thread_mark_idle - which unlocks the thread -
> > were
> >
> > clear_bit_unlock(rqstp->rq_thread_id, pool->sp_busy_map);
> >
> > and that svc_pool_thread_mark_busy were
> >
> > test_and_set_bit_lock(rqstp->rq_thread_id, pool->sp_busy_map);
> >
> > Then it would be more obvious what was happening.
>
> Not obvious to me, but that's very likely because I'm not clear what
> clear_bit_unlock() does. :-)

In general, any "lock" operation (mutex, spin, whatever) is (and must
be) and "acquire" type operations which imposes a memory barrier so that
read requests *after* the lock cannot be satisfied with data from
*before* the lock. The read must access data after the lock.
Conversely any "unlock" operations is a "release" type operation which
imposes a memory barrier so that any write request *before* the unlock
must not be delayed until *after* the unlock. The write must complete
before the unlock.

This is exactly what you would expect of locking - it creates a closed
code region that is properly ordered w.r.t comparable closed regions.

test_and_set_bit_lock() and clear_bit_unlock() provide these expected
semantics for bit operations.

New code should (almost?) never have explicit memory barriers like
smp_mb__after_atomic().
It should use one of the many APIs with _acquire or _release suffixes,
or with the more explicit _lock or _unlock.

>
> I'll try this change for the next version of the series.
>

Thanks.

NeilBrown

2023-07-04 15:18:54

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v2 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap



> On Jul 3, 2023, at 10:17 PM, NeilBrown <[email protected]> wrote:
>
> On Tue, 04 Jul 2023, Chuck Lever wrote:
>> On Tue, Jul 04, 2023 at 11:26:22AM +1000, NeilBrown wrote:
>>> On Tue, 04 Jul 2023, Chuck Lever wrote:
>>>> From: Chuck Lever <[email protected]>
>>>>
>>>> I've noticed that client-observed server request latency goes up
>>>> simply when the nfsd thread count is increased.
>>>>
>>>> List walking is known to be memory-inefficient. On a busy server
>>>> with many threads, enqueuing a transport will walk the "all threads"
>>>> list quite frequently. This also pulls in the cache lines for some
>>>> hot fields in each svc_rqst (namely, rq_flags).
>>>
>>> I think this text could usefully be re-written. By this point in the
>>> series we aren't list walking.
>>>
>>> I'd also be curious to know what latency different you get for just this
>>> change.
>>
>> Not much of a latency difference at lower thread counts.
>>
>> The difference I notice is that with the spinlock version of
>> pool_wake_idle_thread, there is significant lock contention as
>> the thread count increases, and the throughput result of my fio
>> test is lower (outside the result variance).
>>
>>
>>>> The svc_xprt_enqueue() call that concerns me most is the one in
>>>> svc_rdma_wc_receive(), which is single-threaded per CQ. Slowing
>>>> down completion handling limits the total throughput per RDMA
>>>> connection.
>>>>
>>>> So, avoid walking the "all threads" list to find an idle thread to
>>>> wake. Instead, set up an idle bitmap and use find_next_bit, which
>>>> should work the same way as RQ_BUSY but it will touch only the
>>>> cachelines that the bitmap is in. Stick with atomic bit operations
>>>> to avoid taking the pool lock.
>>>>
>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>> ---
>>>> include/linux/sunrpc/svc.h | 6 ++++--
>>>> include/trace/events/sunrpc.h | 1 -
>>>> net/sunrpc/svc.c | 27 +++++++++++++++++++++------
>>>> net/sunrpc/svc_xprt.c | 30 ++++++++++++++++++++++++------
>>>> 4 files changed, 49 insertions(+), 15 deletions(-)
>>>>
>>>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>>>> index 6f8bfcd44250..27ffcf7371d0 100644
>>>> --- a/include/linux/sunrpc/svc.h
>>>> +++ b/include/linux/sunrpc/svc.h
>>>> @@ -35,6 +35,7 @@ struct svc_pool {
>>>> spinlock_t sp_lock; /* protects sp_sockets */
>>>> struct list_head sp_sockets; /* pending sockets */
>>>> unsigned int sp_nrthreads; /* # of threads in pool */
>>>> + unsigned long *sp_idle_map; /* idle threads */
>>>> struct xarray sp_thread_xa;
>>>>
>>>> /* statistics on pool operation */
>>>> @@ -190,6 +191,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>>>> #define RPCSVC_MAXPAGES ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
>>>> + 2 + 1)
>>>>
>>>> +#define RPCSVC_MAXPOOLTHREADS (4096)
>>>> +
>>>> /*
>>>> * The context of a single thread, including the request currently being
>>>> * processed.
>>>> @@ -239,8 +242,7 @@ struct svc_rqst {
>>>> #define RQ_SPLICE_OK (4) /* turned off in gss privacy
>>>> * to prevent encrypting page
>>>> * cache pages */
>>>> -#define RQ_BUSY (5) /* request is busy */
>>>> -#define RQ_DATA (6) /* request has data */
>>>> +#define RQ_DATA (5) /* request has data */
>>>
>>> Might this be a good opportunity to convert this to an enum ??
>>>
>>>> unsigned long rq_flags; /* flags field */
>>>> u32 rq_thread_id; /* xarray index */
>>>> ktime_t rq_qtime; /* enqueue time */
>>>> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
>>>> index ea43c6059bdb..c07824a254bf 100644
>>>> --- a/include/trace/events/sunrpc.h
>>>> +++ b/include/trace/events/sunrpc.h
>>>> @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
>>>> svc_rqst_flag(USEDEFERRAL) \
>>>> svc_rqst_flag(DROPME) \
>>>> svc_rqst_flag(SPLICE_OK) \
>>>> - svc_rqst_flag(BUSY) \
>>>> svc_rqst_flag_end(DATA)
>>>>
>>>> #undef svc_rqst_flag
>>>> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
>>>> index ef350f0d8925..d0278e5190ba 100644
>>>> --- a/net/sunrpc/svc.c
>>>> +++ b/net/sunrpc/svc.c
>>>> @@ -509,6 +509,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
>>>> INIT_LIST_HEAD(&pool->sp_sockets);
>>>> spin_lock_init(&pool->sp_lock);
>>>> xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
>>>> + /* All threads initially marked "busy" */
>>>> + pool->sp_idle_map =
>>>> + bitmap_zalloc_node(RPCSVC_MAXPOOLTHREADS, GFP_KERNEL,
>>>> + svc_pool_map_get_node(i));
>>>> + if (!pool->sp_idle_map)
>>>> + return NULL;
>>>>
>>>> percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
>>>> percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
>>>> @@ -596,6 +602,8 @@ svc_destroy(struct kref *ref)
>>>> percpu_counter_destroy(&pool->sp_threads_starved);
>>>>
>>>> xa_destroy(&pool->sp_thread_xa);
>>>> + bitmap_free(pool->sp_idle_map);
>>>> + pool->sp_idle_map = NULL;
>>>> }
>>>> kfree(serv->sv_pools);
>>>> kfree(serv);
>>>> @@ -647,7 +655,6 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
>>>>
>>>> folio_batch_init(&rqstp->rq_fbatch);
>>>>
>>>> - __set_bit(RQ_BUSY, &rqstp->rq_flags);
>>>> rqstp->rq_server = serv;
>>>> rqstp->rq_pool = pool;
>>>>
>>>> @@ -677,7 +684,7 @@ static struct svc_rqst *
>>>> svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>>>> {
>>>> static const struct xa_limit limit = {
>>>> - .max = U32_MAX,
>>>> + .max = RPCSVC_MAXPOOLTHREADS,
>>>> };
>>>> struct svc_rqst *rqstp;
>>>> int ret;
>>>> @@ -722,12 +729,19 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
>>>> struct svc_pool *pool)
>>>> {
>>>> struct svc_rqst *rqstp;
>>>> - unsigned long index;
>>>> + unsigned long bit;
>>>>
>>>> - xa_for_each(&pool->sp_thread_xa, index, rqstp) {
>>>> - if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
>>>> + /* Check the pool's idle bitmap locklessly so that multiple
>>>> + * idle searches can proceed concurrently.
>>>> + */
>>>> + for_each_set_bit(bit, pool->sp_idle_map, pool->sp_nrthreads) {
>>>> + if (!test_and_clear_bit(bit, pool->sp_idle_map))
>>>> continue;
>>>
>>> I would really rather the map was "sp_busy_map". (initialised with bitmap_fill())
>>> Then you could "test_and_set_bit_lock()" and later "clear_bit_unlock()"
>>> and so get all the required memory barriers.
>>> What we are doing here is locking a particular thread for a task, so
>>> "lock" is an appropriate description of what is happening.
>>> See also svc_pool_thread_mark_* below.
>>>
>>>>
>>>> + rqstp = xa_load(&pool->sp_thread_xa, bit);
>>>> + if (!rqstp)
>>>> + break;
>>>> +
>>>> WRITE_ONCE(rqstp->rq_qtime, ktime_get());
>>>> wake_up_process(rqstp->rq_task);
>>>> percpu_counter_inc(&pool->sp_threads_woken);
>>>> @@ -767,7 +781,8 @@ svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *stat
>>>> }
>>>>
>>>> found_pool:
>>>> - rqstp = xa_find(&pool->sp_thread_xa, &zero, U32_MAX, XA_PRESENT);
>>>> + rqstp = xa_find(&pool->sp_thread_xa, &zero, RPCSVC_MAXPOOLTHREADS,
>>>> + XA_PRESENT);
>>>> if (rqstp) {
>>>> __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
>>>> task = rqstp->rq_task;
>>>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>>>> index 7709120b45c1..2844b32c16ea 100644
>>>> --- a/net/sunrpc/svc_xprt.c
>>>> +++ b/net/sunrpc/svc_xprt.c
>>>> @@ -735,6 +735,25 @@ rqst_should_sleep(struct svc_rqst *rqstp)
>>>> return true;
>>>> }
>>>>
>>>> +static void svc_pool_thread_mark_idle(struct svc_pool *pool,
>>>> + struct svc_rqst *rqstp)
>>>> +{
>>>> + smp_mb__before_atomic();
>>>> + set_bit(rqstp->rq_thread_id, pool->sp_idle_map);
>>>> + smp_mb__after_atomic();
>>>> +}
>>>
>>> There memory barriers above and below bother me. There is no comment
>>> telling me what they are protecting against.
>>> I would rather svc_pool_thread_mark_idle - which unlocks the thread -
>>> were
>>>
>>> clear_bit_unlock(rqstp->rq_thread_id, pool->sp_busy_map);
>>>
>>> and that svc_pool_thread_mark_busy were
>>>
>>> test_and_set_bit_lock(rqstp->rq_thread_id, pool->sp_busy_map);
>>>
>>> Then it would be more obvious what was happening.
>>
>> Not obvious to me, but that's very likely because I'm not clear what
>> clear_bit_unlock() does. :-)
>
> In general, any "lock" operation (mutex, spin, whatever) is (and must
> be) and "acquire" type operations which imposes a memory barrier so that
> read requests *after* the lock cannot be satisfied with data from
> *before* the lock. The read must access data after the lock.
> Conversely any "unlock" operations is a "release" type operation which
> imposes a memory barrier so that any write request *before* the unlock
> must not be delayed until *after* the unlock. The write must complete
> before the unlock.
>
> This is exactly what you would expect of locking - it creates a closed
> code region that is properly ordered w.r.t comparable closed regions.
>
> test_and_set_bit_lock() and clear_bit_unlock() provide these expected
> semantics for bit operations.

Your explanation is more clear than what I read in Documentation/atomic*
so thanks. I feel a little more armed to make good use of it.


> New code should (almost?) never have explicit memory barriers like
> smp_mb__after_atomic().
> It should use one of the many APIs with _acquire or _release suffixes,
> or with the more explicit _lock or _unlock.

Out of curiosity, is "should never have explicit memory barriers"
documented somewhere? I've been accused of skimming when I read, so
I might have missed it.


--
Chuck Lever



2023-07-04 16:04:43

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v2 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap


> On Jul 3, 2023, at 10:02 PM, Chuck Lever <[email protected]> wrote:
>
> On Tue, Jul 04, 2023 at 11:26:22AM +1000, NeilBrown wrote:
>> On Tue, 04 Jul 2023, Chuck Lever wrote:
>>> From: Chuck Lever <[email protected]>
>>>
>>> I've noticed that client-observed server request latency goes up
>>> simply when the nfsd thread count is increased.
>>>
>>> List walking is known to be memory-inefficient. On a busy server
>>> with many threads, enqueuing a transport will walk the "all threads"
>>> list quite frequently. This also pulls in the cache lines for some
>>> hot fields in each svc_rqst (namely, rq_flags).
>>
>> I think this text could usefully be re-written. By this point in the
>> series we aren't list walking.
>>
>> I'd also be curious to know what latency different you get for just this
>> change.
>
> Not much of a latency difference at lower thread counts.
>
> The difference I notice is that with the spinlock version of
> pool_wake_idle_thread, there is significant lock contention as
> the thread count increases, and the throughput result of my fio
> test is lower (outside the result variance).

I mis-spoke. When I wrote this yesterday I had compared only the
"xarray with bitmap" and the "xarray with spinlock" mechanisms.
I had not tried "xarray only".

Today, while testing review-related fixes, I benchmarked "xarray
only". It behaves like the linked-list implementation it replaces:
performance degrades with anything more than a couple dozen threads
in the pool.


--
Chuck Lever



2023-07-05 00:42:40

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v2 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap

On Wed, 05 Jul 2023, Chuck Lever III wrote:
>
> > On Jul 3, 2023, at 10:17 PM, NeilBrown <[email protected]> wrote:
> >
> > On Tue, 04 Jul 2023, Chuck Lever wrote:
> >> On Tue, Jul 04, 2023 at 11:26:22AM +1000, NeilBrown wrote:
> >>> On Tue, 04 Jul 2023, Chuck Lever wrote:
> >>>> From: Chuck Lever <[email protected]>
> >>>>
> >>>> I've noticed that client-observed server request latency goes up
> >>>> simply when the nfsd thread count is increased.
> >>>>
> >>>> List walking is known to be memory-inefficient. On a busy server
> >>>> with many threads, enqueuing a transport will walk the "all threads"
> >>>> list quite frequently. This also pulls in the cache lines for some
> >>>> hot fields in each svc_rqst (namely, rq_flags).
> >>>
> >>> I think this text could usefully be re-written. By this point in the
> >>> series we aren't list walking.
> >>>
> >>> I'd also be curious to know what latency different you get for just this
> >>> change.
> >>
> >> Not much of a latency difference at lower thread counts.
> >>
> >> The difference I notice is that with the spinlock version of
> >> pool_wake_idle_thread, there is significant lock contention as
> >> the thread count increases, and the throughput result of my fio
> >> test is lower (outside the result variance).
> >>
> >>
> >>>> The svc_xprt_enqueue() call that concerns me most is the one in
> >>>> svc_rdma_wc_receive(), which is single-threaded per CQ. Slowing
> >>>> down completion handling limits the total throughput per RDMA
> >>>> connection.
> >>>>
> >>>> So, avoid walking the "all threads" list to find an idle thread to
> >>>> wake. Instead, set up an idle bitmap and use find_next_bit, which
> >>>> should work the same way as RQ_BUSY but it will touch only the
> >>>> cachelines that the bitmap is in. Stick with atomic bit operations
> >>>> to avoid taking the pool lock.
> >>>>
> >>>> Signed-off-by: Chuck Lever <[email protected]>
> >>>> ---
> >>>> include/linux/sunrpc/svc.h | 6 ++++--
> >>>> include/trace/events/sunrpc.h | 1 -
> >>>> net/sunrpc/svc.c | 27 +++++++++++++++++++++------
> >>>> net/sunrpc/svc_xprt.c | 30 ++++++++++++++++++++++++------
> >>>> 4 files changed, 49 insertions(+), 15 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> >>>> index 6f8bfcd44250..27ffcf7371d0 100644
> >>>> --- a/include/linux/sunrpc/svc.h
> >>>> +++ b/include/linux/sunrpc/svc.h
> >>>> @@ -35,6 +35,7 @@ struct svc_pool {
> >>>> spinlock_t sp_lock; /* protects sp_sockets */
> >>>> struct list_head sp_sockets; /* pending sockets */
> >>>> unsigned int sp_nrthreads; /* # of threads in pool */
> >>>> + unsigned long *sp_idle_map; /* idle threads */
> >>>> struct xarray sp_thread_xa;
> >>>>
> >>>> /* statistics on pool operation */
> >>>> @@ -190,6 +191,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> >>>> #define RPCSVC_MAXPAGES ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
> >>>> + 2 + 1)
> >>>>
> >>>> +#define RPCSVC_MAXPOOLTHREADS (4096)
> >>>> +
> >>>> /*
> >>>> * The context of a single thread, including the request currently being
> >>>> * processed.
> >>>> @@ -239,8 +242,7 @@ struct svc_rqst {
> >>>> #define RQ_SPLICE_OK (4) /* turned off in gss privacy
> >>>> * to prevent encrypting page
> >>>> * cache pages */
> >>>> -#define RQ_BUSY (5) /* request is busy */
> >>>> -#define RQ_DATA (6) /* request has data */
> >>>> +#define RQ_DATA (5) /* request has data */
> >>>
> >>> Might this be a good opportunity to convert this to an enum ??
> >>>
> >>>> unsigned long rq_flags; /* flags field */
> >>>> u32 rq_thread_id; /* xarray index */
> >>>> ktime_t rq_qtime; /* enqueue time */
> >>>> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> >>>> index ea43c6059bdb..c07824a254bf 100644
> >>>> --- a/include/trace/events/sunrpc.h
> >>>> +++ b/include/trace/events/sunrpc.h
> >>>> @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
> >>>> svc_rqst_flag(USEDEFERRAL) \
> >>>> svc_rqst_flag(DROPME) \
> >>>> svc_rqst_flag(SPLICE_OK) \
> >>>> - svc_rqst_flag(BUSY) \
> >>>> svc_rqst_flag_end(DATA)
> >>>>
> >>>> #undef svc_rqst_flag
> >>>> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> >>>> index ef350f0d8925..d0278e5190ba 100644
> >>>> --- a/net/sunrpc/svc.c
> >>>> +++ b/net/sunrpc/svc.c
> >>>> @@ -509,6 +509,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> >>>> INIT_LIST_HEAD(&pool->sp_sockets);
> >>>> spin_lock_init(&pool->sp_lock);
> >>>> xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
> >>>> + /* All threads initially marked "busy" */
> >>>> + pool->sp_idle_map =
> >>>> + bitmap_zalloc_node(RPCSVC_MAXPOOLTHREADS, GFP_KERNEL,
> >>>> + svc_pool_map_get_node(i));
> >>>> + if (!pool->sp_idle_map)
> >>>> + return NULL;
> >>>>
> >>>> percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> >>>> percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> >>>> @@ -596,6 +602,8 @@ svc_destroy(struct kref *ref)
> >>>> percpu_counter_destroy(&pool->sp_threads_starved);
> >>>>
> >>>> xa_destroy(&pool->sp_thread_xa);
> >>>> + bitmap_free(pool->sp_idle_map);
> >>>> + pool->sp_idle_map = NULL;
> >>>> }
> >>>> kfree(serv->sv_pools);
> >>>> kfree(serv);
> >>>> @@ -647,7 +655,6 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
> >>>>
> >>>> folio_batch_init(&rqstp->rq_fbatch);
> >>>>
> >>>> - __set_bit(RQ_BUSY, &rqstp->rq_flags);
> >>>> rqstp->rq_server = serv;
> >>>> rqstp->rq_pool = pool;
> >>>>
> >>>> @@ -677,7 +684,7 @@ static struct svc_rqst *
> >>>> svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> >>>> {
> >>>> static const struct xa_limit limit = {
> >>>> - .max = U32_MAX,
> >>>> + .max = RPCSVC_MAXPOOLTHREADS,
> >>>> };
> >>>> struct svc_rqst *rqstp;
> >>>> int ret;
> >>>> @@ -722,12 +729,19 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> >>>> struct svc_pool *pool)
> >>>> {
> >>>> struct svc_rqst *rqstp;
> >>>> - unsigned long index;
> >>>> + unsigned long bit;
> >>>>
> >>>> - xa_for_each(&pool->sp_thread_xa, index, rqstp) {
> >>>> - if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> >>>> + /* Check the pool's idle bitmap locklessly so that multiple
> >>>> + * idle searches can proceed concurrently.
> >>>> + */
> >>>> + for_each_set_bit(bit, pool->sp_idle_map, pool->sp_nrthreads) {
> >>>> + if (!test_and_clear_bit(bit, pool->sp_idle_map))
> >>>> continue;
> >>>
> >>> I would really rather the map was "sp_busy_map". (initialised with bitmap_fill())
> >>> Then you could "test_and_set_bit_lock()" and later "clear_bit_unlock()"
> >>> and so get all the required memory barriers.
> >>> What we are doing here is locking a particular thread for a task, so
> >>> "lock" is an appropriate description of what is happening.
> >>> See also svc_pool_thread_mark_* below.
> >>>
> >>>>
> >>>> + rqstp = xa_load(&pool->sp_thread_xa, bit);
> >>>> + if (!rqstp)
> >>>> + break;
> >>>> +
> >>>> WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> >>>> wake_up_process(rqstp->rq_task);
> >>>> percpu_counter_inc(&pool->sp_threads_woken);
> >>>> @@ -767,7 +781,8 @@ svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *stat
> >>>> }
> >>>>
> >>>> found_pool:
> >>>> - rqstp = xa_find(&pool->sp_thread_xa, &zero, U32_MAX, XA_PRESENT);
> >>>> + rqstp = xa_find(&pool->sp_thread_xa, &zero, RPCSVC_MAXPOOLTHREADS,
> >>>> + XA_PRESENT);
> >>>> if (rqstp) {
> >>>> __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> >>>> task = rqstp->rq_task;
> >>>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> >>>> index 7709120b45c1..2844b32c16ea 100644
> >>>> --- a/net/sunrpc/svc_xprt.c
> >>>> +++ b/net/sunrpc/svc_xprt.c
> >>>> @@ -735,6 +735,25 @@ rqst_should_sleep(struct svc_rqst *rqstp)
> >>>> return true;
> >>>> }
> >>>>
> >>>> +static void svc_pool_thread_mark_idle(struct svc_pool *pool,
> >>>> + struct svc_rqst *rqstp)
> >>>> +{
> >>>> + smp_mb__before_atomic();
> >>>> + set_bit(rqstp->rq_thread_id, pool->sp_idle_map);
> >>>> + smp_mb__after_atomic();
> >>>> +}
> >>>
> >>> There memory barriers above and below bother me. There is no comment
> >>> telling me what they are protecting against.
> >>> I would rather svc_pool_thread_mark_idle - which unlocks the thread -
> >>> were
> >>>
> >>> clear_bit_unlock(rqstp->rq_thread_id, pool->sp_busy_map);
> >>>
> >>> and that svc_pool_thread_mark_busy were
> >>>
> >>> test_and_set_bit_lock(rqstp->rq_thread_id, pool->sp_busy_map);
> >>>
> >>> Then it would be more obvious what was happening.
> >>
> >> Not obvious to me, but that's very likely because I'm not clear what
> >> clear_bit_unlock() does. :-)
> >
> > In general, any "lock" operation (mutex, spin, whatever) is (and must
> > be) and "acquire" type operations which imposes a memory barrier so that
> > read requests *after* the lock cannot be satisfied with data from
> > *before* the lock. The read must access data after the lock.
> > Conversely any "unlock" operations is a "release" type operation which
> > imposes a memory barrier so that any write request *before* the unlock
> > must not be delayed until *after* the unlock. The write must complete
> > before the unlock.
> >
> > This is exactly what you would expect of locking - it creates a closed
> > code region that is properly ordered w.r.t comparable closed regions.
> >
> > test_and_set_bit_lock() and clear_bit_unlock() provide these expected
> > semantics for bit operations.
>
> Your explanation is more clear than what I read in Documentation/atomic*
> so thanks. I feel a little more armed to make good use of it.
>
>
> > New code should (almost?) never have explicit memory barriers like
> > smp_mb__after_atomic().
> > It should use one of the many APIs with _acquire or _release suffixes,
> > or with the more explicit _lock or _unlock.
>
> Out of curiosity, is "should never have explicit memory barriers"
> documented somewhere? I've been accused of skimming when I read, so
> I might have missed it.

My wife says I only read every second word of emails :-)

I don't know that it is documented anywhere (maybe I should submit a
patch). The statement was really my personal rule that seems to be the
natural sequel for the introduction of the many _acquire and _release
interfaces.

NeilBrown

2023-07-05 01:22:54

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v2 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap

On Wed, 05 Jul 2023, Chuck Lever III wrote:
> > On Jul 3, 2023, at 10:02 PM, Chuck Lever <[email protected]> wrote:
> >
> > On Tue, Jul 04, 2023 at 11:26:22AM +1000, NeilBrown wrote:
> >> On Tue, 04 Jul 2023, Chuck Lever wrote:
> >>> From: Chuck Lever <[email protected]>
> >>>
> >>> I've noticed that client-observed server request latency goes up
> >>> simply when the nfsd thread count is increased.
> >>>
> >>> List walking is known to be memory-inefficient. On a busy server
> >>> with many threads, enqueuing a transport will walk the "all threads"
> >>> list quite frequently. This also pulls in the cache lines for some
> >>> hot fields in each svc_rqst (namely, rq_flags).
> >>
> >> I think this text could usefully be re-written. By this point in the
> >> series we aren't list walking.
> >>
> >> I'd also be curious to know what latency different you get for just this
> >> change.
> >
> > Not much of a latency difference at lower thread counts.
> >
> > The difference I notice is that with the spinlock version of
> > pool_wake_idle_thread, there is significant lock contention as
> > the thread count increases, and the throughput result of my fio
> > test is lower (outside the result variance).
>
> I mis-spoke. When I wrote this yesterday I had compared only the
> "xarray with bitmap" and the "xarray with spinlock" mechanisms.
> I had not tried "xarray only".
>
> Today, while testing review-related fixes, I benchmarked "xarray
> only". It behaves like the linked-list implementation it replaces:
> performance degrades with anything more than a couple dozen threads
> in the pool.

I'm a little surprised it is that bad, but only a little.
The above is good text to include in the justification of that last
patch.

Thanks for the clarification.
NeilBrown

2023-07-05 01:23:40

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v2 0/9] SUNRPC service thread scheduler optimizations


I've been pondering this scheduling mechanism in sunrpc/svc some more,
and I wonder if rather than optimising the search, we should eliminate
it.

Instead we could have a linked list of idle threads using llist.h

svc_enqueue_xprt calls llist_del_first() and if the result is not NULL,
that thread is deemed busy (because it isn't on the list) and is woken.

choose_victim() could also use llist_del_first(). If nothing is there
it could set a flag which gets cleared by the next thread to go idle.
That thread exits .. or something. Some interlock would be needed but
it shouldn't be too hard.

svc_exit_thread would have difficulty removing itself from the idle
list, if it wasn't busy.. Possibly we could disallow that case (I think
sending a signal to a thread can make it exit while idle).
Alternately we could use llist_del_all() to clear the list, then wake
them all up so that they go back on the list if there is nothing to do
and if they aren't trying to exit. That is fairly heavy handed, but
isn't a case that we need to optimise.

If you think that might be worth pursuing, I could have a go at writing
the patch - probably on top of all the cleanups in your series before
the xarray is added.

I also wonder if we should avoid waking too many threads up at once.
If multiple events happen in quick succession, we currently wake up
multiple threads and if there is any scheduling delay (which is expected
based on Commit 22700f3c6df5 ("SUNRPC: Improve ordering of transport processing"))
then by the time the threads wake up, there may no longer be work to do
as another thread might have gone idle and taken the work.

Instead we could have a limit on the number of threads waking up -
possibly 1 or 3. If the counter is maxed out, don't do a wake up.
When a thread wakes up, it decrements the counter, dequeues some work,
and if there is more to do, then it queues another task.
I imagine the same basic protocol would be used for creating new threads
when load is high - start just one at a time, though maybe a new thread
would handle a first request before possibly starting another thread.

But this is a stretch goal - not the main focus.

Thanks,
NeilBrown

2023-07-05 02:58:19

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH v2 0/9] SUNRPC service thread scheduler optimizations

On Wed, Jul 05, 2023 at 11:08:32AM +1000, NeilBrown wrote:
>
> I've been pondering this scheduling mechanism in sunrpc/svc some more,
> and I wonder if rather than optimising the search, we should eliminate
> it.
>
> Instead we could have a linked list of idle threads using llist.h
>
> svc_enqueue_xprt calls llist_del_first() and if the result is not NULL,
> that thread is deemed busy (because it isn't on the list) and is woken.
>
> choose_victim() could also use llist_del_first(). If nothing is there
> it could set a flag which gets cleared by the next thread to go idle.
> That thread exits .. or something. Some interlock would be needed but
> it shouldn't be too hard.
>
> svc_exit_thread would have difficulty removing itself from the idle
> list, if it wasn't busy.. Possibly we could disallow that case (I think
> sending a signal to a thread can make it exit while idle).
> Alternately we could use llist_del_all() to clear the list, then wake
> them all up so that they go back on the list if there is nothing to do
> and if they aren't trying to exit. That is fairly heavy handed, but
> isn't a case that we need to optimise.
>
> If you think that might be worth pursuing, I could have a go at writing
> the patch - probably on top of all the cleanups in your series before
> the xarray is added.

The thread pool is effectively a cached resource, so it is a use case
that fits llist well. svcrdma uses llist in a few spots in that very
capacity.

If you think you can meet all of the criteria in the table at the top of
llist.h so thread scheduling works entirely without a lock, that might
be an interesting point of comparison.

My only concern is that the current set of candidate mechanisms manage
to use mostly the first thread, rather than round-robining through the
thread list. Using mostly one process tends to be more cache-friendly.
An llist-based thread scheduler should try to follow that behavior,
IMO.


> I also wonder if we should avoid waking too many threads up at once.
> If multiple events happen in quick succession, we currently wake up
> multiple threads and if there is any scheduling delay (which is expected
> based on Commit 22700f3c6df5 ("SUNRPC: Improve ordering of transport processing"))
> then by the time the threads wake up, there may no longer be work to do
> as another thread might have gone idle and taken the work.

It might be valuable to add some observability of wake-ups that find
nothing to do. I'll look into that.


> Instead we could have a limit on the number of threads waking up -
> possibly 1 or 3. If the counter is maxed out, don't do a wake up.
> When a thread wakes up, it decrements the counter, dequeues some work,
> and if there is more to do, then it queues another task.

I consider reducing the wake-up rate as the next step for improving
RPC service thread scalability. Any experimentation in that area is
worth looking into.


> I imagine the same basic protocol would be used for creating new threads
> when load is high - start just one at a time, though maybe a new thread
> would handle a first request before possibly starting another thread.

I envision that dynamically tuning the pool thread count as something
that should be managed in user space, since it's a policy rather than
a mechanism.

That could be a problem, though, if we wanted to shut down a few pool
threads based on shrinker activity.