2023-07-10 16:46:51

by Chuck Lever

[permalink] [raw]
Subject: [PATCH v3 0/9] SUNRPC service thread scheduler optimizations

Walking a linked list to find an idle thread is not CPU cache-
friendly, and in fact I've noted palpable per-request latency
impacts as the number of nfsd threads on the server increases.

After discussing some possible improvements with Jeff at LSF/MM,
I've been experimenting with the following series. I've measured an
order of magnitude latency improvement in the thread lookup time,
and have managed to keep the whole thing lockless.

This version of the series addresses Neil's earlier comments and
is robust under load. The first 7 patches in this series seem
uncontroversial, so I'll push them to the
"topic-sunrpc-thread-scheduling" branch of:

https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git

The last two are still under discussion. They are posted as part
of this series for comparison to other proposals, but are not yet
included in the topic branch. But they are tested and working.


Changes since v2:
* Dropped the patch that converts sp_lock to a simple spinlock
* Replaced explicit memory barriers in svc_get_next_xprt()
* Select thread victims from the other end of the bitmap
* Added a metric for wake-ups that find nothing on the transport queue

Changes since RFC:
* Add a counter for ingress RPC messages
* Add a few documenting comments
* Move the more controversial patches to the end of the series
* Clarify the refactoring of svc_wake_up()
* Increase the value of RPCSVC_MAXPOOLTHREADS to 4096 (and tested with that many threads)
* Optimize the loop in svc_pool_wake_idle_thread()
* Optimize marking a thread "idle" in svc_get_next_xprt()

---

Chuck Lever (9):
SUNRPC: Deduplicate thread wake-up code
SUNRPC: Report when no service thread is available.
SUNRPC: Split the svc_xprt_dequeue tracepoint
SUNRPC: Count ingress RPC messages per svc_pool
SUNRPC: Count pool threads that were awoken but found no work to do
SUNRPC: Clean up svc_set_num_threads
SUNRPC: Replace dprintk() call site in __svc_create()
SUNRPC: Replace sp_threads_all with an xarray
SUNRPC: Convert RQ_BUSY into a per-pool bitmap


fs/nfsd/nfssvc.c | 3 +-
include/linux/sunrpc/svc.h | 19 ++--
include/trace/events/sunrpc.h | 159 ++++++++++++++++++++++++++----
net/sunrpc/svc.c | 177 ++++++++++++++++++++++------------
net/sunrpc/svc_xprt.c | 99 ++++++++++---------
5 files changed, 323 insertions(+), 134 deletions(-)

--
Chuck Lever



2023-07-10 16:46:54

by Chuck Lever

[permalink] [raw]
Subject: [PATCH v3 1/9] SUNRPC: Deduplicate thread wake-up code

From: Chuck Lever <[email protected]>

Refactor: Extract the loop that finds an idle service thread from
svc_xprt_enqueue() and svc_wake_up(). Both functions do just about
the same thing.

Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/svc.h | 1 +
net/sunrpc/svc.c | 28 ++++++++++++++++++++++++++
net/sunrpc/svc_xprt.c | 48 +++++++++++++++-----------------------------
3 files changed, 45 insertions(+), 32 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index f8751118c122..dc2d90a655e2 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -427,6 +427,7 @@ int svc_register(const struct svc_serv *, struct net *, const int,

void svc_wake_up(struct svc_serv *);
void svc_reserve(struct svc_rqst *rqstp, int space);
+struct svc_rqst *svc_pool_wake_idle_thread(struct svc_pool *pool);
struct svc_pool *svc_pool_for_cpu(struct svc_serv *serv);
char * svc_print_addr(struct svc_rqst *, char *, size_t);
const char * svc_proc_name(const struct svc_rqst *rqstp);
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 587811a002c9..05ee92b5fa1e 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -689,6 +689,34 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
return rqstp;
}

+/**
+ * svc_pool_wake_idle_thread - Awaken an idle thread in @pool
+ * @pool: service thread pool
+ *
+ * Returns an idle service thread (now marked BUSY), or NULL
+ * if no service threads are available. Finding an idle service
+ * thread and marking it BUSY is atomic with respect to other
+ * calls to svc_pool_wake_idle_thread().
+ */
+struct svc_rqst *svc_pool_wake_idle_thread(struct svc_pool *pool)
+{
+ struct svc_rqst *rqstp;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
+ if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
+ continue;
+
+ rcu_read_unlock();
+ WRITE_ONCE(rqstp->rq_qtime, ktime_get());
+ wake_up_process(rqstp->rq_task);
+ percpu_counter_inc(&pool->sp_threads_woken);
+ return rqstp;
+ }
+ rcu_read_unlock();
+ return NULL;
+}
+
/*
* Choose a pool in which to create a new thread, for svc_set_num_threads
*/
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 62c7919ea610..89302bf09b77 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -455,8 +455,8 @@ static bool svc_xprt_ready(struct svc_xprt *xprt)
*/
void svc_xprt_enqueue(struct svc_xprt *xprt)
{
+ struct svc_rqst *rqstp;
struct svc_pool *pool;
- struct svc_rqst *rqstp = NULL;

if (!svc_xprt_ready(xprt))
return;
@@ -476,20 +476,10 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
spin_unlock_bh(&pool->sp_lock);

- /* find a thread for this xprt */
- rcu_read_lock();
- list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
- if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
- continue;
- percpu_counter_inc(&pool->sp_threads_woken);
- rqstp->rq_qtime = ktime_get();
- wake_up_process(rqstp->rq_task);
- goto out_unlock;
- }
- set_bit(SP_CONGESTED, &pool->sp_flags);
- rqstp = NULL;
-out_unlock:
- rcu_read_unlock();
+ rqstp = svc_pool_wake_idle_thread(pool);
+ if (!rqstp)
+ set_bit(SP_CONGESTED, &pool->sp_flags);
+
trace_svc_xprt_enqueue(xprt, rqstp);
}
EXPORT_SYMBOL_GPL(svc_xprt_enqueue);
@@ -581,7 +571,10 @@ static void svc_xprt_release(struct svc_rqst *rqstp)
svc_xprt_put(xprt);
}

-/*
+/**
+ * svc_wake_up - Wake up a service thread for non-transport work
+ * @serv: RPC service
+ *
* Some svc_serv's will have occasional work to do, even when a xprt is not
* waiting to be serviced. This function is there to "kick" a task in one of
* those services so that it can wake up and do that work. Note that we only
@@ -590,27 +583,18 @@ static void svc_xprt_release(struct svc_rqst *rqstp)
*/
void svc_wake_up(struct svc_serv *serv)
{
+ struct svc_pool *pool = &serv->sv_pools[0];
struct svc_rqst *rqstp;
- struct svc_pool *pool;

- pool = &serv->sv_pools[0];
-
- rcu_read_lock();
- list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
- /* skip any that aren't queued */
- if (test_bit(RQ_BUSY, &rqstp->rq_flags))
- continue;
- rcu_read_unlock();
- wake_up_process(rqstp->rq_task);
- trace_svc_wake_up(rqstp->rq_task->pid);
+ rqstp = svc_pool_wake_idle_thread(pool);
+ if (!rqstp) {
+ set_bit(SP_TASK_PENDING, &pool->sp_flags);
+ smp_wmb();
+ trace_svc_wake_up(0);
return;
}
- rcu_read_unlock();

- /* No free entries available */
- set_bit(SP_TASK_PENDING, &pool->sp_flags);
- smp_wmb();
- trace_svc_wake_up(0);
+ trace_svc_wake_up(rqstp->rq_task->pid);
}
EXPORT_SYMBOL_GPL(svc_wake_up);




2023-07-10 16:46:54

by Chuck Lever

[permalink] [raw]
Subject: [PATCH v3 2/9] SUNRPC: Report when no service thread is available.

From: Chuck Lever <[email protected]>

Count and record thread pool starvation. Administrators can take
action by increasing thread count or decreasing workload.

Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/svc.h | 5 +++-
include/trace/events/sunrpc.h | 49 ++++++++++++++++++++++++++++++++++-------
net/sunrpc/svc.c | 9 +++++++-
net/sunrpc/svc_xprt.c | 22 ++++++++++--------
4 files changed, 64 insertions(+), 21 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index dc2d90a655e2..fbfe6ea737c8 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -22,7 +22,6 @@
#include <linux/pagevec.h>

/*
- *
* RPC service thread pool.
*
* Pool of threads and temporary sockets. Generally there is only
@@ -42,6 +41,7 @@ struct svc_pool {
struct percpu_counter sp_sockets_queued;
struct percpu_counter sp_threads_woken;
struct percpu_counter sp_threads_timedout;
+ struct percpu_counter sp_threads_starved;

#define SP_TASK_PENDING (0) /* still work to do even if no
* xprt is queued. */
@@ -427,7 +427,8 @@ int svc_register(const struct svc_serv *, struct net *, const int,

void svc_wake_up(struct svc_serv *);
void svc_reserve(struct svc_rqst *rqstp, int space);
-struct svc_rqst *svc_pool_wake_idle_thread(struct svc_pool *pool);
+struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
+ struct svc_pool *pool);
struct svc_pool *svc_pool_for_cpu(struct svc_serv *serv);
char * svc_print_addr(struct svc_rqst *, char *, size_t);
const char * svc_proc_name(const struct svc_rqst *rqstp);
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 43711753616a..9b70fc1c698a 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -1994,21 +1994,21 @@ TRACE_EVENT(svc_xprt_create_err,
TRACE_EVENT(svc_xprt_enqueue,
TP_PROTO(
const struct svc_xprt *xprt,
- const struct svc_rqst *rqst
+ const struct svc_rqst *wakee
),

- TP_ARGS(xprt, rqst),
+ TP_ARGS(xprt, wakee),

TP_STRUCT__entry(
SVC_XPRT_ENDPOINT_FIELDS(xprt)

- __field(int, pid)
+ __field(pid_t, pid)
),

TP_fast_assign(
SVC_XPRT_ENDPOINT_ASSIGNMENTS(xprt);

- __entry->pid = rqst? rqst->rq_task->pid : 0;
+ __entry->pid = wakee->rq_task->pid;
),

TP_printk(SVC_XPRT_ENDPOINT_FORMAT " pid=%d",
@@ -2039,6 +2039,39 @@ TRACE_EVENT(svc_xprt_dequeue,
SVC_XPRT_ENDPOINT_VARARGS, __entry->wakeup)
);

+#define show_svc_pool_flags(x) \
+ __print_flags(x, "|", \
+ { BIT(SP_TASK_PENDING), "TASK_PENDING" }, \
+ { BIT(SP_CONGESTED), "CONGESTED" })
+
+TRACE_EVENT(svc_pool_starved,
+ TP_PROTO(
+ const struct svc_serv *serv,
+ const struct svc_pool *pool
+ ),
+
+ TP_ARGS(serv, pool),
+
+ TP_STRUCT__entry(
+ __string(name, serv->sv_name)
+ __field(int, pool_id)
+ __field(unsigned int, nrthreads)
+ __field(unsigned long, flags)
+ ),
+
+ TP_fast_assign(
+ __assign_str(name, serv->sv_name);
+ __entry->pool_id = pool->sp_id;
+ __entry->nrthreads = pool->sp_nrthreads;
+ __entry->flags = pool->sp_flags;
+ ),
+
+ TP_printk("service=%s pool=%d flags=%s nrthreads=%u",
+ __get_str(name), __entry->pool_id,
+ show_svc_pool_flags(__entry->flags), __entry->nrthreads
+ )
+);
+
DECLARE_EVENT_CLASS(svc_xprt_event,
TP_PROTO(
const struct svc_xprt *xprt
@@ -2109,16 +2142,16 @@ TRACE_EVENT(svc_xprt_accept,
);

TRACE_EVENT(svc_wake_up,
- TP_PROTO(int pid),
+ TP_PROTO(const struct svc_rqst *wakee),

- TP_ARGS(pid),
+ TP_ARGS(wakee),

TP_STRUCT__entry(
- __field(int, pid)
+ __field(pid_t, pid)
),

TP_fast_assign(
- __entry->pid = pid;
+ __entry->pid = wakee->rq_task->pid;
),

TP_printk("pid=%d", __entry->pid)
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 05ee92b5fa1e..b79b8b41905d 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -516,6 +516,7 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
percpu_counter_init(&pool->sp_threads_woken, 0, GFP_KERNEL);
percpu_counter_init(&pool->sp_threads_timedout, 0, GFP_KERNEL);
+ percpu_counter_init(&pool->sp_threads_starved, 0, GFP_KERNEL);
}

return serv;
@@ -591,6 +592,7 @@ svc_destroy(struct kref *ref)
percpu_counter_destroy(&pool->sp_sockets_queued);
percpu_counter_destroy(&pool->sp_threads_woken);
percpu_counter_destroy(&pool->sp_threads_timedout);
+ percpu_counter_destroy(&pool->sp_threads_starved);
}
kfree(serv->sv_pools);
kfree(serv);
@@ -691,6 +693,7 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)

/**
* svc_pool_wake_idle_thread - Awaken an idle thread in @pool
+ * @serv: RPC service
* @pool: service thread pool
*
* Returns an idle service thread (now marked BUSY), or NULL
@@ -698,7 +701,8 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
* thread and marking it BUSY is atomic with respect to other
* calls to svc_pool_wake_idle_thread().
*/
-struct svc_rqst *svc_pool_wake_idle_thread(struct svc_pool *pool)
+struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
+ struct svc_pool *pool)
{
struct svc_rqst *rqstp;

@@ -714,6 +718,9 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_pool *pool)
return rqstp;
}
rcu_read_unlock();
+
+ trace_svc_pool_starved(serv, pool);
+ percpu_counter_inc(&pool->sp_threads_starved);
return NULL;
}

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 89302bf09b77..a1ed6fb69793 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -455,7 +455,7 @@ static bool svc_xprt_ready(struct svc_xprt *xprt)
*/
void svc_xprt_enqueue(struct svc_xprt *xprt)
{
- struct svc_rqst *rqstp;
+ struct svc_rqst *rqstp;
struct svc_pool *pool;

if (!svc_xprt_ready(xprt))
@@ -476,9 +476,11 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
spin_unlock_bh(&pool->sp_lock);

- rqstp = svc_pool_wake_idle_thread(pool);
- if (!rqstp)
+ rqstp = svc_pool_wake_idle_thread(xprt->xpt_server, pool);
+ if (!rqstp) {
set_bit(SP_CONGESTED, &pool->sp_flags);
+ return;
+ }

trace_svc_xprt_enqueue(xprt, rqstp);
}
@@ -584,17 +586,16 @@ static void svc_xprt_release(struct svc_rqst *rqstp)
void svc_wake_up(struct svc_serv *serv)
{
struct svc_pool *pool = &serv->sv_pools[0];
- struct svc_rqst *rqstp;
+ struct svc_rqst *rqstp;

- rqstp = svc_pool_wake_idle_thread(pool);
+ rqstp = svc_pool_wake_idle_thread(serv, pool);
if (!rqstp) {
set_bit(SP_TASK_PENDING, &pool->sp_flags);
smp_wmb();
- trace_svc_wake_up(0);
return;
}

- trace_svc_wake_up(rqstp->rq_task->pid);
+ trace_svc_wake_up(rqstp);
}
EXPORT_SYMBOL_GPL(svc_wake_up);

@@ -1436,16 +1437,17 @@ static int svc_pool_stats_show(struct seq_file *m, void *p)
struct svc_pool *pool = p;

if (p == SEQ_START_TOKEN) {
- seq_puts(m, "# pool packets-arrived sockets-enqueued threads-woken threads-timedout\n");
+ seq_puts(m, "# pool packets-arrived xprts-enqueued threads-woken threads-timedout starved\n");
return 0;
}

- seq_printf(m, "%u %llu %llu %llu %llu\n",
+ seq_printf(m, "%u %llu %llu %llu %llu %llu\n",
pool->sp_id,
percpu_counter_sum_positive(&pool->sp_sockets_queued),
percpu_counter_sum_positive(&pool->sp_sockets_queued),
percpu_counter_sum_positive(&pool->sp_threads_woken),
- percpu_counter_sum_positive(&pool->sp_threads_timedout));
+ percpu_counter_sum_positive(&pool->sp_threads_timedout),
+ percpu_counter_sum_positive(&pool->sp_threads_starved));

return 0;
}



2023-07-10 16:46:56

by Chuck Lever

[permalink] [raw]
Subject: [PATCH v3 3/9] SUNRPC: Split the svc_xprt_dequeue tracepoint

From: Chuck Lever <[email protected]>

Distinguish between the case where new work was picked up just by
looking at the transport queue versus when the thread was awoken.
This gives us better visibility about how well-utilized the thread
pool is.

Signed-off-by: Chuck Lever <[email protected]>
---
include/trace/events/sunrpc.h | 47 +++++++++++++++++++++++++++++++----------
net/sunrpc/svc_xprt.c | 9 +++++---
2 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 9b70fc1c698a..2e83887b58cd 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -2015,34 +2015,57 @@ TRACE_EVENT(svc_xprt_enqueue,
SVC_XPRT_ENDPOINT_VARARGS, __entry->pid)
);

-TRACE_EVENT(svc_xprt_dequeue,
+#define show_svc_pool_flags(x) \
+ __print_flags(x, "|", \
+ { BIT(SP_TASK_PENDING), "TASK_PENDING" }, \
+ { BIT(SP_CONGESTED), "CONGESTED" })
+
+DECLARE_EVENT_CLASS(svc_pool_scheduler_class,
TP_PROTO(
- const struct svc_rqst *rqst
+ const struct svc_rqst *rqstp
),

- TP_ARGS(rqst),
+ TP_ARGS(rqstp),

TP_STRUCT__entry(
- SVC_XPRT_ENDPOINT_FIELDS(rqst->rq_xprt)
+ SVC_XPRT_ENDPOINT_FIELDS(rqstp->rq_xprt)

+ __string(name, rqstp->rq_server->sv_name)
+ __field(int, pool_id)
+ __field(unsigned int, nrthreads)
+ __field(unsigned long, pool_flags)
__field(unsigned long, wakeup)
),

TP_fast_assign(
- SVC_XPRT_ENDPOINT_ASSIGNMENTS(rqst->rq_xprt);
+ struct svc_pool *pool = rqstp->rq_pool;
+ SVC_XPRT_ENDPOINT_ASSIGNMENTS(rqstp->rq_xprt);

+ __assign_str(name, rqstp->rq_server->sv_name);
+ __entry->pool_id = pool->sp_id;
+ __entry->nrthreads = pool->sp_nrthreads;
+ __entry->pool_flags = pool->sp_flags;
__entry->wakeup = ktime_to_us(ktime_sub(ktime_get(),
- rqst->rq_qtime));
+ rqstp->rq_qtime));
),

- TP_printk(SVC_XPRT_ENDPOINT_FORMAT " wakeup-us=%lu",
- SVC_XPRT_ENDPOINT_VARARGS, __entry->wakeup)
+ TP_printk(SVC_XPRT_ENDPOINT_FORMAT
+ " service=%s pool=%d pool_flags=%s nrthreads=%u wakeup-us=%lu",
+ SVC_XPRT_ENDPOINT_VARARGS, __get_str(name), __entry->pool_id,
+ show_svc_pool_flags(__entry->pool_flags), __entry->nrthreads,
+ __entry->wakeup
+ )
);

-#define show_svc_pool_flags(x) \
- __print_flags(x, "|", \
- { BIT(SP_TASK_PENDING), "TASK_PENDING" }, \
- { BIT(SP_CONGESTED), "CONGESTED" })
+#define DEFINE_SVC_POOL_SCHEDULER_EVENT(name) \
+ DEFINE_EVENT(svc_pool_scheduler_class, svc_pool_##name, \
+ TP_PROTO( \
+ const struct svc_rqst *rqstp \
+ ), \
+ TP_ARGS(rqstp))
+
+DEFINE_SVC_POOL_SCHEDULER_EVENT(polled);
+DEFINE_SVC_POOL_SCHEDULER_EVENT(awoken);

TRACE_EVENT(svc_pool_starved,
TP_PROTO(
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index a1ed6fb69793..7ee095d03996 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -744,8 +744,10 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
WARN_ON_ONCE(rqstp->rq_xprt);

rqstp->rq_xprt = svc_xprt_dequeue(pool);
- if (rqstp->rq_xprt)
+ if (rqstp->rq_xprt) {
+ trace_svc_pool_polled(rqstp);
goto out_found;
+ }

/*
* We have to be able to interrupt this wait
@@ -767,8 +769,10 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
set_bit(RQ_BUSY, &rqstp->rq_flags);
smp_mb__after_atomic();
rqstp->rq_xprt = svc_xprt_dequeue(pool);
- if (rqstp->rq_xprt)
+ if (rqstp->rq_xprt) {
+ trace_svc_pool_awoken(rqstp);
goto out_found;
+ }

if (!time_left)
percpu_counter_inc(&pool->sp_threads_timedout);
@@ -784,7 +788,6 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
rqstp->rq_chandle.thread_wait = 5*HZ;
else
rqstp->rq_chandle.thread_wait = 1*HZ;
- trace_svc_xprt_dequeue(rqstp);
return rqstp->rq_xprt;
}




2023-07-10 16:47:11

by Chuck Lever

[permalink] [raw]
Subject: [PATCH v3 5/9] SUNRPC: Count pool threads that were awoken but found no work to do

From: Chuck Lever <[email protected]>

Measure a source of thread scheduling inefficiency -- count threads
that were awoken but found that the transport queue had already been
emptied.

An empty transport queue is possible when threads that run between
the wake_up_process() call and the woken thread returning from the
scheduler have pulled all remaining work off the transport queue
using the first svc_xprt_dequeue() in svc_get_next_xprt().

Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/svc.h | 1 +
net/sunrpc/svc.c | 2 ++
net/sunrpc/svc_xprt.c | 7 ++++---
3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 74ea13270679..9dd3b16cc4c2 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -43,6 +43,7 @@ struct svc_pool {
struct percpu_counter sp_threads_woken;
struct percpu_counter sp_threads_timedout;
struct percpu_counter sp_threads_starved;
+ struct percpu_counter sp_threads_no_work;

#define SP_TASK_PENDING (0) /* still work to do even if no
* xprt is queued. */
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 88b7b5fb6d75..b7a02309ecb1 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -518,6 +518,7 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
percpu_counter_init(&pool->sp_threads_woken, 0, GFP_KERNEL);
percpu_counter_init(&pool->sp_threads_timedout, 0, GFP_KERNEL);
percpu_counter_init(&pool->sp_threads_starved, 0, GFP_KERNEL);
+ percpu_counter_init(&pool->sp_threads_no_work, 0, GFP_KERNEL);
}

return serv;
@@ -595,6 +596,7 @@ svc_destroy(struct kref *ref)
percpu_counter_destroy(&pool->sp_threads_woken);
percpu_counter_destroy(&pool->sp_threads_timedout);
percpu_counter_destroy(&pool->sp_threads_starved);
+ percpu_counter_destroy(&pool->sp_threads_no_work);
}
kfree(serv->sv_pools);
kfree(serv);
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index ecbccf0d89b9..6c2a702aa469 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -776,9 +776,9 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)

if (!time_left)
percpu_counter_inc(&pool->sp_threads_timedout);
-
if (signalled() || kthread_should_stop())
return ERR_PTR(-EINTR);
+ percpu_counter_inc(&pool->sp_threads_no_work);
return ERR_PTR(-EAGAIN);
out_found:
/* Normally we will wait up to 5 seconds for any required
@@ -1445,13 +1445,14 @@ static int svc_pool_stats_show(struct seq_file *m, void *p)
return 0;
}

- seq_printf(m, "%u %llu %llu %llu %llu %llu\n",
+ seq_printf(m, "%u %llu %llu %llu %llu %llu %llu\n",
pool->sp_id,
percpu_counter_sum_positive(&pool->sp_messages_arrived),
percpu_counter_sum_positive(&pool->sp_sockets_queued),
percpu_counter_sum_positive(&pool->sp_threads_woken),
percpu_counter_sum_positive(&pool->sp_threads_timedout),
- percpu_counter_sum_positive(&pool->sp_threads_starved));
+ percpu_counter_sum_positive(&pool->sp_threads_starved),
+ percpu_counter_sum_positive(&pool->sp_threads_no_work));

return 0;
}



2023-07-10 16:47:21

by Chuck Lever

[permalink] [raw]
Subject: [PATCH v3 8/9] SUNRPC: Replace sp_threads_all with an xarray

From: Chuck Lever <[email protected]>

We want a thread lookup operation that can be done with RCU only,
but also we want to avoid the linked-list walk, which does not scale
well in the number of pool threads.

This patch splits out the use of the sp_lock to protect the set
of threads. Svc thread information is now protected by the xarray's
lock (when making thread count changes) and the RCU read lock (when
only looking up a thread).

Since thread count changes are done only via nfsd filesystem API,
which runs only in process context, we can safely dispense with the
use of a bottom-half-disabled lock.

Signed-off-by: Chuck Lever <[email protected]>
---
fs/nfsd/nfssvc.c | 3 +-
include/linux/sunrpc/svc.h | 11 +++----
include/trace/events/sunrpc.h | 47 ++++++++++++++++++++++++++++-
net/sunrpc/svc.c | 67 +++++++++++++++++++++++++----------------
net/sunrpc/svc_xprt.c | 2 +
5 files changed, 94 insertions(+), 36 deletions(-)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 2154fa63c5f2..d42b2a40c93c 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -62,8 +62,7 @@ static __be32 nfsd_init_request(struct svc_rqst *,
* If (out side the lock) nn->nfsd_serv is non-NULL, then it must point to a
* properly initialised 'struct svc_serv' with ->sv_nrthreads > 0 (unless
* nn->keep_active is set). That number of nfsd threads must
- * exist and each must be listed in ->sp_all_threads in some entry of
- * ->sv_pools[].
+ * exist and each must be listed in some entry of ->sv_pools[].
*
* Each active thread holds a counted reference on nn->nfsd_serv, as does
* the nn->keep_active flag and various transient calls to svc_get().
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 9dd3b16cc4c2..86377506a514 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -32,10 +32,10 @@
*/
struct svc_pool {
unsigned int sp_id; /* pool id; also node id on NUMA */
- spinlock_t sp_lock; /* protects all fields */
+ spinlock_t sp_lock; /* protects sp_sockets */
struct list_head sp_sockets; /* pending sockets */
unsigned int sp_nrthreads; /* # of threads in pool */
- struct list_head sp_all_threads; /* all server threads */
+ struct xarray sp_thread_xa;

/* statistics on pool operation */
struct percpu_counter sp_messages_arrived;
@@ -196,7 +196,6 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
* processed.
*/
struct svc_rqst {
- struct list_head rq_all; /* all threads list */
struct rcu_head rq_rcu_head; /* for RCU deferred kfree */
struct svc_xprt * rq_xprt; /* transport ptr */

@@ -241,10 +240,10 @@ struct svc_rqst {
#define RQ_SPLICE_OK (4) /* turned off in gss privacy
* to prevent encrypting page
* cache pages */
-#define RQ_VICTIM (5) /* about to be shut down */
-#define RQ_BUSY (6) /* request is busy */
-#define RQ_DATA (7) /* request has data */
+#define RQ_BUSY (5) /* request is busy */
+#define RQ_DATA (6) /* request has data */
unsigned long rq_flags; /* flags field */
+ u32 rq_thread_id; /* xarray index */
ktime_t rq_qtime; /* enqueue time */

void * rq_argp; /* decoded arguments */
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 60c8e03268d4..ea43c6059bdb 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
svc_rqst_flag(USEDEFERRAL) \
svc_rqst_flag(DROPME) \
svc_rqst_flag(SPLICE_OK) \
- svc_rqst_flag(VICTIM) \
svc_rqst_flag(BUSY) \
svc_rqst_flag_end(DATA)

@@ -2118,6 +2117,52 @@ TRACE_EVENT(svc_pool_starved,
)
);

+DECLARE_EVENT_CLASS(svc_thread_lifetime_class,
+ TP_PROTO(
+ const struct svc_serv *serv,
+ const struct svc_pool *pool,
+ const struct svc_rqst *rqstp
+ ),
+
+ TP_ARGS(serv, pool, rqstp),
+
+ TP_STRUCT__entry(
+ __string(name, serv->sv_name)
+ __field(int, pool_id)
+ __field(unsigned int, nrthreads)
+ __field(unsigned long, pool_flags)
+ __field(u32, thread_id)
+ __field(const void *, rqstp)
+ ),
+
+ TP_fast_assign(
+ __assign_str(name, serv->sv_name);
+ __entry->pool_id = pool->sp_id;
+ __entry->nrthreads = pool->sp_nrthreads;
+ __entry->pool_flags = pool->sp_flags;
+ __entry->thread_id = rqstp->rq_thread_id;
+ __entry->rqstp = rqstp;
+ ),
+
+ TP_printk("service=%s pool=%d pool_flags=%s nrthreads=%u thread_id=%u",
+ __get_str(name), __entry->pool_id,
+ show_svc_pool_flags(__entry->pool_flags),
+ __entry->nrthreads, __entry->thread_id
+ )
+);
+
+#define DEFINE_SVC_THREAD_LIFETIME_EVENT(name) \
+ DEFINE_EVENT(svc_thread_lifetime_class, svc_pool_##name, \
+ TP_PROTO( \
+ const struct svc_serv *serv, \
+ const struct svc_pool *pool, \
+ const struct svc_rqst *rqstp \
+ ), \
+ TP_ARGS(serv, pool, rqstp))
+
+DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_init);
+DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_exit);
+
DECLARE_EVENT_CLASS(svc_xprt_event,
TP_PROTO(
const struct svc_xprt *xprt
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index ad29df00b454..109d7f047385 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -507,8 +507,8 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,

pool->sp_id = i;
INIT_LIST_HEAD(&pool->sp_sockets);
- INIT_LIST_HEAD(&pool->sp_all_threads);
spin_lock_init(&pool->sp_lock);
+ xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);

percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
@@ -596,6 +596,8 @@ svc_destroy(struct kref *ref)
percpu_counter_destroy(&pool->sp_threads_timedout);
percpu_counter_destroy(&pool->sp_threads_starved);
percpu_counter_destroy(&pool->sp_threads_no_work);
+
+ xa_destroy(&pool->sp_thread_xa);
}
kfree(serv->sv_pools);
kfree(serv);
@@ -676,7 +678,11 @@ EXPORT_SYMBOL_GPL(svc_rqst_alloc);
static struct svc_rqst *
svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
{
+ struct xa_limit limit = {
+ .max = U32_MAX,
+ };
struct svc_rqst *rqstp;
+ int ret;

rqstp = svc_rqst_alloc(serv, pool, node);
if (!rqstp)
@@ -687,11 +693,21 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
serv->sv_nrthreads += 1;
spin_unlock_bh(&serv->sv_lock);

- spin_lock_bh(&pool->sp_lock);
+ xa_lock(&pool->sp_thread_xa);
+ ret = __xa_alloc(&pool->sp_thread_xa, &rqstp->rq_thread_id, rqstp,
+ limit, GFP_KERNEL);
+ if (ret) {
+ xa_unlock(&pool->sp_thread_xa);
+ goto out_free;
+ }
pool->sp_nrthreads++;
- list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
- spin_unlock_bh(&pool->sp_lock);
+ xa_unlock(&pool->sp_thread_xa);
+ trace_svc_pool_thread_init(serv, pool, rqstp);
return rqstp;
+
+out_free:
+ svc_rqst_free(rqstp);
+ return ERR_PTR(ret);
}

/**
@@ -708,19 +724,17 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
struct svc_pool *pool)
{
struct svc_rqst *rqstp;
+ unsigned long index;

- rcu_read_lock();
- list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
+ xa_for_each(&pool->sp_thread_xa, index, rqstp) {
if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
continue;

- rcu_read_unlock();
WRITE_ONCE(rqstp->rq_qtime, ktime_get());
wake_up_process(rqstp->rq_task);
percpu_counter_inc(&pool->sp_threads_woken);
return rqstp;
}
- rcu_read_unlock();

trace_svc_pool_starved(serv, pool);
percpu_counter_inc(&pool->sp_threads_starved);
@@ -736,32 +750,33 @@ svc_pool_next(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
static struct task_struct *
svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
{
- unsigned int i;
struct task_struct *task = NULL;
+ struct svc_rqst *rqstp;
+ unsigned int i;

if (pool != NULL) {
- spin_lock_bh(&pool->sp_lock);
+ xa_lock(&pool->sp_thread_xa);
+ if (!pool->sp_nrthreads)
+ goto out;
} else {
for (i = 0; i < serv->sv_nrpools; i++) {
pool = &serv->sv_pools[--(*state) % serv->sv_nrpools];
- spin_lock_bh(&pool->sp_lock);
- if (!list_empty(&pool->sp_all_threads))
+ xa_lock(&pool->sp_thread_xa);
+ if (pool->sp_nrthreads)
goto found_pool;
- spin_unlock_bh(&pool->sp_lock);
+ xa_unlock(&pool->sp_thread_xa);
}
return NULL;
}

found_pool:
- if (!list_empty(&pool->sp_all_threads)) {
- struct svc_rqst *rqstp;
-
- rqstp = list_entry(pool->sp_all_threads.next, struct svc_rqst, rq_all);
- set_bit(RQ_VICTIM, &rqstp->rq_flags);
- list_del_rcu(&rqstp->rq_all);
+ rqstp = xa_load(&pool->sp_thread_xa, pool->sp_nrthreads - 1);
+ if (rqstp) {
+ __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
task = rqstp->rq_task;
}
- spin_unlock_bh(&pool->sp_lock);
+out:
+ xa_unlock(&pool->sp_thread_xa);
return task;
}

@@ -843,9 +858,9 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
if (pool == NULL) {
nrservs -= serv->sv_nrthreads;
} else {
- spin_lock_bh(&pool->sp_lock);
+ xa_lock(&pool->sp_thread_xa);
nrservs -= pool->sp_nrthreads;
- spin_unlock_bh(&pool->sp_lock);
+ xa_unlock(&pool->sp_thread_xa);
}

if (nrservs > 0)
@@ -932,11 +947,11 @@ svc_exit_thread(struct svc_rqst *rqstp)
struct svc_serv *serv = rqstp->rq_server;
struct svc_pool *pool = rqstp->rq_pool;

- spin_lock_bh(&pool->sp_lock);
+ xa_lock(&pool->sp_thread_xa);
pool->sp_nrthreads--;
- if (!test_and_set_bit(RQ_VICTIM, &rqstp->rq_flags))
- list_del_rcu(&rqstp->rq_all);
- spin_unlock_bh(&pool->sp_lock);
+ __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
+ xa_unlock(&pool->sp_thread_xa);
+ trace_svc_pool_thread_exit(serv, pool, rqstp);

spin_lock_bh(&serv->sv_lock);
serv->sv_nrthreads -= 1;
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 6c2a702aa469..db40f771b60a 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -46,7 +46,7 @@ static LIST_HEAD(svc_xprt_class_list);

/* SMP locking strategy:
*
- * svc_pool->sp_lock protects most of the fields of that pool.
+ * svc_pool->sp_lock protects sp_sockets.
* svc_serv->sv_lock protects sv_tempsocks, sv_permsocks, sv_tmpcnt.
* when both need to be taken (rare), svc_serv->sv_lock is first.
* The "service mutex" protects svc_serv->sv_nrthread.



2023-07-10 16:47:22

by Chuck Lever

[permalink] [raw]
Subject: [PATCH v3 7/9] SUNRPC: Replace dprintk() call site in __svc_create()

From: Chuck Lever <[email protected]>

Done as part of converting SunRPC observability from printk-style to
tracepoints.

Signed-off-by: Chuck Lever <[email protected]>
---
include/trace/events/sunrpc.h | 23 +++++++++++++++++++++++
net/sunrpc/svc.c | 5 ++---
2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 2e83887b58cd..60c8e03268d4 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -1918,6 +1918,29 @@ TRACE_EVENT(svc_stats_latency,
__get_str(procedure), __entry->execute)
);

+TRACE_EVENT(svc_pool_init,
+ TP_PROTO(
+ const struct svc_serv *serv,
+ const struct svc_pool *pool
+ ),
+
+ TP_ARGS(serv, pool),
+
+ TP_STRUCT__entry(
+ __string(name, serv->sv_name)
+ __field(int, pool_id)
+ ),
+
+ TP_fast_assign(
+ __assign_str(name, serv->sv_name);
+ __entry->pool_id = pool->sp_id;
+ ),
+
+ TP_printk("service=%s pool=%d",
+ __get_str(name), __entry->pool_id
+ )
+);
+
#define show_svc_xprt_flags(flags) \
__print_flags(flags, "|", \
{ BIT(XPT_BUSY), "BUSY" }, \
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index b02a672aaada..ad29df00b454 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -505,9 +505,6 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
for (i = 0; i < serv->sv_nrpools; i++) {
struct svc_pool *pool = &serv->sv_pools[i];

- dprintk("svc: initialising pool %u for %s\n",
- i, serv->sv_name);
-
pool->sp_id = i;
INIT_LIST_HEAD(&pool->sp_sockets);
INIT_LIST_HEAD(&pool->sp_all_threads);
@@ -519,6 +516,8 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
percpu_counter_init(&pool->sp_threads_timedout, 0, GFP_KERNEL);
percpu_counter_init(&pool->sp_threads_starved, 0, GFP_KERNEL);
percpu_counter_init(&pool->sp_threads_no_work, 0, GFP_KERNEL);
+
+ trace_svc_pool_init(serv, pool);
}

return serv;



2023-07-10 16:47:22

by Chuck Lever

[permalink] [raw]
Subject: [PATCH v3 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap

From: Chuck Lever <[email protected]>

I've noticed that client-observed server request latency goes up
simply when the nfsd thread count is increased.

Walking the whole set of pool threads is memory-inefficient. On a
busy server with many threads, enqueuing a transport will visit all
the threads in the pool quite frequently. This also pulls in the
cache lines for some hot fields in each svc_rqst (namely, rq_flags).

The svc_xprt_enqueue() call that concerns me most is the one in
svc_rdma_wc_receive(), which is single-threaded per CQ. Slowing
down completion handling limits the total throughput per RDMA
connection.

Instead, set up a busy bitmap and use find_next_clear_bit, which
should work the same way as RQ_BUSY but will touch only the cache
lines that the bitmap is in. Stick with atomic bit operations to
avoid taking a spinlock during the search.

Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/svc.h | 6 ++++--
include/trace/events/sunrpc.h | 1 -
net/sunrpc/svc.c | 24 +++++++++++++++++++-----
net/sunrpc/svc_xprt.c | 26 ++++++++++++++++++++------
4 files changed, 43 insertions(+), 14 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 86377506a514..6669f3eb9ed4 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -35,6 +35,7 @@ struct svc_pool {
spinlock_t sp_lock; /* protects sp_sockets */
struct list_head sp_sockets; /* pending sockets */
unsigned int sp_nrthreads; /* # of threads in pool */
+ unsigned long *sp_busy_map; /* running threads */
struct xarray sp_thread_xa;

/* statistics on pool operation */
@@ -191,6 +192,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
#define RPCSVC_MAXPAGES ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
+ 2 + 1)

+#define RPCSVC_MAXPOOLTHREADS (4096)
+
/*
* The context of a single thread, including the request currently being
* processed.
@@ -240,8 +243,7 @@ struct svc_rqst {
#define RQ_SPLICE_OK (4) /* turned off in gss privacy
* to prevent encrypting page
* cache pages */
-#define RQ_BUSY (5) /* request is busy */
-#define RQ_DATA (6) /* request has data */
+#define RQ_DATA (5) /* request has data */
unsigned long rq_flags; /* flags field */
u32 rq_thread_id; /* xarray index */
ktime_t rq_qtime; /* enqueue time */
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index ea43c6059bdb..c07824a254bf 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
svc_rqst_flag(USEDEFERRAL) \
svc_rqst_flag(DROPME) \
svc_rqst_flag(SPLICE_OK) \
- svc_rqst_flag(BUSY) \
svc_rqst_flag_end(DATA)

#undef svc_rqst_flag
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 109d7f047385..f6305b66fd28 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -509,6 +509,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
INIT_LIST_HEAD(&pool->sp_sockets);
spin_lock_init(&pool->sp_lock);
xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
+ pool->sp_busy_map =
+ bitmap_alloc_node(RPCSVC_MAXPOOLTHREADS, GFP_KERNEL,
+ svc_pool_map_get_node(i));
+ if (!pool->sp_busy_map)
+ return NULL;
+ bitmap_fill(pool->sp_busy_map, RPCSVC_MAXPOOLTHREADS);

percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
@@ -598,6 +604,8 @@ svc_destroy(struct kref *ref)
percpu_counter_destroy(&pool->sp_threads_no_work);

xa_destroy(&pool->sp_thread_xa);
+ bitmap_free(pool->sp_busy_map);
+ pool->sp_busy_map = NULL;
}
kfree(serv->sv_pools);
kfree(serv);
@@ -649,7 +657,6 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)

folio_batch_init(&rqstp->rq_fbatch);

- __set_bit(RQ_BUSY, &rqstp->rq_flags);
rqstp->rq_server = serv;
rqstp->rq_pool = pool;

@@ -679,7 +686,7 @@ static struct svc_rqst *
svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
{
struct xa_limit limit = {
- .max = U32_MAX,
+ .max = RPCSVC_MAXPOOLTHREADS,
};
struct svc_rqst *rqstp;
int ret;
@@ -724,12 +731,19 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
struct svc_pool *pool)
{
struct svc_rqst *rqstp;
- unsigned long index;
+ unsigned long bit;

- xa_for_each(&pool->sp_thread_xa, index, rqstp) {
- if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
+ /* Check the pool's idle bitmap locklessly so that multiple
+ * idle searches can proceed concurrently.
+ */
+ for_each_clear_bit(bit, pool->sp_busy_map, pool->sp_nrthreads) {
+ if (test_and_set_bit(bit, pool->sp_busy_map))
continue;

+ rqstp = xa_load(&pool->sp_thread_xa, bit);
+ if (!rqstp)
+ break;
+
WRITE_ONCE(rqstp->rq_qtime, ktime_get());
wake_up_process(rqstp->rq_task);
percpu_counter_inc(&pool->sp_threads_woken);
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index db40f771b60a..f9c9babe0cba 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -735,6 +735,21 @@ rqst_should_sleep(struct svc_rqst *rqstp)
return true;
}

+static void svc_pool_thread_mark_idle(struct svc_pool *pool,
+ struct svc_rqst *rqstp)
+{
+ clear_bit_unlock(rqstp->rq_thread_id, pool->sp_busy_map);
+}
+
+/*
+ * Note: If we were awoken, then this rqstp has already been marked busy.
+ */
+static void svc_pool_thread_mark_busy(struct svc_pool *pool,
+ struct svc_rqst *rqstp)
+{
+ test_and_set_bit_lock(rqstp->rq_thread_id, pool->sp_busy_map);
+}
+
static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
{
struct svc_pool *pool = rqstp->rq_pool;
@@ -756,18 +771,17 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
set_current_state(TASK_INTERRUPTIBLE);
smp_mb__before_atomic();
clear_bit(SP_CONGESTED, &pool->sp_flags);
- clear_bit(RQ_BUSY, &rqstp->rq_flags);
- smp_mb__after_atomic();

- if (likely(rqst_should_sleep(rqstp)))
+ if (likely(rqst_should_sleep(rqstp))) {
+ svc_pool_thread_mark_idle(pool, rqstp);
time_left = schedule_timeout(timeout);
- else
+ } else
__set_current_state(TASK_RUNNING);

try_to_freeze();

- set_bit(RQ_BUSY, &rqstp->rq_flags);
- smp_mb__after_atomic();
+ svc_pool_thread_mark_busy(pool, rqstp);
+
rqstp->rq_xprt = svc_xprt_dequeue(pool);
if (rqstp->rq_xprt) {
trace_svc_pool_awoken(rqstp);



2023-07-10 16:47:37

by Chuck Lever

[permalink] [raw]
Subject: [PATCH v3 6/9] SUNRPC: Clean up svc_set_num_threads

From: Chuck Lever <[email protected]>

Document the API contract and remove stale or obvious comments.

Signed-off-by: Chuck Lever <[email protected]>
---
net/sunrpc/svc.c | 60 +++++++++++++++++++++++-------------------------------
1 file changed, 25 insertions(+), 35 deletions(-)

diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index b7a02309ecb1..b02a672aaada 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -728,23 +728,14 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
return NULL;
}

-/*
- * Choose a pool in which to create a new thread, for svc_set_num_threads
- */
-static inline struct svc_pool *
-choose_pool(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
+static struct svc_pool *
+svc_pool_next(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
{
- if (pool != NULL)
- return pool;
-
- return &serv->sv_pools[(*state)++ % serv->sv_nrpools];
+ return pool ? pool : &serv->sv_pools[(*state)++ % serv->sv_nrpools];
}

-/*
- * Choose a thread to kill, for svc_set_num_threads
- */
-static inline struct task_struct *
-choose_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
+static struct task_struct *
+svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
{
unsigned int i;
struct task_struct *task = NULL;
@@ -752,7 +743,6 @@ choose_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
if (pool != NULL) {
spin_lock_bh(&pool->sp_lock);
} else {
- /* choose a pool in round-robin fashion */
for (i = 0; i < serv->sv_nrpools; i++) {
pool = &serv->sv_pools[--(*state) % serv->sv_nrpools];
spin_lock_bh(&pool->sp_lock);
@@ -767,21 +757,15 @@ choose_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
if (!list_empty(&pool->sp_all_threads)) {
struct svc_rqst *rqstp;

- /*
- * Remove from the pool->sp_all_threads list
- * so we don't try to kill it again.
- */
rqstp = list_entry(pool->sp_all_threads.next, struct svc_rqst, rq_all);
set_bit(RQ_VICTIM, &rqstp->rq_flags);
list_del_rcu(&rqstp->rq_all);
task = rqstp->rq_task;
}
spin_unlock_bh(&pool->sp_lock);
-
return task;
}

-/* create new threads */
static int
svc_start_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
{
@@ -793,13 +777,12 @@ svc_start_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)

do {
nrservs--;
- chosen_pool = choose_pool(serv, pool, &state);
-
+ chosen_pool = svc_pool_next(serv, pool, &state);
node = svc_pool_map_get_node(chosen_pool->sp_id);
+
rqstp = svc_prepare_thread(serv, chosen_pool, node);
if (IS_ERR(rqstp))
return PTR_ERR(rqstp);
-
task = kthread_create_on_node(serv->sv_threadfn, rqstp,
node, "%s", serv->sv_name);
if (IS_ERR(task)) {
@@ -818,15 +801,6 @@ svc_start_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
return 0;
}

-/*
- * Create or destroy enough new threads to make the number
- * of threads the given number. If `pool' is non-NULL, applies
- * only to threads in that pool, otherwise round-robins between
- * all pools. Caller must ensure that mutual exclusion between this and
- * server startup or shutdown.
- */
-
-/* destroy old threads */
static int
svc_stop_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
{
@@ -834,9 +808,8 @@ svc_stop_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
struct task_struct *task;
unsigned int state = serv->sv_nrthreads-1;

- /* destroy old threads */
do {
- task = choose_victim(serv, pool, &state);
+ task = svc_pool_victim(serv, pool, &state);
if (task == NULL)
break;
rqstp = kthread_data(task);
@@ -848,6 +821,23 @@ svc_stop_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
return 0;
}

+/**
+ * svc_set_num_threads - adjust number of threads per RPC service
+ * @serv: RPC service to adjust
+ * @pool: Specific pool from which to choose threads, or NULL
+ * @nrservs: New number of threads for @serv (0 or less means kill all threads)
+ *
+ * Create or destroy threads to make the number of threads for @serv the
+ * given number. If @pool is non-NULL, change only threads in that pool;
+ * otherwise, round-robin between all pools for @serv. @serv's
+ * sv_nrthreads is adjusted for each thread created or destroyed.
+ *
+ * Caller must ensure mutual exclusion between this and server startup or
+ * shutdown.
+ *
+ * Returns zero on success or a negative errno if an error occurred while
+ * starting a thread.
+ */
int
svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
{



2023-07-10 16:47:59

by Chuck Lever

[permalink] [raw]
Subject: [PATCH v3 4/9] SUNRPC: Count ingress RPC messages per svc_pool

From: Chuck Lever <[email protected]>

svc_xprt_enqueue() can be costly, since it involves selecting and
waking up a process.

More than one enqueue is done per incoming RPC. For example,
svc_data_ready() enqueues, and so does svc_xprt_receive(). Also, if
an RPC message requires more than one call to ->recvfrom() to
receive it fully, each one of those calls does an enqueue.

To get a sense of the average number of transport enqueue operations
needed to process an incoming RPC message, re-use the "packets" pool
stat. Track the number of complete RPC messages processed by each
thread pool.

Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/sunrpc/svc.h | 1 +
net/sunrpc/svc.c | 2 ++
net/sunrpc/svc_xprt.c | 3 ++-
3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index fbfe6ea737c8..74ea13270679 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -38,6 +38,7 @@ struct svc_pool {
struct list_head sp_all_threads; /* all server threads */

/* statistics on pool operation */
+ struct percpu_counter sp_messages_arrived;
struct percpu_counter sp_sockets_queued;
struct percpu_counter sp_threads_woken;
struct percpu_counter sp_threads_timedout;
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index b79b8b41905d..88b7b5fb6d75 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -513,6 +513,7 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
INIT_LIST_HEAD(&pool->sp_all_threads);
spin_lock_init(&pool->sp_lock);

+ percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
percpu_counter_init(&pool->sp_threads_woken, 0, GFP_KERNEL);
percpu_counter_init(&pool->sp_threads_timedout, 0, GFP_KERNEL);
@@ -589,6 +590,7 @@ svc_destroy(struct kref *ref)
for (i = 0; i < serv->sv_nrpools; i++) {
struct svc_pool *pool = &serv->sv_pools[i];

+ percpu_counter_destroy(&pool->sp_messages_arrived);
percpu_counter_destroy(&pool->sp_sockets_queued);
percpu_counter_destroy(&pool->sp_threads_woken);
percpu_counter_destroy(&pool->sp_threads_timedout);
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 7ee095d03996..ecbccf0d89b9 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -897,6 +897,7 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)

if (serv->sv_stats)
serv->sv_stats->netcnt++;
+ percpu_counter_inc(&rqstp->rq_pool->sp_messages_arrived);
rqstp->rq_stime = ktime_get();
return len;
out_release:
@@ -1446,7 +1447,7 @@ static int svc_pool_stats_show(struct seq_file *m, void *p)

seq_printf(m, "%u %llu %llu %llu %llu %llu\n",
pool->sp_id,
- percpu_counter_sum_positive(&pool->sp_sockets_queued),
+ percpu_counter_sum_positive(&pool->sp_messages_arrived),
percpu_counter_sum_positive(&pool->sp_sockets_queued),
percpu_counter_sum_positive(&pool->sp_threads_woken),
percpu_counter_sum_positive(&pool->sp_threads_timedout),



2023-07-10 18:41:37

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v3 0/9] SUNRPC service thread scheduler optimizations

On Mon, 2023-07-10 at 12:41 -0400, Chuck Lever wrote:
> Walking a linked list to find an idle thread is not CPU cache-
> friendly, and in fact I've noted palpable per-request latency
> impacts as the number of nfsd threads on the server increases.
>
> After discussing some possible improvements with Jeff at LSF/MM,
> I've been experimenting with the following series. I've measured an
> order of magnitude latency improvement in the thread lookup time,
> and have managed to keep the whole thing lockless.
>
> This version of the series addresses Neil's earlier comments and
> is robust under load. The first 7 patches in this series seem
> uncontroversial, so I'll push them to the
> "topic-sunrpc-thread-scheduling" branch of:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git
>
> The last two are still under discussion. They are posted as part
> of this series for comparison to other proposals, but are not yet
> included in the topic branch. But they are tested and working.
>
>
> Changes since v2:
> * Dropped the patch that converts sp_lock to a simple spinlock
> * Replaced explicit memory barriers in svc_get_next_xprt()
> * Select thread victims from the other end of the bitmap
> * Added a metric for wake-ups that find nothing on the transport queue
>
> Changes since RFC:
> * Add a counter for ingress RPC messages
> * Add a few documenting comments
> * Move the more controversial patches to the end of the series
> * Clarify the refactoring of svc_wake_up()
> * Increase the value of RPCSVC_MAXPOOLTHREADS to 4096 (and tested with that many threads)
> * Optimize the loop in svc_pool_wake_idle_thread()
> * Optimize marking a thread "idle" in svc_get_next_xprt()
>
> ---
>
> Chuck Lever (9):
> SUNRPC: Deduplicate thread wake-up code
> SUNRPC: Report when no service thread is available.
> SUNRPC: Split the svc_xprt_dequeue tracepoint
> SUNRPC: Count ingress RPC messages per svc_pool
> SUNRPC: Count pool threads that were awoken but found no work to do
> SUNRPC: Clean up svc_set_num_threads
> SUNRPC: Replace dprintk() call site in __svc_create()
> SUNRPC: Replace sp_threads_all with an xarray
> SUNRPC: Convert RQ_BUSY into a per-pool bitmap
>
>
> fs/nfsd/nfssvc.c | 3 +-
> include/linux/sunrpc/svc.h | 19 ++--
> include/trace/events/sunrpc.h | 159 ++++++++++++++++++++++++++----
> net/sunrpc/svc.c | 177 ++++++++++++++++++++++------------
> net/sunrpc/svc_xprt.c | 99 ++++++++++---------
> 5 files changed, 323 insertions(+), 134 deletions(-)
>
> --
> Chuck Lever
>

You can add my Reviewed-by to the first 7 patches. The last two are not
quite there yet, I think, but I like how they're looking so far.

Cheers,
--
Jeff Layton <[email protected]>


2023-07-10 18:41:38

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v3 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap

On Mon, 2023-07-10 at 12:42 -0400, Chuck Lever wrote:
> From: Chuck Lever <[email protected]>
>
> I've noticed that client-observed server request latency goes up
> simply when the nfsd thread count is increased.
>
> Walking the whole set of pool threads is memory-inefficient. On a
> busy server with many threads, enqueuing a transport will visit all
> the threads in the pool quite frequently. This also pulls in the
> cache lines for some hot fields in each svc_rqst (namely, rq_flags).
>
> The svc_xprt_enqueue() call that concerns me most is the one in
> svc_rdma_wc_receive(), which is single-threaded per CQ. Slowing
> down completion handling limits the total throughput per RDMA
> connection.
>
> Instead, set up a busy bitmap and use find_next_clear_bit, which
> should work the same way as RQ_BUSY but will touch only the cache
> lines that the bitmap is in. Stick with atomic bit operations to
> avoid taking a spinlock during the search.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> include/linux/sunrpc/svc.h | 6 ++++--
> include/trace/events/sunrpc.h | 1 -
> net/sunrpc/svc.c | 24 +++++++++++++++++++-----
> net/sunrpc/svc_xprt.c | 26 ++++++++++++++++++++------
> 4 files changed, 43 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 86377506a514..6669f3eb9ed4 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -35,6 +35,7 @@ struct svc_pool {
> spinlock_t sp_lock; /* protects sp_sockets */
> struct list_head sp_sockets; /* pending sockets */
> unsigned int sp_nrthreads; /* # of threads in pool */
> + unsigned long *sp_busy_map; /* running threads */
> struct xarray sp_thread_xa;
>
> /* statistics on pool operation */
> @@ -191,6 +192,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> #define RPCSVC_MAXPAGES ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
> + 2 + 1)
>
> +#define RPCSVC_MAXPOOLTHREADS (4096)
> +
> /*
> * The context of a single thread, including the request currently being
> * processed.
> @@ -240,8 +243,7 @@ struct svc_rqst {
> #define RQ_SPLICE_OK (4) /* turned off in gss privacy
> * to prevent encrypting page
> * cache pages */
> -#define RQ_BUSY (5) /* request is busy */
> -#define RQ_DATA (6) /* request has data */
> +#define RQ_DATA (5) /* request has data */
> unsigned long rq_flags; /* flags field */
> u32 rq_thread_id; /* xarray index */
> ktime_t rq_qtime; /* enqueue time */
> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> index ea43c6059bdb..c07824a254bf 100644
> --- a/include/trace/events/sunrpc.h
> +++ b/include/trace/events/sunrpc.h
> @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
> svc_rqst_flag(USEDEFERRAL) \
> svc_rqst_flag(DROPME) \
> svc_rqst_flag(SPLICE_OK) \
> - svc_rqst_flag(BUSY) \
> svc_rqst_flag_end(DATA)
>
> #undef svc_rqst_flag
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index 109d7f047385..f6305b66fd28 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -509,6 +509,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> INIT_LIST_HEAD(&pool->sp_sockets);
> spin_lock_init(&pool->sp_lock);
> xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
> + pool->sp_busy_map =
> + bitmap_alloc_node(RPCSVC_MAXPOOLTHREADS, GFP_KERNEL,
> + svc_pool_map_get_node(i));
> + if (!pool->sp_busy_map)
> + return NULL;
> + bitmap_fill(pool->sp_busy_map, RPCSVC_MAXPOOLTHREADS);
>
> percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> @@ -598,6 +604,8 @@ svc_destroy(struct kref *ref)
> percpu_counter_destroy(&pool->sp_threads_no_work);
>
> xa_destroy(&pool->sp_thread_xa);
> + bitmap_free(pool->sp_busy_map);
> + pool->sp_busy_map = NULL;
> }
> kfree(serv->sv_pools);
> kfree(serv);
> @@ -649,7 +657,6 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
>
> folio_batch_init(&rqstp->rq_fbatch);
>
> - __set_bit(RQ_BUSY, &rqstp->rq_flags);
> rqstp->rq_server = serv;
> rqstp->rq_pool = pool;
>
> @@ -679,7 +686,7 @@ static struct svc_rqst *
> svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> {
> struct xa_limit limit = {
> - .max = U32_MAX,
> + .max = RPCSVC_MAXPOOLTHREADS,
> };
> struct svc_rqst *rqstp;
> int ret;
> @@ -724,12 +731,19 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> struct svc_pool *pool)
> {
> struct svc_rqst *rqstp;
> - unsigned long index;
> + unsigned long bit;
>
> - xa_for_each(&pool->sp_thread_xa, index, rqstp) {
> - if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> + /* Check the pool's idle bitmap locklessly so that multiple
> + * idle searches can proceed concurrently.
> + */
> + for_each_clear_bit(bit, pool->sp_busy_map, pool->sp_nrthreads) {
> + if (test_and_set_bit(bit, pool->sp_busy_map))
> continue;
>
>

I wonder if we also need to set the sp_busy_map bit when we're trying to
shrink the threadpool? It seems like the code above that is trying to
find a thread to do work could race with the task being killed, such
that we try to take up a task that is no longer functional.

>
> + rqstp = xa_load(&pool->sp_thread_xa, bit);
> + if (!rqstp)
> + break;
> +
> WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> wake_up_process(rqstp->rq_task);
> percpu_counter_inc(&pool->sp_threads_woken);
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index db40f771b60a..f9c9babe0cba 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -735,6 +735,21 @@ rqst_should_sleep(struct svc_rqst *rqstp)
> return true;
> }
>
> +static void svc_pool_thread_mark_idle(struct svc_pool *pool,
> + struct svc_rqst *rqstp)
> +{
> + clear_bit_unlock(rqstp->rq_thread_id, pool->sp_busy_map);
> +}
> +
> +/*
> + * Note: If we were awoken, then this rqstp has already been marked busy.
> + */
> +static void svc_pool_thread_mark_busy(struct svc_pool *pool,
> + struct svc_rqst *rqstp)
> +{
> + test_and_set_bit_lock(rqstp->rq_thread_id, pool->sp_busy_map);
> +}
> +
> static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
> {
> struct svc_pool *pool = rqstp->rq_pool;
> @@ -756,18 +771,17 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
> set_current_state(TASK_INTERRUPTIBLE);
> smp_mb__before_atomic();
> clear_bit(SP_CONGESTED, &pool->sp_flags);
> - clear_bit(RQ_BUSY, &rqstp->rq_flags);
> - smp_mb__after_atomic();
>
> - if (likely(rqst_should_sleep(rqstp)))
> + if (likely(rqst_should_sleep(rqstp))) {
> + svc_pool_thread_mark_idle(pool, rqstp);
> time_left = schedule_timeout(timeout);
> - else
> + } else
> __set_current_state(TASK_RUNNING);
>
> try_to_freeze();
>
> - set_bit(RQ_BUSY, &rqstp->rq_flags);
> - smp_mb__after_atomic();
> + svc_pool_thread_mark_busy(pool, rqstp);
> +
> rqstp->rq_xprt = svc_xprt_dequeue(pool);
> if (rqstp->rq_xprt) {
> trace_svc_pool_awoken(rqstp);
>
>

--
Jeff Layton <[email protected]>

2023-07-10 18:41:38

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v3 8/9] SUNRPC: Replace sp_threads_all with an xarray

On Mon, 2023-07-10 at 12:42 -0400, Chuck Lever wrote:
> From: Chuck Lever <[email protected]>
>
> We want a thread lookup operation that can be done with RCU only,
> but also we want to avoid the linked-list walk, which does not scale
> well in the number of pool threads.
>
> This patch splits out the use of the sp_lock to protect the set
> of threads. Svc thread information is now protected by the xarray's
> lock (when making thread count changes) and the RCU read lock (when
> only looking up a thread).
>
> Since thread count changes are done only via nfsd filesystem API,
> which runs only in process context, we can safely dispense with the
> use of a bottom-half-disabled lock.
>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> fs/nfsd/nfssvc.c | 3 +-
> include/linux/sunrpc/svc.h | 11 +++----
> include/trace/events/sunrpc.h | 47 ++++++++++++++++++++++++++++-
> net/sunrpc/svc.c | 67 +++++++++++++++++++++++++----------------
> net/sunrpc/svc_xprt.c | 2 +
> 5 files changed, 94 insertions(+), 36 deletions(-)
>
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index 2154fa63c5f2..d42b2a40c93c 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -62,8 +62,7 @@ static __be32 nfsd_init_request(struct svc_rqst *,
> * If (out side the lock) nn->nfsd_serv is non-NULL, then it must point to a
> * properly initialised 'struct svc_serv' with ->sv_nrthreads > 0 (unless
> * nn->keep_active is set). That number of nfsd threads must
> - * exist and each must be listed in ->sp_all_threads in some entry of
> - * ->sv_pools[].
> + * exist and each must be listed in some entry of ->sv_pools[].
> *
> * Each active thread holds a counted reference on nn->nfsd_serv, as does
> * the nn->keep_active flag and various transient calls to svc_get().
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 9dd3b16cc4c2..86377506a514 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -32,10 +32,10 @@
> */
> struct svc_pool {
> unsigned int sp_id; /* pool id; also node id on NUMA */
> - spinlock_t sp_lock; /* protects all fields */
> + spinlock_t sp_lock; /* protects sp_sockets */
> struct list_head sp_sockets; /* pending sockets */
> unsigned int sp_nrthreads; /* # of threads in pool */
> - struct list_head sp_all_threads; /* all server threads */
> + struct xarray sp_thread_xa;
>
> /* statistics on pool operation */
> struct percpu_counter sp_messages_arrived;
> @@ -196,7 +196,6 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> * processed.
> */
> struct svc_rqst {
> - struct list_head rq_all; /* all threads list */
> struct rcu_head rq_rcu_head; /* for RCU deferred kfree */
> struct svc_xprt * rq_xprt; /* transport ptr */
>
> @@ -241,10 +240,10 @@ struct svc_rqst {
> #define RQ_SPLICE_OK (4) /* turned off in gss privacy
> * to prevent encrypting page
> * cache pages */
> -#define RQ_VICTIM (5) /* about to be shut down */
> -#define RQ_BUSY (6) /* request is busy */
> -#define RQ_DATA (7) /* request has data */
> +#define RQ_BUSY (5) /* request is busy */
> +#define RQ_DATA (6) /* request has data */
> unsigned long rq_flags; /* flags field */
> + u32 rq_thread_id; /* xarray index */
> ktime_t rq_qtime; /* enqueue time */
>
> void * rq_argp; /* decoded arguments */
> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> index 60c8e03268d4..ea43c6059bdb 100644
> --- a/include/trace/events/sunrpc.h
> +++ b/include/trace/events/sunrpc.h
> @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
> svc_rqst_flag(USEDEFERRAL) \
> svc_rqst_flag(DROPME) \
> svc_rqst_flag(SPLICE_OK) \
> - svc_rqst_flag(VICTIM) \
> svc_rqst_flag(BUSY) \
> svc_rqst_flag_end(DATA)
>
> @@ -2118,6 +2117,52 @@ TRACE_EVENT(svc_pool_starved,
> )
> );
>
> +DECLARE_EVENT_CLASS(svc_thread_lifetime_class,
> + TP_PROTO(
> + const struct svc_serv *serv,
> + const struct svc_pool *pool,
> + const struct svc_rqst *rqstp
> + ),
> +
> + TP_ARGS(serv, pool, rqstp),
> +
> + TP_STRUCT__entry(
> + __string(name, serv->sv_name)
> + __field(int, pool_id)
> + __field(unsigned int, nrthreads)
> + __field(unsigned long, pool_flags)
> + __field(u32, thread_id)
> + __field(const void *, rqstp)
> + ),
> +
> + TP_fast_assign(
> + __assign_str(name, serv->sv_name);
> + __entry->pool_id = pool->sp_id;
> + __entry->nrthreads = pool->sp_nrthreads;
> + __entry->pool_flags = pool->sp_flags;
> + __entry->thread_id = rqstp->rq_thread_id;
> + __entry->rqstp = rqstp;
> + ),
> +
> + TP_printk("service=%s pool=%d pool_flags=%s nrthreads=%u thread_id=%u",
> + __get_str(name), __entry->pool_id,
> + show_svc_pool_flags(__entry->pool_flags),
> + __entry->nrthreads, __entry->thread_id
> + )
> +);
> +
> +#define DEFINE_SVC_THREAD_LIFETIME_EVENT(name) \
> + DEFINE_EVENT(svc_thread_lifetime_class, svc_pool_##name, \
> + TP_PROTO( \
> + const struct svc_serv *serv, \
> + const struct svc_pool *pool, \
> + const struct svc_rqst *rqstp \
> + ), \
> + TP_ARGS(serv, pool, rqstp))
> +
> +DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_init);
> +DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_exit);
> +
> DECLARE_EVENT_CLASS(svc_xprt_event,
> TP_PROTO(
> const struct svc_xprt *xprt
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index ad29df00b454..109d7f047385 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -507,8 +507,8 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
>
> pool->sp_id = i;
> INIT_LIST_HEAD(&pool->sp_sockets);
> - INIT_LIST_HEAD(&pool->sp_all_threads);
> spin_lock_init(&pool->sp_lock);
> + xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
>
> percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> @@ -596,6 +596,8 @@ svc_destroy(struct kref *ref)
> percpu_counter_destroy(&pool->sp_threads_timedout);
> percpu_counter_destroy(&pool->sp_threads_starved);
> percpu_counter_destroy(&pool->sp_threads_no_work);
> +
> + xa_destroy(&pool->sp_thread_xa);
> }
> kfree(serv->sv_pools);
> kfree(serv);
> @@ -676,7 +678,11 @@ EXPORT_SYMBOL_GPL(svc_rqst_alloc);
> static struct svc_rqst *
> svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> {
> + struct xa_limit limit = {
> + .max = U32_MAX,
> + };
> struct svc_rqst *rqstp;
> + int ret;
>
> rqstp = svc_rqst_alloc(serv, pool, node);
> if (!rqstp)
> @@ -687,11 +693,21 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> serv->sv_nrthreads += 1;
> spin_unlock_bh(&serv->sv_lock);
>
> - spin_lock_bh(&pool->sp_lock);
> + xa_lock(&pool->sp_thread_xa);
> + ret = __xa_alloc(&pool->sp_thread_xa, &rqstp->rq_thread_id, rqstp,
> + limit, GFP_KERNEL);
> + if (ret) {
> + xa_unlock(&pool->sp_thread_xa);
> + goto out_free;
> + }
> pool->sp_nrthreads++;
> - list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
> - spin_unlock_bh(&pool->sp_lock);
> + xa_unlock(&pool->sp_thread_xa);
> + trace_svc_pool_thread_init(serv, pool, rqstp);
> return rqstp;
> +
> +out_free:
> + svc_rqst_free(rqstp);
> + return ERR_PTR(ret);
> }
>
> /**
> @@ -708,19 +724,17 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> struct svc_pool *pool)
> {
> struct svc_rqst *rqstp;
> + unsigned long index;
>
> - rcu_read_lock();


While it does do its own locking, the resulting object that xa_for_each
returns needs some protection too. Between xa_for_each returning a rqstp
and calling test_and_set_bit, could the rqstp be freed? I suspect so,
and I think you probably need to keep the rcu_read_lock() call above.


> - list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
> + xa_for_each(&pool->sp_thread_xa, index, rqstp) {
> if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> continue;
>
> - rcu_read_unlock();
> WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> wake_up_process(rqstp->rq_task);
> percpu_counter_inc(&pool->sp_threads_woken);
> return rqstp;
> }
> - rcu_read_unlock();
>

I wonder if this can race with svc_pool_victim below? Can we end up
waking a thread that's already on its way out of the pool? Maybe this is
addressed in your next patch though...

> trace_svc_pool_starved(serv, pool);
> percpu_counter_inc(&pool->sp_threads_starved);
> @@ -736,32 +750,33 @@ svc_pool_next(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
> static struct task_struct *
> svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
> {
> - unsigned int i;
> struct task_struct *task = NULL;
> + struct svc_rqst *rqstp;
> + unsigned int i;
>
> if (pool != NULL) {
> - spin_lock_bh(&pool->sp_lock);
> + xa_lock(&pool->sp_thread_xa);
> + if (!pool->sp_nrthreads)
> + goto out;
> } else {
> for (i = 0; i < serv->sv_nrpools; i++) {
> pool = &serv->sv_pools[--(*state) % serv->sv_nrpools];
> - spin_lock_bh(&pool->sp_lock);
> - if (!list_empty(&pool->sp_all_threads))
> + xa_lock(&pool->sp_thread_xa);
> + if (pool->sp_nrthreads)
> goto found_pool;
> - spin_unlock_bh(&pool->sp_lock);
> + xa_unlock(&pool->sp_thread_xa);
> }
> return NULL;
> }
>
> found_pool:
> - if (!list_empty(&pool->sp_all_threads)) {
> - struct svc_rqst *rqstp;
> -
> - rqstp = list_entry(pool->sp_all_threads.next, struct svc_rqst, rq_all);
> - set_bit(RQ_VICTIM, &rqstp->rq_flags);
> - list_del_rcu(&rqstp->rq_all);
> + rqstp = xa_load(&pool->sp_thread_xa, pool->sp_nrthreads - 1);
> + if (rqstp) {
> + __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> task = rqstp->rq_task;
> }
> - spin_unlock_bh(&pool->sp_lock);
> +out:
> + xa_unlock(&pool->sp_thread_xa);
> return task;
> }
>
> @@ -843,9 +858,9 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
> if (pool == NULL) {
> nrservs -= serv->sv_nrthreads;
> } else {
> - spin_lock_bh(&pool->sp_lock);
> + xa_lock(&pool->sp_thread_xa);
> nrservs -= pool->sp_nrthreads;
> - spin_unlock_bh(&pool->sp_lock);
> + xa_unlock(&pool->sp_thread_xa);
> }
>
> if (nrservs > 0)
> @@ -932,11 +947,11 @@ svc_exit_thread(struct svc_rqst *rqstp)
> struct svc_serv *serv = rqstp->rq_server;
> struct svc_pool *pool = rqstp->rq_pool;
>
> - spin_lock_bh(&pool->sp_lock);
> + xa_lock(&pool->sp_thread_xa);
> pool->sp_nrthreads--;
> - if (!test_and_set_bit(RQ_VICTIM, &rqstp->rq_flags))
> - list_del_rcu(&rqstp->rq_all);
> - spin_unlock_bh(&pool->sp_lock);
> + __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> + xa_unlock(&pool->sp_thread_xa);
> + trace_svc_pool_thread_exit(serv, pool, rqstp);
>
> spin_lock_bh(&serv->sv_lock);
> serv->sv_nrthreads -= 1;
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 6c2a702aa469..db40f771b60a 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -46,7 +46,7 @@ static LIST_HEAD(svc_xprt_class_list);
>
> /* SMP locking strategy:
> *
> - * svc_pool->sp_lock protects most of the fields of that pool.
> + * svc_pool->sp_lock protects sp_sockets.
> * svc_serv->sv_lock protects sv_tempsocks, sv_permsocks, sv_tmpcnt.
> * when both need to be taken (rare), svc_serv->sv_lock is first.
> * The "service mutex" protects svc_serv->sv_nrthread.
>
>

Looks like a nice clean conversion otherwise!
--
Jeff Layton <[email protected]>

2023-07-10 19:09:28

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH v3 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap

On Mon, Jul 10, 2023 at 02:28:56PM -0400, Jeff Layton wrote:
> On Mon, 2023-07-10 at 12:42 -0400, Chuck Lever wrote:
> > From: Chuck Lever <[email protected]>
> >
> > I've noticed that client-observed server request latency goes up
> > simply when the nfsd thread count is increased.
> >
> > Walking the whole set of pool threads is memory-inefficient. On a
> > busy server with many threads, enqueuing a transport will visit all
> > the threads in the pool quite frequently. This also pulls in the
> > cache lines for some hot fields in each svc_rqst (namely, rq_flags).
> >
> > The svc_xprt_enqueue() call that concerns me most is the one in
> > svc_rdma_wc_receive(), which is single-threaded per CQ. Slowing
> > down completion handling limits the total throughput per RDMA
> > connection.
> >
> > Instead, set up a busy bitmap and use find_next_clear_bit, which
> > should work the same way as RQ_BUSY but will touch only the cache
> > lines that the bitmap is in. Stick with atomic bit operations to
> > avoid taking a spinlock during the search.
> >
> > Signed-off-by: Chuck Lever <[email protected]>
> > ---
> > include/linux/sunrpc/svc.h | 6 ++++--
> > include/trace/events/sunrpc.h | 1 -
> > net/sunrpc/svc.c | 24 +++++++++++++++++++-----
> > net/sunrpc/svc_xprt.c | 26 ++++++++++++++++++++------
> > 4 files changed, 43 insertions(+), 14 deletions(-)
> >
> > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > index 86377506a514..6669f3eb9ed4 100644
> > --- a/include/linux/sunrpc/svc.h
> > +++ b/include/linux/sunrpc/svc.h
> > @@ -35,6 +35,7 @@ struct svc_pool {
> > spinlock_t sp_lock; /* protects sp_sockets */
> > struct list_head sp_sockets; /* pending sockets */
> > unsigned int sp_nrthreads; /* # of threads in pool */
> > + unsigned long *sp_busy_map; /* running threads */
> > struct xarray sp_thread_xa;
> >
> > /* statistics on pool operation */
> > @@ -191,6 +192,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> > #define RPCSVC_MAXPAGES ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
> > + 2 + 1)
> >
> > +#define RPCSVC_MAXPOOLTHREADS (4096)
> > +
> > /*
> > * The context of a single thread, including the request currently being
> > * processed.
> > @@ -240,8 +243,7 @@ struct svc_rqst {
> > #define RQ_SPLICE_OK (4) /* turned off in gss privacy
> > * to prevent encrypting page
> > * cache pages */
> > -#define RQ_BUSY (5) /* request is busy */
> > -#define RQ_DATA (6) /* request has data */
> > +#define RQ_DATA (5) /* request has data */
> > unsigned long rq_flags; /* flags field */
> > u32 rq_thread_id; /* xarray index */
> > ktime_t rq_qtime; /* enqueue time */
> > diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> > index ea43c6059bdb..c07824a254bf 100644
> > --- a/include/trace/events/sunrpc.h
> > +++ b/include/trace/events/sunrpc.h
> > @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
> > svc_rqst_flag(USEDEFERRAL) \
> > svc_rqst_flag(DROPME) \
> > svc_rqst_flag(SPLICE_OK) \
> > - svc_rqst_flag(BUSY) \
> > svc_rqst_flag_end(DATA)
> >
> > #undef svc_rqst_flag
> > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > index 109d7f047385..f6305b66fd28 100644
> > --- a/net/sunrpc/svc.c
> > +++ b/net/sunrpc/svc.c
> > @@ -509,6 +509,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> > INIT_LIST_HEAD(&pool->sp_sockets);
> > spin_lock_init(&pool->sp_lock);
> > xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
> > + pool->sp_busy_map =
> > + bitmap_alloc_node(RPCSVC_MAXPOOLTHREADS, GFP_KERNEL,
> > + svc_pool_map_get_node(i));
> > + if (!pool->sp_busy_map)
> > + return NULL;
> > + bitmap_fill(pool->sp_busy_map, RPCSVC_MAXPOOLTHREADS);
> >
> > percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> > percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> > @@ -598,6 +604,8 @@ svc_destroy(struct kref *ref)
> > percpu_counter_destroy(&pool->sp_threads_no_work);
> >
> > xa_destroy(&pool->sp_thread_xa);
> > + bitmap_free(pool->sp_busy_map);
> > + pool->sp_busy_map = NULL;
> > }
> > kfree(serv->sv_pools);
> > kfree(serv);
> > @@ -649,7 +657,6 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
> >
> > folio_batch_init(&rqstp->rq_fbatch);
> >
> > - __set_bit(RQ_BUSY, &rqstp->rq_flags);
> > rqstp->rq_server = serv;
> > rqstp->rq_pool = pool;
> >
> > @@ -679,7 +686,7 @@ static struct svc_rqst *
> > svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> > {
> > struct xa_limit limit = {
> > - .max = U32_MAX,
> > + .max = RPCSVC_MAXPOOLTHREADS,
> > };
> > struct svc_rqst *rqstp;
> > int ret;
> > @@ -724,12 +731,19 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> > struct svc_pool *pool)
> > {
> > struct svc_rqst *rqstp;
> > - unsigned long index;
> > + unsigned long bit;
> >
> > - xa_for_each(&pool->sp_thread_xa, index, rqstp) {
> > - if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> > + /* Check the pool's idle bitmap locklessly so that multiple
> > + * idle searches can proceed concurrently.
> > + */
> > + for_each_clear_bit(bit, pool->sp_busy_map, pool->sp_nrthreads) {
> > + if (test_and_set_bit(bit, pool->sp_busy_map))
> > continue;
> >
> >
>
> I wonder if we also need to set the sp_busy_map bit when we're trying to
> shrink the threadpool? It seems like the code above that is trying to
> find a thread to do work could race with the task being killed, such
> that we try to take up a task that is no longer functional.

What I /think/ will happen is that while svc_pool_victim() is
holding the xa_lock, it will remove the rqst from the xarray.

Then when svc_pool_wake_idle_thread() calls xa_load(), it should
get a NULL, and treat that like a pool that doesn't have an
available thread to wake.


> > + rqstp = xa_load(&pool->sp_thread_xa, bit);
> > + if (!rqstp)
> > + break;
> > +
> > WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> > wake_up_process(rqstp->rq_task);
> > percpu_counter_inc(&pool->sp_threads_woken);
> > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > index db40f771b60a..f9c9babe0cba 100644
> > --- a/net/sunrpc/svc_xprt.c
> > +++ b/net/sunrpc/svc_xprt.c
> > @@ -735,6 +735,21 @@ rqst_should_sleep(struct svc_rqst *rqstp)
> > return true;
> > }
> >
> > +static void svc_pool_thread_mark_idle(struct svc_pool *pool,
> > + struct svc_rqst *rqstp)
> > +{
> > + clear_bit_unlock(rqstp->rq_thread_id, pool->sp_busy_map);
> > +}
> > +
> > +/*
> > + * Note: If we were awoken, then this rqstp has already been marked busy.
> > + */
> > +static void svc_pool_thread_mark_busy(struct svc_pool *pool,
> > + struct svc_rqst *rqstp)
> > +{
> > + test_and_set_bit_lock(rqstp->rq_thread_id, pool->sp_busy_map);
> > +}
> > +
> > static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
> > {
> > struct svc_pool *pool = rqstp->rq_pool;
> > @@ -756,18 +771,17 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
> > set_current_state(TASK_INTERRUPTIBLE);
> > smp_mb__before_atomic();
> > clear_bit(SP_CONGESTED, &pool->sp_flags);
> > - clear_bit(RQ_BUSY, &rqstp->rq_flags);
> > - smp_mb__after_atomic();
> >
> > - if (likely(rqst_should_sleep(rqstp)))
> > + if (likely(rqst_should_sleep(rqstp))) {
> > + svc_pool_thread_mark_idle(pool, rqstp);
> > time_left = schedule_timeout(timeout);
> > - else
> > + } else
> > __set_current_state(TASK_RUNNING);
> >
> > try_to_freeze();
> >
> > - set_bit(RQ_BUSY, &rqstp->rq_flags);
> > - smp_mb__after_atomic();
> > + svc_pool_thread_mark_busy(pool, rqstp);
> > +
> > rqstp->rq_xprt = svc_xprt_dequeue(pool);
> > if (rqstp->rq_xprt) {
> > trace_svc_pool_awoken(rqstp);
> >
> >
>
> --
> Jeff Layton <[email protected]>

2023-07-10 20:26:46

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v3 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap

On Mon, 2023-07-10 at 14:57 -0400, Chuck Lever wrote:
> On Mon, Jul 10, 2023 at 02:28:56PM -0400, Jeff Layton wrote:
> > On Mon, 2023-07-10 at 12:42 -0400, Chuck Lever wrote:
> > > From: Chuck Lever <[email protected]>
> > >
> > > I've noticed that client-observed server request latency goes up
> > > simply when the nfsd thread count is increased.
> > >
> > > Walking the whole set of pool threads is memory-inefficient. On a
> > > busy server with many threads, enqueuing a transport will visit all
> > > the threads in the pool quite frequently. This also pulls in the
> > > cache lines for some hot fields in each svc_rqst (namely, rq_flags).
> > >
> > > The svc_xprt_enqueue() call that concerns me most is the one in
> > > svc_rdma_wc_receive(), which is single-threaded per CQ. Slowing
> > > down completion handling limits the total throughput per RDMA
> > > connection.
> > >
> > > Instead, set up a busy bitmap and use find_next_clear_bit, which
> > > should work the same way as RQ_BUSY but will touch only the cache
> > > lines that the bitmap is in. Stick with atomic bit operations to
> > > avoid taking a spinlock during the search.
> > >
> > > Signed-off-by: Chuck Lever <[email protected]>
> > > ---
> > > include/linux/sunrpc/svc.h | 6 ++++--
> > > include/trace/events/sunrpc.h | 1 -
> > > net/sunrpc/svc.c | 24 +++++++++++++++++++-----
> > > net/sunrpc/svc_xprt.c | 26 ++++++++++++++++++++------
> > > 4 files changed, 43 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > > index 86377506a514..6669f3eb9ed4 100644
> > > --- a/include/linux/sunrpc/svc.h
> > > +++ b/include/linux/sunrpc/svc.h
> > > @@ -35,6 +35,7 @@ struct svc_pool {
> > > spinlock_t sp_lock; /* protects sp_sockets */
> > > struct list_head sp_sockets; /* pending sockets */
> > > unsigned int sp_nrthreads; /* # of threads in pool */
> > > + unsigned long *sp_busy_map; /* running threads */
> > > struct xarray sp_thread_xa;
> > >
> > > /* statistics on pool operation */
> > > @@ -191,6 +192,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> > > #define RPCSVC_MAXPAGES ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
> > > + 2 + 1)
> > >
> > > +#define RPCSVC_MAXPOOLTHREADS (4096)
> > > +
> > > /*
> > > * The context of a single thread, including the request currently being
> > > * processed.
> > > @@ -240,8 +243,7 @@ struct svc_rqst {
> > > #define RQ_SPLICE_OK (4) /* turned off in gss privacy
> > > * to prevent encrypting page
> > > * cache pages */
> > > -#define RQ_BUSY (5) /* request is busy */
> > > -#define RQ_DATA (6) /* request has data */
> > > +#define RQ_DATA (5) /* request has data */
> > > unsigned long rq_flags; /* flags field */
> > > u32 rq_thread_id; /* xarray index */
> > > ktime_t rq_qtime; /* enqueue time */
> > > diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> > > index ea43c6059bdb..c07824a254bf 100644
> > > --- a/include/trace/events/sunrpc.h
> > > +++ b/include/trace/events/sunrpc.h
> > > @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
> > > svc_rqst_flag(USEDEFERRAL) \
> > > svc_rqst_flag(DROPME) \
> > > svc_rqst_flag(SPLICE_OK) \
> > > - svc_rqst_flag(BUSY) \
> > > svc_rqst_flag_end(DATA)
> > >
> > > #undef svc_rqst_flag
> > > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > > index 109d7f047385..f6305b66fd28 100644
> > > --- a/net/sunrpc/svc.c
> > > +++ b/net/sunrpc/svc.c
> > > @@ -509,6 +509,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> > > INIT_LIST_HEAD(&pool->sp_sockets);
> > > spin_lock_init(&pool->sp_lock);
> > > xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
> > > + pool->sp_busy_map =
> > > + bitmap_alloc_node(RPCSVC_MAXPOOLTHREADS, GFP_KERNEL,
> > > + svc_pool_map_get_node(i));
> > > + if (!pool->sp_busy_map)
> > > + return NULL;
> > > + bitmap_fill(pool->sp_busy_map, RPCSVC_MAXPOOLTHREADS);
> > >
> > > percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> > > percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> > > @@ -598,6 +604,8 @@ svc_destroy(struct kref *ref)
> > > percpu_counter_destroy(&pool->sp_threads_no_work);
> > >
> > > xa_destroy(&pool->sp_thread_xa);
> > > + bitmap_free(pool->sp_busy_map);
> > > + pool->sp_busy_map = NULL;
> > > }
> > > kfree(serv->sv_pools);
> > > kfree(serv);
> > > @@ -649,7 +657,6 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
> > >
> > > folio_batch_init(&rqstp->rq_fbatch);
> > >
> > > - __set_bit(RQ_BUSY, &rqstp->rq_flags);
> > > rqstp->rq_server = serv;
> > > rqstp->rq_pool = pool;
> > >
> > > @@ -679,7 +686,7 @@ static struct svc_rqst *
> > > svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> > > {
> > > struct xa_limit limit = {
> > > - .max = U32_MAX,
> > > + .max = RPCSVC_MAXPOOLTHREADS,
> > > };
> > > struct svc_rqst *rqstp;
> > > int ret;
> > > @@ -724,12 +731,19 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> > > struct svc_pool *pool)
> > > {
> > > struct svc_rqst *rqstp;
> > > - unsigned long index;
> > > + unsigned long bit;
> > >
> > > - xa_for_each(&pool->sp_thread_xa, index, rqstp) {
> > > - if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> > > + /* Check the pool's idle bitmap locklessly so that multiple
> > > + * idle searches can proceed concurrently.
> > > + */
> > > + for_each_clear_bit(bit, pool->sp_busy_map, pool->sp_nrthreads) {
> > > + if (test_and_set_bit(bit, pool->sp_busy_map))
> > > continue;
> > >
> > >
> >
> > I wonder if we also need to set the sp_busy_map bit when we're trying to
> > shrink the threadpool? It seems like the code above that is trying to
> > find a thread to do work could race with the task being killed, such
> > that we try to take up a task that is no longer functional.
>
> What I /think/ will happen is that while svc_pool_victim() is
> holding the xa_lock, it will remove the rqst from the xarray.

> Then when svc_pool_wake_idle_thread() calls xa_load(), it should
> get a NULL, and treat that like a pool that doesn't have an
> available thread to wake.
>

Right, but we're searching under RCU, so a search could find the thing
just before it gets removed. IOW, something like this is possible:

receiving task rpc.nfsd nfsd thread
---------------------------------------------------------------------------------------
svc_pool_wake_idle_thread
xa_load

svc_set_num_threads
svc_stop_kthreads
(remove from xarray)

svc_exit_thread

test_and_set_bit
wake_up_process

Granted, I think this is a potential race in the existing code too, so
this patch probably doesn't make anything worse.


>
> > > + rqstp = xa_load(&pool->sp_thread_xa, bit);
> > > + if (!rqstp)
> > > + break;
> > > +
> > > WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> > > wake_up_process(rqstp->rq_task);
> > > percpu_counter_inc(&pool->sp_threads_woken);
> > > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > > index db40f771b60a..f9c9babe0cba 100644
> > > --- a/net/sunrpc/svc_xprt.c
> > > +++ b/net/sunrpc/svc_xprt.c
> > > @@ -735,6 +735,21 @@ rqst_should_sleep(struct svc_rqst *rqstp)
> > > return true;
> > > }
> > >
> > > +static void svc_pool_thread_mark_idle(struct svc_pool *pool,
> > > + struct svc_rqst *rqstp)
> > > +{
> > > + clear_bit_unlock(rqstp->rq_thread_id, pool->sp_busy_map);
> > > +}
> > > +
> > > +/*
> > > + * Note: If we were awoken, then this rqstp has already been marked busy.
> > > + */
> > > +static void svc_pool_thread_mark_busy(struct svc_pool *pool,
> > > + struct svc_rqst *rqstp)
> > > +{
> > > + test_and_set_bit_lock(rqstp->rq_thread_id, pool->sp_busy_map);
> > > +}
> > > +
> > > static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
> > > {
> > > struct svc_pool *pool = rqstp->rq_pool;
> > > @@ -756,18 +771,17 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
> > > set_current_state(TASK_INTERRUPTIBLE);
> > > smp_mb__before_atomic();
> > > clear_bit(SP_CONGESTED, &pool->sp_flags);
> > > - clear_bit(RQ_BUSY, &rqstp->rq_flags);
> > > - smp_mb__after_atomic();
> > >
> > > - if (likely(rqst_should_sleep(rqstp)))
> > > + if (likely(rqst_should_sleep(rqstp))) {
> > > + svc_pool_thread_mark_idle(pool, rqstp);
> > > time_left = schedule_timeout(timeout);
> > > - else
> > > + } else
> > > __set_current_state(TASK_RUNNING);
> > >
> > > try_to_freeze();
> > >
> > > - set_bit(RQ_BUSY, &rqstp->rq_flags);
> > > - smp_mb__after_atomic();
> > > + svc_pool_thread_mark_busy(pool, rqstp);
> > > +
> > > rqstp->rq_xprt = svc_xprt_dequeue(pool);
> > > if (rqstp->rq_xprt) {
> > > trace_svc_pool_awoken(rqstp);
> > >
> > >
> >
> > --
> > Jeff Layton <[email protected]>

--
Jeff Layton <[email protected]>

2023-07-10 22:34:37

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v3 0/9] SUNRPC service thread scheduler optimizations

On Tue, 11 Jul 2023, Jeff Layton wrote:
>
> You can add my Reviewed-by to the first 7 patches. The last two are not
> quite there yet, I think, but I like how they're looking so far.

Similarly here is my r-b for the first 7

Reviewed-By: NeilBrown <[email protected]>

NeilBrown

2023-07-10 22:34:46

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v3 5/9] SUNRPC: Count pool threads that were awoken but found no work to do

On Tue, 11 Jul 2023, Chuck Lever wrote:
> From: Chuck Lever <[email protected]>
>
> Measure a source of thread scheduling inefficiency -- count threads
> that were awoken but found that the transport queue had already been
> emptied.
>
> An empty transport queue is possible when threads that run between
> the wake_up_process() call and the woken thread returning from the
> scheduler have pulled all remaining work off the transport queue
> using the first svc_xprt_dequeue() in svc_get_next_xprt().

I'm in two minds about this. The data being gathered here is
potentially useful - but who it is useful to?
I think it is primarily useful for us - to understand the behaviour of
the implementation so we can know what needs to be optimised.
It isn't really of any use to a sysadmin who wants to understand how
their system is performing.

But then .. who are tracepoints for? Developers or admins?
I guess that fact that we feel free to modify them whenever we need
means they are primarily for developers? In which case this is a good
patch, and maybe we'll revert the functionality one day if it turns out
that we can change the implementation so that a thread is never woken
when there is no work to do ....

Thanks,
NeilBrown


>
> Signed-off-by: Chuck Lever <[email protected]>
> ---
> include/linux/sunrpc/svc.h | 1 +
> net/sunrpc/svc.c | 2 ++
> net/sunrpc/svc_xprt.c | 7 ++++---
> 3 files changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 74ea13270679..9dd3b16cc4c2 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -43,6 +43,7 @@ struct svc_pool {
> struct percpu_counter sp_threads_woken;
> struct percpu_counter sp_threads_timedout;
> struct percpu_counter sp_threads_starved;
> + struct percpu_counter sp_threads_no_work;
>
> #define SP_TASK_PENDING (0) /* still work to do even if no
> * xprt is queued. */
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index 88b7b5fb6d75..b7a02309ecb1 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -518,6 +518,7 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> percpu_counter_init(&pool->sp_threads_woken, 0, GFP_KERNEL);
> percpu_counter_init(&pool->sp_threads_timedout, 0, GFP_KERNEL);
> percpu_counter_init(&pool->sp_threads_starved, 0, GFP_KERNEL);
> + percpu_counter_init(&pool->sp_threads_no_work, 0, GFP_KERNEL);
> }
>
> return serv;
> @@ -595,6 +596,7 @@ svc_destroy(struct kref *ref)
> percpu_counter_destroy(&pool->sp_threads_woken);
> percpu_counter_destroy(&pool->sp_threads_timedout);
> percpu_counter_destroy(&pool->sp_threads_starved);
> + percpu_counter_destroy(&pool->sp_threads_no_work);
> }
> kfree(serv->sv_pools);
> kfree(serv);
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index ecbccf0d89b9..6c2a702aa469 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -776,9 +776,9 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
>
> if (!time_left)
> percpu_counter_inc(&pool->sp_threads_timedout);
> -
> if (signalled() || kthread_should_stop())
> return ERR_PTR(-EINTR);
> + percpu_counter_inc(&pool->sp_threads_no_work);
> return ERR_PTR(-EAGAIN);
> out_found:
> /* Normally we will wait up to 5 seconds for any required
> @@ -1445,13 +1445,14 @@ static int svc_pool_stats_show(struct seq_file *m, void *p)
> return 0;
> }
>
> - seq_printf(m, "%u %llu %llu %llu %llu %llu\n",
> + seq_printf(m, "%u %llu %llu %llu %llu %llu %llu\n",
> pool->sp_id,
> percpu_counter_sum_positive(&pool->sp_messages_arrived),
> percpu_counter_sum_positive(&pool->sp_sockets_queued),
> percpu_counter_sum_positive(&pool->sp_threads_woken),
> percpu_counter_sum_positive(&pool->sp_threads_timedout),
> - percpu_counter_sum_positive(&pool->sp_threads_starved));
> + percpu_counter_sum_positive(&pool->sp_threads_starved),
> + percpu_counter_sum_positive(&pool->sp_threads_no_work));
>
> return 0;
> }
>
>
>


2023-07-10 23:11:01

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v3 5/9] SUNRPC: Count pool threads that were awoken but found no work to do



> On Jul 10, 2023, at 6:29 PM, NeilBrown <[email protected]> wrote:
>
> On Tue, 11 Jul 2023, Chuck Lever wrote:
>> From: Chuck Lever <[email protected]>
>>
>> Measure a source of thread scheduling inefficiency -- count threads
>> that were awoken but found that the transport queue had already been
>> emptied.
>>
>> An empty transport queue is possible when threads that run between
>> the wake_up_process() call and the woken thread returning from the
>> scheduler have pulled all remaining work off the transport queue
>> using the first svc_xprt_dequeue() in svc_get_next_xprt().
>
> I'm in two minds about this. The data being gathered here is
> potentially useful

It's actually pretty shocking: I've measured more than
15% of thread wake-ups find no work to do.


> - but who it is useful to?
> I think it is primarily useful for us - to understand the behaviour of
> the implementation so we can know what needs to be optimised.
> It isn't really of any use to a sysadmin who wants to understand how
> their system is performing.
>
> But then .. who are tracepoints for? Developers or admins?
> I guess that fact that we feel free to modify them whenever we need
> means they are primarily for developers? In which case this is a good
> patch, and maybe we'll revert the functionality one day if it turns out
> that we can change the implementation so that a thread is never woken
> when there is no work to do ....

A reasonable question to ask. The new "starved" metric
is similar: possibly useful while we are developing the
code, but not particularly valuable for system
administrators.

How are the pool_stats used by administrators?

(And, why are they in /proc/fs/nfsd/ rather than under
something RPC-related?)


>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> include/linux/sunrpc/svc.h | 1 +
>> net/sunrpc/svc.c | 2 ++
>> net/sunrpc/svc_xprt.c | 7 ++++---
>> 3 files changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>> index 74ea13270679..9dd3b16cc4c2 100644
>> --- a/include/linux/sunrpc/svc.h
>> +++ b/include/linux/sunrpc/svc.h
>> @@ -43,6 +43,7 @@ struct svc_pool {
>> struct percpu_counter sp_threads_woken;
>> struct percpu_counter sp_threads_timedout;
>> struct percpu_counter sp_threads_starved;
>> + struct percpu_counter sp_threads_no_work;
>>
>> #define SP_TASK_PENDING (0) /* still work to do even if no
>> * xprt is queued. */
>> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
>> index 88b7b5fb6d75..b7a02309ecb1 100644
>> --- a/net/sunrpc/svc.c
>> +++ b/net/sunrpc/svc.c
>> @@ -518,6 +518,7 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
>> percpu_counter_init(&pool->sp_threads_woken, 0, GFP_KERNEL);
>> percpu_counter_init(&pool->sp_threads_timedout, 0, GFP_KERNEL);
>> percpu_counter_init(&pool->sp_threads_starved, 0, GFP_KERNEL);
>> + percpu_counter_init(&pool->sp_threads_no_work, 0, GFP_KERNEL);
>> }
>>
>> return serv;
>> @@ -595,6 +596,7 @@ svc_destroy(struct kref *ref)
>> percpu_counter_destroy(&pool->sp_threads_woken);
>> percpu_counter_destroy(&pool->sp_threads_timedout);
>> percpu_counter_destroy(&pool->sp_threads_starved);
>> + percpu_counter_destroy(&pool->sp_threads_no_work);
>> }
>> kfree(serv->sv_pools);
>> kfree(serv);
>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>> index ecbccf0d89b9..6c2a702aa469 100644
>> --- a/net/sunrpc/svc_xprt.c
>> +++ b/net/sunrpc/svc_xprt.c
>> @@ -776,9 +776,9 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
>>
>> if (!time_left)
>> percpu_counter_inc(&pool->sp_threads_timedout);
>> -
>> if (signalled() || kthread_should_stop())
>> return ERR_PTR(-EINTR);
>> + percpu_counter_inc(&pool->sp_threads_no_work);
>> return ERR_PTR(-EAGAIN);
>> out_found:
>> /* Normally we will wait up to 5 seconds for any required
>> @@ -1445,13 +1445,14 @@ static int svc_pool_stats_show(struct seq_file *m, void *p)
>> return 0;
>> }
>>
>> - seq_printf(m, "%u %llu %llu %llu %llu %llu\n",
>> + seq_printf(m, "%u %llu %llu %llu %llu %llu %llu\n",
>> pool->sp_id,
>> percpu_counter_sum_positive(&pool->sp_messages_arrived),
>> percpu_counter_sum_positive(&pool->sp_sockets_queued),
>> percpu_counter_sum_positive(&pool->sp_threads_woken),
>> percpu_counter_sum_positive(&pool->sp_threads_timedout),
>> - percpu_counter_sum_positive(&pool->sp_threads_starved));
>> + percpu_counter_sum_positive(&pool->sp_threads_starved),
>> + percpu_counter_sum_positive(&pool->sp_threads_no_work));
>>
>> return 0;
>> }
>>
>>
>>
>

--
Chuck Lever



2023-07-11 01:09:48

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v3 8/9] SUNRPC: Replace sp_threads_all with an xarray



> On Jul 10, 2023, at 2:24 PM, Jeff Layton <[email protected]> wrote:
>
> On Mon, 2023-07-10 at 12:42 -0400, Chuck Lever wrote:
>> From: Chuck Lever <[email protected]>
>>
>> We want a thread lookup operation that can be done with RCU only,
>> but also we want to avoid the linked-list walk, which does not scale
>> well in the number of pool threads.
>>
>> This patch splits out the use of the sp_lock to protect the set
>> of threads. Svc thread information is now protected by the xarray's
>> lock (when making thread count changes) and the RCU read lock (when
>> only looking up a thread).
>>
>> Since thread count changes are done only via nfsd filesystem API,
>> which runs only in process context, we can safely dispense with the
>> use of a bottom-half-disabled lock.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> ---
>> fs/nfsd/nfssvc.c | 3 +-
>> include/linux/sunrpc/svc.h | 11 +++----
>> include/trace/events/sunrpc.h | 47 ++++++++++++++++++++++++++++-
>> net/sunrpc/svc.c | 67 +++++++++++++++++++++++++----------------
>> net/sunrpc/svc_xprt.c | 2 +
>> 5 files changed, 94 insertions(+), 36 deletions(-)
>>
>> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
>> index 2154fa63c5f2..d42b2a40c93c 100644
>> --- a/fs/nfsd/nfssvc.c
>> +++ b/fs/nfsd/nfssvc.c
>> @@ -62,8 +62,7 @@ static __be32 nfsd_init_request(struct svc_rqst *,
>> * If (out side the lock) nn->nfsd_serv is non-NULL, then it must point to a
>> * properly initialised 'struct svc_serv' with ->sv_nrthreads > 0 (unless
>> * nn->keep_active is set). That number of nfsd threads must
>> - * exist and each must be listed in ->sp_all_threads in some entry of
>> - * ->sv_pools[].
>> + * exist and each must be listed in some entry of ->sv_pools[].
>> *
>> * Each active thread holds a counted reference on nn->nfsd_serv, as does
>> * the nn->keep_active flag and various transient calls to svc_get().
>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>> index 9dd3b16cc4c2..86377506a514 100644
>> --- a/include/linux/sunrpc/svc.h
>> +++ b/include/linux/sunrpc/svc.h
>> @@ -32,10 +32,10 @@
>> */
>> struct svc_pool {
>> unsigned int sp_id; /* pool id; also node id on NUMA */
>> - spinlock_t sp_lock; /* protects all fields */
>> + spinlock_t sp_lock; /* protects sp_sockets */
>> struct list_head sp_sockets; /* pending sockets */
>> unsigned int sp_nrthreads; /* # of threads in pool */
>> - struct list_head sp_all_threads; /* all server threads */
>> + struct xarray sp_thread_xa;
>>
>> /* statistics on pool operation */
>> struct percpu_counter sp_messages_arrived;
>> @@ -196,7 +196,6 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>> * processed.
>> */
>> struct svc_rqst {
>> - struct list_head rq_all; /* all threads list */
>> struct rcu_head rq_rcu_head; /* for RCU deferred kfree */
>> struct svc_xprt * rq_xprt; /* transport ptr */
>>
>> @@ -241,10 +240,10 @@ struct svc_rqst {
>> #define RQ_SPLICE_OK (4) /* turned off in gss privacy
>> * to prevent encrypting page
>> * cache pages */
>> -#define RQ_VICTIM (5) /* about to be shut down */
>> -#define RQ_BUSY (6) /* request is busy */
>> -#define RQ_DATA (7) /* request has data */
>> +#define RQ_BUSY (5) /* request is busy */
>> +#define RQ_DATA (6) /* request has data */
>> unsigned long rq_flags; /* flags field */
>> + u32 rq_thread_id; /* xarray index */
>> ktime_t rq_qtime; /* enqueue time */
>>
>> void * rq_argp; /* decoded arguments */
>> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
>> index 60c8e03268d4..ea43c6059bdb 100644
>> --- a/include/trace/events/sunrpc.h
>> +++ b/include/trace/events/sunrpc.h
>> @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
>> svc_rqst_flag(USEDEFERRAL) \
>> svc_rqst_flag(DROPME) \
>> svc_rqst_flag(SPLICE_OK) \
>> - svc_rqst_flag(VICTIM) \
>> svc_rqst_flag(BUSY) \
>> svc_rqst_flag_end(DATA)
>>
>> @@ -2118,6 +2117,52 @@ TRACE_EVENT(svc_pool_starved,
>> )
>> );
>>
>> +DECLARE_EVENT_CLASS(svc_thread_lifetime_class,
>> + TP_PROTO(
>> + const struct svc_serv *serv,
>> + const struct svc_pool *pool,
>> + const struct svc_rqst *rqstp
>> + ),
>> +
>> + TP_ARGS(serv, pool, rqstp),
>> +
>> + TP_STRUCT__entry(
>> + __string(name, serv->sv_name)
>> + __field(int, pool_id)
>> + __field(unsigned int, nrthreads)
>> + __field(unsigned long, pool_flags)
>> + __field(u32, thread_id)
>> + __field(const void *, rqstp)
>> + ),
>> +
>> + TP_fast_assign(
>> + __assign_str(name, serv->sv_name);
>> + __entry->pool_id = pool->sp_id;
>> + __entry->nrthreads = pool->sp_nrthreads;
>> + __entry->pool_flags = pool->sp_flags;
>> + __entry->thread_id = rqstp->rq_thread_id;
>> + __entry->rqstp = rqstp;
>> + ),
>> +
>> + TP_printk("service=%s pool=%d pool_flags=%s nrthreads=%u thread_id=%u",
>> + __get_str(name), __entry->pool_id,
>> + show_svc_pool_flags(__entry->pool_flags),
>> + __entry->nrthreads, __entry->thread_id
>> + )
>> +);
>> +
>> +#define DEFINE_SVC_THREAD_LIFETIME_EVENT(name) \
>> + DEFINE_EVENT(svc_thread_lifetime_class, svc_pool_##name, \
>> + TP_PROTO( \
>> + const struct svc_serv *serv, \
>> + const struct svc_pool *pool, \
>> + const struct svc_rqst *rqstp \
>> + ), \
>> + TP_ARGS(serv, pool, rqstp))
>> +
>> +DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_init);
>> +DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_exit);
>> +
>> DECLARE_EVENT_CLASS(svc_xprt_event,
>> TP_PROTO(
>> const struct svc_xprt *xprt
>> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
>> index ad29df00b454..109d7f047385 100644
>> --- a/net/sunrpc/svc.c
>> +++ b/net/sunrpc/svc.c
>> @@ -507,8 +507,8 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
>>
>> pool->sp_id = i;
>> INIT_LIST_HEAD(&pool->sp_sockets);
>> - INIT_LIST_HEAD(&pool->sp_all_threads);
>> spin_lock_init(&pool->sp_lock);
>> + xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
>>
>> percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
>> percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
>> @@ -596,6 +596,8 @@ svc_destroy(struct kref *ref)
>> percpu_counter_destroy(&pool->sp_threads_timedout);
>> percpu_counter_destroy(&pool->sp_threads_starved);
>> percpu_counter_destroy(&pool->sp_threads_no_work);
>> +
>> + xa_destroy(&pool->sp_thread_xa);
>> }
>> kfree(serv->sv_pools);
>> kfree(serv);
>> @@ -676,7 +678,11 @@ EXPORT_SYMBOL_GPL(svc_rqst_alloc);
>> static struct svc_rqst *
>> svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>> {
>> + struct xa_limit limit = {
>> + .max = U32_MAX,
>> + };
>> struct svc_rqst *rqstp;
>> + int ret;
>>
>> rqstp = svc_rqst_alloc(serv, pool, node);
>> if (!rqstp)
>> @@ -687,11 +693,21 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>> serv->sv_nrthreads += 1;
>> spin_unlock_bh(&serv->sv_lock);
>>
>> - spin_lock_bh(&pool->sp_lock);
>> + xa_lock(&pool->sp_thread_xa);
>> + ret = __xa_alloc(&pool->sp_thread_xa, &rqstp->rq_thread_id, rqstp,
>> + limit, GFP_KERNEL);
>> + if (ret) {
>> + xa_unlock(&pool->sp_thread_xa);
>> + goto out_free;
>> + }
>> pool->sp_nrthreads++;
>> - list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
>> - spin_unlock_bh(&pool->sp_lock);
>> + xa_unlock(&pool->sp_thread_xa);
>> + trace_svc_pool_thread_init(serv, pool, rqstp);
>> return rqstp;
>> +
>> +out_free:
>> + svc_rqst_free(rqstp);
>> + return ERR_PTR(ret);
>> }
>>
>> /**
>> @@ -708,19 +724,17 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
>> struct svc_pool *pool)
>> {
>> struct svc_rqst *rqstp;
>> + unsigned long index;
>>
>> - rcu_read_lock();
>
>
> While it does do its own locking, the resulting object that xa_for_each
> returns needs some protection too. Between xa_for_each returning a rqstp
> and calling test_and_set_bit, could the rqstp be freed? I suspect so,
> and I think you probably need to keep the rcu_read_lock() call above.

Should I keep the rcu_read_lock() even with the bitmap/xa_load
version of svc_pool_wake_idle_thread() in 9/9 ?


>> - list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
>> + xa_for_each(&pool->sp_thread_xa, index, rqstp) {
>> if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
>> continue;
>>
>> - rcu_read_unlock();
>> WRITE_ONCE(rqstp->rq_qtime, ktime_get());
>> wake_up_process(rqstp->rq_task);
>> percpu_counter_inc(&pool->sp_threads_woken);
>> return rqstp;
>> }
>> - rcu_read_unlock();
>>
>
> I wonder if this can race with svc_pool_victim below? Can we end up
> waking a thread that's already on its way out of the pool? Maybe this is
> addressed in your next patch though...
>
>> trace_svc_pool_starved(serv, pool);
>> percpu_counter_inc(&pool->sp_threads_starved);
>> @@ -736,32 +750,33 @@ svc_pool_next(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
>> static struct task_struct *
>> svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
>> {
>> - unsigned int i;
>> struct task_struct *task = NULL;
>> + struct svc_rqst *rqstp;
>> + unsigned int i;
>>
>> if (pool != NULL) {
>> - spin_lock_bh(&pool->sp_lock);
>> + xa_lock(&pool->sp_thread_xa);
>> + if (!pool->sp_nrthreads)
>> + goto out;
>> } else {
>> for (i = 0; i < serv->sv_nrpools; i++) {
>> pool = &serv->sv_pools[--(*state) % serv->sv_nrpools];
>> - spin_lock_bh(&pool->sp_lock);
>> - if (!list_empty(&pool->sp_all_threads))
>> + xa_lock(&pool->sp_thread_xa);
>> + if (pool->sp_nrthreads)
>> goto found_pool;
>> - spin_unlock_bh(&pool->sp_lock);
>> + xa_unlock(&pool->sp_thread_xa);
>> }
>> return NULL;
>> }
>>
>> found_pool:
>> - if (!list_empty(&pool->sp_all_threads)) {
>> - struct svc_rqst *rqstp;
>> -
>> - rqstp = list_entry(pool->sp_all_threads.next, struct svc_rqst, rq_all);
>> - set_bit(RQ_VICTIM, &rqstp->rq_flags);
>> - list_del_rcu(&rqstp->rq_all);
>> + rqstp = xa_load(&pool->sp_thread_xa, pool->sp_nrthreads - 1);
>> + if (rqstp) {
>> + __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
>> task = rqstp->rq_task;
>> }
>> - spin_unlock_bh(&pool->sp_lock);
>> +out:
>> + xa_unlock(&pool->sp_thread_xa);
>> return task;
>> }
>>
>> @@ -843,9 +858,9 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
>> if (pool == NULL) {
>> nrservs -= serv->sv_nrthreads;
>> } else {
>> - spin_lock_bh(&pool->sp_lock);
>> + xa_lock(&pool->sp_thread_xa);
>> nrservs -= pool->sp_nrthreads;
>> - spin_unlock_bh(&pool->sp_lock);
>> + xa_unlock(&pool->sp_thread_xa);
>> }
>>
>> if (nrservs > 0)
>> @@ -932,11 +947,11 @@ svc_exit_thread(struct svc_rqst *rqstp)
>> struct svc_serv *serv = rqstp->rq_server;
>> struct svc_pool *pool = rqstp->rq_pool;
>>
>> - spin_lock_bh(&pool->sp_lock);
>> + xa_lock(&pool->sp_thread_xa);
>> pool->sp_nrthreads--;
>> - if (!test_and_set_bit(RQ_VICTIM, &rqstp->rq_flags))
>> - list_del_rcu(&rqstp->rq_all);
>> - spin_unlock_bh(&pool->sp_lock);
>> + __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
>> + xa_unlock(&pool->sp_thread_xa);
>> + trace_svc_pool_thread_exit(serv, pool, rqstp);
>>
>> spin_lock_bh(&serv->sv_lock);
>> serv->sv_nrthreads -= 1;
>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>> index 6c2a702aa469..db40f771b60a 100644
>> --- a/net/sunrpc/svc_xprt.c
>> +++ b/net/sunrpc/svc_xprt.c
>> @@ -46,7 +46,7 @@ static LIST_HEAD(svc_xprt_class_list);
>>
>> /* SMP locking strategy:
>> *
>> - * svc_pool->sp_lock protects most of the fields of that pool.
>> + * svc_pool->sp_lock protects sp_sockets.
>> * svc_serv->sv_lock protects sv_tempsocks, sv_permsocks, sv_tmpcnt.
>> * when both need to be taken (rare), svc_serv->sv_lock is first.
>> * The "service mutex" protects svc_serv->sv_nrthread.
>>
>>
>
> Looks like a nice clean conversion otherwise!
> --
> Jeff Layton <[email protected]>

--
Chuck Lever



2023-07-11 01:09:50

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v3 8/9] SUNRPC: Replace sp_threads_all with an xarray

On Tue, 2023-07-11 at 00:58 +0000, Chuck Lever III wrote:
>
> > On Jul 10, 2023, at 2:24 PM, Jeff Layton <[email protected]> wrote:
> >
> > On Mon, 2023-07-10 at 12:42 -0400, Chuck Lever wrote:
> > > From: Chuck Lever <[email protected]>
> > >
> > > We want a thread lookup operation that can be done with RCU only,
> > > but also we want to avoid the linked-list walk, which does not scale
> > > well in the number of pool threads.
> > >
> > > This patch splits out the use of the sp_lock to protect the set
> > > of threads. Svc thread information is now protected by the xarray's
> > > lock (when making thread count changes) and the RCU read lock (when
> > > only looking up a thread).
> > >
> > > Since thread count changes are done only via nfsd filesystem API,
> > > which runs only in process context, we can safely dispense with the
> > > use of a bottom-half-disabled lock.
> > >
> > > Signed-off-by: Chuck Lever <[email protected]>
> > > ---
> > > fs/nfsd/nfssvc.c | 3 +-
> > > include/linux/sunrpc/svc.h | 11 +++----
> > > include/trace/events/sunrpc.h | 47 ++++++++++++++++++++++++++++-
> > > net/sunrpc/svc.c | 67 +++++++++++++++++++++++++----------------
> > > net/sunrpc/svc_xprt.c | 2 +
> > > 5 files changed, 94 insertions(+), 36 deletions(-)
> > >
> > > diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> > > index 2154fa63c5f2..d42b2a40c93c 100644
> > > --- a/fs/nfsd/nfssvc.c
> > > +++ b/fs/nfsd/nfssvc.c
> > > @@ -62,8 +62,7 @@ static __be32 nfsd_init_request(struct svc_rqst *,
> > > * If (out side the lock) nn->nfsd_serv is non-NULL, then it must point to a
> > > * properly initialised 'struct svc_serv' with ->sv_nrthreads > 0 (unless
> > > * nn->keep_active is set). That number of nfsd threads must
> > > - * exist and each must be listed in ->sp_all_threads in some entry of
> > > - * ->sv_pools[].
> > > + * exist and each must be listed in some entry of ->sv_pools[].
> > > *
> > > * Each active thread holds a counted reference on nn->nfsd_serv, as does
> > > * the nn->keep_active flag and various transient calls to svc_get().
> > > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > > index 9dd3b16cc4c2..86377506a514 100644
> > > --- a/include/linux/sunrpc/svc.h
> > > +++ b/include/linux/sunrpc/svc.h
> > > @@ -32,10 +32,10 @@
> > > */
> > > struct svc_pool {
> > > unsigned int sp_id; /* pool id; also node id on NUMA */
> > > - spinlock_t sp_lock; /* protects all fields */
> > > + spinlock_t sp_lock; /* protects sp_sockets */
> > > struct list_head sp_sockets; /* pending sockets */
> > > unsigned int sp_nrthreads; /* # of threads in pool */
> > > - struct list_head sp_all_threads; /* all server threads */
> > > + struct xarray sp_thread_xa;
> > >
> > > /* statistics on pool operation */
> > > struct percpu_counter sp_messages_arrived;
> > > @@ -196,7 +196,6 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> > > * processed.
> > > */
> > > struct svc_rqst {
> > > - struct list_head rq_all; /* all threads list */
> > > struct rcu_head rq_rcu_head; /* for RCU deferred kfree */
> > > struct svc_xprt * rq_xprt; /* transport ptr */
> > >
> > > @@ -241,10 +240,10 @@ struct svc_rqst {
> > > #define RQ_SPLICE_OK (4) /* turned off in gss privacy
> > > * to prevent encrypting page
> > > * cache pages */
> > > -#define RQ_VICTIM (5) /* about to be shut down */
> > > -#define RQ_BUSY (6) /* request is busy */
> > > -#define RQ_DATA (7) /* request has data */
> > > +#define RQ_BUSY (5) /* request is busy */
> > > +#define RQ_DATA (6) /* request has data */
> > > unsigned long rq_flags; /* flags field */
> > > + u32 rq_thread_id; /* xarray index */
> > > ktime_t rq_qtime; /* enqueue time */
> > >
> > > void * rq_argp; /* decoded arguments */
> > > diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> > > index 60c8e03268d4..ea43c6059bdb 100644
> > > --- a/include/trace/events/sunrpc.h
> > > +++ b/include/trace/events/sunrpc.h
> > > @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
> > > svc_rqst_flag(USEDEFERRAL) \
> > > svc_rqst_flag(DROPME) \
> > > svc_rqst_flag(SPLICE_OK) \
> > > - svc_rqst_flag(VICTIM) \
> > > svc_rqst_flag(BUSY) \
> > > svc_rqst_flag_end(DATA)
> > >
> > > @@ -2118,6 +2117,52 @@ TRACE_EVENT(svc_pool_starved,
> > > )
> > > );
> > >
> > > +DECLARE_EVENT_CLASS(svc_thread_lifetime_class,
> > > + TP_PROTO(
> > > + const struct svc_serv *serv,
> > > + const struct svc_pool *pool,
> > > + const struct svc_rqst *rqstp
> > > + ),
> > > +
> > > + TP_ARGS(serv, pool, rqstp),
> > > +
> > > + TP_STRUCT__entry(
> > > + __string(name, serv->sv_name)
> > > + __field(int, pool_id)
> > > + __field(unsigned int, nrthreads)
> > > + __field(unsigned long, pool_flags)
> > > + __field(u32, thread_id)
> > > + __field(const void *, rqstp)
> > > + ),
> > > +
> > > + TP_fast_assign(
> > > + __assign_str(name, serv->sv_name);
> > > + __entry->pool_id = pool->sp_id;
> > > + __entry->nrthreads = pool->sp_nrthreads;
> > > + __entry->pool_flags = pool->sp_flags;
> > > + __entry->thread_id = rqstp->rq_thread_id;
> > > + __entry->rqstp = rqstp;
> > > + ),
> > > +
> > > + TP_printk("service=%s pool=%d pool_flags=%s nrthreads=%u thread_id=%u",
> > > + __get_str(name), __entry->pool_id,
> > > + show_svc_pool_flags(__entry->pool_flags),
> > > + __entry->nrthreads, __entry->thread_id
> > > + )
> > > +);
> > > +
> > > +#define DEFINE_SVC_THREAD_LIFETIME_EVENT(name) \
> > > + DEFINE_EVENT(svc_thread_lifetime_class, svc_pool_##name, \
> > > + TP_PROTO( \
> > > + const struct svc_serv *serv, \
> > > + const struct svc_pool *pool, \
> > > + const struct svc_rqst *rqstp \
> > > + ), \
> > > + TP_ARGS(serv, pool, rqstp))
> > > +
> > > +DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_init);
> > > +DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_exit);
> > > +
> > > DECLARE_EVENT_CLASS(svc_xprt_event,
> > > TP_PROTO(
> > > const struct svc_xprt *xprt
> > > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > > index ad29df00b454..109d7f047385 100644
> > > --- a/net/sunrpc/svc.c
> > > +++ b/net/sunrpc/svc.c
> > > @@ -507,8 +507,8 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> > >
> > > pool->sp_id = i;
> > > INIT_LIST_HEAD(&pool->sp_sockets);
> > > - INIT_LIST_HEAD(&pool->sp_all_threads);
> > > spin_lock_init(&pool->sp_lock);
> > > + xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
> > >
> > > percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> > > percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> > > @@ -596,6 +596,8 @@ svc_destroy(struct kref *ref)
> > > percpu_counter_destroy(&pool->sp_threads_timedout);
> > > percpu_counter_destroy(&pool->sp_threads_starved);
> > > percpu_counter_destroy(&pool->sp_threads_no_work);
> > > +
> > > + xa_destroy(&pool->sp_thread_xa);
> > > }
> > > kfree(serv->sv_pools);
> > > kfree(serv);
> > > @@ -676,7 +678,11 @@ EXPORT_SYMBOL_GPL(svc_rqst_alloc);
> > > static struct svc_rqst *
> > > svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> > > {
> > > + struct xa_limit limit = {
> > > + .max = U32_MAX,
> > > + };
> > > struct svc_rqst *rqstp;
> > > + int ret;
> > >
> > > rqstp = svc_rqst_alloc(serv, pool, node);
> > > if (!rqstp)
> > > @@ -687,11 +693,21 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> > > serv->sv_nrthreads += 1;
> > > spin_unlock_bh(&serv->sv_lock);
> > >
> > > - spin_lock_bh(&pool->sp_lock);
> > > + xa_lock(&pool->sp_thread_xa);
> > > + ret = __xa_alloc(&pool->sp_thread_xa, &rqstp->rq_thread_id, rqstp,
> > > + limit, GFP_KERNEL);
> > > + if (ret) {
> > > + xa_unlock(&pool->sp_thread_xa);
> > > + goto out_free;
> > > + }
> > > pool->sp_nrthreads++;
> > > - list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
> > > - spin_unlock_bh(&pool->sp_lock);
> > > + xa_unlock(&pool->sp_thread_xa);
> > > + trace_svc_pool_thread_init(serv, pool, rqstp);
> > > return rqstp;
> > > +
> > > +out_free:
> > > + svc_rqst_free(rqstp);
> > > + return ERR_PTR(ret);
> > > }
> > >
> > > /**
> > > @@ -708,19 +724,17 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> > > struct svc_pool *pool)
> > > {
> > > struct svc_rqst *rqstp;
> > > + unsigned long index;
> > >
> > > - rcu_read_lock();
> >
> >
> > While it does do its own locking, the resulting object that xa_for_each
> > returns needs some protection too. Between xa_for_each returning a rqstp
> > and calling test_and_set_bit, could the rqstp be freed? I suspect so,
> > and I think you probably need to keep the rcu_read_lock() call above.
>
> Should I keep the rcu_read_lock() even with the bitmap/xa_load
> version of svc_pool_wake_idle_thread() in 9/9 ?
>

Yeah, I think you have to. We're not doing real locking on the search or
taking references, so nothing else will ensure that the rqstp will stick
around once you've found it. I think you have to hold it until after
wake_up_process (at least).

>
> > > - list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
> > > + xa_for_each(&pool->sp_thread_xa, index, rqstp) {
> > > if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> > > continue;
> > >
> > > - rcu_read_unlock();
> > > WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> > > wake_up_process(rqstp->rq_task);
> > > percpu_counter_inc(&pool->sp_threads_woken);
> > > return rqstp;
> > > }
> > > - rcu_read_unlock();
> > >
> >
> > I wonder if this can race with svc_pool_victim below? Can we end up
> > waking a thread that's already on its way out of the pool? Maybe this is
> > addressed in your next patch though...
> >
> > > trace_svc_pool_starved(serv, pool);
> > > percpu_counter_inc(&pool->sp_threads_starved);
> > > @@ -736,32 +750,33 @@ svc_pool_next(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
> > > static struct task_struct *
> > > svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
> > > {
> > > - unsigned int i;
> > > struct task_struct *task = NULL;
> > > + struct svc_rqst *rqstp;
> > > + unsigned int i;
> > >
> > > if (pool != NULL) {
> > > - spin_lock_bh(&pool->sp_lock);
> > > + xa_lock(&pool->sp_thread_xa);
> > > + if (!pool->sp_nrthreads)
> > > + goto out;
> > > } else {
> > > for (i = 0; i < serv->sv_nrpools; i++) {
> > > pool = &serv->sv_pools[--(*state) % serv->sv_nrpools];
> > > - spin_lock_bh(&pool->sp_lock);
> > > - if (!list_empty(&pool->sp_all_threads))
> > > + xa_lock(&pool->sp_thread_xa);
> > > + if (pool->sp_nrthreads)
> > > goto found_pool;
> > > - spin_unlock_bh(&pool->sp_lock);
> > > + xa_unlock(&pool->sp_thread_xa);
> > > }
> > > return NULL;
> > > }
> > >
> > > found_pool:
> > > - if (!list_empty(&pool->sp_all_threads)) {
> > > - struct svc_rqst *rqstp;
> > > -
> > > - rqstp = list_entry(pool->sp_all_threads.next, struct svc_rqst, rq_all);
> > > - set_bit(RQ_VICTIM, &rqstp->rq_flags);
> > > - list_del_rcu(&rqstp->rq_all);
> > > + rqstp = xa_load(&pool->sp_thread_xa, pool->sp_nrthreads - 1);
> > > + if (rqstp) {
> > > + __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> > > task = rqstp->rq_task;
> > > }
> > > - spin_unlock_bh(&pool->sp_lock);
> > > +out:
> > > + xa_unlock(&pool->sp_thread_xa);
> > > return task;
> > > }
> > >
> > > @@ -843,9 +858,9 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
> > > if (pool == NULL) {
> > > nrservs -= serv->sv_nrthreads;
> > > } else {
> > > - spin_lock_bh(&pool->sp_lock);
> > > + xa_lock(&pool->sp_thread_xa);
> > > nrservs -= pool->sp_nrthreads;
> > > - spin_unlock_bh(&pool->sp_lock);
> > > + xa_unlock(&pool->sp_thread_xa);
> > > }
> > >
> > > if (nrservs > 0)
> > > @@ -932,11 +947,11 @@ svc_exit_thread(struct svc_rqst *rqstp)
> > > struct svc_serv *serv = rqstp->rq_server;
> > > struct svc_pool *pool = rqstp->rq_pool;
> > >
> > > - spin_lock_bh(&pool->sp_lock);
> > > + xa_lock(&pool->sp_thread_xa);
> > > pool->sp_nrthreads--;
> > > - if (!test_and_set_bit(RQ_VICTIM, &rqstp->rq_flags))
> > > - list_del_rcu(&rqstp->rq_all);
> > > - spin_unlock_bh(&pool->sp_lock);
> > > + __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> > > + xa_unlock(&pool->sp_thread_xa);
> > > + trace_svc_pool_thread_exit(serv, pool, rqstp);
> > >
> > > spin_lock_bh(&serv->sv_lock);
> > > serv->sv_nrthreads -= 1;
> > > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > > index 6c2a702aa469..db40f771b60a 100644
> > > --- a/net/sunrpc/svc_xprt.c
> > > +++ b/net/sunrpc/svc_xprt.c
> > > @@ -46,7 +46,7 @@ static LIST_HEAD(svc_xprt_class_list);
> > >
> > > /* SMP locking strategy:
> > > *
> > > - * svc_pool->sp_lock protects most of the fields of that pool.
> > > + * svc_pool->sp_lock protects sp_sockets.
> > > * svc_serv->sv_lock protects sv_tempsocks, sv_permsocks, sv_tmpcnt.
> > > * when both need to be taken (rare), svc_serv->sv_lock is first.
> > > * The "service mutex" protects svc_serv->sv_nrthread.
> > >
> > >
> >
> > Looks like a nice clean conversion otherwise!
> > --
> > Jeff Layton <[email protected]>
>
> --
> Chuck Lever
>
>

--
Jeff Layton <[email protected]>

2023-07-11 02:35:10

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH v3 8/9] SUNRPC: Replace sp_threads_all with an xarray

On Mon, Jul 10, 2023 at 09:06:02PM -0400, Jeff Layton wrote:
> On Tue, 2023-07-11 at 00:58 +0000, Chuck Lever III wrote:
> >
> > > On Jul 10, 2023, at 2:24 PM, Jeff Layton <[email protected]> wrote:
> > >
> > > On Mon, 2023-07-10 at 12:42 -0400, Chuck Lever wrote:
> > > > From: Chuck Lever <[email protected]>
> > > >
> > > > We want a thread lookup operation that can be done with RCU only,
> > > > but also we want to avoid the linked-list walk, which does not scale
> > > > well in the number of pool threads.
> > > >
> > > > This patch splits out the use of the sp_lock to protect the set
> > > > of threads. Svc thread information is now protected by the xarray's
> > > > lock (when making thread count changes) and the RCU read lock (when
> > > > only looking up a thread).
> > > >
> > > > Since thread count changes are done only via nfsd filesystem API,
> > > > which runs only in process context, we can safely dispense with the
> > > > use of a bottom-half-disabled lock.
> > > >
> > > > Signed-off-by: Chuck Lever <[email protected]>
> > > > ---
> > > > fs/nfsd/nfssvc.c | 3 +-
> > > > include/linux/sunrpc/svc.h | 11 +++----
> > > > include/trace/events/sunrpc.h | 47 ++++++++++++++++++++++++++++-
> > > > net/sunrpc/svc.c | 67 +++++++++++++++++++++++++----------------
> > > > net/sunrpc/svc_xprt.c | 2 +
> > > > 5 files changed, 94 insertions(+), 36 deletions(-)
> > > >
> > > > diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> > > > index 2154fa63c5f2..d42b2a40c93c 100644
> > > > --- a/fs/nfsd/nfssvc.c
> > > > +++ b/fs/nfsd/nfssvc.c
> > > > @@ -62,8 +62,7 @@ static __be32 nfsd_init_request(struct svc_rqst *,
> > > > * If (out side the lock) nn->nfsd_serv is non-NULL, then it must point to a
> > > > * properly initialised 'struct svc_serv' with ->sv_nrthreads > 0 (unless
> > > > * nn->keep_active is set). That number of nfsd threads must
> > > > - * exist and each must be listed in ->sp_all_threads in some entry of
> > > > - * ->sv_pools[].
> > > > + * exist and each must be listed in some entry of ->sv_pools[].
> > > > *
> > > > * Each active thread holds a counted reference on nn->nfsd_serv, as does
> > > > * the nn->keep_active flag and various transient calls to svc_get().
> > > > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > > > index 9dd3b16cc4c2..86377506a514 100644
> > > > --- a/include/linux/sunrpc/svc.h
> > > > +++ b/include/linux/sunrpc/svc.h
> > > > @@ -32,10 +32,10 @@
> > > > */
> > > > struct svc_pool {
> > > > unsigned int sp_id; /* pool id; also node id on NUMA */
> > > > - spinlock_t sp_lock; /* protects all fields */
> > > > + spinlock_t sp_lock; /* protects sp_sockets */
> > > > struct list_head sp_sockets; /* pending sockets */
> > > > unsigned int sp_nrthreads; /* # of threads in pool */
> > > > - struct list_head sp_all_threads; /* all server threads */
> > > > + struct xarray sp_thread_xa;
> > > >
> > > > /* statistics on pool operation */
> > > > struct percpu_counter sp_messages_arrived;
> > > > @@ -196,7 +196,6 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> > > > * processed.
> > > > */
> > > > struct svc_rqst {
> > > > - struct list_head rq_all; /* all threads list */
> > > > struct rcu_head rq_rcu_head; /* for RCU deferred kfree */
> > > > struct svc_xprt * rq_xprt; /* transport ptr */
> > > >
> > > > @@ -241,10 +240,10 @@ struct svc_rqst {
> > > > #define RQ_SPLICE_OK (4) /* turned off in gss privacy
> > > > * to prevent encrypting page
> > > > * cache pages */
> > > > -#define RQ_VICTIM (5) /* about to be shut down */
> > > > -#define RQ_BUSY (6) /* request is busy */
> > > > -#define RQ_DATA (7) /* request has data */
> > > > +#define RQ_BUSY (5) /* request is busy */
> > > > +#define RQ_DATA (6) /* request has data */
> > > > unsigned long rq_flags; /* flags field */
> > > > + u32 rq_thread_id; /* xarray index */
> > > > ktime_t rq_qtime; /* enqueue time */
> > > >
> > > > void * rq_argp; /* decoded arguments */
> > > > diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> > > > index 60c8e03268d4..ea43c6059bdb 100644
> > > > --- a/include/trace/events/sunrpc.h
> > > > +++ b/include/trace/events/sunrpc.h
> > > > @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
> > > > svc_rqst_flag(USEDEFERRAL) \
> > > > svc_rqst_flag(DROPME) \
> > > > svc_rqst_flag(SPLICE_OK) \
> > > > - svc_rqst_flag(VICTIM) \
> > > > svc_rqst_flag(BUSY) \
> > > > svc_rqst_flag_end(DATA)
> > > >
> > > > @@ -2118,6 +2117,52 @@ TRACE_EVENT(svc_pool_starved,
> > > > )
> > > > );
> > > >
> > > > +DECLARE_EVENT_CLASS(svc_thread_lifetime_class,
> > > > + TP_PROTO(
> > > > + const struct svc_serv *serv,
> > > > + const struct svc_pool *pool,
> > > > + const struct svc_rqst *rqstp
> > > > + ),
> > > > +
> > > > + TP_ARGS(serv, pool, rqstp),
> > > > +
> > > > + TP_STRUCT__entry(
> > > > + __string(name, serv->sv_name)
> > > > + __field(int, pool_id)
> > > > + __field(unsigned int, nrthreads)
> > > > + __field(unsigned long, pool_flags)
> > > > + __field(u32, thread_id)
> > > > + __field(const void *, rqstp)
> > > > + ),
> > > > +
> > > > + TP_fast_assign(
> > > > + __assign_str(name, serv->sv_name);
> > > > + __entry->pool_id = pool->sp_id;
> > > > + __entry->nrthreads = pool->sp_nrthreads;
> > > > + __entry->pool_flags = pool->sp_flags;
> > > > + __entry->thread_id = rqstp->rq_thread_id;
> > > > + __entry->rqstp = rqstp;
> > > > + ),
> > > > +
> > > > + TP_printk("service=%s pool=%d pool_flags=%s nrthreads=%u thread_id=%u",
> > > > + __get_str(name), __entry->pool_id,
> > > > + show_svc_pool_flags(__entry->pool_flags),
> > > > + __entry->nrthreads, __entry->thread_id
> > > > + )
> > > > +);
> > > > +
> > > > +#define DEFINE_SVC_THREAD_LIFETIME_EVENT(name) \
> > > > + DEFINE_EVENT(svc_thread_lifetime_class, svc_pool_##name, \
> > > > + TP_PROTO( \
> > > > + const struct svc_serv *serv, \
> > > > + const struct svc_pool *pool, \
> > > > + const struct svc_rqst *rqstp \
> > > > + ), \
> > > > + TP_ARGS(serv, pool, rqstp))
> > > > +
> > > > +DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_init);
> > > > +DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_exit);
> > > > +
> > > > DECLARE_EVENT_CLASS(svc_xprt_event,
> > > > TP_PROTO(
> > > > const struct svc_xprt *xprt
> > > > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > > > index ad29df00b454..109d7f047385 100644
> > > > --- a/net/sunrpc/svc.c
> > > > +++ b/net/sunrpc/svc.c
> > > > @@ -507,8 +507,8 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> > > >
> > > > pool->sp_id = i;
> > > > INIT_LIST_HEAD(&pool->sp_sockets);
> > > > - INIT_LIST_HEAD(&pool->sp_all_threads);
> > > > spin_lock_init(&pool->sp_lock);
> > > > + xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
> > > >
> > > > percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> > > > percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> > > > @@ -596,6 +596,8 @@ svc_destroy(struct kref *ref)
> > > > percpu_counter_destroy(&pool->sp_threads_timedout);
> > > > percpu_counter_destroy(&pool->sp_threads_starved);
> > > > percpu_counter_destroy(&pool->sp_threads_no_work);
> > > > +
> > > > + xa_destroy(&pool->sp_thread_xa);
> > > > }
> > > > kfree(serv->sv_pools);
> > > > kfree(serv);
> > > > @@ -676,7 +678,11 @@ EXPORT_SYMBOL_GPL(svc_rqst_alloc);
> > > > static struct svc_rqst *
> > > > svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> > > > {
> > > > + struct xa_limit limit = {
> > > > + .max = U32_MAX,
> > > > + };
> > > > struct svc_rqst *rqstp;
> > > > + int ret;
> > > >
> > > > rqstp = svc_rqst_alloc(serv, pool, node);
> > > > if (!rqstp)
> > > > @@ -687,11 +693,21 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> > > > serv->sv_nrthreads += 1;
> > > > spin_unlock_bh(&serv->sv_lock);
> > > >
> > > > - spin_lock_bh(&pool->sp_lock);
> > > > + xa_lock(&pool->sp_thread_xa);
> > > > + ret = __xa_alloc(&pool->sp_thread_xa, &rqstp->rq_thread_id, rqstp,
> > > > + limit, GFP_KERNEL);
> > > > + if (ret) {
> > > > + xa_unlock(&pool->sp_thread_xa);
> > > > + goto out_free;
> > > > + }
> > > > pool->sp_nrthreads++;
> > > > - list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
> > > > - spin_unlock_bh(&pool->sp_lock);
> > > > + xa_unlock(&pool->sp_thread_xa);
> > > > + trace_svc_pool_thread_init(serv, pool, rqstp);
> > > > return rqstp;
> > > > +
> > > > +out_free:
> > > > + svc_rqst_free(rqstp);
> > > > + return ERR_PTR(ret);
> > > > }
> > > >
> > > > /**
> > > > @@ -708,19 +724,17 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> > > > struct svc_pool *pool)
> > > > {
> > > > struct svc_rqst *rqstp;
> > > > + unsigned long index;
> > > >
> > > > - rcu_read_lock();
> > >
> > >
> > > While it does do its own locking, the resulting object that xa_for_each
> > > returns needs some protection too. Between xa_for_each returning a rqstp
> > > and calling test_and_set_bit, could the rqstp be freed? I suspect so,
> > > and I think you probably need to keep the rcu_read_lock() call above.
> >
> > Should I keep the rcu_read_lock() even with the bitmap/xa_load
> > version of svc_pool_wake_idle_thread() in 9/9 ?
> >
>
> Yeah, I think you have to. We're not doing real locking on the search or
> taking references, so nothing else will ensure that the rqstp will stick
> around once you've found it. I think you have to hold it until after
> wake_up_process (at least).

I can keep the RCU read lock around the search and xa_load(). But
I notice that the code we're replacing releases the RCU read lock
before calling wake_up_process(). Not saying that's right, but we
haven't had a problem reported.


> > > > - list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
> > > > + xa_for_each(&pool->sp_thread_xa, index, rqstp) {
> > > > if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> > > > continue;
> > > >
> > > > - rcu_read_unlock();
> > > > WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> > > > wake_up_process(rqstp->rq_task);
> > > > percpu_counter_inc(&pool->sp_threads_woken);
> > > > return rqstp;
> > > > }
> > > > - rcu_read_unlock();
> > > >
> > >
> > > I wonder if this can race with svc_pool_victim below? Can we end up
> > > waking a thread that's already on its way out of the pool? Maybe this is
> > > addressed in your next patch though...
> > >
> > > > trace_svc_pool_starved(serv, pool);
> > > > percpu_counter_inc(&pool->sp_threads_starved);
> > > > @@ -736,32 +750,33 @@ svc_pool_next(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
> > > > static struct task_struct *
> > > > svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
> > > > {
> > > > - unsigned int i;
> > > > struct task_struct *task = NULL;
> > > > + struct svc_rqst *rqstp;
> > > > + unsigned int i;
> > > >
> > > > if (pool != NULL) {
> > > > - spin_lock_bh(&pool->sp_lock);
> > > > + xa_lock(&pool->sp_thread_xa);
> > > > + if (!pool->sp_nrthreads)
> > > > + goto out;
> > > > } else {
> > > > for (i = 0; i < serv->sv_nrpools; i++) {
> > > > pool = &serv->sv_pools[--(*state) % serv->sv_nrpools];
> > > > - spin_lock_bh(&pool->sp_lock);
> > > > - if (!list_empty(&pool->sp_all_threads))
> > > > + xa_lock(&pool->sp_thread_xa);
> > > > + if (pool->sp_nrthreads)
> > > > goto found_pool;
> > > > - spin_unlock_bh(&pool->sp_lock);
> > > > + xa_unlock(&pool->sp_thread_xa);
> > > > }
> > > > return NULL;
> > > > }
> > > >
> > > > found_pool:
> > > > - if (!list_empty(&pool->sp_all_threads)) {
> > > > - struct svc_rqst *rqstp;
> > > > -
> > > > - rqstp = list_entry(pool->sp_all_threads.next, struct svc_rqst, rq_all);
> > > > - set_bit(RQ_VICTIM, &rqstp->rq_flags);
> > > > - list_del_rcu(&rqstp->rq_all);
> > > > + rqstp = xa_load(&pool->sp_thread_xa, pool->sp_nrthreads - 1);
> > > > + if (rqstp) {
> > > > + __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> > > > task = rqstp->rq_task;
> > > > }
> > > > - spin_unlock_bh(&pool->sp_lock);
> > > > +out:
> > > > + xa_unlock(&pool->sp_thread_xa);
> > > > return task;
> > > > }
> > > >
> > > > @@ -843,9 +858,9 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
> > > > if (pool == NULL) {
> > > > nrservs -= serv->sv_nrthreads;
> > > > } else {
> > > > - spin_lock_bh(&pool->sp_lock);
> > > > + xa_lock(&pool->sp_thread_xa);
> > > > nrservs -= pool->sp_nrthreads;
> > > > - spin_unlock_bh(&pool->sp_lock);
> > > > + xa_unlock(&pool->sp_thread_xa);
> > > > }
> > > >
> > > > if (nrservs > 0)
> > > > @@ -932,11 +947,11 @@ svc_exit_thread(struct svc_rqst *rqstp)
> > > > struct svc_serv *serv = rqstp->rq_server;
> > > > struct svc_pool *pool = rqstp->rq_pool;
> > > >
> > > > - spin_lock_bh(&pool->sp_lock);
> > > > + xa_lock(&pool->sp_thread_xa);
> > > > pool->sp_nrthreads--;
> > > > - if (!test_and_set_bit(RQ_VICTIM, &rqstp->rq_flags))
> > > > - list_del_rcu(&rqstp->rq_all);
> > > > - spin_unlock_bh(&pool->sp_lock);
> > > > + __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> > > > + xa_unlock(&pool->sp_thread_xa);
> > > > + trace_svc_pool_thread_exit(serv, pool, rqstp);
> > > >
> > > > spin_lock_bh(&serv->sv_lock);
> > > > serv->sv_nrthreads -= 1;
> > > > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > > > index 6c2a702aa469..db40f771b60a 100644
> > > > --- a/net/sunrpc/svc_xprt.c
> > > > +++ b/net/sunrpc/svc_xprt.c
> > > > @@ -46,7 +46,7 @@ static LIST_HEAD(svc_xprt_class_list);
> > > >
> > > > /* SMP locking strategy:
> > > > *
> > > > - * svc_pool->sp_lock protects most of the fields of that pool.
> > > > + * svc_pool->sp_lock protects sp_sockets.
> > > > * svc_serv->sv_lock protects sv_tempsocks, sv_permsocks, sv_tmpcnt.
> > > > * when both need to be taken (rare), svc_serv->sv_lock is first.
> > > > * The "service mutex" protects svc_serv->sv_nrthread.
> > > >
> > > >
> > >
> > > Looks like a nice clean conversion otherwise!
> > > --
> > > Jeff Layton <[email protected]>
> >
> > --
> > Chuck Lever
> >
> >
>
> --
> Jeff Layton <[email protected]>

2023-07-11 10:06:01

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v3 5/9] SUNRPC: Count pool threads that were awoken but found no work to do

On Tue, 11 Jul 2023, Chuck Lever III wrote:
>
> > On Jul 10, 2023, at 6:29 PM, NeilBrown <[email protected]> wrote:
> >
> > On Tue, 11 Jul 2023, Chuck Lever wrote:
> >> From: Chuck Lever <[email protected]>
> >>
> >> Measure a source of thread scheduling inefficiency -- count threads
> >> that were awoken but found that the transport queue had already been
> >> emptied.
> >>
> >> An empty transport queue is possible when threads that run between
> >> the wake_up_process() call and the woken thread returning from the
> >> scheduler have pulled all remaining work off the transport queue
> >> using the first svc_xprt_dequeue() in svc_get_next_xprt().
> >
> > I'm in two minds about this. The data being gathered here is
> > potentially useful
>
> It's actually pretty shocking: I've measured more than
> 15% of thread wake-ups find no work to do.

That is a bigger number than I would have guessed!

>
>
> > - but who it is useful to?
> > I think it is primarily useful for us - to understand the behaviour of
> > the implementation so we can know what needs to be optimised.
> > It isn't really of any use to a sysadmin who wants to understand how
> > their system is performing.
> >
> > But then .. who are tracepoints for? Developers or admins?
> > I guess that fact that we feel free to modify them whenever we need
> > means they are primarily for developers? In which case this is a good
> > patch, and maybe we'll revert the functionality one day if it turns out
> > that we can change the implementation so that a thread is never woken
> > when there is no work to do ....
>
> A reasonable question to ask. The new "starved" metric
> is similar: possibly useful while we are developing the
> code, but not particularly valuable for system
> administrators.
>
> How are the pool_stats used by administrators?

That is a fair question. Probably not much.
Once upon a time we had stats which could show a histogram how thread
usage. I used that to decided if the load justified more threads.
But we removed it because it was somewhat expensive and it was argued it
may not be all that useful...
I haven't really looked at any other stats in my work. Almost always it
is a packet capture that helps me see what is happening when I have an
issue to address.

Maybe I should just accept that stats are primarily for developers and
they can be incredible useful for that purpose, and not worry if admins
might ever need them.

>
> (And, why are they in /proc/fs/nfsd/ rather than under
> something RPC-related?)

Maybe because we "owned" /proc/fs/nfsd/, but the RPC-related stuff is
under "net" and we didn't feel so comfortable sticking new stuff there.
Or maybe not.

Thanks,
NeilBrown

2023-07-11 11:47:54

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v3 5/9] SUNRPC: Count pool threads that were awoken but found no work to do

On Tue, 2023-07-11 at 19:54 +1000, NeilBrown wrote:
> On Tue, 11 Jul 2023, Chuck Lever III wrote:
> >
> > > On Jul 10, 2023, at 6:29 PM, NeilBrown <[email protected]> wrote:
> > >
> > > On Tue, 11 Jul 2023, Chuck Lever wrote:
> > > > From: Chuck Lever <[email protected]>
> > > >
> > > > Measure a source of thread scheduling inefficiency -- count threads
> > > > that were awoken but found that the transport queue had already been
> > > > emptied.
> > > >
> > > > An empty transport queue is possible when threads that run between
> > > > the wake_up_process() call and the woken thread returning from the
> > > > scheduler have pulled all remaining work off the transport queue
> > > > using the first svc_xprt_dequeue() in svc_get_next_xprt().
> > >
> > > I'm in two minds about this. The data being gathered here is
> > > potentially useful
> >
> > It's actually pretty shocking: I've measured more than
> > 15% of thread wake-ups find no work to do.
>
> That is a bigger number than I would have guessed!
>

I'm guessing the idea is that the receiver is waking a thread to do the
work, and that races with one that's already running? I'm sure there are
ways we can fix that, but it really does seem like we're trying to
reinvent workqueues here.

> >
> >
> > > - but who it is useful to?
> > > I think it is primarily useful for us - to understand the behaviour of
> > > the implementation so we can know what needs to be optimised.
> > > It isn't really of any use to a sysadmin who wants to understand how
> > > their system is performing.
> > >
> > > But then .. who are tracepoints for? Developers or admins?
> > > I guess that fact that we feel free to modify them whenever we need
> > > means they are primarily for developers? In which case this is a good
> > > patch, and maybe we'll revert the functionality one day if it turns out
> > > that we can change the implementation so that a thread is never woken
> > > when there is no work to do ....
> >
> > A reasonable question to ask. The new "starved" metric
> > is similar: possibly useful while we are developing the
> > code, but not particularly valuable for system
> > administrators.
> >
> > How are the pool_stats used by administrators?
>
> That is a fair question. Probably not much.
> Once upon a time we had stats which could show a histogram how thread
> usage. I used that to decided if the load justified more threads.
> But we removed it because it was somewhat expensive and it was argued it
> may not be all that useful...
> I haven't really looked at any other stats in my work. Almost always it
> is a packet capture that helps me see what is happening when I have an
> issue to address.
>
> Maybe I should just accept that stats are primarily for developers and
> they can be incredible useful for that purpose, and not worry if admins
> might ever need them.
>
> >
> > (And, why are they in /proc/fs/nfsd/ rather than under
> > something RPC-related?)
>
> Maybe because we "owned" /proc/fs/nfsd/, but the RPC-related stuff is
> under "net" and we didn't feel so comfortable sticking new stuff there.
> Or maybe not.
>

--
Jeff Layton <[email protected]>


2023-07-11 12:08:31

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v3 8/9] SUNRPC: Replace sp_threads_all with an xarray

On Mon, 2023-07-10 at 22:24 -0400, Chuck Lever wrote:
> On Mon, Jul 10, 2023 at 09:06:02PM -0400, Jeff Layton wrote:
> > On Tue, 2023-07-11 at 00:58 +0000, Chuck Lever III wrote:
> > >
> > > > On Jul 10, 2023, at 2:24 PM, Jeff Layton <[email protected]> wrote:
> > > >
> > > > On Mon, 2023-07-10 at 12:42 -0400, Chuck Lever wrote:
> > > > > From: Chuck Lever <[email protected]>
> > > > >
> > > > > We want a thread lookup operation that can be done with RCU only,
> > > > > but also we want to avoid the linked-list walk, which does not scale
> > > > > well in the number of pool threads.
> > > > >
> > > > > This patch splits out the use of the sp_lock to protect the set
> > > > > of threads. Svc thread information is now protected by the xarray's
> > > > > lock (when making thread count changes) and the RCU read lock (when
> > > > > only looking up a thread).
> > > > >
> > > > > Since thread count changes are done only via nfsd filesystem API,
> > > > > which runs only in process context, we can safely dispense with the
> > > > > use of a bottom-half-disabled lock.
> > > > >
> > > > > Signed-off-by: Chuck Lever <[email protected]>
> > > > > ---
> > > > > fs/nfsd/nfssvc.c | 3 +-
> > > > > include/linux/sunrpc/svc.h | 11 +++----
> > > > > include/trace/events/sunrpc.h | 47 ++++++++++++++++++++++++++++-
> > > > > net/sunrpc/svc.c | 67 +++++++++++++++++++++++++----------------
> > > > > net/sunrpc/svc_xprt.c | 2 +
> > > > > 5 files changed, 94 insertions(+), 36 deletions(-)
> > > > >
> > > > > diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> > > > > index 2154fa63c5f2..d42b2a40c93c 100644
> > > > > --- a/fs/nfsd/nfssvc.c
> > > > > +++ b/fs/nfsd/nfssvc.c
> > > > > @@ -62,8 +62,7 @@ static __be32 nfsd_init_request(struct svc_rqst *,
> > > > > * If (out side the lock) nn->nfsd_serv is non-NULL, then it must point to a
> > > > > * properly initialised 'struct svc_serv' with ->sv_nrthreads > 0 (unless
> > > > > * nn->keep_active is set). That number of nfsd threads must
> > > > > - * exist and each must be listed in ->sp_all_threads in some entry of
> > > > > - * ->sv_pools[].
> > > > > + * exist and each must be listed in some entry of ->sv_pools[].
> > > > > *
> > > > > * Each active thread holds a counted reference on nn->nfsd_serv, as does
> > > > > * the nn->keep_active flag and various transient calls to svc_get().
> > > > > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > > > > index 9dd3b16cc4c2..86377506a514 100644
> > > > > --- a/include/linux/sunrpc/svc.h
> > > > > +++ b/include/linux/sunrpc/svc.h
> > > > > @@ -32,10 +32,10 @@
> > > > > */
> > > > > struct svc_pool {
> > > > > unsigned int sp_id; /* pool id; also node id on NUMA */
> > > > > - spinlock_t sp_lock; /* protects all fields */
> > > > > + spinlock_t sp_lock; /* protects sp_sockets */
> > > > > struct list_head sp_sockets; /* pending sockets */
> > > > > unsigned int sp_nrthreads; /* # of threads in pool */
> > > > > - struct list_head sp_all_threads; /* all server threads */
> > > > > + struct xarray sp_thread_xa;
> > > > >
> > > > > /* statistics on pool operation */
> > > > > struct percpu_counter sp_messages_arrived;
> > > > > @@ -196,7 +196,6 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> > > > > * processed.
> > > > > */
> > > > > struct svc_rqst {
> > > > > - struct list_head rq_all; /* all threads list */
> > > > > struct rcu_head rq_rcu_head; /* for RCU deferred kfree */
> > > > > struct svc_xprt * rq_xprt; /* transport ptr */
> > > > >
> > > > > @@ -241,10 +240,10 @@ struct svc_rqst {
> > > > > #define RQ_SPLICE_OK (4) /* turned off in gss privacy
> > > > > * to prevent encrypting page
> > > > > * cache pages */
> > > > > -#define RQ_VICTIM (5) /* about to be shut down */
> > > > > -#define RQ_BUSY (6) /* request is busy */
> > > > > -#define RQ_DATA (7) /* request has data */
> > > > > +#define RQ_BUSY (5) /* request is busy */
> > > > > +#define RQ_DATA (6) /* request has data */
> > > > > unsigned long rq_flags; /* flags field */
> > > > > + u32 rq_thread_id; /* xarray index */
> > > > > ktime_t rq_qtime; /* enqueue time */
> > > > >
> > > > > void * rq_argp; /* decoded arguments */
> > > > > diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> > > > > index 60c8e03268d4..ea43c6059bdb 100644
> > > > > --- a/include/trace/events/sunrpc.h
> > > > > +++ b/include/trace/events/sunrpc.h
> > > > > @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
> > > > > svc_rqst_flag(USEDEFERRAL) \
> > > > > svc_rqst_flag(DROPME) \
> > > > > svc_rqst_flag(SPLICE_OK) \
> > > > > - svc_rqst_flag(VICTIM) \
> > > > > svc_rqst_flag(BUSY) \
> > > > > svc_rqst_flag_end(DATA)
> > > > >
> > > > > @@ -2118,6 +2117,52 @@ TRACE_EVENT(svc_pool_starved,
> > > > > )
> > > > > );
> > > > >
> > > > > +DECLARE_EVENT_CLASS(svc_thread_lifetime_class,
> > > > > + TP_PROTO(
> > > > > + const struct svc_serv *serv,
> > > > > + const struct svc_pool *pool,
> > > > > + const struct svc_rqst *rqstp
> > > > > + ),
> > > > > +
> > > > > + TP_ARGS(serv, pool, rqstp),
> > > > > +
> > > > > + TP_STRUCT__entry(
> > > > > + __string(name, serv->sv_name)
> > > > > + __field(int, pool_id)
> > > > > + __field(unsigned int, nrthreads)
> > > > > + __field(unsigned long, pool_flags)
> > > > > + __field(u32, thread_id)
> > > > > + __field(const void *, rqstp)
> > > > > + ),
> > > > > +
> > > > > + TP_fast_assign(
> > > > > + __assign_str(name, serv->sv_name);
> > > > > + __entry->pool_id = pool->sp_id;
> > > > > + __entry->nrthreads = pool->sp_nrthreads;
> > > > > + __entry->pool_flags = pool->sp_flags;
> > > > > + __entry->thread_id = rqstp->rq_thread_id;
> > > > > + __entry->rqstp = rqstp;
> > > > > + ),
> > > > > +
> > > > > + TP_printk("service=%s pool=%d pool_flags=%s nrthreads=%u thread_id=%u",
> > > > > + __get_str(name), __entry->pool_id,
> > > > > + show_svc_pool_flags(__entry->pool_flags),
> > > > > + __entry->nrthreads, __entry->thread_id
> > > > > + )
> > > > > +);
> > > > > +
> > > > > +#define DEFINE_SVC_THREAD_LIFETIME_EVENT(name) \
> > > > > + DEFINE_EVENT(svc_thread_lifetime_class, svc_pool_##name, \
> > > > > + TP_PROTO( \
> > > > > + const struct svc_serv *serv, \
> > > > > + const struct svc_pool *pool, \
> > > > > + const struct svc_rqst *rqstp \
> > > > > + ), \
> > > > > + TP_ARGS(serv, pool, rqstp))
> > > > > +
> > > > > +DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_init);
> > > > > +DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_exit);
> > > > > +
> > > > > DECLARE_EVENT_CLASS(svc_xprt_event,
> > > > > TP_PROTO(
> > > > > const struct svc_xprt *xprt
> > > > > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > > > > index ad29df00b454..109d7f047385 100644
> > > > > --- a/net/sunrpc/svc.c
> > > > > +++ b/net/sunrpc/svc.c
> > > > > @@ -507,8 +507,8 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> > > > >
> > > > > pool->sp_id = i;
> > > > > INIT_LIST_HEAD(&pool->sp_sockets);
> > > > > - INIT_LIST_HEAD(&pool->sp_all_threads);
> > > > > spin_lock_init(&pool->sp_lock);
> > > > > + xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
> > > > >
> > > > > percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> > > > > percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> > > > > @@ -596,6 +596,8 @@ svc_destroy(struct kref *ref)
> > > > > percpu_counter_destroy(&pool->sp_threads_timedout);
> > > > > percpu_counter_destroy(&pool->sp_threads_starved);
> > > > > percpu_counter_destroy(&pool->sp_threads_no_work);
> > > > > +
> > > > > + xa_destroy(&pool->sp_thread_xa);
> > > > > }
> > > > > kfree(serv->sv_pools);
> > > > > kfree(serv);
> > > > > @@ -676,7 +678,11 @@ EXPORT_SYMBOL_GPL(svc_rqst_alloc);
> > > > > static struct svc_rqst *
> > > > > svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> > > > > {
> > > > > + struct xa_limit limit = {
> > > > > + .max = U32_MAX,
> > > > > + };
> > > > > struct svc_rqst *rqstp;
> > > > > + int ret;
> > > > >
> > > > > rqstp = svc_rqst_alloc(serv, pool, node);
> > > > > if (!rqstp)
> > > > > @@ -687,11 +693,21 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> > > > > serv->sv_nrthreads += 1;
> > > > > spin_unlock_bh(&serv->sv_lock);
> > > > >
> > > > > - spin_lock_bh(&pool->sp_lock);
> > > > > + xa_lock(&pool->sp_thread_xa);
> > > > > + ret = __xa_alloc(&pool->sp_thread_xa, &rqstp->rq_thread_id, rqstp,
> > > > > + limit, GFP_KERNEL);
> > > > > + if (ret) {
> > > > > + xa_unlock(&pool->sp_thread_xa);
> > > > > + goto out_free;
> > > > > + }
> > > > > pool->sp_nrthreads++;
> > > > > - list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
> > > > > - spin_unlock_bh(&pool->sp_lock);
> > > > > + xa_unlock(&pool->sp_thread_xa);
> > > > > + trace_svc_pool_thread_init(serv, pool, rqstp);
> > > > > return rqstp;
> > > > > +
> > > > > +out_free:
> > > > > + svc_rqst_free(rqstp);
> > > > > + return ERR_PTR(ret);
> > > > > }
> > > > >
> > > > > /**
> > > > > @@ -708,19 +724,17 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> > > > > struct svc_pool *pool)
> > > > > {
> > > > > struct svc_rqst *rqstp;
> > > > > + unsigned long index;
> > > > >
> > > > > - rcu_read_lock();
> > > >
> > > >
> > > > While it does do its own locking, the resulting object that xa_for_each
> > > > returns needs some protection too. Between xa_for_each returning a rqstp
> > > > and calling test_and_set_bit, could the rqstp be freed? I suspect so,
> > > > and I think you probably need to keep the rcu_read_lock() call above.
> > >
> > > Should I keep the rcu_read_lock() even with the bitmap/xa_load
> > > version of svc_pool_wake_idle_thread() in 9/9 ?
> > >
> >
> > Yeah, I think you have to. We're not doing real locking on the search or
> > taking references, so nothing else will ensure that the rqstp will stick
> > around once you've found it. I think you have to hold it until after
> > wake_up_process (at least).
>
> I can keep the RCU read lock around the search and xa_load(). But
> I notice that the code we're replacing releases the RCU read lock
> before calling wake_up_process(). Not saying that's right, but we
> haven't had a problem reported.
>
>

Understood. Given that we're not sleeping in that section, it's quite
possible that the RCU callbacks just never have a chance to run before
we wake the thing up and so you never hit the problem.

Still, I think it'd be best to just keep the rcu_read_lock around that
whole block. It's relatively cheap and safe to take it recursively, and
that makes it explicit that the found rqst mustn't vanish before the
wakeup is done.

> > > > > - list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
> > > > > + xa_for_each(&pool->sp_thread_xa, index, rqstp) {
> > > > > if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> > > > > continue;
> > > > >
> > > > > - rcu_read_unlock();
> > > > > WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> > > > > wake_up_process(rqstp->rq_task);
> > > > > percpu_counter_inc(&pool->sp_threads_woken);
> > > > > return rqstp;
> > > > > }
> > > > > - rcu_read_unlock();
> > > > >
> > > >
> > > > I wonder if this can race with svc_pool_victim below? Can we end up
> > > > waking a thread that's already on its way out of the pool? Maybe this is
> > > > addressed in your next patch though...
> > > >
> > > > > trace_svc_pool_starved(serv, pool);
> > > > > percpu_counter_inc(&pool->sp_threads_starved);
> > > > > @@ -736,32 +750,33 @@ svc_pool_next(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
> > > > > static struct task_struct *
> > > > > svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
> > > > > {
> > > > > - unsigned int i;
> > > > > struct task_struct *task = NULL;
> > > > > + struct svc_rqst *rqstp;
> > > > > + unsigned int i;
> > > > >
> > > > > if (pool != NULL) {
> > > > > - spin_lock_bh(&pool->sp_lock);
> > > > > + xa_lock(&pool->sp_thread_xa);
> > > > > + if (!pool->sp_nrthreads)
> > > > > + goto out;
> > > > > } else {
> > > > > for (i = 0; i < serv->sv_nrpools; i++) {
> > > > > pool = &serv->sv_pools[--(*state) % serv->sv_nrpools];
> > > > > - spin_lock_bh(&pool->sp_lock);
> > > > > - if (!list_empty(&pool->sp_all_threads))
> > > > > + xa_lock(&pool->sp_thread_xa);
> > > > > + if (pool->sp_nrthreads)
> > > > > goto found_pool;
> > > > > - spin_unlock_bh(&pool->sp_lock);
> > > > > + xa_unlock(&pool->sp_thread_xa);
> > > > > }
> > > > > return NULL;
> > > > > }
> > > > >
> > > > > found_pool:
> > > > > - if (!list_empty(&pool->sp_all_threads)) {
> > > > > - struct svc_rqst *rqstp;
> > > > > -
> > > > > - rqstp = list_entry(pool->sp_all_threads.next, struct svc_rqst, rq_all);
> > > > > - set_bit(RQ_VICTIM, &rqstp->rq_flags);
> > > > > - list_del_rcu(&rqstp->rq_all);
> > > > > + rqstp = xa_load(&pool->sp_thread_xa, pool->sp_nrthreads - 1);
> > > > > + if (rqstp) {
> > > > > + __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> > > > > task = rqstp->rq_task;
> > > > > }
> > > > > - spin_unlock_bh(&pool->sp_lock);
> > > > > +out:
> > > > > + xa_unlock(&pool->sp_thread_xa);
> > > > > return task;
> > > > > }
> > > > >
> > > > > @@ -843,9 +858,9 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
> > > > > if (pool == NULL) {
> > > > > nrservs -= serv->sv_nrthreads;
> > > > > } else {
> > > > > - spin_lock_bh(&pool->sp_lock);
> > > > > + xa_lock(&pool->sp_thread_xa);
> > > > > nrservs -= pool->sp_nrthreads;
> > > > > - spin_unlock_bh(&pool->sp_lock);
> > > > > + xa_unlock(&pool->sp_thread_xa);
> > > > > }
> > > > >
> > > > > if (nrservs > 0)
> > > > > @@ -932,11 +947,11 @@ svc_exit_thread(struct svc_rqst *rqstp)
> > > > > struct svc_serv *serv = rqstp->rq_server;
> > > > > struct svc_pool *pool = rqstp->rq_pool;
> > > > >
> > > > > - spin_lock_bh(&pool->sp_lock);
> > > > > + xa_lock(&pool->sp_thread_xa);
> > > > > pool->sp_nrthreads--;
> > > > > - if (!test_and_set_bit(RQ_VICTIM, &rqstp->rq_flags))
> > > > > - list_del_rcu(&rqstp->rq_all);
> > > > > - spin_unlock_bh(&pool->sp_lock);
> > > > > + __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> > > > > + xa_unlock(&pool->sp_thread_xa);
> > > > > + trace_svc_pool_thread_exit(serv, pool, rqstp);
> > > > >
> > > > > spin_lock_bh(&serv->sv_lock);
> > > > > serv->sv_nrthreads -= 1;
> > > > > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > > > > index 6c2a702aa469..db40f771b60a 100644
> > > > > --- a/net/sunrpc/svc_xprt.c
> > > > > +++ b/net/sunrpc/svc_xprt.c
> > > > > @@ -46,7 +46,7 @@ static LIST_HEAD(svc_xprt_class_list);
> > > > >
> > > > > /* SMP locking strategy:
> > > > > *
> > > > > - * svc_pool->sp_lock protects most of the fields of that pool.
> > > > > + * svc_pool->sp_lock protects sp_sockets.
> > > > > * svc_serv->sv_lock protects sv_tempsocks, sv_permsocks, sv_tmpcnt.
> > > > > * when both need to be taken (rare), svc_serv->sv_lock is first.
> > > > > * The "service mutex" protects svc_serv->sv_nrthread.
> > > > >
> > > > >
> > > >
> > > > Looks like a nice clean conversion otherwise!
> > > > --
> > > > Jeff Layton <[email protected]>
> > >
> > > --
> > > Chuck Lever
> > >
> > >
> >
> > --
> > Jeff Layton <[email protected]>

--
Jeff Layton <[email protected]>

2023-07-11 12:19:17

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v3 5/9] SUNRPC: Count pool threads that were awoken but found no work to do

On Tue, 11 Jul 2023, Jeff Layton wrote:
> On Tue, 2023-07-11 at 19:54 +1000, NeilBrown wrote:
> > On Tue, 11 Jul 2023, Chuck Lever III wrote:
> > >
> > > > On Jul 10, 2023, at 6:29 PM, NeilBrown <[email protected]> wrote:
> > > >
> > > > On Tue, 11 Jul 2023, Chuck Lever wrote:
> > > > > From: Chuck Lever <[email protected]>
> > > > >
> > > > > Measure a source of thread scheduling inefficiency -- count threads
> > > > > that were awoken but found that the transport queue had already been
> > > > > emptied.
> > > > >
> > > > > An empty transport queue is possible when threads that run between
> > > > > the wake_up_process() call and the woken thread returning from the
> > > > > scheduler have pulled all remaining work off the transport queue
> > > > > using the first svc_xprt_dequeue() in svc_get_next_xprt().
> > > >
> > > > I'm in two minds about this. The data being gathered here is
> > > > potentially useful
> > >
> > > It's actually pretty shocking: I've measured more than
> > > 15% of thread wake-ups find no work to do.
> >
> > That is a bigger number than I would have guessed!
> >
>
> I'm guessing the idea is that the receiver is waking a thread to do the
> work, and that races with one that's already running? I'm sure there are
> ways we can fix that, but it really does seem like we're trying to
> reinvent workqueues here.

True. But then workqueues seem to reinvent themselves every so often
too. Once gets the impression they are trying to meet an enormous
variety of needs.
I'm not against trying to see if nfsd could work well in a workqueue
environment, but I'm not certain it is a good idea. Maintaining control
of our own thread pools might be safer.

I have a vague memory of looking into this in more detail once and
deciding that it wasn't a good fit, but I cannot recall or easily deduce
the reason. Obviously we would have to give up SIGKILL, but we want to
do that anyway.

Would we want unbound work queues - so they can be scheduled across
different CPUs? Are NFSD threads cpu-intensive or not? I'm not sure.

I would be happy to explore a credible attempt at a conversion.

NeilBrown


>
> > >
> > >
> > > > - but who it is useful to?
> > > > I think it is primarily useful for us - to understand the behaviour of
> > > > the implementation so we can know what needs to be optimised.
> > > > It isn't really of any use to a sysadmin who wants to understand how
> > > > their system is performing.
> > > >
> > > > But then .. who are tracepoints for? Developers or admins?
> > > > I guess that fact that we feel free to modify them whenever we need
> > > > means they are primarily for developers? In which case this is a good
> > > > patch, and maybe we'll revert the functionality one day if it turns out
> > > > that we can change the implementation so that a thread is never woken
> > > > when there is no work to do ....
> > >
> > > A reasonable question to ask. The new "starved" metric
> > > is similar: possibly useful while we are developing the
> > > code, but not particularly valuable for system
> > > administrators.
> > >
> > > How are the pool_stats used by administrators?
> >
> > That is a fair question. Probably not much.
> > Once upon a time we had stats which could show a histogram how thread
> > usage. I used that to decided if the load justified more threads.
> > But we removed it because it was somewhat expensive and it was argued it
> > may not be all that useful...
> > I haven't really looked at any other stats in my work. Almost always it
> > is a packet capture that helps me see what is happening when I have an
> > issue to address.
> >
> > Maybe I should just accept that stats are primarily for developers and
> > they can be incredible useful for that purpose, and not worry if admins
> > might ever need them.
> >
> > >
> > > (And, why are they in /proc/fs/nfsd/ rather than under
> > > something RPC-related?)
> >
> > Maybe because we "owned" /proc/fs/nfsd/, but the RPC-related stuff is
> > under "net" and we didn't feel so comfortable sticking new stuff there.
> > Or maybe not.
> >
>
> --
> Jeff Layton <[email protected]>
>
>


2023-07-11 12:36:53

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v3 5/9] SUNRPC: Count pool threads that were awoken but found no work to do



> On Jul 11, 2023, at 8:17 AM, NeilBrown <[email protected]> wrote:
>
> On Tue, 11 Jul 2023, Jeff Layton wrote:
>> On Tue, 2023-07-11 at 19:54 +1000, NeilBrown wrote:
>>> On Tue, 11 Jul 2023, Chuck Lever III wrote:
>>>>
>>>>> On Jul 10, 2023, at 6:29 PM, NeilBrown <[email protected]> wrote:
>>>>>
>>>>> On Tue, 11 Jul 2023, Chuck Lever wrote:
>>>>>> From: Chuck Lever <[email protected]>
>>>>>>
>>>>>> Measure a source of thread scheduling inefficiency -- count threads
>>>>>> that were awoken but found that the transport queue had already been
>>>>>> emptied.
>>>>>>
>>>>>> An empty transport queue is possible when threads that run between
>>>>>> the wake_up_process() call and the woken thread returning from the
>>>>>> scheduler have pulled all remaining work off the transport queue
>>>>>> using the first svc_xprt_dequeue() in svc_get_next_xprt().
>>>>>
>>>>> I'm in two minds about this. The data being gathered here is
>>>>> potentially useful
>>>>
>>>> It's actually pretty shocking: I've measured more than
>>>> 15% of thread wake-ups find no work to do.
>>>
>>> That is a bigger number than I would have guessed!
>>>
>>
>> I'm guessing the idea is that the receiver is waking a thread to do the
>> work, and that races with one that's already running? I'm sure there are
>> ways we can fix that, but it really does seem like we're trying to
>> reinvent workqueues here.
>
> True. But then workqueues seem to reinvent themselves every so often
> too. Once gets the impression they are trying to meet an enormous
> variety of needs.
> I'm not against trying to see if nfsd could work well in a workqueue
> environment, but I'm not certain it is a good idea. Maintaining control
> of our own thread pools might be safer.
>
> I have a vague memory of looking into this in more detail once and
> deciding that it wasn't a good fit, but I cannot recall or easily deduce
> the reason. Obviously we would have to give up SIGKILL, but we want to
> do that anyway.
>
> Would we want unbound work queues - so they can be scheduled across
> different CPUs? Are NFSD threads cpu-intensive or not? I'm not sure.
>
> I would be happy to explore a credible attempt at a conversion.

Jeff did one a few years back. It had some subtle performance
problems that we couldn't nail down.

My concern with workqueues is they don't seem well suited to
heavy workloads. There seems to be a lot of lock contention
unless the consumer is very very careful.


>>>>> - but who it is useful to?
>>>>> I think it is primarily useful for us - to understand the behaviour of
>>>>> the implementation so we can know what needs to be optimised.
>>>>> It isn't really of any use to a sysadmin who wants to understand how
>>>>> their system is performing.
>>>>>
>>>>> But then .. who are tracepoints for? Developers or admins?
>>>>> I guess that fact that we feel free to modify them whenever we need
>>>>> means they are primarily for developers? In which case this is a good
>>>>> patch, and maybe we'll revert the functionality one day if it turns out
>>>>> that we can change the implementation so that a thread is never woken
>>>>> when there is no work to do ....
>>>>
>>>> A reasonable question to ask. The new "starved" metric
>>>> is similar: possibly useful while we are developing the
>>>> code, but not particularly valuable for system
>>>> administrators.
>>>>
>>>> How are the pool_stats used by administrators?
>>>
>>> That is a fair question. Probably not much.
>>> Once upon a time we had stats which could show a histogram how thread
>>> usage. I used that to decided if the load justified more threads.
>>> But we removed it because it was somewhat expensive and it was argued it
>>> may not be all that useful...
>>> I haven't really looked at any other stats in my work. Almost always it
>>> is a packet capture that helps me see what is happening when I have an
>>> issue to address.
>>>
>>> Maybe I should just accept that stats are primarily for developers and
>>> they can be incredible useful for that purpose, and not worry if admins
>>> might ever need them.
>>>
>>>>
>>>> (And, why are they in /proc/fs/nfsd/ rather than under
>>>> something RPC-related?)
>>>
>>> Maybe because we "owned" /proc/fs/nfsd/, but the RPC-related stuff is
>>> under "net" and we didn't feel so comfortable sticking new stuff there.
>>> Or maybe not.
>>>
>>
>> --
>> Jeff Layton <[email protected]>


--
Chuck Lever



2023-07-11 12:52:40

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v3 8/9] SUNRPC: Replace sp_threads_all with an xarray



> On Jul 11, 2023, at 7:57 AM, Jeff Layton <[email protected]> wrote:
>
> On Mon, 2023-07-10 at 22:24 -0400, Chuck Lever wrote:
>> On Mon, Jul 10, 2023 at 09:06:02PM -0400, Jeff Layton wrote:
>>> On Tue, 2023-07-11 at 00:58 +0000, Chuck Lever III wrote:
>>>>
>>>>> On Jul 10, 2023, at 2:24 PM, Jeff Layton <[email protected]> wrote:
>>>>>
>>>>> On Mon, 2023-07-10 at 12:42 -0400, Chuck Lever wrote:
>>>>>> From: Chuck Lever <[email protected]>
>>>>>>
>>>>>> We want a thread lookup operation that can be done with RCU only,
>>>>>> but also we want to avoid the linked-list walk, which does not scale
>>>>>> well in the number of pool threads.
>>>>>>
>>>>>> This patch splits out the use of the sp_lock to protect the set
>>>>>> of threads. Svc thread information is now protected by the xarray's
>>>>>> lock (when making thread count changes) and the RCU read lock (when
>>>>>> only looking up a thread).
>>>>>>
>>>>>> Since thread count changes are done only via nfsd filesystem API,
>>>>>> which runs only in process context, we can safely dispense with the
>>>>>> use of a bottom-half-disabled lock.
>>>>>>
>>>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>>>> ---
>>>>>> fs/nfsd/nfssvc.c | 3 +-
>>>>>> include/linux/sunrpc/svc.h | 11 +++----
>>>>>> include/trace/events/sunrpc.h | 47 ++++++++++++++++++++++++++++-
>>>>>> net/sunrpc/svc.c | 67 +++++++++++++++++++++++++----------------
>>>>>> net/sunrpc/svc_xprt.c | 2 +
>>>>>> 5 files changed, 94 insertions(+), 36 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
>>>>>> index 2154fa63c5f2..d42b2a40c93c 100644
>>>>>> --- a/fs/nfsd/nfssvc.c
>>>>>> +++ b/fs/nfsd/nfssvc.c
>>>>>> @@ -62,8 +62,7 @@ static __be32 nfsd_init_request(struct svc_rqst *,
>>>>>> * If (out side the lock) nn->nfsd_serv is non-NULL, then it must point to a
>>>>>> * properly initialised 'struct svc_serv' with ->sv_nrthreads > 0 (unless
>>>>>> * nn->keep_active is set). That number of nfsd threads must
>>>>>> - * exist and each must be listed in ->sp_all_threads in some entry of
>>>>>> - * ->sv_pools[].
>>>>>> + * exist and each must be listed in some entry of ->sv_pools[].
>>>>>> *
>>>>>> * Each active thread holds a counted reference on nn->nfsd_serv, as does
>>>>>> * the nn->keep_active flag and various transient calls to svc_get().
>>>>>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>>>>>> index 9dd3b16cc4c2..86377506a514 100644
>>>>>> --- a/include/linux/sunrpc/svc.h
>>>>>> +++ b/include/linux/sunrpc/svc.h
>>>>>> @@ -32,10 +32,10 @@
>>>>>> */
>>>>>> struct svc_pool {
>>>>>> unsigned int sp_id; /* pool id; also node id on NUMA */
>>>>>> - spinlock_t sp_lock; /* protects all fields */
>>>>>> + spinlock_t sp_lock; /* protects sp_sockets */
>>>>>> struct list_head sp_sockets; /* pending sockets */
>>>>>> unsigned int sp_nrthreads; /* # of threads in pool */
>>>>>> - struct list_head sp_all_threads; /* all server threads */
>>>>>> + struct xarray sp_thread_xa;
>>>>>>
>>>>>> /* statistics on pool operation */
>>>>>> struct percpu_counter sp_messages_arrived;
>>>>>> @@ -196,7 +196,6 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>>>>>> * processed.
>>>>>> */
>>>>>> struct svc_rqst {
>>>>>> - struct list_head rq_all; /* all threads list */
>>>>>> struct rcu_head rq_rcu_head; /* for RCU deferred kfree */
>>>>>> struct svc_xprt * rq_xprt; /* transport ptr */
>>>>>>
>>>>>> @@ -241,10 +240,10 @@ struct svc_rqst {
>>>>>> #define RQ_SPLICE_OK (4) /* turned off in gss privacy
>>>>>> * to prevent encrypting page
>>>>>> * cache pages */
>>>>>> -#define RQ_VICTIM (5) /* about to be shut down */
>>>>>> -#define RQ_BUSY (6) /* request is busy */
>>>>>> -#define RQ_DATA (7) /* request has data */
>>>>>> +#define RQ_BUSY (5) /* request is busy */
>>>>>> +#define RQ_DATA (6) /* request has data */
>>>>>> unsigned long rq_flags; /* flags field */
>>>>>> + u32 rq_thread_id; /* xarray index */
>>>>>> ktime_t rq_qtime; /* enqueue time */
>>>>>>
>>>>>> void * rq_argp; /* decoded arguments */
>>>>>> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
>>>>>> index 60c8e03268d4..ea43c6059bdb 100644
>>>>>> --- a/include/trace/events/sunrpc.h
>>>>>> +++ b/include/trace/events/sunrpc.h
>>>>>> @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
>>>>>> svc_rqst_flag(USEDEFERRAL) \
>>>>>> svc_rqst_flag(DROPME) \
>>>>>> svc_rqst_flag(SPLICE_OK) \
>>>>>> - svc_rqst_flag(VICTIM) \
>>>>>> svc_rqst_flag(BUSY) \
>>>>>> svc_rqst_flag_end(DATA)
>>>>>>
>>>>>> @@ -2118,6 +2117,52 @@ TRACE_EVENT(svc_pool_starved,
>>>>>> )
>>>>>> );
>>>>>>
>>>>>> +DECLARE_EVENT_CLASS(svc_thread_lifetime_class,
>>>>>> + TP_PROTO(
>>>>>> + const struct svc_serv *serv,
>>>>>> + const struct svc_pool *pool,
>>>>>> + const struct svc_rqst *rqstp
>>>>>> + ),
>>>>>> +
>>>>>> + TP_ARGS(serv, pool, rqstp),
>>>>>> +
>>>>>> + TP_STRUCT__entry(
>>>>>> + __string(name, serv->sv_name)
>>>>>> + __field(int, pool_id)
>>>>>> + __field(unsigned int, nrthreads)
>>>>>> + __field(unsigned long, pool_flags)
>>>>>> + __field(u32, thread_id)
>>>>>> + __field(const void *, rqstp)
>>>>>> + ),
>>>>>> +
>>>>>> + TP_fast_assign(
>>>>>> + __assign_str(name, serv->sv_name);
>>>>>> + __entry->pool_id = pool->sp_id;
>>>>>> + __entry->nrthreads = pool->sp_nrthreads;
>>>>>> + __entry->pool_flags = pool->sp_flags;
>>>>>> + __entry->thread_id = rqstp->rq_thread_id;
>>>>>> + __entry->rqstp = rqstp;
>>>>>> + ),
>>>>>> +
>>>>>> + TP_printk("service=%s pool=%d pool_flags=%s nrthreads=%u thread_id=%u",
>>>>>> + __get_str(name), __entry->pool_id,
>>>>>> + show_svc_pool_flags(__entry->pool_flags),
>>>>>> + __entry->nrthreads, __entry->thread_id
>>>>>> + )
>>>>>> +);
>>>>>> +
>>>>>> +#define DEFINE_SVC_THREAD_LIFETIME_EVENT(name) \
>>>>>> + DEFINE_EVENT(svc_thread_lifetime_class, svc_pool_##name, \
>>>>>> + TP_PROTO( \
>>>>>> + const struct svc_serv *serv, \
>>>>>> + const struct svc_pool *pool, \
>>>>>> + const struct svc_rqst *rqstp \
>>>>>> + ), \
>>>>>> + TP_ARGS(serv, pool, rqstp))
>>>>>> +
>>>>>> +DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_init);
>>>>>> +DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_exit);
>>>>>> +
>>>>>> DECLARE_EVENT_CLASS(svc_xprt_event,
>>>>>> TP_PROTO(
>>>>>> const struct svc_xprt *xprt
>>>>>> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
>>>>>> index ad29df00b454..109d7f047385 100644
>>>>>> --- a/net/sunrpc/svc.c
>>>>>> +++ b/net/sunrpc/svc.c
>>>>>> @@ -507,8 +507,8 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
>>>>>>
>>>>>> pool->sp_id = i;
>>>>>> INIT_LIST_HEAD(&pool->sp_sockets);
>>>>>> - INIT_LIST_HEAD(&pool->sp_all_threads);
>>>>>> spin_lock_init(&pool->sp_lock);
>>>>>> + xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
>>>>>>
>>>>>> percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
>>>>>> percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
>>>>>> @@ -596,6 +596,8 @@ svc_destroy(struct kref *ref)
>>>>>> percpu_counter_destroy(&pool->sp_threads_timedout);
>>>>>> percpu_counter_destroy(&pool->sp_threads_starved);
>>>>>> percpu_counter_destroy(&pool->sp_threads_no_work);
>>>>>> +
>>>>>> + xa_destroy(&pool->sp_thread_xa);
>>>>>> }
>>>>>> kfree(serv->sv_pools);
>>>>>> kfree(serv);
>>>>>> @@ -676,7 +678,11 @@ EXPORT_SYMBOL_GPL(svc_rqst_alloc);
>>>>>> static struct svc_rqst *
>>>>>> svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>>>>>> {
>>>>>> + struct xa_limit limit = {
>>>>>> + .max = U32_MAX,
>>>>>> + };
>>>>>> struct svc_rqst *rqstp;
>>>>>> + int ret;
>>>>>>
>>>>>> rqstp = svc_rqst_alloc(serv, pool, node);
>>>>>> if (!rqstp)
>>>>>> @@ -687,11 +693,21 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>>>>>> serv->sv_nrthreads += 1;
>>>>>> spin_unlock_bh(&serv->sv_lock);
>>>>>>
>>>>>> - spin_lock_bh(&pool->sp_lock);
>>>>>> + xa_lock(&pool->sp_thread_xa);
>>>>>> + ret = __xa_alloc(&pool->sp_thread_xa, &rqstp->rq_thread_id, rqstp,
>>>>>> + limit, GFP_KERNEL);
>>>>>> + if (ret) {
>>>>>> + xa_unlock(&pool->sp_thread_xa);
>>>>>> + goto out_free;
>>>>>> + }
>>>>>> pool->sp_nrthreads++;
>>>>>> - list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
>>>>>> - spin_unlock_bh(&pool->sp_lock);
>>>>>> + xa_unlock(&pool->sp_thread_xa);
>>>>>> + trace_svc_pool_thread_init(serv, pool, rqstp);
>>>>>> return rqstp;
>>>>>> +
>>>>>> +out_free:
>>>>>> + svc_rqst_free(rqstp);
>>>>>> + return ERR_PTR(ret);
>>>>>> }
>>>>>>
>>>>>> /**
>>>>>> @@ -708,19 +724,17 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
>>>>>> struct svc_pool *pool)
>>>>>> {
>>>>>> struct svc_rqst *rqstp;
>>>>>> + unsigned long index;
>>>>>>
>>>>>> - rcu_read_lock();
>>>>>
>>>>>
>>>>> While it does do its own locking, the resulting object that xa_for_each
>>>>> returns needs some protection too. Between xa_for_each returning a rqstp
>>>>> and calling test_and_set_bit, could the rqstp be freed? I suspect so,
>>>>> and I think you probably need to keep the rcu_read_lock() call above.
>>>>
>>>> Should I keep the rcu_read_lock() even with the bitmap/xa_load
>>>> version of svc_pool_wake_idle_thread() in 9/9 ?
>>>>
>>>
>>> Yeah, I think you have to. We're not doing real locking on the search or
>>> taking references, so nothing else will ensure that the rqstp will stick
>>> around once you've found it. I think you have to hold it until after
>>> wake_up_process (at least).
>>
>> I can keep the RCU read lock around the search and xa_load(). But
>> I notice that the code we're replacing releases the RCU read lock
>> before calling wake_up_process(). Not saying that's right, but we
>> haven't had a problem reported.
>>
>>
>
> Understood. Given that we're not sleeping in that section, it's quite
> possible that the RCU callbacks just never have a chance to run before
> we wake the thing up and so you never hit the problem.
>
> Still, I think it'd be best to just keep the rcu_read_lock around that
> whole block. It's relatively cheap and safe to take it recursively, and
> that makes it explicit that the found rqst mustn't vanish before the
> wakeup is done.

My point is that since the existing code doesn't hold the
RCU read lock for the wake_up_process() call, either it's
unnecessary or the existing code is broken and needs a
back-portable fix.

Do you think the existing code is broken?


>>>>>> - list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
>>>>>> + xa_for_each(&pool->sp_thread_xa, index, rqstp) {
>>>>>> if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
>>>>>> continue;
>>>>>>
>>>>>> - rcu_read_unlock();
>>>>>> WRITE_ONCE(rqstp->rq_qtime, ktime_get());
>>>>>> wake_up_process(rqstp->rq_task);
>>>>>> percpu_counter_inc(&pool->sp_threads_woken);
>>>>>> return rqstp;
>>>>>> }
>>>>>> - rcu_read_unlock();
>>>>>>
>>>>>
>>>>> I wonder if this can race with svc_pool_victim below? Can we end up
>>>>> waking a thread that's already on its way out of the pool? Maybe this is
>>>>> addressed in your next patch though...
>>>>>
>>>>>> trace_svc_pool_starved(serv, pool);
>>>>>> percpu_counter_inc(&pool->sp_threads_starved);
>>>>>> @@ -736,32 +750,33 @@ svc_pool_next(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
>>>>>> static struct task_struct *
>>>>>> svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
>>>>>> {
>>>>>> - unsigned int i;
>>>>>> struct task_struct *task = NULL;
>>>>>> + struct svc_rqst *rqstp;
>>>>>> + unsigned int i;
>>>>>>
>>>>>> if (pool != NULL) {
>>>>>> - spin_lock_bh(&pool->sp_lock);
>>>>>> + xa_lock(&pool->sp_thread_xa);
>>>>>> + if (!pool->sp_nrthreads)
>>>>>> + goto out;
>>>>>> } else {
>>>>>> for (i = 0; i < serv->sv_nrpools; i++) {
>>>>>> pool = &serv->sv_pools[--(*state) % serv->sv_nrpools];
>>>>>> - spin_lock_bh(&pool->sp_lock);
>>>>>> - if (!list_empty(&pool->sp_all_threads))
>>>>>> + xa_lock(&pool->sp_thread_xa);
>>>>>> + if (pool->sp_nrthreads)
>>>>>> goto found_pool;
>>>>>> - spin_unlock_bh(&pool->sp_lock);
>>>>>> + xa_unlock(&pool->sp_thread_xa);
>>>>>> }
>>>>>> return NULL;
>>>>>> }
>>>>>>
>>>>>> found_pool:
>>>>>> - if (!list_empty(&pool->sp_all_threads)) {
>>>>>> - struct svc_rqst *rqstp;
>>>>>> -
>>>>>> - rqstp = list_entry(pool->sp_all_threads.next, struct svc_rqst, rq_all);
>>>>>> - set_bit(RQ_VICTIM, &rqstp->rq_flags);
>>>>>> - list_del_rcu(&rqstp->rq_all);
>>>>>> + rqstp = xa_load(&pool->sp_thread_xa, pool->sp_nrthreads - 1);
>>>>>> + if (rqstp) {
>>>>>> + __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
>>>>>> task = rqstp->rq_task;
>>>>>> }
>>>>>> - spin_unlock_bh(&pool->sp_lock);
>>>>>> +out:
>>>>>> + xa_unlock(&pool->sp_thread_xa);
>>>>>> return task;
>>>>>> }
>>>>>>
>>>>>> @@ -843,9 +858,9 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
>>>>>> if (pool == NULL) {
>>>>>> nrservs -= serv->sv_nrthreads;
>>>>>> } else {
>>>>>> - spin_lock_bh(&pool->sp_lock);
>>>>>> + xa_lock(&pool->sp_thread_xa);
>>>>>> nrservs -= pool->sp_nrthreads;
>>>>>> - spin_unlock_bh(&pool->sp_lock);
>>>>>> + xa_unlock(&pool->sp_thread_xa);
>>>>>> }
>>>>>>
>>>>>> if (nrservs > 0)
>>>>>> @@ -932,11 +947,11 @@ svc_exit_thread(struct svc_rqst *rqstp)
>>>>>> struct svc_serv *serv = rqstp->rq_server;
>>>>>> struct svc_pool *pool = rqstp->rq_pool;
>>>>>>
>>>>>> - spin_lock_bh(&pool->sp_lock);
>>>>>> + xa_lock(&pool->sp_thread_xa);
>>>>>> pool->sp_nrthreads--;
>>>>>> - if (!test_and_set_bit(RQ_VICTIM, &rqstp->rq_flags))
>>>>>> - list_del_rcu(&rqstp->rq_all);
>>>>>> - spin_unlock_bh(&pool->sp_lock);
>>>>>> + __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
>>>>>> + xa_unlock(&pool->sp_thread_xa);
>>>>>> + trace_svc_pool_thread_exit(serv, pool, rqstp);
>>>>>>
>>>>>> spin_lock_bh(&serv->sv_lock);
>>>>>> serv->sv_nrthreads -= 1;
>>>>>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>>>>>> index 6c2a702aa469..db40f771b60a 100644
>>>>>> --- a/net/sunrpc/svc_xprt.c
>>>>>> +++ b/net/sunrpc/svc_xprt.c
>>>>>> @@ -46,7 +46,7 @@ static LIST_HEAD(svc_xprt_class_list);
>>>>>>
>>>>>> /* SMP locking strategy:
>>>>>> *
>>>>>> - * svc_pool->sp_lock protects most of the fields of that pool.
>>>>>> + * svc_pool->sp_lock protects sp_sockets.
>>>>>> * svc_serv->sv_lock protects sv_tempsocks, sv_permsocks, sv_tmpcnt.
>>>>>> * when both need to be taken (rare), svc_serv->sv_lock is first.
>>>>>> * The "service mutex" protects svc_serv->sv_nrthread.
>>>>>>
>>>>>>
>>>>>
>>>>> Looks like a nice clean conversion otherwise!
>>>>> --
>>>>> Jeff Layton <[email protected]>
>>>>
>>>> --
>>>> Chuck Lever
>>>>
>>>>
>>>
>>> --
>>> Jeff Layton <[email protected]>
>
> --
> Jeff Layton <[email protected]>

--
Chuck Lever



2023-07-11 14:23:07

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v3 5/9] SUNRPC: Count pool threads that were awoken but found no work to do

On Tue, 2023-07-11 at 22:17 +1000, NeilBrown wrote:
> On Tue, 11 Jul 2023, Jeff Layton wrote:
> > On Tue, 2023-07-11 at 19:54 +1000, NeilBrown wrote:
> > > On Tue, 11 Jul 2023, Chuck Lever III wrote:
> > > >
> > > > > On Jul 10, 2023, at 6:29 PM, NeilBrown <[email protected]> wrote:
> > > > >
> > > > > On Tue, 11 Jul 2023, Chuck Lever wrote:
> > > > > > From: Chuck Lever <[email protected]>
> > > > > >
> > > > > > Measure a source of thread scheduling inefficiency -- count threads
> > > > > > that were awoken but found that the transport queue had already been
> > > > > > emptied.
> > > > > >
> > > > > > An empty transport queue is possible when threads that run between
> > > > > > the wake_up_process() call and the woken thread returning from the
> > > > > > scheduler have pulled all remaining work off the transport queue
> > > > > > using the first svc_xprt_dequeue() in svc_get_next_xprt().
> > > > >
> > > > > I'm in two minds about this. The data being gathered here is
> > > > > potentially useful
> > > >
> > > > It's actually pretty shocking: I've measured more than
> > > > 15% of thread wake-ups find no work to do.
> > >
> > > That is a bigger number than I would have guessed!
> > >
> >
> > I'm guessing the idea is that the receiver is waking a thread to do the
> > work, and that races with one that's already running? I'm sure there are
> > ways we can fix that, but it really does seem like we're trying to
> > reinvent workqueues here.
>
> True. But then workqueues seem to reinvent themselves every so often
> too. Once gets the impression they are trying to meet an enormous
> variety of needs.
> I'm not against trying to see if nfsd could work well in a workqueue
> environment, but I'm not certain it is a good idea. Maintaining control
> of our own thread pools might be safer.
>
> I have a vague memory of looking into this in more detail once and
> deciding that it wasn't a good fit, but I cannot recall or easily deduce
> the reason. Obviously we would have to give up SIGKILL, but we want to
> do that anyway.
>
> Would we want unbound work queues - so they can be scheduled across
> different CPUs? Are NFSD threads cpu-intensive or not? I'm not sure.
>
> I would be happy to explore a credible attempt at a conversion.
>

I had some patches several years ago that did a conversion from nfsd
threads to workqueues.?

It worked reasonably well, but under heavy loads it didn't perform as
well as having a dedicated threadpool...which is not too surprising,
really. nfsd has been tuned for performance over years and it's fairly
"greedy" about squatting on system resources even when it's idle.

If we wanted to look again at doing this with workqueues, we'd need the
workqueue infrastructure to allow for long-running jobs that may block
for a long time. That means it might need to be more cavalier about
spinning up new workqueue threads and keeping them running when there
are a lot of concurrent, but sleeping workqueue jobs.

We probably would want unbound workqueues so we can have more jobs in
flight at any given time than cpus, given that a lot of them will end up
being blocked in some way or another.

CPU utilization is a good question. Mostly we just call down into the
filesystem to do things, and the encoding and decoding is not _that_ cpu
intensive. For most storage stacks, I suspect we don't use a lot of CPU
aside from normal copying of data. There might be some exceptions
though, like when the underlying storage is using encryption, etc.

The main upside to switching to workqueues is that it would allow us to
get rid of a lot of fiddly, hand-tuned thread handling code. It might
also make it simpler to convert to a more asynchronous model.

For instance, it would be nice to be able to not have to put a thread to
sleep while waiting on writeback for a COMMIT. I think that'd be easier
to handle with a workqueue-like model.
--
Jeff Layton <[email protected]>


2023-07-11 15:11:57

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v3 8/9] SUNRPC: Replace sp_threads_all with an xarray

On Tue, 2023-07-11 at 12:39 +0000, Chuck Lever III wrote:
>
> > On Jul 11, 2023, at 7:57 AM, Jeff Layton <[email protected]> wrote:
> >
> > On Mon, 2023-07-10 at 22:24 -0400, Chuck Lever wrote:
> > > On Mon, Jul 10, 2023 at 09:06:02PM -0400, Jeff Layton wrote:
> > > > On Tue, 2023-07-11 at 00:58 +0000, Chuck Lever III wrote:
> > > > >
> > > > > > On Jul 10, 2023, at 2:24 PM, Jeff Layton <[email protected]> wrote:
> > > > > >
> > > > > > On Mon, 2023-07-10 at 12:42 -0400, Chuck Lever wrote:
> > > > > > > From: Chuck Lever <[email protected]>
> > > > > > >
> > > > > > > We want a thread lookup operation that can be done with RCU only,
> > > > > > > but also we want to avoid the linked-list walk, which does not scale
> > > > > > > well in the number of pool threads.
> > > > > > >
> > > > > > > This patch splits out the use of the sp_lock to protect the set
> > > > > > > of threads. Svc thread information is now protected by the xarray's
> > > > > > > lock (when making thread count changes) and the RCU read lock (when
> > > > > > > only looking up a thread).
> > > > > > >
> > > > > > > Since thread count changes are done only via nfsd filesystem API,
> > > > > > > which runs only in process context, we can safely dispense with the
> > > > > > > use of a bottom-half-disabled lock.
> > > > > > >
> > > > > > > Signed-off-by: Chuck Lever <[email protected]>
> > > > > > > ---
> > > > > > > fs/nfsd/nfssvc.c | 3 +-
> > > > > > > include/linux/sunrpc/svc.h | 11 +++----
> > > > > > > include/trace/events/sunrpc.h | 47 ++++++++++++++++++++++++++++-
> > > > > > > net/sunrpc/svc.c | 67 +++++++++++++++++++++++++----------------
> > > > > > > net/sunrpc/svc_xprt.c | 2 +
> > > > > > > 5 files changed, 94 insertions(+), 36 deletions(-)
> > > > > > >
> > > > > > > diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> > > > > > > index 2154fa63c5f2..d42b2a40c93c 100644
> > > > > > > --- a/fs/nfsd/nfssvc.c
> > > > > > > +++ b/fs/nfsd/nfssvc.c
> > > > > > > @@ -62,8 +62,7 @@ static __be32 nfsd_init_request(struct svc_rqst *,
> > > > > > > * If (out side the lock) nn->nfsd_serv is non-NULL, then it must point to a
> > > > > > > * properly initialised 'struct svc_serv' with ->sv_nrthreads > 0 (unless
> > > > > > > * nn->keep_active is set). That number of nfsd threads must
> > > > > > > - * exist and each must be listed in ->sp_all_threads in some entry of
> > > > > > > - * ->sv_pools[].
> > > > > > > + * exist and each must be listed in some entry of ->sv_pools[].
> > > > > > > *
> > > > > > > * Each active thread holds a counted reference on nn->nfsd_serv, as does
> > > > > > > * the nn->keep_active flag and various transient calls to svc_get().
> > > > > > > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > > > > > > index 9dd3b16cc4c2..86377506a514 100644
> > > > > > > --- a/include/linux/sunrpc/svc.h
> > > > > > > +++ b/include/linux/sunrpc/svc.h
> > > > > > > @@ -32,10 +32,10 @@
> > > > > > > */
> > > > > > > struct svc_pool {
> > > > > > > unsigned int sp_id; /* pool id; also node id on NUMA */
> > > > > > > - spinlock_t sp_lock; /* protects all fields */
> > > > > > > + spinlock_t sp_lock; /* protects sp_sockets */
> > > > > > > struct list_head sp_sockets; /* pending sockets */
> > > > > > > unsigned int sp_nrthreads; /* # of threads in pool */
> > > > > > > - struct list_head sp_all_threads; /* all server threads */
> > > > > > > + struct xarray sp_thread_xa;
> > > > > > >
> > > > > > > /* statistics on pool operation */
> > > > > > > struct percpu_counter sp_messages_arrived;
> > > > > > > @@ -196,7 +196,6 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> > > > > > > * processed.
> > > > > > > */
> > > > > > > struct svc_rqst {
> > > > > > > - struct list_head rq_all; /* all threads list */
> > > > > > > struct rcu_head rq_rcu_head; /* for RCU deferred kfree */
> > > > > > > struct svc_xprt * rq_xprt; /* transport ptr */
> > > > > > >
> > > > > > > @@ -241,10 +240,10 @@ struct svc_rqst {
> > > > > > > #define RQ_SPLICE_OK (4) /* turned off in gss privacy
> > > > > > > * to prevent encrypting page
> > > > > > > * cache pages */
> > > > > > > -#define RQ_VICTIM (5) /* about to be shut down */
> > > > > > > -#define RQ_BUSY (6) /* request is busy */
> > > > > > > -#define RQ_DATA (7) /* request has data */
> > > > > > > +#define RQ_BUSY (5) /* request is busy */
> > > > > > > +#define RQ_DATA (6) /* request has data */
> > > > > > > unsigned long rq_flags; /* flags field */
> > > > > > > + u32 rq_thread_id; /* xarray index */
> > > > > > > ktime_t rq_qtime; /* enqueue time */
> > > > > > >
> > > > > > > void * rq_argp; /* decoded arguments */
> > > > > > > diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> > > > > > > index 60c8e03268d4..ea43c6059bdb 100644
> > > > > > > --- a/include/trace/events/sunrpc.h
> > > > > > > +++ b/include/trace/events/sunrpc.h
> > > > > > > @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
> > > > > > > svc_rqst_flag(USEDEFERRAL) \
> > > > > > > svc_rqst_flag(DROPME) \
> > > > > > > svc_rqst_flag(SPLICE_OK) \
> > > > > > > - svc_rqst_flag(VICTIM) \
> > > > > > > svc_rqst_flag(BUSY) \
> > > > > > > svc_rqst_flag_end(DATA)
> > > > > > >
> > > > > > > @@ -2118,6 +2117,52 @@ TRACE_EVENT(svc_pool_starved,
> > > > > > > )
> > > > > > > );
> > > > > > >
> > > > > > > +DECLARE_EVENT_CLASS(svc_thread_lifetime_class,
> > > > > > > + TP_PROTO(
> > > > > > > + const struct svc_serv *serv,
> > > > > > > + const struct svc_pool *pool,
> > > > > > > + const struct svc_rqst *rqstp
> > > > > > > + ),
> > > > > > > +
> > > > > > > + TP_ARGS(serv, pool, rqstp),
> > > > > > > +
> > > > > > > + TP_STRUCT__entry(
> > > > > > > + __string(name, serv->sv_name)
> > > > > > > + __field(int, pool_id)
> > > > > > > + __field(unsigned int, nrthreads)
> > > > > > > + __field(unsigned long, pool_flags)
> > > > > > > + __field(u32, thread_id)
> > > > > > > + __field(const void *, rqstp)
> > > > > > > + ),
> > > > > > > +
> > > > > > > + TP_fast_assign(
> > > > > > > + __assign_str(name, serv->sv_name);
> > > > > > > + __entry->pool_id = pool->sp_id;
> > > > > > > + __entry->nrthreads = pool->sp_nrthreads;
> > > > > > > + __entry->pool_flags = pool->sp_flags;
> > > > > > > + __entry->thread_id = rqstp->rq_thread_id;
> > > > > > > + __entry->rqstp = rqstp;
> > > > > > > + ),
> > > > > > > +
> > > > > > > + TP_printk("service=%s pool=%d pool_flags=%s nrthreads=%u thread_id=%u",
> > > > > > > + __get_str(name), __entry->pool_id,
> > > > > > > + show_svc_pool_flags(__entry->pool_flags),
> > > > > > > + __entry->nrthreads, __entry->thread_id
> > > > > > > + )
> > > > > > > +);
> > > > > > > +
> > > > > > > +#define DEFINE_SVC_THREAD_LIFETIME_EVENT(name) \
> > > > > > > + DEFINE_EVENT(svc_thread_lifetime_class, svc_pool_##name, \
> > > > > > > + TP_PROTO( \
> > > > > > > + const struct svc_serv *serv, \
> > > > > > > + const struct svc_pool *pool, \
> > > > > > > + const struct svc_rqst *rqstp \
> > > > > > > + ), \
> > > > > > > + TP_ARGS(serv, pool, rqstp))
> > > > > > > +
> > > > > > > +DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_init);
> > > > > > > +DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_exit);
> > > > > > > +
> > > > > > > DECLARE_EVENT_CLASS(svc_xprt_event,
> > > > > > > TP_PROTO(
> > > > > > > const struct svc_xprt *xprt
> > > > > > > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > > > > > > index ad29df00b454..109d7f047385 100644
> > > > > > > --- a/net/sunrpc/svc.c
> > > > > > > +++ b/net/sunrpc/svc.c
> > > > > > > @@ -507,8 +507,8 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> > > > > > >
> > > > > > > pool->sp_id = i;
> > > > > > > INIT_LIST_HEAD(&pool->sp_sockets);
> > > > > > > - INIT_LIST_HEAD(&pool->sp_all_threads);
> > > > > > > spin_lock_init(&pool->sp_lock);
> > > > > > > + xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
> > > > > > >
> > > > > > > percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> > > > > > > percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> > > > > > > @@ -596,6 +596,8 @@ svc_destroy(struct kref *ref)
> > > > > > > percpu_counter_destroy(&pool->sp_threads_timedout);
> > > > > > > percpu_counter_destroy(&pool->sp_threads_starved);
> > > > > > > percpu_counter_destroy(&pool->sp_threads_no_work);
> > > > > > > +
> > > > > > > + xa_destroy(&pool->sp_thread_xa);
> > > > > > > }
> > > > > > > kfree(serv->sv_pools);
> > > > > > > kfree(serv);
> > > > > > > @@ -676,7 +678,11 @@ EXPORT_SYMBOL_GPL(svc_rqst_alloc);
> > > > > > > static struct svc_rqst *
> > > > > > > svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> > > > > > > {
> > > > > > > + struct xa_limit limit = {
> > > > > > > + .max = U32_MAX,
> > > > > > > + };
> > > > > > > struct svc_rqst *rqstp;
> > > > > > > + int ret;
> > > > > > >
> > > > > > > rqstp = svc_rqst_alloc(serv, pool, node);
> > > > > > > if (!rqstp)
> > > > > > > @@ -687,11 +693,21 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> > > > > > > serv->sv_nrthreads += 1;
> > > > > > > spin_unlock_bh(&serv->sv_lock);
> > > > > > >
> > > > > > > - spin_lock_bh(&pool->sp_lock);
> > > > > > > + xa_lock(&pool->sp_thread_xa);
> > > > > > > + ret = __xa_alloc(&pool->sp_thread_xa, &rqstp->rq_thread_id, rqstp,
> > > > > > > + limit, GFP_KERNEL);
> > > > > > > + if (ret) {
> > > > > > > + xa_unlock(&pool->sp_thread_xa);
> > > > > > > + goto out_free;
> > > > > > > + }
> > > > > > > pool->sp_nrthreads++;
> > > > > > > - list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
> > > > > > > - spin_unlock_bh(&pool->sp_lock);
> > > > > > > + xa_unlock(&pool->sp_thread_xa);
> > > > > > > + trace_svc_pool_thread_init(serv, pool, rqstp);
> > > > > > > return rqstp;
> > > > > > > +
> > > > > > > +out_free:
> > > > > > > + svc_rqst_free(rqstp);
> > > > > > > + return ERR_PTR(ret);
> > > > > > > }
> > > > > > >
> > > > > > > /**
> > > > > > > @@ -708,19 +724,17 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> > > > > > > struct svc_pool *pool)
> > > > > > > {
> > > > > > > struct svc_rqst *rqstp;
> > > > > > > + unsigned long index;
> > > > > > >
> > > > > > > - rcu_read_lock();
> > > > > >
> > > > > >
> > > > > > While it does do its own locking, the resulting object that xa_for_each
> > > > > > returns needs some protection too. Between xa_for_each returning a rqstp
> > > > > > and calling test_and_set_bit, could the rqstp be freed? I suspect so,
> > > > > > and I think you probably need to keep the rcu_read_lock() call above.
> > > > >
> > > > > Should I keep the rcu_read_lock() even with the bitmap/xa_load
> > > > > version of svc_pool_wake_idle_thread() in 9/9 ?
> > > > >
> > > >
> > > > Yeah, I think you have to. We're not doing real locking on the search or
> > > > taking references, so nothing else will ensure that the rqstp will stick
> > > > around once you've found it. I think you have to hold it until after
> > > > wake_up_process (at least).
> > >
> > > I can keep the RCU read lock around the search and xa_load(). But
> > > I notice that the code we're replacing releases the RCU read lock
> > > before calling wake_up_process(). Not saying that's right, but we
> > > haven't had a problem reported.
> > >
> > >
> >
> > Understood. Given that we're not sleeping in that section, it's quite
> > possible that the RCU callbacks just never have a chance to run before
> > we wake the thing up and so you never hit the problem.
> >
> > Still, I think it'd be best to just keep the rcu_read_lock around that
> > whole block. It's relatively cheap and safe to take it recursively, and
> > that makes it explicit that the found rqst mustn't vanish before the
> > wakeup is done.
>
> My point is that since the existing code doesn't hold the
> RCU read lock for the wake_up_process() call, either it's
> unnecessary or the existing code is broken and needs a
> back-portable fix.
>
> Do you think the existing code is broken?
>

Actually, it varies. svc_xprt_enqueue calls wake_up_process with the
rcu_read_lock held. svc_wake_up doesn't though.

So yes, I think svc_wake_up is broken, though I imagine this is hard to
hit outside of some very specific circumstances. Here is the existing
svc_wake_up block:

rcu_read_lock();
list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
/* skip any that aren't queued */
if (test_bit(RQ_BUSY, &rqstp->rq_flags))
continue;
rcu_read_unlock();
wake_up_process(rqstp->rq_task);
trace_svc_wake_up(rqstp->rq_task->pid);
return;
}
rcu_read_unlock();

Once rcu_read_unlock is called, rqstp could technically be freed before
we go to dereference the pointer. I don't see anything that prevents
that from occurring, though you'd have to be doing this while the
service's thread count is being reduced (which would cause entries to be
freed).

Worth fixing, IMO, but probably not worth sending to stable.
--
Jeff Layton <[email protected]>

2023-07-11 16:29:41

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v3 8/9] SUNRPC: Replace sp_threads_all with an xarray



> On Jul 11, 2023, at 11:05 AM, Jeff Layton <[email protected]> wrote:
>
> On Tue, 2023-07-11 at 12:39 +0000, Chuck Lever III wrote:
>>
>>> On Jul 11, 2023, at 7:57 AM, Jeff Layton <[email protected]> wrote:
>>>
>>> On Mon, 2023-07-10 at 22:24 -0400, Chuck Lever wrote:
>>>> On Mon, Jul 10, 2023 at 09:06:02PM -0400, Jeff Layton wrote:
>>>>> On Tue, 2023-07-11 at 00:58 +0000, Chuck Lever III wrote:
>>>>>>
>>>>>>> On Jul 10, 2023, at 2:24 PM, Jeff Layton <[email protected]> wrote:
>>>>>>>
>>>>>>> On Mon, 2023-07-10 at 12:42 -0400, Chuck Lever wrote:
>>>>>>>> From: Chuck Lever <[email protected]>
>>>>>>>>
>>>>>>>> We want a thread lookup operation that can be done with RCU only,
>>>>>>>> but also we want to avoid the linked-list walk, which does not scale
>>>>>>>> well in the number of pool threads.
>>>>>>>>
>>>>>>>> This patch splits out the use of the sp_lock to protect the set
>>>>>>>> of threads. Svc thread information is now protected by the xarray's
>>>>>>>> lock (when making thread count changes) and the RCU read lock (when
>>>>>>>> only looking up a thread).
>>>>>>>>
>>>>>>>> Since thread count changes are done only via nfsd filesystem API,
>>>>>>>> which runs only in process context, we can safely dispense with the
>>>>>>>> use of a bottom-half-disabled lock.
>>>>>>>>
>>>>>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>>>>>> ---
>>>>>>>> fs/nfsd/nfssvc.c | 3 +-
>>>>>>>> include/linux/sunrpc/svc.h | 11 +++----
>>>>>>>> include/trace/events/sunrpc.h | 47 ++++++++++++++++++++++++++++-
>>>>>>>> net/sunrpc/svc.c | 67 +++++++++++++++++++++++++----------------
>>>>>>>> net/sunrpc/svc_xprt.c | 2 +
>>>>>>>> 5 files changed, 94 insertions(+), 36 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
>>>>>>>> index 2154fa63c5f2..d42b2a40c93c 100644
>>>>>>>> --- a/fs/nfsd/nfssvc.c
>>>>>>>> +++ b/fs/nfsd/nfssvc.c
>>>>>>>> @@ -62,8 +62,7 @@ static __be32 nfsd_init_request(struct svc_rqst *,
>>>>>>>> * If (out side the lock) nn->nfsd_serv is non-NULL, then it must point to a
>>>>>>>> * properly initialised 'struct svc_serv' with ->sv_nrthreads > 0 (unless
>>>>>>>> * nn->keep_active is set). That number of nfsd threads must
>>>>>>>> - * exist and each must be listed in ->sp_all_threads in some entry of
>>>>>>>> - * ->sv_pools[].
>>>>>>>> + * exist and each must be listed in some entry of ->sv_pools[].
>>>>>>>> *
>>>>>>>> * Each active thread holds a counted reference on nn->nfsd_serv, as does
>>>>>>>> * the nn->keep_active flag and various transient calls to svc_get().
>>>>>>>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>>>>>>>> index 9dd3b16cc4c2..86377506a514 100644
>>>>>>>> --- a/include/linux/sunrpc/svc.h
>>>>>>>> +++ b/include/linux/sunrpc/svc.h
>>>>>>>> @@ -32,10 +32,10 @@
>>>>>>>> */
>>>>>>>> struct svc_pool {
>>>>>>>> unsigned int sp_id; /* pool id; also node id on NUMA */
>>>>>>>> - spinlock_t sp_lock; /* protects all fields */
>>>>>>>> + spinlock_t sp_lock; /* protects sp_sockets */
>>>>>>>> struct list_head sp_sockets; /* pending sockets */
>>>>>>>> unsigned int sp_nrthreads; /* # of threads in pool */
>>>>>>>> - struct list_head sp_all_threads; /* all server threads */
>>>>>>>> + struct xarray sp_thread_xa;
>>>>>>>>
>>>>>>>> /* statistics on pool operation */
>>>>>>>> struct percpu_counter sp_messages_arrived;
>>>>>>>> @@ -196,7 +196,6 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>>>>>>>> * processed.
>>>>>>>> */
>>>>>>>> struct svc_rqst {
>>>>>>>> - struct list_head rq_all; /* all threads list */
>>>>>>>> struct rcu_head rq_rcu_head; /* for RCU deferred kfree */
>>>>>>>> struct svc_xprt * rq_xprt; /* transport ptr */
>>>>>>>>
>>>>>>>> @@ -241,10 +240,10 @@ struct svc_rqst {
>>>>>>>> #define RQ_SPLICE_OK (4) /* turned off in gss privacy
>>>>>>>> * to prevent encrypting page
>>>>>>>> * cache pages */
>>>>>>>> -#define RQ_VICTIM (5) /* about to be shut down */
>>>>>>>> -#define RQ_BUSY (6) /* request is busy */
>>>>>>>> -#define RQ_DATA (7) /* request has data */
>>>>>>>> +#define RQ_BUSY (5) /* request is busy */
>>>>>>>> +#define RQ_DATA (6) /* request has data */
>>>>>>>> unsigned long rq_flags; /* flags field */
>>>>>>>> + u32 rq_thread_id; /* xarray index */
>>>>>>>> ktime_t rq_qtime; /* enqueue time */
>>>>>>>>
>>>>>>>> void * rq_argp; /* decoded arguments */
>>>>>>>> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
>>>>>>>> index 60c8e03268d4..ea43c6059bdb 100644
>>>>>>>> --- a/include/trace/events/sunrpc.h
>>>>>>>> +++ b/include/trace/events/sunrpc.h
>>>>>>>> @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
>>>>>>>> svc_rqst_flag(USEDEFERRAL) \
>>>>>>>> svc_rqst_flag(DROPME) \
>>>>>>>> svc_rqst_flag(SPLICE_OK) \
>>>>>>>> - svc_rqst_flag(VICTIM) \
>>>>>>>> svc_rqst_flag(BUSY) \
>>>>>>>> svc_rqst_flag_end(DATA)
>>>>>>>>
>>>>>>>> @@ -2118,6 +2117,52 @@ TRACE_EVENT(svc_pool_starved,
>>>>>>>> )
>>>>>>>> );
>>>>>>>>
>>>>>>>> +DECLARE_EVENT_CLASS(svc_thread_lifetime_class,
>>>>>>>> + TP_PROTO(
>>>>>>>> + const struct svc_serv *serv,
>>>>>>>> + const struct svc_pool *pool,
>>>>>>>> + const struct svc_rqst *rqstp
>>>>>>>> + ),
>>>>>>>> +
>>>>>>>> + TP_ARGS(serv, pool, rqstp),
>>>>>>>> +
>>>>>>>> + TP_STRUCT__entry(
>>>>>>>> + __string(name, serv->sv_name)
>>>>>>>> + __field(int, pool_id)
>>>>>>>> + __field(unsigned int, nrthreads)
>>>>>>>> + __field(unsigned long, pool_flags)
>>>>>>>> + __field(u32, thread_id)
>>>>>>>> + __field(const void *, rqstp)
>>>>>>>> + ),
>>>>>>>> +
>>>>>>>> + TP_fast_assign(
>>>>>>>> + __assign_str(name, serv->sv_name);
>>>>>>>> + __entry->pool_id = pool->sp_id;
>>>>>>>> + __entry->nrthreads = pool->sp_nrthreads;
>>>>>>>> + __entry->pool_flags = pool->sp_flags;
>>>>>>>> + __entry->thread_id = rqstp->rq_thread_id;
>>>>>>>> + __entry->rqstp = rqstp;
>>>>>>>> + ),
>>>>>>>> +
>>>>>>>> + TP_printk("service=%s pool=%d pool_flags=%s nrthreads=%u thread_id=%u",
>>>>>>>> + __get_str(name), __entry->pool_id,
>>>>>>>> + show_svc_pool_flags(__entry->pool_flags),
>>>>>>>> + __entry->nrthreads, __entry->thread_id
>>>>>>>> + )
>>>>>>>> +);
>>>>>>>> +
>>>>>>>> +#define DEFINE_SVC_THREAD_LIFETIME_EVENT(name) \
>>>>>>>> + DEFINE_EVENT(svc_thread_lifetime_class, svc_pool_##name, \
>>>>>>>> + TP_PROTO( \
>>>>>>>> + const struct svc_serv *serv, \
>>>>>>>> + const struct svc_pool *pool, \
>>>>>>>> + const struct svc_rqst *rqstp \
>>>>>>>> + ), \
>>>>>>>> + TP_ARGS(serv, pool, rqstp))
>>>>>>>> +
>>>>>>>> +DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_init);
>>>>>>>> +DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_exit);
>>>>>>>> +
>>>>>>>> DECLARE_EVENT_CLASS(svc_xprt_event,
>>>>>>>> TP_PROTO(
>>>>>>>> const struct svc_xprt *xprt
>>>>>>>> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
>>>>>>>> index ad29df00b454..109d7f047385 100644
>>>>>>>> --- a/net/sunrpc/svc.c
>>>>>>>> +++ b/net/sunrpc/svc.c
>>>>>>>> @@ -507,8 +507,8 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
>>>>>>>>
>>>>>>>> pool->sp_id = i;
>>>>>>>> INIT_LIST_HEAD(&pool->sp_sockets);
>>>>>>>> - INIT_LIST_HEAD(&pool->sp_all_threads);
>>>>>>>> spin_lock_init(&pool->sp_lock);
>>>>>>>> + xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
>>>>>>>>
>>>>>>>> percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
>>>>>>>> percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
>>>>>>>> @@ -596,6 +596,8 @@ svc_destroy(struct kref *ref)
>>>>>>>> percpu_counter_destroy(&pool->sp_threads_timedout);
>>>>>>>> percpu_counter_destroy(&pool->sp_threads_starved);
>>>>>>>> percpu_counter_destroy(&pool->sp_threads_no_work);
>>>>>>>> +
>>>>>>>> + xa_destroy(&pool->sp_thread_xa);
>>>>>>>> }
>>>>>>>> kfree(serv->sv_pools);
>>>>>>>> kfree(serv);
>>>>>>>> @@ -676,7 +678,11 @@ EXPORT_SYMBOL_GPL(svc_rqst_alloc);
>>>>>>>> static struct svc_rqst *
>>>>>>>> svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>>>>>>>> {
>>>>>>>> + struct xa_limit limit = {
>>>>>>>> + .max = U32_MAX,
>>>>>>>> + };
>>>>>>>> struct svc_rqst *rqstp;
>>>>>>>> + int ret;
>>>>>>>>
>>>>>>>> rqstp = svc_rqst_alloc(serv, pool, node);
>>>>>>>> if (!rqstp)
>>>>>>>> @@ -687,11 +693,21 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>>>>>>>> serv->sv_nrthreads += 1;
>>>>>>>> spin_unlock_bh(&serv->sv_lock);
>>>>>>>>
>>>>>>>> - spin_lock_bh(&pool->sp_lock);
>>>>>>>> + xa_lock(&pool->sp_thread_xa);
>>>>>>>> + ret = __xa_alloc(&pool->sp_thread_xa, &rqstp->rq_thread_id, rqstp,
>>>>>>>> + limit, GFP_KERNEL);
>>>>>>>> + if (ret) {
>>>>>>>> + xa_unlock(&pool->sp_thread_xa);
>>>>>>>> + goto out_free;
>>>>>>>> + }
>>>>>>>> pool->sp_nrthreads++;
>>>>>>>> - list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
>>>>>>>> - spin_unlock_bh(&pool->sp_lock);
>>>>>>>> + xa_unlock(&pool->sp_thread_xa);
>>>>>>>> + trace_svc_pool_thread_init(serv, pool, rqstp);
>>>>>>>> return rqstp;
>>>>>>>> +
>>>>>>>> +out_free:
>>>>>>>> + svc_rqst_free(rqstp);
>>>>>>>> + return ERR_PTR(ret);
>>>>>>>> }
>>>>>>>>
>>>>>>>> /**
>>>>>>>> @@ -708,19 +724,17 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
>>>>>>>> struct svc_pool *pool)
>>>>>>>> {
>>>>>>>> struct svc_rqst *rqstp;
>>>>>>>> + unsigned long index;
>>>>>>>>
>>>>>>>> - rcu_read_lock();
>>>>>>>
>>>>>>>
>>>>>>> While it does do its own locking, the resulting object that xa_for_each
>>>>>>> returns needs some protection too. Between xa_for_each returning a rqstp
>>>>>>> and calling test_and_set_bit, could the rqstp be freed? I suspect so,
>>>>>>> and I think you probably need to keep the rcu_read_lock() call above.
>>>>>>
>>>>>> Should I keep the rcu_read_lock() even with the bitmap/xa_load
>>>>>> version of svc_pool_wake_idle_thread() in 9/9 ?
>>>>>>
>>>>>
>>>>> Yeah, I think you have to. We're not doing real locking on the search or
>>>>> taking references, so nothing else will ensure that the rqstp will stick
>>>>> around once you've found it. I think you have to hold it until after
>>>>> wake_up_process (at least).
>>>>
>>>> I can keep the RCU read lock around the search and xa_load(). But
>>>> I notice that the code we're replacing releases the RCU read lock
>>>> before calling wake_up_process(). Not saying that's right, but we
>>>> haven't had a problem reported.
>>>>
>>>>
>>>
>>> Understood. Given that we're not sleeping in that section, it's quite
>>> possible that the RCU callbacks just never have a chance to run before
>>> we wake the thing up and so you never hit the problem.
>>>
>>> Still, I think it'd be best to just keep the rcu_read_lock around that
>>> whole block. It's relatively cheap and safe to take it recursively, and
>>> that makes it explicit that the found rqst mustn't vanish before the
>>> wakeup is done.
>>
>> My point is that since the existing code doesn't hold the
>> RCU read lock for the wake_up_process() call, either it's
>> unnecessary or the existing code is broken and needs a
>> back-portable fix.
>>
>> Do you think the existing code is broken?
>>
>
> Actually, it varies. svc_xprt_enqueue calls wake_up_process with the
> rcu_read_lock held. svc_wake_up doesn't though.

Good catch. I'll change 1/9 to hold the RCU read lock until
after the wake_up_process() call, and then leave that pattern
in place through the rest of the series.

1/9 will be straightforward to backport if that should be
necessary.


> So yes, I think svc_wake_up is broken, though I imagine this is hard to
> hit outside of some very specific circumstances. Here is the existing
> svc_wake_up block:
>
> rcu_read_lock();
> list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
> /* skip any that aren't queued */
> if (test_bit(RQ_BUSY, &rqstp->rq_flags))
> continue;
> rcu_read_unlock();
> wake_up_process(rqstp->rq_task);
> trace_svc_wake_up(rqstp->rq_task->pid);
> return;
> }
> rcu_read_unlock();
>
> Once rcu_read_unlock is called, rqstp could technically be freed before
> we go to dereference the pointer. I don't see anything that prevents
> that from occurring, though you'd have to be doing this while the
> service's thread count is being reduced (which would cause entries to be
> freed).
>
> Worth fixing, IMO, but probably not worth sending to stable.
> --
> Jeff Layton <[email protected]>


--
Chuck Lever



2023-07-19 00:51:41

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v3 5/9] SUNRPC: Count pool threads that were awoken but found no work to do

On Tue, 11 Jul 2023, Chuck Lever III wrote:
>
> > On Jul 10, 2023, at 6:29 PM, NeilBrown <[email protected]> wrote:
> >
> > On Tue, 11 Jul 2023, Chuck Lever wrote:
> >> From: Chuck Lever <[email protected]>
> >>
> >> Measure a source of thread scheduling inefficiency -- count threads
> >> that were awoken but found that the transport queue had already been
> >> emptied.
> >>
> >> An empty transport queue is possible when threads that run between
> >> the wake_up_process() call and the woken thread returning from the
> >> scheduler have pulled all remaining work off the transport queue
> >> using the first svc_xprt_dequeue() in svc_get_next_xprt().
> >
> > I'm in two minds about this. The data being gathered here is
> > potentially useful
>
> It's actually pretty shocking: I've measured more than
> 15% of thread wake-ups find no work to do.

I'm now wondering if that is a reliable statistic.

This counter as implemented doesn't count "pool threads that were awoken
but found no work to do". Rather, it counts "pool threads that found no
work to do, either after having been awoken, or having just completed
some other request".

And it doesn't even really count that is it doesn't notice that lockd
"retry blocked" work (or the NFSv4.1 callback work, but we don't count
that at all I think).

Maybe we should only update the count if we had actually been woken up
recently.

NeilBrown

2023-07-19 13:18:02

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v3 5/9] SUNRPC: Count pool threads that were awoken but found no work to do



> On Jul 18, 2023, at 8:39 PM, NeilBrown <[email protected]> wrote:
>
> On Tue, 11 Jul 2023, Chuck Lever III wrote:
>>
>>> On Jul 10, 2023, at 6:29 PM, NeilBrown <[email protected]> wrote:
>>>
>>> On Tue, 11 Jul 2023, Chuck Lever wrote:
>>>> From: Chuck Lever <[email protected]>
>>>>
>>>> Measure a source of thread scheduling inefficiency -- count threads
>>>> that were awoken but found that the transport queue had already been
>>>> emptied.
>>>>
>>>> An empty transport queue is possible when threads that run between
>>>> the wake_up_process() call and the woken thread returning from the
>>>> scheduler have pulled all remaining work off the transport queue
>>>> using the first svc_xprt_dequeue() in svc_get_next_xprt().
>>>
>>> I'm in two minds about this. The data being gathered here is
>>> potentially useful
>>
>> It's actually pretty shocking: I've measured more than
>> 15% of thread wake-ups find no work to do.
>
> I'm now wondering if that is a reliable statistic.
>
> This counter as implemented doesn't count "pool threads that were awoken
> but found no work to do". Rather, it counts "pool threads that found no
> work to do, either after having been awoken, or having just completed
> some other request".

In the current code, the only way to get to "return -EAGAIN;" is if the
thread calls schedule_timeout() (ie, it actually sleeps), then the
svc_rqst was specifically selected and awoken, and the schedule_timeout()
did not time out.

I don't see a problem.


> And it doesn't even really count that is it doesn't notice that lockd
> "retry blocked" work (or the NFSv4.1 callback work, but we don't count
> that at all I think).
>
> Maybe we should only update the count if we had actually been woken up
> recently.

So this one can be dropped for now since it's currently of value only
for working on the scheduler changes. If the thread scheduler were to
change such that a work item was actually assigned to a thread before
it is awoken (a la, a work queue model) then this counter would be
truly meaningless. I think we can wait for a bit.


--
Chuck Lever



2023-07-19 23:50:31

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v3 5/9] SUNRPC: Count pool threads that were awoken but found no work to do

On Wed, 19 Jul 2023, Chuck Lever III wrote:
>
> > On Jul 18, 2023, at 8:39 PM, NeilBrown <[email protected]> wrote:
> >
> > On Tue, 11 Jul 2023, Chuck Lever III wrote:
> >>
> >>> On Jul 10, 2023, at 6:29 PM, NeilBrown <[email protected]> wrote:
> >>>
> >>> On Tue, 11 Jul 2023, Chuck Lever wrote:
> >>>> From: Chuck Lever <[email protected]>
> >>>>
> >>>> Measure a source of thread scheduling inefficiency -- count threads
> >>>> that were awoken but found that the transport queue had already been
> >>>> emptied.
> >>>>
> >>>> An empty transport queue is possible when threads that run between
> >>>> the wake_up_process() call and the woken thread returning from the
> >>>> scheduler have pulled all remaining work off the transport queue
> >>>> using the first svc_xprt_dequeue() in svc_get_next_xprt().
> >>>
> >>> I'm in two minds about this. The data being gathered here is
> >>> potentially useful
> >>
> >> It's actually pretty shocking: I've measured more than
> >> 15% of thread wake-ups find no work to do.
> >
> > I'm now wondering if that is a reliable statistic.
> >
> > This counter as implemented doesn't count "pool threads that were awoken
> > but found no work to do". Rather, it counts "pool threads that found no
> > work to do, either after having been awoken, or having just completed
> > some other request".
>
> In the current code, the only way to get to "return -EAGAIN;" is if the
> thread calls schedule_timeout() (ie, it actually sleeps), then the
> svc_rqst was specifically selected and awoken, and the schedule_timeout()
> did not time out.
>
> I don't see a problem.
>

Yeah - I don't either any more. Sorry for the noise.


>
> > And it doesn't even really count that is it doesn't notice that lockd
> > "retry blocked" work (or the NFSv4.1 callback work, but we don't count
> > that at all I think).
> >
> > Maybe we should only update the count if we had actually been woken up
> > recently.
>
> So this one can be dropped for now since it's currently of value only
> for working on the scheduler changes. If the thread scheduler were to
> change such that a work item was actually assigned to a thread before
> it is awoken (a la, a work queue model) then this counter would be
> truly meaningless. I think we can wait for a bit.
>

We used to assign a workitem to a thread before it was woken. I find
that model to be aesthetically pleasing.
Trond changed that in
Commit: 22700f3c6df5 ("SUNRPC: Improve ordering of transport processing")

claiming that the wake-up time for a sleeping thread could result in
poorer throughput. No data given but I find the reasoning quite
credible.

Thanks,
NeilBrown

>
> --
> Chuck Lever
>
>
>