2014-12-10 19:08:11

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 00/16] nfsd/sunrpc: add support for a workqueue-based nfsd

This is the second revision of the workqueue-based nfsd. Changes from
the RFC version I posted earlier:

v2:
- found and fixed problem causing the delays between work. It was a bug
in the design of the new code. I was queueing the svc_xprt's work to
do everything and that necessarily serialized that work. This version
adds a work_struct to the svc_rqst as well, and that's where the
bulk of the work.

- no longer tries to use the thread count as a max_active setting for
the workqueue. The workqueue max_active settings are left to the
default. This means that /proc/fs/nfsd/threads is basically a binary
switch that's either zero (not running) or non-zero (running). The
actual value has no meaning.

- the handling of the fs_struct has been reworked. It's now allocated
as part of the struct svc_rqst. Each svc_rqst has its own fs_struct
now.

- CONFIG_SUNRPC_SVC_WORKQUEUE has been removed. Since the code must
be enabled at runtime anyway, we don't need a switch to disable this
code at build time.

- the new tracepoints have been reworked.

This new code is still a little slower (2-3%) than the older, thread
based code in my testing, though my hope is that it may perform better
than the older code on a NUMA machine. I don't have one at the moment
to verify that, however.

For background, here's an excerpt from the original posting:

This patchset is a little skunkworks project that I've been poking at
for the last few weeks. Currently nfsd uses a dedicated thread pool to
handle RPCs, but that requires maintaining a rather large swath of
"fiddly" code to handle the threads and transports.

This patchset represents an alternative approach, which makes nfsd use
workqueues to do its bidding rather than a dedicated thread pool. When a
transport needs to do work, we simply queue it to the workqueue in
softirq context and let it service the transport.

Performance:
------------
As far as numbers go, I ran a modified version of the fio
tiobench-example.fio test that just reduces the file size to 128M:

threads:
----------------------------------------------------------------------
Run status group 0 (all jobs):
WRITE: io=13740KB, aggrb=228KB/s, minb=57KB/s, maxb=57KB/s,
mint=60034msec, maxt=60049msec

Run status group 1 (all jobs):
WRITE: io=14652KB, aggrb=244KB/s, minb=60KB/s, maxb=61KB/s,
mint=60002msec, maxt=60028msec

Run status group 2 (all jobs):
READ: io=524288KB, aggrb=49639KB/s, minb=12409KB/s, maxb=12433KB/s,
mint=10542msec, maxt=10562msec

Run status group 3 (all jobs):
READ: io=524288KB, aggrb=50670KB/s, minb=12667KB/s, maxb=12687KB/s,
mint=10331msec, maxt=10347msec

workqueue:
----------------------------------------------------------------------
Run status group 0 (all jobs):
WRITE: io=13632KB, aggrb=226KB/s, minb=56KB/s, maxb=56KB/s,
mint=60010msec, maxt=60061msec

Run status group 1 (all jobs):
WRITE: io=14328KB, aggrb=238KB/s, minb=59KB/s, maxb=59KB/s,
mint=60023msec, maxt=60033msec

Run status group 2 (all jobs):
READ: io=524288KB, aggrb=47779KB/s, minb=11944KB/s, maxb=11955KB/s,
mint=10963msec, maxt=10973msec

Run status group 3 (all jobs):
READ: io=524288KB, aggrb=48379KB/s, minb=12094KB/s, maxb=12114KB/s,
mint=10819msec, maxt=10837msec

...a small performance decrease using workqueues (about 5% in the worst
case). It varies a little between runs but the results over several runs
are fairly consistent in showing a small perf decrease.

In an attempt to ascertain where that comes from, I added some
tracepoints and wrote a perl script to scrape the results and figure out
where the latency comes from. The numbers here are from the ones that
produced the above results:

[jlayton@palma nfsd-wq]$ ./analyze.pl < threads.txt
-------------------------------------------------------------
rpcs=272437
queue=3.79952429455322
queued=10.0558330900775
receive=10.6196625266701
process=2325.8173229043
total=2350.2923428156

[jlayton@palma nfsd-wq]$ ./analyze.pl < workqueue.txt
-------------------------------------------------------------
rpcs=272329
queue=4.41979737859012
queued=8.124893049967
receive=11.3421082590608
process=2310.15945786277
total=2334.04625655039

Here's a legend, the numbers represent a delta between two
tracepoints for a particular xprt or rqst. I can provide the script if
it's helpful but it's definitely hacked together:

rpcs: total number of rpcs processed (just a count)
queue: time it took to queue the xprt.xi
trace_svc_xprt_enqueued - trace_svc_xprt_enqueue (in usecs)

queued: time between being queued and going "active"
trace_svc_xprt_active - trace_svc_xprt_enqueued (in usecs)

receive: time between going "active" and completing the receive
trace_svc_xprt_received - trace_svc_xprt_active (in usecs)

process: time between completing the receive and finishing processing
trace_svc_process - trace_svc_xprt_recevied (in usecs)

total: total time from enqueueing to finishing the processing
trace_svc_process - trace_svc_xprt_enqueued (in usecs)

The interesting bit is that according to these numbers, the workqueue
RPCs ran more quickly on average (though most of that is in svc_process,
for which I have zero explanation). So, why are we getting slower
results?

One theory is that it's getting bitten by the fact that the workqueue
queueing/dequeueing code uses spinlocks that disable bottom halves.
Perhaps that adds up and causes more latency to softirq processing? I'm
not sure how best to nail that down...

It's probably worth it to start considering this for v3.20, but I'd like
to get some time on a larger scale test rig to see how it does first.
We'll also need Al's ACK on the fs_struct stuff.

Thoughts or comments are welcome.

Jeff Layton (16):
sunrpc: add a new svc_serv_ops struct and move sv_shutdown into it
sunrpc: move sv_function into sv_ops
sunrpc: move sv_module parm into sv_ops
sunrpc: turn enqueueing a svc_xprt into a svc_serv operation
sunrpc: abstract out svc_set_num_threads to sv_ops
sunrpc: move pool_mode definitions into svc.h
sunrpc: factor svc_rqst allocation and freeing from sv_nrthreads
refcounting
sunrpc: set up workqueue function in svc_xprt
sunrpc: set up svc_rqst work if it's defined
sunrpc: add basic support for workqueue-based services
nfsd: keep a reference to the fs_struct in svc_rqst
nfsd: add support for workqueue based service processing
sunrpc: keep a cache of svc_rqsts for each NUMA node
sunrpc: add more tracepoints around svc_xprt handling
sunrpc: print the svc_rqst pointer value in svc_process tracepoint
sunrpc: add tracepoints around svc_sock handling

fs/fs_struct.c | 60 ++++++--
fs/lockd/svc.c | 7 +-
fs/nfs/callback.c | 6 +-
fs/nfsd/nfssvc.c | 100 +++++++++++--
include/linux/fs_struct.h | 4 +
include/linux/sunrpc/svc.h | 93 ++++++++++---
include/linux/sunrpc/svc_xprt.h | 3 +
include/linux/sunrpc/svcsock.h | 1 +
include/trace/events/sunrpc.h | 88 ++++++++++--
net/sunrpc/Makefile | 2 +-
net/sunrpc/svc.c | 141 +++++++++++--------
net/sunrpc/svc_wq.c | 302 ++++++++++++++++++++++++++++++++++++++++
net/sunrpc/svc_xprt.c | 68 ++++++++-
net/sunrpc/svcsock.c | 6 +
14 files changed, 758 insertions(+), 123 deletions(-)
create mode 100644 net/sunrpc/svc_wq.c

--
2.1.0


2014-12-10 19:08:22

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 06/16] sunrpc: move pool_mode definitions into svc.h

In later patches, we're going to need to allow code external to svc.c
to figure out what pool_mode is in use. Move these definitions into
svc.h to prepare for that.

Also, make the svc_pool_map object available and exported so that other
modules can peek in there to get insight into what pool mode is in use.
Likewise, export svc_pool_map_get/put function to make it safe to do so.

Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/sunrpc/svc.h | 25 +++++++++++++++++++++++++
net/sunrpc/svc.c | 31 +++++++------------------------
2 files changed, 32 insertions(+), 24 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 5e172c8329f8..28e5f5716a87 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -428,6 +428,29 @@ struct svc_procedure {
};

/*
+ * Mode for mapping cpus to pools.
+ */
+enum {
+ SVC_POOL_AUTO = -1, /* choose one of the others */
+ SVC_POOL_GLOBAL, /* no mapping, just a single global pool
+ * (legacy & UP mode) */
+ SVC_POOL_PERCPU, /* one pool per cpu */
+ SVC_POOL_PERNODE /* one pool per numa node */
+};
+
+struct svc_pool_map {
+ int count; /* How many svc_servs use us */
+ int mode; /* Note: int not enum to avoid
+ * warnings about "enumeration value
+ * not handled in switch" */
+ unsigned int npools;
+ unsigned int *pool_to; /* maps pool id to cpu or node */
+ unsigned int *to_pool; /* maps cpu or node to pool id */
+};
+
+extern struct svc_pool_map svc_pool_map;
+
+/*
* Function prototypes.
*/
int svc_rpcb_setup(struct svc_serv *serv, struct net *net);
@@ -438,6 +461,8 @@ struct svc_serv *svc_create(struct svc_program *, unsigned int,
struct svc_rqst *svc_prepare_thread(struct svc_serv *serv,
struct svc_pool *pool, int node);
void svc_exit_thread(struct svc_rqst *);
+unsigned int svc_pool_map_get(void);
+void svc_pool_map_put(void);
struct svc_serv * svc_create_pooled(struct svc_program *, unsigned int,
struct svc_serv_ops *);
int svc_set_num_threads(struct svc_serv *, struct svc_pool *, int);
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 9b4b4d8d109a..04c083a53121 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -36,34 +36,17 @@ static void svc_unregister(const struct svc_serv *serv, struct net *net);

#define svc_serv_is_pooled(serv) ((serv)->sv_ops->svo_function)

-/*
- * Mode for mapping cpus to pools.
- */
-enum {
- SVC_POOL_AUTO = -1, /* choose one of the others */
- SVC_POOL_GLOBAL, /* no mapping, just a single global pool
- * (legacy & UP mode) */
- SVC_POOL_PERCPU, /* one pool per cpu */
- SVC_POOL_PERNODE /* one pool per numa node */
-};
#define SVC_POOL_DEFAULT SVC_POOL_GLOBAL

/*
* Structure for mapping cpus to pools and vice versa.
* Setup once during sunrpc initialisation.
*/
-static struct svc_pool_map {
- int count; /* How many svc_servs use us */
- int mode; /* Note: int not enum to avoid
- * warnings about "enumeration value
- * not handled in switch" */
- unsigned int npools;
- unsigned int *pool_to; /* maps pool id to cpu or node */
- unsigned int *to_pool; /* maps cpu or node to pool id */
-} svc_pool_map = {
- .count = 0,
+struct svc_pool_map svc_pool_map = {
.mode = SVC_POOL_DEFAULT
};
+EXPORT_SYMBOL_GPL(svc_pool_map);
+
static DEFINE_MUTEX(svc_pool_map_mutex);/* protects svc_pool_map.count only */

static int
@@ -236,7 +219,7 @@ svc_pool_map_init_pernode(struct svc_pool_map *m)
* vice versa). Initialise the map if we're the first user.
* Returns the number of pools.
*/
-static unsigned int
+unsigned int
svc_pool_map_get(void)
{
struct svc_pool_map *m = &svc_pool_map;
@@ -271,7 +254,7 @@ svc_pool_map_get(void)
mutex_unlock(&svc_pool_map_mutex);
return m->npools;
}
-
+EXPORT_SYMBOL_GPL(svc_pool_map_get);

/*
* Drop a reference to the global map of cpus to pools.
@@ -280,7 +263,7 @@ svc_pool_map_get(void)
* mode using the pool_mode module option without
* rebooting or re-loading sunrpc.ko.
*/
-static void
+void
svc_pool_map_put(void)
{
struct svc_pool_map *m = &svc_pool_map;
@@ -297,7 +280,7 @@ svc_pool_map_put(void)

mutex_unlock(&svc_pool_map_mutex);
}
-
+EXPORT_SYMBOL_GPL(svc_pool_map_put);

static int svc_pool_map_get_node(unsigned int pidx)
{
--
2.1.0

2014-12-10 19:08:34

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 14/16] sunrpc: add more tracepoints around svc_xprt handling

Turn the svc_xprt_dequeue and svc_xprt_do_enqueue tracepoints into a
event classes, and use them to add events for svc_xprt_enqueued,
svc_xprt_active, svc_xprt_received, and svc_xprt_enqueue.

Signed-off-by: Jeff Layton <[email protected]>
---
include/trace/events/sunrpc.h | 36 +++++++++++++++++++++++++++---------
net/sunrpc/svc_wq.c | 4 ++++
net/sunrpc/svc_xprt.c | 9 ++++++++-
3 files changed, 39 insertions(+), 10 deletions(-)

diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index b9c1dc6c825a..41e09a1b962d 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -496,28 +496,40 @@ DEFINE_EVENT(svc_rqst_status, svc_send,
{ (1UL << XPT_CACHE_AUTH), "XPT_CACHE_AUTH"}, \
{ (1UL << XPT_LOCAL), "XPT_LOCAL"})

-TRACE_EVENT(svc_xprt_do_enqueue,
- TP_PROTO(struct svc_xprt *xprt, struct svc_rqst *rqst),
+/* events associated with both a rqst and a xprt */
+DECLARE_EVENT_CLASS(svc_xprt_rqst_event,
+ TP_PROTO(struct svc_rqst *rqst, struct svc_xprt *xprt),

- TP_ARGS(xprt, rqst),
+ TP_ARGS(rqst, xprt),

TP_STRUCT__entry(
- __field(struct svc_xprt *, xprt)
__field(struct svc_rqst *, rqst)
+ __field(struct svc_xprt *, xprt)
),

TP_fast_assign(
- __entry->xprt = xprt;
__entry->rqst = rqst;
+ __entry->xprt = xprt;
),

- TP_printk("xprt=0x%p addr=%pIScp pid=%d flags=%s", __entry->xprt,
- (struct sockaddr *)&__entry->xprt->xpt_remote,
- __entry->rqst ? __entry->rqst->rq_task->pid : 0,
+ TP_printk("rqst=0x%p xprt=0x%p addr=%pIScp flags=%s", __entry->rqst,
+ __entry->xprt, (struct sockaddr *)&__entry->xprt->xpt_remote,
show_svc_xprt_flags(__entry->xprt->xpt_flags))
);

-TRACE_EVENT(svc_xprt_dequeue,
+DEFINE_EVENT(svc_xprt_rqst_event, svc_xprt_enqueued,
+ TP_PROTO(struct svc_rqst *rqst, struct svc_xprt *xprt),
+ TP_ARGS(rqst, xprt));
+
+DEFINE_EVENT(svc_xprt_rqst_event, svc_xprt_active,
+ TP_PROTO(struct svc_rqst *rqst, struct svc_xprt *xprt),
+ TP_ARGS(rqst, xprt));
+
+DEFINE_EVENT(svc_xprt_rqst_event, svc_xprt_received,
+ TP_PROTO(struct svc_rqst *rqst, struct svc_xprt *xprt),
+ TP_ARGS(rqst, xprt));
+
+DECLARE_EVENT_CLASS(svc_xprt_event,
TP_PROTO(struct svc_xprt *xprt),

TP_ARGS(xprt),
@@ -539,6 +551,12 @@ TRACE_EVENT(svc_xprt_dequeue,
show_svc_xprt_flags(__entry->flags))
);

+DEFINE_EVENT(svc_xprt_event, svc_xprt_enqueue,
+ TP_PROTO(struct svc_xprt *xprt), TP_ARGS(xprt));
+
+DEFINE_EVENT(svc_xprt_event, svc_xprt_dequeue,
+ TP_PROTO(struct svc_xprt *xprt), TP_ARGS(xprt));
+
TRACE_EVENT(svc_wake_up,
TP_PROTO(int pid),

diff --git a/net/sunrpc/svc_wq.c b/net/sunrpc/svc_wq.c
index 1ca26d51b8ec..779eb51e4bbb 100644
--- a/net/sunrpc/svc_wq.c
+++ b/net/sunrpc/svc_wq.c
@@ -257,6 +257,8 @@ svc_wq_enqueue_xprt(struct svc_xprt *xprt)
struct svc_serv *serv = xprt->xpt_server;
struct svc_rqst *rqstp;

+ trace_svc_xprt_enqueue(xprt);
+
if (!svc_xprt_has_something_to_do(xprt))
return;

@@ -290,9 +292,11 @@ out:
rqstp = find_svc_rqst(serv);
if (!rqstp) {
queue_work(serv->sv_wq, &xprt->xpt_work);
+ trace_svc_xprt_enqueued(NULL, xprt);
return;
}
rqstp->rq_xprt = xprt;
queue_work(serv->sv_wq, &rqstp->rq_work);
+ trace_svc_xprt_enqueued(rqstp, xprt);
}
EXPORT_SYMBOL_GPL(svc_wq_enqueue_xprt);
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 17398eb9f38f..c759f04c9ee9 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -242,6 +242,7 @@ void svc_add_new_perm_xprt(struct svc_serv *serv, struct svc_xprt *new)
spin_lock_bh(&serv->sv_lock);
list_add(&new->xpt_list, &serv->sv_permsocks);
spin_unlock_bh(&serv->sv_lock);
+ trace_svc_xprt_received(NULL, new);
svc_xprt_received(new);
}

@@ -329,6 +330,8 @@ void svc_xprt_do_enqueue(struct svc_xprt *xprt)
int cpu;
bool queued = false;

+ trace_svc_xprt_enqueue(xprt);
+
if (!svc_xprt_has_something_to_do(xprt))
goto out;

@@ -402,7 +405,7 @@ redo_search:
rqstp = NULL;
put_cpu();
out:
- trace_svc_xprt_do_enqueue(xprt, rqstp);
+ trace_svc_xprt_enqueued(rqstp, xprt);
}
EXPORT_SYMBOL_GPL(svc_xprt_do_enqueue);

@@ -738,6 +741,7 @@ static void svc_add_new_temp_xprt(struct svc_serv *serv, struct svc_xprt *newxpt
jiffies + svc_conn_age_period * HZ);
}
spin_unlock_bh(&serv->sv_lock);
+ trace_svc_xprt_received(NULL, newxpt);
svc_xprt_received(newxpt);
}

@@ -746,6 +750,8 @@ static int svc_handle_xprt(struct svc_rqst *rqstp, struct svc_xprt *xprt)
struct svc_serv *serv = rqstp->rq_server;
int len = 0;

+ trace_svc_xprt_active(rqstp, xprt);
+
if (test_bit(XPT_CLOSE, &xprt->xpt_flags)) {
dprintk("svc_recv: found XPT_CLOSE\n");
svc_delete_xprt(xprt);
@@ -780,6 +786,7 @@ static int svc_handle_xprt(struct svc_rqst *rqstp, struct svc_xprt *xprt)
atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
}
/* clear XPT_BUSY: */
+ trace_svc_xprt_received(rqstp, xprt);
svc_xprt_received(xprt);
out:
trace_svc_handle_xprt(xprt, len);
--
2.1.0

2014-12-10 19:08:44

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 16/16] sunrpc: add tracepoints around svc_sock handling

Signed-off-by: Jeff Layton <[email protected]>
---
include/trace/events/sunrpc.h | 46 +++++++++++++++++++++++++++++++++++++++++++
net/sunrpc/svcsock.c | 6 ++++++
2 files changed, 52 insertions(+)

diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 0d54473d9b37..8bcc38c3f974 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -9,6 +9,7 @@
#include <linux/sunrpc/svc.h>
#include <linux/sunrpc/xprtsock.h>
#include <linux/sunrpc/svc_xprt.h>
+#include <linux/sunrpc/svcsock.h>
#include <net/tcp_states.h>
#include <linux/net.h>
#include <linux/tracepoint.h>
@@ -594,6 +595,51 @@ TRACE_EVENT(svc_handle_xprt,
(struct sockaddr *)&__entry->xprt->xpt_remote, __entry->len,
show_svc_xprt_flags(__entry->xprt->xpt_flags))
);
+
+DECLARE_EVENT_CLASS(svc_socket_event,
+ TP_PROTO(struct sock *sk),
+
+ TP_ARGS(sk),
+
+ TP_STRUCT__entry(
+ __field(struct sock *, sk)
+ __field(struct svc_sock *, svsk)
+ __field_struct(struct sockaddr_storage, ss)
+ __field(unsigned long, xpt_flags)
+ ),
+
+ TP_fast_assign(
+ __entry->sk = sk;
+ __entry->svsk = sk->sk_user_data;
+ if (__entry->svsk) {
+ struct svc_xprt *xprt = &__entry->svsk->sk_xprt;
+
+ memcpy(&__entry->ss, &xprt->xpt_remote,
+ sizeof(__entry->ss));
+ __entry->xpt_flags = xprt->xpt_flags;
+ } else {
+ memset(&__entry->ss, 0, sizeof(__entry->ss));
+ __entry->xpt_flags = 0;
+ }
+ ),
+
+ TP_printk("sk=0x%p peer=%pIScp sk_state=%s xpt_flags=%s", __entry->sk,
+ (struct sockaddr *)&__entry->ss,
+ rpc_show_sock_state(__entry->sk->sk_state),
+ show_svc_xprt_flags(__entry->xpt_flags))
+);
+
+DEFINE_EVENT(svc_socket_event, svc_tcp_listen_data_ready,
+ TP_PROTO(struct sock *sk), TP_ARGS(sk));
+
+DEFINE_EVENT(svc_socket_event, svc_tcp_state_change,
+ TP_PROTO(struct sock *sk), TP_ARGS(sk));
+
+DEFINE_EVENT(svc_socket_event, svc_tcp_data_ready,
+ TP_PROTO(struct sock *sk), TP_ARGS(sk));
+
+DEFINE_EVENT(svc_socket_event, svc_tcp_accept,
+ TP_PROTO(struct sock *sk), TP_ARGS(sk));
#endif /* _TRACE_SUNRPC_H */

#include <trace/define_trace.h>
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index cc331b6cf573..2334ebeda814 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -44,6 +44,7 @@
#include <asm/uaccess.h>
#include <asm/ioctls.h>
#include <trace/events/skb.h>
+#include <trace/events/sunrpc.h>

#include <linux/sunrpc/types.h>
#include <linux/sunrpc/clnt.h>
@@ -765,6 +766,7 @@ static void svc_tcp_listen_data_ready(struct sock *sk)
struct svc_sock *svsk = (struct svc_sock *)sk->sk_user_data;
wait_queue_head_t *wq;

+ trace_svc_tcp_listen_data_ready(sk);
dprintk("svc: socket %p TCP (listen) state change %d\n",
sk, sk->sk_state);

@@ -799,6 +801,7 @@ static void svc_tcp_state_change(struct sock *sk)
struct svc_sock *svsk = (struct svc_sock *)sk->sk_user_data;
wait_queue_head_t *wq = sk_sleep(sk);

+ trace_svc_tcp_state_change(sk);
dprintk("svc: socket %p TCP (connected) state change %d (svsk %p)\n",
sk, sk->sk_state, sk->sk_user_data);

@@ -817,6 +820,7 @@ static void svc_tcp_data_ready(struct sock *sk)
struct svc_sock *svsk = (struct svc_sock *)sk->sk_user_data;
wait_queue_head_t *wq = sk_sleep(sk);

+ trace_svc_tcp_data_ready(sk);
dprintk("svc: socket %p TCP data ready (svsk %p)\n",
sk, sk->sk_user_data);
if (svsk) {
@@ -842,6 +846,8 @@ static struct svc_xprt *svc_tcp_accept(struct svc_xprt *xprt)
int err, slen;
RPC_IFDEBUG(char buf[RPC_MAX_ADDRBUFLEN]);

+ trace_svc_tcp_accept(svsk->sk_sk);
+
dprintk("svc: tcp_accept %p sock %p\n", svsk, sock);
if (!sock)
return NULL;
--
2.1.0

2014-12-10 19:08:32

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 13/16] sunrpc: keep a cache of svc_rqsts for each NUMA node

Allocating an entire svc_rqst (including all of the pages, etc...) for
each workqueue request is pretty expensive. Keep a cache of allocated
svc_rqst structures for each NUMA node that we keep in svc_pool.

When an xprt needs servicing we look for an existing svc_rqst if
possible, attach the xprt to it and then queue it to do the work. If
one isn't currently available, we queue the svc_xprt work to allocate
one, add it to the cache and then queue the svc_rqst's work to handle
the rest.

In order to keep the cache from growing without bound, we register a
shrinker. Since the cache is already NUMA-aware, we can use a NUMA-aware
shrinker as well.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfssvc.c | 11 ++--
include/linux/sunrpc/svc.h | 11 ++++
net/sunrpc/svc.c | 1 +
net/sunrpc/svc_wq.c | 158 ++++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 172 insertions(+), 9 deletions(-)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 7e22068bdad4..416faf9a77f0 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -669,26 +669,25 @@ nfsd_rqst_work(struct work_struct *work)
rqstp->rq_server->sv_maxconn = nn->max_connections;

if (svc_wq_recv(rqstp) < 0) {
- svc_rqst_free(rqstp);
+ put_svc_rqst(rqstp);
return;
}

saved_fs = swap_fs_struct(rqstp->rq_fs);
svc_process(rqstp);
- saved_fs = swap_fs_struct(saved_fs);
- svc_rqst_free(rqstp);
+ swap_fs_struct(saved_fs);
+ put_svc_rqst(rqstp);
}

/* work function for workqueue-based nfsd */
static void
nfsd_xprt_work(struct work_struct *work)
{
- int node = numa_node_id();
struct svc_xprt *xprt = container_of(work, struct svc_xprt, xpt_work);
- struct svc_rqst *rqstp;
struct svc_serv *serv = xprt->xpt_server;
+ struct svc_rqst *rqstp;

- rqstp = svc_rqst_alloc(serv, &serv->sv_pools[node], node);
+ rqstp = find_or_alloc_svc_rqst(serv);
if (!rqstp) {
/* Alloc failure. Give up for now, and requeue the work */
queue_work(serv->sv_wq, &xprt->xpt_work);
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 695bc989c007..4a71436efb1f 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -108,6 +108,7 @@ struct svc_serv {
struct svc_pool * sv_pools; /* array of thread pools */
struct svc_serv_ops * sv_ops; /* server operations */
struct workqueue_struct *sv_wq; /* workqueue for wq-based services */
+ struct shrinker sv_shrinker; /* for shrinking svc_rqst caches */
#if defined(CONFIG_SUNRPC_BACKCHANNEL)
struct list_head sv_cb_list; /* queue for callback requests
* that arrive over the same
@@ -277,6 +278,7 @@ struct svc_rqst {
#define RQ_VICTIM (5) /* about to be shut down */
#define RQ_BUSY (6) /* request is busy */
unsigned long rq_flags; /* flags field */
+ unsigned long rq_time; /* when rqstp was last put */

void * rq_argp; /* decoded arguments */
void * rq_resp; /* xdr'd results */
@@ -496,6 +498,15 @@ char * svc_print_addr(struct svc_rqst *, char *, size_t);

int svc_wq_setup(struct svc_serv *, struct svc_pool *, int);
void svc_wq_enqueue_xprt(struct svc_xprt *);
+struct svc_rqst * find_or_alloc_svc_rqst(struct svc_serv *serv);
+void exit_svc_rqst_cache(struct svc_serv *serv);
+
+static inline void
+put_svc_rqst(struct svc_rqst *rqstp)
+{
+ rqstp->rq_time = jiffies;
+ clear_bit(RQ_BUSY, &rqstp->rq_flags);
+}

#define RPC_MAX_ADDRBUFLEN (63U)

diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 78395f790b54..32018951928e 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -545,6 +545,7 @@ svc_destroy(struct svc_serv *serv)

if (serv->sv_wq) {
destroy_workqueue(serv->sv_wq);
+ exit_svc_rqst_cache(serv);
module_put(serv->sv_ops->svo_module);
}

diff --git a/net/sunrpc/svc_wq.c b/net/sunrpc/svc_wq.c
index d1778373249e..1ca26d51b8ec 100644
--- a/net/sunrpc/svc_wq.c
+++ b/net/sunrpc/svc_wq.c
@@ -11,6 +11,143 @@
#include <linux/workqueue.h>
#include <trace/events/sunrpc.h>

+static struct svc_rqst *
+find_svc_rqst(struct svc_serv *serv)
+{
+ int node = numa_node_id();
+ struct svc_rqst *rqstp;
+ struct svc_pool *pool = &serv->sv_pools[node];
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
+ if (!test_and_set_bit(RQ_BUSY, &rqstp->rq_flags)) {
+ rcu_read_unlock();
+ return rqstp;
+ }
+ }
+ rcu_read_unlock();
+ return NULL;
+}
+
+/*
+ * Find a svc_rqst to use. Try to find an already allocated-one on the list
+ * first, and then allocate if there isn't one available.
+ */
+struct svc_rqst *
+find_or_alloc_svc_rqst(struct svc_serv *serv)
+{
+ int node = numa_node_id();
+ struct svc_rqst *rqstp;
+ struct svc_pool *pool = &serv->sv_pools[node];
+
+ rqstp = find_svc_rqst(serv);
+ if (likely(rqstp))
+ return rqstp;
+
+ rqstp = svc_rqst_alloc(serv, pool, node);
+ if (rqstp) {
+ spin_lock_bh(&pool->sp_lock);
+ list_add_tail_rcu(&rqstp->rq_all, &pool->sp_all_threads);
+ ++pool->sp_nrthreads;
+ spin_unlock_bh(&pool->sp_lock);
+ }
+ return rqstp;
+}
+EXPORT_SYMBOL_GPL(find_or_alloc_svc_rqst);
+
+static unsigned long
+count_svc_rqst_objects(struct shrinker *shrinker, struct shrink_control *sc)
+{
+ struct svc_serv *serv = container_of(shrinker, struct svc_serv,
+ sv_shrinker);
+ struct svc_pool *pool = &serv->sv_pools[sc->nid];
+ struct svc_rqst *rqstp;
+ unsigned long count = 0;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
+ /* Don't count it if it's busy */
+ if (test_bit(RQ_BUSY, &rqstp->rq_flags))
+ continue;
+
+ /* Don't count it if it was used within the last second */
+ if (time_before(jiffies, rqstp->rq_time + HZ))
+ continue;
+
+ ++count;
+ }
+ rcu_read_unlock();
+
+ return count;
+}
+
+static unsigned long
+scan_svc_rqst_objects(struct shrinker *shrinker, struct shrink_control *sc)
+{
+ struct svc_serv *serv = container_of(shrinker, struct svc_serv,
+ sv_shrinker);
+ struct svc_pool *pool = &serv->sv_pools[sc->nid];
+ struct svc_rqst *rqstp;
+ unsigned long count = 0;
+
+ spin_lock(&pool->sp_lock);
+ list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
+ /* Don't free it if it's busy */
+ if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
+ continue;
+
+ list_del_rcu(&rqstp->rq_all);
+ svc_rqst_free(rqstp);
+ --pool->sp_nrthreads;
+ ++count;
+ if (sc->nr_to_scan-- == 0)
+ break;
+ }
+ spin_unlock(&pool->sp_lock);
+
+ return count;
+}
+
+static int
+init_svc_rqst_cache(struct svc_serv *serv)
+{
+ struct shrinker *shrinker = &serv->sv_shrinker;
+
+ memset(shrinker, 0, sizeof(*shrinker));
+
+ shrinker->count_objects = count_svc_rqst_objects;
+ shrinker->scan_objects = scan_svc_rqst_objects;
+ shrinker->seeks = DEFAULT_SEEKS;
+ shrinker->flags = SHRINKER_NUMA_AWARE;
+
+ return register_shrinker(shrinker);
+}
+
+void
+exit_svc_rqst_cache(struct svc_serv *serv)
+{
+ int node;
+
+ unregister_shrinker(&serv->sv_shrinker);
+
+ for (node = 0; node < serv->sv_nrpools; node++) {
+ struct svc_pool *pool = &serv->sv_pools[node];
+
+ spin_lock_bh(&pool->sp_lock);
+ while (!list_empty(&pool->sp_all_threads)) {
+ struct svc_rqst *rqstp = list_first_entry(
+ &pool->sp_all_threads, struct svc_rqst,
+ rq_all);
+
+ WARN_ON_ONCE(test_bit(RQ_BUSY, &rqstp->rq_flags));
+ list_del_rcu(&rqstp->rq_all);
+ svc_rqst_free(rqstp);
+ }
+ pool->sp_nrthreads = 0;
+ spin_unlock_bh(&pool->sp_lock);
+ }
+}
+
/*
* This workqueue job should run on each node when the workqueue is created. It
* walks the list of xprts for its node, and queues the workqueue job for each.
@@ -58,8 +195,8 @@ process_queued_xprts(struct svc_serv *serv)

/*
* Start up or shut down a workqueue-based RPC service. Basically, we use this
- * to allocate the workqueue. The function assumes that the caller holds one
- * serv->sv_nrthreads reference.
+ * to allocate the workqueue and set up the shrinker for the svc_rqst cache.
+ * This function assumes that the caller holds one serv->sv_nrthreads reference.
*
* The "active" parm is treated as a boolean here. The only meaningful values
* are non-zero which means that we're starting the service up, or zero which
@@ -68,6 +205,7 @@ process_queued_xprts(struct svc_serv *serv)
int
svc_wq_setup(struct svc_serv *serv, struct svc_pool *pool, int active)
{
+ int err;
int nrthreads = serv->sv_nrthreads - 1; /* -1 for caller's reference */

WARN_ON_ONCE(nrthreads < 0);
@@ -85,14 +223,20 @@ svc_wq_setup(struct svc_serv *serv, struct svc_pool *pool, int active)
* down the workqueue until the closing of the xprts is done.
*/
if (!nrthreads && active) {
+ err = init_svc_rqst_cache(serv);
+ if (err)
+ return err;
+
__module_get(serv->sv_ops->svo_module);
serv->sv_wq = alloc_workqueue("%s",
WQ_UNBOUND|WQ_FREEZABLE|WQ_SYSFS,
0, serv->sv_name);
if (!serv->sv_wq) {
+ exit_svc_rqst_cache(serv);
module_put(serv->sv_ops->svo_module);
return -ENOMEM;
}
+
process_queued_xprts(serv);
}

@@ -111,6 +255,7 @@ void
svc_wq_enqueue_xprt(struct svc_xprt *xprt)
{
struct svc_serv *serv = xprt->xpt_server;
+ struct svc_rqst *rqstp;

if (!svc_xprt_has_something_to_do(xprt))
return;
@@ -139,8 +284,15 @@ svc_wq_enqueue_xprt(struct svc_xprt *xprt)
spin_unlock_bh(&pool->sp_lock);
return;
}
+
out:
svc_xprt_get(xprt);
- queue_work(serv->sv_wq, &xprt->xpt_work);
+ rqstp = find_svc_rqst(serv);
+ if (!rqstp) {
+ queue_work(serv->sv_wq, &xprt->xpt_work);
+ return;
+ }
+ rqstp->rq_xprt = xprt;
+ queue_work(serv->sv_wq, &rqstp->rq_work);
}
EXPORT_SYMBOL_GPL(svc_wq_enqueue_xprt);
--
2.1.0

2014-12-10 19:09:05

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 15/16] sunrpc: print the svc_rqst pointer value in svc_process tracepoint

Signed-off-by: Jeff Layton <[email protected]>
---
include/trace/events/sunrpc.h | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 41e09a1b962d..0d54473d9b37 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -454,6 +454,7 @@ DECLARE_EVENT_CLASS(svc_rqst_status,
TP_ARGS(rqst, status),

TP_STRUCT__entry(
+ __field(struct svc_rqst *, rqst)
__field(struct sockaddr *, addr)
__field(__be32, xid)
__field(int, dropme)
@@ -462,14 +463,15 @@ DECLARE_EVENT_CLASS(svc_rqst_status,
),

TP_fast_assign(
+ __entry->rqst = rqst;
__entry->addr = (struct sockaddr *)&rqst->rq_addr;
__entry->xid = rqst->rq_xid;
__entry->status = status;
__entry->flags = rqst->rq_flags;
),

- TP_printk("addr=%pIScp rq_xid=0x%x status=%d flags=%s",
- __entry->addr, be32_to_cpu(__entry->xid),
+ TP_printk("rqst=0x%p addr=%pIScp rq_xid=0x%x status=%d flags=%s",
+ __entry->rqst, __entry->addr, be32_to_cpu(__entry->xid),
__entry->status, show_rqstp_flags(__entry->flags))
);

--
2.1.0

2014-12-10 19:08:28

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 10/16] sunrpc: add basic support for workqueue-based services

Add a new "workqueue" pool mode setting. When that is configured, we'll
set up a svc_pool for each NUMA node, but don't bother with the
pool <=> cpu mapping arrays.

We use an unbound workqueue, which should naturally make the work be
queued to a CPU within the current NUMA node.

The first iteration of this is quite simple. When a svc_xprt needs to be
serviced we queue its work and return. In later patches, we'll optimize
this a bit more.

Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/sunrpc/svc.h | 8 ++-
include/linux/sunrpc/svc_xprt.h | 1 +
include/linux/sunrpc/svcsock.h | 1 +
net/sunrpc/Makefile | 2 +-
net/sunrpc/svc.c | 13 ++++
net/sunrpc/svc_wq.c | 146 ++++++++++++++++++++++++++++++++++++++++
net/sunrpc/svc_xprt.c | 47 ++++++++++++-
7 files changed, 215 insertions(+), 3 deletions(-)
create mode 100644 net/sunrpc/svc_wq.c

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 70bee4e86a9f..43efdaae943a 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -47,6 +47,7 @@ struct svc_pool {
#define SP_TASK_PENDING (0) /* still work to do even if no
* xprt is queued. */
unsigned long sp_flags;
+ struct work_struct sp_work; /* per-pool work struct */
} ____cacheline_aligned_in_smp;

struct svc_serv;
@@ -106,6 +107,7 @@ struct svc_serv {
unsigned int sv_nrpools; /* number of thread pools */
struct svc_pool * sv_pools; /* array of thread pools */
struct svc_serv_ops * sv_ops; /* server operations */
+ struct workqueue_struct *sv_wq; /* workqueue for wq-based services */
#if defined(CONFIG_SUNRPC_BACKCHANNEL)
struct list_head sv_cb_list; /* queue for callback requests
* that arrive over the same
@@ -442,7 +444,8 @@ enum {
SVC_POOL_GLOBAL, /* no mapping, just a single global pool
* (legacy & UP mode) */
SVC_POOL_PERCPU, /* one pool per cpu */
- SVC_POOL_PERNODE /* one pool per numa node */
+ SVC_POOL_PERNODE, /* one pool per numa node */
+ SVC_POOL_WORKQUEUE, /* workqueue-based service */
};

struct svc_pool_map {
@@ -490,6 +493,9 @@ void svc_reserve(struct svc_rqst *rqstp, int space);
struct svc_pool * svc_pool_for_cpu(struct svc_serv *serv, int cpu);
char * svc_print_addr(struct svc_rqst *, char *, size_t);

+int svc_wq_setup(struct svc_serv *, struct svc_pool *, int);
+void svc_wq_enqueue_xprt(struct svc_xprt *);
+
#define RPC_MAX_ADDRBUFLEN (63U)

/*
diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index 096937871cda..ce7fd68a905e 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -117,6 +117,7 @@ void svc_xprt_init(struct net *, struct svc_xprt_class *, struct svc_xprt *,
struct svc_serv *);
int svc_create_xprt(struct svc_serv *, const char *, struct net *,
const int, const unsigned short, int);
+bool svc_xprt_has_something_to_do(struct svc_xprt *xprt);
void svc_xprt_do_enqueue(struct svc_xprt *xprt);
void svc_xprt_enqueue(struct svc_xprt *xprt);
void svc_xprt_put(struct svc_xprt *xprt);
diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index 2e780134f449..3ce0a640605d 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -53,6 +53,7 @@ static inline u32 svc_sock_final_rec(struct svc_sock *svsk)
*/
void svc_close_net(struct svc_serv *, struct net *);
int svc_recv(struct svc_rqst *, long);
+int svc_wq_recv(struct svc_rqst *);
int svc_send(struct svc_rqst *);
void svc_drop(struct svc_rqst *);
void svc_sock_update_bufs(struct svc_serv *serv);
diff --git a/net/sunrpc/Makefile b/net/sunrpc/Makefile
index 15e6f6c23c5d..e40d7fb89ef4 100644
--- a/net/sunrpc/Makefile
+++ b/net/sunrpc/Makefile
@@ -13,7 +13,7 @@ sunrpc-y := clnt.o xprt.o socklib.o xprtsock.o sched.o \
svc.o svcsock.o svcauth.o svcauth_unix.o \
addr.o rpcb_clnt.o timer.o xdr.o \
sunrpc_syms.o cache.o rpc_pipe.o \
- svc_xprt.o
+ svc_wq.o svc_xprt.o
sunrpc-$(CONFIG_SUNRPC_DEBUG) += debugfs.o
sunrpc-$(CONFIG_SUNRPC_BACKCHANNEL) += backchannel_rqst.o bc_svc.o
sunrpc-$(CONFIG_PROC_FS) += stats.o
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index ed243eb80e5b..9aad6619aa56 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -71,6 +71,8 @@ param_set_pool_mode(const char *val, struct kernel_param *kp)
*ip = SVC_POOL_PERCPU;
else if (!strncmp(val, "pernode", 7))
*ip = SVC_POOL_PERNODE;
+ else if (!strncmp(val, "workqueue", 9))
+ *ip = SVC_POOL_WORKQUEUE;
else
err = -EINVAL;

@@ -94,6 +96,8 @@ param_get_pool_mode(char *buf, struct kernel_param *kp)
return strlcpy(buf, "percpu", 20);
case SVC_POOL_PERNODE:
return strlcpy(buf, "pernode", 20);
+ case SVC_POOL_WORKQUEUE:
+ return strlcpy(buf, "workqueue", 20);
default:
return sprintf(buf, "%d", *ip);
}
@@ -242,6 +246,10 @@ svc_pool_map_get(void)
case SVC_POOL_PERNODE:
npools = svc_pool_map_init_pernode(m);
break;
+ case SVC_POOL_WORKQUEUE:
+ /* workqueues get a pool per numa node, but don't need a map */
+ npools = nr_node_ids;
+ break;
}

if (npools < 0) {
@@ -534,6 +542,11 @@ svc_destroy(struct svc_serv *serv)
if (svc_serv_is_pooled(serv))
svc_pool_map_put();

+ if (serv->sv_wq) {
+ destroy_workqueue(serv->sv_wq);
+ module_put(serv->sv_ops->svo_module);
+ }
+
kfree(serv->sv_pools);
kfree(serv);
}
diff --git a/net/sunrpc/svc_wq.c b/net/sunrpc/svc_wq.c
new file mode 100644
index 000000000000..d1778373249e
--- /dev/null
+++ b/net/sunrpc/svc_wq.c
@@ -0,0 +1,146 @@
+/*
+ * svc_wq - support for workqueue-based rpc svcs
+ */
+
+#include <linux/sched.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/sunrpc/stats.h>
+#include <linux/sunrpc/svc_xprt.h>
+#include <linux/module.h>
+#include <linux/workqueue.h>
+#include <trace/events/sunrpc.h>
+
+/*
+ * This workqueue job should run on each node when the workqueue is created. It
+ * walks the list of xprts for its node, and queues the workqueue job for each.
+ */
+static void
+process_queued_xprt_work(struct work_struct *work)
+{
+ struct svc_pool *pool = container_of(work, struct svc_pool, sp_work);
+
+ spin_lock_bh(&pool->sp_lock);
+ while (!list_empty(&pool->sp_sockets)) {
+ struct svc_xprt *xprt = list_first_entry(&pool->sp_sockets,
+ struct svc_xprt, xpt_ready);
+
+ list_del_init(&xprt->xpt_ready);
+ svc_xprt_get(xprt);
+ queue_work(xprt->xpt_server->sv_wq, &xprt->xpt_work);
+ }
+ spin_unlock_bh(&pool->sp_lock);
+}
+
+/*
+ * If any svc_xprts are enqueued before the workqueue is available, they get
+ * added to the pool->sp_sockets list. When the workqueue becomes available,
+ * we must walk the list for each pool and queue each xprt to the workqueue.
+ *
+ * In order to minimize inter-node communication, we queue a separate job for
+ * each node to walk its own list. We queue this job to any cpu in the node.
+ * Since the workqueues are unbound they'll end up queued to the pool_workqueue
+ * for their corresponding node, and not necessarily to the given CPU.
+ */
+static void
+process_queued_xprts(struct svc_serv *serv)
+{
+ int node;
+
+ for (node = 0; node < serv->sv_nrpools; ++node) {
+ int cpu = any_online_cpu(*cpumask_of_node(node));
+ struct svc_pool *pool = &serv->sv_pools[node];
+
+ INIT_WORK(&pool->sp_work, process_queued_xprt_work);
+ queue_work_on(cpu, serv->sv_wq, &pool->sp_work);
+ }
+}
+
+/*
+ * Start up or shut down a workqueue-based RPC service. Basically, we use this
+ * to allocate the workqueue. The function assumes that the caller holds one
+ * serv->sv_nrthreads reference.
+ *
+ * The "active" parm is treated as a boolean here. The only meaningful values
+ * are non-zero which means that we're starting the service up, or zero which
+ * means that we're taking it down.
+ */
+int
+svc_wq_setup(struct svc_serv *serv, struct svc_pool *pool, int active)
+{
+ int nrthreads = serv->sv_nrthreads - 1; /* -1 for caller's reference */
+
+ WARN_ON_ONCE(nrthreads < 0);
+
+ /*
+ * We don't allow startup or shutdown on a per-node basis. If we got
+ * here via the pool_threads interface, then just return an error.
+ */
+ if (pool)
+ return -EINVAL;
+
+ /*
+ * A zero "active" value is essentially ignored. If the service isn't
+ * up then we don't need to do anything. If it is, then we can't take
+ * down the workqueue until the closing of the xprts is done.
+ */
+ if (!nrthreads && active) {
+ __module_get(serv->sv_ops->svo_module);
+ serv->sv_wq = alloc_workqueue("%s",
+ WQ_UNBOUND|WQ_FREEZABLE|WQ_SYSFS,
+ 0, serv->sv_name);
+ if (!serv->sv_wq) {
+ module_put(serv->sv_ops->svo_module);
+ return -ENOMEM;
+ }
+ process_queued_xprts(serv);
+ }
+
+ /* +1 for caller's reference */
+ serv->sv_nrthreads = active + 1;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(svc_wq_setup);
+
+/*
+ * A svc_xprt needs to be serviced. Queue its workqueue job and return. In the
+ * event that the workqueue isn't available yet, add it to the sp_sockets list
+ * so that it can be processed when it does become available.
+ */
+void
+svc_wq_enqueue_xprt(struct svc_xprt *xprt)
+{
+ struct svc_serv *serv = xprt->xpt_server;
+
+ if (!svc_xprt_has_something_to_do(xprt))
+ return;
+
+ /* Don't enqueue transport while already enqueued */
+ if (test_and_set_bit(XPT_BUSY, &xprt->xpt_flags))
+ return;
+
+ /* No workqueue yet? Queue the socket until there is one. */
+ if (!serv->sv_wq) {
+ struct svc_pool *pool = &serv->sv_pools[numa_node_id()];
+
+ spin_lock_bh(&pool->sp_lock);
+
+ /*
+ * It's possible for the workqueue to be started up between
+ * when we checked for it before but before we took the lock.
+ * Check again while holding lock to avoid that potential race.
+ */
+ if (serv->sv_wq) {
+ spin_unlock_bh(&pool->sp_lock);
+ goto out;
+ }
+
+ list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
+ spin_unlock_bh(&pool->sp_lock);
+ return;
+ }
+out:
+ svc_xprt_get(xprt);
+ queue_work(serv->sv_wq, &xprt->xpt_work);
+}
+EXPORT_SYMBOL_GPL(svc_wq_enqueue_xprt);
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 63b42a8578c0..17398eb9f38f 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -313,7 +313,7 @@ char *svc_print_addr(struct svc_rqst *rqstp, char *buf, size_t len)
}
EXPORT_SYMBOL_GPL(svc_print_addr);

-static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
+bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
{
if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
return true;
@@ -851,6 +851,51 @@ out:
EXPORT_SYMBOL_GPL(svc_recv);

/*
+ * Perform a receive off the rqstp->rq_xprt socket.
+ *
+ * This function is a bit different from the standard svc_recv function as it
+ * assumes that the xprt is already provided in rqstp->rq_xprt, and so it
+ * does not sleep when there is no more work to be done.
+ */
+int
+svc_wq_recv(struct svc_rqst *rqstp)
+{
+ int len, err;
+ struct svc_xprt *xprt = rqstp->rq_xprt;
+ struct svc_serv *serv = xprt->xpt_server;
+
+ err = svc_alloc_arg(rqstp);
+ if (err)
+ goto out;
+
+ len = svc_handle_xprt(rqstp, xprt);
+ if (len <= 0) {
+ err = -EAGAIN;
+ goto out_release;
+ }
+
+ clear_bit(XPT_OLD, &xprt->xpt_flags);
+
+ if (xprt->xpt_ops->xpo_secure_port(rqstp))
+ set_bit(RQ_SECURE, &rqstp->rq_flags);
+ else
+ clear_bit(RQ_SECURE, &rqstp->rq_flags);
+ rqstp->rq_chandle.defer = svc_defer;
+ rqstp->rq_xid = svc_getu32(&rqstp->rq_arg.head[0]);
+
+ if (serv->sv_stats)
+ serv->sv_stats->netcnt++;
+ trace_svc_recv(rqstp, len);
+ return len;
+out_release:
+ rqstp->rq_res.len = 0;
+ svc_xprt_release(rqstp);
+out:
+ return err;
+}
+EXPORT_SYMBOL_GPL(svc_wq_recv);
+
+/*
* Drop request
*/
void svc_drop(struct svc_rqst *rqstp)
--
2.1.0

2014-12-10 19:09:42

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 12/16] nfsd: add support for workqueue based service processing

Allow nfsd to create a workqueue based nfsd service if the sunrpc
pool_mode is set to "workqueue".

The first iteration here is fairly simple. The xprt work simply
allocates a svc_rqst object, and then queues that work.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/fs_struct.c | 3 ++-
fs/nfsd/nfssvc.c | 69 +++++++++++++++++++++++++++++++++++++++++++++--
include/linux/fs_struct.h | 1 +
3 files changed, 70 insertions(+), 3 deletions(-)

diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index 9bc08ea2f433..68985c76cd75 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -130,7 +130,7 @@ struct fs_struct *copy_fs_struct(struct fs_struct *old)
EXPORT_SYMBOL_GPL(copy_fs_struct);

/* Replace current fs struct with one given. Return a pointer to old one. */
-static struct fs_struct *
+struct fs_struct *
swap_fs_struct(struct fs_struct *new_fs)
{
struct fs_struct *old_fs;
@@ -142,6 +142,7 @@ swap_fs_struct(struct fs_struct *new_fs)

return old_fs;
}
+EXPORT_SYMBOL_GPL(swap_fs_struct);

/* Put a reference to a fs_struct. */
void put_fs_struct(struct fs_struct *fs)
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index f37bd7db2176..7e22068bdad4 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -28,6 +28,8 @@
extern struct svc_program nfsd_program;
static int nfsd(void *vrqstp);

+static struct svc_serv_ops nfsd_wq_sv_ops;
+
/*
* nfsd_mutex protects nn->nfsd_serv -- both the pointer itself and the members
* of the svc_serv struct. In particular, ->sv_nrthreads but also to some
@@ -398,10 +400,19 @@ static struct svc_serv_ops nfsd_thread_sv_ops = {
.svo_module = THIS_MODULE,
};

+static struct svc_serv_ops *
+select_svc_serv_ops(void)
+{
+ if (svc_pool_map.mode == SVC_POOL_WORKQUEUE)
+ return &nfsd_wq_sv_ops;
+ return &nfsd_thread_sv_ops;
+}
+
int nfsd_create_serv(struct net *net)
{
int error;
struct nfsd_net *nn = net_generic(net, nfsd_net_id);
+ struct svc_serv_ops *ops;

WARN_ON(!mutex_is_locked(&nfsd_mutex));
if (nn->nfsd_serv) {
@@ -411,8 +422,11 @@ int nfsd_create_serv(struct net *net)
if (nfsd_max_blksize == 0)
nfsd_max_blksize = nfsd_get_default_max_blksize();
nfsd_reset_versions();
- nn->nfsd_serv = svc_create_pooled(&nfsd_program, nfsd_max_blksize,
- &nfsd_thread_sv_ops);
+
+ svc_pool_map_get();
+ ops = select_svc_serv_ops();
+ nn->nfsd_serv = svc_create_pooled(&nfsd_program, nfsd_max_blksize, ops);
+ svc_pool_map_put();
if (nn->nfsd_serv == NULL)
return -ENOMEM;

@@ -643,6 +657,57 @@ nfsd(void *vrqstp)
return 0;
}

+static void
+nfsd_rqst_work(struct work_struct *work)
+{
+ struct svc_rqst *rqstp = container_of(work, struct svc_rqst, rq_work);
+ struct svc_xprt *xprt = rqstp->rq_xprt;
+ struct net *net = xprt->xpt_net;
+ struct nfsd_net *nn = net_generic(net, nfsd_net_id);
+ struct fs_struct *saved_fs;
+
+ rqstp->rq_server->sv_maxconn = nn->max_connections;
+
+ if (svc_wq_recv(rqstp) < 0) {
+ svc_rqst_free(rqstp);
+ return;
+ }
+
+ saved_fs = swap_fs_struct(rqstp->rq_fs);
+ svc_process(rqstp);
+ saved_fs = swap_fs_struct(saved_fs);
+ svc_rqst_free(rqstp);
+}
+
+/* work function for workqueue-based nfsd */
+static void
+nfsd_xprt_work(struct work_struct *work)
+{
+ int node = numa_node_id();
+ struct svc_xprt *xprt = container_of(work, struct svc_xprt, xpt_work);
+ struct svc_rqst *rqstp;
+ struct svc_serv *serv = xprt->xpt_server;
+
+ rqstp = svc_rqst_alloc(serv, &serv->sv_pools[node], node);
+ if (!rqstp) {
+ /* Alloc failure. Give up for now, and requeue the work */
+ queue_work(serv->sv_wq, &xprt->xpt_work);
+ return;
+ }
+
+ rqstp->rq_xprt = xprt;
+ queue_work(serv->sv_wq, &rqstp->rq_work);
+}
+
+static struct svc_serv_ops nfsd_wq_sv_ops = {
+ .svo_shutdown = nfsd_last_thread,
+ .svo_enqueue_xprt = svc_wq_enqueue_xprt,
+ .svo_xprt_work = nfsd_xprt_work,
+ .svo_rqst_work = nfsd_rqst_work,
+ .svo_setup = svc_wq_setup,
+ .svo_module = THIS_MODULE,
+};
+
static __be32 map_new_errors(u32 vers, __be32 nfserr)
{
if (nfserr == nfserr_jukebox && vers == 2)
diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
index d2b7a1942790..671c49f13396 100644
--- a/include/linux/fs_struct.h
+++ b/include/linux/fs_struct.h
@@ -20,6 +20,7 @@ extern void exit_fs(struct task_struct *);
extern void set_fs_root(struct fs_struct *, const struct path *);
extern void set_fs_pwd(struct fs_struct *, const struct path *);
extern struct fs_struct *copy_fs_struct(struct fs_struct *);
+extern struct fs_struct *swap_fs_struct(struct fs_struct *);
extern void free_fs_struct(struct fs_struct *);
extern void replace_fs_struct(struct fs_struct *);
extern int unshare_fs_struct(void);
--
2.1.0

2014-12-10 19:10:07

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 11/16] nfsd: keep a reference to the fs_struct in svc_rqst

When we convert this code to use a workqueue, we won't want to allocate
a new fs_struct to handle each RPC. Doing so might also be problematic
since we'd be swapping out the ->fs value on a "public" workqueue
kthread.

Change the code to allocate an fs struct when when allocating a svc_rqst
and then switch to using that in the "nfsd" function. Once we add
workqueue support, we'll have the work function switch to the new fs
struct when doing the work and then switch it back.

Cc: Al Viro <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
---
fs/fs_struct.c | 59 +++++++++++++++++++++++++++++++++++++++-------
fs/nfsd/nfssvc.c | 17 ++++++-------
include/linux/fs_struct.h | 3 +++
include/linux/sunrpc/svc.h | 1 +
net/sunrpc/svc.c | 8 +++++++
5 files changed, 69 insertions(+), 19 deletions(-)

diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index 7dca743b2ce1..9bc08ea2f433 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -127,26 +127,67 @@ struct fs_struct *copy_fs_struct(struct fs_struct *old)
}
return fs;
}
+EXPORT_SYMBOL_GPL(copy_fs_struct);

-int unshare_fs_struct(void)
+/* Replace current fs struct with one given. Return a pointer to old one. */
+static struct fs_struct *
+swap_fs_struct(struct fs_struct *new_fs)
{
- struct fs_struct *fs = current->fs;
- struct fs_struct *new_fs = copy_fs_struct(fs);
- int kill;
-
- if (!new_fs)
- return -ENOMEM;
+ struct fs_struct *old_fs;

task_lock(current);
+ old_fs = current->fs;
+ current->fs = new_fs;
+ task_unlock(current);
+
+ return old_fs;
+}
+
+/* Put a reference to a fs_struct. */
+void put_fs_struct(struct fs_struct *fs)
+{
+ bool kill;
+
spin_lock(&fs->lock);
kill = !--fs->users;
- current->fs = new_fs;
spin_unlock(&fs->lock);
- task_unlock(current);

if (kill)
free_fs_struct(fs);
+}
+EXPORT_SYMBOL_GPL(put_fs_struct);
+
+/* Take an extra reference to a fs_struct. Caller must already hold one! */
+struct fs_struct *
+get_fs_struct(struct fs_struct *fs)
+{
+ spin_lock(&fs->lock);
+ ++fs->users;
+ spin_unlock(&fs->lock);
+ return fs;
+}
+EXPORT_SYMBOL_GPL(get_fs_struct);
+
+/*
+ * Swap in a new fs_struct and drop the reference on the old one.
+ * Caller must have already taken the reference to the new one.
+ */
+void replace_fs_struct(struct fs_struct *new_fs)
+{
+ struct fs_struct *old_fs = swap_fs_struct(new_fs);
+
+ put_fs_struct(old_fs);
+}
+EXPORT_SYMBOL_GPL(replace_fs_struct);
+
+int unshare_fs_struct(void)
+{
+ struct fs_struct *new_fs = copy_fs_struct(current->fs);
+
+ if (!new_fs)
+ return -ENOMEM;

+ replace_fs_struct(new_fs);
return 0;
}
EXPORT_SYMBOL_GPL(unshare_fs_struct);
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 71e7b180c0d9..f37bd7db2176 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -582,15 +582,13 @@ nfsd(void *vrqstp)
/* Lock module and set up kernel thread */
mutex_lock(&nfsd_mutex);

- /* At this point, the thread shares current->fs
- * with the init process. We need to create files with a
- * umask of 0 instead of init's umask. */
- if (unshare_fs_struct() < 0) {
- printk("Unable to start nfsd thread: out of memory\n");
- goto out;
- }
-
- current->fs->umask = 0;
+ /*
+ * At this point, the thread shares current->fs with the init process.
+ * We need to create files with a umask of 0 instead of init's umask,
+ * so switch to the fs_struct associated with the rqstp.
+ */
+ get_fs_struct(rqstp->rq_fs);
+ replace_fs_struct(rqstp->rq_fs);

/*
* thread is spawned with all signals set to SIG_IGN, re-enable
@@ -632,7 +630,6 @@ nfsd(void *vrqstp)
mutex_lock(&nfsd_mutex);
nfsdstats.th_cnt --;

-out:
rqstp->rq_server = NULL;

/* Release the thread */
diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
index 0efc3e62843a..d2b7a1942790 100644
--- a/include/linux/fs_struct.h
+++ b/include/linux/fs_struct.h
@@ -21,7 +21,10 @@ extern void set_fs_root(struct fs_struct *, const struct path *);
extern void set_fs_pwd(struct fs_struct *, const struct path *);
extern struct fs_struct *copy_fs_struct(struct fs_struct *);
extern void free_fs_struct(struct fs_struct *);
+extern void replace_fs_struct(struct fs_struct *);
extern int unshare_fs_struct(void);
+struct fs_struct *get_fs_struct(struct fs_struct *);
+void put_fs_struct(struct fs_struct *);

static inline void get_fs_root(struct fs_struct *fs, struct path *root)
{
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 43efdaae943a..695bc989c007 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -298,6 +298,7 @@ struct svc_rqst {
struct svc_cacherep * rq_cacherep; /* cache info */
struct task_struct *rq_task; /* service thread */
spinlock_t rq_lock; /* per-request lock */
+ struct fs_struct *rq_fs;
struct work_struct rq_work; /* per-request work */
};

diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 9aad6619aa56..78395f790b54 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -20,6 +20,7 @@
#include <linux/module.h>
#include <linux/kthread.h>
#include <linux/slab.h>
+#include <linux/fs_struct.h>

#include <linux/sunrpc/types.h>
#include <linux/sunrpc/xdr.h>
@@ -622,6 +623,11 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
if (!svc_init_buffer(rqstp, serv->sv_max_mesg, node))
goto out_enomem;

+ rqstp->rq_fs = copy_fs_struct(current->fs);
+ if (!rqstp->rq_fs)
+ goto out_enomem;
+
+ rqstp->rq_fs->umask = 0;
return rqstp;
out_enomem:
svc_rqst_free(rqstp);
@@ -784,6 +790,8 @@ svc_rqst_free(struct svc_rqst *rqstp)
kfree(rqstp->rq_resp);
kfree(rqstp->rq_argp);
kfree(rqstp->rq_auth_data);
+ if (rqstp->rq_fs)
+ put_fs_struct(rqstp->rq_fs);
kfree_rcu(rqstp, rq_rcu_head);
}
EXPORT_SYMBOL_GPL(svc_rqst_free);
--
2.1.0

2014-12-10 19:10:42

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 09/16] sunrpc: set up svc_rqst work if it's defined

We want to queue most of the work in the context of the svc_rqst. Add a
work_struct to the svc_rqst and a new operation for svc_rqst_work. If
that's defined then initialize it when setting up the service.

Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/sunrpc/svc.h | 4 ++++
net/sunrpc/svc.c | 2 ++
2 files changed, 6 insertions(+)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 8bd53f723485..70bee4e86a9f 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -61,6 +61,9 @@ struct svc_serv_ops {
/* xprt work function */
work_func_t svo_xprt_work;

+ /* rqst work function */
+ work_func_t svo_rqst_work;
+
/* queue up a transport for servicing */
void (*svo_enqueue_xprt)(struct svc_xprt *);

@@ -293,6 +296,7 @@ struct svc_rqst {
struct svc_cacherep * rq_cacherep; /* cache info */
struct task_struct *rq_task; /* service thread */
spinlock_t rq_lock; /* per-request lock */
+ struct work_struct rq_work; /* per-request work */
};

#define SVC_NET(svc_rqst) (svc_rqst->rq_xprt->xpt_net)
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 135ffbe9d983..ed243eb80e5b 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -595,6 +595,8 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
spin_lock_init(&rqstp->rq_lock);
rqstp->rq_server = serv;
rqstp->rq_pool = pool;
+ if (serv->sv_ops->svo_rqst_work)
+ INIT_WORK(&rqstp->rq_work, serv->sv_ops->svo_rqst_work);

rqstp->rq_argp = kmalloc_node(serv->sv_xdrsize, GFP_KERNEL, node);
if (!rqstp->rq_argp)
--
2.1.0

2014-12-10 19:08:20

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 05/16] sunrpc: abstract out svc_set_num_threads to sv_ops

Add an operation that will do setup of the service. In the case of a
classic thread-based service that means starting up threads. In the case
of a workqueue-based service, the setup will do something different.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfssvc.c | 8 +++++---
include/linux/sunrpc/svc.h | 3 +++
2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 5904c06cfd32..71e7b180c0d9 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -394,6 +394,7 @@ static struct svc_serv_ops nfsd_thread_sv_ops = {
.svo_shutdown = nfsd_last_thread,
.svo_function = nfsd,
.svo_enqueue_xprt = svc_xprt_do_enqueue,
+ .svo_setup = svc_set_num_threads,
.svo_module = THIS_MODULE,
};

@@ -506,8 +507,8 @@ int nfsd_set_nrthreads(int n, int *nthreads, struct net *net)
/* apply the new numbers */
svc_get(nn->nfsd_serv);
for (i = 0; i < n; i++) {
- err = svc_set_num_threads(nn->nfsd_serv, &nn->nfsd_serv->sv_pools[i],
- nthreads[i]);
+ err = nn->nfsd_serv->sv_ops->svo_setup(nn->nfsd_serv,
+ &nn->nfsd_serv->sv_pools[i], nthreads[i]);
if (err)
break;
}
@@ -546,7 +547,8 @@ nfsd_svc(int nrservs, struct net *net)
error = nfsd_startup_net(nrservs, net);
if (error)
goto out_destroy;
- error = svc_set_num_threads(nn->nfsd_serv, NULL, nrservs);
+ error = nn->nfsd_serv->sv_ops->svo_setup(nn->nfsd_serv,
+ NULL, nrservs);
if (error)
goto out_shutdown;
/* We are holding a reference to nn->nfsd_serv which
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 7429cdcb48b5..5e172c8329f8 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -61,6 +61,9 @@ struct svc_serv_ops {
/* queue up a transport for servicing */
void (*svo_enqueue_xprt)(struct svc_xprt *);

+ /* set up thread (or whatever) execution context */
+ int (*svo_setup)(struct svc_serv *, struct svc_pool *, int);
+
/* optional module to count when adding threads (pooled svcs only) */
struct module *svo_module;
};
--
2.1.0

2014-12-10 19:11:03

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 08/16] sunrpc: set up workqueue function in svc_xprt

For workqueue-based services we'll want to queue a workqueue job
whenever a xprt needs to be serviced. Add a work_struct to svc_xprt
and initialize it from a new work_func_t field in the svc_serv_ops.

Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/sunrpc/svc.h | 3 +++
include/linux/sunrpc/svc_xprt.h | 1 +
net/sunrpc/svc.c | 2 +-
net/sunrpc/svc_xprt.c | 2 ++
4 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 3d2113222dda..8bd53f723485 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -58,6 +58,9 @@ struct svc_serv_ops {
/* function for service threads to run */
int (*svo_function)(void *);

+ /* xprt work function */
+ work_func_t svo_xprt_work;
+
/* queue up a transport for servicing */
void (*svo_enqueue_xprt)(struct svc_xprt *);

diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index 78512cfe1fe6..096937871cda 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -53,6 +53,7 @@ struct svc_xprt {
struct kref xpt_ref;
struct list_head xpt_list;
struct list_head xpt_ready;
+ struct work_struct xpt_work;
unsigned long xpt_flags;
#define XPT_BUSY 0 /* enqueued/receiving */
#define XPT_CONN 1 /* conn pending */
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 165fa1803a0a..135ffbe9d983 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -34,7 +34,7 @@

static void svc_unregister(const struct svc_serv *serv, struct net *net);

-#define svc_serv_is_pooled(serv) ((serv)->sv_ops->svo_function)
+#define svc_serv_is_pooled(serv) ((serv)->sv_ops->svo_function || (serv)->sv_ops->svo_xprt_work)

#define SVC_POOL_DEFAULT SVC_POOL_GLOBAL

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 4954339315b7..63b42a8578c0 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -157,6 +157,8 @@ void svc_xprt_init(struct net *net, struct svc_xprt_class *xcl,
xprt->xpt_ops = xcl->xcl_ops;
kref_init(&xprt->xpt_ref);
xprt->xpt_server = serv;
+ if (serv->sv_ops->svo_xprt_work)
+ INIT_WORK(&xprt->xpt_work, serv->sv_ops->svo_xprt_work);
INIT_LIST_HEAD(&xprt->xpt_list);
INIT_LIST_HEAD(&xprt->xpt_ready);
INIT_LIST_HEAD(&xprt->xpt_deferred);
--
2.1.0

2014-12-10 19:11:49

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 07/16] sunrpc: factor svc_rqst allocation and freeing from sv_nrthreads refcounting

In later patches, we'll want to be able to allocate and free svc_rqst
structures without monkeying with the serv->sv_nrthreads refcount.

Factor those pieces out of their respective functions.

Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/sunrpc/svc.h | 3 +++
net/sunrpc/svc.c | 58 ++++++++++++++++++++++++++++++----------------
2 files changed, 41 insertions(+), 20 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 28e5f5716a87..3d2113222dda 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -458,8 +458,11 @@ void svc_rpcb_cleanup(struct svc_serv *serv, struct net *net);
int svc_bind(struct svc_serv *serv, struct net *net);
struct svc_serv *svc_create(struct svc_program *, unsigned int,
struct svc_serv_ops *);
+struct svc_rqst *svc_rqst_alloc(struct svc_serv *serv,
+ struct svc_pool *pool, int node);
struct svc_rqst *svc_prepare_thread(struct svc_serv *serv,
struct svc_pool *pool, int node);
+void svc_rqst_free(struct svc_rqst *);
void svc_exit_thread(struct svc_rqst *);
unsigned int svc_pool_map_get(void);
void svc_pool_map_put(void);
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 04c083a53121..165fa1803a0a 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -583,40 +583,52 @@ svc_release_buffer(struct svc_rqst *rqstp)
}

struct svc_rqst *
-svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
+svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
{
struct svc_rqst *rqstp;

rqstp = kzalloc_node(sizeof(*rqstp), GFP_KERNEL, node);
if (!rqstp)
- goto out_enomem;
+ return rqstp;

- serv->sv_nrthreads++;
__set_bit(RQ_BUSY, &rqstp->rq_flags);
spin_lock_init(&rqstp->rq_lock);
rqstp->rq_server = serv;
rqstp->rq_pool = pool;
- spin_lock_bh(&pool->sp_lock);
- pool->sp_nrthreads++;
- list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
- spin_unlock_bh(&pool->sp_lock);

rqstp->rq_argp = kmalloc_node(serv->sv_xdrsize, GFP_KERNEL, node);
if (!rqstp->rq_argp)
- goto out_thread;
+ goto out_enomem;

rqstp->rq_resp = kmalloc_node(serv->sv_xdrsize, GFP_KERNEL, node);
if (!rqstp->rq_resp)
- goto out_thread;
+ goto out_enomem;

if (!svc_init_buffer(rqstp, serv->sv_max_mesg, node))
- goto out_thread;
+ goto out_enomem;

return rqstp;
-out_thread:
- svc_exit_thread(rqstp);
out_enomem:
- return ERR_PTR(-ENOMEM);
+ svc_rqst_free(rqstp);
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(svc_rqst_alloc);
+
+struct svc_rqst *
+svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
+{
+ struct svc_rqst *rqstp;
+
+ rqstp = svc_rqst_alloc(serv, pool, node);
+ if (!rqstp)
+ return ERR_PTR(-ENOMEM);
+
+ serv->sv_nrthreads++;
+ spin_lock_bh(&pool->sp_lock);
+ pool->sp_nrthreads++;
+ list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
+ spin_unlock_bh(&pool->sp_lock);
+ return rqstp;
}
EXPORT_SYMBOL_GPL(svc_prepare_thread);

@@ -747,19 +759,25 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
EXPORT_SYMBOL_GPL(svc_set_num_threads);

/*
- * Called from a server thread as it's exiting. Caller must hold the BKL or
- * the "service mutex", whichever is appropriate for the service.
+ * Called from a server thread as it's exiting. Caller must hold the service's
+ * mutex.
*/
void
-svc_exit_thread(struct svc_rqst *rqstp)
+svc_rqst_free(struct svc_rqst *rqstp)
{
- struct svc_serv *serv = rqstp->rq_server;
- struct svc_pool *pool = rqstp->rq_pool;
-
svc_release_buffer(rqstp);
kfree(rqstp->rq_resp);
kfree(rqstp->rq_argp);
kfree(rqstp->rq_auth_data);
+ kfree_rcu(rqstp, rq_rcu_head);
+}
+EXPORT_SYMBOL_GPL(svc_rqst_free);
+
+void
+svc_exit_thread(struct svc_rqst *rqstp)
+{
+ struct svc_serv *serv = rqstp->rq_server;
+ struct svc_pool *pool = rqstp->rq_pool;

spin_lock_bh(&pool->sp_lock);
pool->sp_nrthreads--;
@@ -767,7 +785,7 @@ svc_exit_thread(struct svc_rqst *rqstp)
list_del_rcu(&rqstp->rq_all);
spin_unlock_bh(&pool->sp_lock);

- kfree_rcu(rqstp, rq_rcu_head);
+ svc_rqst_free(rqstp);

/* Release the server */
if (serv)
--
2.1.0

2014-12-10 19:08:16

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 02/16] sunrpc: move sv_function into sv_ops

Since we now have a container for holding svc_serv operations, move the
sv_function into it as well.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfssvc.c | 3 ++-
include/linux/sunrpc/svc.h | 11 +++--------
net/sunrpc/svc.c | 8 +++-----
3 files changed, 8 insertions(+), 14 deletions(-)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index cc5dd1227732..5017af3fe882 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -392,6 +392,7 @@ static int nfsd_get_default_max_blksize(void)

static struct svc_serv_ops nfsd_sv_ops = {
.svo_shutdown = nfsd_last_thread,
+ .svo_function = nfsd,
};

int nfsd_create_serv(struct net *net)
@@ -408,7 +409,7 @@ int nfsd_create_serv(struct net *net)
nfsd_max_blksize = nfsd_get_default_max_blksize();
nfsd_reset_versions();
nn->nfsd_serv = svc_create_pooled(&nfsd_program, nfsd_max_blksize,
- &nfsd_sv_ops, nfsd, THIS_MODULE);
+ &nfsd_sv_ops, THIS_MODULE);
if (nn->nfsd_serv == NULL)
return -ENOMEM;

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 35faea625d9f..7e27894bc065 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -19,11 +19,6 @@
#include <linux/wait.h>
#include <linux/mm.h>

-/*
- * This is the RPC server thread function prototype
- */
-typedef int (*svc_thread_fn)(void *);
-
/* statistics for svc_pool structures */
struct svc_pool_stats {
atomic_long_t packets;
@@ -58,7 +53,8 @@ struct svc_serv;

struct svc_serv_ops {
/* Callback to use when last thread exits. */
- void (*svo_shutdown)(struct svc_serv *serv, struct net *net);
+ void (*svo_shutdown)(struct svc_serv *, struct net *);
+ int (*svo_function)(void *);
};

/*
@@ -95,7 +91,6 @@ struct svc_serv {
struct svc_serv_ops * sv_ops; /* server operations */
struct module * sv_module; /* optional module to count when
* adding threads */
- svc_thread_fn sv_function; /* main function for threads */
#if defined(CONFIG_SUNRPC_BACKCHANNEL)
struct list_head sv_cb_list; /* queue for callback requests
* that arrive over the same
@@ -435,7 +430,7 @@ struct svc_rqst *svc_prepare_thread(struct svc_serv *serv,
struct svc_pool *pool, int node);
void svc_exit_thread(struct svc_rqst *);
struct svc_serv * svc_create_pooled(struct svc_program *, unsigned int,
- struct svc_serv_ops *, svc_thread_fn, struct module *);
+ struct svc_serv_ops *, struct module *);
int svc_set_num_threads(struct svc_serv *, struct svc_pool *, int);
int svc_pool_stats_open(struct svc_serv *serv, struct file *file);
void svc_destroy(struct svc_serv *);
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 057185e11261..897c755c6186 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -34,7 +34,7 @@

static void svc_unregister(const struct svc_serv *serv, struct net *net);

-#define svc_serv_is_pooled(serv) ((serv)->sv_function)
+#define svc_serv_is_pooled(serv) ((serv)->sv_ops->svo_function)

/*
* Mode for mapping cpus to pools.
@@ -494,8 +494,7 @@ EXPORT_SYMBOL_GPL(svc_create);

struct svc_serv *
svc_create_pooled(struct svc_program *prog, unsigned int bufsize,
- struct svc_serv_ops *ops, svc_thread_fn func,
- struct module *mod)
+ struct svc_serv_ops *ops, struct module *mod)
{
struct svc_serv *serv;
unsigned int npools = svc_pool_map_get();
@@ -504,7 +503,6 @@ svc_create_pooled(struct svc_program *prog, unsigned int bufsize,
if (!serv)
goto out_err;

- serv->sv_function = func;
serv->sv_module = mod;
return serv;
out_err:
@@ -740,7 +738,7 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
}

__module_get(serv->sv_module);
- task = kthread_create_on_node(serv->sv_function, rqstp,
+ task = kthread_create_on_node(serv->sv_ops->svo_function, rqstp,
node, "%s", serv->sv_name);
if (IS_ERR(task)) {
error = PTR_ERR(task);
--
2.1.0

2014-12-10 19:12:48

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 04/16] sunrpc: turn enqueueing a svc_xprt into a svc_serv operation

For now, all services use svc_xprt_do_enqueue, but once we add
workqueue-based service support, we'll need to do something different.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/lockd/svc.c | 3 ++-
fs/nfs/callback.c | 1 +
fs/nfsd/nfssvc.c | 11 ++++++-----
include/linux/sunrpc/svc.h | 3 +++
include/linux/sunrpc/svc_xprt.h | 1 +
net/sunrpc/svc_xprt.c | 10 +++++-----
6 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index d93a8d0a04bb..b6bdba25d4fb 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -327,7 +327,8 @@ out_rqst:
}

static struct svc_serv_ops lockd_sv_ops = {
- .svo_shutdown = svc_rpcb_cleanup,
+ .svo_shutdown = svc_rpcb_cleanup,
+ .svo_enqueue_xprt = svc_xprt_do_enqueue,
};

static struct svc_serv *lockd_create_svc(void)
diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c
index ace6c70f807b..1800a918f244 100644
--- a/fs/nfs/callback.c
+++ b/fs/nfs/callback.c
@@ -307,6 +307,7 @@ err_bind:
}

static struct svc_serv_ops nfs_cb_sv_ops = {
+ .svo_enqueue_xprt = svc_xprt_do_enqueue,
};

static struct svc_serv *nfs_callback_create_svc(int minorversion)
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 94e7e173aa10..5904c06cfd32 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -390,10 +390,11 @@ static int nfsd_get_default_max_blksize(void)
return ret;
}

-static struct svc_serv_ops nfsd_sv_ops = {
- .svo_shutdown = nfsd_last_thread,
- .svo_function = nfsd,
- .svo_module = THIS_MODULE,
+static struct svc_serv_ops nfsd_thread_sv_ops = {
+ .svo_shutdown = nfsd_last_thread,
+ .svo_function = nfsd,
+ .svo_enqueue_xprt = svc_xprt_do_enqueue,
+ .svo_module = THIS_MODULE,
};

int nfsd_create_serv(struct net *net)
@@ -410,7 +411,7 @@ int nfsd_create_serv(struct net *net)
nfsd_max_blksize = nfsd_get_default_max_blksize();
nfsd_reset_versions();
nn->nfsd_serv = svc_create_pooled(&nfsd_program, nfsd_max_blksize,
- &nfsd_sv_ops);
+ &nfsd_thread_sv_ops);
if (nn->nfsd_serv == NULL)
return -ENOMEM;

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 377be73b736f..7429cdcb48b5 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -58,6 +58,9 @@ struct svc_serv_ops {
/* function for service threads to run */
int (*svo_function)(void *);

+ /* queue up a transport for servicing */
+ void (*svo_enqueue_xprt)(struct svc_xprt *);
+
/* optional module to count when adding threads (pooled svcs only) */
struct module *svo_module;
};
diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index 79f6f8f3dc0a..78512cfe1fe6 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -116,6 +116,7 @@ void svc_xprt_init(struct net *, struct svc_xprt_class *, struct svc_xprt *,
struct svc_serv *);
int svc_create_xprt(struct svc_serv *, const char *, struct net *,
const int, const unsigned short, int);
+void svc_xprt_do_enqueue(struct svc_xprt *xprt);
void svc_xprt_enqueue(struct svc_xprt *xprt);
void svc_xprt_put(struct svc_xprt *xprt);
void svc_xprt_copy_addrs(struct svc_rqst *rqstp, struct svc_xprt *xprt);
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index c69358b3cf7f..4954339315b7 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -24,7 +24,6 @@ static int svc_deferred_recv(struct svc_rqst *rqstp);
static struct cache_deferred_req *svc_defer(struct cache_req *req);
static void svc_age_temp_xprts(unsigned long closure);
static void svc_delete_xprt(struct svc_xprt *xprt);
-static void svc_xprt_do_enqueue(struct svc_xprt *xprt);

/* apparently the "standard" is that clients close
* idle connections after 5 minutes, servers after
@@ -226,12 +225,12 @@ static void svc_xprt_received(struct svc_xprt *xprt)
}

/* As soon as we clear busy, the xprt could be closed and
- * 'put', so we need a reference to call svc_xprt_do_enqueue with:
+ * 'put', so we need a reference to call svc_enqueue_xprt with:
*/
svc_xprt_get(xprt);
smp_mb__before_atomic();
clear_bit(XPT_BUSY, &xprt->xpt_flags);
- svc_xprt_do_enqueue(xprt);
+ xprt->xpt_server->sv_ops->svo_enqueue_xprt(xprt);
svc_xprt_put(xprt);
}

@@ -321,7 +320,7 @@ static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
return false;
}

-static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
+void svc_xprt_do_enqueue(struct svc_xprt *xprt)
{
struct svc_pool *pool;
struct svc_rqst *rqstp = NULL;
@@ -403,6 +402,7 @@ redo_search:
out:
trace_svc_xprt_do_enqueue(xprt, rqstp);
}
+EXPORT_SYMBOL_GPL(svc_xprt_do_enqueue);

/*
* Queue up a transport with data pending. If there are idle nfsd
@@ -413,7 +413,7 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
{
if (test_bit(XPT_BUSY, &xprt->xpt_flags))
return;
- svc_xprt_do_enqueue(xprt);
+ xprt->xpt_server->sv_ops->svo_enqueue_xprt(xprt);
}
EXPORT_SYMBOL_GPL(svc_xprt_enqueue);

--
2.1.0

2014-12-10 19:13:35

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 03/16] sunrpc: move sv_module parm into sv_ops

...not technically an operation, but it's more convenient and cleaner
to pass the module pointer in this struct.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfssvc.c | 3 ++-
include/linux/sunrpc/svc.h | 9 ++++++---
net/sunrpc/svc.c | 8 +++-----
3 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 5017af3fe882..94e7e173aa10 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -393,6 +393,7 @@ static int nfsd_get_default_max_blksize(void)
static struct svc_serv_ops nfsd_sv_ops = {
.svo_shutdown = nfsd_last_thread,
.svo_function = nfsd,
+ .svo_module = THIS_MODULE,
};

int nfsd_create_serv(struct net *net)
@@ -409,7 +410,7 @@ int nfsd_create_serv(struct net *net)
nfsd_max_blksize = nfsd_get_default_max_blksize();
nfsd_reset_versions();
nn->nfsd_serv = svc_create_pooled(&nfsd_program, nfsd_max_blksize,
- &nfsd_sv_ops, THIS_MODULE);
+ &nfsd_sv_ops);
if (nn->nfsd_serv == NULL)
return -ENOMEM;

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 7e27894bc065..377be73b736f 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -54,7 +54,12 @@ struct svc_serv;
struct svc_serv_ops {
/* Callback to use when last thread exits. */
void (*svo_shutdown)(struct svc_serv *, struct net *);
+
+ /* function for service threads to run */
int (*svo_function)(void *);
+
+ /* optional module to count when adding threads (pooled svcs only) */
+ struct module *svo_module;
};

/*
@@ -89,8 +94,6 @@ struct svc_serv {
unsigned int sv_nrpools; /* number of thread pools */
struct svc_pool * sv_pools; /* array of thread pools */
struct svc_serv_ops * sv_ops; /* server operations */
- struct module * sv_module; /* optional module to count when
- * adding threads */
#if defined(CONFIG_SUNRPC_BACKCHANNEL)
struct list_head sv_cb_list; /* queue for callback requests
* that arrive over the same
@@ -430,7 +433,7 @@ struct svc_rqst *svc_prepare_thread(struct svc_serv *serv,
struct svc_pool *pool, int node);
void svc_exit_thread(struct svc_rqst *);
struct svc_serv * svc_create_pooled(struct svc_program *, unsigned int,
- struct svc_serv_ops *, struct module *);
+ struct svc_serv_ops *);
int svc_set_num_threads(struct svc_serv *, struct svc_pool *, int);
int svc_pool_stats_open(struct svc_serv *serv, struct file *file);
void svc_destroy(struct svc_serv *);
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 897c755c6186..9b4b4d8d109a 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -494,7 +494,7 @@ EXPORT_SYMBOL_GPL(svc_create);

struct svc_serv *
svc_create_pooled(struct svc_program *prog, unsigned int bufsize,
- struct svc_serv_ops *ops, struct module *mod)
+ struct svc_serv_ops *ops)
{
struct svc_serv *serv;
unsigned int npools = svc_pool_map_get();
@@ -502,8 +502,6 @@ svc_create_pooled(struct svc_program *prog, unsigned int bufsize,
serv = __svc_create(prog, bufsize, npools, ops);
if (!serv)
goto out_err;
-
- serv->sv_module = mod;
return serv;
out_err:
svc_pool_map_put();
@@ -737,12 +735,12 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
break;
}

- __module_get(serv->sv_module);
+ __module_get(serv->sv_ops->svo_module);
task = kthread_create_on_node(serv->sv_ops->svo_function, rqstp,
node, "%s", serv->sv_name);
if (IS_ERR(task)) {
error = PTR_ERR(task);
- module_put(serv->sv_module);
+ module_put(serv->sv_ops->svo_module);
svc_exit_thread(rqstp);
break;
}
--
2.1.0

2014-12-10 19:13:57

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 01/16] sunrpc: add a new svc_serv_ops struct and move sv_shutdown into it

In later patches we'll need to abstract out more operations on a
per-service level, besides sv_shutdown and sv_function.

Declare a new svc_serv_ops struct to hold these operations, and move
sv_shutdown into this struct.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/lockd/svc.c | 6 +++++-
fs/nfs/callback.c | 5 ++++-
fs/nfsd/nfssvc.c | 6 +++++-
include/linux/sunrpc/svc.h | 20 ++++++++++----------
net/sunrpc/svc.c | 18 +++++++++---------
5 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index e94c887da2d7..d93a8d0a04bb 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -326,6 +326,10 @@ out_rqst:
return error;
}

+static struct svc_serv_ops lockd_sv_ops = {
+ .svo_shutdown = svc_rpcb_cleanup,
+};
+
static struct svc_serv *lockd_create_svc(void)
{
struct svc_serv *serv;
@@ -350,7 +354,7 @@ static struct svc_serv *lockd_create_svc(void)
printk(KERN_WARNING
"lockd_up: no pid, %d users??\n", nlmsvc_users);

- serv = svc_create(&nlmsvc_program, LOCKD_BUFSIZE, svc_rpcb_cleanup);
+ serv = svc_create(&nlmsvc_program, LOCKD_BUFSIZE, &lockd_sv_ops);
if (!serv) {
printk(KERN_WARNING "lockd_up: create service failed\n");
return ERR_PTR(-ENOMEM);
diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c
index b8fb3a4ef649..ace6c70f807b 100644
--- a/fs/nfs/callback.c
+++ b/fs/nfs/callback.c
@@ -306,6 +306,9 @@ err_bind:
return ret;
}

+static struct svc_serv_ops nfs_cb_sv_ops = {
+};
+
static struct svc_serv *nfs_callback_create_svc(int minorversion)
{
struct nfs_callback_data *cb_info = &nfs_callback_info[minorversion];
@@ -331,7 +334,7 @@ static struct svc_serv *nfs_callback_create_svc(int minorversion)
printk(KERN_WARNING "nfs_callback_create_svc: no kthread, %d users??\n",
cb_info->users);

- serv = svc_create(&nfs4_callback_program, NFS4_CALLBACK_BUFSIZE, NULL);
+ serv = svc_create(&nfs4_callback_program, NFS4_CALLBACK_BUFSIZE, &nfs_cb_sv_ops);
if (!serv) {
printk(KERN_ERR "nfs_callback_create_svc: create service failed\n");
return ERR_PTR(-ENOMEM);
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 314f5c8f8f1a..cc5dd1227732 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -390,6 +390,10 @@ static int nfsd_get_default_max_blksize(void)
return ret;
}

+static struct svc_serv_ops nfsd_sv_ops = {
+ .svo_shutdown = nfsd_last_thread,
+};
+
int nfsd_create_serv(struct net *net)
{
int error;
@@ -404,7 +408,7 @@ int nfsd_create_serv(struct net *net)
nfsd_max_blksize = nfsd_get_default_max_blksize();
nfsd_reset_versions();
nn->nfsd_serv = svc_create_pooled(&nfsd_program, nfsd_max_blksize,
- nfsd_last_thread, nfsd, THIS_MODULE);
+ &nfsd_sv_ops, nfsd, THIS_MODULE);
if (nn->nfsd_serv == NULL)
return -ENOMEM;

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 6f22cfeef5e3..35faea625d9f 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -54,6 +54,13 @@ struct svc_pool {
unsigned long sp_flags;
} ____cacheline_aligned_in_smp;

+struct svc_serv;
+
+struct svc_serv_ops {
+ /* Callback to use when last thread exits. */
+ void (*svo_shutdown)(struct svc_serv *serv, struct net *net);
+};
+
/*
* RPC service.
*
@@ -85,13 +92,7 @@ struct svc_serv {

unsigned int sv_nrpools; /* number of thread pools */
struct svc_pool * sv_pools; /* array of thread pools */
-
- void (*sv_shutdown)(struct svc_serv *serv,
- struct net *net);
- /* Callback to use when last thread
- * exits.
- */
-
+ struct svc_serv_ops * sv_ops; /* server operations */
struct module * sv_module; /* optional module to count when
* adding threads */
svc_thread_fn sv_function; /* main function for threads */
@@ -429,13 +430,12 @@ int svc_rpcb_setup(struct svc_serv *serv, struct net *net);
void svc_rpcb_cleanup(struct svc_serv *serv, struct net *net);
int svc_bind(struct svc_serv *serv, struct net *net);
struct svc_serv *svc_create(struct svc_program *, unsigned int,
- void (*shutdown)(struct svc_serv *, struct net *net));
+ struct svc_serv_ops *);
struct svc_rqst *svc_prepare_thread(struct svc_serv *serv,
struct svc_pool *pool, int node);
void svc_exit_thread(struct svc_rqst *);
struct svc_serv * svc_create_pooled(struct svc_program *, unsigned int,
- void (*shutdown)(struct svc_serv *, struct net *net),
- svc_thread_fn, struct module *);
+ struct svc_serv_ops *, svc_thread_fn, struct module *);
int svc_set_num_threads(struct svc_serv *, struct svc_pool *, int);
int svc_pool_stats_open(struct svc_serv *serv, struct file *file);
void svc_destroy(struct svc_serv *);
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 4308881d9d0a..057185e11261 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -423,7 +423,7 @@ EXPORT_SYMBOL_GPL(svc_bind);
*/
static struct svc_serv *
__svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
- void (*shutdown)(struct svc_serv *serv, struct net *net))
+ struct svc_serv_ops *ops)
{
struct svc_serv *serv;
unsigned int vers;
@@ -440,7 +440,7 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
bufsize = RPCSVC_MAXPAYLOAD;
serv->sv_max_payload = bufsize? bufsize : 4096;
serv->sv_max_mesg = roundup(serv->sv_max_payload + PAGE_SIZE, PAGE_SIZE);
- serv->sv_shutdown = shutdown;
+ serv->sv_ops = ops;
xdrsize = 0;
while (prog) {
prog->pg_lovers = prog->pg_nvers-1;
@@ -486,21 +486,21 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,

struct svc_serv *
svc_create(struct svc_program *prog, unsigned int bufsize,
- void (*shutdown)(struct svc_serv *serv, struct net *net))
+ struct svc_serv_ops *ops)
{
- return __svc_create(prog, bufsize, /*npools*/1, shutdown);
+ return __svc_create(prog, bufsize, /*npools*/1, ops);
}
EXPORT_SYMBOL_GPL(svc_create);

struct svc_serv *
svc_create_pooled(struct svc_program *prog, unsigned int bufsize,
- void (*shutdown)(struct svc_serv *serv, struct net *net),
- svc_thread_fn func, struct module *mod)
+ struct svc_serv_ops *ops, svc_thread_fn func,
+ struct module *mod)
{
struct svc_serv *serv;
unsigned int npools = svc_pool_map_get();

- serv = __svc_create(prog, bufsize, npools, shutdown);
+ serv = __svc_create(prog, bufsize, npools, ops);
if (!serv)
goto out_err;

@@ -517,8 +517,8 @@ void svc_shutdown_net(struct svc_serv *serv, struct net *net)
{
svc_close_net(serv, net);

- if (serv->sv_shutdown)
- serv->sv_shutdown(serv, net);
+ if (serv->sv_ops->svo_shutdown)
+ serv->sv_ops->svo_shutdown(serv, net);
}
EXPORT_SYMBOL_GPL(svc_shutdown_net);

--
2.1.0

2014-12-10 22:32:14

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] nfsd/sunrpc: add support for a workqueue-based nfsd

Hi Jeff-

On Dec 10, 2014, at 11:07 AM, Jeff Layton <[email protected]> wrote:

> This is the second revision of the workqueue-based nfsd. Changes from
> the RFC version I posted earlier:
>
> v2:
> - found and fixed problem causing the delays between work. It was a bug
> in the design of the new code. I was queueing the svc_xprt's work to
> do everything and that necessarily serialized that work. This version
> adds a work_struct to the svc_rqst as well, and that's where the
> bulk of the work.
>
> - no longer tries to use the thread count as a max_active setting for
> the workqueue. The workqueue max_active settings are left to the
> default. This means that /proc/fs/nfsd/threads is basically a binary
> switch that's either zero (not running) or non-zero (running). The
> actual value has no meaning.
>
> - the handling of the fs_struct has been reworked. It's now allocated
> as part of the struct svc_rqst. Each svc_rqst has its own fs_struct
> now.
>
> - CONFIG_SUNRPC_SVC_WORKQUEUE has been removed. Since the code must
> be enabled at runtime anyway, we don't need a switch to disable this
> code at build time.
>
> - the new tracepoints have been reworked.
>
> This new code is still a little slower (2-3%) than the older, thread
> based code in my testing, though my hope is that it may perform better
> than the older code on a NUMA machine. I don't have one at the moment
> to verify that, however.
>
> For background, here's an excerpt from the original posting:
>
> This patchset is a little skunkworks project that I've been poking at
> for the last few weeks. Currently nfsd uses a dedicated thread pool to
> handle RPCs, but that requires maintaining a rather large swath of
> "fiddly" code to handle the threads and transports.
>
> This patchset represents an alternative approach, which makes nfsd use
> workqueues to do its bidding rather than a dedicated thread pool. When a
> transport needs to do work, we simply queue it to the workqueue in
> softirq context and let it service the transport.
>
> Performance:
> ------------
> As far as numbers go, I ran a modified version of the fio
> tiobench-example.fio test that just reduces the file size to 128M:
>
> threads:
> ----------------------------------------------------------------------
> Run status group 0 (all jobs):
> WRITE: io=13740KB, aggrb=228KB/s, minb=57KB/s, maxb=57KB/s,
> mint=60034msec, maxt=60049msec
>
> Run status group 1 (all jobs):
> WRITE: io=14652KB, aggrb=244KB/s, minb=60KB/s, maxb=61KB/s,
> mint=60002msec, maxt=60028msec
>
> Run status group 2 (all jobs):
> READ: io=524288KB, aggrb=49639KB/s, minb=12409KB/s, maxb=12433KB/s,
> mint=10542msec, maxt=10562msec
>
> Run status group 3 (all jobs):
> READ: io=524288KB, aggrb=50670KB/s, minb=12667KB/s, maxb=12687KB/s,
> mint=10331msec, maxt=10347msec
>
> workqueue:
> ----------------------------------------------------------------------
> Run status group 0 (all jobs):
> WRITE: io=13632KB, aggrb=226KB/s, minb=56KB/s, maxb=56KB/s,
> mint=60010msec, maxt=60061msec
>
> Run status group 1 (all jobs):
> WRITE: io=14328KB, aggrb=238KB/s, minb=59KB/s, maxb=59KB/s,
> mint=60023msec, maxt=60033msec
>
> Run status group 2 (all jobs):
> READ: io=524288KB, aggrb=47779KB/s, minb=11944KB/s, maxb=11955KB/s,
> mint=10963msec, maxt=10973msec
>
> Run status group 3 (all jobs):
> READ: io=524288KB, aggrb=48379KB/s, minb=12094KB/s, maxb=12114KB/s,
> mint=10819msec, maxt=10837msec
>
> ...a small performance decrease using workqueues (about 5% in the worst
> case). It varies a little between runs but the results over several runs
> are fairly consistent in showing a small perf decrease.
>
> In an attempt to ascertain where that comes from, I added some
> tracepoints and wrote a perl script to scrape the results and figure out
> where the latency comes from. The numbers here are from the ones that
> produced the above results:
>
> [jlayton@palma nfsd-wq]$ ./analyze.pl < threads.txt
> -------------------------------------------------------------
> rpcs=272437
> queue=3.79952429455322
> queued=10.0558330900775
> receive=10.6196625266701
> process=2325.8173229043
> total=2350.2923428156
>
> [jlayton@palma nfsd-wq]$ ./analyze.pl < workqueue.txt
> -------------------------------------------------------------
> rpcs=272329
> queue=4.41979737859012
> queued=8.124893049967
> receive=11.3421082590608
> process=2310.15945786277
> total=2334.04625655039
>
> Here's a legend, the numbers represent a delta between two
> tracepoints for a particular xprt or rqst. I can provide the script if
> it's helpful but it's definitely hacked together:
>
> rpcs: total number of rpcs processed (just a count)
> queue: time it took to queue the xprt.xi
> trace_svc_xprt_enqueued - trace_svc_xprt_enqueue (in usecs)
>
> queued: time between being queued and going "active"
> trace_svc_xprt_active - trace_svc_xprt_enqueued (in usecs)
>
> receive: time between going "active" and completing the receive
> trace_svc_xprt_received - trace_svc_xprt_active (in usecs)
>
> process: time between completing the receive and finishing processing
> trace_svc_process - trace_svc_xprt_recevied (in usecs)
>
> total: total time from enqueueing to finishing the processing
> trace_svc_process - trace_svc_xprt_enqueued (in usecs)
>
> The interesting bit is that according to these numbers, the workqueue
> RPCs ran more quickly on average (though most of that is in svc_process,
> for which I have zero explanation). So, why are we getting slower
> results?
>
> One theory is that it's getting bitten by the fact that the workqueue
> queueing/dequeueing code uses spinlocks that disable bottom halves.
> Perhaps that adds up and causes more latency to softirq processing? I'm
> not sure how best to nail that down...
>
> It's probably worth it to start considering this for v3.20, but I'd like
> to get some time on a larger scale test rig to see how it does first.

I think there needs to be a guarantee that this does not break
badly with NFS/RDMA, with xfs / btrfs / ext4, or with 40 and
56GbE. (Yes, we are volunteering to help with some of that).

How does it scale in number of individual transport sockets?

Does the new scheduling model use more or less CPU per byte
and per RPC?

Also, running these performance tests with a CPU soaker running
on the NFS server might be interesting. Something like a large
software build. That could better expose issues with workqueue
scheduling.

That feels like more work than any of us can do before 3.20.


> We'll also need Al's ACK on the fs_struct stuff.
>
> Thoughts or comments are welcome.
>
> Jeff Layton (16):
> sunrpc: add a new svc_serv_ops struct and move sv_shutdown into it
> sunrpc: move sv_function into sv_ops
> sunrpc: move sv_module parm into sv_ops
> sunrpc: turn enqueueing a svc_xprt into a svc_serv operation
> sunrpc: abstract out svc_set_num_threads to sv_ops
> sunrpc: move pool_mode definitions into svc.h
> sunrpc: factor svc_rqst allocation and freeing from sv_nrthreads
> refcounting
> sunrpc: set up workqueue function in svc_xprt
> sunrpc: set up svc_rqst work if it's defined
> sunrpc: add basic support for workqueue-based services
> nfsd: keep a reference to the fs_struct in svc_rqst
> nfsd: add support for workqueue based service processing
> sunrpc: keep a cache of svc_rqsts for each NUMA node
> sunrpc: add more tracepoints around svc_xprt handling
> sunrpc: print the svc_rqst pointer value in svc_process tracepoint
> sunrpc: add tracepoints around svc_sock handling
>
> fs/fs_struct.c | 60 ++++++--
> fs/lockd/svc.c | 7 +-
> fs/nfs/callback.c | 6 +-
> fs/nfsd/nfssvc.c | 100 +++++++++++--
> include/linux/fs_struct.h | 4 +
> include/linux/sunrpc/svc.h | 93 ++++++++++---
> include/linux/sunrpc/svc_xprt.h | 3 +
> include/linux/sunrpc/svcsock.h | 1 +
> include/trace/events/sunrpc.h | 88 ++++++++++--
> net/sunrpc/Makefile | 2 +-
> net/sunrpc/svc.c | 141 +++++++++++--------
> net/sunrpc/svc_wq.c | 302 ++++++++++++++++++++++++++++++++++++++++
> net/sunrpc/svc_xprt.c | 68 ++++++++-
> net/sunrpc/svcsock.c | 6 +
> 14 files changed, 758 insertions(+), 123 deletions(-)
> create mode 100644 net/sunrpc/svc_wq.c
>
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com


2014-12-10 23:13:58

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] nfsd/sunrpc: add support for a workqueue-based nfsd

On Wed, 10 Dec 2014 14:31:55 -0800
Chuck Lever <[email protected]> wrote:

> Hi Jeff-
>
> On Dec 10, 2014, at 11:07 AM, Jeff Layton <[email protected]> wrote:
>
> > This is the second revision of the workqueue-based nfsd. Changes from
> > the RFC version I posted earlier:
> >
> > v2:
> > - found and fixed problem causing the delays between work. It was a bug
> > in the design of the new code. I was queueing the svc_xprt's work to
> > do everything and that necessarily serialized that work. This version
> > adds a work_struct to the svc_rqst as well, and that's where the
> > bulk of the work.
> >
> > - no longer tries to use the thread count as a max_active setting for
> > the workqueue. The workqueue max_active settings are left to the
> > default. This means that /proc/fs/nfsd/threads is basically a binary
> > switch that's either zero (not running) or non-zero (running). The
> > actual value has no meaning.
> >
> > - the handling of the fs_struct has been reworked. It's now allocated
> > as part of the struct svc_rqst. Each svc_rqst has its own fs_struct
> > now.
> >
> > - CONFIG_SUNRPC_SVC_WORKQUEUE has been removed. Since the code must
> > be enabled at runtime anyway, we don't need a switch to disable this
> > code at build time.
> >
> > - the new tracepoints have been reworked.
> >
> > This new code is still a little slower (2-3%) than the older, thread
> > based code in my testing, though my hope is that it may perform better
> > than the older code on a NUMA machine. I don't have one at the moment
> > to verify that, however.
> >
> > For background, here's an excerpt from the original posting:
> >
> > This patchset is a little skunkworks project that I've been poking at
> > for the last few weeks. Currently nfsd uses a dedicated thread pool to
> > handle RPCs, but that requires maintaining a rather large swath of
> > "fiddly" code to handle the threads and transports.
> >
> > This patchset represents an alternative approach, which makes nfsd use
> > workqueues to do its bidding rather than a dedicated thread pool. When a
> > transport needs to do work, we simply queue it to the workqueue in
> > softirq context and let it service the transport.
> >
> > Performance:
> > ------------
> > As far as numbers go, I ran a modified version of the fio
> > tiobench-example.fio test that just reduces the file size to 128M:
> >
> > threads:
> > ----------------------------------------------------------------------
> > Run status group 0 (all jobs):
> > WRITE: io=13740KB, aggrb=228KB/s, minb=57KB/s, maxb=57KB/s,
> > mint=60034msec, maxt=60049msec
> >
> > Run status group 1 (all jobs):
> > WRITE: io=14652KB, aggrb=244KB/s, minb=60KB/s, maxb=61KB/s,
> > mint=60002msec, maxt=60028msec
> >
> > Run status group 2 (all jobs):
> > READ: io=524288KB, aggrb=49639KB/s, minb=12409KB/s, maxb=12433KB/s,
> > mint=10542msec, maxt=10562msec
> >
> > Run status group 3 (all jobs):
> > READ: io=524288KB, aggrb=50670KB/s, minb=12667KB/s, maxb=12687KB/s,
> > mint=10331msec, maxt=10347msec
> >
> > workqueue:
> > ----------------------------------------------------------------------
> > Run status group 0 (all jobs):
> > WRITE: io=13632KB, aggrb=226KB/s, minb=56KB/s, maxb=56KB/s,
> > mint=60010msec, maxt=60061msec
> >
> > Run status group 1 (all jobs):
> > WRITE: io=14328KB, aggrb=238KB/s, minb=59KB/s, maxb=59KB/s,
> > mint=60023msec, maxt=60033msec
> >
> > Run status group 2 (all jobs):
> > READ: io=524288KB, aggrb=47779KB/s, minb=11944KB/s, maxb=11955KB/s,
> > mint=10963msec, maxt=10973msec
> >
> > Run status group 3 (all jobs):
> > READ: io=524288KB, aggrb=48379KB/s, minb=12094KB/s, maxb=12114KB/s,
> > mint=10819msec, maxt=10837msec
> >
> > ...a small performance decrease using workqueues (about 5% in the worst
> > case). It varies a little between runs but the results over several runs
> > are fairly consistent in showing a small perf decrease.
> >
> > In an attempt to ascertain where that comes from, I added some
> > tracepoints and wrote a perl script to scrape the results and figure out
> > where the latency comes from. The numbers here are from the ones that
> > produced the above results:
> >
> > [jlayton@palma nfsd-wq]$ ./analyze.pl < threads.txt
> > -------------------------------------------------------------
> > rpcs=272437
> > queue=3.79952429455322
> > queued=10.0558330900775
> > receive=10.6196625266701
> > process=2325.8173229043
> > total=2350.2923428156
> >
> > [jlayton@palma nfsd-wq]$ ./analyze.pl < workqueue.txt
> > -------------------------------------------------------------
> > rpcs=272329
> > queue=4.41979737859012
> > queued=8.124893049967
> > receive=11.3421082590608
> > process=2310.15945786277
> > total=2334.04625655039
> >
> > Here's a legend, the numbers represent a delta between two
> > tracepoints for a particular xprt or rqst. I can provide the script if
> > it's helpful but it's definitely hacked together:
> >
> > rpcs: total number of rpcs processed (just a count)
> > queue: time it took to queue the xprt.xi
> > trace_svc_xprt_enqueued - trace_svc_xprt_enqueue (in usecs)
> >
> > queued: time between being queued and going "active"
> > trace_svc_xprt_active - trace_svc_xprt_enqueued (in usecs)
> >
> > receive: time between going "active" and completing the receive
> > trace_svc_xprt_received - trace_svc_xprt_active (in usecs)
> >
> > process: time between completing the receive and finishing processing
> > trace_svc_process - trace_svc_xprt_recevied (in usecs)
> >
> > total: total time from enqueueing to finishing the processing
> > trace_svc_process - trace_svc_xprt_enqueued (in usecs)
> >
> > The interesting bit is that according to these numbers, the workqueue
> > RPCs ran more quickly on average (though most of that is in svc_process,
> > for which I have zero explanation). So, why are we getting slower
> > results?
> >
> > One theory is that it's getting bitten by the fact that the workqueue
> > queueing/dequeueing code uses spinlocks that disable bottom halves.
> > Perhaps that adds up and causes more latency to softirq processing? I'm
> > not sure how best to nail that down...
> >
> > It's probably worth it to start considering this for v3.20, but I'd like
> > to get some time on a larger scale test rig to see how it does first.
>
> I think there needs to be a guarantee that this does not break
> badly with NFS/RDMA, with xfs / btrfs / ext4, or with 40 and
> 56GbE. (Yes, we are volunteering to help with some of that).
>

That would be helpful. I wouldn't expect any breakage with any of that
given where I added the operations, but you never know.

> How does it scale in number of individual transport sockets?
>

I'd expect that it'll do well given that we use RCU when searching for
a svc_rqst, but I don't have a test rig that has a ton of clients so I
can't guarantee that.

> Does the new scheduling model use more or less CPU per byte
> and per RPC?
>

Good question. My expectation would be that it wouldn't make much
difference, but again I don't know for sure.

I'd also like very much to account for the drop in throughput I'm seeing
with this code, but I'm unclear on the best method to chase that down
as of yet.

> Also, running these performance tests with a CPU soaker running
> on the NFS server might be interesting. Something like a large
> software build. That could better expose issues with workqueue
> scheduling.
>
> That feels like more work than any of us can do before 3.20.
>

Quite likely.

v3.20 isn't a firm goal. Heck I may drop this altogether if it turns
out to be slower and we can't figure out any way to make it better.

A primary goal here is to avoid breaking the existing code (of course).
One thing I'm really trying to do is avoid substantially altering the
code in the case where you don't opt to use workqueues.

I'm adding some function pointers that will need to be chased, of course
but I don't think that'll add much overhead in the context of all of
the other stuff you have to do to handle an RPC.

>
> > We'll also need Al's ACK on the fs_struct stuff.
> >
> > Thoughts or comments are welcome.
> >
> > Jeff Layton (16):
> > sunrpc: add a new svc_serv_ops struct and move sv_shutdown into it
> > sunrpc: move sv_function into sv_ops
> > sunrpc: move sv_module parm into sv_ops
> > sunrpc: turn enqueueing a svc_xprt into a svc_serv operation
> > sunrpc: abstract out svc_set_num_threads to sv_ops
> > sunrpc: move pool_mode definitions into svc.h
> > sunrpc: factor svc_rqst allocation and freeing from sv_nrthreads
> > refcounting
> > sunrpc: set up workqueue function in svc_xprt
> > sunrpc: set up svc_rqst work if it's defined
> > sunrpc: add basic support for workqueue-based services
> > nfsd: keep a reference to the fs_struct in svc_rqst
> > nfsd: add support for workqueue based service processing
> > sunrpc: keep a cache of svc_rqsts for each NUMA node
> > sunrpc: add more tracepoints around svc_xprt handling
> > sunrpc: print the svc_rqst pointer value in svc_process tracepoint
> > sunrpc: add tracepoints around svc_sock handling
> >
> > fs/fs_struct.c | 60 ++++++--
> > fs/lockd/svc.c | 7 +-
> > fs/nfs/callback.c | 6 +-
> > fs/nfsd/nfssvc.c | 100 +++++++++++--
> > include/linux/fs_struct.h | 4 +
> > include/linux/sunrpc/svc.h | 93 ++++++++++---
> > include/linux/sunrpc/svc_xprt.h | 3 +
> > include/linux/sunrpc/svcsock.h | 1 +
> > include/trace/events/sunrpc.h | 88 ++++++++++--
> > net/sunrpc/Makefile | 2 +-
> > net/sunrpc/svc.c | 141 +++++++++++--------
> > net/sunrpc/svc_wq.c | 302
> > ++++++++++++++++++++++++++++++++++++++++
> > net/sunrpc/svc_xprt.c | 68 ++++++++-
> > net/sunrpc/svcsock.c | 6 + 14 files changed, 758
> > insertions(+), 123 deletions(-) create mode 100644
> > net/sunrpc/svc_wq.c
> >
> > --
> > 2.1.0
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > linux-nfs" in the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>


--
Jeff Layton <[email protected]>

2014-12-12 02:12:50

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] nfsd/sunrpc: add support for a workqueue-based nfsd

On Wed, Dec 10, 2014 at 02:07:44PM -0500, Jeff Layton wrote:

> We'll also need Al's ACK on the fs_struct stuff.

... and I'm not happy with it. First of all, ditch those EXPORT_SYMBOL_GPL();
if it's too low-level for general export (and many of those are), tacking
_GPL on it doesn't make it any better. unshare_fs_struct() misbegotten export
is a historical accident - somebody (agruen, perhaps?) slapped a _GPL export
on its precursors and I had taken a lazy approach back then. Shouldn't have
done that... Please, don't add more of that shit.

More to the point, though, it *is* far too low-level, and for no visible
reason. AFAICS, what you want to achieve is zero umask in that fucker.
TBH, I really wonder if any of that is needed at all. Why do we want kernel
threads to get umask shared with init(8), anyway? It's very easy to change -
all it takes is
* make init_fs.umask zero
* make kernel_init cloned without CLONE_FS and have it immediately
set its ->fs->umask to 0022
* make ____call_usermodehelper() (always called very early in the
life of a thread that had been cloned without CLONE_FS) do the same thing.
Voila - all kernel threads have umask 0 and it's not affected by whatever
init(8) might be pulling off. And that includes all worqueue workers, etc.

With that I'm not sure we need to have unshare_fs_struct() at all; at least
not unless some thread wants to play with chdir() and chroot() and
I don't see anything of that sort in nfsd and lustre (the only callers of
unshare_fs_struct() in the tree). Note that set_fs_pwd() and set_fs_root()
are *not* exported, and neither are sys_{chdir,fchdir,chroot}, so nfsd and
lustre would have a hard time trying to do something of that sort anyway.
There is open-coded crap trying to implement chdir in lustre_compat25.h, but
it has no callers...

Linus, do you see any problems with the following patch (against the mainline)?
If not, I'll put it into the next vfs.git pull request, along with removal of
all mentionings of ->fs-> in lustre (aside of aforementioned dead code,
there's open-coded current_umask() in one place and broken attempt to
reimplement dentry_path() in another).

Signed-off-by: Al Viro <[email protected]>
---
diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs.h b/drivers/staging/lustre/include/linux/libcfs/libcfs.h
index a6b2f90..e097489 100644
--- a/drivers/staging/lustre/include/linux/libcfs/libcfs.h
+++ b/drivers/staging/lustre/include/linux/libcfs/libcfs.h
@@ -136,7 +136,6 @@ void cfs_enter_debugger(void);
/*
* Defined by platform
*/
-int unshare_fs_struct(void);
sigset_t cfs_get_blocked_sigs(void);
sigset_t cfs_block_allsigs(void);
sigset_t cfs_block_sigs(unsigned long sigs);
diff --git a/drivers/staging/lustre/lustre/obdclass/genops.c b/drivers/staging/lustre/lustre/obdclass/genops.c
index c314e9c..53876f9 100644
--- a/drivers/staging/lustre/lustre/obdclass/genops.c
+++ b/drivers/staging/lustre/lustre/obdclass/genops.c
@@ -1713,7 +1713,6 @@ EXPORT_SYMBOL(obd_zombie_barrier);
*/
static int obd_zombie_impexp_thread(void *unused)
{
- unshare_fs_struct();
complete(&obd_zombie_start);

obd_zombie_pid = current_pid();
diff --git a/drivers/staging/lustre/lustre/obdclass/llog.c b/drivers/staging/lustre/lustre/obdclass/llog.c
index 3ab0529..6130b23 100644
--- a/drivers/staging/lustre/lustre/obdclass/llog.c
+++ b/drivers/staging/lustre/lustre/obdclass/llog.c
@@ -411,8 +411,6 @@ static int llog_process_thread_daemonize(void *arg)
struct lu_env env;
int rc;

- unshare_fs_struct();
-
/* client env has no keys, tags is just 0 */
rc = lu_env_init(&env, LCT_LOCAL | LCT_MG_THREAD);
if (rc)
diff --git a/drivers/staging/lustre/lustre/ptlrpc/import.c b/drivers/staging/lustre/lustre/ptlrpc/import.c
index 2e7e717..d395e06 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/import.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/import.c
@@ -1290,8 +1290,6 @@ static int ptlrpc_invalidate_import_thread(void *data)
{
struct obd_import *imp = data;

- unshare_fs_struct();
-
CDEBUG(D_HA, "thread invalidate import %s to %s@%s\n",
imp->imp_obd->obd_name, obd2cli_tgt(imp->imp_obd),
imp->imp_connection->c_remote_uuid.uuid);
diff --git a/drivers/staging/lustre/lustre/ptlrpc/pinger.c b/drivers/staging/lustre/lustre/ptlrpc/pinger.c
index 20341b2..9f426ea 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/pinger.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/pinger.c
@@ -586,8 +586,6 @@ static int ping_evictor_main(void *arg)
struct l_wait_info lwi = { 0 };
time_t expire_time;

- unshare_fs_struct();
-
CDEBUG(D_HA, "Starting Ping Evictor\n");
pet_state = PET_READY;
while (1) {
diff --git a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
index 357ea9f..a2a1574 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
@@ -382,7 +382,6 @@ static int ptlrpcd(void *arg)
struct lu_env env = { .le_ses = NULL };
int rc, exit = 0;

- unshare_fs_struct();
#if defined(CONFIG_SMP)
if (test_bit(LIOD_BIND, &pc->pc_flags)) {
int index = pc->pc_index;
diff --git a/drivers/staging/lustre/lustre/ptlrpc/sec_gc.c b/drivers/staging/lustre/lustre/ptlrpc/sec_gc.c
index c500aff..9e33781 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/sec_gc.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/sec_gc.c
@@ -164,8 +164,6 @@ static int sec_gc_main(void *arg)
struct ptlrpc_thread *thread = (struct ptlrpc_thread *) arg;
struct l_wait_info lwi;

- unshare_fs_struct();
-
/* Record that the thread is running */
thread_set_flags(thread, SVC_RUNNING);
wake_up(&thread->t_ctl_waitq);
diff --git a/drivers/staging/lustre/lustre/ptlrpc/service.c b/drivers/staging/lustre/lustre/ptlrpc/service.c
index a8df8a7..149c65c 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/service.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/service.c
@@ -2277,7 +2277,6 @@ static int ptlrpc_main(void *arg)
int counter = 0, rc = 0;

thread->t_pid = current_pid();
- unshare_fs_struct();

/* NB: we will call cfs_cpt_bind() for all threads, because we
* might want to run lustre server only on a subset of system CPUs,
@@ -2478,7 +2477,6 @@ static int ptlrpc_hr_main(void *arg)

snprintf(threadname, sizeof(threadname), "ptlrpc_hr%02d_%03d",
hrp->hrp_cpt, hrt->hrt_id);
- unshare_fs_struct();

rc = cfs_cpt_bind(ptlrpc_hr.hr_cpt_table, hrp->hrp_cpt);
if (rc != 0) {
diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index 7dca743..401fd2e 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -128,29 +128,6 @@ struct fs_struct *copy_fs_struct(struct fs_struct *old)
return fs;
}

-int unshare_fs_struct(void)
-{
- struct fs_struct *fs = current->fs;
- struct fs_struct *new_fs = copy_fs_struct(fs);
- int kill;
-
- if (!new_fs)
- return -ENOMEM;
-
- task_lock(current);
- spin_lock(&fs->lock);
- kill = !--fs->users;
- current->fs = new_fs;
- spin_unlock(&fs->lock);
- task_unlock(current);
-
- if (kill)
- free_fs_struct(fs);
-
- return 0;
-}
-EXPORT_SYMBOL_GPL(unshare_fs_struct);
-
int current_umask(void)
{
return current->fs->umask;
@@ -162,5 +139,5 @@ struct fs_struct init_fs = {
.users = 1,
.lock = __SPIN_LOCK_UNLOCKED(init_fs.lock),
.seq = SEQCNT_ZERO(init_fs.seq),
- .umask = 0022,
+ .umask = 0,
};
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 752d56b..5c03928 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -573,16 +573,6 @@ nfsd(void *vrqstp)
/* Lock module and set up kernel thread */
mutex_lock(&nfsd_mutex);

- /* At this point, the thread shares current->fs
- * with the init process. We need to create files with a
- * umask of 0 instead of init's umask. */
- if (unshare_fs_struct() < 0) {
- printk("Unable to start nfsd thread: out of memory\n");
- goto out;
- }
-
- current->fs->umask = 0;
-
/*
* thread is spawned with all signals set to SIG_IGN, re-enable
* the ones that will bring down the thread
diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
index 0efc3e6..18d369c 100644
--- a/include/linux/fs_struct.h
+++ b/include/linux/fs_struct.h
@@ -21,7 +21,6 @@ extern void set_fs_root(struct fs_struct *, const struct path *);
extern void set_fs_pwd(struct fs_struct *, const struct path *);
extern struct fs_struct *copy_fs_struct(struct fs_struct *);
extern void free_fs_struct(struct fs_struct *);
-extern int unshare_fs_struct(void);

static inline void get_fs_root(struct fs_struct *fs, struct path *root)
{
diff --git a/init/main.c b/init/main.c
index ca380ec..d1a4acf 100644
--- a/init/main.c
+++ b/init/main.c
@@ -399,7 +399,7 @@ static noinline void __init_refok rest_init(void)
* the init task will end up wanting to create kthreads, which, if
* we schedule it before we create kthreadd, will OOPS.
*/
- kernel_thread(kernel_init, NULL, CLONE_FS);
+ kernel_thread(kernel_init, NULL, 0);
numa_default_policy();
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
rcu_read_lock();
@@ -924,6 +924,7 @@ static int __ref kernel_init(void *unused)
{
int ret;

+ current->fs->umask = 0022;
kernel_init_freeable();
/* need to finish all async __init code before freeing the memory */
async_synchronize_full();
diff --git a/kernel/kmod.c b/kernel/kmod.c
index 2777f40..0686840 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -219,6 +219,7 @@ static int ____call_usermodehelper(void *data)
struct cred *new;
int retval;

+ current->fs->umask = 0022;
spin_lock_irq(&current->sighand->siglock);
flush_signal_handlers(current, 1);
spin_unlock_irq(&current->sighand->siglock);

2014-12-12 02:29:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] nfsd/sunrpc: add support for a workqueue-based nfsd

On Thu, Dec 11, 2014 at 6:12 PM, Al Viro <[email protected]> wrote:
>
> Linus, do you see any problems with the following patch (against the mainline)?

Not concpetually, but create_kthread() uses CLONE_FS, and I don't
think it's just umask that things like nfsd want to avoid sharing.
What about all the *other* fields?

Just as an example: even if all the threads actually end up all having
the same global root, what about contention on 'fs->lock'?

I have *not* looked at the details, and maybe there's some reason I'm
completely off, but it worries me.

Linus

2014-12-12 02:52:56

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] nfsd/sunrpc: add support for a workqueue-based nfsd

On Fri, Dec 12, 2014 at 02:12:41AM +0000, Al Viro wrote:
> On Wed, Dec 10, 2014 at 02:07:44PM -0500, Jeff Layton wrote:
>
> > We'll also need Al's ACK on the fs_struct stuff.
>
> ... and I'm not happy with it. First of all, ditch those EXPORT_SYMBOL_GPL();
> if it's too low-level for general export (and many of those are), tacking
> _GPL on it doesn't make it any better. unshare_fs_struct() misbegotten export
> is a historical accident - somebody (agruen, perhaps?) slapped a _GPL export
> on its precursors and I had taken a lazy approach back then. Shouldn't have
> done that... Please, don't add more of that shit.
>
> More to the point, though, it *is* far too low-level, and for no visible
> reason. AFAICS, what you want to achieve is zero umask in that fucker.
> TBH, I really wonder if any of that is needed at all. Why do we want kernel
> threads to get umask shared with init(8), anyway? It's very easy to change -
> all it takes is
> * make init_fs.umask zero
> * make kernel_init cloned without CLONE_FS and have it immediately
> set its ->fs->umask to 0022
> * make ____call_usermodehelper() (always called very early in the
> life of a thread that had been cloned without CLONE_FS) do the same thing.
> Voila - all kernel threads have umask 0 and it's not affected by whatever
> init(8) might be pulling off. And that includes all worqueue workers, etc.
>
> With that I'm not sure we need to have unshare_fs_struct() at all; at least
> not unless some thread wants to play with chdir() and chroot() and
> I don't see anything of that sort in nfsd and lustre (the only callers of
> unshare_fs_struct() in the tree). Note that set_fs_pwd() and set_fs_root()
> are *not* exported, and neither are sys_{chdir,fchdir,chroot}, so nfsd and
> lustre would have a hard time trying to do something of that sort anyway.
> There is open-coded crap trying to implement chdir in lustre_compat25.h, but
> it has no callers...
>
> Linus, do you see any problems with the following patch (against the mainline)?
> If not, I'll put it into the next vfs.git pull request, along with removal of
> all mentionings of ->fs-> in lustre (aside of aforementioned dead code,
> there's open-coded current_umask() in one place and broken attempt to
> reimplement dentry_path() in another).

Grr... With includes of <linux/fs_struct.h> added in init/main.c and
kernel/kmod.c. Sorry. That way it builds and, AFAICT, works... I think
it ought to be 3 commits -
* giving PID 1 fs_struct of its own, making umask for all kernel
threads zero, while keeping the same value (0022) for PID 1 and all userland
processes spawned by call_usermodehelper().
* removal of unshare_fs_struct() - it becomes unneeded after the
previous commit
* removal of assorted junk in lustre.

All three combined yield this:

.../staging/lustre/include/linux/libcfs/libcfs.h | 1 -
.../lustre/lustre/include/linux/lustre_compat25.h | 24 ---------------------
drivers/staging/lustre/lustre/llite/dir.c | 2 +-
drivers/staging/lustre/lustre/llite/llite_lib.c | 17 +--------------
drivers/staging/lustre/lustre/obdclass/genops.c | 1 -
drivers/staging/lustre/lustre/obdclass/llog.c | 2 --
drivers/staging/lustre/lustre/ptlrpc/import.c | 2 --
drivers/staging/lustre/lustre/ptlrpc/pinger.c | 2 --
drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c | 1 -
drivers/staging/lustre/lustre/ptlrpc/sec_gc.c | 2 --
drivers/staging/lustre/lustre/ptlrpc/service.c | 2 --
fs/fs_struct.c | 25 +---------------------
fs/nfsd/nfssvc.c | 11 ----------
include/linux/fs_struct.h | 1 -
init/main.c | 4 +++-
kernel/kmod.c | 2 ++
16 files changed, 8 insertions(+), 91 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs.h b/drivers/staging/lustre/include/linux/libcfs/libcfs.h
index a6b2f90..e097489 100644
--- a/drivers/staging/lustre/include/linux/libcfs/libcfs.h
+++ b/drivers/staging/lustre/include/linux/libcfs/libcfs.h
@@ -136,7 +136,6 @@ void cfs_enter_debugger(void);
/*
* Defined by platform
*/
-int unshare_fs_struct(void);
sigset_t cfs_get_blocked_sigs(void);
sigset_t cfs_block_allsigs(void);
sigset_t cfs_block_sigs(unsigned long sigs);
diff --git a/drivers/staging/lustre/lustre/include/linux/lustre_compat25.h b/drivers/staging/lustre/lustre/include/linux/lustre_compat25.h
index e94ab34..f10e061 100644
--- a/drivers/staging/lustre/lustre/include/linux/lustre_compat25.h
+++ b/drivers/staging/lustre/lustre/include/linux/lustre_compat25.h
@@ -42,28 +42,6 @@

#include "lustre_patchless_compat.h"

-# define LOCK_FS_STRUCT(fs) spin_lock(&(fs)->lock)
-# define UNLOCK_FS_STRUCT(fs) spin_unlock(&(fs)->lock)
-
-static inline void ll_set_fs_pwd(struct fs_struct *fs, struct vfsmount *mnt,
- struct dentry *dentry)
-{
- struct path path;
- struct path old_pwd;
-
- path.mnt = mnt;
- path.dentry = dentry;
- LOCK_FS_STRUCT(fs);
- old_pwd = fs->pwd;
- path_get(&path);
- fs->pwd = path;
- UNLOCK_FS_STRUCT(fs);
-
- if (old_pwd.dentry)
- path_put(&old_pwd);
-}
-
-
/*
* set ATTR_BLOCKS to a high value to avoid any risk of collision with other
* ATTR_* attributes (see bug 13828)
@@ -112,8 +90,6 @@ static inline void ll_set_fs_pwd(struct fs_struct *fs, struct vfsmount *mnt,
#define cfs_bio_io_error(a, b) bio_io_error((a))
#define cfs_bio_endio(a, b, c) bio_endio((a), (c))

-#define cfs_fs_pwd(fs) ((fs)->pwd.dentry)
-#define cfs_fs_mnt(fs) ((fs)->pwd.mnt)
#define cfs_path_put(nd) path_put(&(nd)->path)


diff --git a/drivers/staging/lustre/lustre/llite/dir.c b/drivers/staging/lustre/lustre/llite/dir.c
index a79fd65..fa40474 100644
--- a/drivers/staging/lustre/lustre/llite/dir.c
+++ b/drivers/staging/lustre/lustre/llite/dir.c
@@ -661,7 +661,7 @@ int ll_dir_setdirstripe(struct inode *dir, struct lmv_user_md *lump,
int mode;
int err;

- mode = (0755 & (S_IRWXUGO|S_ISVTX) & ~current->fs->umask) | S_IFDIR;
+ mode = (0755 & ~current_umask()) | S_IFDIR;
op_data = ll_prep_md_op_data(NULL, dir, NULL, filename,
strlen(filename), mode, LUSTRE_OPC_MKDIR,
lump);
diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c b/drivers/staging/lustre/lustre/llite/llite_lib.c
index 7b6b9e2..accba4f 100644
--- a/drivers/staging/lustre/lustre/llite/llite_lib.c
+++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
@@ -2396,21 +2396,6 @@ char *ll_get_fsname(struct super_block *sb, char *buf, int buflen)
return buf;
}

-static char* ll_d_path(struct dentry *dentry, char *buf, int bufsize)
-{
- char *path = NULL;
-
- struct path p;
-
- p.dentry = dentry;
- p.mnt = current->fs->root.mnt;
- path_get(&p);
- path = d_path(&p, buf, bufsize);
- path_put(&p);
-
- return path;
-}
-
void ll_dirty_page_discard_warn(struct page *page, int ioret)
{
char *buf, *path = NULL;
@@ -2422,7 +2407,7 @@ void ll_dirty_page_discard_warn(struct page *page, int ioret)
if (buf != NULL) {
dentry = d_find_alias(page->mapping->host);
if (dentry != NULL)
- path = ll_d_path(dentry, buf, PAGE_SIZE);
+ path = dentry_path_raw(dentry, buf, PAGE_SIZE);
}

CDEBUG(D_WARNING,
diff --git a/drivers/staging/lustre/lustre/obdclass/genops.c b/drivers/staging/lustre/lustre/obdclass/genops.c
index c314e9c..53876f9 100644
--- a/drivers/staging/lustre/lustre/obdclass/genops.c
+++ b/drivers/staging/lustre/lustre/obdclass/genops.c
@@ -1713,7 +1713,6 @@ EXPORT_SYMBOL(obd_zombie_barrier);
*/
static int obd_zombie_impexp_thread(void *unused)
{
- unshare_fs_struct();
complete(&obd_zombie_start);

obd_zombie_pid = current_pid();
diff --git a/drivers/staging/lustre/lustre/obdclass/llog.c b/drivers/staging/lustre/lustre/obdclass/llog.c
index 3ab0529..6130b23 100644
--- a/drivers/staging/lustre/lustre/obdclass/llog.c
+++ b/drivers/staging/lustre/lustre/obdclass/llog.c
@@ -411,8 +411,6 @@ static int llog_process_thread_daemonize(void *arg)
struct lu_env env;
int rc;

- unshare_fs_struct();
-
/* client env has no keys, tags is just 0 */
rc = lu_env_init(&env, LCT_LOCAL | LCT_MG_THREAD);
if (rc)
diff --git a/drivers/staging/lustre/lustre/ptlrpc/import.c b/drivers/staging/lustre/lustre/ptlrpc/import.c
index 2e7e717..d395e06 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/import.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/import.c
@@ -1290,8 +1290,6 @@ static int ptlrpc_invalidate_import_thread(void *data)
{
struct obd_import *imp = data;

- unshare_fs_struct();
-
CDEBUG(D_HA, "thread invalidate import %s to %s@%s\n",
imp->imp_obd->obd_name, obd2cli_tgt(imp->imp_obd),
imp->imp_connection->c_remote_uuid.uuid);
diff --git a/drivers/staging/lustre/lustre/ptlrpc/pinger.c b/drivers/staging/lustre/lustre/ptlrpc/pinger.c
index 20341b2..9f426ea 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/pinger.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/pinger.c
@@ -586,8 +586,6 @@ static int ping_evictor_main(void *arg)
struct l_wait_info lwi = { 0 };
time_t expire_time;

- unshare_fs_struct();
-
CDEBUG(D_HA, "Starting Ping Evictor\n");
pet_state = PET_READY;
while (1) {
diff --git a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
index 357ea9f..a2a1574 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
@@ -382,7 +382,6 @@ static int ptlrpcd(void *arg)
struct lu_env env = { .le_ses = NULL };
int rc, exit = 0;

- unshare_fs_struct();
#if defined(CONFIG_SMP)
if (test_bit(LIOD_BIND, &pc->pc_flags)) {
int index = pc->pc_index;
diff --git a/drivers/staging/lustre/lustre/ptlrpc/sec_gc.c b/drivers/staging/lustre/lustre/ptlrpc/sec_gc.c
index c500aff..9e33781 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/sec_gc.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/sec_gc.c
@@ -164,8 +164,6 @@ static int sec_gc_main(void *arg)
struct ptlrpc_thread *thread = (struct ptlrpc_thread *) arg;
struct l_wait_info lwi;

- unshare_fs_struct();
-
/* Record that the thread is running */
thread_set_flags(thread, SVC_RUNNING);
wake_up(&thread->t_ctl_waitq);
diff --git a/drivers/staging/lustre/lustre/ptlrpc/service.c b/drivers/staging/lustre/lustre/ptlrpc/service.c
index a8df8a7..149c65c 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/service.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/service.c
@@ -2277,7 +2277,6 @@ static int ptlrpc_main(void *arg)
int counter = 0, rc = 0;

thread->t_pid = current_pid();
- unshare_fs_struct();

/* NB: we will call cfs_cpt_bind() for all threads, because we
* might want to run lustre server only on a subset of system CPUs,
@@ -2478,7 +2477,6 @@ static int ptlrpc_hr_main(void *arg)

snprintf(threadname, sizeof(threadname), "ptlrpc_hr%02d_%03d",
hrp->hrp_cpt, hrt->hrt_id);
- unshare_fs_struct();

rc = cfs_cpt_bind(ptlrpc_hr.hr_cpt_table, hrp->hrp_cpt);
if (rc != 0) {
diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index 7dca743..401fd2e 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -128,29 +128,6 @@ struct fs_struct *copy_fs_struct(struct fs_struct *old)
return fs;
}

-int unshare_fs_struct(void)
-{
- struct fs_struct *fs = current->fs;
- struct fs_struct *new_fs = copy_fs_struct(fs);
- int kill;
-
- if (!new_fs)
- return -ENOMEM;
-
- task_lock(current);
- spin_lock(&fs->lock);
- kill = !--fs->users;
- current->fs = new_fs;
- spin_unlock(&fs->lock);
- task_unlock(current);
-
- if (kill)
- free_fs_struct(fs);
-
- return 0;
-}
-EXPORT_SYMBOL_GPL(unshare_fs_struct);
-
int current_umask(void)
{
return current->fs->umask;
@@ -162,5 +139,5 @@ struct fs_struct init_fs = {
.users = 1,
.lock = __SPIN_LOCK_UNLOCKED(init_fs.lock),
.seq = SEQCNT_ZERO(init_fs.seq),
- .umask = 0022,
+ .umask = 0,
};
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 752d56b..357a73b 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -573,16 +573,6 @@ nfsd(void *vrqstp)
/* Lock module and set up kernel thread */
mutex_lock(&nfsd_mutex);

- /* At this point, the thread shares current->fs
- * with the init process. We need to create files with a
- * umask of 0 instead of init's umask. */
- if (unshare_fs_struct() < 0) {
- printk("Unable to start nfsd thread: out of memory\n");
- goto out;
- }
-
- current->fs->umask = 0;
-
/*
* thread is spawned with all signals set to SIG_IGN, re-enable
* the ones that will bring down the thread
@@ -623,7 +613,6 @@ nfsd(void *vrqstp)
mutex_lock(&nfsd_mutex);
nfsdstats.th_cnt --;

-out:
rqstp->rq_server = NULL;

/* Release the thread */
diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
index 0efc3e6..18d369c 100644
--- a/include/linux/fs_struct.h
+++ b/include/linux/fs_struct.h
@@ -21,7 +21,6 @@ extern void set_fs_root(struct fs_struct *, const struct path *);
extern void set_fs_pwd(struct fs_struct *, const struct path *);
extern struct fs_struct *copy_fs_struct(struct fs_struct *);
extern void free_fs_struct(struct fs_struct *);
-extern int unshare_fs_struct(void);

static inline void get_fs_root(struct fs_struct *fs, struct path *root)
{
diff --git a/init/main.c b/init/main.c
index ca380ec..53aea3b 100644
--- a/init/main.c
+++ b/init/main.c
@@ -77,6 +77,7 @@
#include <linux/context_tracking.h>
#include <linux/random.h>
#include <linux/list.h>
+#include <linux/fs_struct.h>

#include <asm/io.h>
#include <asm/bugs.h>
@@ -399,7 +400,7 @@ static noinline void __init_refok rest_init(void)
* the init task will end up wanting to create kthreads, which, if
* we schedule it before we create kthreadd, will OOPS.
*/
- kernel_thread(kernel_init, NULL, CLONE_FS);
+ kernel_thread(kernel_init, NULL, 0);
numa_default_policy();
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
rcu_read_lock();
@@ -924,6 +925,7 @@ static int __ref kernel_init(void *unused)
{
int ret;

+ current->fs->umask = 0022;
kernel_init_freeable();
/* need to finish all async __init code before freeing the memory */
async_synchronize_full();
diff --git a/kernel/kmod.c b/kernel/kmod.c
index 2777f40..4d775e7 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -39,6 +39,7 @@
#include <linux/rwsem.h>
#include <linux/ptrace.h>
#include <linux/async.h>
+#include <linux/fs_struct.h>
#include <asm/uaccess.h>

#include <trace/events/module.h>
@@ -219,6 +220,7 @@ static int ____call_usermodehelper(void *data)
struct cred *new;
int retval;

+ current->fs->umask = 0022;
spin_lock_irq(&current->sighand->siglock);
flush_signal_handlers(current, 1);
spin_unlock_irq(&current->sighand->siglock);

2014-12-12 03:02:11

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] nfsd/sunrpc: add support for a workqueue-based nfsd

On Thu, Dec 11, 2014 at 06:29:37PM -0800, Linus Torvalds wrote:
> On Thu, Dec 11, 2014 at 6:12 PM, Al Viro <[email protected]> wrote:
> >
> > Linus, do you see any problems with the following patch (against the mainline)?
>
> Not concpetually, but create_kthread() uses CLONE_FS, and I don't
> think it's just umask that things like nfsd want to avoid sharing.
> What about all the *other* fields?
>
> Just as an example: even if all the threads actually end up all having
> the same global root, what about contention on 'fs->lock'?
>
> I have *not* looked at the details, and maybe there's some reason I'm
> completely off, but it worries me.

Umm... I would be very surprised if it turned out to be a problem.
nfsd really doesn't give a fuck about its cwd and root - not in the
thread side of things. And (un)exporting is (a) not on a hot path
and (b) not done from a kernel thread anyway. fh_to_dentry and friends
doesn't care about root/cwd, etc.

I don't see anything that could cause that kind of issues.

2014-12-12 03:06:39

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] nfsd/sunrpc: add support for a workqueue-based nfsd

On Fri, Dec 12, 2014 at 03:02:06AM +0000, Al Viro wrote:
> On Thu, Dec 11, 2014 at 06:29:37PM -0800, Linus Torvalds wrote:
> > On Thu, Dec 11, 2014 at 6:12 PM, Al Viro <[email protected]> wrote:
> > >
> > > Linus, do you see any problems with the following patch (against the mainline)?
> >
> > Not concpetually, but create_kthread() uses CLONE_FS, and I don't
> > think it's just umask that things like nfsd want to avoid sharing.
> > What about all the *other* fields?
> >
> > Just as an example: even if all the threads actually end up all having
> > the same global root, what about contention on 'fs->lock'?
> >
> > I have *not* looked at the details, and maybe there's some reason I'm
> > completely off, but it worries me.
>
> Umm... I would be very surprised if it turned out to be a problem.
> nfsd really doesn't give a fuck about its cwd and root - not in the
> thread side of things. And (un)exporting is (a) not on a hot path
> and (b) not done from a kernel thread anyway. fh_to_dentry and friends
> doesn't care about root/cwd, etc.
>
> I don't see anything that could cause that kind of issues.

PS: I haven't checked if lustre is trying to avoid that kind of contention;
it might be, but I would consider having those threads resolve _any_ kind
of pathnames from root or cwd as serious bug - after all, we are not guaranteed
that filesystem in question is reachable from the namespace PID 1 is running
it.

2014-12-12 11:54:13

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] nfsd/sunrpc: add support for a workqueue-based nfsd

On Fri, 12 Dec 2014 03:02:06 +0000
Al Viro <[email protected]> wrote:

> On Thu, Dec 11, 2014 at 06:29:37PM -0800, Linus Torvalds wrote:
> > On Thu, Dec 11, 2014 at 6:12 PM, Al Viro <[email protected]> wrote:
> > >
> > > Linus, do you see any problems with the following patch (against the mainline)?
> >
> > Not concpetually, but create_kthread() uses CLONE_FS, and I don't
> > think it's just umask that things like nfsd want to avoid sharing.
> > What about all the *other* fields?
> >
> > Just as an example: even if all the threads actually end up all having
> > the same global root, what about contention on 'fs->lock'?
> >
> > I have *not* looked at the details, and maybe there's some reason I'm
> > completely off, but it worries me.
>
> Umm... I would be very surprised if it turned out to be a problem.
> nfsd really doesn't give a fuck about its cwd and root - not in the
> thread side of things. And (un)exporting is (a) not on a hot path
> and (b) not done from a kernel thread anyway. fh_to_dentry and friends
> doesn't care about root/cwd, etc.
>
> I don't see anything that could cause that kind of issues.

I like the change overall -- it would certainly make my patch series
simpler, but what about pathwalking? We do take the fs->lock in
unlazy_walk. Is it possible we'd end up with more contention there?

--
Jeff Layton <[email protected]>

2014-12-12 16:59:59

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] nfsd/sunrpc: add support for a workqueue-based nfsd

On Fri, Dec 12, 2014 at 06:54:08AM -0500, Jeff Layton wrote:

> > Umm... I would be very surprised if it turned out to be a problem.
> > nfsd really doesn't give a fuck about its cwd and root - not in the
> > thread side of things. And (un)exporting is (a) not on a hot path
> > and (b) not done from a kernel thread anyway. fh_to_dentry and friends
> > doesn't care about root/cwd, etc.
> >
> > I don't see anything that could cause that kind of issues.
>
> I like the change overall -- it would certainly make my patch series
> simpler, but what about pathwalking? We do take the fs->lock in
> unlazy_walk. Is it possible we'd end up with more contention there?

That would take a pathname lookup in kernel thread side of nfsd that
* isn't single-component
* isn't LOOKUP_ROOT one (i.e. vfs_path_lookup() or file_open_root())
and I would really hope we don't have such things. Any such a beast would
allow probing the tree layout outside of what we export, to start with...

AFAICS, we really don't have anything of that sort. Note that e.g.
lookup_one_len() doesn't go anywhere near ->fs->lock...

2014-12-13 14:06:52

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] nfsd/sunrpc: add support for a workqueue-based nfsd

On Fri, 12 Dec 2014 16:59:52 +0000
Al Viro <[email protected]> wrote:

> On Fri, Dec 12, 2014 at 06:54:08AM -0500, Jeff Layton wrote:
>
> > > Umm... I would be very surprised if it turned out to be a problem.
> > > nfsd really doesn't give a fuck about its cwd and root - not in the
> > > thread side of things. And (un)exporting is (a) not on a hot path
> > > and (b) not done from a kernel thread anyway. fh_to_dentry and friends
> > > doesn't care about root/cwd, etc.
> > >
> > > I don't see anything that could cause that kind of issues.
> >
> > I like the change overall -- it would certainly make my patch series
> > simpler, but what about pathwalking? We do take the fs->lock in
> > unlazy_walk. Is it possible we'd end up with more contention there?
>
> That would take a pathname lookup in kernel thread side of nfsd that
> * isn't single-component
> * isn't LOOKUP_ROOT one (i.e. vfs_path_lookup() or file_open_root())
> and I would really hope we don't have such things. Any such a beast would
> allow probing the tree layout outside of what we export, to start with...
>
> AFAICS, we really don't have anything of that sort. Note that e.g.
> lookup_one_len() doesn't go anywhere near ->fs->lock...

Ahh right. Ok, then I don't see any issue with this so far. Maybe worth
letting it stew in -next once -rc1 ships? Thanks!

Acked-by: Jeff Layton <[email protected]>

2014-12-13 17:30:00

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] nfsd/sunrpc: add support for a workqueue-based nfsd

On Sat, Dec 13, 2014 at 09:06:45AM -0500, Jeff Layton wrote:
> On Fri, 12 Dec 2014 16:59:52 +0000
> Al Viro <[email protected]> wrote:
>
> > On Fri, Dec 12, 2014 at 06:54:08AM -0500, Jeff Layton wrote:
> >
> > > > Umm... I would be very surprised if it turned out to be a problem.
> > > > nfsd really doesn't give a fuck about its cwd and root - not in the
> > > > thread side of things. And (un)exporting is (a) not on a hot path
> > > > and (b) not done from a kernel thread anyway. fh_to_dentry and friends
> > > > doesn't care about root/cwd, etc.
> > > >
> > > > I don't see anything that could cause that kind of issues.
> > >
> > > I like the change overall -- it would certainly make my patch series
> > > simpler, but what about pathwalking? We do take the fs->lock in
> > > unlazy_walk. Is it possible we'd end up with more contention there?
> >
> > That would take a pathname lookup in kernel thread side of nfsd that
> > * isn't single-component
> > * isn't LOOKUP_ROOT one (i.e. vfs_path_lookup() or file_open_root())
> > and I would really hope we don't have such things. Any such a beast would
> > allow probing the tree layout outside of what we export, to start with...
> >
> > AFAICS, we really don't have anything of that sort. Note that e.g.
> > lookup_one_len() doesn't go anywhere near ->fs->lock...
>
> Ahh right. Ok, then I don't see any issue with this so far. Maybe worth
> letting it stew in -next once -rc1 ships? Thanks!

FWIW, right now I think that out of those 3 commits #1 (separating PID 1 from
init_fs + making all kernel threads get umask 0) and #3 (assorted crapectomy
in lustre, getting rid of odd games with fs_struct there) are OK for mainline,
with #2 (removal of unshare_fs_struct()) being -next fodder, to see if we get
anything like unexpected contention, etc.