2014-12-02 18:24:32

by Jeff Layton

[permalink] [raw]
Subject: [RFC PATCH 00/14] nfsd/sunrpc: add support for a workqueue-based nfsd

tl;dr: this code works and is much simpler than the dedicated thread
pool, but there are some latencies in the workqueue code that
seem to keep it from being as fast as it could be.

This patchset is a little skunkworks project that I've been poking at
for the last few weeks. Currently nfsd uses a dedicated thread pool to
handle RPCs, but that requires maintaining a rather large swath of
"fiddly" code to handle the threads and transports.

This patchset represents an alternative approach, which makes nfsd use
workqueues to do its bidding rather than a dedicated thread pool. When a
transport needs to do work, we simply queue it to the workqueue in
softirq context and let it service the transport.

The current draft is runtime-switchable via a new sunrpc pool_mode
module parameter setting. When that's set to "workqueue", nfsd will use
a workqueue-based service. One of the goals of this patchset was to
*not* need to change any userland code, so starting it up using rpc.nfsd
still works as expected. The only real difference is that the nfsdfs
"threads" file is reinterpreted as the "max_active" value for the
workqueue.

This code has a lot of potential to simplify nfsd significantly and I
think it may also scale better on larger machines. When testing with an
exported tmpfs on my craptacular test machine, the workqueue based code
seems to be a little faster than a dedicated thread pool.

Currently though, performance takes a nose dive (~%40) when I'm writing
to (relatively slow) SATA disks. With the help of some tracepoints, I
think this is mostly due to some significant latency in the workqueue
code.

When I queue a thread using the legacy dedicated thread pool, I see
~.2ms of latency between the softirq function queueing it to a given
thread and the thread picking that work up. When I queue it to a
workqueue however, that latency jumps to ~30ms (average).

My current theory is that this latency interferes with the ability to
batch up requests to the disks and that is what accounts for the massive
slowdown.

So, I have several goals here in posting this:

1) to get some early feedback on this code. Does this seem reasonable,
assuming that we can address the workqueue latency problems?

2) get some insight about the latency from those with a better
understanding of the CMWQ code. Any thoughts as to why we might be
seeing such high latency here? Any ideas of what we can do about it?

3) I'm also cc'ing Al due to some changes in patch #10 to allow nfsd
to manage its fs_structs a little differently. Does that approach seem
reasonable?

Jeff Layton (14):
sunrpc: add a new svc_serv_ops struct and move sv_shutdown into it
sunrpc: move sv_function into sv_ops
sunrpc: move sv_module parm into sv_ops
sunrpc: turn enqueueing a svc_xprt into a svc_serv operation
sunrpc: abstract out svc_set_num_threads to sv_ops
sunrpc: move pool_mode definitions into svc.h
sunrpc: factor svc_rqst allocation and freeing from sv_nrthreads
refcounting
sunrpc: set up workqueue function in svc_xprt
sunrpc: add basic support for workqueue-based services
nfsd: keep a reference to the fs_struct in svc_rqst
nfsd: add support for workqueue based service processing
sunrpc: keep a cache of svc_rqsts for each NUMA node
sunrpc: add more tracepoints around svc_xprt handling
sunrpc: add tracepoints around svc_sock handling

fs/fs_struct.c | 60 +++++++--
fs/lockd/svc.c | 7 +-
fs/nfs/callback.c | 6 +-
fs/nfsd/nfssvc.c | 107 ++++++++++++---
include/linux/fs_struct.h | 4 +
include/linux/sunrpc/svc.h | 97 +++++++++++---
include/linux/sunrpc/svc_xprt.h | 3 +
include/linux/sunrpc/svcsock.h | 1 +
include/trace/events/sunrpc.h | 60 ++++++++-
net/sunrpc/Kconfig | 10 ++
net/sunrpc/Makefile | 1 +
net/sunrpc/svc.c | 141 +++++++++++---------
net/sunrpc/svc_wq.c | 281 ++++++++++++++++++++++++++++++++++++++++
net/sunrpc/svc_xprt.c | 66 +++++++++-
net/sunrpc/svcsock.c | 6 +
15 files changed, 737 insertions(+), 113 deletions(-)
create mode 100644 net/sunrpc/svc_wq.c

--
2.1.0


2014-12-02 18:24:35

by Jeff Layton

[permalink] [raw]
Subject: [RFC PATCH 02/14] sunrpc: move sv_function into sv_ops

Since we now have a container for holding svc_serv operations, move the
sv_function into it as well.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfssvc.c | 3 ++-
include/linux/sunrpc/svc.h | 11 +++--------
net/sunrpc/svc.c | 8 +++-----
3 files changed, 8 insertions(+), 14 deletions(-)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index cc5dd1227732..5017af3fe882 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -392,6 +392,7 @@ static int nfsd_get_default_max_blksize(void)

static struct svc_serv_ops nfsd_sv_ops = {
.svo_shutdown = nfsd_last_thread,
+ .svo_function = nfsd,
};

int nfsd_create_serv(struct net *net)
@@ -408,7 +409,7 @@ int nfsd_create_serv(struct net *net)
nfsd_max_blksize = nfsd_get_default_max_blksize();
nfsd_reset_versions();
nn->nfsd_serv = svc_create_pooled(&nfsd_program, nfsd_max_blksize,
- &nfsd_sv_ops, nfsd, THIS_MODULE);
+ &nfsd_sv_ops, THIS_MODULE);
if (nn->nfsd_serv == NULL)
return -ENOMEM;

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 35faea625d9f..7e27894bc065 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -19,11 +19,6 @@
#include <linux/wait.h>
#include <linux/mm.h>

-/*
- * This is the RPC server thread function prototype
- */
-typedef int (*svc_thread_fn)(void *);
-
/* statistics for svc_pool structures */
struct svc_pool_stats {
atomic_long_t packets;
@@ -58,7 +53,8 @@ struct svc_serv;

struct svc_serv_ops {
/* Callback to use when last thread exits. */
- void (*svo_shutdown)(struct svc_serv *serv, struct net *net);
+ void (*svo_shutdown)(struct svc_serv *, struct net *);
+ int (*svo_function)(void *);
};

/*
@@ -95,7 +91,6 @@ struct svc_serv {
struct svc_serv_ops * sv_ops; /* server operations */
struct module * sv_module; /* optional module to count when
* adding threads */
- svc_thread_fn sv_function; /* main function for threads */
#if defined(CONFIG_SUNRPC_BACKCHANNEL)
struct list_head sv_cb_list; /* queue for callback requests
* that arrive over the same
@@ -435,7 +430,7 @@ struct svc_rqst *svc_prepare_thread(struct svc_serv *serv,
struct svc_pool *pool, int node);
void svc_exit_thread(struct svc_rqst *);
struct svc_serv * svc_create_pooled(struct svc_program *, unsigned int,
- struct svc_serv_ops *, svc_thread_fn, struct module *);
+ struct svc_serv_ops *, struct module *);
int svc_set_num_threads(struct svc_serv *, struct svc_pool *, int);
int svc_pool_stats_open(struct svc_serv *serv, struct file *file);
void svc_destroy(struct svc_serv *);
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 057185e11261..897c755c6186 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -34,7 +34,7 @@

static void svc_unregister(const struct svc_serv *serv, struct net *net);

-#define svc_serv_is_pooled(serv) ((serv)->sv_function)
+#define svc_serv_is_pooled(serv) ((serv)->sv_ops->svo_function)

/*
* Mode for mapping cpus to pools.
@@ -494,8 +494,7 @@ EXPORT_SYMBOL_GPL(svc_create);

struct svc_serv *
svc_create_pooled(struct svc_program *prog, unsigned int bufsize,
- struct svc_serv_ops *ops, svc_thread_fn func,
- struct module *mod)
+ struct svc_serv_ops *ops, struct module *mod)
{
struct svc_serv *serv;
unsigned int npools = svc_pool_map_get();
@@ -504,7 +503,6 @@ svc_create_pooled(struct svc_program *prog, unsigned int bufsize,
if (!serv)
goto out_err;

- serv->sv_function = func;
serv->sv_module = mod;
return serv;
out_err:
@@ -740,7 +738,7 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
}

__module_get(serv->sv_module);
- task = kthread_create_on_node(serv->sv_function, rqstp,
+ task = kthread_create_on_node(serv->sv_ops->svo_function, rqstp,
node, "%s", serv->sv_name);
if (IS_ERR(task)) {
error = PTR_ERR(task);
--
2.1.0

2014-12-02 18:24:43

by Jeff Layton

[permalink] [raw]
Subject: [RFC PATCH 06/14] sunrpc: move pool_mode definitions into svc.h

In later patches, we're going to need to allow code external to svc.c
to figure out what pool_mode is in use. Move these definitions into
svc.h to prepare for that.

Also, make the svc_pool_map object available and exported so that other
modules can peek in there to get insight into what pool mode is in use.
Likewise, export svc_pool_map_get/put function to make it safe to do so.

Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/sunrpc/svc.h | 25 +++++++++++++++++++++++++
net/sunrpc/svc.c | 31 +++++++------------------------
2 files changed, 32 insertions(+), 24 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 5e172c8329f8..28e5f5716a87 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -428,6 +428,29 @@ struct svc_procedure {
};

/*
+ * Mode for mapping cpus to pools.
+ */
+enum {
+ SVC_POOL_AUTO = -1, /* choose one of the others */
+ SVC_POOL_GLOBAL, /* no mapping, just a single global pool
+ * (legacy & UP mode) */
+ SVC_POOL_PERCPU, /* one pool per cpu */
+ SVC_POOL_PERNODE /* one pool per numa node */
+};
+
+struct svc_pool_map {
+ int count; /* How many svc_servs use us */
+ int mode; /* Note: int not enum to avoid
+ * warnings about "enumeration value
+ * not handled in switch" */
+ unsigned int npools;
+ unsigned int *pool_to; /* maps pool id to cpu or node */
+ unsigned int *to_pool; /* maps cpu or node to pool id */
+};
+
+extern struct svc_pool_map svc_pool_map;
+
+/*
* Function prototypes.
*/
int svc_rpcb_setup(struct svc_serv *serv, struct net *net);
@@ -438,6 +461,8 @@ struct svc_serv *svc_create(struct svc_program *, unsigned int,
struct svc_rqst *svc_prepare_thread(struct svc_serv *serv,
struct svc_pool *pool, int node);
void svc_exit_thread(struct svc_rqst *);
+unsigned int svc_pool_map_get(void);
+void svc_pool_map_put(void);
struct svc_serv * svc_create_pooled(struct svc_program *, unsigned int,
struct svc_serv_ops *);
int svc_set_num_threads(struct svc_serv *, struct svc_pool *, int);
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 9b4b4d8d109a..04c083a53121 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -36,34 +36,17 @@ static void svc_unregister(const struct svc_serv *serv, struct net *net);

#define svc_serv_is_pooled(serv) ((serv)->sv_ops->svo_function)

-/*
- * Mode for mapping cpus to pools.
- */
-enum {
- SVC_POOL_AUTO = -1, /* choose one of the others */
- SVC_POOL_GLOBAL, /* no mapping, just a single global pool
- * (legacy & UP mode) */
- SVC_POOL_PERCPU, /* one pool per cpu */
- SVC_POOL_PERNODE /* one pool per numa node */
-};
#define SVC_POOL_DEFAULT SVC_POOL_GLOBAL

/*
* Structure for mapping cpus to pools and vice versa.
* Setup once during sunrpc initialisation.
*/
-static struct svc_pool_map {
- int count; /* How many svc_servs use us */
- int mode; /* Note: int not enum to avoid
- * warnings about "enumeration value
- * not handled in switch" */
- unsigned int npools;
- unsigned int *pool_to; /* maps pool id to cpu or node */
- unsigned int *to_pool; /* maps cpu or node to pool id */
-} svc_pool_map = {
- .count = 0,
+struct svc_pool_map svc_pool_map = {
.mode = SVC_POOL_DEFAULT
};
+EXPORT_SYMBOL_GPL(svc_pool_map);
+
static DEFINE_MUTEX(svc_pool_map_mutex);/* protects svc_pool_map.count only */

static int
@@ -236,7 +219,7 @@ svc_pool_map_init_pernode(struct svc_pool_map *m)
* vice versa). Initialise the map if we're the first user.
* Returns the number of pools.
*/
-static unsigned int
+unsigned int
svc_pool_map_get(void)
{
struct svc_pool_map *m = &svc_pool_map;
@@ -271,7 +254,7 @@ svc_pool_map_get(void)
mutex_unlock(&svc_pool_map_mutex);
return m->npools;
}
-
+EXPORT_SYMBOL_GPL(svc_pool_map_get);

/*
* Drop a reference to the global map of cpus to pools.
@@ -280,7 +263,7 @@ svc_pool_map_get(void)
* mode using the pool_mode module option without
* rebooting or re-loading sunrpc.ko.
*/
-static void
+void
svc_pool_map_put(void)
{
struct svc_pool_map *m = &svc_pool_map;
@@ -297,7 +280,7 @@ svc_pool_map_put(void)

mutex_unlock(&svc_pool_map_mutex);
}
-
+EXPORT_SYMBOL_GPL(svc_pool_map_put);

static int svc_pool_map_get_node(unsigned int pidx)
{
--
2.1.0

2014-12-02 18:24:37

by Jeff Layton

[permalink] [raw]
Subject: [RFC PATCH 03/14] sunrpc: move sv_module parm into sv_ops

...not technically an operation, but it's more convenient and cleaner
to pass the module pointer in this struct.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfssvc.c | 3 ++-
include/linux/sunrpc/svc.h | 9 ++++++---
net/sunrpc/svc.c | 8 +++-----
3 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 5017af3fe882..94e7e173aa10 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -393,6 +393,7 @@ static int nfsd_get_default_max_blksize(void)
static struct svc_serv_ops nfsd_sv_ops = {
.svo_shutdown = nfsd_last_thread,
.svo_function = nfsd,
+ .svo_module = THIS_MODULE,
};

int nfsd_create_serv(struct net *net)
@@ -409,7 +410,7 @@ int nfsd_create_serv(struct net *net)
nfsd_max_blksize = nfsd_get_default_max_blksize();
nfsd_reset_versions();
nn->nfsd_serv = svc_create_pooled(&nfsd_program, nfsd_max_blksize,
- &nfsd_sv_ops, THIS_MODULE);
+ &nfsd_sv_ops);
if (nn->nfsd_serv == NULL)
return -ENOMEM;

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 7e27894bc065..377be73b736f 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -54,7 +54,12 @@ struct svc_serv;
struct svc_serv_ops {
/* Callback to use when last thread exits. */
void (*svo_shutdown)(struct svc_serv *, struct net *);
+
+ /* function for service threads to run */
int (*svo_function)(void *);
+
+ /* optional module to count when adding threads (pooled svcs only) */
+ struct module *svo_module;
};

/*
@@ -89,8 +94,6 @@ struct svc_serv {
unsigned int sv_nrpools; /* number of thread pools */
struct svc_pool * sv_pools; /* array of thread pools */
struct svc_serv_ops * sv_ops; /* server operations */
- struct module * sv_module; /* optional module to count when
- * adding threads */
#if defined(CONFIG_SUNRPC_BACKCHANNEL)
struct list_head sv_cb_list; /* queue for callback requests
* that arrive over the same
@@ -430,7 +433,7 @@ struct svc_rqst *svc_prepare_thread(struct svc_serv *serv,
struct svc_pool *pool, int node);
void svc_exit_thread(struct svc_rqst *);
struct svc_serv * svc_create_pooled(struct svc_program *, unsigned int,
- struct svc_serv_ops *, struct module *);
+ struct svc_serv_ops *);
int svc_set_num_threads(struct svc_serv *, struct svc_pool *, int);
int svc_pool_stats_open(struct svc_serv *serv, struct file *file);
void svc_destroy(struct svc_serv *);
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 897c755c6186..9b4b4d8d109a 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -494,7 +494,7 @@ EXPORT_SYMBOL_GPL(svc_create);

struct svc_serv *
svc_create_pooled(struct svc_program *prog, unsigned int bufsize,
- struct svc_serv_ops *ops, struct module *mod)
+ struct svc_serv_ops *ops)
{
struct svc_serv *serv;
unsigned int npools = svc_pool_map_get();
@@ -502,8 +502,6 @@ svc_create_pooled(struct svc_program *prog, unsigned int bufsize,
serv = __svc_create(prog, bufsize, npools, ops);
if (!serv)
goto out_err;
-
- serv->sv_module = mod;
return serv;
out_err:
svc_pool_map_put();
@@ -737,12 +735,12 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
break;
}

- __module_get(serv->sv_module);
+ __module_get(serv->sv_ops->svo_module);
task = kthread_create_on_node(serv->sv_ops->svo_function, rqstp,
node, "%s", serv->sv_name);
if (IS_ERR(task)) {
error = PTR_ERR(task);
- module_put(serv->sv_module);
+ module_put(serv->sv_ops->svo_module);
svc_exit_thread(rqstp);
break;
}
--
2.1.0

2014-12-02 18:24:48

by Jeff Layton

[permalink] [raw]
Subject: [RFC PATCH 10/14] nfsd: keep a reference to the fs_struct in svc_rqst

When we convert this code to use a workqueue, we won't want to allocate
a new fs_struct to handle each RPC. Doing so might also be problematic
since we'd be swapping out the ->fs value on a "public" workqueue
kthread.

Change the code to allocate an fs struct when when allocating a svc_rqst
and then switch to using that in the "nfsd" function.

Cc: Al Viro <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
---
fs/fs_struct.c | 59 +++++++++++++++++++++++++++++++++++++++-------
fs/nfsd/nfssvc.c | 17 ++++++-------
include/linux/fs_struct.h | 3 +++
include/linux/sunrpc/svc.h | 1 +
net/sunrpc/svc.c | 8 +++++++
5 files changed, 69 insertions(+), 19 deletions(-)

diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index 7dca743b2ce1..9bc08ea2f433 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -127,26 +127,67 @@ struct fs_struct *copy_fs_struct(struct fs_struct *old)
}
return fs;
}
+EXPORT_SYMBOL_GPL(copy_fs_struct);

-int unshare_fs_struct(void)
+/* Replace current fs struct with one given. Return a pointer to old one. */
+static struct fs_struct *
+swap_fs_struct(struct fs_struct *new_fs)
{
- struct fs_struct *fs = current->fs;
- struct fs_struct *new_fs = copy_fs_struct(fs);
- int kill;
-
- if (!new_fs)
- return -ENOMEM;
+ struct fs_struct *old_fs;

task_lock(current);
+ old_fs = current->fs;
+ current->fs = new_fs;
+ task_unlock(current);
+
+ return old_fs;
+}
+
+/* Put a reference to a fs_struct. */
+void put_fs_struct(struct fs_struct *fs)
+{
+ bool kill;
+
spin_lock(&fs->lock);
kill = !--fs->users;
- current->fs = new_fs;
spin_unlock(&fs->lock);
- task_unlock(current);

if (kill)
free_fs_struct(fs);
+}
+EXPORT_SYMBOL_GPL(put_fs_struct);
+
+/* Take an extra reference to a fs_struct. Caller must already hold one! */
+struct fs_struct *
+get_fs_struct(struct fs_struct *fs)
+{
+ spin_lock(&fs->lock);
+ ++fs->users;
+ spin_unlock(&fs->lock);
+ return fs;
+}
+EXPORT_SYMBOL_GPL(get_fs_struct);
+
+/*
+ * Swap in a new fs_struct and drop the reference on the old one.
+ * Caller must have already taken the reference to the new one.
+ */
+void replace_fs_struct(struct fs_struct *new_fs)
+{
+ struct fs_struct *old_fs = swap_fs_struct(new_fs);
+
+ put_fs_struct(old_fs);
+}
+EXPORT_SYMBOL_GPL(replace_fs_struct);
+
+int unshare_fs_struct(void)
+{
+ struct fs_struct *new_fs = copy_fs_struct(current->fs);
+
+ if (!new_fs)
+ return -ENOMEM;

+ replace_fs_struct(new_fs);
return 0;
}
EXPORT_SYMBOL_GPL(unshare_fs_struct);
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 71e7b180c0d9..f37bd7db2176 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -582,15 +582,13 @@ nfsd(void *vrqstp)
/* Lock module and set up kernel thread */
mutex_lock(&nfsd_mutex);

- /* At this point, the thread shares current->fs
- * with the init process. We need to create files with a
- * umask of 0 instead of init's umask. */
- if (unshare_fs_struct() < 0) {
- printk("Unable to start nfsd thread: out of memory\n");
- goto out;
- }
-
- current->fs->umask = 0;
+ /*
+ * At this point, the thread shares current->fs with the init process.
+ * We need to create files with a umask of 0 instead of init's umask,
+ * so switch to the fs_struct associated with the rqstp.
+ */
+ get_fs_struct(rqstp->rq_fs);
+ replace_fs_struct(rqstp->rq_fs);

/*
* thread is spawned with all signals set to SIG_IGN, re-enable
@@ -632,7 +630,6 @@ nfsd(void *vrqstp)
mutex_lock(&nfsd_mutex);
nfsdstats.th_cnt --;

-out:
rqstp->rq_server = NULL;

/* Release the thread */
diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
index 0efc3e62843a..d2b7a1942790 100644
--- a/include/linux/fs_struct.h
+++ b/include/linux/fs_struct.h
@@ -21,7 +21,10 @@ extern void set_fs_root(struct fs_struct *, const struct path *);
extern void set_fs_pwd(struct fs_struct *, const struct path *);
extern struct fs_struct *copy_fs_struct(struct fs_struct *);
extern void free_fs_struct(struct fs_struct *);
+extern void replace_fs_struct(struct fs_struct *);
extern int unshare_fs_struct(void);
+struct fs_struct *get_fs_struct(struct fs_struct *);
+void put_fs_struct(struct fs_struct *);

static inline void get_fs_root(struct fs_struct *fs, struct path *root)
{
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 81e723220346..f47de87660b4 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -295,6 +295,7 @@ struct svc_rqst {
struct svc_cacherep * rq_cacherep; /* cache info */
struct task_struct *rq_task; /* service thread */
spinlock_t rq_lock; /* per-request lock */
+ struct fs_struct *rq_fs;
};

#define SVC_NET(svc_rqst) (svc_rqst->rq_xprt->xpt_net)
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 7c8e33923210..4300bc852f6e 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -20,6 +20,7 @@
#include <linux/module.h>
#include <linux/kthread.h>
#include <linux/slab.h>
+#include <linux/fs_struct.h>

#include <linux/sunrpc/types.h>
#include <linux/sunrpc/xdr.h>
@@ -622,6 +623,11 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
if (!svc_init_buffer(rqstp, serv->sv_max_mesg, node))
goto out_enomem;

+ rqstp->rq_fs = copy_fs_struct(current->fs);
+ if (!rqstp->rq_fs)
+ goto out_enomem;
+
+ rqstp->rq_fs->umask = 0;
return rqstp;
out_enomem:
svc_rqst_free(rqstp);
@@ -784,6 +790,8 @@ svc_rqst_free(struct svc_rqst *rqstp)
kfree(rqstp->rq_resp);
kfree(rqstp->rq_argp);
kfree(rqstp->rq_auth_data);
+ if (rqstp->rq_fs)
+ put_fs_struct(rqstp->rq_fs);
kfree_rcu(rqstp, rq_rcu_head);
}
EXPORT_SYMBOL_GPL(svc_rqst_free);
--
2.1.0

2014-12-02 18:25:00

by Jeff Layton

[permalink] [raw]
Subject: [RFC PATCH 14/14] sunrpc: add tracepoints around svc_sock handling

Signed-off-by: Jeff Layton <[email protected]>
---
include/trace/events/sunrpc.h | 46 +++++++++++++++++++++++++++++++++++++++++++
net/sunrpc/svcsock.c | 6 ++++++
2 files changed, 52 insertions(+)

diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 434b46e79053..711ebd9179b1 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -9,6 +9,7 @@
#include <linux/sunrpc/svc.h>
#include <linux/sunrpc/xprtsock.h>
#include <linux/sunrpc/svc_xprt.h>
+#include <linux/sunrpc/svcsock.h>
#include <net/tcp_states.h>
#include <linux/net.h>
#include <linux/tracepoint.h>
@@ -584,6 +585,51 @@ TRACE_EVENT(svc_handle_xprt,
show_svc_xprt_flags(__entry->xprt->xpt_flags))
);

+DECLARE_EVENT_CLASS(svc_socket_event,
+ TP_PROTO(struct sock *sk),
+
+ TP_ARGS(sk),
+
+ TP_STRUCT__entry(
+ __field(struct sock *, sk)
+ __field(struct svc_sock *, svsk)
+ __field_struct(struct sockaddr_storage, ss)
+ __field(unsigned long, xpt_flags)
+ ),
+
+ TP_fast_assign(
+ __entry->sk = sk;
+ __entry->svsk = sk->sk_user_data;
+ if (__entry->svsk) {
+ struct svc_xprt *xprt = &__entry->svsk->sk_xprt;
+
+ memcpy(&__entry->ss, &xprt->xpt_remote,
+ sizeof(__entry->ss));
+ __entry->xpt_flags = xprt->xpt_flags;
+ } else {
+ memset(&__entry->ss, 0, sizeof(__entry->ss));
+ __entry->xpt_flags = 0;
+ }
+ ),
+
+ TP_printk("sk=0x%p peer=%pIScp sk_state=%s xpt_flags=%s", __entry->sk,
+ (struct sockaddr *)&__entry->ss,
+ rpc_show_sock_state(__entry->sk->sk_state),
+ show_svc_xprt_flags(__entry->xpt_flags))
+);
+
+DEFINE_EVENT(svc_socket_event, svc_tcp_listen_data_ready,
+ TP_PROTO(struct sock *sk), TP_ARGS(sk));
+
+DEFINE_EVENT(svc_socket_event, svc_tcp_state_change,
+ TP_PROTO(struct sock *sk), TP_ARGS(sk));
+
+DEFINE_EVENT(svc_socket_event, svc_tcp_data_ready,
+ TP_PROTO(struct sock *sk), TP_ARGS(sk));
+
+DEFINE_EVENT(svc_socket_event, svc_tcp_accept,
+ TP_PROTO(struct sock *sk), TP_ARGS(sk));
+
#endif /* _TRACE_SUNRPC_H */

#include <trace/define_trace.h>
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 74db5ca60b33..2b09deabac24 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -44,6 +44,7 @@
#include <asm/uaccess.h>
#include <asm/ioctls.h>
#include <trace/events/skb.h>
+#include <trace/events/sunrpc.h>

#include <linux/sunrpc/types.h>
#include <linux/sunrpc/clnt.h>
@@ -765,6 +766,7 @@ static void svc_tcp_listen_data_ready(struct sock *sk)
struct svc_sock *svsk = (struct svc_sock *)sk->sk_user_data;
wait_queue_head_t *wq;

+ trace_svc_tcp_listen_data_ready(sk);
dprintk("svc: socket %p TCP (listen) state change %d\n",
sk, sk->sk_state);

@@ -799,6 +801,7 @@ static void svc_tcp_state_change(struct sock *sk)
struct svc_sock *svsk = (struct svc_sock *)sk->sk_user_data;
wait_queue_head_t *wq = sk_sleep(sk);

+ trace_svc_tcp_state_change(sk);
dprintk("svc: socket %p TCP (connected) state change %d (svsk %p)\n",
sk, sk->sk_state, sk->sk_user_data);

@@ -817,6 +820,7 @@ static void svc_tcp_data_ready(struct sock *sk)
struct svc_sock *svsk = (struct svc_sock *)sk->sk_user_data;
wait_queue_head_t *wq = sk_sleep(sk);

+ trace_svc_tcp_data_ready(sk);
dprintk("svc: socket %p TCP data ready (svsk %p)\n",
sk, sk->sk_user_data);
if (svsk) {
@@ -842,6 +846,8 @@ static struct svc_xprt *svc_tcp_accept(struct svc_xprt *xprt)
int err, slen;
RPC_IFDEBUG(char buf[RPC_MAX_ADDRBUFLEN]);

+ trace_svc_tcp_accept(svsk->sk_sk);
+
dprintk("svc: tcp_accept %p sock %p\n", svsk, sock);
if (!sock)
return NULL;
--
2.1.0

2014-12-02 18:24:52

by Jeff Layton

[permalink] [raw]
Subject: [RFC PATCH 12/14] sunrpc: keep a cache of svc_rqsts for each NUMA node

Allocating an entire svc_rqst (including all of the pages, etc...) for
each workqueue request is pretty expensive. Keep a cache of allocated
svc_rqst structures for each NUMA node that we keep in svc_pool.

In order to keep the cache from growing without bound, we register a
shrinker. Since the cache is already NUMA-aware, we can use a NUMA-aware
shrinker as well.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfssvc.c | 6 +-
include/linux/sunrpc/svc.h | 17 ++++++
net/sunrpc/svc.c | 1 +
net/sunrpc/svc_wq.c | 136 ++++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 154 insertions(+), 6 deletions(-)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 2c7ebced0311..c359e8f77b30 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -672,7 +672,6 @@ nfsd(void *vrqstp)
static void
nfsd_work(struct work_struct *work)
{
- int node = numa_node_id();
struct svc_xprt *xprt = container_of(work, struct svc_xprt, xpt_work);
struct net *net = xprt->xpt_net;
struct nfsd_net *nn = net_generic(net, nfsd_net_id);
@@ -681,7 +680,7 @@ nfsd_work(struct work_struct *work)
struct fs_struct *saved_fs;
int err;

- rqstp = svc_rqst_alloc(serv, &serv->sv_pools[node], node);
+ rqstp = find_or_alloc_svc_rqst(serv);
if (!rqstp) {
/* Alloc failure. Give up for now, and requeue the work */
queue_work(serv->sv_wq, &xprt->xpt_work);
@@ -703,8 +702,7 @@ nfsd_work(struct work_struct *work)

saved_fs = swap_fs_struct(saved_fs);
put_fs_struct(saved_fs);
-
- svc_rqst_free(rqstp);
+ put_svc_rqst(rqstp);
}

static struct svc_serv_ops nfsd_wq_sv_ops = {
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index f47de87660b4..33321ddacfee 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -105,6 +105,7 @@ struct svc_serv {
struct svc_pool * sv_pools; /* array of thread pools */
struct svc_serv_ops * sv_ops; /* server operations */
struct workqueue_struct *sv_wq; /* workqueue for wq-based services */
+ struct shrinker sv_shrinker; /* for shrinking svc_rqst caches */
#if defined(CONFIG_SUNRPC_BACKCHANNEL)
struct list_head sv_cb_list; /* queue for callback requests
* that arrive over the same
@@ -274,6 +275,7 @@ struct svc_rqst {
#define RQ_VICTIM (5) /* about to be shut down */
#define RQ_BUSY (6) /* request is busy */
unsigned long rq_flags; /* flags field */
+ unsigned long rq_time; /* when rqstp was last put */

void * rq_argp; /* decoded arguments */
void * rq_resp; /* xdr'd results */
@@ -493,6 +495,21 @@ char * svc_print_addr(struct svc_rqst *, char *, size_t);
#if IS_ENABLED(CONFIG_SUNRPC_SVC_WORKQUEUE)
int svc_wq_setup(struct svc_serv *, struct svc_pool *, int);
void svc_wq_enqueue_xprt(struct svc_xprt *);
+struct svc_rqst * find_or_alloc_svc_rqst(struct svc_serv *serv);
+void exit_svc_rqst_cache(struct svc_serv *serv);
+
+static inline void
+put_svc_rqst(struct svc_rqst *rqstp)
+{
+ rqstp->rq_time = jiffies;
+ clear_bit(RQ_BUSY, &rqstp->rq_flags);
+}
+#else
+static inline void
+exit_svc_rqst_cache(struct svc_serv *serv)
+{
+ return;
+}
#endif

#define RPC_MAX_ADDRBUFLEN (63U)
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 4300bc852f6e..4ebba00b8b27 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -547,6 +547,7 @@ svc_destroy(struct svc_serv *serv)

if (serv->sv_wq) {
destroy_workqueue(serv->sv_wq);
+ exit_svc_rqst_cache(serv);
module_put(serv->sv_ops->svo_module);
}

diff --git a/net/sunrpc/svc_wq.c b/net/sunrpc/svc_wq.c
index d4720ecd0b32..e96bbf49c1a0 100644
--- a/net/sunrpc/svc_wq.c
+++ b/net/sunrpc/svc_wq.c
@@ -12,6 +12,130 @@
#include <trace/events/sunrpc.h>

/*
+ * Find a svc_rqst to use. Try to find an already allocated-one on the list
+ * first, and then allocate if there isn't one already available.
+ */
+struct svc_rqst *
+find_or_alloc_svc_rqst(struct svc_serv *serv)
+{
+ int node = numa_node_id();
+ struct svc_rqst *rqstp;
+ struct svc_pool *pool = &serv->sv_pools[node];
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
+ if (!test_and_set_bit(RQ_BUSY, &rqstp->rq_flags)) {
+ rcu_read_unlock();
+ return rqstp;
+ }
+ }
+ rcu_read_unlock();
+
+ rqstp = svc_rqst_alloc(serv, pool, node);
+ if (rqstp) {
+ spin_lock_bh(&pool->sp_lock);
+ list_add_tail_rcu(&rqstp->rq_all, &pool->sp_all_threads);
+ ++pool->sp_nrthreads;
+ spin_unlock_bh(&pool->sp_lock);
+ }
+ return rqstp;
+}
+EXPORT_SYMBOL_GPL(find_or_alloc_svc_rqst);
+
+static unsigned long
+count_svc_rqst_objects(struct shrinker *shrinker, struct shrink_control *sc)
+{
+ struct svc_serv *serv = container_of(shrinker, struct svc_serv,
+ sv_shrinker);
+ struct svc_pool *pool = &serv->sv_pools[sc->nid];
+ struct svc_rqst *rqstp;
+ unsigned long count = 0;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
+ /* Don't count it if it's busy */
+ if (test_bit(RQ_BUSY, &rqstp->rq_flags))
+ continue;
+
+ /* Don't count it if it was used within the last second */
+ if (time_before(jiffies, rqstp->rq_time + HZ))
+ continue;
+
+ ++count;
+ }
+ rcu_read_unlock();
+
+ return count;
+}
+
+static unsigned long
+scan_svc_rqst_objects(struct shrinker *shrinker, struct shrink_control *sc)
+{
+ struct svc_serv *serv = container_of(shrinker, struct svc_serv,
+ sv_shrinker);
+ struct svc_pool *pool = &serv->sv_pools[sc->nid];
+ struct svc_rqst *rqstp;
+ unsigned long count = 0;
+
+ spin_lock(&pool->sp_lock);
+ list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
+ /* Don't free it if it's busy */
+ if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
+ continue;
+
+ list_del_rcu(&rqstp->rq_all);
+ svc_rqst_free(rqstp);
+ --pool->sp_nrthreads;
+ ++count;
+ if (sc->nr_to_scan-- == 0)
+ break;
+ }
+ spin_unlock(&pool->sp_lock);
+
+ return count;
+}
+
+static int
+init_svc_rqst_cache(struct svc_serv *serv)
+{
+ struct shrinker *shrinker = &serv->sv_shrinker;
+
+ memset(shrinker, 0, sizeof(*shrinker));
+
+ shrinker->count_objects = count_svc_rqst_objects;
+ shrinker->scan_objects = scan_svc_rqst_objects;
+ shrinker->seeks = DEFAULT_SEEKS;
+ shrinker->flags = SHRINKER_NUMA_AWARE;
+
+ return register_shrinker(shrinker);
+}
+
+void
+exit_svc_rqst_cache(struct svc_serv *serv)
+{
+ int node;
+
+ unregister_shrinker(&serv->sv_shrinker);
+
+ for (node = 0; node < serv->sv_nrpools; node++) {
+ struct svc_pool *pool = &serv->sv_pools[node];
+
+ spin_lock_bh(&pool->sp_lock);
+ while (!list_empty(&pool->sp_all_threads)) {
+ struct svc_rqst *rqstp = list_first_entry(
+ &pool->sp_all_threads, struct svc_rqst,
+ rq_all);
+
+ WARN_ON_ONCE(test_bit(RQ_BUSY, &rqstp->rq_flags));
+ list_del_rcu(&rqstp->rq_all);
+ svc_rqst_free(rqstp);
+ }
+ pool->sp_nrthreads = 0;
+ spin_unlock_bh(&pool->sp_lock);
+ }
+}
+
+/*
* This workqueue job should run on each node when the workqueue is created. It
* walks the list of xprts for its node, and queues the workqueue job for each.
*/
@@ -58,12 +182,13 @@ process_queued_xprts(struct svc_serv *serv)

/*
* Start up or shut down a workqueue-based RPC service. Basically, we use this
- * to allocate the workqueue. The function assumes that the caller holds one
- * serv->sv_nrthreads reference.
+ * to allocate the workqueue and set up the shrinker for the svc_rqst cache.
+ * This function assumes that the caller holds one serv->sv_nrthreads reference.
*/
int
svc_wq_setup(struct svc_serv *serv, struct svc_pool *pool, int max_active)
{
+ int err;
int nrthreads = serv->sv_nrthreads - 1; /* -1 for caller's reference */

WARN_ON_ONCE(nrthreads < 0);
@@ -79,14 +204,21 @@ svc_wq_setup(struct svc_serv *serv, struct svc_pool *pool, int max_active)
/* svc is down and none requested? */
if (!max_active)
return 0;
+
+ err = init_svc_rqst_cache(serv);
+ if (err)
+ return err;
+
__module_get(serv->sv_ops->svo_module);
serv->sv_wq = alloc_workqueue("%s",
WQ_UNBOUND|WQ_FREEZABLE|WQ_SYSFS,
max_active, serv->sv_name);
if (!serv->sv_wq) {
+ exit_svc_rqst_cache(serv);
module_put(serv->sv_ops->svo_module);
return -ENOMEM;
}
+
process_queued_xprts(serv);
} else {
/*
--
2.1.0

2014-12-02 18:25:53

by Jeff Layton

[permalink] [raw]
Subject: [RFC PATCH 13/14] sunrpc: add more tracepoints around svc_xprt handling

Turn the svc_xprt_dequeue tracepoint into an event class, and use it
to add events for svc_xprt_received and svc_xprt_enqueue.

Signed-off-by: Jeff Layton <[email protected]>
---
include/trace/events/sunrpc.h | 14 ++++++++++++--
net/sunrpc/svc_wq.c | 1 +
net/sunrpc/svc_xprt.c | 5 ++++-
3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index b9c1dc6c825a..434b46e79053 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -496,7 +496,7 @@ DEFINE_EVENT(svc_rqst_status, svc_send,
{ (1UL << XPT_CACHE_AUTH), "XPT_CACHE_AUTH"}, \
{ (1UL << XPT_LOCAL), "XPT_LOCAL"})

-TRACE_EVENT(svc_xprt_do_enqueue,
+TRACE_EVENT(svc_xprt_enqueue,
TP_PROTO(struct svc_xprt *xprt, struct svc_rqst *rqst),

TP_ARGS(xprt, rqst),
@@ -517,7 +517,7 @@ TRACE_EVENT(svc_xprt_do_enqueue,
show_svc_xprt_flags(__entry->xprt->xpt_flags))
);

-TRACE_EVENT(svc_xprt_dequeue,
+DECLARE_EVENT_CLASS(svc_xprt_event,
TP_PROTO(struct svc_xprt *xprt),

TP_ARGS(xprt),
@@ -539,6 +539,15 @@ TRACE_EVENT(svc_xprt_dequeue,
show_svc_xprt_flags(__entry->flags))
);

+DEFINE_EVENT(svc_xprt_event, svc_xprt_dequeue,
+ TP_PROTO(struct svc_xprt *xprt), TP_ARGS(xprt));
+
+DEFINE_EVENT(svc_xprt_event, svc_xprt_received,
+ TP_PROTO(struct svc_xprt *xprt), TP_ARGS(xprt));
+
+DEFINE_EVENT(svc_xprt_event, svc_xprt_active,
+ TP_PROTO(struct svc_xprt *xprt), TP_ARGS(xprt));
+
TRACE_EVENT(svc_wake_up,
TP_PROTO(int pid),

@@ -574,6 +583,7 @@ TRACE_EVENT(svc_handle_xprt,
(struct sockaddr *)&__entry->xprt->xpt_remote, __entry->len,
show_svc_xprt_flags(__entry->xprt->xpt_flags))
);
+
#endif /* _TRACE_SUNRPC_H */

#include <trace/define_trace.h>
diff --git a/net/sunrpc/svc_wq.c b/net/sunrpc/svc_wq.c
index e96bbf49c1a0..315a86c7ed10 100644
--- a/net/sunrpc/svc_wq.c
+++ b/net/sunrpc/svc_wq.c
@@ -275,6 +275,7 @@ svc_wq_enqueue_xprt(struct svc_xprt *xprt)
}
out:
svc_xprt_get(xprt);
+ trace_svc_xprt_enqueue(xprt, NULL);
queue_work(serv->sv_wq, &xprt->xpt_work);
}
EXPORT_SYMBOL_GPL(svc_wq_enqueue_xprt);
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 30f9fdfbff0c..7dd08f16140f 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -221,6 +221,7 @@ static struct svc_xprt *__svc_xpo_create(struct svc_xprt_class *xcl,
*/
static void svc_xprt_received(struct svc_xprt *xprt)
{
+ trace_svc_xprt_received(xprt);
if (!test_bit(XPT_BUSY, &xprt->xpt_flags)) {
WARN_ONCE(1, "xprt=0x%p already busy!", xprt);
return;
@@ -402,7 +403,7 @@ redo_search:
rqstp = NULL;
put_cpu();
out:
- trace_svc_xprt_do_enqueue(xprt, rqstp);
+ trace_svc_xprt_enqueue(xprt, rqstp);
}
EXPORT_SYMBOL_GPL(svc_xprt_do_enqueue);

@@ -746,6 +747,8 @@ static int svc_handle_xprt(struct svc_rqst *rqstp, struct svc_xprt *xprt)
struct svc_serv *serv = rqstp->rq_server;
int len = 0;

+ trace_svc_xprt_active(xprt);
+
if (test_bit(XPT_CLOSE, &xprt->xpt_flags)) {
dprintk("svc_recv: found XPT_CLOSE\n");
svc_delete_xprt(xprt);
--
2.1.0

2014-12-02 18:26:26

by Jeff Layton

[permalink] [raw]
Subject: [RFC PATCH 09/14] sunrpc: add basic support for workqueue-based services

Add a new "workqueue" pool mode setting. When that is configured, we'll
set up a svc_pool for each NUMA node, but don't bother with the
pool <=> cpu mapping arrays.

We use an unbound workqueue, which should naturally make each xprt be
queued to a CPU within the current NUMA node.

Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/sunrpc/svc.h | 10 ++-
include/linux/sunrpc/svc_xprt.h | 1 +
include/linux/sunrpc/svcsock.h | 1 +
net/sunrpc/Kconfig | 10 +++
net/sunrpc/Makefile | 1 +
net/sunrpc/svc.c | 15 ++++
net/sunrpc/svc_wq.c | 148 ++++++++++++++++++++++++++++++++++++++++
net/sunrpc/svc_xprt.c | 49 ++++++++++++-
8 files changed, 233 insertions(+), 2 deletions(-)
create mode 100644 net/sunrpc/svc_wq.c

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 8bd53f723485..81e723220346 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -47,6 +47,7 @@ struct svc_pool {
#define SP_TASK_PENDING (0) /* still work to do even if no
* xprt is queued. */
unsigned long sp_flags;
+ struct work_struct sp_work; /* per-pool work struct */
} ____cacheline_aligned_in_smp;

struct svc_serv;
@@ -103,6 +104,7 @@ struct svc_serv {
unsigned int sv_nrpools; /* number of thread pools */
struct svc_pool * sv_pools; /* array of thread pools */
struct svc_serv_ops * sv_ops; /* server operations */
+ struct workqueue_struct *sv_wq; /* workqueue for wq-based services */
#if defined(CONFIG_SUNRPC_BACKCHANNEL)
struct list_head sv_cb_list; /* queue for callback requests
* that arrive over the same
@@ -438,7 +440,8 @@ enum {
SVC_POOL_GLOBAL, /* no mapping, just a single global pool
* (legacy & UP mode) */
SVC_POOL_PERCPU, /* one pool per cpu */
- SVC_POOL_PERNODE /* one pool per numa node */
+ SVC_POOL_PERNODE, /* one pool per numa node */
+ SVC_POOL_WORKQUEUE, /* workqueue-based service */
};

struct svc_pool_map {
@@ -486,6 +489,11 @@ void svc_reserve(struct svc_rqst *rqstp, int space);
struct svc_pool * svc_pool_for_cpu(struct svc_serv *serv, int cpu);
char * svc_print_addr(struct svc_rqst *, char *, size_t);

+#if IS_ENABLED(CONFIG_SUNRPC_SVC_WORKQUEUE)
+int svc_wq_setup(struct svc_serv *, struct svc_pool *, int);
+void svc_wq_enqueue_xprt(struct svc_xprt *);
+#endif
+
#define RPC_MAX_ADDRBUFLEN (63U)

/*
diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index 096937871cda..ce7fd68a905e 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -117,6 +117,7 @@ void svc_xprt_init(struct net *, struct svc_xprt_class *, struct svc_xprt *,
struct svc_serv *);
int svc_create_xprt(struct svc_serv *, const char *, struct net *,
const int, const unsigned short, int);
+bool svc_xprt_has_something_to_do(struct svc_xprt *xprt);
void svc_xprt_do_enqueue(struct svc_xprt *xprt);
void svc_xprt_enqueue(struct svc_xprt *xprt);
void svc_xprt_put(struct svc_xprt *xprt);
diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index 2e780134f449..3ce0a640605d 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -53,6 +53,7 @@ static inline u32 svc_sock_final_rec(struct svc_sock *svsk)
*/
void svc_close_net(struct svc_serv *, struct net *);
int svc_recv(struct svc_rqst *, long);
+int svc_wq_recv(struct svc_rqst *);
int svc_send(struct svc_rqst *);
void svc_drop(struct svc_rqst *);
void svc_sock_update_bufs(struct svc_serv *serv);
diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
index fb78117b896c..08e01949bdc5 100644
--- a/net/sunrpc/Kconfig
+++ b/net/sunrpc/Kconfig
@@ -71,3 +71,13 @@ config SUNRPC_XPRT_RDMA_SERVER
choose M here: the module will be called svcrdma.

If unsure, say N.
+
+config SUNRPC_SVC_WORKQUEUE
+ bool "Support for workqueue-based SUNRPC services"
+ depends on SUNRPC
+ default n
+ help
+ Traditional SUNRPC services have required a dedicated thread
+ to handle incoming requests. This option enables support for
+ queueing incoming reqests to a workqueue instead, eliminating
+ the need for a dedicated thread pool.
diff --git a/net/sunrpc/Makefile b/net/sunrpc/Makefile
index 15e6f6c23c5d..401341123fdd 100644
--- a/net/sunrpc/Makefile
+++ b/net/sunrpc/Makefile
@@ -18,3 +18,4 @@ sunrpc-$(CONFIG_SUNRPC_DEBUG) += debugfs.o
sunrpc-$(CONFIG_SUNRPC_BACKCHANNEL) += backchannel_rqst.o bc_svc.o
sunrpc-$(CONFIG_PROC_FS) += stats.o
sunrpc-$(CONFIG_SYSCTL) += sysctl.o
+sunrpc-$(CONFIG_SUNRPC_SVC_WORKQUEUE) += svc_wq.o
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 135ffbe9d983..7c8e33923210 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -71,6 +71,10 @@ param_set_pool_mode(const char *val, struct kernel_param *kp)
*ip = SVC_POOL_PERCPU;
else if (!strncmp(val, "pernode", 7))
*ip = SVC_POOL_PERNODE;
+#if IS_ENABLED(CONFIG_SUNRPC_SVC_WORKQUEUE)
+ else if (!strncmp(val, "workqueue", 9))
+ *ip = SVC_POOL_WORKQUEUE;
+#endif
else
err = -EINVAL;

@@ -94,6 +98,8 @@ param_get_pool_mode(char *buf, struct kernel_param *kp)
return strlcpy(buf, "percpu", 20);
case SVC_POOL_PERNODE:
return strlcpy(buf, "pernode", 20);
+ case SVC_POOL_WORKQUEUE:
+ return strlcpy(buf, "workqueue", 20);
default:
return sprintf(buf, "%d", *ip);
}
@@ -242,6 +248,10 @@ svc_pool_map_get(void)
case SVC_POOL_PERNODE:
npools = svc_pool_map_init_pernode(m);
break;
+ case SVC_POOL_WORKQUEUE:
+ /* workqueues get a pool per numa node, but don't need a map */
+ npools = nr_node_ids;
+ break;
}

if (npools < 0) {
@@ -534,6 +544,11 @@ svc_destroy(struct svc_serv *serv)
if (svc_serv_is_pooled(serv))
svc_pool_map_put();

+ if (serv->sv_wq) {
+ destroy_workqueue(serv->sv_wq);
+ module_put(serv->sv_ops->svo_module);
+ }
+
kfree(serv->sv_pools);
kfree(serv);
}
diff --git a/net/sunrpc/svc_wq.c b/net/sunrpc/svc_wq.c
new file mode 100644
index 000000000000..d4720ecd0b32
--- /dev/null
+++ b/net/sunrpc/svc_wq.c
@@ -0,0 +1,148 @@
+/*
+ * svc_wq - support for workqueue-based rpc svcs
+ */
+
+#include <linux/sched.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/sunrpc/stats.h>
+#include <linux/sunrpc/svc_xprt.h>
+#include <linux/module.h>
+#include <linux/workqueue.h>
+#include <trace/events/sunrpc.h>
+
+/*
+ * This workqueue job should run on each node when the workqueue is created. It
+ * walks the list of xprts for its node, and queues the workqueue job for each.
+ */
+static void
+process_queued_xprt_work(struct work_struct *work)
+{
+ struct svc_pool *pool = container_of(work, struct svc_pool, sp_work);
+
+ spin_lock_bh(&pool->sp_lock);
+ while (!list_empty(&pool->sp_sockets)) {
+ struct svc_xprt *xprt = list_first_entry(&pool->sp_sockets,
+ struct svc_xprt, xpt_ready);
+
+ list_del_init(&xprt->xpt_ready);
+ svc_xprt_get(xprt);
+ queue_work(xprt->xpt_server->sv_wq, &xprt->xpt_work);
+ }
+ spin_unlock_bh(&pool->sp_lock);
+}
+
+/*
+ * If any svc_xprts are enqueued before the workqueue is available, they get
+ * added to the pool->sp_sockets list. When the workqueue becomes available,
+ * we must walk the list for each pool and queue each xprt to the workqueue.
+ *
+ * In order to minimize inter-node communication, we queue a separate job for
+ * each node to walk its own list. We queue this job to any cpu in the node.
+ * Since the workqueues are unbound they'll end up queued to the pool_workqueue
+ * for their corresponding node, and not necessarily to the given CPU.
+ */
+static void
+process_queued_xprts(struct svc_serv *serv)
+{
+ int node;
+
+ for (node = 0; node < serv->sv_nrpools; ++node) {
+ int cpu = any_online_cpu(*cpumask_of_node(node));
+ struct svc_pool *pool = &serv->sv_pools[node];
+
+ INIT_WORK(&pool->sp_work, process_queued_xprt_work);
+ queue_work_on(cpu, serv->sv_wq, &pool->sp_work);
+ }
+}
+
+/*
+ * Start up or shut down a workqueue-based RPC service. Basically, we use this
+ * to allocate the workqueue. The function assumes that the caller holds one
+ * serv->sv_nrthreads reference.
+ */
+int
+svc_wq_setup(struct svc_serv *serv, struct svc_pool *pool, int max_active)
+{
+ int nrthreads = serv->sv_nrthreads - 1; /* -1 for caller's reference */
+
+ WARN_ON_ONCE(nrthreads < 0);
+
+ /*
+ * We don't allow tuning max_active on a per-node basis. If we got here
+ * via the pool_threads interface, then just return an error.
+ */
+ if (pool)
+ return -EINVAL;
+
+ if (!nrthreads) {
+ /* svc is down and none requested? */
+ if (!max_active)
+ return 0;
+ __module_get(serv->sv_ops->svo_module);
+ serv->sv_wq = alloc_workqueue("%s",
+ WQ_UNBOUND|WQ_FREEZABLE|WQ_SYSFS,
+ max_active, serv->sv_name);
+ if (!serv->sv_wq) {
+ module_put(serv->sv_ops->svo_module);
+ return -ENOMEM;
+ }
+ process_queued_xprts(serv);
+ } else {
+ /*
+ * If max_active is 0, then that means we're taking the service
+ * down. Don't destroy the workqueue just yet, as we need it
+ * to process the closing of the xprts.
+ */
+ if (max_active)
+ workqueue_set_max_active(serv->sv_wq, max_active);
+ }
+
+ /* +1 for caller's reference */
+ serv->sv_nrthreads = max_active + 1;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(svc_wq_setup);
+
+/*
+ * A svc_xprt needs to be serviced. Queue its workqueue job and return. In the
+ * event that the workqueue isn't available yet, add it to the sp_sockets list
+ * so that it can be processed when it does become available.
+ */
+void
+svc_wq_enqueue_xprt(struct svc_xprt *xprt)
+{
+ struct svc_serv *serv = xprt->xpt_server;
+
+ if (!svc_xprt_has_something_to_do(xprt))
+ return;
+
+ /* Don't enqueue transport while already enqueued */
+ if (test_and_set_bit(XPT_BUSY, &xprt->xpt_flags))
+ return;
+
+ /* No workqueue yet? Queue the socket until there is one. */
+ if (!serv->sv_wq) {
+ struct svc_pool *pool = &serv->sv_pools[numa_node_id()];
+
+ spin_lock_bh(&pool->sp_lock);
+
+ /*
+ * It's possible for the workqueue to be started up between
+ * when we checked for it before but before we took the lock.
+ * Check again while holding lock to avoid that potential race.
+ */
+ if (serv->sv_wq) {
+ spin_unlock_bh(&pool->sp_lock);
+ goto out;
+ }
+
+ list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
+ spin_unlock_bh(&pool->sp_lock);
+ return;
+ }
+out:
+ svc_xprt_get(xprt);
+ queue_work(serv->sv_wq, &xprt->xpt_work);
+}
+EXPORT_SYMBOL_GPL(svc_wq_enqueue_xprt);
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 63b42a8578c0..30f9fdfbff0c 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -313,7 +313,7 @@ char *svc_print_addr(struct svc_rqst *rqstp, char *buf, size_t len)
}
EXPORT_SYMBOL_GPL(svc_print_addr);

-static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
+bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
{
if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
return true;
@@ -850,6 +850,53 @@ out:
}
EXPORT_SYMBOL_GPL(svc_recv);

+#if IS_ENABLED(CONFIG_SUNRPC_SVC_WORKQUEUE)
+/*
+ * Perform a receive off the rqstp->rq_xprt socket.
+ *
+ * This function is a bit different from the standard svc_recv function as it
+ * assumes that the xprt is already provided in rqstp->rq_xprt, and so it
+ * does not sleep when there is no more work to be done.
+ */
+int
+svc_wq_recv(struct svc_rqst *rqstp)
+{
+ int len, err;
+ struct svc_xprt *xprt = rqstp->rq_xprt;
+ struct svc_serv *serv = xprt->xpt_server;
+
+ err = svc_alloc_arg(rqstp);
+ if (err)
+ goto out;
+
+ len = svc_handle_xprt(rqstp, xprt);
+ if (len <= 0) {
+ err = -EAGAIN;
+ goto out_release;
+ }
+
+ clear_bit(XPT_OLD, &xprt->xpt_flags);
+
+ if (xprt->xpt_ops->xpo_secure_port(rqstp))
+ set_bit(RQ_SECURE, &rqstp->rq_flags);
+ else
+ clear_bit(RQ_SECURE, &rqstp->rq_flags);
+ rqstp->rq_chandle.defer = svc_defer;
+ rqstp->rq_xid = svc_getu32(&rqstp->rq_arg.head[0]);
+
+ if (serv->sv_stats)
+ serv->sv_stats->netcnt++;
+ trace_svc_recv(rqstp, len);
+ return len;
+out_release:
+ rqstp->rq_res.len = 0;
+ svc_xprt_release(rqstp);
+out:
+ return err;
+}
+EXPORT_SYMBOL_GPL(svc_wq_recv);
+#endif
+
/*
* Drop request
*/
--
2.1.0

2014-12-02 18:26:22

by Jeff Layton

[permalink] [raw]
Subject: [RFC PATCH 11/14] nfsd: add support for workqueue based service processing

Allow nfsd to create a workqueue based nfsd service if the sunrpc
pool_mode is set to "workqueue".

Signed-off-by: Jeff Layton <[email protected]>
---
fs/fs_struct.c | 3 +-
fs/nfsd/nfssvc.c | 77 +++++++++++++++++++++++++++++++++++++++++++++--
include/linux/fs_struct.h | 1 +
3 files changed, 78 insertions(+), 3 deletions(-)

diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index 9bc08ea2f433..68985c76cd75 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -130,7 +130,7 @@ struct fs_struct *copy_fs_struct(struct fs_struct *old)
EXPORT_SYMBOL_GPL(copy_fs_struct);

/* Replace current fs struct with one given. Return a pointer to old one. */
-static struct fs_struct *
+struct fs_struct *
swap_fs_struct(struct fs_struct *new_fs)
{
struct fs_struct *old_fs;
@@ -142,6 +142,7 @@ swap_fs_struct(struct fs_struct *new_fs)

return old_fs;
}
+EXPORT_SYMBOL_GPL(swap_fs_struct);

/* Put a reference to a fs_struct. */
void put_fs_struct(struct fs_struct *fs)
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index f37bd7db2176..2c7ebced0311 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -28,6 +28,10 @@
extern struct svc_program nfsd_program;
static int nfsd(void *vrqstp);

+#if IS_ENABLED(CONFIG_SUNRPC_SVC_WORKQUEUE)
+static struct svc_serv_ops nfsd_wq_sv_ops;
+#endif
+
/*
* nfsd_mutex protects nn->nfsd_serv -- both the pointer itself and the members
* of the svc_serv struct. In particular, ->sv_nrthreads but also to some
@@ -398,10 +402,27 @@ static struct svc_serv_ops nfsd_thread_sv_ops = {
.svo_module = THIS_MODULE,
};

+#if IS_ENABLED(CONFIG_SUNRPC_SVC_WORKQUEUE)
+static struct svc_serv_ops *
+select_svc_serv_ops(void)
+{
+ if (svc_pool_map.mode == SVC_POOL_WORKQUEUE)
+ return &nfsd_wq_sv_ops;
+ return &nfsd_thread_sv_ops;
+}
+#else
+static struct svc_serv_ops *
+select_svc_serv_ops(void)
+{
+ return &nfsd_thread_sv_ops;
+}
+#endif
+
int nfsd_create_serv(struct net *net)
{
int error;
struct nfsd_net *nn = net_generic(net, nfsd_net_id);
+ struct svc_serv_ops *ops;

WARN_ON(!mutex_is_locked(&nfsd_mutex));
if (nn->nfsd_serv) {
@@ -411,8 +432,11 @@ int nfsd_create_serv(struct net *net)
if (nfsd_max_blksize == 0)
nfsd_max_blksize = nfsd_get_default_max_blksize();
nfsd_reset_versions();
- nn->nfsd_serv = svc_create_pooled(&nfsd_program, nfsd_max_blksize,
- &nfsd_thread_sv_ops);
+
+ svc_pool_map_get();
+ ops = select_svc_serv_ops();
+ nn->nfsd_serv = svc_create_pooled(&nfsd_program, nfsd_max_blksize, ops);
+ svc_pool_map_put();
if (nn->nfsd_serv == NULL)
return -ENOMEM;

@@ -643,6 +667,55 @@ nfsd(void *vrqstp)
return 0;
}

+#if IS_ENABLED(CONFIG_SUNRPC_SVC_WORKQUEUE)
+/* work function for workqueue-based nfsd */
+static void
+nfsd_work(struct work_struct *work)
+{
+ int node = numa_node_id();
+ struct svc_xprt *xprt = container_of(work, struct svc_xprt, xpt_work);
+ struct net *net = xprt->xpt_net;
+ struct nfsd_net *nn = net_generic(net, nfsd_net_id);
+ struct svc_serv *serv = xprt->xpt_server;
+ struct svc_rqst *rqstp;
+ struct fs_struct *saved_fs;
+ int err;
+
+ rqstp = svc_rqst_alloc(serv, &serv->sv_pools[node], node);
+ if (!rqstp) {
+ /* Alloc failure. Give up for now, and requeue the work */
+ queue_work(serv->sv_wq, &xprt->xpt_work);
+ return;
+ }
+
+ rqstp->rq_xprt = xprt;
+
+ get_fs_struct(rqstp->rq_fs);
+ saved_fs = swap_fs_struct(rqstp->rq_fs);
+
+ rqstp->rq_server->sv_maxconn = nn->max_connections;
+ err = svc_wq_recv(rqstp);
+ if (err >= 0) {
+ validate_process_creds();
+ svc_process(rqstp);
+ validate_process_creds();
+ }
+
+ saved_fs = swap_fs_struct(saved_fs);
+ put_fs_struct(saved_fs);
+
+ svc_rqst_free(rqstp);
+}
+
+static struct svc_serv_ops nfsd_wq_sv_ops = {
+ .svo_shutdown = nfsd_last_thread,
+ .svo_enqueue_xprt = svc_wq_enqueue_xprt,
+ .svo_xprt_work = nfsd_work,
+ .svo_setup = svc_wq_setup,
+ .svo_module = THIS_MODULE,
+};
+#endif
+
static __be32 map_new_errors(u32 vers, __be32 nfserr)
{
if (nfserr == nfserr_jukebox && vers == 2)
diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
index d2b7a1942790..671c49f13396 100644
--- a/include/linux/fs_struct.h
+++ b/include/linux/fs_struct.h
@@ -20,6 +20,7 @@ extern void exit_fs(struct task_struct *);
extern void set_fs_root(struct fs_struct *, const struct path *);
extern void set_fs_pwd(struct fs_struct *, const struct path *);
extern struct fs_struct *copy_fs_struct(struct fs_struct *);
+extern struct fs_struct *swap_fs_struct(struct fs_struct *);
extern void free_fs_struct(struct fs_struct *);
extern void replace_fs_struct(struct fs_struct *);
extern int unshare_fs_struct(void);
--
2.1.0

2014-12-02 18:26:56

by Jeff Layton

[permalink] [raw]
Subject: [RFC PATCH 08/14] sunrpc: set up workqueue function in svc_xprt

For workqueue-based services we'll want to queue a workqueue job
whenever a xprt needs to be serviced. Add a work_struct to svc_xprt
and initialize it from a new work_func_t field in the svc_serv_ops.

Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/sunrpc/svc.h | 3 +++
include/linux/sunrpc/svc_xprt.h | 1 +
net/sunrpc/svc.c | 2 +-
net/sunrpc/svc_xprt.c | 2 ++
4 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 3d2113222dda..8bd53f723485 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -58,6 +58,9 @@ struct svc_serv_ops {
/* function for service threads to run */
int (*svo_function)(void *);

+ /* xprt work function */
+ work_func_t svo_xprt_work;
+
/* queue up a transport for servicing */
void (*svo_enqueue_xprt)(struct svc_xprt *);

diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index 78512cfe1fe6..096937871cda 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -53,6 +53,7 @@ struct svc_xprt {
struct kref xpt_ref;
struct list_head xpt_list;
struct list_head xpt_ready;
+ struct work_struct xpt_work;
unsigned long xpt_flags;
#define XPT_BUSY 0 /* enqueued/receiving */
#define XPT_CONN 1 /* conn pending */
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 165fa1803a0a..135ffbe9d983 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -34,7 +34,7 @@

static void svc_unregister(const struct svc_serv *serv, struct net *net);

-#define svc_serv_is_pooled(serv) ((serv)->sv_ops->svo_function)
+#define svc_serv_is_pooled(serv) ((serv)->sv_ops->svo_function || (serv)->sv_ops->svo_xprt_work)

#define SVC_POOL_DEFAULT SVC_POOL_GLOBAL

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 4954339315b7..63b42a8578c0 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -157,6 +157,8 @@ void svc_xprt_init(struct net *net, struct svc_xprt_class *xcl,
xprt->xpt_ops = xcl->xcl_ops;
kref_init(&xprt->xpt_ref);
xprt->xpt_server = serv;
+ if (serv->sv_ops->svo_xprt_work)
+ INIT_WORK(&xprt->xpt_work, serv->sv_ops->svo_xprt_work);
INIT_LIST_HEAD(&xprt->xpt_list);
INIT_LIST_HEAD(&xprt->xpt_ready);
INIT_LIST_HEAD(&xprt->xpt_deferred);
--
2.1.0

2014-12-02 18:27:40

by Jeff Layton

[permalink] [raw]
Subject: [RFC PATCH 07/14] sunrpc: factor svc_rqst allocation and freeing from sv_nrthreads refcounting

In later patches, we'll want to be able to allocate and free svc_rqst
structures without monkeying with the serv->sv_nrthreads refcount.

Factor those pieces out of their respective functions.

Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/sunrpc/svc.h | 3 +++
net/sunrpc/svc.c | 58 ++++++++++++++++++++++++++++++----------------
2 files changed, 41 insertions(+), 20 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 28e5f5716a87..3d2113222dda 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -458,8 +458,11 @@ void svc_rpcb_cleanup(struct svc_serv *serv, struct net *net);
int svc_bind(struct svc_serv *serv, struct net *net);
struct svc_serv *svc_create(struct svc_program *, unsigned int,
struct svc_serv_ops *);
+struct svc_rqst *svc_rqst_alloc(struct svc_serv *serv,
+ struct svc_pool *pool, int node);
struct svc_rqst *svc_prepare_thread(struct svc_serv *serv,
struct svc_pool *pool, int node);
+void svc_rqst_free(struct svc_rqst *);
void svc_exit_thread(struct svc_rqst *);
unsigned int svc_pool_map_get(void);
void svc_pool_map_put(void);
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 04c083a53121..165fa1803a0a 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -583,40 +583,52 @@ svc_release_buffer(struct svc_rqst *rqstp)
}

struct svc_rqst *
-svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
+svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
{
struct svc_rqst *rqstp;

rqstp = kzalloc_node(sizeof(*rqstp), GFP_KERNEL, node);
if (!rqstp)
- goto out_enomem;
+ return rqstp;

- serv->sv_nrthreads++;
__set_bit(RQ_BUSY, &rqstp->rq_flags);
spin_lock_init(&rqstp->rq_lock);
rqstp->rq_server = serv;
rqstp->rq_pool = pool;
- spin_lock_bh(&pool->sp_lock);
- pool->sp_nrthreads++;
- list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
- spin_unlock_bh(&pool->sp_lock);

rqstp->rq_argp = kmalloc_node(serv->sv_xdrsize, GFP_KERNEL, node);
if (!rqstp->rq_argp)
- goto out_thread;
+ goto out_enomem;

rqstp->rq_resp = kmalloc_node(serv->sv_xdrsize, GFP_KERNEL, node);
if (!rqstp->rq_resp)
- goto out_thread;
+ goto out_enomem;

if (!svc_init_buffer(rqstp, serv->sv_max_mesg, node))
- goto out_thread;
+ goto out_enomem;

return rqstp;
-out_thread:
- svc_exit_thread(rqstp);
out_enomem:
- return ERR_PTR(-ENOMEM);
+ svc_rqst_free(rqstp);
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(svc_rqst_alloc);
+
+struct svc_rqst *
+svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
+{
+ struct svc_rqst *rqstp;
+
+ rqstp = svc_rqst_alloc(serv, pool, node);
+ if (!rqstp)
+ return ERR_PTR(-ENOMEM);
+
+ serv->sv_nrthreads++;
+ spin_lock_bh(&pool->sp_lock);
+ pool->sp_nrthreads++;
+ list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
+ spin_unlock_bh(&pool->sp_lock);
+ return rqstp;
}
EXPORT_SYMBOL_GPL(svc_prepare_thread);

@@ -747,19 +759,25 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
EXPORT_SYMBOL_GPL(svc_set_num_threads);

/*
- * Called from a server thread as it's exiting. Caller must hold the BKL or
- * the "service mutex", whichever is appropriate for the service.
+ * Called from a server thread as it's exiting. Caller must hold the service's
+ * mutex.
*/
void
-svc_exit_thread(struct svc_rqst *rqstp)
+svc_rqst_free(struct svc_rqst *rqstp)
{
- struct svc_serv *serv = rqstp->rq_server;
- struct svc_pool *pool = rqstp->rq_pool;
-
svc_release_buffer(rqstp);
kfree(rqstp->rq_resp);
kfree(rqstp->rq_argp);
kfree(rqstp->rq_auth_data);
+ kfree_rcu(rqstp, rq_rcu_head);
+}
+EXPORT_SYMBOL_GPL(svc_rqst_free);
+
+void
+svc_exit_thread(struct svc_rqst *rqstp)
+{
+ struct svc_serv *serv = rqstp->rq_server;
+ struct svc_pool *pool = rqstp->rq_pool;

spin_lock_bh(&pool->sp_lock);
pool->sp_nrthreads--;
@@ -767,7 +785,7 @@ svc_exit_thread(struct svc_rqst *rqstp)
list_del_rcu(&rqstp->rq_all);
spin_unlock_bh(&pool->sp_lock);

- kfree_rcu(rqstp, rq_rcu_head);
+ svc_rqst_free(rqstp);

/* Release the server */
if (serv)
--
2.1.0

2014-12-02 18:28:00

by Jeff Layton

[permalink] [raw]
Subject: [RFC PATCH 05/14] sunrpc: abstract out svc_set_num_threads to sv_ops

Add an operation that will do setup of the service. In the case of a
classic thread-based service that means starting up threads. In the case
of a workqueue-based service, the setup will do something different.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/nfssvc.c | 8 +++++---
include/linux/sunrpc/svc.h | 3 +++
2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 5904c06cfd32..71e7b180c0d9 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -394,6 +394,7 @@ static struct svc_serv_ops nfsd_thread_sv_ops = {
.svo_shutdown = nfsd_last_thread,
.svo_function = nfsd,
.svo_enqueue_xprt = svc_xprt_do_enqueue,
+ .svo_setup = svc_set_num_threads,
.svo_module = THIS_MODULE,
};

@@ -506,8 +507,8 @@ int nfsd_set_nrthreads(int n, int *nthreads, struct net *net)
/* apply the new numbers */
svc_get(nn->nfsd_serv);
for (i = 0; i < n; i++) {
- err = svc_set_num_threads(nn->nfsd_serv, &nn->nfsd_serv->sv_pools[i],
- nthreads[i]);
+ err = nn->nfsd_serv->sv_ops->svo_setup(nn->nfsd_serv,
+ &nn->nfsd_serv->sv_pools[i], nthreads[i]);
if (err)
break;
}
@@ -546,7 +547,8 @@ nfsd_svc(int nrservs, struct net *net)
error = nfsd_startup_net(nrservs, net);
if (error)
goto out_destroy;
- error = svc_set_num_threads(nn->nfsd_serv, NULL, nrservs);
+ error = nn->nfsd_serv->sv_ops->svo_setup(nn->nfsd_serv,
+ NULL, nrservs);
if (error)
goto out_shutdown;
/* We are holding a reference to nn->nfsd_serv which
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 7429cdcb48b5..5e172c8329f8 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -61,6 +61,9 @@ struct svc_serv_ops {
/* queue up a transport for servicing */
void (*svo_enqueue_xprt)(struct svc_xprt *);

+ /* set up thread (or whatever) execution context */
+ int (*svo_setup)(struct svc_serv *, struct svc_pool *, int);
+
/* optional module to count when adding threads (pooled svcs only) */
struct module *svo_module;
};
--
2.1.0

2014-12-02 18:28:25

by Jeff Layton

[permalink] [raw]
Subject: [RFC PATCH 04/14] sunrpc: turn enqueueing a svc_xprt into a svc_serv operation

For now, all services use svc_xprt_do_enqueue, but once we add
workqueue-based service support, we'll need to do something different.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/lockd/svc.c | 3 ++-
fs/nfs/callback.c | 1 +
fs/nfsd/nfssvc.c | 11 ++++++-----
include/linux/sunrpc/svc.h | 3 +++
include/linux/sunrpc/svc_xprt.h | 1 +
net/sunrpc/svc_xprt.c | 10 +++++-----
6 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index d93a8d0a04bb..b6bdba25d4fb 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -327,7 +327,8 @@ out_rqst:
}

static struct svc_serv_ops lockd_sv_ops = {
- .svo_shutdown = svc_rpcb_cleanup,
+ .svo_shutdown = svc_rpcb_cleanup,
+ .svo_enqueue_xprt = svc_xprt_do_enqueue,
};

static struct svc_serv *lockd_create_svc(void)
diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c
index ace6c70f807b..1800a918f244 100644
--- a/fs/nfs/callback.c
+++ b/fs/nfs/callback.c
@@ -307,6 +307,7 @@ err_bind:
}

static struct svc_serv_ops nfs_cb_sv_ops = {
+ .svo_enqueue_xprt = svc_xprt_do_enqueue,
};

static struct svc_serv *nfs_callback_create_svc(int minorversion)
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 94e7e173aa10..5904c06cfd32 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -390,10 +390,11 @@ static int nfsd_get_default_max_blksize(void)
return ret;
}

-static struct svc_serv_ops nfsd_sv_ops = {
- .svo_shutdown = nfsd_last_thread,
- .svo_function = nfsd,
- .svo_module = THIS_MODULE,
+static struct svc_serv_ops nfsd_thread_sv_ops = {
+ .svo_shutdown = nfsd_last_thread,
+ .svo_function = nfsd,
+ .svo_enqueue_xprt = svc_xprt_do_enqueue,
+ .svo_module = THIS_MODULE,
};

int nfsd_create_serv(struct net *net)
@@ -410,7 +411,7 @@ int nfsd_create_serv(struct net *net)
nfsd_max_blksize = nfsd_get_default_max_blksize();
nfsd_reset_versions();
nn->nfsd_serv = svc_create_pooled(&nfsd_program, nfsd_max_blksize,
- &nfsd_sv_ops);
+ &nfsd_thread_sv_ops);
if (nn->nfsd_serv == NULL)
return -ENOMEM;

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 377be73b736f..7429cdcb48b5 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -58,6 +58,9 @@ struct svc_serv_ops {
/* function for service threads to run */
int (*svo_function)(void *);

+ /* queue up a transport for servicing */
+ void (*svo_enqueue_xprt)(struct svc_xprt *);
+
/* optional module to count when adding threads (pooled svcs only) */
struct module *svo_module;
};
diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index 79f6f8f3dc0a..78512cfe1fe6 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -116,6 +116,7 @@ void svc_xprt_init(struct net *, struct svc_xprt_class *, struct svc_xprt *,
struct svc_serv *);
int svc_create_xprt(struct svc_serv *, const char *, struct net *,
const int, const unsigned short, int);
+void svc_xprt_do_enqueue(struct svc_xprt *xprt);
void svc_xprt_enqueue(struct svc_xprt *xprt);
void svc_xprt_put(struct svc_xprt *xprt);
void svc_xprt_copy_addrs(struct svc_rqst *rqstp, struct svc_xprt *xprt);
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index c69358b3cf7f..4954339315b7 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -24,7 +24,6 @@ static int svc_deferred_recv(struct svc_rqst *rqstp);
static struct cache_deferred_req *svc_defer(struct cache_req *req);
static void svc_age_temp_xprts(unsigned long closure);
static void svc_delete_xprt(struct svc_xprt *xprt);
-static void svc_xprt_do_enqueue(struct svc_xprt *xprt);

/* apparently the "standard" is that clients close
* idle connections after 5 minutes, servers after
@@ -226,12 +225,12 @@ static void svc_xprt_received(struct svc_xprt *xprt)
}

/* As soon as we clear busy, the xprt could be closed and
- * 'put', so we need a reference to call svc_xprt_do_enqueue with:
+ * 'put', so we need a reference to call svc_enqueue_xprt with:
*/
svc_xprt_get(xprt);
smp_mb__before_atomic();
clear_bit(XPT_BUSY, &xprt->xpt_flags);
- svc_xprt_do_enqueue(xprt);
+ xprt->xpt_server->sv_ops->svo_enqueue_xprt(xprt);
svc_xprt_put(xprt);
}

@@ -321,7 +320,7 @@ static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
return false;
}

-static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
+void svc_xprt_do_enqueue(struct svc_xprt *xprt)
{
struct svc_pool *pool;
struct svc_rqst *rqstp = NULL;
@@ -403,6 +402,7 @@ redo_search:
out:
trace_svc_xprt_do_enqueue(xprt, rqstp);
}
+EXPORT_SYMBOL_GPL(svc_xprt_do_enqueue);

/*
* Queue up a transport with data pending. If there are idle nfsd
@@ -413,7 +413,7 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
{
if (test_bit(XPT_BUSY, &xprt->xpt_flags))
return;
- svc_xprt_do_enqueue(xprt);
+ xprt->xpt_server->sv_ops->svo_enqueue_xprt(xprt);
}
EXPORT_SYMBOL_GPL(svc_xprt_enqueue);

--
2.1.0

2014-12-02 18:28:57

by Jeff Layton

[permalink] [raw]
Subject: [RFC PATCH 01/14] sunrpc: add a new svc_serv_ops struct and move sv_shutdown into it

In later patches we'll need to abstract out more operations on a
per-service level, besides sv_shutdown and sv_function.

Declare a new svc_serv_ops struct to hold these operations, and move
sv_shutdown into this struct.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/lockd/svc.c | 6 +++++-
fs/nfs/callback.c | 5 ++++-
fs/nfsd/nfssvc.c | 6 +++++-
include/linux/sunrpc/svc.h | 20 ++++++++++----------
net/sunrpc/svc.c | 18 +++++++++---------
5 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index e94c887da2d7..d93a8d0a04bb 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -326,6 +326,10 @@ out_rqst:
return error;
}

+static struct svc_serv_ops lockd_sv_ops = {
+ .svo_shutdown = svc_rpcb_cleanup,
+};
+
static struct svc_serv *lockd_create_svc(void)
{
struct svc_serv *serv;
@@ -350,7 +354,7 @@ static struct svc_serv *lockd_create_svc(void)
printk(KERN_WARNING
"lockd_up: no pid, %d users??\n", nlmsvc_users);

- serv = svc_create(&nlmsvc_program, LOCKD_BUFSIZE, svc_rpcb_cleanup);
+ serv = svc_create(&nlmsvc_program, LOCKD_BUFSIZE, &lockd_sv_ops);
if (!serv) {
printk(KERN_WARNING "lockd_up: create service failed\n");
return ERR_PTR(-ENOMEM);
diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c
index b8fb3a4ef649..ace6c70f807b 100644
--- a/fs/nfs/callback.c
+++ b/fs/nfs/callback.c
@@ -306,6 +306,9 @@ err_bind:
return ret;
}

+static struct svc_serv_ops nfs_cb_sv_ops = {
+};
+
static struct svc_serv *nfs_callback_create_svc(int minorversion)
{
struct nfs_callback_data *cb_info = &nfs_callback_info[minorversion];
@@ -331,7 +334,7 @@ static struct svc_serv *nfs_callback_create_svc(int minorversion)
printk(KERN_WARNING "nfs_callback_create_svc: no kthread, %d users??\n",
cb_info->users);

- serv = svc_create(&nfs4_callback_program, NFS4_CALLBACK_BUFSIZE, NULL);
+ serv = svc_create(&nfs4_callback_program, NFS4_CALLBACK_BUFSIZE, &nfs_cb_sv_ops);
if (!serv) {
printk(KERN_ERR "nfs_callback_create_svc: create service failed\n");
return ERR_PTR(-ENOMEM);
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 314f5c8f8f1a..cc5dd1227732 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -390,6 +390,10 @@ static int nfsd_get_default_max_blksize(void)
return ret;
}

+static struct svc_serv_ops nfsd_sv_ops = {
+ .svo_shutdown = nfsd_last_thread,
+};
+
int nfsd_create_serv(struct net *net)
{
int error;
@@ -404,7 +408,7 @@ int nfsd_create_serv(struct net *net)
nfsd_max_blksize = nfsd_get_default_max_blksize();
nfsd_reset_versions();
nn->nfsd_serv = svc_create_pooled(&nfsd_program, nfsd_max_blksize,
- nfsd_last_thread, nfsd, THIS_MODULE);
+ &nfsd_sv_ops, nfsd, THIS_MODULE);
if (nn->nfsd_serv == NULL)
return -ENOMEM;

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 6f22cfeef5e3..35faea625d9f 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -54,6 +54,13 @@ struct svc_pool {
unsigned long sp_flags;
} ____cacheline_aligned_in_smp;

+struct svc_serv;
+
+struct svc_serv_ops {
+ /* Callback to use when last thread exits. */
+ void (*svo_shutdown)(struct svc_serv *serv, struct net *net);
+};
+
/*
* RPC service.
*
@@ -85,13 +92,7 @@ struct svc_serv {

unsigned int sv_nrpools; /* number of thread pools */
struct svc_pool * sv_pools; /* array of thread pools */
-
- void (*sv_shutdown)(struct svc_serv *serv,
- struct net *net);
- /* Callback to use when last thread
- * exits.
- */
-
+ struct svc_serv_ops * sv_ops; /* server operations */
struct module * sv_module; /* optional module to count when
* adding threads */
svc_thread_fn sv_function; /* main function for threads */
@@ -429,13 +430,12 @@ int svc_rpcb_setup(struct svc_serv *serv, struct net *net);
void svc_rpcb_cleanup(struct svc_serv *serv, struct net *net);
int svc_bind(struct svc_serv *serv, struct net *net);
struct svc_serv *svc_create(struct svc_program *, unsigned int,
- void (*shutdown)(struct svc_serv *, struct net *net));
+ struct svc_serv_ops *);
struct svc_rqst *svc_prepare_thread(struct svc_serv *serv,
struct svc_pool *pool, int node);
void svc_exit_thread(struct svc_rqst *);
struct svc_serv * svc_create_pooled(struct svc_program *, unsigned int,
- void (*shutdown)(struct svc_serv *, struct net *net),
- svc_thread_fn, struct module *);
+ struct svc_serv_ops *, svc_thread_fn, struct module *);
int svc_set_num_threads(struct svc_serv *, struct svc_pool *, int);
int svc_pool_stats_open(struct svc_serv *serv, struct file *file);
void svc_destroy(struct svc_serv *);
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 4308881d9d0a..057185e11261 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -423,7 +423,7 @@ EXPORT_SYMBOL_GPL(svc_bind);
*/
static struct svc_serv *
__svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
- void (*shutdown)(struct svc_serv *serv, struct net *net))
+ struct svc_serv_ops *ops)
{
struct svc_serv *serv;
unsigned int vers;
@@ -440,7 +440,7 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
bufsize = RPCSVC_MAXPAYLOAD;
serv->sv_max_payload = bufsize? bufsize : 4096;
serv->sv_max_mesg = roundup(serv->sv_max_payload + PAGE_SIZE, PAGE_SIZE);
- serv->sv_shutdown = shutdown;
+ serv->sv_ops = ops;
xdrsize = 0;
while (prog) {
prog->pg_lovers = prog->pg_nvers-1;
@@ -486,21 +486,21 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,

struct svc_serv *
svc_create(struct svc_program *prog, unsigned int bufsize,
- void (*shutdown)(struct svc_serv *serv, struct net *net))
+ struct svc_serv_ops *ops)
{
- return __svc_create(prog, bufsize, /*npools*/1, shutdown);
+ return __svc_create(prog, bufsize, /*npools*/1, ops);
}
EXPORT_SYMBOL_GPL(svc_create);

struct svc_serv *
svc_create_pooled(struct svc_program *prog, unsigned int bufsize,
- void (*shutdown)(struct svc_serv *serv, struct net *net),
- svc_thread_fn func, struct module *mod)
+ struct svc_serv_ops *ops, svc_thread_fn func,
+ struct module *mod)
{
struct svc_serv *serv;
unsigned int npools = svc_pool_map_get();

- serv = __svc_create(prog, bufsize, npools, shutdown);
+ serv = __svc_create(prog, bufsize, npools, ops);
if (!serv)
goto out_err;

@@ -517,8 +517,8 @@ void svc_shutdown_net(struct svc_serv *serv, struct net *net)
{
svc_close_net(serv, net);

- if (serv->sv_shutdown)
- serv->sv_shutdown(serv, net);
+ if (serv->sv_ops->svo_shutdown)
+ serv->sv_ops->svo_shutdown(serv, net);
}
EXPORT_SYMBOL_GPL(svc_shutdown_net);

--
2.1.0

2014-12-02 19:18:19

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] nfsd/sunrpc: add support for a workqueue-based nfsd

Hello, Jeff.

On Tue, Dec 02, 2014 at 01:24:09PM -0500, Jeff Layton wrote:
> 2) get some insight about the latency from those with a better
> understanding of the CMWQ code. Any thoughts as to why we might be
> seeing such high latency here? Any ideas of what we can do about it?

The latency is prolly from concurrency management. Work items which
participate in concurrency management (the ones on per-cpu workqueues
w/o WQ_CPU_INTENSIVE set) tend to get penalized on latency side quite
a bit as the "run" durations for all such work items end up being
serialized on the cpu. Setting WQ_CPU_INTENSIVE on the workqueue
disables concurrency management and so does making the workqueue
unbound. If strict cpu locality is likely to be beneficial and each
work item isn't likely to consume huge amount of cpu cycles,
WQ_CPU_INTENSIVE would fit better; otherwise, WQ_UNBOUND to let the
scheduler do its thing.

Thanks.

--
tejun

2014-12-02 19:26:34

by Jeff Layton

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] nfsd/sunrpc: add support for a workqueue-based nfsd

On Tue, 2 Dec 2014 14:18:14 -0500
Tejun Heo <[email protected]> wrote:

> Hello, Jeff.
>
> On Tue, Dec 02, 2014 at 01:24:09PM -0500, Jeff Layton wrote:
> > 2) get some insight about the latency from those with a better
> > understanding of the CMWQ code. Any thoughts as to why we might be
> > seeing such high latency here? Any ideas of what we can do about it?
>
> The latency is prolly from concurrency management. Work items which
> participate in concurrency management (the ones on per-cpu workqueues
> w/o WQ_CPU_INTENSIVE set) tend to get penalized on latency side quite
> a bit as the "run" durations for all such work items end up being
> serialized on the cpu. Setting WQ_CPU_INTENSIVE on the workqueue
> disables concurrency management and so does making the workqueue
> unbound. If strict cpu locality is likely to be beneficial and each
> work item isn't likely to consume huge amount of cpu cycles,
> WQ_CPU_INTENSIVE would fit better; otherwise, WQ_UNBOUND to let the
> scheduler do its thing.
>
> Thanks.
>

Thanks Tejun,

I'm already using WQ_UNBOUND workqueues. If that exempts this code from
the concurrency management, then that's probably not the problem. The
jobs here aren't terribly CPU intensive, but they can sleep for a long
time while waiting on I/O, etc...

I don't think we necessarily need CPU locality (though that's nice to
have of course), but NUMA affinity will likely be important. It looked
like you had done some work a year or so ago to make unbound workqueues
prefer to queue work on the same NUMA node which meshes nicely with
what I think we want for this.

I'll keep looking at it -- let me know if you have any other thoughts
on the latency...

Cheers!
--
Jeff Layton <[email protected]>

2014-12-02 19:29:26

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] nfsd/sunrpc: add support for a workqueue-based nfsd

Hello, Jeff.

On Tue, Dec 02, 2014 at 02:26:27PM -0500, Jeff Layton wrote:
> I'm already using WQ_UNBOUND workqueues. If that exempts this code from
> the concurrency management, then that's probably not the problem. The
> jobs here aren't terribly CPU intensive, but they can sleep for a long
> time while waiting on I/O, etc...

Yeap, noticed that. Replied in another message.

> I don't think we necessarily need CPU locality (though that's nice to
> have of course), but NUMA affinity will likely be important. It looked
> like you had done some work a year or so ago to make unbound workqueues
> prefer to queue work on the same NUMA node which meshes nicely with
> what I think we want for this.

Yeap, unbound workqueues are NUMA-affine by default.

Thanks.

--
tejun

2014-12-02 19:30:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] nfsd/sunrpc: add support for a workqueue-based nfsd

On Tue, Dec 02, 2014 at 02:18:14PM -0500, Tejun Heo wrote:
...
> unbound. If strict cpu locality is likely to be beneficial and each
> work item isn't likely to consume huge amount of cpu cycles,
> WQ_CPU_INTENSIVE would fit better; otherwise, WQ_UNBOUND to let the
> scheduler do its thing.

Hmmm... but you're already using WQ_UNBOUND. Concurrency management
doesn't matter for unbound workqueues. They really just behave as
shared worker thread pools. Does turning on WQ_HIGHPRI change
anything? Workqueue always prefers hot workers which can lead to the
hot ones being penalized for consuming too much CPU time.

Thanks.

--
tejun

2014-12-02 19:46:50

by Jeff Layton

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] nfsd/sunrpc: add support for a workqueue-based nfsd

On Tue, 2 Dec 2014 14:26:55 -0500
Tejun Heo <[email protected]> wrote:

> On Tue, Dec 02, 2014 at 02:18:14PM -0500, Tejun Heo wrote:
> ...
> > unbound. If strict cpu locality is likely to be beneficial and each
> > work item isn't likely to consume huge amount of cpu cycles,
> > WQ_CPU_INTENSIVE would fit better; otherwise, WQ_UNBOUND to let the
> > scheduler do its thing.
>
> Hmmm... but you're already using WQ_UNBOUND. Concurrency management
> doesn't matter for unbound workqueues. They really just behave as
> shared worker thread pools. Does turning on WQ_HIGHPRI change
> anything? Workqueue always prefers hot workers which can lead to the
> hot ones being penalized for consuming too much CPU time.
>
> Thanks.
>

WQ_HIGHPRI doesn't seem to help. FWIW, here's the fio status line from
that test:

threaded nfsd:
WRITE: io=12628KB, aggrb=210KB/s, minb=52KB/s, maxb=52KB/s, mint=60064msec, maxt=60107msec

workqueue nfsd:
WRITE: io=7464KB, aggrb=124KB/s, minb=30KB/s, maxb=31KB/s, mint=60066msec, maxt=60133msec

...and the workqueue numbers didn't change much from the case without
WQ_HIGHPRI. I didn't gather significant queueing latency numbers for
that run but I can get those together tomorrow.

My fio test config follows for anyone who's interested (it's
tiobench-example.fio without the read tests):

-------------------------[snip]--------------------------
[global]
direct=1
size=128m
bsrange=4k-4k
timeout=60
numjobs=4 ; 4 simultaneous threads for each job

[f1]
rw=write
-------------------------[snip]--------------------------

... the server in this case is an older quad core AMD Phenom CPU, with
a 5400rpm SATA disk under it on a 3Gbps SATA interface. The client is
mounting it using NFSv3 and I'm only running a single client here.

The only difference between the two tests above is that whether nfsd is
running with the sunrpc.ko pool_mode set to "global" or "workqueue".

--
Jeff Layton <[email protected]>

2014-12-03 01:11:28

by NeilBrown

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] nfsd/sunrpc: add support for a workqueue-based nfsd

On Tue, 2 Dec 2014 13:24:09 -0500 Jeff Layton <[email protected]>
wrote:

> tl;dr: this code works and is much simpler than the dedicated thread
> pool, but there are some latencies in the workqueue code that
> seem to keep it from being as fast as it could be.
>
> This patchset is a little skunkworks project that I've been poking at
> for the last few weeks. Currently nfsd uses a dedicated thread pool to
> handle RPCs, but that requires maintaining a rather large swath of
> "fiddly" code to handle the threads and transports.
>
> This patchset represents an alternative approach, which makes nfsd use
> workqueues to do its bidding rather than a dedicated thread pool. When a
> transport needs to do work, we simply queue it to the workqueue in
> softirq context and let it service the transport.
>
> The current draft is runtime-switchable via a new sunrpc pool_mode
> module parameter setting. When that's set to "workqueue", nfsd will use
> a workqueue-based service. One of the goals of this patchset was to
> *not* need to change any userland code, so starting it up using rpc.nfsd
> still works as expected. The only real difference is that the nfsdfs
> "threads" file is reinterpreted as the "max_active" value for the
> workqueue.

Hi Jeff,
I haven't looked very closely at the code, but in principal I think this is
an excellent idea. Having to set a number of threads manually was never
nice as it is impossible to give sensible guidance on what an appropriate
number is.
Tying max_active to "threads" doesn't really make sense I think.
"max_active" is a per-cpu number and I think the only meaningful numbers are
"1" (meaning concurrent works might mutually deadlock) or infinity (which is
approximated as 512). I would just ignore the "threads" number when
workqueues are used.... or maybe enable workqueues when "auto" is written to
"threads"??

Using a shrinker to manage the allocation and freeing of svc_rqst is a
really good idea. It will put pressure on the effective number of threads
when needed, but will not artificially constrain things.

The combination of workqueue and shrinker seems like a perfect match for
nfsd.

I hope you can work out the latency issues!

Thanks,
NeilBrown


Attachments:
(No filename) (811.00 B)
OpenPGP digital signature

2014-12-03 01:30:12

by Jeff Layton

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] nfsd/sunrpc: add support for a workqueue-based nfsd

On Wed, 3 Dec 2014 12:11:18 +1100
NeilBrown <[email protected]> wrote:

> On Tue, 2 Dec 2014 13:24:09 -0500 Jeff Layton <[email protected]>
> wrote:
>
> > tl;dr: this code works and is much simpler than the dedicated thread
> > pool, but there are some latencies in the workqueue code that
> > seem to keep it from being as fast as it could be.
> >
> > This patchset is a little skunkworks project that I've been poking at
> > for the last few weeks. Currently nfsd uses a dedicated thread pool to
> > handle RPCs, but that requires maintaining a rather large swath of
> > "fiddly" code to handle the threads and transports.
> >
> > This patchset represents an alternative approach, which makes nfsd use
> > workqueues to do its bidding rather than a dedicated thread pool. When a
> > transport needs to do work, we simply queue it to the workqueue in
> > softirq context and let it service the transport.
> >
> > The current draft is runtime-switchable via a new sunrpc pool_mode
> > module parameter setting. When that's set to "workqueue", nfsd will use
> > a workqueue-based service. One of the goals of this patchset was to
> > *not* need to change any userland code, so starting it up using rpc.nfsd
> > still works as expected. The only real difference is that the nfsdfs
> > "threads" file is reinterpreted as the "max_active" value for the
> > workqueue.
>
> Hi Jeff,
> I haven't looked very closely at the code, but in principal I think this is
> an excellent idea. Having to set a number of threads manually was never
> nice as it is impossible to give sensible guidance on what an appropriate
> number is.

Yes, this should be a lot more "dynamic" in principle. A lightly-used
nfs server using workqueues should really not consume much at all in
the way of resources.

> Tying max_active to "threads" doesn't really make sense I think.
> "max_active" is a per-cpu number and I think the only meaningful numbers are
> "1" (meaning concurrent works might mutually deadlock) or infinity (which is
> approximated as 512). I would just ignore the "threads" number when
> workqueues are used.... or maybe enable workqueues when "auto" is written to
> "threads"??
>

That's a good point. I had originally thought that max_active on an
unbound workqueue would be the number of concurrent jobs that could run
across all the CPUs, but now that I look I'm not sure that's really
the case.

Certainly, not bounding this at all has some appeal, but having some
limit here does help put an upper bound on the size of the svc_rqst
cache.

I'll definitely give that some thought.

> Using a shrinker to manage the allocation and freeing of svc_rqst is a
> really good idea. It will put pressure on the effective number of threads
> when needed, but will not artificially constrain things.
>

That's my hope. We might also consider reaping idle rqsts ourselves
as needed somehow, but for now I decided not to worry about it until we
see whether this is really viable or not.

> The combination of workqueue and shrinker seems like a perfect match for
> nfsd.
>
> I hope you can work out the latency issues!
>
> Thanks,
> NeilBrown

I've heard random grumblings from various people in the past that
workqueues have significant latency, but this is the first time I've
really hit it in practice. If we can get this fixed, then that may be a
significant perf win for all workqueue users. For instance, rpciod in
the NFS client is all workqueue-based. Getting that latency down could
really help things.

I'm currently trying to roll up a kernel module for benchmarking the
workqueue dispatching code in the hopes that we can use that to help
nail it down.

Thanks for having a look!
--
Jeff Layton <[email protected]>


Attachments:
(No filename) (819.00 B)
OpenPGP digital signature

2014-12-03 15:56:55

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] nfsd/sunrpc: add support for a workqueue-based nfsd

Hello, Neil, Jeff.

On Tue, Dec 02, 2014 at 08:29:46PM -0500, Jeff Layton wrote:
> That's a good point. I had originally thought that max_active on an
> unbound workqueue would be the number of concurrent jobs that could run
> across all the CPUs, but now that I look I'm not sure that's really
> the case.

@max_active is a per-pool number. By default, unbound wqs use
per-node pools, so @max_active would be per-node. Currently,
@max_active is mostly meant as a protection against run-away
workqueues creating crazy number of workers, which has been enough for
the existing wq users. *Maybe* it makes sense to make it actually
mean maximum concurrency which would prolly involve aggregated per-cpu
distribution mechanism so that we don't end up inc'ing and dec'ing the
same counter from all CPUs on each work item execution.

However, I do agree with Neil that making it user configurable is
almost always painful. It's usually a question without a good answer
and the same value may behave differently depending on a lot of
implementation details and a better approach, probably, is to use
@max_active as the last resort protection mechanism while providing
automatic throttling of in-flight work items which is meaningful for
the specific use cases.

> I've heard random grumblings from various people in the past that
> workqueues have significant latency, but this is the first time I've
> really hit it in practice. If we can get this fixed, then that may be a
> significant perf win for all workqueue users. For instance, rpciod in
> the NFS client is all workqueue-based. Getting that latency down could
> really help things.
>
> I'm currently trying to roll up a kernel module for benchmarking the
> workqueue dispatching code in the hopes that we can use that to help
> nail it down.

Definitely, there were some reportings but nothing really got tracked
down properly. It'd be awesome to actually find out where the latency
is coming from.

Thanks!

--
tejun

2014-12-03 16:04:12

by Jeff Layton

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] nfsd/sunrpc: add support for a workqueue-based nfsd

On Wed, 3 Dec 2014 10:56:49 -0500
Tejun Heo <[email protected]> wrote:

> Hello, Neil, Jeff.
>
> On Tue, Dec 02, 2014 at 08:29:46PM -0500, Jeff Layton wrote:
> > That's a good point. I had originally thought that max_active on an
> > unbound workqueue would be the number of concurrent jobs that could run
> > across all the CPUs, but now that I look I'm not sure that's really
> > the case.
>
> @max_active is a per-pool number. By default, unbound wqs use
> per-node pools, so @max_active would be per-node. Currently,
> @max_active is mostly meant as a protection against run-away
> workqueues creating crazy number of workers, which has been enough for
> the existing wq users. *Maybe* it makes sense to make it actually
> mean maximum concurrency which would prolly involve aggregated per-cpu
> distribution mechanism so that we don't end up inc'ing and dec'ing the
> same counter from all CPUs on each work item execution.
>
> However, I do agree with Neil that making it user configurable is
> almost always painful. It's usually a question without a good answer
> and the same value may behave differently depending on a lot of
> implementation details and a better approach, probably, is to use
> @max_active as the last resort protection mechanism while providing
> automatic throttling of in-flight work items which is meaningful for
> the specific use cases.
>
> > I've heard random grumblings from various people in the past that
> > workqueues have significant latency, but this is the first time I've
> > really hit it in practice. If we can get this fixed, then that may be a
> > significant perf win for all workqueue users. For instance, rpciod in
> > the NFS client is all workqueue-based. Getting that latency down could
> > really help things.
> >
> > I'm currently trying to roll up a kernel module for benchmarking the
> > workqueue dispatching code in the hopes that we can use that to help
> > nail it down.
>
> Definitely, there were some reportings but nothing really got tracked
> down properly. It'd be awesome to actually find out where the latency
> is coming from.
>
> Thanks!
>

I think I might have figured this out (and before I go any farther
allow me to say <facepalm>), thanks to the workqueue tracepoints in the
code. What I noticed is that when things are fairly idle, the work is
picked up quickly, but once things get busy it takes a lot longer.

I think that the issue is in the design of the workqueue-based nfsd
code. In particular, I attached a work_struct to the svc_xprt which is
limiting the code to only process one RPC at a time for a xprt, from
beginning to end.

So, even if we requeue that work after the receive phase is done, the
workqueue won't pick it up again until the thing is processed and the
reply is sent.

What I think I need to do is to do the receive phase using the
work_struct attached to the xprt, and then do the rest of the
processing from the context of a different work_struct (possibly one
attached to the svc_rqst), which should free up the xprt's work_struct
sooner.

I'm going to work on changing that today and see if it improves things.

Thanks for the help so far!
--
Jeff Layton <[email protected]>

2014-12-03 16:50:46

by Chuck Lever III

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] nfsd/sunrpc: add support for a workqueue-based nfsd


On Dec 3, 2014, at 10:56 AM, Tejun Heo <[email protected]> wrote:

> Hello, Neil, Jeff.
>
> On Tue, Dec 02, 2014 at 08:29:46PM -0500, Jeff Layton wrote:
>> That's a good point. I had originally thought that max_active on an
>> unbound workqueue would be the number of concurrent jobs that could run
>> across all the CPUs, but now that I look I'm not sure that's really
>> the case.
>
> @max_active is a per-pool number. By default, unbound wqs use
> per-node pools, so @max_active would be per-node. Currently,
> @max_active is mostly meant as a protection against run-away
> workqueues creating crazy number of workers, which has been enough for
> the existing wq users. *Maybe* it makes sense to make it actually
> mean maximum concurrency which would prolly involve aggregated per-cpu
> distribution mechanism so that we don't end up inc'ing and dec'ing the
> same counter from all CPUs on each work item execution.
>
> However, I do agree with Neil that making it user configurable is
> almost always painful. It's usually a question without a good answer
> and the same value may behave differently depending on a lot of
> implementation details and a better approach, probably, is to use
> @max_active as the last resort protection mechanism while providing
> automatic throttling of in-flight work items which is meaningful for
> the specific use cases.
>
>> I've heard random grumblings from various people in the past that
>> workqueues have significant latency, but this is the first time I've
>> really hit it in practice. If we can get this fixed, then that may be a
>> significant perf win for all workqueue users. For instance, rpciod in
>> the NFS client is all workqueue-based. Getting that latency down could
>> really help things.
>>
>> I'm currently trying to roll up a kernel module for benchmarking the
>> workqueue dispatching code in the hopes that we can use that to help
>> nail it down.
>
> Definitely, there were some reportings but nothing really got tracked
> down properly. It'd be awesome to actually find out where the latency
> is coming from.

I?ve measured some long (>10usec) latencies for queue_workqueue()
and wake_up_bit() calls with CPU-intensive workloads.

Often these APIs are invoked while one or more spinlocks are
held. That makes it easy to back up a lot of work if one of
these calls is slow for any reason.

For wake_up_bit(), this doesn?t include how long it takes before
the awoken process is run. With CPU-intensive workloads, it?s
often the case that hundreds of usecs elapse, and the awoken
process is migrated to another CPU (observed via ftrace).

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com


2014-12-03 19:02:09

by Jeff Layton

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] nfsd/sunrpc: add support for a workqueue-based nfsd

On Wed, 3 Dec 2014 11:04:05 -0500
Jeff Layton <[email protected]> wrote:

> On Wed, 3 Dec 2014 10:56:49 -0500
> Tejun Heo <[email protected]> wrote:
>
> > Hello, Neil, Jeff.
> >
> > On Tue, Dec 02, 2014 at 08:29:46PM -0500, Jeff Layton wrote:
> > > That's a good point. I had originally thought that max_active on an
> > > unbound workqueue would be the number of concurrent jobs that could run
> > > across all the CPUs, but now that I look I'm not sure that's really
> > > the case.
> >
> > @max_active is a per-pool number. By default, unbound wqs use
> > per-node pools, so @max_active would be per-node. Currently,
> > @max_active is mostly meant as a protection against run-away
> > workqueues creating crazy number of workers, which has been enough for
> > the existing wq users. *Maybe* it makes sense to make it actually
> > mean maximum concurrency which would prolly involve aggregated per-cpu
> > distribution mechanism so that we don't end up inc'ing and dec'ing the
> > same counter from all CPUs on each work item execution.
> >
> > However, I do agree with Neil that making it user configurable is
> > almost always painful. It's usually a question without a good answer
> > and the same value may behave differently depending on a lot of
> > implementation details and a better approach, probably, is to use
> > @max_active as the last resort protection mechanism while providing
> > automatic throttling of in-flight work items which is meaningful for
> > the specific use cases.
> >
> > > I've heard random grumblings from various people in the past that
> > > workqueues have significant latency, but this is the first time I've
> > > really hit it in practice. If we can get this fixed, then that may be a
> > > significant perf win for all workqueue users. For instance, rpciod in
> > > the NFS client is all workqueue-based. Getting that latency down could
> > > really help things.
> > >
> > > I'm currently trying to roll up a kernel module for benchmarking the
> > > workqueue dispatching code in the hopes that we can use that to help
> > > nail it down.
> >
> > Definitely, there were some reportings but nothing really got tracked
> > down properly. It'd be awesome to actually find out where the latency
> > is coming from.
> >
> > Thanks!
> >
>
> I think I might have figured this out (and before I go any farther
> allow me to say <facepalm>), thanks to the workqueue tracepoints in the
> code. What I noticed is that when things are fairly idle, the work is
> picked up quickly, but once things get busy it takes a lot longer.
>
> I think that the issue is in the design of the workqueue-based nfsd
> code. In particular, I attached a work_struct to the svc_xprt which is
> limiting the code to only process one RPC at a time for a xprt, from
> beginning to end.
>
> So, even if we requeue that work after the receive phase is done, the
> workqueue won't pick it up again until the thing is processed and the
> reply is sent.
>
> What I think I need to do is to do the receive phase using the
> work_struct attached to the xprt, and then do the rest of the
> processing from the context of a different work_struct (possibly one
> attached to the svc_rqst), which should free up the xprt's work_struct
> sooner.
>
> I'm going to work on changing that today and see if it improves things.
>
> Thanks for the help so far!

Yes! That does help. The new workqueue based code is a little (a few
percent?) slower than the thread-based code across the board. I suspect
that's due to the fact that I'm having to queue each RPC to the
workqueue twice (once for the receive and once to do the processing).

I suspect that I can remedy that, but I'll have to think about the best
way to do it.

Thanks again for the help!
--
Jeff Layton <[email protected]>

2014-12-03 19:08:05

by Trond Myklebust

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] nfsd/sunrpc: add support for a workqueue-based nfsd

On Wed, Dec 3, 2014 at 2:02 PM, Jeff Layton <[email protected]> wrote:
> On Wed, 3 Dec 2014 11:04:05 -0500
> Jeff Layton <[email protected]> wrote:
>
>> On Wed, 3 Dec 2014 10:56:49 -0500
>> Tejun Heo <[email protected]> wrote:
>>
>> > Hello, Neil, Jeff.
>> >
>> > On Tue, Dec 02, 2014 at 08:29:46PM -0500, Jeff Layton wrote:
>> > > That's a good point. I had originally thought that max_active on an
>> > > unbound workqueue would be the number of concurrent jobs that could run
>> > > across all the CPUs, but now that I look I'm not sure that's really
>> > > the case.
>> >
>> > @max_active is a per-pool number. By default, unbound wqs use
>> > per-node pools, so @max_active would be per-node. Currently,
>> > @max_active is mostly meant as a protection against run-away
>> > workqueues creating crazy number of workers, which has been enough for
>> > the existing wq users. *Maybe* it makes sense to make it actually
>> > mean maximum concurrency which would prolly involve aggregated per-cpu
>> > distribution mechanism so that we don't end up inc'ing and dec'ing the
>> > same counter from all CPUs on each work item execution.
>> >
>> > However, I do agree with Neil that making it user configurable is
>> > almost always painful. It's usually a question without a good answer
>> > and the same value may behave differently depending on a lot of
>> > implementation details and a better approach, probably, is to use
>> > @max_active as the last resort protection mechanism while providing
>> > automatic throttling of in-flight work items which is meaningful for
>> > the specific use cases.
>> >
>> > > I've heard random grumblings from various people in the past that
>> > > workqueues have significant latency, but this is the first time I've
>> > > really hit it in practice. If we can get this fixed, then that may be a
>> > > significant perf win for all workqueue users. For instance, rpciod in
>> > > the NFS client is all workqueue-based. Getting that latency down could
>> > > really help things.
>> > >
>> > > I'm currently trying to roll up a kernel module for benchmarking the
>> > > workqueue dispatching code in the hopes that we can use that to help
>> > > nail it down.
>> >
>> > Definitely, there were some reportings but nothing really got tracked
>> > down properly. It'd be awesome to actually find out where the latency
>> > is coming from.
>> >
>> > Thanks!
>> >
>>
>> I think I might have figured this out (and before I go any farther
>> allow me to say <facepalm>), thanks to the workqueue tracepoints in the
>> code. What I noticed is that when things are fairly idle, the work is
>> picked up quickly, but once things get busy it takes a lot longer.
>>
>> I think that the issue is in the design of the workqueue-based nfsd
>> code. In particular, I attached a work_struct to the svc_xprt which is
>> limiting the code to only process one RPC at a time for a xprt, from
>> beginning to end.
>>
>> So, even if we requeue that work after the receive phase is done, the
>> workqueue won't pick it up again until the thing is processed and the
>> reply is sent.
>>
>> What I think I need to do is to do the receive phase using the
>> work_struct attached to the xprt, and then do the rest of the
>> processing from the context of a different work_struct (possibly one
>> attached to the svc_rqst), which should free up the xprt's work_struct
>> sooner.
>>
>> I'm going to work on changing that today and see if it improves things.
>>
>> Thanks for the help so far!
>
> Yes! That does help. The new workqueue based code is a little (a few
> percent?) slower than the thread-based code across the board. I suspect
> that's due to the fact that I'm having to queue each RPC to the
> workqueue twice (once for the receive and once to do the processing).
>
> I suspect that I can remedy that, but I'll have to think about the best
> way to do it.
>

Which workqueue are you using? Since the receive code is non-blocking,
I'd expect you might be able to use rpciod, for the initial socket
reads, but you wouldn't want to use that for the actual knfsd
processing.

--
Trond Myklebust

Linux NFS client maintainer, PrimaryData

[email protected]

2014-12-03 19:20:39

by Jeff Layton

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] nfsd/sunrpc: add support for a workqueue-based nfsd

On Wed, 3 Dec 2014 14:08:01 -0500
Trond Myklebust <[email protected]> wrote:

> On Wed, Dec 3, 2014 at 2:02 PM, Jeff Layton <[email protected]> wrote:
> > On Wed, 3 Dec 2014 11:04:05 -0500
> > Jeff Layton <[email protected]> wrote:
> >
> >> On Wed, 3 Dec 2014 10:56:49 -0500
> >> Tejun Heo <[email protected]> wrote:
> >>
> >> > Hello, Neil, Jeff.
> >> >
> >> > On Tue, Dec 02, 2014 at 08:29:46PM -0500, Jeff Layton wrote:
> >> > > That's a good point. I had originally thought that max_active on an
> >> > > unbound workqueue would be the number of concurrent jobs that could run
> >> > > across all the CPUs, but now that I look I'm not sure that's really
> >> > > the case.
> >> >
> >> > @max_active is a per-pool number. By default, unbound wqs use
> >> > per-node pools, so @max_active would be per-node. Currently,
> >> > @max_active is mostly meant as a protection against run-away
> >> > workqueues creating crazy number of workers, which has been enough for
> >> > the existing wq users. *Maybe* it makes sense to make it actually
> >> > mean maximum concurrency which would prolly involve aggregated per-cpu
> >> > distribution mechanism so that we don't end up inc'ing and dec'ing the
> >> > same counter from all CPUs on each work item execution.
> >> >
> >> > However, I do agree with Neil that making it user configurable is
> >> > almost always painful. It's usually a question without a good answer
> >> > and the same value may behave differently depending on a lot of
> >> > implementation details and a better approach, probably, is to use
> >> > @max_active as the last resort protection mechanism while providing
> >> > automatic throttling of in-flight work items which is meaningful for
> >> > the specific use cases.
> >> >
> >> > > I've heard random grumblings from various people in the past that
> >> > > workqueues have significant latency, but this is the first time I've
> >> > > really hit it in practice. If we can get this fixed, then that may be a
> >> > > significant perf win for all workqueue users. For instance, rpciod in
> >> > > the NFS client is all workqueue-based. Getting that latency down could
> >> > > really help things.
> >> > >
> >> > > I'm currently trying to roll up a kernel module for benchmarking the
> >> > > workqueue dispatching code in the hopes that we can use that to help
> >> > > nail it down.
> >> >
> >> > Definitely, there were some reportings but nothing really got tracked
> >> > down properly. It'd be awesome to actually find out where the latency
> >> > is coming from.
> >> >
> >> > Thanks!
> >> >
> >>
> >> I think I might have figured this out (and before I go any farther
> >> allow me to say <facepalm>), thanks to the workqueue tracepoints in the
> >> code. What I noticed is that when things are fairly idle, the work is
> >> picked up quickly, but once things get busy it takes a lot longer.
> >>
> >> I think that the issue is in the design of the workqueue-based nfsd
> >> code. In particular, I attached a work_struct to the svc_xprt which is
> >> limiting the code to only process one RPC at a time for a xprt, from
> >> beginning to end.
> >>
> >> So, even if we requeue that work after the receive phase is done, the
> >> workqueue won't pick it up again until the thing is processed and the
> >> reply is sent.
> >>
> >> What I think I need to do is to do the receive phase using the
> >> work_struct attached to the xprt, and then do the rest of the
> >> processing from the context of a different work_struct (possibly one
> >> attached to the svc_rqst), which should free up the xprt's work_struct
> >> sooner.
> >>
> >> I'm going to work on changing that today and see if it improves things.
> >>
> >> Thanks for the help so far!
> >
> > Yes! That does help. The new workqueue based code is a little (a few
> > percent?) slower than the thread-based code across the board. I suspect
> > that's due to the fact that I'm having to queue each RPC to the
> > workqueue twice (once for the receive and once to do the processing).
> >
> > I suspect that I can remedy that, but I'll have to think about the best
> > way to do it.
> >
>
> Which workqueue are you using? Since the receive code is non-blocking,
> I'd expect you might be able to use rpciod, for the initial socket
> reads, but you wouldn't want to use that for the actual knfsd
> processing.
>

I'm using the same (nfsd) workqueue for everything. The workqueue
isn't really the bottleneck though, it's the work_struct.

Basically, the problem is that the work_struct in the svc_xprt was
remaining busy for far too long. So, even though the XPT_BUSY bit had
cleared, the work wouldn't get picked up again until the previous
workqueue job had returned.

With the change I made today, I just added a new work_struct to
svc_rqst and queue that to the same workqueue to do svc_process as soon
as the receive is done. That means though that each RPC ends up waiting
in the queue twice (once to do the receive and once to process the
RPC), and I think that's probably the reason for the performance delta.

What I think I'm going to do on the next pass is have the job that
enqueues the xprt instead try to find an svc_rqst. If it finds it,
then it can go ahead and queue the work struct in it to do the receive
and processing in a single go.

If it can't find one, it'll queue the xprt's work to allocate one and
then queue that to do all of the work as before. That will likely
penalize the case where there isn't an available svc_rqst, but in the
common case that there is one it should go quickly.

Thanks!
--
Jeff Layton <[email protected]>

2014-12-03 19:59:46

by Trond Myklebust

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] nfsd/sunrpc: add support for a workqueue-based nfsd

On Wed, Dec 3, 2014 at 2:20 PM, Jeff Layton <[email protected]> wrote:
> On Wed, 3 Dec 2014 14:08:01 -0500
> Trond Myklebust <[email protected]> wrote:
>> Which workqueue are you using? Since the receive code is non-blocking,
>> I'd expect you might be able to use rpciod, for the initial socket
>> reads, but you wouldn't want to use that for the actual knfsd
>> processing.
>>
>
> I'm using the same (nfsd) workqueue for everything. The workqueue
> isn't really the bottleneck though, it's the work_struct.
>
> Basically, the problem is that the work_struct in the svc_xprt was
> remaining busy for far too long. So, even though the XPT_BUSY bit had
> cleared, the work wouldn't get picked up again until the previous
> workqueue job had returned.
>
> With the change I made today, I just added a new work_struct to
> svc_rqst and queue that to the same workqueue to do svc_process as soon
> as the receive is done. That means though that each RPC ends up waiting
> in the queue twice (once to do the receive and once to process the
> RPC), and I think that's probably the reason for the performance delta.

Why would the queuing latency still be significant now?

> What I think I'm going to do on the next pass is have the job that
> enqueues the xprt instead try to find an svc_rqst. If it finds it,
> then it can go ahead and queue the work struct in it to do the receive
> and processing in a single go.
>
> If it can't find one, it'll queue the xprt's work to allocate one and
> then queue that to do all of the work as before. That will likely
> penalize the case where there isn't an available svc_rqst, but in the
> common case that there is one it should go quickly.


--
Trond Myklebust

Linux NFS client maintainer, PrimaryData

[email protected]

2014-12-03 20:21:51

by Jeff Layton

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] nfsd/sunrpc: add support for a workqueue-based nfsd

On Wed, 3 Dec 2014 14:59:43 -0500
Trond Myklebust <[email protected]> wrote:

> On Wed, Dec 3, 2014 at 2:20 PM, Jeff Layton <[email protected]> wrote:
> > On Wed, 3 Dec 2014 14:08:01 -0500
> > Trond Myklebust <[email protected]> wrote:
> >> Which workqueue are you using? Since the receive code is non-blocking,
> >> I'd expect you might be able to use rpciod, for the initial socket
> >> reads, but you wouldn't want to use that for the actual knfsd
> >> processing.
> >>
> >
> > I'm using the same (nfsd) workqueue for everything. The workqueue
> > isn't really the bottleneck though, it's the work_struct.
> >
> > Basically, the problem is that the work_struct in the svc_xprt was
> > remaining busy for far too long. So, even though the XPT_BUSY bit had
> > cleared, the work wouldn't get picked up again until the previous
> > workqueue job had returned.
> >
> > With the change I made today, I just added a new work_struct to
> > svc_rqst and queue that to the same workqueue to do svc_process as soon
> > as the receive is done. That means though that each RPC ends up waiting
> > in the queue twice (once to do the receive and once to process the
> > RPC), and I think that's probably the reason for the performance delta.
>
> Why would the queuing latency still be significant now?
>

That, I'm not clear on yet and that may not be why this is slower. But,
I was seeing slightly faster performance with reads before I made
today's changes. If changing how these jobs get queued doesn't help the
performance, then I'll have to look elsewhere...

> > What I think I'm going to do on the next pass is have the job that
> > enqueues the xprt instead try to find an svc_rqst. If it finds it,
> > then it can go ahead and queue the work struct in it to do the
> > receive and processing in a single go.
> >
> > If it can't find one, it'll queue the xprt's work to allocate one
> > and then queue that to do all of the work as before. That will
> > likely penalize the case where there isn't an available svc_rqst,
> > but in the common case that there is one it should go quickly.
>
>


--
Jeff Layton <[email protected]>

2014-12-03 20:45:59

by Trond Myklebust

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] nfsd/sunrpc: add support for a workqueue-based nfsd

On Wed, Dec 3, 2014 at 3:21 PM, Jeff Layton <[email protected]> wrote:
> On Wed, 3 Dec 2014 14:59:43 -0500
> Trond Myklebust <[email protected]> wrote:
>
>> On Wed, Dec 3, 2014 at 2:20 PM, Jeff Layton <[email protected]> wrote:
>> > On Wed, 3 Dec 2014 14:08:01 -0500
>> > Trond Myklebust <[email protected]> wrote:
>> >> Which workqueue are you using? Since the receive code is non-blocking,
>> >> I'd expect you might be able to use rpciod, for the initial socket
>> >> reads, but you wouldn't want to use that for the actual knfsd
>> >> processing.
>> >>
>> >
>> > I'm using the same (nfsd) workqueue for everything. The workqueue
>> > isn't really the bottleneck though, it's the work_struct.
>> >
>> > Basically, the problem is that the work_struct in the svc_xprt was
>> > remaining busy for far too long. So, even though the XPT_BUSY bit had
>> > cleared, the work wouldn't get picked up again until the previous
>> > workqueue job had returned.
>> >
>> > With the change I made today, I just added a new work_struct to
>> > svc_rqst and queue that to the same workqueue to do svc_process as soon
>> > as the receive is done. That means though that each RPC ends up waiting
>> > in the queue twice (once to do the receive and once to process the
>> > RPC), and I think that's probably the reason for the performance delta.
>>
>> Why would the queuing latency still be significant now?
>>
>
> That, I'm not clear on yet and that may not be why this is slower. But,
> I was seeing slightly faster performance with reads before I made
> today's changes. If changing how these jobs get queued doesn't help the
> performance, then I'll have to look elsewhere...

Do you have a good method for measuring that latency? If the queuing
latency turns out to depend on the execution latency for each job,
then perhaps running the message receives on a separate low latency
queue could help (hence the suggestion to use rpciod).

--
Trond Myklebust

Linux NFS client maintainer, PrimaryData

[email protected]

2014-12-04 11:47:17

by Jeff Layton

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] nfsd/sunrpc: add support for a workqueue-based nfsd

On Wed, 3 Dec 2014 15:44:31 -0500
Trond Myklebust <[email protected]> wrote:

> On Wed, Dec 3, 2014 at 3:21 PM, Jeff Layton <[email protected]> wrote:
> > On Wed, 3 Dec 2014 14:59:43 -0500
> > Trond Myklebust <[email protected]> wrote:
> >
> >> On Wed, Dec 3, 2014 at 2:20 PM, Jeff Layton <[email protected]> wrote:
> >> > On Wed, 3 Dec 2014 14:08:01 -0500
> >> > Trond Myklebust <[email protected]> wrote:
> >> >> Which workqueue are you using? Since the receive code is non-blocking,
> >> >> I'd expect you might be able to use rpciod, for the initial socket
> >> >> reads, but you wouldn't want to use that for the actual knfsd
> >> >> processing.
> >> >>
> >> >
> >> > I'm using the same (nfsd) workqueue for everything. The workqueue
> >> > isn't really the bottleneck though, it's the work_struct.
> >> >
> >> > Basically, the problem is that the work_struct in the svc_xprt was
> >> > remaining busy for far too long. So, even though the XPT_BUSY bit had
> >> > cleared, the work wouldn't get picked up again until the previous
> >> > workqueue job had returned.
> >> >
> >> > With the change I made today, I just added a new work_struct to
> >> > svc_rqst and queue that to the same workqueue to do svc_process as soon
> >> > as the receive is done. That means though that each RPC ends up waiting
> >> > in the queue twice (once to do the receive and once to process the
> >> > RPC), and I think that's probably the reason for the performance delta.
> >>
> >> Why would the queuing latency still be significant now?
> >>
> >
> > That, I'm not clear on yet and that may not be why this is slower. But,
> > I was seeing slightly faster performance with reads before I made
> > today's changes. If changing how these jobs get queued doesn't help the
> > performance, then I'll have to look elsewhere...
>
> Do you have a good method for measuring that latency? If the queuing
> latency turns out to depend on the execution latency for each job,
> then perhaps running the message receives on a separate low latency
> queue could help (hence the suggestion to use rpciod).
>

I was using ftrace with the sunrpc:* and workqueue:* tracepoints, and
had a simple perl script to postprocess the trace info to figure out
average/min/max latency.

I don't think the queueing latency is that significant per-se, but I
think the best thing is to avoid making multiple trips through the
workqueue per RPC if we can help it. I tested and pushed a newer
patchset to my repo last night that does that (at least if there's
already a svc_rqst available when the xprt needs servicing). It seems
to be pretty close speed-wise to the thread-based code.

The next step is to test this out on something larger-scale. I'm hoping
to get access to just such a test rig soon. Once we have some results
from that, I think I'll have a much better idea of how viable this
approach is and where other potential bottlenecks might be.

Thanks!
--
Jeff Layton <[email protected]>

2014-12-04 17:17:36

by Shirley Ma

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] nfsd/sunrpc: add support for a workqueue-based nfsd

I am looking at how to reduce total RPC execution time in NFS/RDMA. mountstats output shows that RPC backlog wait is too long, but increasing the credit limit doesn't seem help. Would this patchset help reducing total RPC execution time?

Shirley

On 12/04/2014 03:47 AM, Jeff Layton wrote:
> I was using ftrace with the sunrpc:* and workqueue:* tracepoints, and
> had a simple perl script to postprocess the trace info to figure out
> average/min/max latency.

2014-12-04 17:28:50

by Jeff Layton

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] nfsd/sunrpc: add support for a workqueue-based nfsd

On Thu, 04 Dec 2014 09:17:17 -0800
Shirley Ma <[email protected]> wrote:

> I am looking at how to reduce total RPC execution time in NFS/RDMA. mountstats output shows that RPC backlog wait is too long, but increasing the credit limit doesn't seem help. Would this patchset help reducing total RPC execution time?
>
> Shirley
>

I'm not sure. It depends on why you're seeing a backlog.

So far, my testing still shows this to be slightly (~2%) slower than the
same setup running a threaded nfsd, but that could be different with a
NUMA server or faster disks.

Probably there is some more tuning to do here before this is quite
ready for prime-time. It may be worth testing in your environment
though if you have the time and ability to do so.

> On 12/04/2014 03:47 AM, Jeff Layton wrote:
> > I was using ftrace with the sunrpc:* and workqueue:* tracepoints, and
> > had a simple perl script to postprocess the trace info to figure out
> > average/min/max latency.


--
Jeff Layton <[email protected]>

2014-12-04 17:44:48

by Shirley Ma

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] nfsd/sunrpc: add support for a workqueue-based nfsd

On 12/04/2014 09:28 AM, Jeff Layton wrote:
> On Thu, 04 Dec 2014 09:17:17 -0800
> Shirley Ma <[email protected]> wrote:
>
>> > I am looking at how to reduce total RPC execution time in NFS/RDMA. mountstats output shows that RPC backlog wait is too long, but increasing the credit limit doesn't seem help. Would this patchset help reducing total RPC execution time?
>> >
>> > Shirley
>> >
> I'm not sure. It depends on why you're seeing a backlog.

What's the major factors contributing RPC backlog wait time? I guess I need to probe the backlog wait path to display fine-grained latency cost.

> So far, my testing still shows this to be slightly (~2%) slower than the
> same setup running a threaded nfsd, but that could be different with a
> NUMA server or faster disks.
>
> Probably there is some more tuning to do here before this is quite
> ready for prime-time. It may be worth testing in your environment
> though if you have the time and ability to do so.
>
>> > On 12/04/2014 03:47 AM, Jeff Layton wrote:
>>> > > I was using ftrace with the sunrpc:* and workqueue:* tracepoints, and
>>> > > had a simple perl script to postprocess the trace info to figure out
>>> > > average/min/max latency.

2014-12-08 20:47:14

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [RFC PATCH 09/14] sunrpc: add basic support for workqueue-based services

On Tue, Dec 02, 2014 at 01:24:18PM -0500, Jeff Layton wrote:
> diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
> index fb78117b896c..08e01949bdc5 100644
> --- a/net/sunrpc/Kconfig
> +++ b/net/sunrpc/Kconfig
> @@ -71,3 +71,13 @@ config SUNRPC_XPRT_RDMA_SERVER
> choose M here: the module will be called svcrdma.
>
> If unsure, say N.
> +
> +config SUNRPC_SVC_WORKQUEUE
> + bool "Support for workqueue-based SUNRPC services"
> + depends on SUNRPC
> + default n
> + help
> + Traditional SUNRPC services have required a dedicated thread
> + to handle incoming requests. This option enables support for
> + queueing incoming reqests to a workqueue instead, eliminating
> + the need for a dedicated thread pool.

Minor point, but: If people don't want this, they can turn it off at
runtime. It's annoying to test all the possible combination of build
options, and this doesn't seem likely to let people e.g. build a
significantly smaller kernel, so I wouldn't bother.

--b.

2014-12-08 20:49:24

by Jeff Layton

[permalink] [raw]
Subject: Re: [RFC PATCH 09/14] sunrpc: add basic support for workqueue-based services

On Mon, 8 Dec 2014 15:47:09 -0500
"J. Bruce Fields" <[email protected]> wrote:

> On Tue, Dec 02, 2014 at 01:24:18PM -0500, Jeff Layton wrote:
> > diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
> > index fb78117b896c..08e01949bdc5 100644
> > --- a/net/sunrpc/Kconfig
> > +++ b/net/sunrpc/Kconfig
> > @@ -71,3 +71,13 @@ config SUNRPC_XPRT_RDMA_SERVER
> > choose M here: the module will be called svcrdma.
> >
> > If unsure, say N.
> > +
> > +config SUNRPC_SVC_WORKQUEUE
> > + bool "Support for workqueue-based SUNRPC services"
> > + depends on SUNRPC
> > + default n
> > + help
> > + Traditional SUNRPC services have required a dedicated thread
> > + to handle incoming requests. This option enables support for
> > + queueing incoming reqests to a workqueue instead, eliminating
> > + the need for a dedicated thread pool.
>
> Minor point, but: If people don't want this, they can turn it off at
> runtime. It's annoying to test all the possible combination of build
> options, and this doesn't seem likely to let people e.g. build a
> significantly smaller kernel, so I wouldn't bother.
>
> --b.

Ok, fair enough. I added the config option since this was experimental,
but if we keep it runtime-switchable then I think you're probably right
and there's no need.

--
Jeff Layton <[email protected]>