2006-07-31 00:42:41

by NeilBrown

[permalink] [raw]
Subject: [PATCH 010 of 11] knfsd: make rpc threads pools numa aware


From: Greg Banks <[email protected]>

knfsd: Actually implement multiple pools. On NUMA machines, allocate
a svc_pool per NUMA node; on SMP a svc_pool per CPU; otherwise a single
global pool. Enqueue sockets on the svc_pool corresponding to the CPU
on which the socket bh is run (i.e. the NIC interrupt CPU). Threads
have their cpu mask set to limit them to the CPUs in the svc_pool that
owns them.

This is the patch that allows an Altix to scale NFS traffic linearly
beyond 4 CPUs and 4 NICs.

Incorporates changes and feedback from Neil Brown, Trond Myklebust,
and Christoph Hellwig.

Signed-off-by: Greg Banks <[email protected]>
Signed-off-by: Neil Brown <[email protected]>

### Diffstat output
./include/linux/sunrpc/svc.h | 1
./net/sunrpc/svc.c | 246 ++++++++++++++++++++++++++++++++++++++++++-
./net/sunrpc/svcsock.c | 7 +
3 files changed, 252 insertions(+), 2 deletions(-)

diff .prev/include/linux/sunrpc/svc.h ./include/linux/sunrpc/svc.h
--- .prev/include/linux/sunrpc/svc.h 2006-07-31 10:30:30.000000000 +1000
+++ ./include/linux/sunrpc/svc.h 2006-07-31 10:31:12.000000000 +1000
@@ -369,5 +369,6 @@ int svc_process(struct svc_rqst *);
int svc_register(struct svc_serv *, int, unsigned short);
void svc_wake_up(struct svc_serv *);
void svc_reserve(struct svc_rqst *rqstp, int space);
+struct svc_pool * svc_pool_for_cpu(struct svc_serv *serv, int cpu);

#endif /* SUNRPC_SVC_H */

diff .prev/net/sunrpc/svc.c ./net/sunrpc/svc.c
--- .prev/net/sunrpc/svc.c 2006-07-31 10:30:30.000000000 +1000
+++ ./net/sunrpc/svc.c 2006-07-31 10:31:30.000000000 +1000
@@ -4,6 +4,10 @@
* High-level RPC service routines
*
* Copyright (C) 1995, 1996 Olaf Kirch <[email protected]>
+ *
+ * Multiple threads pools and NUMAisation
+ * Copyright (c) 2006 Silicon Graphics, Inc.
+ * by Greg Banks <[email protected]>
*/

#include <linux/linkage.h>
@@ -25,6 +29,233 @@
#define RPC_PARANOIA 1

/*
+ * Mode for mapping cpus to pools.
+ */
+enum {
+ SVC_POOL_NONE = -1, /* uninitialised, choose one of the others */
+ SVC_POOL_GLOBAL, /* no mapping, just a single global pool
+ * (legacy & UP mode) */
+ SVC_POOL_PERCPU, /* one pool per cpu */
+ SVC_POOL_PERNODE /* one pool per numa node */
+};
+
+/*
+ * Structure for mapping cpus to pools and vice versa.
+ * Setup once during sunrpc initialisation.
+ */
+static struct svc_pool_map {
+ int mode; /* Note: int not enum to avoid
+ * warnings about "enumeration value
+ * not handled in switch" */
+ unsigned int npools;
+ unsigned int *pool_to; /* maps pool id to cpu or node */
+ unsigned int *to_pool; /* maps cpu or node to pool id */
+} svc_pool_map = {
+ .mode = SVC_POOL_NONE
+};
+
+
+/*
+ * Detect best pool mapping mode heuristically,
+ * according to the machine's topology.
+ */
+static int
+svc_pool_map_choose_mode(void)
+{
+ unsigned int node;
+
+ if (num_online_nodes() > 1) {
+ /*
+ * Actually have multiple NUMA nodes,
+ * so split pools on NUMA node boundaries
+ */
+ return SVC_POOL_PERNODE;
+ }
+
+ node = any_online_node(node_online_map);
+ if (nr_cpus_node(node) > 2) {
+ /*
+ * Non-trivial SMP, or CONFIG_NUMA on
+ * non-NUMA hardware, e.g. with a generic
+ * x86_64 kernel on Xeons. In this case we
+ * want to divide the pools on cpu boundaries.
+ */
+ return SVC_POOL_PERCPU;
+ }
+
+ /* default: one global pool */
+ return SVC_POOL_GLOBAL;
+}
+
+/*
+ * Allocate the to_pool[] and pool_to[] arrays.
+ * Returns 0 on success or an errno.
+ */
+static int
+svc_pool_map_alloc_arrays(struct svc_pool_map *m, unsigned int maxpools)
+{
+ m->to_pool = kcalloc(maxpools, sizeof(unsigned int), GFP_KERNEL);
+ if (!m->to_pool)
+ goto fail;
+ m->pool_to = kcalloc(maxpools, sizeof(unsigned int), GFP_KERNEL);
+ if (!m->pool_to)
+ goto fail_free;
+
+ return 0;
+
+fail_free:
+ kfree(m->to_pool);
+fail:
+ return -ENOMEM;
+}
+
+/*
+ * Initialise the pool map for SVC_POOL_PERCPU mode.
+ * Returns number of pools or <0 on error.
+ */
+static int
+svc_pool_map_init_percpu(struct svc_pool_map *m)
+{
+ unsigned int maxpools = num_possible_cpus();
+ unsigned int pidx = 0;
+ unsigned int cpu;
+ int err;
+
+ err = svc_pool_map_alloc_arrays(m, maxpools);
+ if (err)
+ return err;
+
+ for_each_online_cpu(cpu) {
+ BUG_ON(pidx > maxpools);
+ m->to_pool[cpu] = pidx;
+ m->pool_to[pidx] = cpu;
+ pidx++;
+ }
+ /* cpus brought online later all get mapped to pool0, sorry */
+
+ return pidx;
+};
+
+
+/*
+ * Initialise the pool map for SVC_POOL_PERNODE mode.
+ * Returns number of pools or <0 on error.
+ */
+static int
+svc_pool_map_init_pernode(struct svc_pool_map *m)
+{
+ unsigned int maxpools = num_possible_nodes();
+ unsigned int pidx = 0;
+ unsigned int node;
+ int err;
+
+ err = svc_pool_map_alloc_arrays(m, maxpools);
+ if (err)
+ return err;
+
+ for_each_node_with_cpus(node) {
+ /* some architectures (e.g. SN2) have cpuless nodes */
+ BUG_ON(pidx > maxpools);
+ m->to_pool[node] = pidx;
+ m->pool_to[pidx] = node;
+ pidx++;
+ }
+ /* nodes brought online later all get mapped to pool0, sorry */
+
+ return pidx;
+}
+
+
+/*
+ * Build the global map of cpus to pools and vice versa.
+ */
+static unsigned int
+svc_pool_map_init(void)
+{
+ struct svc_pool_map *m = &svc_pool_map;
+ int npools = -1;
+
+ if (m->mode != SVC_POOL_NONE)
+ return m->npools;
+
+ m->mode = svc_pool_map_choose_mode();
+
+ switch (m->mode) {
+ case SVC_POOL_PERCPU:
+ npools = svc_pool_map_init_percpu(m);
+ break;
+ case SVC_POOL_PERNODE:
+ npools = svc_pool_map_init_pernode(m);
+ break;
+ }
+
+ if (npools < 0) {
+ /* default, or memory allocation failure */
+ npools = 1;
+ m->mode = SVC_POOL_GLOBAL;
+ }
+ m->npools = npools;
+
+ return m->npools;
+}
+
+/*
+ * Set the current thread's cpus_allowed mask so that it
+ * will only run on cpus in the given pool.
+ *
+ * Returns 1 and fills in oldmask iff a cpumask was applied.
+ */
+static inline int
+svc_pool_map_set_cpumask(unsigned int pidx, cpumask_t *oldmask)
+{
+ struct svc_pool_map *m = &svc_pool_map;
+ unsigned int node; /* or cpu */
+
+ BUG_ON(m->mode == SVC_POOL_NONE);
+
+ switch (m->mode)
+ {
+ default:
+ return 0;
+ case SVC_POOL_PERCPU:
+ node = m->pool_to[pidx];
+ *oldmask = current->cpus_allowed;
+ set_cpus_allowed(current, cpumask_of_cpu(node));
+ return 1;
+ case SVC_POOL_PERNODE:
+ node = m->pool_to[pidx];
+ *oldmask = current->cpus_allowed;
+ set_cpus_allowed(current, node_to_cpumask(node));
+ return 1;
+ }
+}
+
+/*
+ * Use the mapping mode to choose a pool for a given CPU.
+ * Used when enqueueing an incoming RPC. Always returns
+ * a non-NULL pool pointer.
+ */
+struct svc_pool *
+svc_pool_for_cpu(struct svc_serv *serv, int cpu)
+{
+ struct svc_pool_map *m = &svc_pool_map;
+ unsigned int pidx = 0;
+
+ BUG_ON(m->mode == SVC_POOL_NONE);
+
+ switch (m->mode) {
+ case SVC_POOL_PERCPU:
+ pidx = m->to_pool[cpu];
+ break;
+ case SVC_POOL_PERNODE:
+ pidx = m->to_pool[cpu_to_node(cpu)];
+ break;
+ }
+ return &serv->sv_pools[pidx % serv->sv_nrpools];
+}
+
+
+/*
* Create an RPC service
*/
static struct svc_serv *
@@ -105,8 +336,9 @@ svc_create_pooled(struct svc_program *pr
svc_thread_fn func, int sig, struct module *mod)
{
struct svc_serv *serv;
+ unsigned int npools = svc_pool_map_init();

- serv = __svc_create(prog, bufsize, /*npools*/1, shutdown);
+ serv = __svc_create(prog, bufsize, npools, shutdown);

if (serv != NULL) {
serv->sv_function = func;
@@ -209,6 +441,8 @@ svc_release_buffer(struct svc_rqst *rqst

/*
* Create a thread in the given pool. Caller must hold BKL.
+ * On a NUMA or SMP machine, with a multi-pool serv, the thread
+ * will be restricted to run on the cpus belonging to the pool.
*/
static int
__svc_create_thread(svc_thread_fn func, struct svc_serv *serv,
@@ -216,6 +450,8 @@ __svc_create_thread(svc_thread_fn func,
{
struct svc_rqst *rqstp;
int error = -ENOMEM;
+ int have_oldmask = 0;
+ cpumask_t oldmask;

rqstp = kzalloc(sizeof(*rqstp), GFP_KERNEL);
if (!rqstp)
@@ -235,7 +471,15 @@ __svc_create_thread(svc_thread_fn func,
spin_unlock_bh(&pool->sp_lock);
rqstp->rq_server = serv;
rqstp->rq_pool = pool;
+
+ if (serv->sv_nrpools > 1)
+ have_oldmask = svc_pool_map_set_cpumask(pool->sp_id, &oldmask);
+
error = kernel_thread((int (*)(void *)) func, rqstp, 0);
+
+ if (have_oldmask)
+ set_cpus_allowed(current, oldmask);
+
if (error < 0)
goto out_thread;
svc_sock_update_bufs(serv);

diff .prev/net/sunrpc/svcsock.c ./net/sunrpc/svcsock.c
--- .prev/net/sunrpc/svcsock.c 2006-07-31 10:29:38.000000000 +1000
+++ ./net/sunrpc/svcsock.c 2006-07-31 10:31:12.000000000 +1000
@@ -151,8 +151,9 @@ static void
svc_sock_enqueue(struct svc_sock *svsk)
{
struct svc_serv *serv = svsk->sk_server;
- struct svc_pool *pool = &serv->sv_pools[0];
+ struct svc_pool *pool;
struct svc_rqst *rqstp;
+ int cpu;

if (!(svsk->sk_flags &
( (1<<SK_CONN)|(1<<SK_DATA)|(1<<SK_CLOSE)|(1<<SK_DEFERRED)) ))
@@ -160,6 +161,10 @@ svc_sock_enqueue(struct svc_sock *svsk)
if (test_bit(SK_DEAD, &svsk->sk_flags))
return;

+ cpu = get_cpu();
+ pool = svc_pool_for_cpu(svsk->sk_server, cpu);
+ put_cpu();
+
spin_lock_bh(&pool->sp_lock);

if (!list_empty(&pool->sp_threads) &&

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2006-07-31 04:14:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 010 of 11] knfsd: make rpc threads pools numa aware

On Mon, 31 Jul 2006 10:42:34 +1000
NeilBrown <[email protected]> wrote:

> +static int
> +svc_pool_map_init_percpu(struct svc_pool_map *m)
> +{
> + unsigned int maxpools = num_possible_cpus();
> + unsigned int pidx = 0;
> + unsigned int cpu;
> + int err;
> +
> + err = svc_pool_map_alloc_arrays(m, maxpools);
> + if (err)
> + return err;
> +
> + for_each_online_cpu(cpu) {
> + BUG_ON(pidx > maxpools);
> + m->to_pool[cpu] = pidx;
> + m->pool_to[pidx] = cpu;
> + pidx++;
> + }

That isn't right - it assumes that cpu_possible_map is not sparse. If it
is sparse, we allocate undersized pools and then overindex them.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-07-31 04:36:17

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 010 of 11] knfsd: make rpc threads pools numa aware

On Sunday July 30, [email protected] wrote:
> On Mon, 31 Jul 2006 10:42:34 +1000
> NeilBrown <[email protected]> wrote:
>
> > +static int
> > +svc_pool_map_init_percpu(struct svc_pool_map *m)
> > +{
> > + unsigned int maxpools = num_possible_cpus();
> > + unsigned int pidx = 0;
> > + unsigned int cpu;
> > + int err;
> > +
> > + err = svc_pool_map_alloc_arrays(m, maxpools);
> > + if (err)
> > + return err;
> > +
> > + for_each_online_cpu(cpu) {
> > + BUG_ON(pidx > maxpools);
> > + m->to_pool[cpu] = pidx;
> > + m->pool_to[pidx] = cpu;
> > + pidx++;
> > + }
>
> That isn't right - it assumes that cpu_possible_map is not sparse. If it
> is sparse, we allocate undersized pools and then overindex them.

I don't think so.

At this point we are largely counting the number of online cpus
(in pidx (pool index) - this is returned). The two-way mapping
to_pool and pool_to provides a mapping between the possible-sparse cpu
list and a dense list of pool indexes.

If further cpus come on line they will be automatically included in
pool-0. (as to_pool[n] will still be zero).

Does that make it at all clearer?

Thanks,
NeilBrown

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-07-31 04:42:49

by Greg Banks

[permalink] [raw]
Subject: Re: [PATCH 010 of 11] knfsd: make rpc threads pools numa aware

On Mon, 2006-07-31 at 14:36, Neil Brown wrote:
> On Sunday July 30, [email protected] wrote:
> > On Mon, 31 Jul 2006 10:42:34 +1000
> > NeilBrown <[email protected]> wrote:
> >
> > > +static int
> > > +svc_pool_map_init_percpu(struct svc_pool_map *m)
> > > +{
> > > + unsigned int maxpools = num_possible_cpus();
> > > + unsigned int pidx = 0;
> > > + unsigned int cpu;
> > > + int err;
> > > +
> > > + err = svc_pool_map_alloc_arrays(m, maxpools);
> > > + if (err)
> > > + return err;
> > > +
> > > + for_each_online_cpu(cpu) {
> > > + BUG_ON(pidx > maxpools);
> > > + m->to_pool[cpu] = pidx;
> > > + m->pool_to[pidx] = cpu;
> > > + pidx++;
> > > + }
> >
> > That isn't right - it assumes that cpu_possible_map is not sparse. If it
> > is sparse, we allocate undersized pools and then overindex them.
>
> I don't think so.
>
> At this point we are largely counting the number of online cpus
> (in pidx (pool index) - this is returned). The two-way mapping
> to_pool and pool_to provides a mapping between the possible-sparse cpu
> list and a dense list of pool indexes.
>
> If further cpus come on line they will be automatically included in
> pool-0. (as to_pool[n] will still be zero).
>
> Does that make it at all clearer?

Umm, I think Andrew's right, num_possible_cpus() should be NR_CPUS.
If there's a value of `cpu' > num_possible_cpus(), which would happen
if the cpu numbers weren't contiguous, then m->to_pool[] will be
overflowed. My bad, I didn't even consider the case of non-contiguous
CPU numbers and none of the machines available for testing had that
property.


Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-07-31 05:55:13

by Greg Banks

[permalink] [raw]
Subject: Re: [PATCH 010 of 11] knfsd: make rpc threads pools numa aware

On Mon, 2006-07-31 at 14:42, Greg Banks wrote:
> On Mon, 2006-07-31 at 14:36, Neil Brown wrote:
> > On Sunday July 30, [email protected] wrote:
> > > On Mon, 31 Jul 2006 10:42:34 +1000
> > > NeilBrown <[email protected]> wrote:
> > >
> > > > +static int
> > > > +svc_pool_map_init_percpu(struct svc_pool_map *m)
> > > > +{
> > > > + unsigned int maxpools = num_possible_cpus();
> > > > + unsigned int pidx = 0;
> > > > + unsigned int cpu;
> > > > + int err;
> > > > +
> > > > +
> > >
> > > That isn't right - it assumes that cpu_possible_map is not sparse. If it
> > > is sparse, we allocate undersized pools and then overindex them.
>
> Umm, I think Andrew's right, num_possible_cpus() should be NR_CPUS.

How about this version of the patch? It replaces num_possible_cpus()
with highest_possible_processor_id()+1 and similarly for nodes.
--

knfsd: Actually implement multiple pools. On NUMA machines, allocate
a svc_pool per NUMA node; on SMP a svc_pool per CPU; otherwise a single
global pool. Enqueue sockets on the svc_pool corresponding to the CPU
on which the socket bh is run (i.e. the NIC interrupt CPU). Threads
have their cpu mask set to limit them to the CPUs in the svc_pool that
owns them.

This is the patch that allows an Altix to scale NFS traffic linearly
beyond 4 CPUs and 4 NICs.

Incorporates changes and feedback from Neil Brown, Trond Myklebust,
Christoph Hellwig and Andrew Morton.

Signed-off-by: Greg Banks <[email protected]>
---

include/linux/sunrpc/svc.h | 1
net/sunrpc/svc.c | 258 +++++++++++++++++++++++++++++++++-
net/sunrpc/svcsock.c | 7
3 files changed, 264 insertions(+), 2 deletions(-)

Index: linus-git/net/sunrpc/svc.c
===================================================================
--- linus-git.orig/net/sunrpc/svc.c 2006-07-26 16:50:25.802187916 +1000
+++ linus-git/net/sunrpc/svc.c 2006-07-31 15:08:59.999800051 +1000
@@ -4,6 +4,10 @@
* High-level RPC service routines
*
* Copyright (C) 1995, 1996 Olaf Kirch <[email protected]>
+ *
+ * Multiple threads pools and NUMAisation
+ * Copyright (c) 2006 Silicon Graphics, Inc.
+ * by Greg Banks <[email protected]>
*/

#include <linux/linkage.h>
@@ -25,6 +29,245 @@
#define RPC_PARANOIA 1

/*
+ * Mode for mapping cpus to pools.
+ */
+enum {
+ SVC_POOL_NONE = -1, /* uninitialised, choose one of the others */
+ SVC_POOL_GLOBAL, /* no mapping, just a single global pool
+ * (legacy & UP mode) */
+ SVC_POOL_PERCPU, /* one pool per cpu */
+ SVC_POOL_PERNODE /* one pool per numa node */
+};
+
+/*
+ * Structure for mapping cpus to pools and vice versa.
+ * Setup once during sunrpc initialisation.
+ */
+static struct svc_pool_map {
+ int mode; /* Note: int not enum to avoid
+ * warnings about "enumeration value
+ * not handled in switch" */
+ unsigned int npools;
+ unsigned int *pool_to; /* maps pool id to cpu or node */
+ unsigned int *to_pool; /* maps cpu or node to pool id */
+} svc_pool_map = {
+ .mode = SVC_POOL_NONE
+};
+
+
+/*
+ * Detect best pool mapping mode heuristically,
+ * according to the machine's topology.
+ */
+static int
+svc_pool_map_choose_mode(void)
+{
+ unsigned int node;
+
+ if (num_online_nodes() > 1) {
+ /*
+ * Actually have multiple NUMA nodes,
+ * so split pools on NUMA node boundaries
+ */
+ return SVC_POOL_PERNODE;
+ }
+
+ node = any_online_node(node_online_map);
+ if (nr_cpus_node(node) > 2) {
+ /*
+ * Non-trivial SMP, or CONFIG_NUMA on
+ * non-NUMA hardware, e.g. with a generic
+ * x86_64 kernel on Xeons. In this case we
+ * want to divide the pools on cpu boundaries.
+ */
+ return SVC_POOL_PERCPU;
+ }
+
+ /* default: one global pool */
+ return SVC_POOL_GLOBAL;
+}
+
+/*
+ * Allocate the to_pool[] and pool_to[] arrays.
+ * Returns 0 on success or an errno.
+ */
+static int
+svc_pool_map_alloc_arrays(struct svc_pool_map *m, unsigned int maxpools)
+{
+ m->to_pool = kcalloc(maxpools, sizeof(unsigned int), GFP_KERNEL);
+ if (!m->to_pool)
+ goto fail;
+ m->pool_to = kcalloc(maxpools, sizeof(unsigned int), GFP_KERNEL);
+ if (!m->pool_to)
+ goto fail_free;
+
+ return 0;
+
+fail_free:
+ kfree(m->to_pool);
+fail:
+ return -ENOMEM;
+}
+
+/*
+ * Initialise the pool map for SVC_POOL_PERCPU mode.
+ * Returns number of pools or <0 on error.
+ */
+static int
+svc_pool_map_init_percpu(struct svc_pool_map *m)
+{
+ unsigned int maxpools = highest_possible_processor_id()+1;
+ unsigned int pidx = 0;
+ unsigned int cpu;
+ int err;
+
+ err = svc_pool_map_alloc_arrays(m, maxpools);
+ if (err)
+ return err;
+
+ for_each_online_cpu(cpu) {
+ BUG_ON(pidx > maxpools);
+ m->to_pool[cpu] = pidx;
+ m->pool_to[pidx] = cpu;
+ pidx++;
+ }
+ /* cpus brought online later all get mapped to pool0, sorry */
+
+ return pidx;
+};
+
+static int
+highest_possible_node_id(void)
+{
+ unsigned int node;
+ unsigned int highest = 0;
+
+ for_each_node(node)
+ highest = node;
+
+ return highest;
+}
+
+
+/*
+ * Initialise the pool map for SVC_POOL_PERNODE mode.
+ * Returns number of pools or <0 on error.
+ */
+static int
+svc_pool_map_init_pernode(struct svc_pool_map *m)
+{
+ unsigned int maxpools = highest_possible_node_id()+1;
+ unsigned int pidx = 0;
+ unsigned int node;
+ int err;
+
+ err = svc_pool_map_alloc_arrays(m, maxpools);
+ if (err)
+ return err;
+
+ for_each_node_with_cpus(node) {
+ /* some architectures (e.g. SN2) have cpuless nodes */
+ BUG_ON(pidx > maxpools);
+ m->to_pool[node] = pidx;
+ m->pool_to[pidx] = node;
+ pidx++;
+ }
+ /* nodes brought online later all get mapped to pool0, sorry */
+
+ return pidx;
+}
+
+
+/*
+ * Build the global map of cpus to pools and vice versa.
+ */
+static unsigned int
+svc_pool_map_init(void)
+{
+ struct svc_pool_map *m = &svc_pool_map;
+ int npools = -1;
+
+ if (m->mode != SVC_POOL_NONE)
+ return m->npools;
+
+ m->mode = svc_pool_map_choose_mode();
+
+ switch (m->mode) {
+ case SVC_POOL_PERCPU:
+ npools = svc_pool_map_init_percpu(m);
+ break;
+ case SVC_POOL_PERNODE:
+ npools = svc_pool_map_init_pernode(m);
+ break;
+ }
+
+ if (npools < 0) {
+ /* default, or memory allocation failure */
+ npools = 1;
+ m->mode = SVC_POOL_GLOBAL;
+ }
+ m->npools = npools;
+
+ return m->npools;
+}
+
+/*
+ * Set the current thread's cpus_allowed mask so that it
+ * will only run on cpus in the given pool.
+ *
+ * Returns 1 and fills in oldmask iff a cpumask was applied.
+ */
+static inline int
+svc_pool_map_set_cpumask(unsigned int pidx, cpumask_t *oldmask)
+{
+ struct svc_pool_map *m = &svc_pool_map;
+ unsigned int node; /* or cpu */
+
+ BUG_ON(m->mode == SVC_POOL_NONE);
+
+ switch (m->mode)
+ {
+ default:
+ return 0;
+ case SVC_POOL_PERCPU:
+ node = m->pool_to[pidx];
+ *oldmask = current->cpus_allowed;
+ set_cpus_allowed(current, cpumask_of_cpu(node));
+ return 1;
+ case SVC_POOL_PERNODE:
+ node = m->pool_to[pidx];
+ *oldmask = current->cpus_allowed;
+ set_cpus_allowed(current, node_to_cpumask(node));
+ return 1;
+ }
+}
+
+/*
+ * Use the mapping mode to choose a pool for a given CPU.
+ * Used when enqueueing an incoming RPC. Always returns
+ * a non-NULL pool pointer.
+ */
+struct svc_pool *
+svc_pool_for_cpu(struct svc_serv *serv, int cpu)
+{
+ struct svc_pool_map *m = &svc_pool_map;
+ unsigned int pidx = 0;
+
+ BUG_ON(m->mode == SVC_POOL_NONE);
+
+ switch (m->mode) {
+ case SVC_POOL_PERCPU:
+ pidx = m->to_pool[cpu];
+ break;
+ case SVC_POOL_PERNODE:
+ pidx = m->to_pool[cpu_to_node(cpu)];
+ break;
+ }
+ return &serv->sv_pools[pidx % serv->sv_nrpools];
+}
+
+
+/*
* Create an RPC service
*/
static struct svc_serv *
@@ -103,8 +346,9 @@ svc_create_pooled(struct svc_program *pr
svc_thread_fn func, int sig, struct module *mod)
{
struct svc_serv *serv;
+ unsigned int npools = svc_pool_map_init();

- serv = __svc_create(prog, bufsize, shutdown, /*npools*/1);
+ serv = __svc_create(prog, bufsize, shutdown, npools);

if (serv != NULL) {
serv->sv_function = func;
@@ -207,12 +451,16 @@ svc_release_buffer(struct svc_rqst *rqst

/*
* Create a thread in the given pool. Caller must hold BKL.
+ * On a NUMA or SMP machine, with a multi-pool serv, the thread
+ * will be restricted to run on the cpus belonging to the pool.
*/
static int
__svc_create_thread(svc_thread_fn func, struct svc_serv *serv, struct svc_pool *pool)
{
struct svc_rqst *rqstp;
int error = -ENOMEM;
+ int have_oldmask = 0;
+ cpumask_t oldmask;

rqstp = kzalloc(sizeof(*rqstp), GFP_KERNEL);
if (!rqstp)
@@ -232,7 +480,15 @@ __svc_create_thread(svc_thread_fn func,
spin_unlock_bh(&pool->sp_lock);
rqstp->rq_server = serv;
rqstp->rq_pool = pool;
+
+ if (serv->sv_nrpools > 1)
+ have_oldmask = svc_pool_map_set_cpumask(pool->sp_id, &oldmask);
+
error = kernel_thread((int (*)(void *)) func, rqstp, 0);
+
+ if (have_oldmask)
+ set_cpus_allowed(current, oldmask);
+
if (error < 0)
goto out_thread;
svc_sock_update_bufs(serv);
Index: linus-git/net/sunrpc/svcsock.c
===================================================================
--- linus-git.orig/net/sunrpc/svcsock.c 2006-07-26 16:47:46.018822248 +1000
+++ linus-git/net/sunrpc/svcsock.c 2006-07-31 15:00:27.710150626 +1000
@@ -151,8 +151,9 @@ static void
svc_sock_enqueue(struct svc_sock *svsk)
{
struct svc_serv *serv = svsk->sk_server;
- struct svc_pool *pool = &serv->sv_pools[0];
+ struct svc_pool *pool;
struct svc_rqst *rqstp;
+ int cpu;

if (!(svsk->sk_flags &
( (1<<SK_CONN)|(1<<SK_DATA)|(1<<SK_CLOSE)|(1<<SK_DEFERRED)) ))
@@ -160,6 +161,10 @@ svc_sock_enqueue(struct svc_sock *svsk)
if (test_bit(SK_DEAD, &svsk->sk_flags))
return;

+ cpu = get_cpu();
+ pool = svc_pool_for_cpu(svsk->sk_server, cpu);
+ put_cpu();
+
spin_lock_bh(&pool->sp_lock);

if (!list_empty(&pool->sp_threads) &&
Index: linus-git/include/linux/sunrpc/svc.h
===================================================================
--- linus-git.orig/include/linux/sunrpc/svc.h 2006-07-26 16:49:47.615119695 +1000
+++ linus-git/include/linux/sunrpc/svc.h 2006-07-31 15:00:28.946000862 +1000
@@ -368,5 +368,6 @@ int svc_process(struct svc_serv *, s
int svc_register(struct svc_serv *, int, unsigned short);
void svc_wake_up(struct svc_serv *);
void svc_reserve(struct svc_rqst *rqstp, int space);
+struct svc_pool * svc_pool_for_cpu(struct svc_serv *serv, int cpu);

#endif /* SUNRPC_SVC_H */




Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-08-06 09:47:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 010 of 11] knfsd: make rpc threads pools numa aware

On Mon, 31 Jul 2006 10:42:34 +1000
NeilBrown <[email protected]> wrote:

> knfsd: Actually implement multiple pools. On NUMA machines, allocate
> a svc_pool per NUMA node; on SMP a svc_pool per CPU; otherwise a single
> global pool. Enqueue sockets on the svc_pool corresponding to the CPU
> on which the socket bh is run (i.e. the NIC interrupt CPU). Threads
> have their cpu mask set to limit them to the CPUs in the svc_pool that
> owns them.
>
> This is the patch that allows an Altix to scale NFS traffic linearly
> beyond 4 CPUs and 4 NICs.
>
> Incorporates changes and feedback from Neil Brown, Trond Myklebust,
> and Christoph Hellwig.

This makes the NFS client go BUG. Simple nfsv3 workload (ie: mount, read
stuff). Uniproc, FC5.

+ BUG_ON(m->mode == SVC_POOL_NONE);


kernel BUG at net/sunrpc/svc.c:244!
invalid opcode: 0000 [#1]
4K_STACKS
last sysfs file: /class/net/eth1/flags
Modules linked in: nfs lockd nfs_acl ipw2200 sonypi autofs4 hidp l2cap bluetooth sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink xt_tcpudp iptable_filter ip_tables x_tables video sony_acpi sbs i2c_ec button battery asus_acpi ac nvram ohci1394 ieee1394 ehci_hcd uhci_hcd sg joydev snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device ieee80211 snd_pcm_oss snd_mixer_oss ieee80211_crypt snd_pcm snd_timer snd i2c_i801 soundcore i2c_core piix snd_page_alloc pcspkr generic ext3 jbd ide_disk ide_core
CPU: 0
EIP: 0060:[<f8d6d308>] Not tainted VLI
EFLAGS: 00210246 (2.6.18-rc3-mm1 #21)
EIP is at svc_pool_for_cpu+0xc/0x43 [sunrpc]
eax: ffffffff ebx: f59a75c0 ecx: f59a76c0 edx: 00000000
esi: f59cbc20 edi: f59a75c0 ebp: f582d5c0 esp: f59cbb98
ds: 007b es: 007b ss: 0068
Process mount (pid: 2599, ti=f59cb000 task=f59a5550 task.ti=f59cb000)
Stack: f8d6e506 f59cbbb0 00000000 00200282 00000014 00000006 f59a76c0 f59a75c0
f59cbc20 f59a75c0 f582d5c0 f8d6ea83 00200286 c0270376 f582d5c0 f59a76c0
f59a75c0 c02dbbc0 00000006 f59a76c0 f5956c80 f8d6ebd7 00000001 00000000
Call Trace:
[<f8d6e506>] svc_sock_enqueue+0x33/0x294 [sunrpc]
[<f8d6ea83>] svc_setup_socket+0x31c/0x326 [sunrpc]
[<c0270376>] release_sock+0xc/0x83
[<f8d6ebd7>] svc_makesock+0x14a/0x185 [sunrpc]
[<f8ca3b10>] make_socks+0x72/0xae [lockd]
[<f8ca3bce>] lockd_up+0x82/0xd9 [lockd]
[<c01169a6>] __wake_up+0x11/0x1a
[<f9227743>] nfs_start_lockd+0x26/0x43 [nfs]
[<f9228264>] nfs_create_server+0x1dc/0x3da [nfs]
[<c02c4298>] wait_for_completion+0x70/0x99
[<c0116293>] default_wake_function+0x0/0xc
[<c0124918>] call_usermodehelper_keys+0xc4/0xd3
[<f922e348>] nfs_get_sb+0x398/0x3b4 [nfs]
[<c0124927>] __call_usermodehelper+0x0/0x43
[<c0158d68>] vfs_kern_mount+0x83/0xf6
[<c0158e1d>] do_kern_mount+0x2d/0x3e
[<c016a8ac>] do_mount+0x5b2/0x625
[<c019facb>] task_has_capability+0x56/0x5e
[<c029479e>] inet_bind_bucket_create+0x11/0x3c
[<c0295e57>] inet_csk_get_port+0x196/0x1a0
[<c0270376>] release_sock+0xc/0x83
[<c02add33>] inet_bind+0x1c6/0x1d0
[<c01397fe>] handle_IRQ_event+0x23/0x49
[<c013ec5e>] __alloc_pages+0x5e/0x28d
[<c01034d2>] common_interrupt+0x1a/0x20
[<c0169815>] copy_mount_options+0x26/0x109
[<c016a991>] sys_mount+0x72/0xa4
[<c0102b4b>] syscall_call+0x7/0xb
Code: 31 c0 eb 15 8b 40 10 89 d1 c1 e9 02 8b 50 1c 8d 41 02 89 42 04 8d 44 8b 08 5a 59 5b c3 90 90 89 c1 a1 88 86 d8 f8 83 f8 ff 75 0a <0f> 0b f4 00 2c 6f d7 f8 eb 0a 83 f8 01 74 09 83 f8 02 74 0e 31
EIP: [<f8d6d308>] svc_pool_for_cpu+0xc/0x43 [sunrpc] SS:ESP 0068:f59cbb98


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-08-07 03:17:02

by Greg Banks

[permalink] [raw]
Subject: Re: [PATCH 010 of 11] knfsd: make rpc threads pools numa aware

On Sun, 2006-08-06 at 19:47, Andrew Morton wrote:
> On Mon, 31 Jul 2006 10:42:34 +1000
> NeilBrown <[email protected]> wrote:
>
> > knfsd: Actually implement multiple pools. On NUMA machines, allocate
> > a svc_pool per NUMA node; on SMP a svc_pool per CPU; otherwise a single
> > global pool. Enqueue sockets on the svc_pool corresponding to the CPU
> > on which the socket bh is run (i.e. the NIC interrupt CPU). Threads
> > have their cpu mask set to limit them to the CPUs in the svc_pool that
> > owns them.
> >
> > This is the patch that allows an Altix to scale NFS traffic linearly
> > beyond 4 CPUs and 4 NICs.
> >
> > Incorporates changes and feedback from Neil Brown, Trond Myklebust,
> > and Christoph Hellwig.
>
> This makes the NFS client go BUG. Simple nfsv3 workload (ie: mount, read
> stuff). Uniproc, FC5.
>
> + BUG_ON(m->mode == SVC_POOL_NONE);

Aha, I see what I b0rked up. On the client, lockd starts an RPC
service via the old svc_create() interface, which avoids calling
svc_pool_map_init(). When the first NLM callback arrives,
svc_sock_enqueue() calls svc_pool_for_cpu() which BUGs out because
the map is not initialised. The BUG_ON() was introduced in one
of the rewrites in response to review feedback in the last few
days; previously the code was simpler and would trivially return
pool 0, which is the right thing to do in this case. The bug was
hidden on my test machines because they have SLES userspaces,
where lockd is broken because both the kernel and userspace think
the other one is doing the rpc.statd functionality.

A simple patch should fix this, coming up as soon as I can find
a non-SLES machine and run some client tests.

Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-08-07 11:25:22

by Greg Banks

[permalink] [raw]
Subject: Re: [PATCH 010 of 11] knfsd: make rpc threads pools numa aware

On Sun, 2006-08-06 at 19:47, Andrew Morton wrote:
> On Mon, 31 Jul 2006 10:42:34 +1000
> NeilBrown <[email protected]> wrote:
>
> > knfsd: Actually implement multiple pools. On NUMA machines, allocate
> > a svc_pool per NUMA node; on SMP a svc_pool per CPU; otherwise a single
> > global pool. Enqueue sockets on the svc_pool corresponding to the CPU
> > on which the socket bh is run (i.e. the NIC interrupt CPU). Threads
> > have their cpu mask set to limit them to the CPUs in the svc_pool that
> > owns them.
> >
> > This is the patch that allows an Altix to scale NFS traffic linearly
> > beyond 4 CPUs and 4 NICs.
> >
> > Incorporates changes and feedback from Neil Brown, Trond Myklebust,
> > and Christoph Hellwig.
>
> This makes the NFS client go BUG. Simple nfsv3 workload (ie: mount, read
> stuff). Uniproc, FC5.
>
> + BUG_ON(m->mode == SVC_POOL_NONE);
>

Reproduced on RHAS4; this patch fixes it for me.
--

knfsd: Fix a regression on an NFS client where mounting an
NFS filesystem trips a spurious BUG_ON() in the server code.
Tested using cthon04 lock tests on RHAS4-U2 userspace.

Signed-off-by: Greg Banks <[email protected]>
---

net/sunrpc/svc.c | 11 ++++++++++-
1 files changed, 10 insertions(+), 1 deletion(-)

Index: linux-2.6.18-rc2/net/sunrpc/svc.c
===================================================================
--- linux-2.6.18-rc2.orig/net/sunrpc/svc.c
+++ linux-2.6.18-rc2/net/sunrpc/svc.c
@@ -211,6 +211,11 @@ svc_pool_map_set_cpumask(unsigned int pi
struct svc_pool_map *m = &svc_pool_map;
unsigned int node; /* or cpu */

+ /*
+ * The caller checks for sv_nrpools > 1, which
+ * implies that we've been initialized and the
+ * map mode is not NONE.
+ */
BUG_ON(m->mode == SVC_POOL_NONE);

switch (m->mode)
@@ -241,7 +246,11 @@ svc_pool_for_cpu(struct svc_serv *serv,
struct svc_pool_map *m = &svc_pool_map;
unsigned int pidx = 0;

- BUG_ON(m->mode == SVC_POOL_NONE);
+ /*
+ * SVC_POOL_NONE happens in a pure client when
+ * lockd is brought up, so silently treat it the
+ * same as SVC_POOL_GLOBAL.
+ */

switch (m->mode) {
case SVC_POOL_PERCPU:





Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-08-01 04:43:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 010 of 11] knfsd: make rpc threads pools numa aware

On Mon, 31 Jul 2006 15:54:57 +1000
Greg Banks <[email protected]> wrote:

> On Mon, 2006-07-31 at 14:42, Greg Banks wrote:
> > On Mon, 2006-07-31 at 14:36, Neil Brown wrote:
> > > On Sunday July 30, [email protected] wrote:
> > > > On Mon, 31 Jul 2006 10:42:34 +1000
> > > > NeilBrown <[email protected]> wrote:
> > > >
> > > > > +static int
> > > > > +svc_pool_map_init_percpu(struct svc_pool_map *m)
> > > > > +{
> > > > > + unsigned int maxpools = num_possible_cpus();
> > > > > + unsigned int pidx = 0;
> > > > > + unsigned int cpu;
> > > > > + int err;
> > > > > +
> > > > > +
> > > >
> > > > That isn't right - it assumes that cpu_possible_map is not sparse. If it
> > > > is sparse, we allocate undersized pools and then overindex them.
> >
> > Umm, I think Andrew's right, num_possible_cpus() should be NR_CPUS.
>
> How about this version of the patch? It replaces num_possible_cpus()
> with highest_possible_processor_id()+1 and similarly for nodes.
> --
>
> knfsd: Actually implement multiple pools. On NUMA machines, allocate
> a svc_pool per NUMA node; on SMP a svc_pool per CPU; otherwise a single
> global pool. Enqueue sockets on the svc_pool corresponding to the CPU
> on which the socket bh is run (i.e. the NIC interrupt CPU). Threads
> have their cpu mask set to limit them to the CPUs in the svc_pool that
> owns them.
>
> This is the patch that allows an Altix to scale NFS traffic linearly
> beyond 4 CPUs and 4 NICs.
>
> Incorporates changes and feedback from Neil Brown, Trond Myklebust,
> Christoph Hellwig and Andrew Morton.
>

Something has gone rather wrong here.

> - serv = __svc_create(prog, bufsize, shutdown, /*npools*/1);
> + serv = __svc_create(prog, bufsize, shutdown, npools);

__svc_create() is:

__svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
void (*shutdown)(struct svc_serv *serv))

so heaven knows what tree you're patching.

Incremental patches really are preferred. So we can see what people are
monkeying with ;)

After fixing the rejects and cleaning a few things up, your proposed change
amounts to:

--- a/net/sunrpc/svc.c~knfsd-make-rpc-threads-pools-numa-aware-fix
+++ a/net/sunrpc/svc.c
@@ -116,7 +116,7 @@ fail:
static int
svc_pool_map_init_percpu(struct svc_pool_map *m)
{
- unsigned int maxpools = num_possible_cpus();
+ unsigned int maxpools = highest_possible_processor_id() + 1;
unsigned int pidx = 0;
unsigned int cpu;
int err;
@@ -136,6 +136,18 @@ svc_pool_map_init_percpu(struct svc_pool
return pidx;
};

+static int
+highest_possible_node_id(void)
+{
+ unsigned int node;
+ unsigned int highest = 0;
+
+ for_each_node(node)
+ highest = node;
+
+ return highest;
+}
+

/*
* Initialise the pool map for SVC_POOL_PERNODE mode.
@@ -144,7 +156,7 @@ svc_pool_map_init_percpu(struct svc_pool
static int
svc_pool_map_init_pernode(struct svc_pool_map *m)
{
- unsigned int maxpools = num_possible_nodes();
+ unsigned int maxpools = highest_possible_node_id() + 1;
unsigned int pidx = 0;
unsigned int node;
int err;
_


Which shouldn't have compiled, due to the missing forward declaration. And
I'd be surprised if it worked very well with CONFIG_NUMA=n. And it's
naughty to be sneaking general library functions into the sunrpc code
anyway.

Please,

- Write a standalone patch which adds highest_possible_node_id() to
lib/cpumask.c(?)

Make sure it's inside #ifdef CONFIG_NUMA

Remember to export it to modules.

Provide a !CONFIG_NUMA version in include/linux/nodemask.h which just
returns constant zero.

Consider doing something more efficient than the for_each_node() loop.
Although I'm not sure what that would be, given that we don't have
find_last_bit().

- Provide an incremental patch against
knfsd-make-rpc-threads-pools-numa-aware.patch which utilises
highest_possible_node_id().

A replacement patch will be grudgingly accepted, but I'll only go and
turn it into an incremental one, so you can't hide ;)

- Test it real good. Modular, non-modular, NUMA, non-NUMA, !SMP.

Thanks.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-08-01 05:22:39

by Greg Banks

[permalink] [raw]
Subject: Re: [PATCH 010 of 11] knfsd: make rpc threads pools numa aware

On Tue, 2006-08-01 at 14:43, Andrew Morton wrote:
> On Mon, 31 Jul 2006 15:54:57 +1000
> Greg Banks <[email protected]> wrote:
>
> > On Mon, 2006-07-31 at 14:42, Greg Banks wrote:
> > > On Mon, 2006-07-31 at 14:36, Neil Brown wrote:
> > > > On Sunday July 30, [email protected] wrote:
> > > > > On Mon, 31 Jul 2006 10:42:34 +1000
> > > > > NeilBrown <[email protected]> wrote:
> > > > >
> > > > > > +static int
> > > > > > +svc_pool_map_init_percpu(struct svc_pool_map *m)
> > > > > > +{
> > > > > > + unsigned int maxpools = num_possible_cpus();
> > > > > > + unsigned int pidx = 0;
> > > > > > + unsigned int cpu;
> > > > > > + int err;
> > > > > > +
> > > > > > +
> > > > >
> > > > > That isn't right - it assumes that cpu_possible_map is not sparse. If it
> > > > > is sparse, we allocate undersized pools and then overindex them.
> > >
> > > Umm, I think Andrew's right, num_possible_cpus() should be NR_CPUS.
> >
> > How about this version of the patch? It replaces num_possible_cpus()
> > with highest_possible_processor_id()+1 and similarly for nodes.
> > --
> >
> > knfsd: Actually implement multiple pools. On NUMA machines, allocate
> > a svc_pool per NUMA node; on SMP a svc_pool per CPU; otherwise a single
> > global pool. Enqueue sockets on the svc_pool corresponding to the CPU
> > on which the socket bh is run (i.e. the NIC interrupt CPU). Threads
> > have their cpu mask set to limit them to the CPUs in the svc_pool that
> > owns them.
> >
> > This is the patch that allows an Altix to scale NFS traffic linearly
> > beyond 4 CPUs and 4 NICs.
> >
> > Incorporates changes and feedback from Neil Brown, Trond Myklebust,
> > Christoph Hellwig and Andrew Morton.
> >
>
> Something has gone rather wrong here.
>
> > - serv = __svc_create(prog, bufsize, shutdown, /*npools*/1);
> > + serv = __svc_create(prog, bufsize, shutdown, npools);
>
> __svc_create() is:
>
> __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> void (*shutdown)(struct svc_serv *serv))
>
> so heaven knows what tree you're patching.

Hmm. Neil introduced the `shutdown' argument a few days ago, and I
thought I had his patches in my quilt tree, but apparently I screwed
something up. Sorry about that.

> Incremental patches really are preferred. So we can see what people are
> monkeying with ;)

Righto.

> After fixing the rejects and cleaning a few things up, your proposed change
> amounts to:
>
> --- a/net/sunrpc/svc.c~knfsd-make-rpc-threads-pools-numa-aware-fix
> +++ a/net/sunrpc/svc.c
> @@ -116,7 +116,7 @@ fail:
> static int
> svc_pool_map_init_percpu(struct svc_pool_map *m)
> {
> - unsigned int maxpools = num_possible_cpus();
> + unsigned int maxpools = highest_possible_processor_id() + 1;
> unsigned int pidx = 0;
> unsigned int cpu;
> int err;
> @@ -136,6 +136,18 @@ svc_pool_map_init_percpu(struct svc_pool
> return pidx;
> };
>
> +static int
> +highest_possible_node_id(void)
> +{
> + unsigned int node;
> + unsigned int highest = 0;
> +
> + for_each_node(node)
> + highest = node;
> +
> + return highest;
> +}
> +
>
> /*
> * Initialise the pool map for SVC_POOL_PERNODE mode.
> @@ -144,7 +156,7 @@ svc_pool_map_init_percpu(struct svc_pool
> static int
> svc_pool_map_init_pernode(struct svc_pool_map *m)
> {
> - unsigned int maxpools = num_possible_nodes();
> + unsigned int maxpools = highest_possible_node_id() + 1;
> unsigned int pidx = 0;
> unsigned int node;
> int err;
> _
>
>

Yes.

> Which shouldn't have compiled, due to the missing forward declaration.

? It compiled, booted, and ran traffic.

gnb@chook 1965> quilt push 2
Applying patch knfsd-make-pools-numa-aware-5
patching file net/sunrpc/svc.c
patching file net/sunrpc/svcsock.c
patching file include/linux/sunrpc/svc.h

Applying patch knfsd-allow-admin-to-set-nthreads-per-node-2
patching file fs/nfsd/nfsctl.c
patching file fs/nfsd/nfssvc.c

Now at patch knfsd-allow-admin-to-set-nthreads-per-node-2

gnb@chook 1966> make compressed modules
CHK include/linux/version.h
CHK include/linux/utsrelease.h
CHK include/linux/compile.h
CC arch/ia64/ia32/sys_ia32.o
LD arch/ia64/ia32/built-in.o
CC fs/compat.o
CC fs/nfsctl.o
CC [M] fs/lockd/clntlock.o
CC [M] fs/lockd/clntproc.o
CC [M] fs/lockd/host.o
CC [M] fs/lockd/svc.o
CC [M] fs/lockd/svclock.o
CC [M] fs/lockd/svcshare.o
CC [M] fs/lockd/svcproc.o
CC [M] fs/lockd/svcsubs.o
CC [M] fs/lockd/mon.o
CC [M] fs/lockd/xdr.o
CC [M] fs/lockd/xdr4.o
CC [M] fs/lockd/svc4proc.o
LD [M] fs/lockd/lockd.o
CC [M] fs/nfs/callback.o
CC [M] fs/nfs/callback_xdr.o
LD [M] fs/nfs/nfs.o
CC [M] fs/nfsd/nfssvc.o
CC [M] fs/nfsd/nfsctl.o
CC [M] fs/nfsd/nfsproc.o
CC [M] fs/nfsd/nfsfh.o
CC [M] fs/nfsd/vfs.o
CC [M] fs/nfsd/export.o
CC [M] fs/nfsd/auth.o
CC [M] fs/nfsd/lockd.o
CC [M] fs/nfsd/nfscache.o
CC [M] fs/nfsd/nfsxdr.o
CC [M] fs/nfsd/stats.o
CC [M] fs/nfsd/nfs2acl.o
CC [M] fs/nfsd/nfs3proc.o
CC [M] fs/nfsd/nfs3xdr.o
CC [M] fs/nfsd/nfs3acl.o
CC [M] fs/nfsd/nfs4proc.o
CC [M] fs/nfsd/nfs4xdr.o
CC [M] fs/nfsd/nfs4state.o
CC [M] fs/nfsd/nfs4idmap.o
CC [M] fs/nfsd/nfs4callback.o
CC [M] fs/nfsd/nfs4recover.o
LD [M] fs/nfsd/nfsd.o
LD fs/built-in.o
CC [M] net/sunrpc/svc.o
CC [M] net/sunrpc/svcsock.o
CC [M] net/sunrpc/svcauth.o
CC [M] net/sunrpc/svcauth_unix.o
CC [M] net/sunrpc/sunrpc_syms.o
CC [M] net/sunrpc/stats.o
LD [M] net/sunrpc/sunrpc.o
CC [M] net/sunrpc/auth_gss/auth_gss.o
CC [M] net/sunrpc/auth_gss/gss_mech_switch.o
CC [M] net/sunrpc/auth_gss/svcauth_gss.o
CC [M] net/sunrpc/auth_gss/gss_krb5_crypto.o
CC [M] net/sunrpc/auth_gss/gss_krb5_mech.o
CC [M] net/sunrpc/auth_gss/gss_krb5_seal.o
CC [M] net/sunrpc/auth_gss/gss_krb5_unseal.o
CC [M] net/sunrpc/auth_gss/gss_krb5_seqnum.o
CC [M] net/sunrpc/auth_gss/gss_krb5_wrap.o
CC [M] net/sunrpc/auth_gss/gss_spkm3_mech.o
CC [M] net/sunrpc/auth_gss/gss_spkm3_seal.o
CC [M] net/sunrpc/auth_gss/gss_spkm3_unseal.o
CC [M] net/sunrpc/auth_gss/gss_spkm3_token.o
LD [M] net/sunrpc/auth_gss/auth_rpcgss.o
LD [M] net/sunrpc/auth_gss/rpcsec_gss_krb5.o
LD [M] net/sunrpc/auth_gss/rpcsec_gss_spkm3.o
GEN .version
CHK include/linux/compile.h
UPD include/linux/compile.h
CC init/version.o
LD init/built-in.o
LD .tmp_vmlinux1
KSYM .tmp_kallsyms1.S
AS .tmp_kallsyms1.o
LD .tmp_vmlinux2
KSYM .tmp_kallsyms2.S
AS .tmp_kallsyms2.o
LD vmlinux
SYSMAP System.map
SYSMAP .tmp_System.map
OBJCOPY arch/ia64/hp/sim/boot/vmlinux.bin
GZIP arch/ia64/hp/sim/boot/vmlinux.gz
LN vmlinux.gz
Kernel: vmlinux.gz is ready
Building modules, stage 2.
MODPOST
WARNING: drivers/atm/fore_200e.o - Section mismatch: reference to .init.data:_fore200e_pca_fw_data from .data.rel.ro between 'fore200e_bus' (at offset 0x20) and 'fore200e_ops'
WARNING: drivers/atm/fore_200e.o - Section mismatch: reference to .init.data:_fore200e_pca_fw_size from .data.rel.ro between 'fore200e_bus' (at offset 0x28) and 'fore200e_ops'
WARNING: drivers/atm/fore_200e.o - Section mismatch: reference to .init.text:fore200e_pca_prom_read from .data.rel.ro between 'fore200e_bus' (at offset 0x90) and 'fore200e_ops'
WARNING: drivers/atm/he.o - Section mismatch: reference to .init.text:he_init_irq from .text between 'he_start' (at offset 0x2ca2) and 'he_irq_handler'
WARNING: drivers/atm/he.o - Section mismatch: reference to .init.text:he_init_cs_block_rcm from .text between 'he_start' (at offset 0x81c2) and 'he_irq_handler'WARNING: drivers/atm/he.o - Section mismatch: reference to .init.text: from .text between 'he_start' (at offset 0x81e2) and 'he_irq_handler'
WARNING: drivers/atm/he.o - Section mismatch: reference to .init.text:he_init_group from .text between 'he_start' (at offset 0x82a2) and 'he_irq_handler'
WARNING: drivers/block/cpqarray.o - Section mismatch: reference to .init.text:cpqarray_init_one from .data.rel.local between 'cpqarray_pci_driver' (at offset 0x20) and 'smart1_access'
WARNING: drivers/net/dgrs.o - Section mismatch: reference to .init.text:dgrs_pci_probe from .data.rel.local after 'dgrs_pci_driver' (at offset 0x20)
WARNING: drivers/net/tulip/de2104x.o - Section mismatch: reference to .init.text:de_init_one from .data.rel.local after 'de_driver' (at offset 0x20)
WARNING: drivers/net/tulip/de2104x.o - Section mismatch: reference to .exit.text:de_remove_one from .data.rel.local after 'de_driver' (at offset 0x28)
WARNING: drivers/usb/host/isp116x-hcd.o - Section mismatch: reference to .init.text:isp116x_probe from .data.rel.local between 'isp116x_driver' (at offset 0x0) and 'isp116x_hc_driver'
WARNING: drivers/video/aty/atyfb.o - Section mismatch: reference to .init.text:aty_get_pll_ct from .data.rel.ro after 'aty_pll_ct' (at offset 0x18)
WARNING: drivers/video/aty/atyfb.o - Section mismatch: reference to .init.text:aty_init_pll_ct from .data.rel.ro after 'aty_pll_ct' (at offset 0x20)
WARNING: drivers/video/nvidia/nvidiafb.o - Section mismatch: reference to .exit.text:nvidiafb_remove from .data.rel.local after 'nvidiafb_driver' (at offset 0x28)
WARNING: drivers/video/riva/rivafb.o - Section mismatch: reference to .exit.text:rivafb_remove from .data.rel.local after 'rivafb_driver' (at offset 0x28)
WARNING: sound/drivers/snd-dummy.o - Section mismatch: reference to .init.text:snd_dummy_probe from .data.rel.local between 'snd_dummy_driver' (at offset 0x0) and 'snd_dummy_controls'
WARNING: sound/drivers/snd-mtpav.o - Section mismatch: reference to .init.text:snd_mtpav_probe from .data.rel.local between 'snd_mtpav_driver' (at offset 0x0) and 'snd_mtpav_input'
WARNING: sound/drivers/snd-serial-u16550.o - Section mismatch: reference to .init.text:snd_serial_probe from .data.rel.local between 'snd_serial_driver' (at offset 0x0) and 'ops.15504'
WARNING: sound/drivers/snd-virmidi.o - Section mismatch: reference to .init.text:snd_virmidi_probe from .data.rel.local after 'snd_virmidi_driver' (at offset 0x0)
LD [M] fs/lockd/lockd.ko
LD [M] fs/nfs/nfs.ko
LD [M] fs/nfsd/nfsd.ko
LD [M] net/sunrpc/auth_gss/auth_rpcgss.ko
LD [M] net/sunrpc/auth_gss/rpcsec_gss_krb5.ko
LD [M] net/sunrpc/auth_gss/rpcsec_gss_spkm3.ko
CC net/sunrpc/sunrpc.mod.o
LD [M] net/sunrpc/sunrpc.ko


gnb@chook 1967> gcc --version
gcc (GCC) 4.1.0 (SUSE Linux)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

> And
> I'd be surprised if it worked very well with CONFIG_NUMA=n.

Why? The only new code which is sensitive to CONFIG_NUMA is
for_each_node(node) and that has a definition for !NUMA that
appears (by inspection) to do the right thing.

> And it's
> naughty to be sneaking general library functions into the sunrpc code
> anyway.

Fair enough, my bad.

>
> Please,
>
> - Write a standalone patch which adds highest_possible_node_id() to
> lib/cpumask.c(?)
>
> Make sure it's inside #ifdef CONFIG_NUMA
>
> Remember to export it to modules.
>
> Provide a !CONFIG_NUMA version in include/linux/nodemask.h which just
> returns constant zero.

Will do.

> Consider doing something more efficient than the for_each_node() loop.
> Although I'm not sure what that would be, given that we don't have
> find_last_bit().

That code is a copy-n-paste of highest_possible_processor_id().

> - Provide an incremental patch against
> knfsd-make-rpc-threads-pools-numa-aware.patch which utilises
> highest_possible_node_id().
>
> A replacement patch will be grudgingly accepted, but I'll only go and
> turn it into an incremental one, so you can't hide ;)

Ok, an incremental patch it is then. Let it not be said I
don't take a hint the fourth time.

> - Test it real good. Modular, non-modular, NUMA, non-NUMA, !SMP.
>

Right.

Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs