LinuxLists.cc - [PATCH 000 of 11] knfsd: NUMAisation

2006-07-25 05:28:01

Subject: [PATCH 000 of 11] knfsd: NUMAisation

G'day,

These 11 patches make the Linux NFS server scale gracefully
on large NUMA and SMP machines. The two basic approaches
are to eliminate code (such as tempsock aging) from the main
svc loop, and to move the hot locks and data structures from
svc_serv into a new svc_pool which can be allocated per-node
or per-cpu.

These applied cleanly to Linus' GIT yesterday, but clash with
today's set of 9 patches. I'd appreciate advice in picking up
those pieces, especially how patch your 007/9 interacts with
my 008/11.

001 knfsd: move tempsock aging to timer
002 knfsd: convert sk_inuse to atomic_t
003 knfsd: use new lock for svc_sock deferred list
004 knfsd: convert sk_reserved to atomic_t
These move various data structures out from the
protection of the svc_serv.sv_lock
005 knfsd: test and set SK_BUSY atomically
This provides correct update for the SK_BUSY bit
006 knfsd: split svc_serv into pools
The first major step: split off hot fields from svc_serv
into new svc_pool structure
007 knfsd: add svc_get
008 knfsd: add svc_set_num_threads
009 knfsd: use svc_set_num_threads
Move management of nfsd threads from fs/nfsd to net/sunrpc
010 knfsd: make pools numa aware 2
The second major step: a pool per node or cpu
011 knfsd: allow admin to set nthreads per node
Admin interface to control #threads per pool

Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-07-25 07:14:10

by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH 000 of 11] knfsd: NUMAisation

On Tuesday July 25, [email protected] wrote:
> G'day,
>
> These 11 patches make the Linux NFS server scale gracefully
> on large NUMA and SMP machines. The two basic approaches
> are to eliminate code (such as tempsock aging) from the main
> svc loop, and to move the hot locks and data structures from
> svc_serv into a new svc_pool which can be allocated per-node
> or per-cpu.
>

Sounds incredibly sensible.

> These applied cleanly to Linus' GIT yesterday, but clash with
> today's set of 9 patches. I'd appreciate advice in picking up
> those pieces, especially how patch your 007/9 interacts with
> my 008/11.

I cannot see a serious conflict...

>
> 001 knfsd: move tempsock aging to timer

Looks good.
Has the subtle side effect that when we reach our limit on the number
of permitted tcp connections, the one that is discarded is the one
that has been open the longest, rather than the one that has be unused
for the longest. The comment there says
/*
* Always select the oldest socket. It's not fair,
* but so is life
*/
so I guess it is still correct ('oldest' is ambiguous). The new
situation is probably equally (un)fair - would you agree?

> 002 knfsd: convert sk_inuse to atomic_t

Have you measured the performance impact of this?

You gain by taking the spinlock less often, but lose by having to use
atomic ops even when under the spinlock.
This could well be a net benefit, and could vary on different archs.
It would be nice to have measurements - or at least a believable
argument in favour of a net gain.

> 003 knfsd: use new lock for svc_sock deferred list

Yep, good.

> 004 knfsd: convert sk_reserved to atomic_t
> These move various data structures out from the
> protection of the svc_serv.sv_lock

Cut/paste error with the comment for this one :-) - fixed.
Same comment as 002. I'm guessing that you measure sv_lock as being
heavily contended. Could you quote some measurements?

> 005 knfsd: test and set SK_BUSY atomically
> This provides correct update for the SK_BUSY bit

Whitespace error line 50 :-) - fixed

Feeling a little uneasy that the rule
must call svc_sock_enqueue after clearing SK_BUSY
is not being obeyed....
You have more pending patches that touch this code?

> 006 knfsd: split svc_serv into pools
> The first major step: split off hot fields from svc_serv
> into new svc_pool structure

kmalloc + memset == kzalloc. Fixed.
I also took the assignment out of the if and removed
the unnecessary cast.

> 007 knfsd: add svc_get

Yes... using nr_threads as a pseudo ref count really stinks.
I wonder if we can make that go away. Maybe have a
refcount separate from nr_threads, and nr_thread != 0
causes a refcount of 1...

You remove a comment that you had previously added. Why?
-- not fixed pending explanation...

> 008 knfsd: add svc_set_num_threads

I like this a lot. I don't see the serious conflict yet.... maybe
when I try applying the patches.

> 009 knfsd: use svc_set_num_threads
> Move management of nfsd threads from fs/nfsd to net/sunrpc
> 010 knfsd: make pools numa aware 2
> The second major step: a pool per node or cpu

- printk("nfsd: initialising 1 global pool\n");
+ printk("nfsd: initialising 1 global thread pool\n");
Makes it a bit clearer for Jo Sysadmin. - fixed

All the #if's are a bit of a shame. I wonder what can be removed.
e.g. if we put
#ifndef CONFIG_NUMA
# define node_online_map undefined
# define any_online_node(x)
# define nr_cpus_node(node) num_online_cpus()
#endif
then I think svc_pool_map_init won't need any #if ...

I'll get you to review that fix...

Also, I don't see that you need both cpu_to_pool and node_to_pool.
Could you just have a node_to_pool and on SMP systems treat the
cpu number as a node number?

> 011 knfsd: allow admin to set nthreads per node
> Admin interface to control #threads per pool

Looks simple enough.

Thanks. I now have all those patches applied.

I'll post the interesting revisions in a little while.

Thanks a lot - this is much-needed work.

NeilBrown

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-07-25 08:03:14

by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH 000 of 11] knfsd: NUMAisation

On Tuesday July 25, [email protected] wrote:
>
> Thanks. I now have all those patches applied.
>
> I'll post the interesting revisions in a little while.

You can see them all at
http://cgi.cse.unsw.edu.au/~neilb/patches/linux-devel/2.6/2006-07-25-07/

The following patch removes some of the #if's.
Do you agree with it?

NeilBrown
---------------------------------------
Remove some #if tests.

The protected code should just compile away without needing
the #if (I hope).

Signed-off-by: Neil Brown <[email protected]>

### Diffstat output
./net/sunrpc/svc.c | 39 ++++++++++++++-------------------------
1 file changed, 14 insertions(+), 25 deletions(-)

diff .prev/net/sunrpc/svc.c ./net/sunrpc/svc.c
--- .prev/net/sunrpc/svc.c 2006-07-25 17:32:10.000000000 +1000
+++ ./net/sunrpc/svc.c 2006-07-25 17:31:14.000000000 +1000
@@ -28,7 +28,6 @@
#define RPCDBG_FACILITY RPCDBG_SVCDSP
#define RPC_PARANOIA 1

-
#if SVC_HAVE_MULTIPLE_POOLS

struct svc_pool_map svc_pool_map = { .mode = -1, .init = 0 };
@@ -54,7 +53,6 @@ svc_pool_map_init(void)
* Detect best pool mapping mode heuristically.
*/
m->mode = 0; /* default: one global pool */
-#ifdef CONFIG_NUMA
if (num_online_nodes() > 1) {
/*
* Actually have multiple NUMA nodes,
@@ -73,14 +71,6 @@ svc_pool_map_init(void)
m->mode = 1;
}
}
-#else
- if (num_online_cpus() > 1) {
- /*
- * Plain SMP with multiple CPUs online.
- */
- m->mode = 1;
- }
-#endif
}

switch (m->mode) {
@@ -156,8 +146,7 @@ static int
svc_pool_map_set_cpumask(unsigned int pidx, cpumask_t *oldmask)
{
struct svc_pool_map *m = &svc_pool_map;
- unsigned int node;
- unsigned int cpu;
+ unsigned int node; /* or cpu */

BUG_ON(!m->init);

@@ -167,9 +156,9 @@ svc_pool_map_set_cpumask(unsigned int pi
case 0:
return 0;
case 1:
- cpu = m->pool_to_cpu[pidx];
+ node = m->pool_to_cpu[pidx];
*oldmask = current->cpus_allowed;
- set_cpus_allowed(current, cpumask_of_cpu(cpu));
+ set_cpus_allowed(current, cpumask_of_cpu(node));
return 1;
#ifdef CONFIG_NUMA
case 2:
@@ -181,6 +170,16 @@ svc_pool_map_set_cpumask(unsigned int pi
}
}

+#else
+static inline int svc_pool_map_init(void)
+{
+ return 1;
+}
+static inline int
+svc_pool_map_set_cpumask(unsigned int pidx, cpumask_t *oldmask)
+{
+ return 0;
+}
#endif /* SVC_HAVE_MULTIPLE_POOLS */

/*
@@ -265,11 +264,7 @@ svc_create_pooled(struct svc_program *pr
svc_thread_fn func, int sig, struct module *mod)
{
struct svc_serv *serv;
- unsigned int npools = 1;
-
-#if SVC_HAVE_MULTIPLE_POOLS
- npools = svc_pool_map_init();
-#endif
+ unsigned int npools = svc_pool_map_init();

serv = __svc_create(prog, bufsize, npools, shutdown);

@@ -382,10 +377,8 @@ __svc_create_thread(svc_thread_fn func,
{
struct svc_rqst *rqstp;
int error = -ENOMEM;
-#if SVC_HAVE_MULTIPLE_POOLS
int have_oldmask = 0;
cpumask_t oldmask;
-#endif

rqstp = kmalloc(sizeof(*rqstp), GFP_KERNEL);
if (!rqstp)
@@ -407,17 +400,13 @@ __svc_create_thread(svc_thread_fn func,
rqstp->rq_server = serv;
rqstp->rq_pool = pool;

-#if SVC_HAVE_MULTIPLE_POOLS
if (serv->sv_nrpools > 1)
have_oldmask = svc_pool_map_set_cpumask(pool->sp_id, &oldmask);
-#endif

error = kernel_thread((int (*)(void *)) func, rqstp, 0);

-#if SVC_HAVE_MULTIPLE_POOLS
if (have_oldmask)
set_cpus_allowed(current, oldmask);
-#endif

if (error < 0)
goto out_thread;

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-07-25 08:33:28

by Greg Banks

[permalink] [raw]

Subject: Re: [PATCH 000 of 11] knfsd: NUMAisation

On Tue, 2006-07-25 at 17:13, Neil Brown wrote:
> On Tuesday July 25, [email protected] wrote:
> > These applied cleanly to Linus' GIT yesterday, but clash with
> > today's set of 9 patches. I'd appreciate advice in picking up
> > those pieces, especially how patch your 007/9 interacts with
> > my 008/11.
>
> I cannot see a serious conflict...

So far it doesn't seem to as bad as I'd feared ;-)

> > 001 knfsd: move tempsock aging to timer
>
> Looks good.
> Has the subtle side effect that when we reach our limit on the number
> of permitted tcp connections, the one that is discarded is the one
> that has been open the longest, rather than the one that has be unused
> for the longest. The comment there says
> /*
> * Always select the oldest socket. It's not fair,
> * but so is life
> */
> so I guess it is still correct ('oldest' is ambiguous). The new
> situation is probably equally (un)fair - would you agree?

Agreed. The algorithm makes no pretence at fairness, but at least
we're no longer bouncing cachelines in global data structures to
check for old sockets (a infrequent corner case) on every incoming
packet (the main performance path).

> > 002 knfsd: convert sk_inuse to atomic_t
>
> Have you measured the performance impact of this?

Not in isolation, just as part of this total patchset.

> You gain by taking the spinlock less often, but lose by having to use
> atomic ops even when under the spinlock.

Agreed.

> This could well be a net benefit, and could vary on different archs.
> It would be nice to have measurements - or at least a believable
> argument in favour of a net gain.

Ok, here are some (12 month old) numbers. Test is an Altix A350
server, IRIX NFS clients mounting over TCP, 32K rsize, streaming
reads from separate files. N is the number of NICs in the server,
the number of CPUs in the server, and the number of clients
(all 3 increased together). The number of NUMA nodes is N/2.

N Throughput, MiB/s CPU usage, % (max=N*100)
Before After Before After
--- ------ ---- ----- -----
4 312 435 350 228
6 500 656 501 418
8 562 804 690 589

> > 004 knfsd: convert sk_reserved to atomic_t
> > These move various data structures out from the
> > protection of the svc_serv.sv_lock
>
> Cut/paste error with the comment for this one :-) - fixed.
> Same comment as 002. I'm guessing that you measure sv_lock as being
> heavily contended.

Actually, it wasn't contended all that much, although I'm sure if I'd
tried a 16 or 32 cpu test it would have been. However even without
contention, on NUMA we get lots of CPU wastage from having to bounce
the cachelines of the (effectively) global svc_serv between nodes every
time an nfsd writes to it. It doesn't help that the CPU scheduler
moves nfsds around to other nodes "to balance the load".

A simple experiment showed running the NFS server with any significant
load was taking far too much CPU, nonlinearly with load, and that CPU
usage could be reduced by limiting the NFS service to a subset of CPUs,
and most effectively by limiting it to a single node. CPU profiling
showed no outstanding contributions from ia64_spin_lock_contention()
but lots of time doing the normal svc loop. This combined with
previous experience on NUMA machines was enough to point the finger.

Basically, the goal of these patches is to reduce writes to the global
svc_serv, and it just happened that taking/dropping sv_lock was the
majority (by code lines) of those writes. BTW, the global `nfsdstat'
is also a significant contributor, I have a patch to help alleviate
that too.

> Could you quote some measurements?

The difference in CPU usage numbers in the above table should
give you an indication.

> > 005 knfsd: test and set SK_BUSY atomically
> > This provides correct update for the SK_BUSY bit
>
> Whitespace error line 50 :-) - fixed
>
> Feeling a little uneasy that the rule
> must call svc_sock_enqueue after clearing SK_BUSY
> is not being obeyed....

It's a bogus rule. That comment doesn't really describe
the way SK_BUSY works at all.

> You have more pending patches that touch this code?

These particular lines, no.

Perhaps a little background will help explain why this patch
is necessary (it's pretty unobvious). Currently, locking works
like this:

a) fields in svc_serv are protected by svc_serv->sv_lock

b) svc_socket is attached to exactly one svc_serv forever
and is protected by svc_serv->sv_lock too

After patch 011, we have a situation where

a) fields in svc_pool are protected by svc_pool->sp_lock

b) (remaining) fields in svc_serv are protected by svc_serv->sv_lock

c) svc_socket is attached to exactly one svc_serv forever but
we don't want to use svc_serv->sv_lock

d) svc_socket can float between svc_pool's (although this hardly
ever happens, unless you have irq spraying, or ethernet bonding
in round-robin mode, or udp).

So we need some extra synchronisation to ensure that two cpus in
different svc_pool's don't both try to attach the svc_socket to
their svc_pool. When that happens the lists get confused, with
all kinds of evil consequences (this happened during a customer
trial, most embarrassing).

The SK_BUSY bit's existing semantics are almost exactly "I'm
attached to a svc_pool", except for atomicity, so I just made it
atomic. Effectively, it's doing part of the job that sv_lock
used to.

> > 006 knfsd: split svc_serv into pools
> > The first major step: split off hot fields from svc_serv
> > into new svc_pool structure
>
> kmalloc + memset == kzalloc. Fixed.
> I also took the assignment out of the if and removed
> the unnecessary cast.

Sure. Wouldn't kcalloc() be a better fit?

I just noticed there's also a bogus comment added, can you
please apply this incremental patch?

Index: linus-git/net/sunrpc/svcsock.c
===================================================================
--- linus-git.orig/net/sunrpc/svcsock.c 2006-07-25 18:28:48.000000000 +1000
+++ linus-git/net/sunrpc/svcsock.c 2006-07-25 18:29:21.488534818 +1000
@@ -49,7 +49,7 @@
* svc_pool->sp_lock protects most of the fields of that pool.
* svc_serv->sv_lock protects sv_tempsocks, sv_permsocks, sv_tmpcnt.
* when both need to be taken (rare), svc_serv->sv_lock is first.
- * BKL protects svc_serv->sv_nrthread, svc_pool->sp_nrthread
+ * BKL protects svc_serv->sv_nrthread.
* svc_sock->sk_defer_lock protects the svc_sock->sk_deferred list
* svc_sock->sk_flags.SK_BUSY prevents a svc_sock being enqueued multiply.
*

> > 007 knfsd: add svc_get
>
> Yes... using nr_threads as a pseudo ref count really stinks.
> I wonder if we can make that go away. Maybe have a
> refcount separate from nr_threads, and nr_thread != 0
> causes a refcount of 1...

Or just rename it sv_refcount and stop pretending it means
something else? It seems silly to have two variables which
are almost exactly the same number, and where the difference
doesn't matter (IIRC the only places it's actually used as
"number of threads" is for estimating upper bounds on socket
buffer space which is already pretty approximate).

> You remove a comment that you had previously added. Why?
> -- not fixed pending explanation...

Umm...where?

> > 008 knfsd: add svc_set_num_threads
>
> I like this a lot. I don't see the serious conflict yet.... maybe
> when I try applying the patches.
>
> > 009 knfsd: use svc_set_num_threads
> > Move management of nfsd threads from fs/nfsd to net/sunrpc
> > 010 knfsd: make pools numa aware 2
> > The second major step: a pool per node or cpu
>
> - printk("nfsd: initialising 1 global pool\n");
> + printk("nfsd: initialising 1 global thread pool\n");
> Makes it a bit clearer for Jo Sysadmin. - fixed

Beaut.

> All the #if's are a bit of a shame. I wonder what can be removed.

I did try to isolate them in one patch and one one small set
of functions. But if we can remove them, that'd be great.

> e.g. if we put
> #ifndef CONFIG_NUMA
> # define node_online_map undefined
> # define any_online_node(x)
> # define nr_cpus_node(node) num_online_cpus()
> #endif
> then I think svc_pool_map_init won't need any #if ...

Those look almost like good default definitions for <linux/topology.h>,
except it might be convenient to have any_online_node(x) non-empty.

>
> I'll get you to review that fix...
>
> Also, I don't see that you need both cpu_to_pool and node_to_pool.
> Could you just have a node_to_pool and on SMP systems treat the
> cpu number as a node number?

Sure, we only need a single array for each direction, I could
have called them "map" and "reversemap" ;-) I tried to make
their purpose slightly clearer.

>
> I'll post the interesting revisions in a little while.

Cool.

>
> Thanks a lot - this is much-needed work.
>

No worries ;-)

Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-07-25 09:04:06

by Greg Banks

[permalink] [raw]

Subject: Re: [PATCH 000 of 11] knfsd: NUMAisation

On Tue, 2006-07-25 at 18:02, Neil Brown wrote:
> On Tuesday July 25, [email protected] wrote:
> >
> > Thanks. I now have all those patches applied.
> >
> > I'll post the interesting revisions in a little while.
>
> You can see them all at
> http://cgi.cse.unsw.edu.au/~neilb/patches/linux-devel/2.6/2006-07-25-07/
>
> The following patch removes some of the #if's.
> Do you agree with it?

Yep.

>
> struct svc_pool_map svc_pool_map = { .mode = -1, .init = 0 };
> @@ -54,7 +53,6 @@ svc_pool_map_init(void)
> * Detect best pool mapping mode heuristically.
> */
> m->mode = 0; /* default: one global pool */
> -#ifdef CONFIG_NUMA
> if (num_online_nodes() > 1) {
> /*
> * Actually have multiple NUMA nodes,
> @@ -73,14 +71,6 @@ svc_pool_map_init(void)
> m->mode = 1;
> }
> }
> -#else
> - if (num_online_cpus() > 1) {
> - /*
> - * Plain SMP with multiple CPUs online.
> - */
> - m->mode = 1;
> - }
> -#endif
> }

This hunk actually changes the logic slightly, but not
detrimentally (the threshold for using mode `1' is
changed from 1 to 2 cpus in the !NUMA case). It shouldn't
matter.

The rest looks fine.

Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-07-25 14:05:55

by Chuck Lever

[permalink] [raw]

Subject: Re: [PATCH 000 of 11] knfsd: NUMAisation

On 7/25/06, Neil Brown <[email protected]> wrote:
> On Tuesday July 25, [email protected] wrote:
> > G'day,
> >
> >
> > 001 knfsd: move tempsock aging to timer
>
> Looks good.
> Has the subtle side effect that when we reach our limit on the number
> of permitted tcp connections, the one that is discarded is the one
> that has been open the longest, rather than the one that has be unused
> for the longest. The comment there says
> /*
> * Always select the oldest socket. It's not fair,
> * but so is life
> */
> so I guess it is still correct ('oldest' is ambiguous). The new
> situation is probably equally (un)fair - would you agree?

Ouch. IMO this will penalize long-running workloads unnecessarily,
and will likely be much more of a performance hit for such workloads
than would be gained by improving the CPU cache behavior on the
server.

The previous algorithm (close the least recently used socket) is more fair.

--
"We who cut mere stones must always be envisioning cathedrals"
-- Quarry worker's creed

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-07-25 14:50:08

by Paul Jimenez

[permalink] [raw]

Subject: Re: [PATCH 000 of 11] knfsd: NUMAisation

On Jul 25, 2006, at 9:05 AM, Chuck Lever wrote:

> On 7/25/06, Neil Brown <[email protected]> wrote:
>> On Tuesday July 25, [email protected] wrote:
>>> G'day,
>>>
>>>
>>> 001 knfsd: move tempsock aging to timer
>>
>> Looks good.
>> Has the subtle side effect that when we reach our limit on the number
>> of permitted tcp connections, the one that is discarded is the one
>> that has been open the longest, rather than the one that has be
>> unused
>> for the longest. The comment there says
>> /*
>> * Always select the oldest socket. It's
>> not fair,
>> * but so is life
>> */
>> so I guess it is still correct ('oldest' is ambiguous). The new
>> situation is probably equally (un)fair - would you agree?
>
> Ouch. IMO this will penalize long-running workloads unnecessarily,
> and will likely be much more of a performance hit for such workloads
> than would be gained by improving the CPU cache behavior on the
> server.
>
> The previous algorithm (close the least recently used socket) is
> more fair.

OTOH, getting rid of the longest-open one may have beneficial side
effects
by restarting state associated with that long-running task from
scratch, no?
Restarting long-running workloads with a clean slate every so often
sounds
like a good idea to me - I'm confident that the performance and
impact of
shorter lived connections is better understood than that of very long
lived
connections, so given the choice I'd say go with restarting them
every so
often.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-07-25 15:36:39

by Chuck Lever

[permalink] [raw]

Subject: Re: [PATCH 000 of 11] knfsd: NUMAisation

On 7/25/06, Paul Jimenez <[email protected]> wrote:
>
> On Jul 25, 2006, at 9:05 AM, Chuck Lever wrote:
>
> > On 7/25/06, Neil Brown <[email protected]> wrote:
> >> On Tuesday July 25, [email protected] wrote:
> >>> G'day,
> >>>
> >>>
> >>> 001 knfsd: move tempsock aging to timer
> >>
> >> Looks good.
> >> Has the subtle side effect that when we reach our limit on the number
> >> of permitted tcp connections, the one that is discarded is the one
> >> that has been open the longest, rather than the one that has be
> >> unused
> >> for the longest. The comment there says
> >> /*
> >> * Always select the oldest socket. It's
> >> not fair,
> >> * but so is life
> >> */
> >> so I guess it is still correct ('oldest' is ambiguous). The new
> >> situation is probably equally (un)fair - would you agree?
> >
> > Ouch. IMO this will penalize long-running workloads unnecessarily,
> > and will likely be much more of a performance hit for such workloads
> > than would be gained by improving the CPU cache behavior on the
> > server.
> >
> > The previous algorithm (close the least recently used socket) is
> > more fair.
>
> OTOH, getting rid of the longest-open one may have beneficial side
> effects by restarting state associated with that long-running task from
> scratch, no?

What is being restarted is only the TCP connection. Thus only network
layer state is being reconstructed here.

In addition, we have field evidence that breaking an active NFS/TCP
connection has lots of bad consequences. The client is forced to
retransmit any RPC requests that were active when the connection was
dropped. The client may reconnect on a different port number, which
makes the duplicate reply cache unable to prevent reapplying
non-idempotent RPC requests. The combination invites data corruption
or application crashes due to unexpected return codes (but I though I
just created that file !?!).

My vote is to zap the least recently used connection. That is overall
the safest compromise.

> Restarting long-running workloads with a clean slate every so often
> sounds like a good idea to me - I'm confident that the performance
> and impact of shorter lived connections is better understood than
> that of very long lived connections

By your logic, we should kill anyone who reaches the age of 30 so they
won't die of old age.

I'd rather find those lingering bugs and fix them, instead of cover
them up with a risky workaround.

--
"We who cut mere stones must always be envisioning cathedrals"
-- Quarry worker's creed

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-07-26 06:25:37

by Greg Banks

[permalink] [raw]

Subject: Re: [PATCH 000 of 11] knfsd: NUMAisation

On Wed, 2006-07-26 at 01:36, Chuck Lever wrote:
> On 7/25/06, Paul Jimenez <[email protected]> wrote:
> >
> My vote is to zap the least recently used connection. That is overall
> the safest compromise.

Agreed. The following patch (on top of my earlier patch 001) does
that. I've verified that it enforces the connection limit and
doesn't kill the oldest socket.
--

sunrpc: when enforcing the TCP connection limit, choose exactly the
least recently used TCP connection. This is fairer than the new
algorithm which was a side effect of the recent connection aging
patch, and minimises the chances of closing an active connection
(which is known to have adverse consequences).

Signed-off-by: Greg Banks <[email protected]>
---

net/sunrpc/svcsock.c | 34 +++++++++++++++++++++++++++-------
1 files changed, 27 insertions(+), 7 deletions(-)

Index: linus-git/net/sunrpc/svcsock.c
===================================================================
--- linus-git.orig/net/sunrpc/svcsock.c 2006-07-26 15:33:32.989291299 +1000
+++ linus-git/net/sunrpc/svcsock.c 2006-07-26 16:18:10.988508212 +1000
@@ -798,6 +798,32 @@ svc_tcp_data_ready(struct sock *sk, int
wake_up_interruptible(sk->sk_sleep);
}

+
+/*
+ * Find and return the Least Recently Used temp socket
+ * on the serv. Used to choose a victim when enforcing
+ * connection limits. The idea here is to avoid dropping
+ * active connections, which is known to have adverse
+ * effects. Caller must own serv->sv_lock.
+ */
+static struct svc_sock *
+svc_lru_temp_socket(struct svc_serv *serv)
+{
+ struct svc_sock *svsk;
+ struct svc_sock *lru_svsk = NULL;
+ unsigned long lru_when = 0;
+
+ list_for_each_entry(svsk, &serv->sv_tempsocks, sk_list) {
+ unsigned long when = svsk->sk_lastrecv;
+ if (lru_svsk == NULL || when < lru_when) {
+ lru_svsk = svsk;
+ lru_when = when;
+ }
+ }
+
+ return lru_svsk;
+}
+
/*
* Accept a TCP connection
*/
@@ -896,13 +922,7 @@ svc_tcp_accept(struct svc_sock *svsk)
NIPQUAD(sin.sin_addr.s_addr),
ntohs(sin.sin_port));
}
- /*
- * Always select the oldest socket. It's not fair,
- * but so is life
- */
- svsk = list_entry(serv->sv_tempsocks.prev,
- struct svc_sock,
- sk_list);
+ svsk = svc_lru_temp_socket(serv);
set_bit(SK_CLOSE, &svsk->sk_flags);
svsk->sk_inuse ++;
}

Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-07-26 09:57:58

by Olaf Kirch

[permalink] [raw]

Subject: Re: [PATCH 000 of 11] knfsd: NUMAisation

On Tue, Jul 25, 2006 at 06:33:20PM +1000, Greg Banks wrote:
> > /*
> > * Always select the oldest socket. It's not fair,
> > * but so is life
> > */
> > so I guess it is still correct ('oldest' is ambiguous). The new
> > situation is probably equally (un)fair - would you agree?
>
> Agreed. The algorithm makes no pretence at fairness, but at least
> we're no longer bouncing cachelines in global data structures to
> check for old sockets (a infrequent corner case) on every incoming
> packet (the main performance path).

I think picking the oldest is just as good as picking a random one. The
only bad choice is picking the youngest socket, as that will make active
connections thrash while leaving potentially idle ones untouched.

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-07-26 18:52:45

by Chuck Lever

[permalink] [raw]

Subject: Re: [PATCH 000 of 11] knfsd: NUMAisation

On 7/26/06, Greg Banks <[email protected]> wrote:
> On Wed, 2006-07-26 at 01:36, Chuck Lever wrote:
> > On 7/25/06, Paul Jimenez <[email protected]> wrote:
> > >
> > My vote is to zap the least recently used connection. That is overall
> > the safest compromise.
>
> Agreed. The following patch (on top of my earlier patch 001) does
> that. I've verified that it enforces the connection limit and
> doesn't kill the oldest socket.

Looks like you're only dropping sockets that aren't registered with
the local portmapper. Should you do something similar for the
permanent sockets, or am I missing something?

> --
>
> sunrpc: when enforcing the TCP connection limit, choose exactly the
> least recently used TCP connection. This is fairer than the new
> algorithm which was a side effect of the recent connection aging
> patch, and minimises the chances of closing an active connection
> (which is known to have adverse consequences).
>
>
> Signed-off-by: Greg Banks <[email protected]>
> ---
>
> net/sunrpc/svcsock.c | 34 +++++++++++++++++++++++++++-------
> 1 files changed, 27 insertions(+), 7 deletions(-)
>
> Index: linus-git/net/sunrpc/svcsock.c
> ===================================================================
> --- linus-git.orig/net/sunrpc/svcsock.c 2006-07-26 15:33:32.989291299 +1000
> +++ linus-git/net/sunrpc/svcsock.c 2006-07-26 16:18:10.988508212 +1000
> @@ -798,6 +798,32 @@ svc_tcp_data_ready(struct sock *sk, int
> wake_up_interruptible(sk->sk_sleep);
> }
>
> +
> +/*
> + * Find and return the Least Recently Used temp socket
> + * on the serv. Used to choose a victim when enforcing
> + * connection limits. The idea here is to avoid dropping
> + * active connections, which is known to have adverse
> + * effects. Caller must own serv->sv_lock.
> + */
> +static struct svc_sock *
> +svc_lru_temp_socket(struct svc_serv *serv)
> +{
> + struct svc_sock *svsk;
> + struct svc_sock *lru_svsk = NULL;
> + unsigned long lru_when = 0;
> +
> + list_for_each_entry(svsk, &serv->sv_tempsocks, sk_list) {
> + unsigned long when = svsk->sk_lastrecv;
> + if (lru_svsk == NULL || when < lru_when) {
> + lru_svsk = svsk;
> + lru_when = when;
> + }
> + }
> +
> + return lru_svsk;
> +}
> +
> /*
> * Accept a TCP connection
> */
> @@ -896,13 +922,7 @@ svc_tcp_accept(struct svc_sock *svsk)
> NIPQUAD(sin.sin_addr.s_addr),
> ntohs(sin.sin_port));
> }
> - /*
> - * Always select the oldest socket. It's not fair,
> - * but so is life
> - */
> - svsk = list_entry(serv->sv_tempsocks.prev,
> - struct svc_sock,
> - sk_list);
> + svsk = svc_lru_temp_socket(serv);
> set_bit(SK_CLOSE, &svsk->sk_flags);
> svsk->sk_inuse ++;
> }
>
>
>
> Greg.
> --
> Greg Banks, R&D Software Engineer, SGI Australian Software Group.
> I don't speak for SGI.
>
>
>

--
"We who cut mere stones must always be envisioning cathedrals"
-- Quarry worker's creed

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-07-27 12:19:44

by Greg Banks

[permalink] [raw]

Subject: Re: [PATCH 000 of 11] knfsd: NUMAisation

On Thu, 2006-07-27 at 04:26, Chuck Lever wrote:
> On 7/26/06, Greg Banks <[email protected]> wrote:
> > On Wed, 2006-07-26 at 01:36, Chuck Lever wrote:
> > > On 7/25/06, Paul Jimenez <[email protected]> wrote:
> > > >
> > > My vote is to zap the least recently used connection. That is overall
> > > the safest compromise.
> >
> > Agreed. The following patch (on top of my earlier patch 001) does
> > that. I've verified that it enforces the connection limit and
> > doesn't kill the oldest socket.
>
> Looks like you're only dropping sockets that aren't registered with
> the local portmapper.

Yes, I've preserved that logic.

> Should you do something similar for the
> permanent sockets,

No.

> or am I missing something?

At this time there are two permanent sockets: the UDP socket and
the TCP rendezvous socket (i.e. the one in LISTEN state). Neither
of these is ever connected; neither should they ever be dropped
(because they're holding their respective ports, and dropping
them might allow a userspace program to bind to those ports,
which would be a security issue).

Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs