LinuxLists.cc - [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages

2009-01-13 10:26:58

Subject: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages

Avoid overloading the CPU scheduler with enormous load averages
when handling high call-rate NFS loads. When the knfsd bottom half
is made aware of an incoming call by the socket layer, it tries to
choose an nfsd thread and wake it up. As long as there are idle
threads, one will be woken up.

If there are lot of nfsd threads (a sensible configuration when
the server is disk-bound or is running an HSM), there will be many
more nfsd threads than CPUs to run them. Under a high call-rate
low service-time workload, the result is that almost every nfsd is
runnable, but only a handful are actually able to run. This situation
causes two significant problems:

1. The CPU scheduler takes over 10% of each CPU, which is robbing
the nfsd threads of valuable CPU time.

2. At a high enough load, the nfsd threads starve userspace threads
of CPU time, to the point where daemons like portmap and rpc.mountd
do not schedule for tens of seconds at a time. Clients attempting
to mount an NFS filesystem timeout at the very first step (opening
a TCP connection to portmap) because portmap cannot wake up from
select() and call accept() in time.

Disclaimer: these effects were observed on a SLES9 kernel, modern
kernels' schedulers may behave more gracefully.

The solution is simple: keep in each svc_pool a counter of the number
of threads which have been woken but have not yet run, and do not wake
any more if that count reaches an arbitrary small threshold.

Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
synthetic client threads simulating an rsync (i.e. recursive directory
listing) workload reading from an i386 RH9 install image (161480
regular files in 10841 directories) on the server. That tree is small
enough to fill in the server's RAM so no disk traffic was involved.
This setup gives a sustained call rate in excess of 60000 calls/sec
before being CPU-bound on the server. The server was running 128 nfsds.

Profiling showed schedule() taking 6.7% of every CPU, and __wake_up()
taking 5.2%. This patch drops those contributions to 3.0% and 2.2%.
Load average was over 120 before the patch, and 20.9 after.

This patch is a forward-ported version of knfsd-avoid-nfsd-overload
which has been shipping in the SGI "Enhanced NFS" product since 2006.
It has been posted before:

http://article.gmane.org/gmane.linux.nfs/10374

Signed-off-by: Greg Banks <[email protected]>
---

include/linux/sunrpc/svc.h | 2 ++
net/sunrpc/svc_xprt.c | 25 ++++++++++++++++++-------
2 files changed, 20 insertions(+), 7 deletions(-)

Index: bfields/include/linux/sunrpc/svc.h
===================================================================
--- bfields.orig/include/linux/sunrpc/svc.h
+++ bfields/include/linux/sunrpc/svc.h
@@ -41,6 +41,7 @@ struct svc_pool {
struct list_head sp_sockets; /* pending sockets */
unsigned int sp_nrthreads; /* # of threads in pool */
struct list_head sp_all_threads; /* all server threads */
+ int sp_nwaking; /* number of threads woken but not yet active */
} ____cacheline_aligned_in_smp;

/*
@@ -264,6 +265,7 @@ struct svc_rqst {
* cache pages */
wait_queue_head_t rq_wait; /* synchronization */
struct task_struct *rq_task; /* service thread */
+ int rq_waking; /* 1 if thread is being woken */
};

/*
Index: bfields/net/sunrpc/svc_xprt.c
===================================================================
--- bfields.orig/net/sunrpc/svc_xprt.c
+++ bfields/net/sunrpc/svc_xprt.c
@@ -14,6 +14,8 @@

#define RPCDBG_FACILITY RPCDBG_SVCXPRT

+#define SVC_MAX_WAKING 5
+
static struct svc_deferred_req *svc_deferred_dequeue(struct svc_xprt *xprt);
static int svc_deferred_recv(struct svc_rqst *rqstp);
static struct cache_deferred_req *svc_defer(struct cache_req *req);
@@ -298,6 +300,7 @@ void svc_xprt_enqueue(struct svc_xprt *x
struct svc_pool *pool;
struct svc_rqst *rqstp;
int cpu;
+ int thread_avail;

if (!(xprt->xpt_flags &
((1<<XPT_CONN)|(1<<XPT_DATA)|(1<<XPT_CLOSE)|(1<<XPT_DEFERRED))))
@@ -309,12 +312,6 @@ void svc_xprt_enqueue(struct svc_xprt *x

spin_lock_bh(&pool->sp_lock);

- if (!list_empty(&pool->sp_threads) &&
- !list_empty(&pool->sp_sockets))
- printk(KERN_ERR
- "svc_xprt_enqueue: "
- "threads and transports both waiting??\n");
-
if (test_bit(XPT_DEAD, &xprt->xpt_flags)) {
/* Don't enqueue dead transports */
dprintk("svc: transport %p is dead, not enqueued\n", xprt);
@@ -353,7 +350,14 @@ void svc_xprt_enqueue(struct svc_xprt *x
}

process:
- if (!list_empty(&pool->sp_threads)) {
+ /* Work out whether threads are available */
+ thread_avail = !list_empty(&pool->sp_threads); /* threads are asleep */
+ if (pool->sp_nwaking >= SVC_MAX_WAKING) {
+ /* too many threads are runnable and trying to wake up */
+ thread_avail = 0;
+ }
+
+ if (thread_avail) {
rqstp = list_entry(pool->sp_threads.next,
struct svc_rqst,
rq_list);
@@ -368,6 +372,8 @@ void svc_xprt_enqueue(struct svc_xprt *x
svc_xprt_get(xprt);
rqstp->rq_reserved = serv->sv_max_mesg;
atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
+ rqstp->rq_waking = 1;
+ pool->sp_nwaking++;
BUG_ON(xprt->xpt_pool != pool);
wake_up(&rqstp->rq_wait);
} else {
@@ -633,6 +639,11 @@ int svc_recv(struct svc_rqst *rqstp, lon
return -EINTR;

spin_lock_bh(&pool->sp_lock);
+ if (rqstp->rq_waking) {
+ rqstp->rq_waking = 0;
+ pool->sp_nwaking--;
+ BUG_ON(pool->sp_nwaking < 0);
+ }
xprt = svc_xprt_dequeue(pool);
if (xprt) {
rqstp->rq_xprt = xprt;

--
--
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.

2009-01-13 22:23:58

by Greg Banks

[permalink] [raw]

Subject: Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages

Peter Staubach wrote:
> Greg Banks wrote:
>> [...]
>> Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
>> synthetic client threads simulating an rsync (i.e. recursive directory
>> listing) workload[...]
>>
>> Profiling showed schedule() taking 6.7% of every CPU, and __wake_up()
>> taking 5.2%. This patch drops those contributions to 3.0% and 2.2%.
>> Load average was over 120 before the patch, and 20.9 after.
>> [...]
>
> Have you measured the impact of these changes for something
> like SpecSFS?

Not individually. This patch was part of some work I did in late
2005/early 2006 which was aimed at improving NFS server performance in
general. I do know that the server's SpecSFS numbers jumped by a factor
of somewhere over 2x, from embarrassingly bad to publishable, when
SpecSFS was re-run after that work. However at the time I did not have
the ability to run SpecSFS myself, it was run by a separate group of
people who had dedicated hardware and experience. So I can't tell what
contribution this particular patch made to the overall SpecSFS
improvements. Sorry.

--
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.

2009-01-13 23:12:14

by Greg Banks

[permalink] [raw]

Subject: Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages

Peter Staubach wrote:
> Greg Banks wrote:
>
> It would be interesting to get some better information regarding
> some of the measurable performance ramifications such as SFS
> though.
My NFS server work got SpecSFS to the condition of being disk subsystem
bound instead of CPU bound, at which point it was the XFS folks' problem.
> The Linux NFS server has not had much attention paid to
> it
It's had attention paid to it, just not yet published :-(

--
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.

2009-01-13 14:33:06

by Peter Staubach

[permalink] [raw]

Subject: Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages

Greg Banks wrote:
> Avoid overloading the CPU scheduler with enormous load averages
> when handling high call-rate NFS loads. When the knfsd bottom half
> is made aware of an incoming call by the socket layer, it tries to
> choose an nfsd thread and wake it up. As long as there are idle
> threads, one will be woken up.
>
> If there are lot of nfsd threads (a sensible configuration when
> the server is disk-bound or is running an HSM), there will be many
> more nfsd threads than CPUs to run them. Under a high call-rate
> low service-time workload, the result is that almost every nfsd is
> runnable, but only a handful are actually able to run. This situation
> causes two significant problems:
>
> 1. The CPU scheduler takes over 10% of each CPU, which is robbing
> the nfsd threads of valuable CPU time.
>
> 2. At a high enough load, the nfsd threads starve userspace threads
> of CPU time, to the point where daemons like portmap and rpc.mountd
> do not schedule for tens of seconds at a time. Clients attempting
> to mount an NFS filesystem timeout at the very first step (opening
> a TCP connection to portmap) because portmap cannot wake up from
> select() and call accept() in time.
>
> Disclaimer: these effects were observed on a SLES9 kernel, modern
> kernels' schedulers may behave more gracefully.
>
> The solution is simple: keep in each svc_pool a counter of the number
> of threads which have been woken but have not yet run, and do not wake
> any more if that count reaches an arbitrary small threshold.
>
> Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
> synthetic client threads simulating an rsync (i.e. recursive directory
> listing) workload reading from an i386 RH9 install image (161480
> regular files in 10841 directories) on the server. That tree is small
> enough to fill in the server's RAM so no disk traffic was involved.
> This setup gives a sustained call rate in excess of 60000 calls/sec
> before being CPU-bound on the server. The server was running 128 nfsds.
>
> Profiling showed schedule() taking 6.7% of every CPU, and __wake_up()
> taking 5.2%. This patch drops those contributions to 3.0% and 2.2%.
> Load average was over 120 before the patch, and 20.9 after.
>
> This patch is a forward-ported version of knfsd-avoid-nfsd-overload
> which has been shipping in the SGI "Enhanced NFS" product since 2006.
> It has been posted before:
>
> http://article.gmane.org/gmane.linux.nfs/10374
>
> Signed-off-by: Greg Banks <[email protected]>
> ---

Have you measured the impact of these changes for something
like SpecSFS?

Thanx...

ps

2009-01-13 22:35:37

by Peter Staubach

[permalink] [raw]

Subject: Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages

Greg Banks wrote:
> Peter Staubach wrote:
>
>> Greg Banks wrote:
>>
>>> [...]
>>> Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
>>> synthetic client threads simulating an rsync (i.e. recursive directory
>>> listing) workload[...]
>>>
>>> Profiling showed schedule() taking 6.7% of every CPU, and __wake_up()
>>> taking 5.2%. This patch drops those contributions to 3.0% and 2.2%.
>>> Load average was over 120 before the patch, and 20.9 after.
>>> [...]
>>>
>> Have you measured the impact of these changes for something
>> like SpecSFS?
>>
>
> Not individually. This patch was part of some work I did in late
> 2005/early 2006 which was aimed at improving NFS server performance in
> general. I do know that the server's SpecSFS numbers jumped by a factor
> of somewhere over 2x, from embarrassingly bad to publishable, when
> SpecSFS was re-run after that work. However at the time I did not have
> the ability to run SpecSFS myself, it was run by a separate group of
> people who had dedicated hardware and experience. So I can't tell what
> contribution this particular patch made to the overall SpecSFS
> improvements. Sorry.

That does sound promising though. :-)

It would be interesting to get some better information regarding
some of the measurable performance ramifications such as SFS
though. The Linux NFS server has not had much attention paid to
it and I suspect that it could use some work in the performance
area.

Thanx...

ps

2009-02-19 06:29:05

by Greg Banks

[permalink] [raw]

Subject: Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages

J. Bruce Fields wrote:
> On Tue, Jan 13, 2009 at 09:26:35PM +1100, Greg Banks wrote:
>
>> [...] Under a high call-rate
>> low service-time workload, the result is that almost every nfsd is
>> runnable, but only a handful are actually able to run. This situation
>> causes two significant problems:
>>
>> 1. The CPU scheduler takes over 10% of each CPU, which is robbing
>> the nfsd threads of valuable CPU time.
>>
>> 2. At a high enough load, the nfsd threads starve userspace threads
>> of CPU time, to the point where daemons like portmap and rpc.mountd
>> do not schedule for tens of seconds at a time. Clients attempting
>> to mount an NFS filesystem timeout at the very first step (opening
>> a TCP connection to portmap) because portmap cannot wake up from
>> select() and call accept() in time.
>>
>> Disclaimer: these effects were observed on a SLES9 kernel, modern
>> kernels' schedulers may behave more gracefully.
>>
>
> Yes, googling for "SLES9 kernel"... Was that really 2.6.5 based?
>
> The scheduler's been through at least one complete rewrite since then,
> so the obvious question is whether it's wise to apply something that may
> turn out to have been very specific to an old version of the scheduler.
>
> It's a simple enough patch, but without any suggestion for how to retest
> on a more recent kernel, I'm uneasy.
>
>

Ok, fair enough. I retested using my local GIT tree, which is cloned
from yours and was last git-pull'd a couple of days ago. The test load
was the same as in my 2005 tests (multiple userspace threads each
simulating an rsync directory traversal from a 2.4 client, i.e. almost
entirely ACCESS calls with some READDIRs and GETATTRs, running as fast
as the server will respond). This was run on much newer hardware (and a
different architecture as well: a quad-core Xeon) so the results are not
directly comparable with my 2005 tests. However the effect with and
without the patch can be clearly seen, with otherwise identical hardware
software and load (I added a sysctl to enable and disable the effect of
the patch at runtime).

A quick summary: the 2.6.29-rc4 CPU scheduler is not magically better
than the 2.6.5 one and NFS can still benefit from reducing load on it.

Here's a table of measured call rates and steady-state 1-minute load
averages, before and after the patch, versus number of client load
threads. The server was configured with 128 nfsds in the thread pool
which was under load. In all cases except the the single CPU in the
thread pool was 100% busy (I've elided the 8-thread results where that
wasn't the case).

#threads before after
call/sec loadavg call/sec loadavg
-------- -------- ------- -------- -------
16 57353 10.98 74965 6.11
24 57787 19.56 79397 13.58
32 57921 26.00 80746 21.35
40 57936 35.32 81629 31.73
48 57930 43.84 81775 42.64
56 57467 51.05 81411 52.39
64 57595 57.93 81543 64.61

As you can see, the patch improves NFS throughput for this load by up
to 40%, which is a surprisingly large improvement. I suspect it's a
larger improvement because my 2005 tests had multiple CPUs serving NFS
traffic, and the improvements due to this patch were drowned in various
SMP effects which are absent from this test.

Also surprising is that the patch improves the reported load average
number only at higher numbers of client threads; at low client thread
counts the load average is unchanged or even slightly higher. The patch
didn't have that effect back in 2005, so I'm confused by that
behaviour. Perhaps the difference is due to changes in the scheduler or
the accounting that measures load averages?

Profiling at 16 client threads, 32 server threads shows differences in
the CPU usage in the CPU scheduler itself, with some ACPI effects too.
The platform I ran on in 2005 did not support ACPI, so that's new to
me. Nevertheless it makes a difference. Here are the top samples from
a couple of 30-second flat profiles.

Before:

samples % image name app name symbol name
3013 4.9327 processor.ko processor acpi_idle_enter_simple <---
2583 4.2287 sunrpc.ko sunrpc svc_recv
1273 2.0841 e1000e.ko e1000e e1000_irq_enable
1235 2.0219 sunrpc.ko sunrpc svc_process
1070 1.7517 e1000e.ko e1000e e1000_intr_msi
966 1.5815 e1000e.ko e1000e e1000_xmit_frame
884 1.4472 sunrpc.ko sunrpc svc_xprt_enqueue
861 1.4096 e1000e.ko e1000e e1000_clean_rx_irq
774 1.2671 xfs.ko xfs xfs_iget
772 1.2639 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb schedule <---
726 1.1886 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb sched_clock <---
693 1.1345 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb read_hpet <---
680 1.1133 sunrpc.ko sunrpc cache_check
671 1.0985 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb tcp_sendpage
641 1.0494 sunrpc.ko sunrpc sunrpc_cache_lookup

Total % cpu from ACPI & scheduler: 8.5%

After:

samples % image name app name symbol name
5145 5.2163 sunrpc.ko sunrpc svc_recv
2908 2.9483 processor.ko processor acpi_idle_enter_simple <---
2731 2.7688 sunrpc.ko sunrpc svc_process
2092 2.1210 e1000e.ko e1000e e1000_clean_rx_irq
1988 2.0155 e1000e.ko e1000e e1000_xmit_frame
1863 1.8888 e1000e.ko e1000e e1000_irq_enable
1606 1.6282 xfs.ko xfs xfs_iget
1514 1.5350 sunrpc.ko sunrpc cache_check
1389 1.4082 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb tcp_recvmsg
1383 1.4022 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb tcp_sendpage
1310 1.3281 sunrpc.ko sunrpc svc_xprt_enqueue
1177 1.1933 sunrpc.ko sunrpc sunrpc_cache_lookup
1142 1.1578 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb get_page_from_freelist
1135 1.1507 sunrpc.ko sunrpc svc_tcp_recvfrom
1126 1.1416 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb tcp_transmit_skb
1040 1.0544 e1000e.ko e1000e e1000_intr_msi
1033 1.0473 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb tcp_ack
1030 1.0443 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb kref_get
1000 1.0138 nfsd.ko nfsd fh_verify

Total % cpu from ACPI & scheduler: 2.9%

Does that make you less uneasy?

--
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.

2009-02-11 23:10:26

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages

On Tue, Jan 13, 2009 at 09:26:35PM +1100, Greg Banks wrote:
> Avoid overloading the CPU scheduler with enormous load averages
> when handling high call-rate NFS loads. When the knfsd bottom half
> is made aware of an incoming call by the socket layer, it tries to
> choose an nfsd thread and wake it up. As long as there are idle
> threads, one will be woken up.
>
> If there are lot of nfsd threads (a sensible configuration when
> the server is disk-bound or is running an HSM), there will be many
> more nfsd threads than CPUs to run them. Under a high call-rate
> low service-time workload, the result is that almost every nfsd is
> runnable, but only a handful are actually able to run. This situation
> causes two significant problems:
>
> 1. The CPU scheduler takes over 10% of each CPU, which is robbing
> the nfsd threads of valuable CPU time.
>
> 2. At a high enough load, the nfsd threads starve userspace threads
> of CPU time, to the point where daemons like portmap and rpc.mountd
> do not schedule for tens of seconds at a time. Clients attempting
> to mount an NFS filesystem timeout at the very first step (opening
> a TCP connection to portmap) because portmap cannot wake up from
> select() and call accept() in time.
>
> Disclaimer: these effects were observed on a SLES9 kernel, modern
> kernels' schedulers may behave more gracefully.

Yes, googling for "SLES9 kernel"... Was that really 2.6.5 based?

The scheduler's been through at least one complete rewrite since then,
so the obvious question is whether it's wise to apply something that may
turn out to have been very specific to an old version of the scheduler.

It's a simple enough patch, but without any suggestion for how to retest
on a more recent kernel, I'm uneasy.

--b.

>
> The solution is simple: keep in each svc_pool a counter of the number
> of threads which have been woken but have not yet run, and do not wake
> any more if that count reaches an arbitrary small threshold.
>
> Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
> synthetic client threads simulating an rsync (i.e. recursive directory
> listing) workload reading from an i386 RH9 install image (161480
> regular files in 10841 directories) on the server. That tree is small
> enough to fill in the server's RAM so no disk traffic was involved.
> This setup gives a sustained call rate in excess of 60000 calls/sec
> before being CPU-bound on the server. The server was running 128 nfsds.
>
> Profiling showed schedule() taking 6.7% of every CPU, and __wake_up()
> taking 5.2%. This patch drops those contributions to 3.0% and 2.2%.
> Load average was over 120 before the patch, and 20.9 after.
>
> This patch is a forward-ported version of knfsd-avoid-nfsd-overload
> which has been shipping in the SGI "Enhanced NFS" product since 2006.
> It has been posted before:
>
> http://article.gmane.org/gmane.linux.nfs/10374
>
> Signed-off-by: Greg Banks <[email protected]>
> ---
>
> include/linux/sunrpc/svc.h | 2 ++
> net/sunrpc/svc_xprt.c | 25 ++++++++++++++++++-------
> 2 files changed, 20 insertions(+), 7 deletions(-)
>
> Index: bfields/include/linux/sunrpc/svc.h
> ===================================================================
> --- bfields.orig/include/linux/sunrpc/svc.h
> +++ bfields/include/linux/sunrpc/svc.h
> @@ -41,6 +41,7 @@ struct svc_pool {
> struct list_head sp_sockets; /* pending sockets */
> unsigned int sp_nrthreads; /* # of threads in pool */
> struct list_head sp_all_threads; /* all server threads */
> + int sp_nwaking; /* number of threads woken but not yet active */
> } ____cacheline_aligned_in_smp;
>
> /*
> @@ -264,6 +265,7 @@ struct svc_rqst {
> * cache pages */
> wait_queue_head_t rq_wait; /* synchronization */
> struct task_struct *rq_task; /* service thread */
> + int rq_waking; /* 1 if thread is being woken */
> };
>
> /*
> Index: bfields/net/sunrpc/svc_xprt.c
> ===================================================================
> --- bfields.orig/net/sunrpc/svc_xprt.c
> +++ bfields/net/sunrpc/svc_xprt.c
> @@ -14,6 +14,8 @@
>
> #define RPCDBG_FACILITY RPCDBG_SVCXPRT
>
> +#define SVC_MAX_WAKING 5
> +
> static struct svc_deferred_req *svc_deferred_dequeue(struct svc_xprt *xprt);
> static int svc_deferred_recv(struct svc_rqst *rqstp);
> static struct cache_deferred_req *svc_defer(struct cache_req *req);
> @@ -298,6 +300,7 @@ void svc_xprt_enqueue(struct svc_xprt *x
> struct svc_pool *pool;
> struct svc_rqst *rqstp;
> int cpu;
> + int thread_avail;
>
> if (!(xprt->xpt_flags &
> ((1<<XPT_CONN)|(1<<XPT_DATA)|(1<<XPT_CLOSE)|(1<<XPT_DEFERRED))))
> @@ -309,12 +312,6 @@ void svc_xprt_enqueue(struct svc_xprt *x
>
> spin_lock_bh(&pool->sp_lock);
>
> - if (!list_empty(&pool->sp_threads) &&
> - !list_empty(&pool->sp_sockets))
> - printk(KERN_ERR
> - "svc_xprt_enqueue: "
> - "threads and transports both waiting??\n");
> -
> if (test_bit(XPT_DEAD, &xprt->xpt_flags)) {
> /* Don't enqueue dead transports */
> dprintk("svc: transport %p is dead, not enqueued\n", xprt);
> @@ -353,7 +350,14 @@ void svc_xprt_enqueue(struct svc_xprt *x
> }
>
> process:
> - if (!list_empty(&pool->sp_threads)) {
> + /* Work out whether threads are available */
> + thread_avail = !list_empty(&pool->sp_threads); /* threads are asleep */
> + if (pool->sp_nwaking >= SVC_MAX_WAKING) {
> + /* too many threads are runnable and trying to wake up */
> + thread_avail = 0;
> + }
> +
> + if (thread_avail) {
> rqstp = list_entry(pool->sp_threads.next,
> struct svc_rqst,
> rq_list);
> @@ -368,6 +372,8 @@ void svc_xprt_enqueue(struct svc_xprt *x
> svc_xprt_get(xprt);
> rqstp->rq_reserved = serv->sv_max_mesg;
> atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
> + rqstp->rq_waking = 1;
> + pool->sp_nwaking++;
> BUG_ON(xprt->xpt_pool != pool);
> wake_up(&rqstp->rq_wait);
> } else {
> @@ -633,6 +639,11 @@ int svc_recv(struct svc_rqst *rqstp, lon
> return -EINTR;
>
> spin_lock_bh(&pool->sp_lock);
> + if (rqstp->rq_waking) {
> + rqstp->rq_waking = 0;
> + pool->sp_nwaking--;
> + BUG_ON(pool->sp_nwaking < 0);
> + }
> xprt = svc_xprt_dequeue(pool);
> if (xprt) {
> rqstp->rq_xprt = xprt;
>
> --
> --
> Greg Banks, P.Engineer, SGI Australian Software Group.
> the brightly coloured sporks of revolution.
> I don't speak for SGI.

2009-03-15 21:21:13

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages

On Thu, Feb 19, 2009 at 05:25:47PM +1100, Greg Banks wrote:
> J. Bruce Fields wrote:
> > On Tue, Jan 13, 2009 at 09:26:35PM +1100, Greg Banks wrote:
> >
> >> [...] Under a high call-rate
> >> low service-time workload, the result is that almost every nfsd is
> >> runnable, but only a handful are actually able to run. This situation
> >> causes two significant problems:
> >>
> >> 1. The CPU scheduler takes over 10% of each CPU, which is robbing
> >> the nfsd threads of valuable CPU time.
> >>
> >> 2. At a high enough load, the nfsd threads starve userspace threads
> >> of CPU time, to the point where daemons like portmap and rpc.mountd
> >> do not schedule for tens of seconds at a time. Clients attempting
> >> to mount an NFS filesystem timeout at the very first step (opening
> >> a TCP connection to portmap) because portmap cannot wake up from
> >> select() and call accept() in time.
> >>
> >> Disclaimer: these effects were observed on a SLES9 kernel, modern
> >> kernels' schedulers may behave more gracefully.
> >>
> >
> > Yes, googling for "SLES9 kernel"... Was that really 2.6.5 based?
> >
> > The scheduler's been through at least one complete rewrite since then,
> > so the obvious question is whether it's wise to apply something that may
> > turn out to have been very specific to an old version of the scheduler.
> >
> > It's a simple enough patch, but without any suggestion for how to retest
> > on a more recent kernel, I'm uneasy.
> >
> >
>
> Ok, fair enough. I retested using my local GIT tree, which is cloned
> from yours and was last git-pull'd a couple of days ago. The test load
> was the same as in my 2005 tests (multiple userspace threads each
> simulating an rsync directory traversal from a 2.4 client, i.e. almost
> entirely ACCESS calls with some READDIRs and GETATTRs, running as fast
> as the server will respond). This was run on much newer hardware (and a
> different architecture as well: a quad-core Xeon) so the results are not
> directly comparable with my 2005 tests. However the effect with and
> without the patch can be clearly seen, with otherwise identical hardware
> software and load (I added a sysctl to enable and disable the effect of
> the patch at runtime).
>
> A quick summary: the 2.6.29-rc4 CPU scheduler is not magically better
> than the 2.6.5 one and NFS can still benefit from reducing load on it.
>
> Here's a table of measured call rates and steady-state 1-minute load
> averages, before and after the patch, versus number of client load
> threads. The server was configured with 128 nfsds in the thread pool
> which was under load. In all cases except the the single CPU in the
> thread pool was 100% busy (I've elided the 8-thread results where that
> wasn't the case).
>
> #threads before after
> call/sec loadavg call/sec loadavg
> -------- -------- ------- -------- -------
> 16 57353 10.98 74965 6.11
> 24 57787 19.56 79397 13.58
> 32 57921 26.00 80746 21.35
> 40 57936 35.32 81629 31.73
> 48 57930 43.84 81775 42.64
> 56 57467 51.05 81411 52.39
> 64 57595 57.93 81543 64.61
>
>
> As you can see, the patch improves NFS throughput for this load by up
> to 40%, which is a surprisingly large improvement. I suspect it's a
> larger improvement because my 2005 tests had multiple CPUs serving NFS
> traffic, and the improvements due to this patch were drowned in various
> SMP effects which are absent from this test.
>
> Also surprising is that the patch improves the reported load average
> number only at higher numbers of client threads; at low client thread
> counts the load average is unchanged or even slightly higher. The patch
> didn't have that effect back in 2005, so I'm confused by that
> behaviour. Perhaps the difference is due to changes in the scheduler or
> the accounting that measures load averages?
>
> Profiling at 16 client threads, 32 server threads shows differences in
> the CPU usage in the CPU scheduler itself, with some ACPI effects too.
> The platform I ran on in 2005 did not support ACPI, so that's new to
> me. Nevertheless it makes a difference. Here are the top samples from
> a couple of 30-second flat profiles.
>
> Before:
>
> samples % image name app name symbol name
> 3013 4.9327 processor.ko processor acpi_idle_enter_simple <---
> 2583 4.2287 sunrpc.ko sunrpc svc_recv
> 1273 2.0841 e1000e.ko e1000e e1000_irq_enable
> 1235 2.0219 sunrpc.ko sunrpc svc_process
> 1070 1.7517 e1000e.ko e1000e e1000_intr_msi
> 966 1.5815 e1000e.ko e1000e e1000_xmit_frame
> 884 1.4472 sunrpc.ko sunrpc svc_xprt_enqueue
> 861 1.4096 e1000e.ko e1000e e1000_clean_rx_irq
> 774 1.2671 xfs.ko xfs xfs_iget
> 772 1.2639 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb schedule <---
> 726 1.1886 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb sched_clock <---
> 693 1.1345 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb read_hpet <---
> 680 1.1133 sunrpc.ko sunrpc cache_check
> 671 1.0985 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb tcp_sendpage
> 641 1.0494 sunrpc.ko sunrpc sunrpc_cache_lookup
>
> Total % cpu from ACPI & scheduler: 8.5%
>
> After:
>
> samples % image name app name symbol name
> 5145 5.2163 sunrpc.ko sunrpc svc_recv
> 2908 2.9483 processor.ko processor acpi_idle_enter_simple <---
> 2731 2.7688 sunrpc.ko sunrpc svc_process
> 2092 2.1210 e1000e.ko e1000e e1000_clean_rx_irq
> 1988 2.0155 e1000e.ko e1000e e1000_xmit_frame
> 1863 1.8888 e1000e.ko e1000e e1000_irq_enable
> 1606 1.6282 xfs.ko xfs xfs_iget
> 1514 1.5350 sunrpc.ko sunrpc cache_check
> 1389 1.4082 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb tcp_recvmsg
> 1383 1.4022 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb tcp_sendpage
> 1310 1.3281 sunrpc.ko sunrpc svc_xprt_enqueue
> 1177 1.1933 sunrpc.ko sunrpc sunrpc_cache_lookup
> 1142 1.1578 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb get_page_from_freelist
> 1135 1.1507 sunrpc.ko sunrpc svc_tcp_recvfrom
> 1126 1.1416 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb tcp_transmit_skb
> 1040 1.0544 e1000e.ko e1000e e1000_intr_msi
> 1033 1.0473 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb tcp_ack
> 1030 1.0443 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb kref_get
> 1000 1.0138 nfsd.ko nfsd fh_verify
>
> Total % cpu from ACPI & scheduler: 2.9%
>
>
> Does that make you less uneasy?

Yes, thanks!

Queued up for 2.6.30, barring objections. But perhaps we should pass on
the patch and your results to people who know the scheduler better and
see if they can explain e.g. the loadavg numbers.

--b.

2009-03-16 03:05:25

by Greg Banks

[permalink] [raw]

Subject: Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages

J. Bruce Fields wrote:
> On Thu, Feb 19, 2009 at 05:25:47PM +1100, Greg Banks wrote:
>
>> J. Bruce Fields wrote:
>>
>>> On Tue, Jan 13, 2009 at 09:26:35PM +1100, Greg Banks wrote:
>>>
>>> [...]
>>> It's a simple enough patch, but without any suggestion for how to retest
>>> on a more recent kernel, I'm uneasy.
>>>
>> [...]
>>
>> Does that make you less uneasy?
>>
>
> Yes, thanks!
>
> Queued up for 2.6.30, barring objections.
Thanks.
> But perhaps we should pass on
> the patch and your results to people who know the scheduler better and
> see if they can explain e.g. the loadavg numbers.
>
If you like. Personally I'm happy with assuming that it's because nfsd
is putting an unnaturally harsh load on the scheduler.

--
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.