LinuxLists.cc - [patch 1/3] knfsd: remove the nfsd thread busy histogram

2009-01-13 10:27:00

Subject: [patch 1/3] knfsd: remove the nfsd thread busy histogram

Stop gathering the data that feeds the 'th' line in /proc/net/rpc/nfsd
because the questionable data provided is not worth the scalability
impact of calculating it. Instead, always report zeroes. The current
approach suffers from three major issues:

1. update_thread_usage() increments buckets by call service
time or call arrival time...in jiffies. On lightly loaded
machines, call service times are usually < 1 jiffy; on
heavily loaded machines call arrival times will be << 1 jiffy.
So a large portion of the updates to the buckets are rounded
down to zero, and the histogram is undercounting.

2. As seen previously on the nfs mailing list, the format in which
the histogram is presented is cryptic, difficult to explain,
and difficult to use.

3. Updating the histogram requires taking a global spinlock and
dirtying the global variables nfsd_last_call, nfsd_busy, and
nfsdstats *twice* on every RPC call, which is a significant
scaling limitation.

Testing on a 4 CPU 4 NIC Altix using 4 IRIX clients each doing
1K streaming reads at full line rate, shows the stats update code
(inlined into nfsd()) takes about 1.7% of each CPU. This patch drops
the contribution from nfsd() into the profile noise.

This patch is a forward-ported version of knfsd-remove-nfsd-threadstats
which has been shipping in the SGI "Enhanced NFS" product since 2006.
In that time, exactly one customer has noticed that the threadstats
were missing. It has been previously posted:

http://article.gmane.org/gmane.linux.nfs/10376

and more recently requested to be posted again.

Signed-off-by: Greg Banks <[email protected]>
---

fs/nfsd/nfssvc.c | 28 ----------------------------
1 file changed, 28 deletions(-)

Index: bfields/fs/nfsd/nfssvc.c
===================================================================
--- bfields.orig/fs/nfsd/nfssvc.c
+++ bfields/fs/nfsd/nfssvc.c
@@ -40,9 +40,6 @@
extern struct svc_program nfsd_program;
static int nfsd(void *vrqstp);
struct timeval nfssvc_boot;
-static atomic_t nfsd_busy;
-static unsigned long nfsd_last_call;
-static DEFINE_SPINLOCK(nfsd_call_lock);

/*
* nfsd_mutex protects nfsd_serv -- both the pointer itself and the members
@@ -227,7 +224,6 @@ int nfsd_create_serv(void)
nfsd_max_blksize /= 2;
}

- atomic_set(&nfsd_busy, 0);
nfsd_serv = svc_create_pooled(&nfsd_program, nfsd_max_blksize,
AF_INET,
nfsd_last_thread, nfsd, THIS_MODULE);
@@ -376,26 +372,6 @@ nfsd_svc(unsigned short port, int nrserv
return error;
}

-static inline void
-update_thread_usage(int busy_threads)
-{
- unsigned long prev_call;
- unsigned long diff;
- int decile;
-
- spin_lock(&nfsd_call_lock);
- prev_call = nfsd_last_call;
- nfsd_last_call = jiffies;
- decile = busy_threads*10/nfsdstats.th_cnt;
- if (decile>0 && decile <= 10) {
- diff = nfsd_last_call - prev_call;
- if ( (nfsdstats.th_usage[decile-1] += diff) >= NFSD_USAGE_WRAP)
- nfsdstats.th_usage[decile-1] -= NFSD_USAGE_WRAP;
- if (decile == 10)
- nfsdstats.th_fullcnt++;
- }
- spin_unlock(&nfsd_call_lock);
-}

/*
* This is the NFS server kernel thread
@@ -464,8 +440,6 @@ nfsd(void *vrqstp)
continue;
}

- update_thread_usage(atomic_read(&nfsd_busy));
- atomic_inc(&nfsd_busy);

/* Lock the export hash tables for reading. */
exp_readlock();
@@ -474,8 +448,6 @@ nfsd(void *vrqstp)

/* Unlock export hash tables */
exp_readunlock();
- update_thread_usage(atomic_read(&nfsd_busy));
- atomic_dec(&nfsd_busy);
}

/* Clear signals before calling svc_exit_thread() */

--
--
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.

2009-01-13 22:59:04

by Greg Banks

[permalink] [raw]

Subject: Re: [patch 1/3] knfsd: remove the nfsd thread busy histogram

Chuck Lever wrote:
> On Jan 13, 2009, at Jan 13, 2009, 5:26 AM, Greg Banks wrote:
>> Stop gathering the data that feeds the 'th' line in /proc/net/rpc/nfsd
>> because the questionable data provided is not worth the scalability
>> impact of calculating it. Instead, always report zeroes. The current
>> approach suffers from three major issues:
>>
>> 1. update_thread_usage() increments buckets by call service
>> time or call arrival time...in jiffies. On lightly loaded
>> machines, call service times are usually < 1 jiffy; on
>> heavily loaded machines call arrival times will be << 1 jiffy.
>> So a large portion of the updates to the buckets are rounded
>> down to zero, and the histogram is undercounting.
>
> Use ktime_get_real() instead. This is what the network layer uses.
IIRC that wasn't available when I wrote the patch (2.6.5 kernel in SLES9
in late 2005). I haven't looked at it again since.

Later, I looked at gathering better statistics on thread usage, and I
investigated real time clock (more precisely, monotonic clock)
implementations in Linux and came to the sad conclusion that there was
no API I could call that would be both accurate and efficient on two or
more platforms, so I gave up. The HPET hardware timer looked promising
for a while, but it turned out that a 32b kernel used a global spinlock
to access the 64b HPET registers, which created the same scalability
problem I was trying to fix. Things may have improved since then.

If we had such a clock though, the solution is very simple. Each nfsd
maintains two new 64b counters of nanoseconds spent in each of two
states "busy" and "idle", where "idle" is asleep waiting for a call and
"busy" is everything else. These are maintained in svc_recv().
Interfaces are provided for userspace to read an aggregation of these
counters, per-pool and globally. Userspace rate-converts the counters;
the rate of increase of the two counters tells you both how many threads
there are and how much actual demand on thread time there is. This is
how I did it in Irix (SGI Origin machines had a global distributed
monotonic clock in hardware).

> You could even steal the start timestamp from the first skbuff in an
> incoming RPC request.

This would help if we had skbuffs: NFS/RDMA doesn't.
>
> This problem is made worse on "server" configurations and in virtual
> guests which may still use HZ=100, though with tickless HZ this is a
> less frequently seen configuration.

Indeed.
>
>> 2. As seen previously on the nfs mailing list, the format in which
>> the histogram is presented is cryptic, difficult to explain,
>> and difficult to use.
>
> A user space script similar to mountstats that interprets these
> metrics might help here.

The formatting in the pseudofile isn't the entire problem. The problem
is translating the "thread usage histogram" information there into an
answer to the actual question the sysadmin wants, which is "should I
configure more nfsds?"

>
>> 3. Updating the histogram requires taking a global spinlock and
>> dirtying the global variables nfsd_last_call, nfsd_busy, and
>> nfsdstats *twice* on every RPC call, which is a significant
>> scaling limitation.
>
> You might fix this by making the global variables into per-CPU
> variables, then totaling the per-CPU variables only at presentation
> time (ie when someone cats /proc/net/rpc/nfsd). That would make the
> collection logic lockless.
This is how I fixed some of the other server stats in later patches.
IIRC that approach doesn't work for the thread usage histogram because
it's scaled as it's gathered by a potentially time-varying global number
so the on-demand totalling might not give correct results. Also, in the
presence of multiple thread pools any thread usage information should be
per-pool not global. At the time I wrote this patch I concluded that I
couldn't make the gathering scale and still preserve the exact semantics
of the data gathered.

>
>>
>
> Yeah. The real issue here is deciding whether these stats are useful
> or not;
In my experience, not.

> if not, can they be made useable?
A different form of the data could certainly be made useful.

--
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.

2009-01-13 16:41:23

by Chuck Lever III

[permalink] [raw]

Subject: Re: [patch 1/3] knfsd: remove the nfsd thread busy histogram

On Jan 13, 2009, at Jan 13, 2009, 5:26 AM, Greg Banks wrote:
> Stop gathering the data that feeds the 'th' line in /proc/net/rpc/nfsd
> because the questionable data provided is not worth the scalability
> impact of calculating it. Instead, always report zeroes. The current
> approach suffers from three major issues:
>
> 1. update_thread_usage() increments buckets by call service
> time or call arrival time...in jiffies. On lightly loaded
> machines, call service times are usually < 1 jiffy; on
> heavily loaded machines call arrival times will be << 1 jiffy.
> So a large portion of the updates to the buckets are rounded
> down to zero, and the histogram is undercounting.

Use ktime_get_real() instead. This is what the network layer uses.
You could even steal the start timestamp from the first skbuff in an
incoming RPC request.

This problem is made worse on "server" configurations and in virtual
guests which may still use HZ=100, though with tickless HZ this is a
less frequently seen configuration.

> 2. As seen previously on the nfs mailing list, the format in which
> the histogram is presented is cryptic, difficult to explain,
> and difficult to use.

A user space script similar to mountstats that interprets these
metrics might help here.

> 3. Updating the histogram requires taking a global spinlock and
> dirtying the global variables nfsd_last_call, nfsd_busy, and
> nfsdstats *twice* on every RPC call, which is a significant
> scaling limitation.

You might fix this by making the global variables into per-CPU
variables, then totaling the per-CPU variables only at presentation
time (ie when someone cats /proc/net/rpc/nfsd). That would make the
collection logic lockless.

> Testing on a 4 CPU 4 NIC Altix using 4 IRIX clients each doing
> 1K streaming reads at full line rate, shows the stats update code
> (inlined into nfsd()) takes about 1.7% of each CPU. This patch drops
> the contribution from nfsd() into the profile noise.
>
> This patch is a forward-ported version of knfsd-remove-nfsd-
> threadstats
> which has been shipping in the SGI "Enhanced NFS" product since 2006.
> In that time, exactly one customer has noticed that the threadstats
> were missing.

Yeah. The real issue here is deciding whether these stats are useful
or not; if not, can they be made useable?

> It has been previously posted:
>
> http://article.gmane.org/gmane.linux.nfs/10376
>
> and more recently requested to be posted again.
>
> Signed-off-by: Greg Banks <[email protected]>
> ---
>
> fs/nfsd/nfssvc.c | 28 ----------------------------
> 1 file changed, 28 deletions(-)
>
> Index: bfields/fs/nfsd/nfssvc.c
> ===================================================================
> --- bfields.orig/fs/nfsd/nfssvc.c
> +++ bfields/fs/nfsd/nfssvc.c
> @@ -40,9 +40,6 @@
> extern struct svc_program nfsd_program;
> static int nfsd(void *vrqstp);
> struct timeval nfssvc_boot;
> -static atomic_t nfsd_busy;
> -static unsigned long nfsd_last_call;
> -static DEFINE_SPINLOCK(nfsd_call_lock);
>
> /*
> * nfsd_mutex protects nfsd_serv -- both the pointer itself and the
> members
> @@ -227,7 +224,6 @@ int nfsd_create_serv(void)
> nfsd_max_blksize /= 2;
> }
>
> - atomic_set(&nfsd_busy, 0);
> nfsd_serv = svc_create_pooled(&nfsd_program, nfsd_max_blksize,
> AF_INET,
> nfsd_last_thread, nfsd, THIS_MODULE);
> @@ -376,26 +372,6 @@ nfsd_svc(unsigned short port, int nrserv
> return error;
> }
>
> -static inline void
> -update_thread_usage(int busy_threads)
> -{
> - unsigned long prev_call;
> - unsigned long diff;
> - int decile;
> -
> - spin_lock(&nfsd_call_lock);
> - prev_call = nfsd_last_call;
> - nfsd_last_call = jiffies;
> - decile = busy_threads*10/nfsdstats.th_cnt;
> - if (decile>0 && decile <= 10) {
> - diff = nfsd_last_call - prev_call;
> - if ( (nfsdstats.th_usage[decile-1] += diff) >= NFSD_USAGE_WRAP)
> - nfsdstats.th_usage[decile-1] -= NFSD_USAGE_WRAP;
> - if (decile == 10)
> - nfsdstats.th_fullcnt++;
> - }
> - spin_unlock(&nfsd_call_lock);
> -}
>
> /*
> * This is the NFS server kernel thread
> @@ -464,8 +440,6 @@ nfsd(void *vrqstp)
> continue;
> }
>
> - update_thread_usage(atomic_read(&nfsd_busy));
> - atomic_inc(&nfsd_busy);
>
> /* Lock the export hash tables for reading. */
> exp_readlock();
> @@ -474,8 +448,6 @@ nfsd(void *vrqstp)
>
> /* Unlock export hash tables */
> exp_readunlock();
> - update_thread_usage(atomic_read(&nfsd_busy));
> - atomic_dec(&nfsd_busy);
> }
>
> /* Clear signals before calling svc_exit_thread() */
>

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2009-02-11 21:59:41

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [patch 1/3] knfsd: remove the nfsd thread busy histogram

On Wed, Jan 14, 2009 at 09:50:52AM +1100, Greg Banks wrote:
> Chuck Lever wrote:
> > On Jan 13, 2009, at Jan 13, 2009, 5:26 AM, Greg Banks wrote:
> >> Stop gathering the data that feeds the 'th' line in /proc/net/rpc/nfsd
> >> because the questionable data provided is not worth the scalability
> >> impact of calculating it. Instead, always report zeroes. The current
> >> approach suffers from three major issues:
> >>
> >> 1. update_thread_usage() increments buckets by call service
> >> time or call arrival time...in jiffies. On lightly loaded
> >> machines, call service times are usually < 1 jiffy; on
> >> heavily loaded machines call arrival times will be << 1 jiffy.
> >> So a large portion of the updates to the buckets are rounded
> >> down to zero, and the histogram is undercounting.
> >
> > Use ktime_get_real() instead. This is what the network layer uses.
> IIRC that wasn't available when I wrote the patch (2.6.5 kernel in SLES9
> in late 2005). I haven't looked at it again since.
>
> Later, I looked at gathering better statistics on thread usage, and I
> investigated real time clock (more precisely, monotonic clock)
> implementations in Linux and came to the sad conclusion that there was
> no API I could call that would be both accurate and efficient on two or
> more platforms, so I gave up. The HPET hardware timer looked promising
> for a while, but it turned out that a 32b kernel used a global spinlock
> to access the 64b HPET registers, which created the same scalability
> problem I was trying to fix. Things may have improved since then.
>
> If we had such a clock though, the solution is very simple. Each nfsd
> maintains two new 64b counters of nanoseconds spent in each of two
> states "busy" and "idle", where "idle" is asleep waiting for a call and
> "busy" is everything else. These are maintained in svc_recv().
> Interfaces are provided for userspace to read an aggregation of these
> counters, per-pool and globally. Userspace rate-converts the counters;
> the rate of increase of the two counters tells you both how many threads
> there are and how much actual demand on thread time there is. This is
> how I did it in Irix (SGI Origin machines had a global distributed
> monotonic clock in hardware).

Is it worth looking into this again now?

> > You could even steal the start timestamp from the first skbuff in an
> > incoming RPC request.
>
> This would help if we had skbuffs: NFS/RDMA doesn't.
> >
> > This problem is made worse on "server" configurations and in virtual
> > guests which may still use HZ=100, though with tickless HZ this is a
> > less frequently seen configuration.
>
> Indeed.
> >
> >> 2. As seen previously on the nfs mailing list, the format in which
> >> the histogram is presented is cryptic, difficult to explain,
> >> and difficult to use.
> >
> > A user space script similar to mountstats that interprets these
> > metrics might help here.
>
> The formatting in the pseudofile isn't the entire problem. The problem
> is translating the "thread usage histogram" information there into an
> answer to the actual question the sysadmin wants, which is "should I
> configure more nfsds?"

Agreed.

> >> 3. Updating the histogram requires taking a global spinlock and
> >> dirtying the global variables nfsd_last_call, nfsd_busy, and
> >> nfsdstats *twice* on every RPC call, which is a significant
> >> scaling limitation.
> >
> > You might fix this by making the global variables into per-CPU
> > variables, then totaling the per-CPU variables only at presentation
> > time (ie when someone cats /proc/net/rpc/nfsd). That would make the
> > collection logic lockless.
> This is how I fixed some of the other server stats in later patches.
> IIRC that approach doesn't work for the thread usage histogram because
> it's scaled as it's gathered by a potentially time-varying global number
> so the on-demand totalling might not give correct results.

Right, so, amuse yourself watching me as I try to remember how the 'th'
line works: if it's attempting to report, e.g., what length of time 10%
of the threads are busy, it needs very local knowledge: if you know each
thread was busy 10 jiffies out of the last 100, you'd still need to know
whether they were all busy during the *same* 10-jiffy interval, or
whether that work was spread out more evenly over the 100 jiffies....

--b.

> Also, in the
> presence of multiple thread pools any thread usage information should be
> per-pool not global. At the time I wrote this patch I concluded that I
> couldn't make the gathering scale and still preserve the exact semantics
> of the data gathered.
>
> >
> >>
> >
> > Yeah. The real issue here is deciding whether these stats are useful
> > or not;
> In my experience, not.
>
> > if not, can they be made useable?
> A different form of the data could certainly be made useful.
>
> --
> Greg Banks, P.Engineer, SGI Australian Software Group.
> the brightly coloured sporks of revolution.
> I don't speak for SGI.
>