2009-09-03 00:07:04

by Andrew Morton

[permalink] [raw]
Subject: Re: VM issue causing high CPU loads

On Mon, 31 Aug 2009 22:39:20 +0200
Yohan <[email protected]> wrote:

> Yohan wrote:
> > Andrew Morton wrote:
> >> On Mon, 24 Aug 2009 16:23:22 +0200
> >> Yohan <kernel-qqs8qlct+LniQET9NtBuNOUY/[email protected]> wrote:
> >>> Hi,
> >>>
> >>> Is someone have an idea for that :
> >>>
> >>> http://bugzilla.kernel.org/show_bug.cgi?id=14024
> >>>
> >> Please generate a kernel profile to work out where all the CPU tie is
> >> being spent. Documentation/basic_profiling.txt is a starting point.
> >>
> > I post some new reports, it seems that the problem is in
> > rpcauth_lookup_credcache ...

Thanks, that helps a lot.

> > for information, this is an imap mail server that mounts ~10 netapp
> > over ~300 mountpoints..
> I saw that : http://patchwork.kernel.org/patch/24747/

I wonder what happened with Miquel's patch?

> I did only:
>
> --- linux-2.6.27.21/include/linux/sunrpc/auth.h 2009-03-23 23:04:09.000000000 +0100
> +++ linux-2.6.27.21/include/linux/sunrpc/auth.h 2009-05-19 16:02:35.000000000 +0200
> @@ -62,8 +62,12 @@
> */
> - #define RPC_CREDCACHE_HASHBITS 4
> + #define RPC_CREDCACHE_HASHBITS 12
>
>
> And i test it in prod since sunday: i only have 36% of one core used by
> system
> versus more than 3 cores used by system in another server that did a
> drop_caches at morning...
>

OK, but it's still pretty bad. Let's tell the NFS guys.

In http://bugzilla.kernel.org/show_bug.cgi?id=14024 we appear to have a
major meltdown caused by the linear search in
rpcauth_lookup_credcache() with Yohan's workload.



2009-09-03 14:08:56

by Yohan

[permalink] [raw]
Subject: Re: VM issue causing high CPU loads

Trond Myklebust wrote:
> On Thu, 2009-09-03 at 15:39 +0200, Yohan wrote:
>
>>> As far as I can see, there is no RPCSEC_GSS involved, so credentials
>>> should never expire. They will be reused as long as processes aren't
>>> switching between thousands and thousands of different combinations of
>>> uid, gid and groups.
>>>
>> My servers are imap servers.
>> Foreach user (~15 million) it have a specific uid over ~10 nfs netapp
>> storage.
>>
> OK, so 16 hash buckets are likely to be filled with ~10^6 entries each.
> I can see that might be a performance issue...
>
> So afaics, you did try adjusting the hashtable size. How much larger
> does it have to be before you start to get acceptable performance? If it
> solves your problem we could make hash table sizes adjustable via a
> module parameter, for instance.
>
I run now with a value of 12, and it's great for me...


2009-09-03 13:39:38

by Yohan

[permalink] [raw]
Subject: Re: VM issue causing high CPU loads


>>> I did only:
>>>
>>> --- linux-2.6.27.21/include/linux/sunrpc/auth.h 2009-03-23 23:04:09.000000000 +0100
>>> +++ linux-2.6.27.21/include/linux/sunrpc/auth.h 2009-05-19 16:02:35.000000000 +0200
>>> @@ -62,8 +62,12 @@
>>> */
>>> - #define RPC_CREDCACHE_HASHBITS 4
>>> + #define RPC_CREDCACHE_HASHBITS 12
>>>
>>>
>>> And i test it in prod since sunday: i only have 36% of one core used by
>>> system
>>> versus more than 3 cores used by system in another server that did a
>>> drop_caches at morning...
>>>
>> OK, but it's still pretty bad. Let's tell the NFS guys.
>>
>> In http://bugzilla.kernel.org/show_bug.cgi?id=14024 we appear to have a
>> major meltdown caused by the linear search in
>> rpcauth_lookup_credcache() with Yohan's workload.
>>
> OK. Could we please have some more details about the actual workload involved here?
>
I add a new server CPU graph and 60s readprofile on the bugzilla

> As far as I can see, there is no RPCSEC_GSS involved, so credentials
> should never expire. They will be reused as long as processes aren't
> switching between thousands and thousands of different combinations of
> uid, gid and groups.
My servers are imap servers.
Foreach user (~15 million) it have a specific uid over ~10 nfs netapp
storage.


2009-09-03 13:01:33

by Trond Myklebust

[permalink] [raw]
Subject: Re: VM issue causing high CPU loads

On Wed, 2009-09-02 at 17:06 -0700, Andrew Morton wrote:
> On Mon, 31 Aug 2009 22:39:20 +0200
> Yohan <[email protected]> wrote:
>
> > Yohan wrote:
> > > Andrew Morton wrote:
> > >> On Mon, 24 Aug 2009 16:23:22 +0200
> > >> Yohan <[email protected]> wrote:
> > >>> Hi,
> > >>>
> > >>> Is someone have an idea for that :
> > >>>
> > >>> http://bugzilla.kernel.org/show_bug.cgi?id=14024
> > >>>
> > >> Please generate a kernel profile to work out where all the CPU tie is
> > >> being spent. Documentation/basic_profiling.txt is a starting point.
> > >>
> > > I post some new reports, it seems that the problem is in
> > > rpcauth_lookup_credcache ...
>
> Thanks, that helps a lot.
>
> > > for information, this is an imap mail server that mounts ~10 netapp
> > > over ~300 mountpoints..
> > I saw that : http://patchwork.kernel.org/patch/24747/
>
> I wonder what happened with Miquel's patch?

At the time, I asked him to split out the various changes into several
patches.

His patch did a lot of different things that would impact workloads in
different ways. For instance, while increasing the hash table size is
not likely to have a huge performance degradation for most people, the
change that decreases the garbage collection timeout is very likely to
cause issues (particularly with RPCSEC_GSS setups)...

> > I did only:
> >
> > --- linux-2.6.27.21/include/linux/sunrpc/auth.h 2009-03-23 23:04:09.000000000 +0100
> > +++ linux-2.6.27.21/include/linux/sunrpc/auth.h 2009-05-19 16:02:35.000000000 +0200
> > @@ -62,8 +62,12 @@
> > */
> > - #define RPC_CREDCACHE_HASHBITS 4
> > + #define RPC_CREDCACHE_HASHBITS 12
> >
> >
> > And i test it in prod since sunday: i only have 36% of one core used by
> > system
> > versus more than 3 cores used by system in another server that did a
> > drop_caches at morning...
> >
>
> OK, but it's still pretty bad. Let's tell the NFS guys.
>
> In http://bugzilla.kernel.org/show_bug.cgi?id=14024 we appear to have a
> major meltdown caused by the linear search in
> rpcauth_lookup_credcache() with Yohan's workload.
>

OK. Could we please have some more details about the actual workload
involved here?
As far as I can see, there is no RPCSEC_GSS involved, so credentials
should never expire. They will be reused as long as processes aren't
switching between thousands and thousands of different combinations of
uid, gid and groups.




2009-09-03 14:02:10

by Trond Myklebust

[permalink] [raw]
Subject: Re: VM issue causing high CPU loads

On Thu, 2009-09-03 at 15:39 +0200, Yohan wrote:
> > As far as I can see, there is no RPCSEC_GSS involved, so credentials
> > should never expire. They will be reused as long as processes aren't
> > switching between thousands and thousands of different combinations of
> > uid, gid and groups.
> My servers are imap servers.
> Foreach user (~15 million) it have a specific uid over ~10 nfs netapp
> storage.

OK, so 16 hash buckets are likely to be filled with ~10^6 entries each.
I can see that might be a performance issue...

So afaics, you did try adjusting the hashtable size. How much larger
does it have to be before you start to get acceptable performance? If it
solves your problem we could make hash table sizes adjustable via a
module parameter, for instance.

Cheers
Trond


2009-09-03 20:49:30

by Trond Myklebust

[permalink] [raw]
Subject: Re: VM issue causing high CPU loads

On Thu, 2009-09-03 at 13:05 -0700, Simon Kirby wrote:
> On Thu, Sep 03, 2009 at 10:02:06AM -0400, Trond Myklebust wrote:
>
> > OK, so 16 hash buckets are likely to be filled with ~10^6 entries each.
> > I can see that might be a performance issue...
>
> We have a similar setup with millions of UIDs over NFS (currently NFSv3).
> I _wish_ there were a way to use NFSv4 without having to use name-mapped
> UIDs and GIDs, since our user and group names come from MySQL anyway, and
> are guaranteed to be consistent across machines.

That's a separate issue.

I'm working on increasing the idmapper scalability, however another
project is currently taking up most of my time. I can't guarantee that
the revised idmapper code will be finished in time to allow for
inclusion in 2.6.32.

> Why on earth does NFSv4 force the use of names?

NFSv4 aspires to be an internet-wide protocol, and so you cannot use
uids/gids: they just aren't guaranteed to represent a unique user
outside your local LDAP/NIS or /etc/passwd domain. Furthermore, uids and
gids are a posix construct. They simply don't work in environments where
you may have lots of non-posix systems.

Trond


2009-09-03 14:44:12

by Miquel van Smoorenburg

[permalink] [raw]
Subject: sunrpc: dynamically allocate credcache hashtables [was: Re: VM issue causing high CPU loads]

On Thu, 2009-09-03 at 10:02 -0400, Trond Myklebust wrote:
> On Thu, 2009-09-03 at 15:39 +0200, Yohan wrote:
> > > As far as I can see, there is no RPCSEC_GSS involved, so credentials
> > > should never expire. They will be reused as long as processes aren't
> > > switching between thousands and thousands of different combinations of
> > > uid, gid and groups.
> > My servers are imap servers.
> > Foreach user (~15 million) it have a specific uid over ~10 nfs netapp
> > storage.
>
> OK, so 16 hash buckets are likely to be filled with ~10^6 entries each.
> I can see that might be a performance issue...
>
> So afaics, you did try adjusting the hashtable size. How much larger
> does it have to be before you start to get acceptable performance? If it
> solves your problem we could make hash table sizes adjustable via a
> module parameter, for instance.

That is *exactly* what my patch does :)
I ported it to 2.6.31-rc8-bk2 this afternoon, that was trivial.

What I wanted to discuss was finding out if there was another solution,
or that we should build something that auto-tunes hashtable sizes, of if
there was a way to limit the size of the cache in another way.

I have the same usage pattern as Yohan (also an IMAP server for
potentially a few million different uids) - lots of uids are used, but
not simultaneously (maybe a few hundred or a thousand at the same time).
It's just that the inode/dentry/cred caches never expire because modern
boxes have lots and lots of memory.

Due to personal circumstances though I haven't been able to work on
anything much for the last few months. I apologize for keeping quiet.

Patch attached. I've removed the debugging stuff, this is only the
"dynamically allocate credcache hashtables" patch.

Patch description:

auth.h: increase RPC_CREDCACHE_HASHBITS from 4 to 12
(16 hashtable entries -> 4096). This is just the default.
auth.c: allocate hashtables dyamically
add sysctl for credcache_hashsize
auth_generic.c: use rpcauth_init_credcache
auth_unix.c: use rpcauth_init_credcache
sunrpc_syms.c: add hashsize module parameter

Mike.


Attachments:
linux-2.6.31-rc8-git2-sunprc-credcache_hashsize.patch (8.92 kB)

2009-09-03 20:05:53

by Simon Kirby

[permalink] [raw]
Subject: Re: VM issue causing high CPU loads

On Thu, Sep 03, 2009 at 10:02:06AM -0400, Trond Myklebust wrote:

> OK, so 16 hash buckets are likely to be filled with ~10^6 entries each.
> I can see that might be a performance issue...

We have a similar setup with millions of UIDs over NFS (currently NFSv3).
I _wish_ there were a way to use NFSv4 without having to use name-mapped
UIDs and GIDs, since our user and group names come from MySQL anyway, and
are guaranteed to be consistent across machines.

Why on earth does NFSv4 force the use of names?

I was considering hacking the code to stick IDs in there anyway, but I
haven't looked at the feasibility of this. I suspect this would break or
complicate other things, but the current NFSv4 design just seems like an
incredible waste for this case.

Simon-

2009-09-03 21:22:01

by Muntz, Daniel

[permalink] [raw]
Subject: RE: VM issue causing high CPU loads

Amen. I understand that v4 wants to extend across domains, etc., but it
goes out of its way to prevent the use of uids/gids, which in the vast
majority of installations would work just fine and wouldn't incur the
overhead of the mapping/unmapping operations. There's no reason
uids/gids couldn't coexist with string names. If the 4.0 spec had a
slightly different version of this paragraph:

To provide a greater degree of compatibility with previous versions
of NFS (i.e., v2 and v3), which identified users and groups by 32-bit
unsigned uid's and gid's, owner and group strings that consist of
decimal numeric values with no leading zeros can be given a special
interpretation by clients and servers which choose to provide such
support. The receiver may treat such a user or group string as
representing the same user as would be represented by a v2/v3 uid or
gid having the corresponding numeric value. A server is not
obligated to accept such a string, but may return an NFS4ERR_BADOWNER
instead. To avoid this mechanism being used to subvert user and
group translation, so that a client might pass all of the owners and
groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER
error when there is a valid translation for the user or owner
designated in this way. In that case, the client must use the
appropriate name@domain string and not the special form for
compatibility.

i.e., take out the "subvert" portion, and just plain allow string
representations of uids/gids, then at least the conversion would just be
an atoi and itoa. Even better, allow the uids/gids to be used directly
and avoid the atoi/itoa, perhaps with a flag. Either case is better
than idmapd and getting EDELAY and an X-second pause in odd places
because NFS has to go to userspace for a translation.

-Dan Quixote

> -----Original Message-----
> From: Simon Kirby [mailto:[email protected]]
> Sent: Thursday, September 03, 2009 1:06 PM
> To: Trond Myklebust
> Cc: Yohan; Andrew Morton; [email protected];
> [email protected]; Neil Brown; J. Bruce Fields;
> [email protected]
> Subject: Re: VM issue causing high CPU loads
>
> On Thu, Sep 03, 2009 at 10:02:06AM -0400, Trond Myklebust wrote:
>
> > OK, so 16 hash buckets are likely to be filled with ~10^6
> entries each.
> > I can see that might be a performance issue...
>
> We have a similar setup with millions of UIDs over NFS
> (currently NFSv3).
> I _wish_ there were a way to use NFSv4 without having to use
> name-mapped UIDs and GIDs, since our user and group names
> come from MySQL anyway, and are guaranteed to be consistent
> across machines.
>
> Why on earth does NFSv4 force the use of names?
>
> I was considering hacking the code to stick IDs in there
> anyway, but I haven't looked at the feasibility of this. I
> suspect this would break or complicate other things, but the
> current NFSv4 design just seems like an incredible waste for
> this case.
>
> Simon-
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-nfs" in the body of a message to
> [email protected] More majordomo info at
> http://vger.kernel.org/majordomo-info.html
>

2009-09-03 22:22:53

by Simon Kirby

[permalink] [raw]
Subject: Re: VM issue causing high CPU loads

On Thu, Sep 03, 2009 at 04:49:25PM -0400, Trond Myklebust wrote:

> I'm working on increasing the idmapper scalability, however another
> project is currently taking up most of my time. I can't guarantee that
> the revised idmapper code will be finished in time to allow for
> inclusion in 2.6.32.

Sure, improving it would be nice for cases where it's needed, but in
environments where all IDs are consistent (by design), it just seems
silly to force this extra work for zero gain.

> NFSv4 aspires to be an internet-wide protocol, and so you cannot use
> uids/gids: they just aren't guaranteed to represent a unique user
> outside your local LDAP/NIS or /etc/passwd domain. Furthermore, uids and
> gids are a posix construct. They simply don't work in environments where
> you may have lots of non-posix systems.

So, for environments with all POSIX systems, what do you think about
perhaps a mount or export flag that violates the spec on purpose to allow
numeric IDs to be used?

I can understand that the quiet use of IDs if name-to-user mapping fails
will cause security issues in environments without consistent users, so
it would now be unsafe to turn this on silently. However, making this
an option seems reasonable to me.

(Not that I know what I'm doing.)

Simon-

2009-09-04 12:31:42

by Trond Myklebust

[permalink] [raw]
Subject: Re: VM issue causing high CPU loads

On Thu, 2009-09-03 at 15:22 -0700, Simon Kirby wrote:
> So, for environments with all POSIX systems, what do you think about
> perhaps a mount or export flag that violates the spec on purpose to allow
> numeric IDs to be used?

No! I'm not interested in starting a LinuxPrivateNFSv4 protocol on top
of everything else we've got...

Trond