LinuxLists.cc - [PATCH][RFC] NFS: Improving the access cache

2006-04-26 01:14:19

Subject: [PATCH][RFC] NFS: Improving the access cache

Currently the NFS client caches ACCESS information on a per uid basis
which fall apart when different process with different uid consistently
access the same directory. The end result being a storm of needless
ACCESS calls...

The attached patch used a hash table to store the nfs_access_entry
entires which cause the ACCESS request to only happen when the
attributes timeout.. The table is indexed by the addition of the
nfs_inode pointer and the cr_uid in the cred structure which should
spread things out nicely for some decent scalability (although the
locking scheme may need to be reworked a bit). The table has 256 entries
of struct list_head giving it a total size of 2k.

The patch is based on Trond's GIT tree...

Comments?

steved.

Attachments:

linux-2.6.17.rc2-nfs-accesscache-hash.patch (6.08 kB)

2006-04-26 14:51:27

by Steve Dickson

[permalink] [raw]

Subject: Re: [PATCH][RFC] NFS: Improving the access cache

Neil Brown wrote:
> - There is no upper bound imposed on the size of the cache, and no way
> for memory pressure to shrink the cache except indirectly by
> discarding inodes.
> I cannot see this being exploitable as getting access to lots of
> credentials would be hard for any given user. However I feel that
> some cleaning might be in order.
I guess there is no upper bound checking because I didn't see any type
of boundary checking the server hashing code I stoled 8-) Or maybe
I just missed it... Is there an example and what would trigger
this cleanup?

> - The nfs_zap_access_cache call isn't cheap. If that could be
> amortised somehow it would be good. Maybe use some version tagging
> so that when an inode is reused the access entry will no longer
> match in some way. Then just clean the table by periodic scans that
> discard based on timeout information ?
yes.. I did realize that ifs_zap_access_cache would be expensive...
and I think there might be an issue with holding the access_hash lock
spin lock while calling put_rpccred() since it can also take out
another spin lock... Maybe I just spin through the table marking
entries for deletion and then let somebody else clean them up??
Is there already a clean up process that I would us? I don't
recall one...

>
> It occurs to me that the large majority of inodes will only be
> accessed by a single user and so need not reside in the cache.
>
> So how about having a single "struct nfs_access_entry" pointer in the
> inode.
> When we do a lookup we look there first.
> When we want to add an entry, we try to add it there first.
> When we end up with two current access entries for the same inode,
> only then do we insert them into the hash.
To rephrase to make sure I understand....
1) P1(uid=1) creates an access pointer in the nfs_inode
2) P2(uid=2) sees the access pointer is not null so it adds them both
to the table, right?

> We would need to be able to tell from the inode whether anything is
> hashed or not. This could simply be if the nfs_access_entry point is
> non-null, and its hashlist it non-empty. Or we could just use a bit
> flag somewhere.
So I guess it would be something like:
if (nfs_inode->access == null)
set nfs_inode->access
if (nfs_inode->access =! NULL && nfs_inode->access_hash == empty)
move both pointer into hast able.
if (nfs_inode->access == null && nfs_inode->access_hash != empty)
use hastable.

But now the question is how would I know when there is only one
entry in the table? Or do we just let the hash table "drain"
naturally and when it become empty we start with the nfs_inode->access
pointer again... Is this close to what your thinking??

steved.

2006-04-26 15:03:17

by Steve Dickson

[permalink] [raw]

Subject: Re: [PATCH][RFC] NFS: Improving the access cache

Trond Myklebust wrote:
>=20
>=20
> Instead of having the field 'id', why don't you let the nfs_inode kee=
p a
> small (hashed?) list of all the nfs_access_entry objects that refer t=
o
> it? That would speed up searches for cached entries.
Actually I did look into having a pointer in the nfs_inode... but
what do you do when the second hashed list is needed. Meaning
P2(uid2) comes along and hashes to a different que. I guess
I thought it was a bit messy to keep overwriting the point in the
nfs_inode so I just kept everything in the hash table...

> I agree with Neil's assessment that we need a bound on the size of th=
e
> cache. In fact, enforcing a bound is pretty much the raison d'=C3=AAt=
re for a
> global table (by which I mean that if we don't need a bound, then we
> might as well cache everything in the nfs_inode).
Ok..

> How about rather changing that hash table into an LRU list, then addi=
ng
> a shrinker callback (using set_shrinker()) to allow the VM to free up
> entries when memory pressure dictates that it must?
Sounds interesting.. Just to be clear, by LRU list you mean use hlist
correct?

steved.

2006-04-26 15:44:10

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH][RFC] NFS: Improving the access cache

On Wed, 2006-04-26 at 10:15 -0400, Peter Staubach wrote:
> >
> >What situations? AFAIA the number of processes in a typical setup are
> >almost always far smaller than the number of cached inodes.
> >
> >
> >
>
> The situation that doesn't scale is one where there are many different
> users on the system. It is the situation where there are more then just
> a few users per file. This can happen on compute servers or systems
> used for timesharing sorts of purposes.

Yes, but the number of users <= number of processes which even on those
systems is almost always much, much less than the number of cached
inodes.

> >For instance on my laptop, I'm currently running 146 processes, but
> >according to /proc/slabinfo I'm caching 330000 XFS inodes + 141500 ext3
> >inodes.
> >If I were to assume that a typical nfsroot system will show roughly the
> >same behaviour, then it would mean that a typical bucket in Steve's 256
> >hash entry table will contain at least 2000 entries that I need to
> >search through every time I want to do an access call.
> >
>
> For such a system, there needs to be more than 256 hash buckets. The number
> of the access cache hash buckets needs to be on scale with the number of
> hash
> buckets used for similarly sized caches and tables.

The inode cache is the only similarly sized cache I can think of.

That is set either by the user, or it takes a default value of (total
memory size) / 2^14 buckets (see alloc_large_system_hash). On a 1Gb
system, that makes the default hash table size ~ 65536 entries. I can't
see people wanting to put up with a 256K static hash table for access
caching too.

Furthermore, note that the inode cache is only searched when
initialising a dentry. It is not searched on _every_ traversal of a path
element.

Cheers,
Trond

2006-04-26 17:01:58

by Peter Staubach

[permalink] [raw]

Subject: Re: [PATCH][RFC] NFS: Improving the access cache

Trond Myklebust wrote:

>On Wed, 2006-04-26 at 10:15 -0400, Peter Staubach wrote:
>
>
>>>What situations? AFAIA the number of processes in a typical setup are
>>>almost always far smaller than the number of cached inodes.
>>>
>>>
>>>
>>>
>>>
>>The situation that doesn't scale is one where there are many different
>>users on the system. It is the situation where there are more then just
>>a few users per file. This can happen on compute servers or systems
>>used for timesharing sorts of purposes.
>>
>>
>
>Yes, but the number of users <= number of processes which even on those
>systems is almost always much, much less than the number of cached
>inodes.
>
>
>

There isn't a 1-to-1 correspondence between processes and files. A single
process accesses many different files and many of the processes will be
accessing the same files. Shared libraries are easy examples of files
which are accessed by multiple processes and processes themselves access
multiple shared libraries.

>>>For instance on my laptop, I'm currently running 146 processes, but
>>>according to /proc/slabinfo I'm caching 330000 XFS inodes + 141500 ext3
>>>inodes.
>>>If I were to assume that a typical nfsroot system will show roughly the
>>>same behaviour, then it would mean that a typical bucket in Steve's 256
>>>hash entry table will contain at least 2000 entries that I need to
>>>search through every time I want to do an access call.
>>>
>>>
>>>
>>For such a system, there needs to be more than 256 hash buckets. The number
>>of the access cache hash buckets needs to be on scale with the number of
>>hash
>>buckets used for similarly sized caches and tables.
>>
>>
>
>The inode cache is the only similarly sized cache I can think of.
>
>That is set either by the user, or it takes a default value of (total
>memory size) / 2^14 buckets (see alloc_large_system_hash). On a 1Gb
>system, that makes the default hash table size ~ 65536 entries. I can't
>see people wanting to put up with a 256K static hash table for access
>caching too.
>
>
>

I think that if the performance benefits warrant such a cache, then it is
worth it. It is a very small percentage of the real memory on the system.

Previous, informal, studies showed that caching access privileges like
this was good at short circuiting 90%+ of access calls.

However, we could always divide this further when sizing the access
cache. If we assume that 1/2 or 1/4 or some percentage of the files
accessed will be on NFS mounted file systems, then the access cache just
needs to be based on the number of NFS inodes, not the total number of
inodes.

>Furthermore, note that the inode cache is only searched when
>initialising a dentry. It is not searched on _every_ traversal of a path
>element.
>

Very true, which points out the importance of getting the access to the
access
cache correct and fast. The number of entries in the access cache will
be at
least the number of NFS inodes in the system and could be much higher
depending
upon whether the system is single-user, desktop style system, or a
multi-user
shared system. The key to making this cache cheap is to make the hash
algorithm
cheap and keeping the hash chains short.

ps

2006-04-26 01:31:36

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH][RFC] NFS: Improving the access cache

On Tue, Apr 25, 2006 at 09:14:19PM -0400, Steve Dickson wrote:
> The attached patch used a hash table to store the nfs_access_entry
> entires which cause the ACCESS request to only happen when the
> attributes timeout.. The table is indexed by the addition of the
> nfs_inode pointer and the cr_uid in the cred structure which should
> spread things out nicely for some decent scalability (although the
> locking scheme may need to be reworked a bit). The table has 256 entries
> of struct list_head giving it a total size of 2k.

Seems to me like you could use an hlist instead, saving you 1k.

2006-04-26 04:55:21

by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH][RFC] NFS: Improving the access cache

On Tuesday April 25, [email protected] wrote:
> Currently the NFS client caches ACCESS information on a per uid basis
> which fall apart when different process with different uid consistently
> access the same directory. The end result being a storm of needless
> ACCESS calls...
>
> The attached patch used a hash table to store the nfs_access_entry
> entires which cause the ACCESS request to only happen when the
> attributes timeout.. The table is indexed by the addition of the
> nfs_inode pointer and the cr_uid in the cred structure which should
> spread things out nicely for some decent scalability (although the
> locking scheme may need to be reworked a bit). The table has 256 entries
> of struct list_head giving it a total size of 2k.
>
> The patch is based on Trond's GIT tree...
>
> Comments?

- There is no upper bound imposed on the size of the cache, and no way
for memory pressure to shrink the cache except indirectly by
discarding inodes.
I cannot see this being exploitable as getting access to lots of
credentials would be hard for any given user. However I feel that
some cleaning might be in order.
- The nfs_zap_access_cache call isn't cheap. If that could be
amortised somehow it would be good. Maybe use some version tagging
so that when an inode is reused the access entry will no longer
match in some way. Then just clean the table by periodic scans that
discard based on timeout information ?

It occurs to me that the large majority of inodes will only be
accessed by a single user and so need not reside in the cache.

So how about having a single "struct nfs_access_entry" pointer in the
inode.
When we do a lookup we look there first.
When we want to add an entry, we try to add it there first.
When we end up with two current access entries for the same inode,
only then do we insert them into the hash.
We would need to be able to tell from the inode whether anything is
hashed or not. This could simply be if the nfs_access_entry point is
non-null, and its hashlist it non-empty. Or we could just use a bit
flag somewhere.

I suspect this would keep the cache much smaller, and some of my other
concerns might then become irrelevant.

NeilBrown

2006-04-26 13:03:21

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH][RFC] NFS: Improving the access cache

On Tue, 2006-04-25 at 21:14 -0400, Steve Dickson wrote:
> Currently the NFS client caches ACCESS information on a per uid basis
> which fall apart when different process with different uid consistent=
ly
> access the same directory. The end result being a storm of needless
> ACCESS calls...
>=20
> The attached patch used a hash table to store the nfs_access_entry
> entires which cause the ACCESS request to only happen when the
> attributes timeout.. The table is indexed by the addition of the
> nfs_inode pointer and the cr_uid in the cred structure which should
> spread things out nicely for some decent scalability (although the
> locking scheme may need to be reworked a bit). The table has 256 entr=
ies
> of struct list_head giving it a total size of 2k.

Instead of having the field 'id', why don't you let the nfs_inode keep =
a
small (hashed?) list of all the nfs_access_entry objects that refer to
it? That would speed up searches for cached entries.

I agree with Neil's assessment that we need a bound on the size of the
cache. In fact, enforcing a bound is pretty much the raison d'=C3=AAtre=
for a
global table (by which I mean that if we don't need a bound, then we
might as well cache everything in the nfs_inode).
How about rather changing that hash table into an LRU list, then adding
a shrinker callback (using set_shrinker()) to allow the VM to free up
entries when memory pressure dictates that it must?

Cheers,
Trond

2006-04-26 13:14:56

by Peter Staubach

[permalink] [raw]

Subject: Re: [PATCH][RFC] NFS: Improving the access cache

Trond Myklebust wrote:

>On Tue, 2006-04-25 at 21:14 -0400, Steve Dickson wrote:
> =20
>
>>Currently the NFS client caches ACCESS information on a per uid basis
>>which fall apart when different process with different uid consistent=
ly
>>access the same directory. The end result being a storm of needless
>>ACCESS calls...
>>
>>The attached patch used a hash table to store the nfs_access_entry
>>entires which cause the ACCESS request to only happen when the
>>attributes timeout.. The table is indexed by the addition of the
>>nfs_inode pointer and the cr_uid in the cred structure which should
>>spread things out nicely for some decent scalability (although the
>>locking scheme may need to be reworked a bit). The table has 256 entr=
ies
>>of struct list_head giving it a total size of 2k.
>> =20
>>
>
>Instead of having the field 'id', why don't you let the nfs_inode keep=
a
>small (hashed?) list of all the nfs_access_entry objects that refer to
>it? That would speed up searches for cached entries.
>
>I agree with Neil's assessment that we need a bound on the size of the
>cache. In fact, enforcing a bound is pretty much the raison d'=EAtre f=
or a
>global table (by which I mean that if we don't need a bound, then we
>might as well cache everything in the nfs_inode).
>How about rather changing that hash table into an LRU list, then addin=
g
>a shrinker callback (using set_shrinker()) to allow the VM to free up
>entries when memory pressure dictates that it must?
>

Previous implementations have shown that a single per inode linear=20
linked list
ends up not being scalable enough in certain situations. There would e=
nd up
being too many entries in the list and searching the list would become
a bottleneck. Adding a set of hash buckets per inode also proved to be
inefficient because in order to have enough hash buckets to make the ha=
shing
efficient, much space was wasted. Having a single set of hash buckets,
adequately sized, ended up being the best solution.

Why have a limit? A better solution is to have the cache grow dynamica=
lly
as need requires and then have the system reclaim the memory when it st=
arts
to run low on memory.

ps

2006-04-26 13:17:45

by Chuck Lever

[permalink] [raw]

Subject: Re: [NFS] [PATCH][RFC] NFS: Improving the access cache

Steve Dickson wrote:
> Currently the NFS client caches ACCESS information on a per uid basis
> which fall apart when different process with different uid consistently
> access the same directory. The end result being a storm of needless
> ACCESS calls...
>
> The attached patch used a hash table to store the nfs_access_entry
> entires which cause the ACCESS request to only happen when the
> attributes timeout.. The table is indexed by the addition of the
> nfs_inode pointer and the cr_uid in the cred structure which should
> spread things out nicely for some decent scalability (although the
> locking scheme may need to be reworked a bit). The table has 256 entries
> of struct list_head giving it a total size of 2k.
>
> The patch is based on Trond's GIT tree...
>
> Comments?
>
> steved.

Hi Steve-

Thanks for digging into this.

I can't tell, but have you addressed the problem of racing processes
generating multiple similar ACCESS requests? Seems like some kind of
serialization through your caching mechanism would resolve that nicely.

--
corporate: <cel at netapp dot com>
personal: <chucklever at bigfoot dot com>

2006-04-26 14:01:52

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH][RFC] NFS: Improving the access cache

On Wed, 2006-04-26 at 09:14 -0400, Peter Staubach wrote:
> Trond Myklebust wrote:
>=20
> >On Tue, 2006-04-25 at 21:14 -0400, Steve Dickson wrote:
> > =20
> >
> >>Currently the NFS client caches ACCESS information on a per uid bas=
is
> >>which fall apart when different process with different uid consiste=
ntly
> >>access the same directory. The end result being a storm of needless
> >>ACCESS calls...
> >>
> >>The attached patch used a hash table to store the nfs_access_entry
> >>entires which cause the ACCESS request to only happen when the
> >>attributes timeout.. The table is indexed by the addition of the
> >>nfs_inode pointer and the cr_uid in the cred structure which should
> >>spread things out nicely for some decent scalability (although the
> >>locking scheme may need to be reworked a bit). The table has 256 en=
tries
> >>of struct list_head giving it a total size of 2k.
> >> =20
> >>
> >
> >Instead of having the field 'id', why don't you let the nfs_inode ke=
ep a
> >small (hashed?) list of all the nfs_access_entry objects that refer =
to
> >it? That would speed up searches for cached entries.
> >
> >I agree with Neil's assessment that we need a bound on the size of t=
he
> >cache. In fact, enforcing a bound is pretty much the raison d'=C3=AA=
tre for a
> >global table (by which I mean that if we don't need a bound, then we
> >might as well cache everything in the nfs_inode).
> >How about rather changing that hash table into an LRU list, then add=
ing
> >a shrinker callback (using set_shrinker()) to allow the VM to free u=
p
> >entries when memory pressure dictates that it must?
> >
>=20
> Previous implementations have shown that a single per inode linear=20
> linked list
> ends up not being scalable enough in certain situations. There would=
end up
> being too many entries in the list and searching the list would becom=
e
> a bottleneck. Adding a set of hash buckets per inode also proved to =
be
> inefficient because in order to have enough hash buckets to make the =
hashing
> efficient, much space was wasted. Having a single set of hash bucket=
s,
> adequately sized, ended up being the best solution.

What situations? AFAIA the number of processes in a typical setup are
almost always far smaller than the number of cached inodes.

=46or instance on my laptop, I'm currently running 146 processes, but
according to /proc/slabinfo I'm caching 330000 XFS inodes + 141500 ext3
inodes.
If I were to assume that a typical nfsroot system will show roughly the
same behaviour, then it would mean that a typical bucket in Steve's 256
hash entry table will contain at least 2000 entries that I need to
search through every time I want to do an access call.

> Why have a limit? A better solution is to have the cache grow dynami=
cally
> as need requires and then have the system reclaim the memory when it =
starts
> to run low on memory.

See my previous mail: that is exactly what I suggested in the 3rd
paragraph.

Cheers,
Trond

2006-04-26 14:15:50

by Peter Staubach

[permalink] [raw]

Subject: Re: [PATCH][RFC] NFS: Improving the access cache

Trond Myklebust wrote:

>On Wed, 2006-04-26 at 09:14 -0400, Peter Staubach wrote:
> =20
>
>>Trond Myklebust wrote:
>>
>> =20
>>
>>>On Tue, 2006-04-25 at 21:14 -0400, Steve Dickson wrote:
>>>=20
>>>
>>> =20
>>>
>>>>Currently the NFS client caches ACCESS information on a per uid bas=
is
>>>>which fall apart when different process with different uid consiste=
ntly
>>>>access the same directory. The end result being a storm of needless
>>>>ACCESS calls...
>>>>
>>>>The attached patch used a hash table to store the nfs_access_entry
>>>>entires which cause the ACCESS request to only happen when the
>>>>attributes timeout.. The table is indexed by the addition of the
>>>>nfs_inode pointer and the cr_uid in the cred structure which should
>>>>spread things out nicely for some decent scalability (although the
>>>>locking scheme may need to be reworked a bit). The table has 256 en=
tries
>>>>of struct list_head giving it a total size of 2k.
>>>> =20
>>>>
>>>> =20
>>>>
>>>Instead of having the field 'id', why don't you let the nfs_inode ke=
ep a
>>>small (hashed?) list of all the nfs_access_entry objects that refer =
to
>>>it? That would speed up searches for cached entries.
>>>
>>>I agree with Neil's assessment that we need a bound on the size of t=
he
>>>cache. In fact, enforcing a bound is pretty much the raison d'=EAtre=
for a
>>>global table (by which I mean that if we don't need a bound, then we
>>>might as well cache everything in the nfs_inode).
>>>How about rather changing that hash table into an LRU list, then add=
ing
>>>a shrinker callback (using set_shrinker()) to allow the VM to free u=
p
>>>entries when memory pressure dictates that it must?
>>>
>>> =20
>>>
>>Previous implementations have shown that a single per inode linear=20
>>linked list
>>ends up not being scalable enough in certain situations. There would=
end up
>>being too many entries in the list and searching the list would becom=
e
>>a bottleneck. Adding a set of hash buckets per inode also proved to =
be
>>inefficient because in order to have enough hash buckets to make the =
hashing
>>efficient, much space was wasted. Having a single set of hash bucket=
s,
>>adequately sized, ended up being the best solution.
>> =20
>>
>
>What situations? AFAIA the number of processes in a typical setup are
>almost always far smaller than the number of cached inodes.
>
> =20
>

The situation that doesn't scale is one where there are many different
users on the system. It is the situation where there are more then jus=
t
a few users per file. This can happen on compute servers or systems
used for timesharing sorts of purposes.

>For instance on my laptop, I'm currently running 146 processes, but
>according to /proc/slabinfo I'm caching 330000 XFS inodes + 141500 ext=
3
>inodes.
>If I were to assume that a typical nfsroot system will show roughly th=
e
>same behaviour, then it would mean that a typical bucket in Steve's 25=
6
>hash entry table will contain at least 2000 entries that I need to
>search through every time I want to do an access call.
>

=46or such a system, there needs to be more than 256 hash buckets. The=
number
of the access cache hash buckets needs to be on scale with the number o=
f=20
hash
buckets used for similarly sized caches and tables.

ps

2006-04-26 14:19:11

by Steve Dickson

[permalink] [raw]

Subject: Re: [NFS] [PATCH][RFC] NFS: Improving the access cache

Chuck Lever wrote:
>
> Hi Steve-
>
> Thanks for digging into this.
>
> I can't tell, but have you addressed the problem of racing processes
> generating multiple similar ACCESS requests? Seems like some kind of
> serialization through your caching mechanism would resolve that nicely.
No.. I was just trying to duplicate logic that was there...
but that is a good point... It turns out that I messed up the
locking anyways... the NFS_INO_INVALID_ACCESS bit is no longer
being protect by the inode->i_lock which is not good... so I
need to rework that part...

steved.