We recently noticed that, with 5.4+ kernels, the generic/531 test takes
a very long time to finish for v4, especially when run on larger systems.
Case in point: a 72 VCPU, 144G EC2 instance as a client will make the test
last about 20 hours.
So, I had a look to see what was going on. First of all, the test generates
a lot of files - what it does is generate 50000 files per process, where
it starts 2 * NCPU processes. So that's 144 processes in this case, 50000
files each. Also, it does it by setting the file ulimit to 50000, and then
just opening files, keeping them open, until it hits the limit.
So that's 7 million new/open files - that's a lot, but the problem can
be triggered with far fewer than that as well.
Looking at what the server was doing, I noticed a lot of lock contention
for nfsd_file_lru. Then I noticed that that nfsd_filecache_count kept
going up, reflecting the number of open files by the client processes,
eventually reaching, for example, that 7 million number.
So here's what happens: for NFSv4, files that are associated with an
open stateid can stick around for a long time, as long as there's no
CLOSE done on them. That's what's happening here. Also, since those files
have a refcount of >= 2 (one for the hash table, one for being pointed to
by the state), they are never eligible for removal from the file cache.
Worse, since the code call nfs_file_gc inline if the upper bound is crossed
(8192), every single operation that calls nfsd_file_acquire will end up
walking the entire LRU, trying to free files, and failing every time.
Walking a list with millions of files every single time isn't great.
There are some ways to fix this behavior like:
* Always allow v4 cached file structured to be purged from the cache.
They will stick around, since they still have a reference, but
at least they won't slow down cache handling to a crawl.
* Don't add v4 files to the cache to begin with.
* Since the only advantage of the file cache for v4 is the caching
of files linked to special stateids (as far as I can tell), only
cache files associated with special state ids.
* Don't bother with v4 files at all, and revert the changes that
made v4 use the file cache.
In general, the resource control for files OPENed by the client is
probably an issue. Even if you fix the cache, what if there are
N clients that open millions of files and keep them open? Maybe
there should be a fallback to start using temporary open files
if a client goes beyond a reasonable limit and threatens to eat
all resources.
Thoughts?
- Frank
On Mon, Jun 08, 2020 at 07:21:22PM +0000, Frank van der Linden wrote:
> We recently noticed that, with 5.4+ kernels, the generic/531 test takes
> a very long time to finish for v4, especially when run on larger systems.
>
> Case in point: a 72 VCPU, 144G EC2 instance as a client will make the test
> last about 20 hours.
>
> So, I had a look to see what was going on. First of all, the test generates
> a lot of files - what it does is generate 50000 files per process, where
> it starts 2 * NCPU processes. So that's 144 processes in this case, 50000
> files each. Also, it does it by setting the file ulimit to 50000, and then
> just opening files, keeping them open, until it hits the limit.
>
> So that's 7 million new/open files - that's a lot, but the problem can
> be triggered with far fewer than that as well.
>
> Looking at what the server was doing, I noticed a lot of lock contention
> for nfsd_file_lru. Then I noticed that that nfsd_filecache_count kept
> going up, reflecting the number of open files by the client processes,
> eventually reaching, for example, that 7 million number.
>
> So here's what happens: for NFSv4, files that are associated with an
> open stateid can stick around for a long time, as long as there's no
> CLOSE done on them. That's what's happening here. Also, since those files
> have a refcount of >= 2 (one for the hash table, one for being pointed to
> by the state), they are never eligible for removal from the file cache.
> Worse, since the code call nfs_file_gc inline if the upper bound is crossed
> (8192), every single operation that calls nfsd_file_acquire will end up
> walking the entire LRU, trying to free files, and failing every time.
> Walking a list with millions of files every single time isn't great.
Thanks for tracking this down.
>
> There are some ways to fix this behavior like:
>
> * Always allow v4 cached file structured to be purged from the cache.
> They will stick around, since they still have a reference, but
> at least they won't slow down cache handling to a crawl.
If they have to stick around anyway it seems too bad not to be able to
use them.
I mean, just because a file's opened first by a v4 user doesn't mean it
might not also have other users, right?
Would it be that hard to make nfsd_file_gc() a little smarter?
I don't know, maybe it's not worth it.
--b.
> * Don't add v4 files to the cache to begin with.
>
> * Since the only advantage of the file cache for v4 is the caching
> of files linked to special stateids (as far as I can tell), only
> cache files associated with special state ids.
>
> * Don't bother with v4 files at all, and revert the changes that
> made v4 use the file cache.
>
> In general, the resource control for files OPENed by the client is
> probably an issue. Even if you fix the cache, what if there are
> N clients that open millions of files and keep them open? Maybe
> there should be a fallback to start using temporary open files
> if a client goes beyond a reasonable limit and threatens to eat
> all resources.
>
> Thoughts?
>
> - Frank
On Thu, Jun 25, 2020 at 01:10:21PM -0400, Bruce Fields wrote:
>
> On Mon, Jun 08, 2020 at 07:21:22PM +0000, Frank van der Linden wrote:
> > So here's what happens: for NFSv4, files that are associated with an
> > open stateid can stick around for a long time, as long as there's no
> > CLOSE done on them. That's what's happening here. Also, since those files
> > have a refcount of >= 2 (one for the hash table, one for being pointed to
> > by the state), they are never eligible for removal from the file cache.
> > Worse, since the code call nfs_file_gc inline if the upper bound is crossed
> > (8192), every single operation that calls nfsd_file_acquire will end up
> > walking the entire LRU, trying to free files, and failing every time.
> > Walking a list with millions of files every single time isn't great.
>
> Thanks for tracking this down.
>
> >
> > There are some ways to fix this behavior like:
> >
> > * Always allow v4 cached file structured to be purged from the cache.
> > They will stick around, since they still have a reference, but
> > at least they won't slow down cache handling to a crawl.
>
> If they have to stick around anyway it seems too bad not to be able to
> use them.
>
> I mean, just because a file's opened first by a v4 user doesn't mean it
> might not also have other users, right?
>
> Would it be that hard to make nfsd_file_gc() a little smarter?
>
> I don't know, maybe it's not worth it.
>
> --b.
Basically, opening, and keeping open, a very large number of v4 files on
a client blows up these data structures:
* nfs4state.c:file_hashtbl (FH -> nfs4_file)
..and with the addition of filecache:
* filecache.c:nfsd_file_hashtbl (ino -> nfsd_file)
* filecache.c:nfsd_file_lru
nfsd_file_lru causes the most pain, see my description. But the other ones
aren't without pain either. I tried an experiment where v4 files don't
get added to the filecache, and file_hashtbl started showing up in perf
output in a serious way. Not surprising, really, if you hash millions
of items in a hash table with 256 buckets.
I guess there is an argument to be made that it's such an extreme use case
that it's not worth it.
On the other hand, clients running the server out of resources and slowing
down everything by a lot for all clients isn't great either.
Generally, the only way to enforce an upper bound on resource usage without
returning permanent errors (to which the client might react badly) seems
to be to start invaliding v4 state under pressure. Clients should be prepared
for this, as they should be able to recover from a server reboot. On the
other hand, it's something you probably only should be doing as a last resort.
I'm not sure if consistent behavior for e.g. locks could be guaranteed, I
am not very familiar with the locking code.
Some ideas to alleviate the pain short of doing the above:
* Count v4 references to nfsd_file (filecache) structures. If there
is a v4 reference, don't have the file on the LRU, as it's pointless.
Do include it in the hash table so that v2/v3 users can find it. This
avoids the worst offender (nfsd_file_lru), but does still blow up
nfsd_file_hashtbl.
* Use rhashtable for the hashtables, as it can automatically grow/shrink
the number of buckets. I don't know if the rhashtable code could handle
the load, but it might be worth a shot.
- Frank
On Thu, Jun 25, 2020 at 07:12:05PM +0000, Frank van der Linden wrote:
> Some ideas to alleviate the pain short of doing the above:
>
> * Count v4 references to nfsd_file (filecache) structures. If there
> is a v4 reference, don't have the file on the LRU, as it's pointless.
> Do include it in the hash table so that v2/v3 users can find it. This
> avoids the worst offender (nfsd_file_lru), but does still blow up
> nfsd_file_hashtbl.
>
> * Use rhashtable for the hashtables, as it can automatically grow/shrink
> the number of buckets. I don't know if the rhashtable code could handle
> the load, but it might be worth a shot.
In fact, don't even put an nfsd_file on the LRU until the first time its
references drop to 1, e.g. when it's no longer being used. Because it
will never be removed if the refs are > 1, in any case.
Let me come up with a patch for that, shouldn't be too hard.
- Frank
On Thu, Jun 25, 2020 at 07:12:05PM +0000, Frank van der Linden wrote:
> On Thu, Jun 25, 2020 at 01:10:21PM -0400, Bruce Fields wrote:
> >
> > On Mon, Jun 08, 2020 at 07:21:22PM +0000, Frank van der Linden wrote:
> > > So here's what happens: for NFSv4, files that are associated with an
> > > open stateid can stick around for a long time, as long as there's no
> > > CLOSE done on them. That's what's happening here. Also, since those files
> > > have a refcount of >= 2 (one for the hash table, one for being pointed to
> > > by the state), they are never eligible for removal from the file cache.
> > > Worse, since the code call nfs_file_gc inline if the upper bound is crossed
> > > (8192), every single operation that calls nfsd_file_acquire will end up
> > > walking the entire LRU, trying to free files, and failing every time.
> > > Walking a list with millions of files every single time isn't great.
> >
> > Thanks for tracking this down.
> >
> > >
> > > There are some ways to fix this behavior like:
> > >
> > > * Always allow v4 cached file structured to be purged from the cache.
> > > They will stick around, since they still have a reference, but
> > > at least they won't slow down cache handling to a crawl.
> >
> > If they have to stick around anyway it seems too bad not to be able to
> > use them.
> >
> > I mean, just because a file's opened first by a v4 user doesn't mean it
> > might not also have other users, right?
> >
> > Would it be that hard to make nfsd_file_gc() a little smarter?
> >
> > I don't know, maybe it's not worth it.
> >
> > --b.
>
> Basically, opening, and keeping open, a very large number of v4 files on
> a client blows up these data structures:
>
> * nfs4state.c:file_hashtbl (FH -> nfs4_file)
>
> ..and with the addition of filecache:
>
> * filecache.c:nfsd_file_hashtbl (ino -> nfsd_file)
> * filecache.c:nfsd_file_lru
>
> nfsd_file_lru causes the most pain, see my description. But the other ones
> aren't without pain either. I tried an experiment where v4 files don't
> get added to the filecache, and file_hashtbl started showing up in perf
> output in a serious way. Not surprising, really, if you hash millions
> of items in a hash table with 256 buckets.
>
> I guess there is an argument to be made that it's such an extreme use case
> that it's not worth it.
>
> On the other hand, clients running the server out of resources and slowing
> down everything by a lot for all clients isn't great either.
>
> Generally, the only way to enforce an upper bound on resource usage without
> returning permanent errors (to which the client might react badly) seems
> to be to start invaliding v4 state under pressure. Clients should be prepared
> for this, as they should be able to recover from a server reboot. On the
> other hand, it's something you probably only should be doing as a last resort.
> I'm not sure if consistent behavior for e.g. locks could be guaranteed, I
> am not very familiar with the locking code.
I don't think that would work, for a bunch of reasons.
Off hand I don't think I've actually seen reports in the wild of hitting
resource limits due to number of opens. Though I admit it bothers me
that we're not prepared for it.
--b.
> Some ideas to alleviate the pain short of doing the above:
>
> * Count v4 references to nfsd_file (filecache) structures. If there
> is a v4 reference, don't have the file on the LRU, as it's pointless.
> Do include it in the hash table so that v2/v3 users can find it. This
> avoids the worst offender (nfsd_file_lru), but does still blow up
> nfsd_file_hashtbl.
>
> * Use rhashtable for the hashtables, as it can automatically grow/shrink
> the number of buckets. I don't know if the rhashtable code could handle
> the load, but it might be worth a shot.
>
> - Frank