2022-05-27 22:47:09

by Chuck Lever

[permalink] [raw]
Subject: Re: filecache LRU performance regression



> On May 27, 2022, at 4:37 PM, Frank van der Linden <[email protected]> wrote:
>
> On Fri, May 27, 2022 at 06:59:47PM +0000, Chuck Lever III wrote:
>>
>>
>> Hi Frank-
>>
>> Bruce recently reminded me about this issue. Is there a bugzilla somewhere?
>> Do you have a reproducer I can try?
>
> Hi Chuck,
>
> The easiest way to reproduce the issue is to run generic/531 over an
> NFSv4 mount, using a system with a larger number of CPUs on the client
> side (or just scaling the test up manually - it has a calculation based
> on the number of CPUs).
>
> The test will take a long time to finish. I initially described the
> details here:
>
> https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
>
> Since then, it was also reported here:
>
> https://lore.kernel.org/all/[email protected]/T/#m8c3e4173696e17a9d5903d2a619550f352314d20

Thanks for the summary. So, there isn't a bugzilla tracking this
issue? If not, please create one here:

https://bugzilla.linux-nfs.org/

Then we don't have to keep asking for a repeat summary ;-)


> I posted an experimental patch, but it's actually not quite correct
> (although I think the idea behind it is makes sense):
>
> https://lore.kernel.org/linux-nfs/[email protected]/T/#m869aa427f125afee2af9a89d569c6b98e12e516f

A handful of random comments:

- nfsd_file_put() is now significantly different than it used
to be, so that part of the patch will have to be updated in
order to apply to v5.18+

- IMO maybe that hash table should be replaced with something
more memory and CPU-cache efficient, like a Maple tree. That
patch can certainly be developed independently of the other
LRU/GC related fixes you proposed. (Added Matthew and Liam
for their take).

How can we take this forward? Would you like to continue
developing these patches, or should I adopt the project?
Maybe Matthew or Liam see a straightforward way to convert
the nfsd_file hash table as a separate project?


> The basic problem from the initial email I sent:
>
>> So here's what happens: for NFSv4, files that are associated with an
>> open stateid can stick around for a long time, as long as there's no
>> CLOSE done on them. That's what's happening here. Also, since those files
>> have a refcount of >= 2 (one for the hash table, one for being pointed to
>> by the state), they are never eligible for removal from the file cache.
>> Worse, since the code call nfs_file_gc inline if the upper bound is crossed
>> (8192), every single operation that calls nfsd_file_acquire will end up
>> walking the entire LRU, trying to free files, and failing every time.
>> Walking a list with millions of files every single time isn't great.

Walking a linked list is just about the worst thing to do
to the cache on a modern CPU. I vote for replacing that
data structure.


> I guess the basic issues here are:
>
> 1) Should these NFSv4 files be in the filecache at all? They are readily
> available from the state, no need for additional caching.

I agree that LRU and garbage collection are not needed for
managing NFSv4 files. NFSv4 OPEN/CLOSE and lease destruction
are already effective at that.

Re-reading the old email thread, I'm not exactly sure what
"turning off the filecache for NFSv4" means exactly. Fleshing
that idea out could be helpful. Would it make sense to simply
disable the GC walk when releasing NFSv4 files?

Interestingly, the dcache has the same problem with negative
dentries, which can grow without bound, creating a significant
performance impact on systems with high uptime. There were a
few other similar cases discussed at the recent LSF/MM.


> 2) In general: can state ever be gracefully retired if the client still
> has an OPEN?

I think that's what NFS4ERR_ADMIN_REVOKED is for? It's not
exactly graceful, but the server can warn the client that
state is gone.

Also, I expect a server might recall a delegation if it needs
to reclaim resources. We've talked in the past about a server
that recalls in order to cap the number of delegations that
need to be returned by the client when the lease is destroyed.


--
Chuck Lever



2022-05-29 02:32:35

by Wang Yugui

[permalink] [raw]
Subject: Re: filecache LRU performance regression

Hi,

>
>
> > On May 27, 2022, at 4:37 PM, Frank van der Linden <[email protected]> wrote:
> >
> > On Fri, May 27, 2022 at 06:59:47PM +0000, Chuck Lever III wrote:
> >>
> >>
> >> Hi Frank-
> >>
> >> Bruce recently reminded me about this issue. Is there a bugzilla somewhere?
> >> Do you have a reproducer I can try?
> >
> > Hi Chuck,
> >
> > The easiest way to reproduce the issue is to run generic/531 over an
> > NFSv4 mount, using a system with a larger number of CPUs on the client
> > side (or just scaling the test up manually - it has a calculation based
> > on the number of CPUs).
> >
> > The test will take a long time to finish. I initially described the
> > details here:
> >
> > https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
> >
> > Since then, it was also reported here:
> >
> > https://lore.kernel.org/all/[email protected]/T/#m8c3e4173696e17a9d5903d2a619550f352314d20
>
> Thanks for the summary. So, there isn't a bugzilla tracking this
> issue? If not, please create one here:
>
> https://bugzilla.linux-nfs.org/
>
> Then we don't have to keep asking for a repeat summary ;-)
>
>
> > I posted an experimental patch, but it's actually not quite correct
> > (although I think the idea behind it is makes sense):
> >
> > https://lore.kernel.org/linux-nfs/[email protected]/T/#m869aa427f125afee2af9a89d569c6b98e12e516f
>
> A handful of random comments:
>
> - nfsd_file_put() is now significantly different than it used
> to be, so that part of the patch will have to be updated in
> order to apply to v5.18+

When many open files(>NFSD_FILE_LRU_THRESHOLD),
nfsd_file_gc() will waste many CPU times.

Can we serialize nfsd_file_gc() for v5.18+ as a first step?

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 3d944ca..6abefd9 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -63,6 +63,8 @@ static struct delayed_work nfsd_filecache_laundrette;

static void nfsd_file_gc(void);

+static atomic_t nfsd_file_gc_queue_delayed = ATOMIC_INIT(0);
+
static void
nfsd_file_schedule_laundrette(void)
{
@@ -71,8 +73,10 @@ nfsd_file_schedule_laundrette(void)
if (count == 0 || test_bit(NFSD_FILE_SHUTDOWN, &nfsd_file_lru_flags))
return;

- queue_delayed_work(system_wq, &nfsd_filecache_laundrette,
+ if(atomic_cmpxchg(&nfsd_file_gc_queue_delayed, 0, 1)==0){
+ queue_delayed_work(system_wq, &nfsd_filecache_laundrette,
NFSD_LAUNDRETTE_DELAY);
+ }
}

static void
@@ -468,15 +472,22 @@ nfsd_file_lru_walk_list(struct shrink_control *sc)
return ret;
}

+// nfsd_file_gc() support concurrency, but serialize it
+atomic_t nfsd_file_gc_running = ATOMIC_INIT(0);
static void
nfsd_file_gc(void)
{
- nfsd_file_lru_walk_list(NULL);
+ if(atomic_cmpxchg(&nfsd_file_gc_running, 0, 1)==0){
+ nfsd_file_lru_walk_list(NULL);
+ atomic_set(&nfsd_file_gc_running, 0);
+ }
}

static void
nfsd_file_gc_worker(struct work_struct *work)
{
+ atomic_set(&nfsd_file_gc_queue_delayed, 0);
+ // pr_info("nfsd_file_gc_worker\n");
nfsd_file_gc();
nfsd_file_schedule_laundrette();
}

Best Regards
Wang Yugui ([email protected])
2022/05/29



2022-05-30 07:07:58

by Chuck Lever

[permalink] [raw]
Subject: Re: filecache LRU performance regression


> On May 28, 2022, at 10:32 PM, Wang Yugui <[email protected]> wrote:
>
> Hi,
>
>>> On May 27, 2022, at 4:37 PM, Frank van der Linden <[email protected]> wrote:
>>>
>>> On Fri, May 27, 2022 at 06:59:47PM +0000, Chuck Lever III wrote:
>>>>
>>>> Hi Frank-
>>>>
>>>> Bruce recently reminded me about this issue. Is there a bugzilla somewhere?
>>>> Do you have a reproducer I can try?
>>>
>>> Hi Chuck,
>>>
>>> The easiest way to reproduce the issue is to run generic/531 over an
>>> NFSv4 mount, using a system with a larger number of CPUs on the client
>>> side (or just scaling the test up manually - it has a calculation based
>>> on the number of CPUs).
>>>
>>> The test will take a long time to finish. I initially described the
>>> details here:
>>>
>>> https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
>>>
>>> Since then, it was also reported here:
>>>
>>> https://lore.kernel.org/all/[email protected]/T/#m8c3e4173696e17a9d5903d2a619550f352314d20
>>
>> Thanks for the summary. So, there isn't a bugzilla tracking this
>> issue? If not, please create one here:
>>
>> https://bugzilla.linux-nfs.org/
>>
>> Then we don't have to keep asking for a repeat summary ;-)
>>
>>
>>> I posted an experimental patch, but it's actually not quite correct
>>> (although I think the idea behind it is makes sense):
>>>
>>> https://lore.kernel.org/linux-nfs/[email protected]/T/#m869aa427f125afee2af9a89d569c6b98e12e516f
>>
>> A handful of random comments:
>>
>> - nfsd_file_put() is now significantly different than it used
>> to be, so that part of the patch will have to be updated in
>> order to apply to v5.18+
>
> When many open files(>NFSD_FILE_LRU_THRESHOLD),
> nfsd_file_gc() will waste many CPU times.

Thanks for the suggestion. I agree that CPU and
memory bandwidth are not being used effectively
by the filecache garbage collector.


> Can we serialize nfsd_file_gc() for v5.18+ as a first step?

If I understand Frank's problem statement correctly,
garbage collection during an nfsd_file_put() under
an NFSv4-only workload walks the length of the LRU
list and finds nothing to evict 100% of the time.
That seems like a bug, and fixing it might give us
the most improvement in this area.


> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 3d944ca..6abefd9 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -63,6 +63,8 @@ static struct delayed_work nfsd_filecache_laundrette;
>
> static void nfsd_file_gc(void);
>
> +static atomic_t nfsd_file_gc_queue_delayed = ATOMIC_INIT(0);
> +
> static void
> nfsd_file_schedule_laundrette(void)
> {
> @@ -71,8 +73,10 @@ nfsd_file_schedule_laundrette(void)
> if (count == 0 || test_bit(NFSD_FILE_SHUTDOWN, &nfsd_file_lru_flags))
> return;
>
> - queue_delayed_work(system_wq, &nfsd_filecache_laundrette,
> + if(atomic_cmpxchg(&nfsd_file_gc_queue_delayed, 0, 1)==0){
> + queue_delayed_work(system_wq, &nfsd_filecache_laundrette,
> NFSD_LAUNDRETTE_DELAY);

I might misunderstand what this is doing exactly.
I'm sure there's a preferred idiom in the kernel
for not queuing a new work item when one is already
running, so that an open-coded cmpxchg is not
necessary.

It might be better to allocate a specific workqueue
for filecache garbage collection, and limit the
maximum number of active work items allowed on that
queue to 1. One benefit of that might be reducing
the number of threads contending for the filecache
data structures.

If GC is capped like this, maybe create one GC
workqueue per nfsd_net so GC activity in one
namespace does not starve filecache GC in other
namespaces.

Before I would take patches like this, though,
performance data demonstrating a problem and some
improvement should be presented separately or as
part of the patch descriptions.

If you repost, start a separate thread and cc:

M: Tejun Heo <[email protected]>
R: Lai Jiangshan <[email protected]>

to get review by workqueue domain experts.


> + }
> }
>
> static void
> @@ -468,15 +472,22 @@ nfsd_file_lru_walk_list(struct shrink_control *sc)
> return ret;
> }
>
> +// nfsd_file_gc() support concurrency, but serialize it
> +atomic_t nfsd_file_gc_running = ATOMIC_INIT(0);
> static void
> nfsd_file_gc(void)
> {
> - nfsd_file_lru_walk_list(NULL);
> + if(atomic_cmpxchg(&nfsd_file_gc_running, 0, 1)==0){
> + nfsd_file_lru_walk_list(NULL);
> + atomic_set(&nfsd_file_gc_running, 0);
> + }
> }
>
> static void
> nfsd_file_gc_worker(struct work_struct *work)
> {
> + atomic_set(&nfsd_file_gc_queue_delayed, 0);
> + // pr_info("nfsd_file_gc_worker\n");
> nfsd_file_gc();
> nfsd_file_schedule_laundrette();
> }
>
> Best Regards
> Wang Yugui ([email protected])
> 2022/05/29

--
Chuck Lever




2022-06-01 10:13:09

by Chuck Lever

[permalink] [raw]
Subject: Re: filecache LRU performance regression



> On May 27, 2022, at 5:34 PM, Chuck Lever III <[email protected]> wrote:
>
>
>
>> On May 27, 2022, at 4:37 PM, Frank van der Linden <[email protected]> wrote:
>>
>> On Fri, May 27, 2022 at 06:59:47PM +0000, Chuck Lever III wrote:
>>>
>>>
>>> Hi Frank-
>>>
>>> Bruce recently reminded me about this issue. Is there a bugzilla somewhere?
>>> Do you have a reproducer I can try?
>>
>> Hi Chuck,
>>
>> The easiest way to reproduce the issue is to run generic/531 over an
>> NFSv4 mount, using a system with a larger number of CPUs on the client
>> side (or just scaling the test up manually - it has a calculation based
>> on the number of CPUs).
>>
>> The test will take a long time to finish. I initially described the
>> details here:
>>
>> https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
>>
>> Since then, it was also reported here:
>>
>> https://lore.kernel.org/all/[email protected]/T/#m8c3e4173696e17a9d5903d2a619550f352314d20
>
> Thanks for the summary. So, there isn't a bugzilla tracking this
> issue? If not, please create one here:
>
> https://bugzilla.linux-nfs.org/
>
> Then we don't have to keep asking for a repeat summary ;-)

I can easily reproduce this scenario in my lab. I've opened:

https://bugzilla.linux-nfs.org/show_bug.cgi?id=386


--
Chuck Lever




2022-06-01 19:27:16

by Frank van der Linden

[permalink] [raw]
Subject: Re: filecache LRU performance regression

On Wed, Jun 01, 2022 at 12:34:34AM +0000, Chuck Lever III wrote:
> > On May 27, 2022, at 5:34 PM, Chuck Lever III <[email protected]> wrote:
> >
> >
> >
> >> On May 27, 2022, at 4:37 PM, Frank van der Linden <[email protected]> wrote:
> >>
> >> On Fri, May 27, 2022 at 06:59:47PM +0000, Chuck Lever III wrote:
> >>>
> >>>
> >>> Hi Frank-
> >>>
> >>> Bruce recently reminded me about this issue. Is there a bugzilla somewhere?
> >>> Do you have a reproducer I can try?
> >>
> >> Hi Chuck,
> >>
> >> The easiest way to reproduce the issue is to run generic/531 over an
> >> NFSv4 mount, using a system with a larger number of CPUs on the client
> >> side (or just scaling the test up manually - it has a calculation based
> >> on the number of CPUs).
> >>
> >> The test will take a long time to finish. I initially described the
> >> details here:
> >>
> >> https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
> >>
> >> Since then, it was also reported here:
> >>
> >> https://lore.kernel.org/all/[email protected]/T/#m8c3e4173696e17a9d5903d2a619550f352314d20
> >
> > Thanks for the summary. So, there isn't a bugzilla tracking this
> > issue? If not, please create one here:
> >
> > https://bugzilla.linux-nfs.org/
> >
> > Then we don't have to keep asking for a repeat summary ;-)
>
> I can easily reproduce this scenario in my lab. I've opened:
>
> https://bugzilla.linux-nfs.org/show_bug.cgi?id=386
>

Thanks for taking care of that. I'm switching jobs, so I won't have much
time to look at it or test for a few weeks.

I think the basic problem is that the filecache is a clear win for
v3, since that's stateless and it avoids a lookup for each operation.

For v4, it's not clear to me that it's much of a win, and in this case
it definitely gets in the way.

Maybe the best thing is to not bother at all with the caching for v4,
although that might hurt mixed v3/v4 clients accessing the same fs
slightly. Not sure how common of a scenario that is, though.

- Frank

2022-06-01 19:27:53

by Chuck Lever

[permalink] [raw]
Subject: Re: filecache LRU performance regression



> On Jun 1, 2022, at 12:10 PM, Frank van der Linden <[email protected]> wrote:
>
> On Wed, Jun 01, 2022 at 12:34:34AM +0000, Chuck Lever III wrote:
>>> On May 27, 2022, at 5:34 PM, Chuck Lever III <[email protected]> wrote:
>>>
>>>
>>>
>>>> On May 27, 2022, at 4:37 PM, Frank van der Linden <[email protected]> wrote:
>>>>
>>>> On Fri, May 27, 2022 at 06:59:47PM +0000, Chuck Lever III wrote:
>>>>>
>>>>>
>>>>> Hi Frank-
>>>>>
>>>>> Bruce recently reminded me about this issue. Is there a bugzilla somewhere?
>>>>> Do you have a reproducer I can try?
>>>>
>>>> Hi Chuck,
>>>>
>>>> The easiest way to reproduce the issue is to run generic/531 over an
>>>> NFSv4 mount, using a system with a larger number of CPUs on the client
>>>> side (or just scaling the test up manually - it has a calculation based
>>>> on the number of CPUs).
>>>>
>>>> The test will take a long time to finish. I initially described the
>>>> details here:
>>>>
>>>> https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
>>>>
>>>> Since then, it was also reported here:
>>>>
>>>> https://lore.kernel.org/all/[email protected]/T/#m8c3e4173696e17a9d5903d2a619550f352314d20
>>>
>>> Thanks for the summary. So, there isn't a bugzilla tracking this
>>> issue? If not, please create one here:
>>>
>>> https://bugzilla.linux-nfs.org/
>>>
>>> Then we don't have to keep asking for a repeat summary ;-)
>>
>> I can easily reproduce this scenario in my lab. I've opened:
>>
>> https://bugzilla.linux-nfs.org/show_bug.cgi?id=386
>>
>
> Thanks for taking care of that. I'm switching jobs, so I won't have much
> time to look at it or test for a few weeks.

No problem. I can reproduce the failure, and I have some ideas
of how to address the issue, so I've assigned the bug to myself.


> I think the basic problem is that the filecache is a clear win for
> v3, since that's stateless and it avoids a lookup for each operation.
>
> For v4, it's not clear to me that it's much of a win, and in this case
> it definitely gets in the way.
>
> Maybe the best thing is to not bother at all with the caching for v4,

At this point I don't think we can go that way. The NFSv4 code
uses a lot of the same infrastructural helpers as NFSv3, and
all of those now depend on the use of nfsd_file objects.

Certainly, though, the filecache plays somewhat different roles
for legacy NFS and NFSv4. I've been toying with the idea of
maintaining separate filecaches for NFSv3 and NFSv4, since
the garbage collection and shrinker rules are fundamentally
different for the two, and NFSv4 wants a file closed completely
(no lingering open) when it does a CLOSE or DELEGRETURN.

In the meantime, the obvious culprit is the LRU walk during
garbage collection is broken. I've talked with Dave Chinner,
co-author of list_lru, about a way to straighten this out so
that the LRU walk is very nicely bounded and at the same time
deals properly with NFSv4 OPEN and CLOSE. Trond also had an
idea or two here, and it seems the three of us are on nearly
the same page.

Once that is addressed, we can revisit Wang's suggestion of
serializing garbage collection, as a nice optimization.

Good luck with your new position!


> although that might hurt mixed v3/v4 clients accessing the same fs
> slightly. Not sure how common of a scenario that is, though.



--
Chuck Lever




2022-06-01 21:53:33

by Frank van der Linden

[permalink] [raw]
Subject: Re: filecache LRU performance regression

On Wed, Jun 01, 2022 at 04:37:47PM +0000, Chuck Lever III wrote:
>
> > On Jun 1, 2022, at 12:10 PM, Frank van der Linden <[email protected]> wrote:
> >
> > On Wed, Jun 01, 2022 at 12:34:34AM +0000, Chuck Lever III wrote:
> >>> On May 27, 2022, at 5:34 PM, Chuck Lever III <[email protected]> wrote:
> >>>
> >>>
> >>>
> >>>> On May 27, 2022, at 4:37 PM, Frank van der Linden <[email protected]> wrote:
> >>>>
> >>>> On Fri, May 27, 2022 at 06:59:47PM +0000, Chuck Lever III wrote:
> >>>>>
> >>>>>
> >>>>> Hi Frank-
> >>>>>
> >>>>> Bruce recently reminded me about this issue. Is there a bugzilla somewhere?
> >>>>> Do you have a reproducer I can try?
> >>>>
> >>>> Hi Chuck,
> >>>>
> >>>> The easiest way to reproduce the issue is to run generic/531 over an
> >>>> NFSv4 mount, using a system with a larger number of CPUs on the client
> >>>> side (or just scaling the test up manually - it has a calculation based
> >>>> on the number of CPUs).
> >>>>
> >>>> The test will take a long time to finish. I initially described the
> >>>> details here:
> >>>>
> >>>> https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
> >>>>
> >>>> Since then, it was also reported here:
> >>>>
> >>>> https://lore.kernel.org/all/[email protected]/T/#m8c3e4173696e17a9d5903d2a619550f352314d20
> >>>
> >>> Thanks for the summary. So, there isn't a bugzilla tracking this
> >>> issue? If not, please create one here:
> >>>
> >>> https://bugzilla.linux-nfs.org/
> >>>
> >>> Then we don't have to keep asking for a repeat summary ;-)
> >>
> >> I can easily reproduce this scenario in my lab. I've opened:
> >>
> >> https://bugzilla.linux-nfs.org/show_bug.cgi?id=386
> >>
> >
> > Thanks for taking care of that. I'm switching jobs, so I won't have much
> > time to look at it or test for a few weeks.
>
> No problem. I can reproduce the failure, and I have some ideas
> of how to address the issue, so I've assigned the bug to myself.
>
>
> > I think the basic problem is that the filecache is a clear win for
> > v3, since that's stateless and it avoids a lookup for each operation.
> >
> > For v4, it's not clear to me that it's much of a win, and in this case
> > it definitely gets in the way.
> >
> > Maybe the best thing is to not bother at all with the caching for v4,
>
> At this point I don't think we can go that way. The NFSv4 code
> uses a lot of the same infrastructural helpers as NFSv3, and
> all of those now depend on the use of nfsd_file objects.
>
> Certainly, though, the filecache plays somewhat different roles
> for legacy NFS and NFSv4. I've been toying with the idea of
> maintaining separate filecaches for NFSv3 and NFSv4, since
> the garbage collection and shrinker rules are fundamentally
> different for the two, and NFSv4 wants a file closed completely
> (no lingering open) when it does a CLOSE or DELEGRETURN.
>
> In the meantime, the obvious culprit is the LRU walk during
> garbage collection is broken. I've talked with Dave Chinner,
> co-author of list_lru, about a way to straighten this out so
> that the LRU walk is very nicely bounded and at the same time
> deals properly with NFSv4 OPEN and CLOSE. Trond also had an
> idea or two here, and it seems the three of us are on nearly
> the same page.
>
> Once that is addressed, we can revisit Wang's suggestion of
> serializing garbage collection, as a nice optimization.

Sounds good, thanks!

A related issue: there is currently no upper limit that I can see
on the number of active OPENs for a client. So essentially, a
client can run a server out of resources by doing a very large
number of OPENs.

Should there be an upper limit, above which requests are either
denied, or old state is invalidated?

- Frank

2022-06-01 22:03:39

by Chuck Lever

[permalink] [raw]
Subject: Re: filecache LRU performance regression



> On Jun 1, 2022, at 5:18 PM, Frank van der Linden <[email protected]> wrote:
>
> On Wed, Jun 01, 2022 at 04:37:47PM +0000, Chuck Lever III wrote:
>>
>>> On Jun 1, 2022, at 12:10 PM, Frank van der Linden <[email protected]> wrote:
>>>
>>> On Wed, Jun 01, 2022 at 12:34:34AM +0000, Chuck Lever III wrote:
>>>>> On May 27, 2022, at 5:34 PM, Chuck Lever III <[email protected]> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> On May 27, 2022, at 4:37 PM, Frank van der Linden <[email protected]> wrote:
>>>>>>
>>>>>> On Fri, May 27, 2022 at 06:59:47PM +0000, Chuck Lever III wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi Frank-
>>>>>>>
>>>>>>> Bruce recently reminded me about this issue. Is there a bugzilla somewhere?
>>>>>>> Do you have a reproducer I can try?
>>>>>>
>>>>>> Hi Chuck,
>>>>>>
>>>>>> The easiest way to reproduce the issue is to run generic/531 over an
>>>>>> NFSv4 mount, using a system with a larger number of CPUs on the client
>>>>>> side (or just scaling the test up manually - it has a calculation based
>>>>>> on the number of CPUs).
>>>>>>
>>>>>> The test will take a long time to finish. I initially described the
>>>>>> details here:
>>>>>>
>>>>>> https://lore.kernel.org/linux-nfs/20200608192122.GA19171@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
>>>>>>
>>>>>> Since then, it was also reported here:
>>>>>>
>>>>>> https://lore.kernel.org/all/[email protected]/T/#m8c3e4173696e17a9d5903d2a619550f352314d20
>>>>>
>>>>> Thanks for the summary. So, there isn't a bugzilla tracking this
>>>>> issue? If not, please create one here:
>>>>>
>>>>> https://bugzilla.linux-nfs.org/
>>>>>
>>>>> Then we don't have to keep asking for a repeat summary ;-)
>>>>
>>>> I can easily reproduce this scenario in my lab. I've opened:
>>>>
>>>> https://bugzilla.linux-nfs.org/show_bug.cgi?id=386
>>>>
>>>
>>> Thanks for taking care of that. I'm switching jobs, so I won't have much
>>> time to look at it or test for a few weeks.
>>
>> No problem. I can reproduce the failure, and I have some ideas
>> of how to address the issue, so I've assigned the bug to myself.
>>
>>
>>> I think the basic problem is that the filecache is a clear win for
>>> v3, since that's stateless and it avoids a lookup for each operation.
>>>
>>> For v4, it's not clear to me that it's much of a win, and in this case
>>> it definitely gets in the way.
>>>
>>> Maybe the best thing is to not bother at all with the caching for v4,
>>
>> At this point I don't think we can go that way. The NFSv4 code
>> uses a lot of the same infrastructural helpers as NFSv3, and
>> all of those now depend on the use of nfsd_file objects.
>>
>> Certainly, though, the filecache plays somewhat different roles
>> for legacy NFS and NFSv4. I've been toying with the idea of
>> maintaining separate filecaches for NFSv3 and NFSv4, since
>> the garbage collection and shrinker rules are fundamentally
>> different for the two, and NFSv4 wants a file closed completely
>> (no lingering open) when it does a CLOSE or DELEGRETURN.
>>
>> In the meantime, the obvious culprit is the LRU walk during
>> garbage collection is broken. I've talked with Dave Chinner,
>> co-author of list_lru, about a way to straighten this out so
>> that the LRU walk is very nicely bounded and at the same time
>> deals properly with NFSv4 OPEN and CLOSE. Trond also had an
>> idea or two here, and it seems the three of us are on nearly
>> the same page.
>>
>> Once that is addressed, we can revisit Wang's suggestion of
>> serializing garbage collection, as a nice optimization.
>
> Sounds good, thanks!
>
> A related issue: there is currently no upper limit that I can see
> on the number of active OPENs for a client. So essentially, a
> client can run a server out of resources by doing a very large
> number of OPENs.
>
> Should there be an upper limit, above which requests are either
> denied, or old state is invalidated?

We need to explore the server's behavior in low resource
situations. I prefer graceful degradation of service rather
than adding admin knobs, plus adding maybe some mechanism that
can report unruly clients so the server admin can deal with them.

The VFS will stop giving out struct file's after a certain point.
NFSv4 OPEN operations will return an error.

Some of the server's resource pools have shrinker callbacks.
Those pools will be reaped when there is memory pressure. We
could do better there.

I don't believe there's any mechanism yet to get rid of state
that is still active. I consider that as a "heroic measure"
that adds complexity for perhaps little benefit. And doing so
is potentially catastrophic for clients that trust the server
won't rip the rug out from under them. I would rather prevent
the creation of new state at that point rather than invalidating
existing state.

There are ways that server administrators can clean out known
defunct clients. That would be perhaps a little safer, and might
go along with the idea that server admins should deal mindfully
with malfunctioning or malicious clients that are causing a
denial of service.

In the end we have to address problems like CPU soft lock-up
to be certain server administrative interfaces will remain
available when the server is under a significant workload.


--
Chuck Lever




2022-06-01 23:05:56

by J. Bruce Fields

[permalink] [raw]
Subject: Re: filecache LRU performance regression

On Wed, Jun 01, 2022 at 09:18:27PM +0000, Frank van der Linden wrote:
> A related issue: there is currently no upper limit that I can see
> on the number of active OPENs for a client. So essentially, a
> client can run a server out of resources by doing a very large
> number of OPENs.
>
> Should there be an upper limit, above which requests are either
> denied, or old state is invalidated?

Could be, if only to prevent abuse. It doesn't seem to be a problem
people hit in practice, though.

Delegations add a bit of a wrinkle, as they allow the client do do opens
and locks locally without limit, and denying them during recovery from a
delegation break is ugly.

--b.

2022-06-02 01:51:22

by Wang Yugui

[permalink] [raw]
Subject: Re: filecache LRU performance regression

Hi,

> Certainly, though, the filecache plays somewhat different roles
> for legacy NFS and NFSv4. I've been toying with the idea of
> maintaining separate filecaches for NFSv3 and NFSv4, since
> the garbage collection and shrinker rules are fundamentally
> different for the two, and NFSv4 wants a file closed completely
> (no lingering open) when it does a CLOSE or DELEGRETURN.

if NFSv4 closed a file completely (no lingering open) when it does a
CLOSE or DELEGRETURN,
then the flowing problem seems to be fixed too.
https://lore.kernel.org/linux-nfs/[email protected]/

Best Regards
Wang Yugui ([email protected])
2022/06/02



2022-06-05 04:02:10

by Chuck Lever

[permalink] [raw]
Subject: Re: filecache LRU performance regression


> On Jun 1, 2022, at 9:49 PM, Wang Yugui <[email protected]> wrote:
>
> Hi,
>
>> Certainly, though, the filecache plays somewhat different roles
>> for legacy NFS and NFSv4. I've been toying with the idea of
>> maintaining separate filecaches for NFSv3 and NFSv4, since
>> the garbage collection and shrinker rules are fundamentally
>> different for the two, and NFSv4 wants a file closed completely
>> (no lingering open) when it does a CLOSE or DELEGRETURN.
>
> if NFSv4 closed a file completely (no lingering open) when it does a
> CLOSE or DELEGRETURN,
> then the flowing problem seems to be fixed too.
> https://lore.kernel.org/linux-nfs/[email protected]/

I agree, that's probably related.

But let's document this issue as a separate bug, in case it
isn't actually as related as it looks like at first. Can you
open a bug report on bugzilla.linux-nfs.org? kernel/server

--
Chuck Lever



2022-06-06 04:45:38

by Wang Yugui

[permalink] [raw]
Subject: Re: filecache LRU performance regression

Hi,

> > On Jun 1, 2022, at 9:49 PM, Wang Yugui <[email protected]> wrote:
> >
> > Hi,
> >
> >> Certainly, though, the filecache plays somewhat different roles
> >> for legacy NFS and NFSv4. I've been toying with the idea of
> >> maintaining separate filecaches for NFSv3 and NFSv4, since
> >> the garbage collection and shrinker rules are fundamentally
> >> different for the two, and NFSv4 wants a file closed completely
> >> (no lingering open) when it does a CLOSE or DELEGRETURN.
> >
> > if NFSv4 closed a file completely (no lingering open) when it does a
> > CLOSE or DELEGRETURN,
> > then the flowing problem seems to be fixed too.
> > https://lore.kernel.org/linux-nfs/[email protected]/
>
> I agree, that's probably related.
>
> But let's document this issue as a separate bug, in case it
> isn't actually as related as it looks like at first. Can you
> open a bug report on bugzilla.linux-nfs.org? kernel/server
>
> --
> Chuck Lever

I opened a bug report in bugzilla.linux-nfs.org
https://bugzilla.linux-nfs.org/show_bug.cgi?id=387

Best Regards
Wang Yugui ([email protected])
2022/06/05