Hello,
during our performance testing we have noticed that commit b6669305d35a
("nfsd: Reduce the number of calls to nfsd_file_gc()") has introduced a
performance regression when a client does random buffered writes. The
workload on NFS client is fio running 4 processed doing random buffered writes to 4
different files and the files are large enough to hit dirty limits and
force writeback from the client. In particular the invocation is like:
fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1 --invalidate=1 --group_reporting=1 --runtime=300 --fallocate=posix --ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --rw=randwrite --size=4000m --numjobs=4 --bs=4k --filename_format=FioWorkloads.\$jobnum --end_fsync=1
The reason why commit b6669305d35a regresses performance is the
filemap_flush() call it adds into nfsd_file_put(). Before this commit
writeback on the server happened from nfsd_commit() code resulting in
rather long semisequential streams of 4k writes. After commit b6669305d35a
all the writeback happens from filemap_flush() calls resulting in much
longer average seek distance (IO from different files is more interleaved)
and about 16-20% regression in the achieved writeback throughput when the
backing store is rotational storage.
I think the filemap_flush() from nfsd_file_put() is indeed rather
aggressive and I think we'd be better off to just leave writeback to either
nfsd_commit() or standard dirty page cleaning happening on the system. I
assume the rationale for the filemap_flush() call was to make it more
likely the file can be evicted during the garbage collection run? Was there
any particular problem leading to addition of this call or was it just "it
seemed like a good idea" thing?
Thanks in advance for ideas.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
> On Mar 30, 2022, at 12:19 PM, Jan Kara <[email protected]> wrote:
>
> On Wed 30-03-22 15:38:38, Chuck Lever III wrote:
>>> On Mar 30, 2022, at 11:03 AM, Trond Myklebust <[email protected]> wrote:
>>>
>>> On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
>>>> Hello,
>>>>
>>>> during our performance testing we have noticed that commit
>>>> b6669305d35a
>>>> ("nfsd: Reduce the number of calls to nfsd_file_gc()") has introduced
>>>> a
>>>> performance regression when a client does random buffered writes. The
>>>> workload on NFS client is fio running 4 processed doing random
>>>> buffered writes to 4
>>>> different files and the files are large enough to hit dirty limits
>>>> and
>>>> force writeback from the client. In particular the invocation is
>>>> like:
>>>>
>>>> fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1 --
>>>> invalidate=1 --group_reporting=1 --runtime=300 --fallocate=posix --
>>>> ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
>>>> rw=randwrite --size=4000m --numjobs=4 --bs=4k --
>>>> filename_format=FioWorkloads.\$jobnum --end_fsync=1
>>>>
>>>> The reason why commit b6669305d35a regresses performance is the
>>>> filemap_flush() call it adds into nfsd_file_put(). Before this commit
>>>> writeback on the server happened from nfsd_commit() code resulting in
>>>> rather long semisequential streams of 4k writes. After commit
>>>> b6669305d35a
>>>> all the writeback happens from filemap_flush() calls resulting in
>>>> much
>>>> longer average seek distance (IO from different files is more
>>>> interleaved)
>>>> and about 16-20% regression in the achieved writeback throughput when
>>>> the
>>>> backing store is rotational storage.
>>>>
>>>> I think the filemap_flush() from nfsd_file_put() is indeed rather
>>>> aggressive and I think we'd be better off to just leave writeback to
>>>> either
>>>> nfsd_commit() or standard dirty page cleaning happening on the
>>>> system. I
>>>> assume the rationale for the filemap_flush() call was to make it more
>>>> likely the file can be evicted during the garbage collection run? Was
>>>> there
>>>> any particular problem leading to addition of this call or was it
>>>> just "it
>>>> seemed like a good idea" thing?
>>>>
>>>> Thanks in advance for ideas.
>>>>
>>>> Honza
>>>
>>> It was mainly introduced to reduce the amount of work that
>>> nfsd_file_free() needs to do. In particular when re-exporting NFS, the
>>> call to filp_close() can be expensive because it synchronously flushes
>>> out dirty pages. That again means that some of the calls to
>>> nfsd_file_dispose_list() can end up being very expensive (particularly
>>> the ones run by the garbage collector itself).
>>
>> The "no regressions" rule suggests that some kind of action needs
>> to be taken. I don't have a sense of whether Jan's workload or NFS
>> re-export is the more common use case, however.
>>
>> I can see that filp_close() on a file on an NFS mount could be
>> costly if that file has dirty pages, due to the NFS client's
>> "flush on close" semantic.
>>
>> Trond, are there alternatives to flushing in the nfsd_file_put()
>> path? I'm willing to entertain bug fix patches rather than a
>> mechanical revert of b6669305d35a.
>
> Yeah, I don't think we need to rush fixing this with a revert.
Sorry I wasn't clear: I would prefer to apply a bug fix over
sending a revert commit, and I do not have enough information
yet to make that choice. Waiting a bit is OK with me.
> Also because
> just removing the filemap_flush() from nfsd_file_put() would keep other
> benefits of that commit while fixing the regression AFAIU. But I think
> making flushing less aggressive is desirable because as I wrote in my other
> reply, currently it is preventing pagecache from accumulating enough dirty
> data for a good IO pattern.
I might even go so far as to say that a small write workload
isn't especially good for solid state storage either.
I know Trond is trying to address NFS re-export performance, but
there appear to be some palpable effects outside of that narrow
use case that need to be considered. Thus a server-side fix,
rather than a fix in the NFS client used to do the re-export,
seems appropriate to consider.
--
Chuck Lever
> On Mar 30, 2022, at 11:03 AM, Trond Myklebust <[email protected]> wrote:
>
> On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
>> Hello,
>>
>> during our performance testing we have noticed that commit
>> b6669305d35a
>> ("nfsd: Reduce the number of calls to nfsd_file_gc()") has introduced
>> a
>> performance regression when a client does random buffered writes. The
>> workload on NFS client is fio running 4 processed doing random
>> buffered writes to 4
>> different files and the files are large enough to hit dirty limits
>> and
>> force writeback from the client. In particular the invocation is
>> like:
>>
>> fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1 --
>> invalidate=1 --group_reporting=1 --runtime=300 --fallocate=posix --
>> ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
>> rw=randwrite --size=4000m --numjobs=4 --bs=4k --
>> filename_format=FioWorkloads.\$jobnum --end_fsync=1
>>
>> The reason why commit b6669305d35a regresses performance is the
>> filemap_flush() call it adds into nfsd_file_put(). Before this commit
>> writeback on the server happened from nfsd_commit() code resulting in
>> rather long semisequential streams of 4k writes. After commit
>> b6669305d35a
>> all the writeback happens from filemap_flush() calls resulting in
>> much
>> longer average seek distance (IO from different files is more
>> interleaved)
>> and about 16-20% regression in the achieved writeback throughput when
>> the
>> backing store is rotational storage.
>>
>> I think the filemap_flush() from nfsd_file_put() is indeed rather
>> aggressive and I think we'd be better off to just leave writeback to
>> either
>> nfsd_commit() or standard dirty page cleaning happening on the
>> system. I
>> assume the rationale for the filemap_flush() call was to make it more
>> likely the file can be evicted during the garbage collection run? Was
>> there
>> any particular problem leading to addition of this call or was it
>> just "it
>> seemed like a good idea" thing?
>>
>> Thanks in advance for ideas.
>>
>> Honza
>
> It was mainly introduced to reduce the amount of work that
> nfsd_file_free() needs to do. In particular when re-exporting NFS, the
> call to filp_close() can be expensive because it synchronously flushes
> out dirty pages. That again means that some of the calls to
> nfsd_file_dispose_list() can end up being very expensive (particularly
> the ones run by the garbage collector itself).
The "no regressions" rule suggests that some kind of action needs
to be taken. I don't have a sense of whether Jan's workload or NFS
re-export is the more common use case, however.
I can see that filp_close() on a file on an NFS mount could be
costly if that file has dirty pages, due to the NFS client's
"flush on close" semantic.
Trond, are there alternatives to flushing in the nfsd_file_put()
path? I'm willing to entertain bug fix patches rather than a
mechanical revert of b6669305d35a.
--
Chuck Lever
On Wed 30-03-22 15:38:38, Chuck Lever III wrote:
> > On Mar 30, 2022, at 11:03 AM, Trond Myklebust <[email protected]> wrote:
> >
> > On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
> >> Hello,
> >>
> >> during our performance testing we have noticed that commit
> >> b6669305d35a
> >> ("nfsd: Reduce the number of calls to nfsd_file_gc()") has introduced
> >> a
> >> performance regression when a client does random buffered writes. The
> >> workload on NFS client is fio running 4 processed doing random
> >> buffered writes to 4
> >> different files and the files are large enough to hit dirty limits
> >> and
> >> force writeback from the client. In particular the invocation is
> >> like:
> >>
> >> fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1 --
> >> invalidate=1 --group_reporting=1 --runtime=300 --fallocate=posix --
> >> ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
> >> rw=randwrite --size=4000m --numjobs=4 --bs=4k --
> >> filename_format=FioWorkloads.\$jobnum --end_fsync=1
> >>
> >> The reason why commit b6669305d35a regresses performance is the
> >> filemap_flush() call it adds into nfsd_file_put(). Before this commit
> >> writeback on the server happened from nfsd_commit() code resulting in
> >> rather long semisequential streams of 4k writes. After commit
> >> b6669305d35a
> >> all the writeback happens from filemap_flush() calls resulting in
> >> much
> >> longer average seek distance (IO from different files is more
> >> interleaved)
> >> and about 16-20% regression in the achieved writeback throughput when
> >> the
> >> backing store is rotational storage.
> >>
> >> I think the filemap_flush() from nfsd_file_put() is indeed rather
> >> aggressive and I think we'd be better off to just leave writeback to
> >> either
> >> nfsd_commit() or standard dirty page cleaning happening on the
> >> system. I
> >> assume the rationale for the filemap_flush() call was to make it more
> >> likely the file can be evicted during the garbage collection run? Was
> >> there
> >> any particular problem leading to addition of this call or was it
> >> just "it
> >> seemed like a good idea" thing?
> >>
> >> Thanks in advance for ideas.
> >>
> >> Honza
> >
> > It was mainly introduced to reduce the amount of work that
> > nfsd_file_free() needs to do. In particular when re-exporting NFS, the
> > call to filp_close() can be expensive because it synchronously flushes
> > out dirty pages. That again means that some of the calls to
> > nfsd_file_dispose_list() can end up being very expensive (particularly
> > the ones run by the garbage collector itself).
>
> The "no regressions" rule suggests that some kind of action needs
> to be taken. I don't have a sense of whether Jan's workload or NFS
> re-export is the more common use case, however.
>
> I can see that filp_close() on a file on an NFS mount could be
> costly if that file has dirty pages, due to the NFS client's
> "flush on close" semantic.
>
> Trond, are there alternatives to flushing in the nfsd_file_put()
> path? I'm willing to entertain bug fix patches rather than a
> mechanical revert of b6669305d35a.
Yeah, I don't think we need to rush fixing this with a revert. Also because
just removing the filemap_flush() from nfsd_file_put() would keep other
benefits of that commit while fixing the regression AFAIU. But I think
making flushing less aggressive is desirable because as I wrote in my other
reply, currently it is preventing pagecache from accumulating enough dirty
data for a good IO pattern.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed 30-03-22 15:03:30, Trond Myklebust wrote:
> On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
> > Hello,
> >
> > during our performance testing we have noticed that commit
> > b6669305d35a
> > ("nfsd: Reduce the number of calls to nfsd_file_gc()") has introduced
> > a
> > performance regression when a client does random buffered writes. The
> > workload on NFS client is fio running 4 processed doing random
> > buffered writes to 4
> > different files and the files are large enough to hit dirty limits
> > and
> > force writeback from the client. In particular the invocation is
> > like:
> >
> > fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1 --
> > invalidate=1 --group_reporting=1 --runtime=300 --fallocate=posix --
> > ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
> > rw=randwrite --size=4000m --numjobs=4 --bs=4k --
> > filename_format=FioWorkloads.\$jobnum --end_fsync=1
> >
> > The reason why commit b6669305d35a regresses performance is the
> > filemap_flush() call it adds into nfsd_file_put(). Before this commit
> > writeback on the server happened from nfsd_commit() code resulting in
> > rather long semisequential streams of 4k writes. After commit
> > b6669305d35a
> > all the writeback happens from filemap_flush() calls resulting in
> > much
> > longer average seek distance (IO from different files is more
> > interleaved)
> > and about 16-20% regression in the achieved writeback throughput when
> > the
> > backing store is rotational storage.
> >
> > I think the filemap_flush() from nfsd_file_put() is indeed rather
> > aggressive and I think we'd be better off to just leave writeback to
> > either
> > nfsd_commit() or standard dirty page cleaning happening on the
> > system. I
> > assume the rationale for the filemap_flush() call was to make it more
> > likely the file can be evicted during the garbage collection run? Was
> > there
> > any particular problem leading to addition of this call or was it
> > just "it
> > seemed like a good idea" thing?
> >
> > Thanks in advance for ideas.
>
> It was mainly introduced to reduce the amount of work that
> nfsd_file_free() needs to do. In particular when re-exporting NFS, the
> call to filp_close() can be expensive because it synchronously flushes
> out dirty pages. That again means that some of the calls to
> nfsd_file_dispose_list() can end up being very expensive (particularly
> the ones run by the garbage collector itself).
I see, thanks for info. So I'm pondering what options we have for fixing
the performance regression. Because the filemap_flush() call in
nfsd_file_put() is just too aggressive and doesn't allow enough dirty data
to accumulate in the page cache for a reasonable IO pattern.
E.g. if the concern is just too long nfsd_file_dispose_list() runtime when
there are more files in the dispose list, we could do two iterations there
- the first one that walks all the files and starts async writeback for all
of them, and the second one which drops the reference which among other
things may end up in ->flush() doing the synchronous writeback (but that
will now have not much to do). This is how generic writeback actually does
things for synchronous writeback because it is much faster than doing
submit one file, wait for one file in a loop if there are multiple files to
write. Would something like this be acceptable for you?
If something like this is not enough, we could also do something like
having another delayed work walking unused files and starting writeback for
them.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, 2022-03-30 at 17:56 +0000, Chuck Lever III wrote:
>
>
> > On Mar 30, 2022, at 12:19 PM, Jan Kara <[email protected]> wrote:
> >
> > On Wed 30-03-22 15:38:38, Chuck Lever III wrote:
> > > > On Mar 30, 2022, at 11:03 AM, Trond Myklebust
> > > > <[email protected]> wrote:
> > > >
> > > > On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
> > > > > Hello,
> > > > >
> > > > > during our performance testing we have noticed that commit
> > > > > b6669305d35a
> > > > > ("nfsd: Reduce the number of calls to nfsd_file_gc()") has
> > > > > introduced
> > > > > a
> > > > > performance regression when a client does random buffered
> > > > > writes. The
> > > > > workload on NFS client is fio running 4 processed doing
> > > > > random
> > > > > buffered writes to 4
> > > > > different files and the files are large enough to hit dirty
> > > > > limits
> > > > > and
> > > > > force writeback from the client. In particular the invocation
> > > > > is
> > > > > like:
> > > > >
> > > > > fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1
> > > > > --
> > > > > invalidate=1 --group_reporting=1 --runtime=300 --
> > > > > fallocate=posix --
> > > > > ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
> > > > > rw=randwrite --size=4000m --numjobs=4 --bs=4k --
> > > > > filename_format=FioWorkloads.\$jobnum --end_fsync=1
> > > > >
> > > > > The reason why commit b6669305d35a regresses performance is
> > > > > the
> > > > > filemap_flush() call it adds into nfsd_file_put(). Before
> > > > > this commit
> > > > > writeback on the server happened from nfsd_commit() code
> > > > > resulting in
> > > > > rather long semisequential streams of 4k writes. After commit
> > > > > b6669305d35a
> > > > > all the writeback happens from filemap_flush() calls
> > > > > resulting in
> > > > > much
> > > > > longer average seek distance (IO from different files is more
> > > > > interleaved)
> > > > > and about 16-20% regression in the achieved writeback
> > > > > throughput when
> > > > > the
> > > > > backing store is rotational storage.
> > > > >
> > > > > I think the filemap_flush() from nfsd_file_put() is indeed
> > > > > rather
> > > > > aggressive and I think we'd be better off to just leave
> > > > > writeback to
> > > > > either
> > > > > nfsd_commit() or standard dirty page cleaning happening on
> > > > > the
> > > > > system. I
> > > > > assume the rationale for the filemap_flush() call was to make
> > > > > it more
> > > > > likely the file can be evicted during the garbage collection
> > > > > run? Was
> > > > > there
> > > > > any particular problem leading to addition of this call or
> > > > > was it
> > > > > just "it
> > > > > seemed like a good idea" thing?
> > > > >
> > > > > Thanks in advance for ideas.
> > > > >
> > > > >
> > > > > Honza
> > > >
> > > > It was mainly introduced to reduce the amount of work that
> > > > nfsd_file_free() needs to do. In particular when re-exporting
> > > > NFS, the
> > > > call to filp_close() can be expensive because it synchronously
> > > > flushes
> > > > out dirty pages. That again means that some of the calls to
> > > > nfsd_file_dispose_list() can end up being very expensive
> > > > (particularly
> > > > the ones run by the garbage collector itself).
> > >
> > > The "no regressions" rule suggests that some kind of action needs
> > > to be taken. I don't have a sense of whether Jan's workload or
> > > NFS
> > > re-export is the more common use case, however.
> > >
> > > I can see that filp_close() on a file on an NFS mount could be
> > > costly if that file has dirty pages, due to the NFS client's
> > > "flush on close" semantic.
> > >
> > > Trond, are there alternatives to flushing in the nfsd_file_put()
> > > path? I'm willing to entertain bug fix patches rather than a
> > > mechanical revert of b6669305d35a.
> >
> > Yeah, I don't think we need to rush fixing this with a revert.
>
> Sorry I wasn't clear: I would prefer to apply a bug fix over
> sending a revert commit, and I do not have enough information
> yet to make that choice. Waiting a bit is OK with me.
>
>
> > Also because
> > just removing the filemap_flush() from nfsd_file_put() would keep
> > other
> > benefits of that commit while fixing the regression AFAIU. But I
> > think
> > making flushing less aggressive is desirable because as I wrote in
> > my other
> > reply, currently it is preventing pagecache from accumulating
> > enough dirty
> > data for a good IO pattern.
>
> I might even go so far as to say that a small write workload
> isn't especially good for solid state storage either.
>
> I know Trond is trying to address NFS re-export performance, but
> there appear to be some palpable effects outside of that narrow
> use case that need to be considered. Thus a server-side fix,
> rather than a fix in the NFS client used to do the re-export,
> seems appropriate to consider.
Turns out it is not just the NFS client that is the problem. It is
rather that we need in general to be able to detect flush errors and
either report them directly (through commit) or we need to change the
boot verifier to force clients to resend the unstable writes.
Hence, I think we're looking at something like this:
8<--------------------------------------------------------------------
From c0c89267f303432c8f5e490ea9b075856e4be79d Mon Sep 17 00:00:00 2001
From: Trond Myklebust <[email protected]>
Date: Wed, 30 Mar 2022 16:55:38 -0400
Subject: [PATCH] nfsd: Fix a write performance regression
The call to filemap_flush() in nfsd_file_put() is there to ensure that
we clear out any writes belonging to a NFSv3 client relatively quickly
and avoid situations where the file can't be evicted by the garbage
collector. It also ensures that we detect write errors quickly.
The problem is this causes a regression in performance for some
workloads.
So try to improve matters by deferring writeback until we're ready to
close the file, and need to detect errors so that we can force the
client to resend.
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfsd/filecache.c | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 8bc807c5fea4..9578a6317709 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -235,6 +235,13 @@ nfsd_file_check_write_error(struct nfsd_file *nf)
return filemap_check_wb_err(file->f_mapping, READ_ONCE(file->f_wb_err));
}
+static void
+nfsd_file_flush(struct nfsd_file *nf)
+{
+ if (nf->nf_file && vfs_fsync(nf->nf_file, 1) != 0)
+ nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
+}
+
static void
nfsd_file_do_unhash(struct nfsd_file *nf)
{
@@ -302,11 +309,14 @@ nfsd_file_put(struct nfsd_file *nf)
return;
}
- filemap_flush(nf->nf_file->f_mapping);
is_hashed = test_bit(NFSD_FILE_HASHED, &nf->nf_flags) != 0;
- nfsd_file_put_noref(nf);
- if (is_hashed)
+ if (!is_hashed) {
+ nfsd_file_flush(nf);
+ nfsd_file_put_noref(nf);
+ } else {
+ nfsd_file_put_noref(nf);
nfsd_file_schedule_laundrette();
+ }
if (atomic_long_read(&nfsd_filecache_count) >= NFSD_FILE_LRU_LIMIT)
nfsd_file_gc();
}
@@ -327,6 +337,7 @@ nfsd_file_dispose_list(struct list_head *dispose)
while(!list_empty(dispose)) {
nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
list_del(&nf->nf_lru);
+ nfsd_file_flush(nf);
nfsd_file_put_noref(nf);
}
}
@@ -340,6 +351,7 @@ nfsd_file_dispose_list_sync(struct list_head *dispose)
while(!list_empty(dispose)) {
nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
list_del(&nf->nf_lru);
+ nfsd_file_flush(nf);
if (!refcount_dec_and_test(&nf->nf_ref))
continue;
if (nfsd_file_free(nf))
--
2.35.1
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
> Hello,
>
> during our performance testing we have noticed that commit
> b6669305d35a
> ("nfsd: Reduce the number of calls to nfsd_file_gc()") has introduced
> a
> performance regression when a client does random buffered writes. The
> workload on NFS client is fio running 4 processed doing random
> buffered writes to 4
> different files and the files are large enough to hit dirty limits
> and
> force writeback from the client. In particular the invocation is
> like:
>
> fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1 --
> invalidate=1 --group_reporting=1 --runtime=300 --fallocate=posix --
> ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
> rw=randwrite --size=4000m --numjobs=4 --bs=4k --
> filename_format=FioWorkloads.\$jobnum --end_fsync=1
>
> The reason why commit b6669305d35a regresses performance is the
> filemap_flush() call it adds into nfsd_file_put(). Before this commit
> writeback on the server happened from nfsd_commit() code resulting in
> rather long semisequential streams of 4k writes. After commit
> b6669305d35a
> all the writeback happens from filemap_flush() calls resulting in
> much
> longer average seek distance (IO from different files is more
> interleaved)
> and about 16-20% regression in the achieved writeback throughput when
> the
> backing store is rotational storage.
>
> I think the filemap_flush() from nfsd_file_put() is indeed rather
> aggressive and I think we'd be better off to just leave writeback to
> either
> nfsd_commit() or standard dirty page cleaning happening on the
> system. I
> assume the rationale for the filemap_flush() call was to make it more
> likely the file can be evicted during the garbage collection run? Was
> there
> any particular problem leading to addition of this call or was it
> just "it
> seemed like a good idea" thing?
>
> Thanks in advance for ideas.
>
> Honza
It was mainly introduced to reduce the amount of work that
nfsd_file_free() needs to do. In particular when re-exporting NFS, the
call to filp_close() can be expensive because it synchronously flushes
out dirty pages. That again means that some of the calls to
nfsd_file_dispose_list() can end up being very expensive (particularly
the ones run by the garbage collector itself).
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Wed 30-03-22 22:02:39, Trond Myklebust wrote:
> On Wed, 2022-03-30 at 17:56 +0000, Chuck Lever III wrote:
> >
> >
> > > On Mar 30, 2022, at 12:19 PM, Jan Kara <[email protected]> wrote:
> > >
> > > On Wed 30-03-22 15:38:38, Chuck Lever III wrote:
> > > > > On Mar 30, 2022, at 11:03 AM, Trond Myklebust
> > > > > <[email protected]> wrote:
> > > > >
> > > > > On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
> > > > > > Hello,
> > > > > >
> > > > > > during our performance testing we have noticed that commit
> > > > > > b6669305d35a
> > > > > > ("nfsd: Reduce the number of calls to nfsd_file_gc()") has
> > > > > > introduced
> > > > > > a
> > > > > > performance regression when a client does random buffered
> > > > > > writes. The
> > > > > > workload on NFS client is fio running 4 processed doing
> > > > > > random
> > > > > > buffered writes to 4
> > > > > > different files and the files are large enough to hit dirty
> > > > > > limits
> > > > > > and
> > > > > > force writeback from the client. In particular the invocation
> > > > > > is
> > > > > > like:
> > > > > >
> > > > > > fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1
> > > > > > --
> > > > > > invalidate=1 --group_reporting=1 --runtime=300 --
> > > > > > fallocate=posix --
> > > > > > ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
> > > > > > rw=randwrite --size=4000m --numjobs=4 --bs=4k --
> > > > > > filename_format=FioWorkloads.\$jobnum --end_fsync=1
> > > > > >
> > > > > > The reason why commit b6669305d35a regresses performance is
> > > > > > the
> > > > > > filemap_flush() call it adds into nfsd_file_put(). Before
> > > > > > this commit
> > > > > > writeback on the server happened from nfsd_commit() code
> > > > > > resulting in
> > > > > > rather long semisequential streams of 4k writes. After commit
> > > > > > b6669305d35a
> > > > > > all the writeback happens from filemap_flush() calls
> > > > > > resulting in
> > > > > > much
> > > > > > longer average seek distance (IO from different files is more
> > > > > > interleaved)
> > > > > > and about 16-20% regression in the achieved writeback
> > > > > > throughput when
> > > > > > the
> > > > > > backing store is rotational storage.
> > > > > >
> > > > > > I think the filemap_flush() from nfsd_file_put() is indeed
> > > > > > rather
> > > > > > aggressive and I think we'd be better off to just leave
> > > > > > writeback to
> > > > > > either
> > > > > > nfsd_commit() or standard dirty page cleaning happening on
> > > > > > the
> > > > > > system. I
> > > > > > assume the rationale for the filemap_flush() call was to make
> > > > > > it more
> > > > > > likely the file can be evicted during the garbage collection
> > > > > > run? Was
> > > > > > there
> > > > > > any particular problem leading to addition of this call or
> > > > > > was it
> > > > > > just "it
> > > > > > seemed like a good idea" thing?
> > > > > >
> > > > > > Thanks in advance for ideas.
> > > > > >
> > > > > > ?????????????????????????????????????????????????????????????
> > > > > > ? Honza
> > > > >
> > > > > It was mainly introduced to reduce the amount of work that
> > > > > nfsd_file_free() needs to do. In particular when re-exporting
> > > > > NFS, the
> > > > > call to filp_close() can be expensive because it synchronously
> > > > > flushes
> > > > > out dirty pages. That again means that some of the calls to
> > > > > nfsd_file_dispose_list() can end up being very expensive
> > > > > (particularly
> > > > > the ones run by the garbage collector itself).
> > > >
> > > > The "no regressions" rule suggests that some kind of action needs
> > > > to be taken. I don't have a sense of whether Jan's workload or
> > > > NFS
> > > > re-export is the more common use case, however.
> > > >
> > > > I can see that filp_close() on a file on an NFS mount could be
> > > > costly if that file has dirty pages, due to the NFS client's
> > > > "flush on close" semantic.
> > > >
> > > > Trond, are there alternatives to flushing in the nfsd_file_put()
> > > > path? I'm willing to entertain bug fix patches rather than a
> > > > mechanical revert of b6669305d35a.
> > >
> > > Yeah, I don't think we need to rush fixing this with a revert.
> >
> > Sorry I wasn't clear: I would prefer to apply a bug fix over
> > sending a revert commit, and I do not have enough information
> > yet to make that choice. Waiting a bit is OK with me.
> >
> >
> > > Also because
> > > just removing the filemap_flush() from nfsd_file_put() would keep
> > > other
> > > benefits of that commit while fixing the regression AFAIU. But I
> > > think
> > > making flushing less aggressive is desirable because as I wrote in
> > > my other
> > > reply, currently it is preventing pagecache from accumulating
> > > enough dirty
> > > data for a good IO pattern.
> >
> > I might even go so far as to say that a small write workload
> > isn't especially good for solid state storage either.
> >
> > I know Trond is trying to address NFS re-export performance, but
> > there appear to be some palpable effects outside of that narrow
> > use case that need to be considered. Thus a server-side fix,
> > rather than a fix in the NFS client used to do the re-export,
> > seems appropriate to consider.
>
> Turns out it is not just the NFS client that is the problem. It is
> rather that we need in general to be able to detect flush errors and
> either report them directly (through commit) or we need to change the
> boot verifier to force clients to resend the unstable writes.
>
> Hence, I think we're looking at something like this:
Thanks for the fix! I've run the patch through some testing and your fix
indeed restores the good IO pattern and returns the performance back to
original levels. So feel free to add:
Tested-by: Jan Kara <[email protected]>
Honza
>
> 8<--------------------------------------------------------------------
> From c0c89267f303432c8f5e490ea9b075856e4be79d Mon Sep 17 00:00:00 2001
> From: Trond Myklebust <[email protected]>
> Date: Wed, 30 Mar 2022 16:55:38 -0400
> Subject: [PATCH] nfsd: Fix a write performance regression
>
> The call to filemap_flush() in nfsd_file_put() is there to ensure that
> we clear out any writes belonging to a NFSv3 client relatively quickly
> and avoid situations where the file can't be evicted by the garbage
> collector. It also ensures that we detect write errors quickly.
>
> The problem is this causes a regression in performance for some
> workloads.
>
> So try to improve matters by deferring writeback until we're ready to
> close the file, and need to detect errors so that we can force the
> client to resend.
>
> Signed-off-by: Trond Myklebust <[email protected]>
> ---
> fs/nfsd/filecache.c | 18 +++++++++++++++---
> 1 file changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 8bc807c5fea4..9578a6317709 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -235,6 +235,13 @@ nfsd_file_check_write_error(struct nfsd_file *nf)
> return filemap_check_wb_err(file->f_mapping, READ_ONCE(file->f_wb_err));
> }
>
> +static void
> +nfsd_file_flush(struct nfsd_file *nf)
> +{
> + if (nf->nf_file && vfs_fsync(nf->nf_file, 1) != 0)
> + nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
> +}
> +
> static void
> nfsd_file_do_unhash(struct nfsd_file *nf)
> {
> @@ -302,11 +309,14 @@ nfsd_file_put(struct nfsd_file *nf)
> return;
> }
>
> - filemap_flush(nf->nf_file->f_mapping);
> is_hashed = test_bit(NFSD_FILE_HASHED, &nf->nf_flags) != 0;
> - nfsd_file_put_noref(nf);
> - if (is_hashed)
> + if (!is_hashed) {
> + nfsd_file_flush(nf);
> + nfsd_file_put_noref(nf);
> + } else {
> + nfsd_file_put_noref(nf);
> nfsd_file_schedule_laundrette();
> + }
> if (atomic_long_read(&nfsd_filecache_count) >= NFSD_FILE_LRU_LIMIT)
> nfsd_file_gc();
> }
> @@ -327,6 +337,7 @@ nfsd_file_dispose_list(struct list_head *dispose)
> while(!list_empty(dispose)) {
> nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
> list_del(&nf->nf_lru);
> + nfsd_file_flush(nf);
> nfsd_file_put_noref(nf);
> }
> }
> @@ -340,6 +351,7 @@ nfsd_file_dispose_list_sync(struct list_head *dispose)
> while(!list_empty(dispose)) {
> nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
> list_del(&nf->nf_lru);
> + nfsd_file_flush(nf);
> if (!refcount_dec_and_test(&nf->nf_ref))
> continue;
> if (nfsd_file_free(nf))
> --
> 2.35.1
>
>
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
[TLDR: I'm adding the regression report below to regzbot, the Linux
kernel regression tracking bot; all text you find below is compiled from
a few templates paragraphs you might have encountered already already
from similar mails.]
Hi, this is your Linux kernel regression tracker. CCing the regression
mailing list, as it should be in the loop for all regressions, as
explained here:
https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html
On 30.03.22 12:34, Jan Kara wrote:
> Hello,
>
> during our performance testing we have noticed that commit b6669305d35a
> ("nfsd: Reduce the number of calls to nfsd_file_gc()") has introduced a
> performance regression when a client does random buffered writes. The
> workload on NFS client is fio running 4 processed doing random buffered writes to 4
> different files and the files are large enough to hit dirty limits and
> force writeback from the client. In particular the invocation is like:
>
> fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1 --invalidate=1 --group_reporting=1 --runtime=300 --fallocate=posix --ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --rw=randwrite --size=4000m --numjobs=4 --bs=4k --filename_format=FioWorkloads.\$jobnum --end_fsync=1
>
> The reason why commit b6669305d35a regresses performance is the
> filemap_flush() call it adds into nfsd_file_put(). Before this commit
> writeback on the server happened from nfsd_commit() code resulting in
> rather long semisequential streams of 4k writes. After commit b6669305d35a
> all the writeback happens from filemap_flush() calls resulting in much
> longer average seek distance (IO from different files is more interleaved)
> and about 16-20% regression in the achieved writeback throughput when the
> backing store is rotational storage.
>
> I think the filemap_flush() from nfsd_file_put() is indeed rather
> aggressive and I think we'd be better off to just leave writeback to either
> nfsd_commit() or standard dirty page cleaning happening on the system. I
> assume the rationale for the filemap_flush() call was to make it more
> likely the file can be evicted during the garbage collection run? Was there
> any particular problem leading to addition of this call or was it just "it
> seemed like a good idea" thing?
>
> Thanks in advance for ideas.
>
To be sure below issue doesn't fall through the cracks unnoticed, I'm
adding it to regzbot, my Linux kernel regression tracking bot:
#regzbot ^introduced b6669305d35a
#regzbot title nfs: Performance regression with random IO pattern from
the client
#regzbot backburner: introduced more than two years ago, first report,
hence developers want to give it some thought
#regzbot ignore-activity
If it turns out this isn't a regression, free free to remove it from the
tracking by sending a reply to this thread containing a paragraph like
"#regzbot invalid: reason why this is invalid" (without the quotes).
Reminder for developers: when fixing the issue, please add a 'Link:'
tags pointing to the report (the mail quoted above) using
lore.kernel.org/r/, as explained in
'Documentation/process/submitting-patches.rst' and
'Documentation/process/5.Posting.rst'. Regzbot needs them to
automatically connect reports with fixes, but they are useful in
general, too.
I'm sending this to everyone that got the initial report, to make
everyone aware of the tracking. I also hope that messages like this
motivate people to directly get at least the regression mailing list and
ideally even regzbot involved when dealing with regressions, as messages
like this wouldn't be needed then. And don't worry, if I need to send
other mails regarding this regression only relevant for regzbot I'll
send them to the regressions lists only (with a tag in the subject so
people can filter them away). With a bit of luck no such messages will
be needed anyway.
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.
--
Additional information about regzbot:
If you want to know more about regzbot, check out its web-interface, the
getting start guide, and the references documentation:
https://linux-regtracking.leemhuis.info/regzbot/
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md
The last two documents will explain how you can interact with regzbot
yourself if your want to.
Hint for reporters: when reporting a regression it's in your interest to
CC the regression list and tell regzbot about the issue, as that ensures
the regression makes it onto the radar of the Linux kernel's regression
tracker -- that's in your interest, as it ensures your report won't fall
through the cracks unnoticed.
Hint for developers: you normally don't need to care about regzbot once
it's involved. Fix the issue as you normally would, just remember to
include 'Link:' tag in the patch descriptions pointing to all reports
about the issue. This has been expected from developers even before
regzbot showed up for reasons explained in
'Documentation/process/submitting-patches.rst' and
'Documentation/process/5.Posting.rst'.
> On Mar 31, 2022, at 9:09 AM, Jan Kara <[email protected]> wrote:
>
> On Wed 30-03-22 22:02:39, Trond Myklebust wrote:
>> On Wed, 2022-03-30 at 17:56 +0000, Chuck Lever III wrote:
>>>
>>>
>>>> On Mar 30, 2022, at 12:19 PM, Jan Kara <[email protected]> wrote:
>>>>
>>>> On Wed 30-03-22 15:38:38, Chuck Lever III wrote:
>>>>>> On Mar 30, 2022, at 11:03 AM, Trond Myklebust
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> during our performance testing we have noticed that commit
>>>>>>> b6669305d35a
>>>>>>> ("nfsd: Reduce the number of calls to nfsd_file_gc()") has
>>>>>>> introduced
>>>>>>> a
>>>>>>> performance regression when a client does random buffered
>>>>>>> writes. The
>>>>>>> workload on NFS client is fio running 4 processed doing
>>>>>>> random
>>>>>>> buffered writes to 4
>>>>>>> different files and the files are large enough to hit dirty
>>>>>>> limits
>>>>>>> and
>>>>>>> force writeback from the client. In particular the invocation
>>>>>>> is
>>>>>>> like:
>>>>>>>
>>>>>>> fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1
>>>>>>> --
>>>>>>> invalidate=1 --group_reporting=1 --runtime=300 --
>>>>>>> fallocate=posix --
>>>>>>> ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
>>>>>>> rw=randwrite --size=4000m --numjobs=4 --bs=4k --
>>>>>>> filename_format=FioWorkloads.\$jobnum --end_fsync=1
>>>>>>>
>>>>>>> The reason why commit b6669305d35a regresses performance is
>>>>>>> the
>>>>>>> filemap_flush() call it adds into nfsd_file_put(). Before
>>>>>>> this commit
>>>>>>> writeback on the server happened from nfsd_commit() code
>>>>>>> resulting in
>>>>>>> rather long semisequential streams of 4k writes. After commit
>>>>>>> b6669305d35a
>>>>>>> all the writeback happens from filemap_flush() calls
>>>>>>> resulting in
>>>>>>> much
>>>>>>> longer average seek distance (IO from different files is more
>>>>>>> interleaved)
>>>>>>> and about 16-20% regression in the achieved writeback
>>>>>>> throughput when
>>>>>>> the
>>>>>>> backing store is rotational storage.
>>>>>>>
>>>>>>> I think the filemap_flush() from nfsd_file_put() is indeed
>>>>>>> rather
>>>>>>> aggressive and I think we'd be better off to just leave
>>>>>>> writeback to
>>>>>>> either
>>>>>>> nfsd_commit() or standard dirty page cleaning happening on
>>>>>>> the
>>>>>>> system. I
>>>>>>> assume the rationale for the filemap_flush() call was to make
>>>>>>> it more
>>>>>>> likely the file can be evicted during the garbage collection
>>>>>>> run? Was
>>>>>>> there
>>>>>>> any particular problem leading to addition of this call or
>>>>>>> was it
>>>>>>> just "it
>>>>>>> seemed like a good idea" thing?
>>>>>>>
>>>>>>> Thanks in advance for ideas.
>>>>>>>
>>>>>>>
>>>>>>> Honza
>>>>>>
>>>>>> It was mainly introduced to reduce the amount of work that
>>>>>> nfsd_file_free() needs to do. In particular when re-exporting
>>>>>> NFS, the
>>>>>> call to filp_close() can be expensive because it synchronously
>>>>>> flushes
>>>>>> out dirty pages. That again means that some of the calls to
>>>>>> nfsd_file_dispose_list() can end up being very expensive
>>>>>> (particularly
>>>>>> the ones run by the garbage collector itself).
>>>>>
>>>>> The "no regressions" rule suggests that some kind of action needs
>>>>> to be taken. I don't have a sense of whether Jan's workload or
>>>>> NFS
>>>>> re-export is the more common use case, however.
>>>>>
>>>>> I can see that filp_close() on a file on an NFS mount could be
>>>>> costly if that file has dirty pages, due to the NFS client's
>>>>> "flush on close" semantic.
>>>>>
>>>>> Trond, are there alternatives to flushing in the nfsd_file_put()
>>>>> path? I'm willing to entertain bug fix patches rather than a
>>>>> mechanical revert of b6669305d35a.
>>>>
>>>> Yeah, I don't think we need to rush fixing this with a revert.
>>>
>>> Sorry I wasn't clear: I would prefer to apply a bug fix over
>>> sending a revert commit, and I do not have enough information
>>> yet to make that choice. Waiting a bit is OK with me.
>>>
>>>
>>>> Also because
>>>> just removing the filemap_flush() from nfsd_file_put() would keep
>>>> other
>>>> benefits of that commit while fixing the regression AFAIU. But I
>>>> think
>>>> making flushing less aggressive is desirable because as I wrote in
>>>> my other
>>>> reply, currently it is preventing pagecache from accumulating
>>>> enough dirty
>>>> data for a good IO pattern.
>>>
>>> I might even go so far as to say that a small write workload
>>> isn't especially good for solid state storage either.
>>>
>>> I know Trond is trying to address NFS re-export performance, but
>>> there appear to be some palpable effects outside of that narrow
>>> use case that need to be considered. Thus a server-side fix,
>>> rather than a fix in the NFS client used to do the re-export,
>>> seems appropriate to consider.
>>
>> Turns out it is not just the NFS client that is the problem. It is
>> rather that we need in general to be able to detect flush errors and
>> either report them directly (through commit) or we need to change the
>> boot verifier to force clients to resend the unstable writes.
>>
>> Hence, I think we're looking at something like this:
>
> Thanks for the fix! I've run the patch through some testing and your fix
> indeed restores the good IO pattern and returns the performance back to
> original levels. So feel free to add:
>
> Tested-by: Jan Kara <[email protected]>
Excellent. I'll queue this up in the NFSD tree for 5.17-rc.
>
> Honza
>
>>
>> 8<--------------------------------------------------------------------
>> From c0c89267f303432c8f5e490ea9b075856e4be79d Mon Sep 17 00:00:00 2001
>> From: Trond Myklebust <[email protected]>
>> Date: Wed, 30 Mar 2022 16:55:38 -0400
>> Subject: [PATCH] nfsd: Fix a write performance regression
>>
>> The call to filemap_flush() in nfsd_file_put() is there to ensure that
>> we clear out any writes belonging to a NFSv3 client relatively quickly
>> and avoid situations where the file can't be evicted by the garbage
>> collector. It also ensures that we detect write errors quickly.
>>
>> The problem is this causes a regression in performance for some
>> workloads.
>>
>> So try to improve matters by deferring writeback until we're ready to
>> close the file, and need to detect errors so that we can force the
>> client to resend.
>>
>> Signed-off-by: Trond Myklebust <[email protected]>
>> ---
>> fs/nfsd/filecache.c | 18 +++++++++++++++---
>> 1 file changed, 15 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>> index 8bc807c5fea4..9578a6317709 100644
>> --- a/fs/nfsd/filecache.c
>> +++ b/fs/nfsd/filecache.c
>> @@ -235,6 +235,13 @@ nfsd_file_check_write_error(struct nfsd_file *nf)
>> return filemap_check_wb_err(file->f_mapping, READ_ONCE(file->f_wb_err));
>> }
>>
>> +static void
>> +nfsd_file_flush(struct nfsd_file *nf)
>> +{
>> + if (nf->nf_file && vfs_fsync(nf->nf_file, 1) != 0)
>> + nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
>> +}
>> +
>> static void
>> nfsd_file_do_unhash(struct nfsd_file *nf)
>> {
>> @@ -302,11 +309,14 @@ nfsd_file_put(struct nfsd_file *nf)
>> return;
>> }
>>
>> - filemap_flush(nf->nf_file->f_mapping);
>> is_hashed = test_bit(NFSD_FILE_HASHED, &nf->nf_flags) != 0;
>> - nfsd_file_put_noref(nf);
>> - if (is_hashed)
>> + if (!is_hashed) {
>> + nfsd_file_flush(nf);
>> + nfsd_file_put_noref(nf);
>> + } else {
>> + nfsd_file_put_noref(nf);
>> nfsd_file_schedule_laundrette();
>> + }
>> if (atomic_long_read(&nfsd_filecache_count) >= NFSD_FILE_LRU_LIMIT)
>> nfsd_file_gc();
>> }
>> @@ -327,6 +337,7 @@ nfsd_file_dispose_list(struct list_head *dispose)
>> while(!list_empty(dispose)) {
>> nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
>> list_del(&nf->nf_lru);
>> + nfsd_file_flush(nf);
>> nfsd_file_put_noref(nf);
>> }
>> }
>> @@ -340,6 +351,7 @@ nfsd_file_dispose_list_sync(struct list_head *dispose)
>> while(!list_empty(dispose)) {
>> nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
>> list_del(&nf->nf_lru);
>> + nfsd_file_flush(nf);
>> if (!refcount_dec_and_test(&nf->nf_ref))
>> continue;
>> if (nfsd_file_free(nf))
>> --
>> 2.35.1
>>
>>
>>
>> --
>> Trond Myklebust
>> Linux NFS client maintainer, Hammerspace
>> [email protected]
>>
>>
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
--
Chuck Lever
> On Mar 31, 2022, at 10:20 AM, Chuck Lever III <[email protected]> wrote:
>
>
>
>> On Mar 31, 2022, at 9:09 AM, Jan Kara <[email protected]> wrote:
>>
>> On Wed 30-03-22 22:02:39, Trond Myklebust wrote:
>>> On Wed, 2022-03-30 at 17:56 +0000, Chuck Lever III wrote:
>>>>
>>>>
>>>>> On Mar 30, 2022, at 12:19 PM, Jan Kara <[email protected]> wrote:
>>>>>
>>>>> On Wed 30-03-22 15:38:38, Chuck Lever III wrote:
>>>>>>> On Mar 30, 2022, at 11:03 AM, Trond Myklebust
>>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>> On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> during our performance testing we have noticed that commit
>>>>>>>> b6669305d35a
>>>>>>>> ("nfsd: Reduce the number of calls to nfsd_file_gc()") has
>>>>>>>> introduced
>>>>>>>> a
>>>>>>>> performance regression when a client does random buffered
>>>>>>>> writes. The
>>>>>>>> workload on NFS client is fio running 4 processed doing
>>>>>>>> random
>>>>>>>> buffered writes to 4
>>>>>>>> different files and the files are large enough to hit dirty
>>>>>>>> limits
>>>>>>>> and
>>>>>>>> force writeback from the client. In particular the invocation
>>>>>>>> is
>>>>>>>> like:
>>>>>>>>
>>>>>>>> fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1
>>>>>>>> --
>>>>>>>> invalidate=1 --group_reporting=1 --runtime=300 --
>>>>>>>> fallocate=posix --
>>>>>>>> ramp_time=10 --name=RandomReads-128000-4k-4 --new_group --
>>>>>>>> rw=randwrite --size=4000m --numjobs=4 --bs=4k --
>>>>>>>> filename_format=FioWorkloads.\$jobnum --end_fsync=1
>>>>>>>>
>>>>>>>> The reason why commit b6669305d35a regresses performance is
>>>>>>>> the
>>>>>>>> filemap_flush() call it adds into nfsd_file_put(). Before
>>>>>>>> this commit
>>>>>>>> writeback on the server happened from nfsd_commit() code
>>>>>>>> resulting in
>>>>>>>> rather long semisequential streams of 4k writes. After commit
>>>>>>>> b6669305d35a
>>>>>>>> all the writeback happens from filemap_flush() calls
>>>>>>>> resulting in
>>>>>>>> much
>>>>>>>> longer average seek distance (IO from different files is more
>>>>>>>> interleaved)
>>>>>>>> and about 16-20% regression in the achieved writeback
>>>>>>>> throughput when
>>>>>>>> the
>>>>>>>> backing store is rotational storage.
>>>>>>>>
>>>>>>>> I think the filemap_flush() from nfsd_file_put() is indeed
>>>>>>>> rather
>>>>>>>> aggressive and I think we'd be better off to just leave
>>>>>>>> writeback to
>>>>>>>> either
>>>>>>>> nfsd_commit() or standard dirty page cleaning happening on
>>>>>>>> the
>>>>>>>> system. I
>>>>>>>> assume the rationale for the filemap_flush() call was to make
>>>>>>>> it more
>>>>>>>> likely the file can be evicted during the garbage collection
>>>>>>>> run? Was
>>>>>>>> there
>>>>>>>> any particular problem leading to addition of this call or
>>>>>>>> was it
>>>>>>>> just "it
>>>>>>>> seemed like a good idea" thing?
>>>>>>>>
>>>>>>>> Thanks in advance for ideas.
>>>>>>>>
>>>>>>>>
>>>>>>>> Honza
>>>>>>>
>>>>>>> It was mainly introduced to reduce the amount of work that
>>>>>>> nfsd_file_free() needs to do. In particular when re-exporting
>>>>>>> NFS, the
>>>>>>> call to filp_close() can be expensive because it synchronously
>>>>>>> flushes
>>>>>>> out dirty pages. That again means that some of the calls to
>>>>>>> nfsd_file_dispose_list() can end up being very expensive
>>>>>>> (particularly
>>>>>>> the ones run by the garbage collector itself).
>>>>>>
>>>>>> The "no regressions" rule suggests that some kind of action needs
>>>>>> to be taken. I don't have a sense of whether Jan's workload or
>>>>>> NFS
>>>>>> re-export is the more common use case, however.
>>>>>>
>>>>>> I can see that filp_close() on a file on an NFS mount could be
>>>>>> costly if that file has dirty pages, due to the NFS client's
>>>>>> "flush on close" semantic.
>>>>>>
>>>>>> Trond, are there alternatives to flushing in the nfsd_file_put()
>>>>>> path? I'm willing to entertain bug fix patches rather than a
>>>>>> mechanical revert of b6669305d35a.
>>>>>
>>>>> Yeah, I don't think we need to rush fixing this with a revert.
>>>>
>>>> Sorry I wasn't clear: I would prefer to apply a bug fix over
>>>> sending a revert commit, and I do not have enough information
>>>> yet to make that choice. Waiting a bit is OK with me.
>>>>
>>>>
>>>>> Also because
>>>>> just removing the filemap_flush() from nfsd_file_put() would keep
>>>>> other
>>>>> benefits of that commit while fixing the regression AFAIU. But I
>>>>> think
>>>>> making flushing less aggressive is desirable because as I wrote in
>>>>> my other
>>>>> reply, currently it is preventing pagecache from accumulating
>>>>> enough dirty
>>>>> data for a good IO pattern.
>>>>
>>>> I might even go so far as to say that a small write workload
>>>> isn't especially good for solid state storage either.
>>>>
>>>> I know Trond is trying to address NFS re-export performance, but
>>>> there appear to be some palpable effects outside of that narrow
>>>> use case that need to be considered. Thus a server-side fix,
>>>> rather than a fix in the NFS client used to do the re-export,
>>>> seems appropriate to consider.
>>>
>>> Turns out it is not just the NFS client that is the problem. It is
>>> rather that we need in general to be able to detect flush errors and
>>> either report them directly (through commit) or we need to change the
>>> boot verifier to force clients to resend the unstable writes.
>>>
>>> Hence, I think we're looking at something like this:
>>
>> Thanks for the fix! I've run the patch through some testing and your fix
>> indeed restores the good IO pattern and returns the performance back to
>> original levels. So feel free to add:
>>
>> Tested-by: Jan Kara <[email protected]>
>
> Excellent. I'll queue this up in the NFSD tree for 5.17-rc.
Belay that. 5.18-rc.
>>
>> Honza
>>
>>>
>>> 8<--------------------------------------------------------------------
>>> From c0c89267f303432c8f5e490ea9b075856e4be79d Mon Sep 17 00:00:00 2001
>>> From: Trond Myklebust <[email protected]>
>>> Date: Wed, 30 Mar 2022 16:55:38 -0400
>>> Subject: [PATCH] nfsd: Fix a write performance regression
>>>
>>> The call to filemap_flush() in nfsd_file_put() is there to ensure that
>>> we clear out any writes belonging to a NFSv3 client relatively quickly
>>> and avoid situations where the file can't be evicted by the garbage
>>> collector. It also ensures that we detect write errors quickly.
>>>
>>> The problem is this causes a regression in performance for some
>>> workloads.
>>>
>>> So try to improve matters by deferring writeback until we're ready to
>>> close the file, and need to detect errors so that we can force the
>>> client to resend.
>>>
>>> Signed-off-by: Trond Myklebust <[email protected]>
>>> ---
>>> fs/nfsd/filecache.c | 18 +++++++++++++++---
>>> 1 file changed, 15 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>>> index 8bc807c5fea4..9578a6317709 100644
>>> --- a/fs/nfsd/filecache.c
>>> +++ b/fs/nfsd/filecache.c
>>> @@ -235,6 +235,13 @@ nfsd_file_check_write_error(struct nfsd_file *nf)
>>> return filemap_check_wb_err(file->f_mapping, READ_ONCE(file->f_wb_err));
>>> }
>>>
>>> +static void
>>> +nfsd_file_flush(struct nfsd_file *nf)
>>> +{
>>> + if (nf->nf_file && vfs_fsync(nf->nf_file, 1) != 0)
>>> + nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
>>> +}
>>> +
>>> static void
>>> nfsd_file_do_unhash(struct nfsd_file *nf)
>>> {
>>> @@ -302,11 +309,14 @@ nfsd_file_put(struct nfsd_file *nf)
>>> return;
>>> }
>>>
>>> - filemap_flush(nf->nf_file->f_mapping);
>>> is_hashed = test_bit(NFSD_FILE_HASHED, &nf->nf_flags) != 0;
>>> - nfsd_file_put_noref(nf);
>>> - if (is_hashed)
>>> + if (!is_hashed) {
>>> + nfsd_file_flush(nf);
>>> + nfsd_file_put_noref(nf);
>>> + } else {
>>> + nfsd_file_put_noref(nf);
>>> nfsd_file_schedule_laundrette();
>>> + }
>>> if (atomic_long_read(&nfsd_filecache_count) >= NFSD_FILE_LRU_LIMIT)
>>> nfsd_file_gc();
>>> }
>>> @@ -327,6 +337,7 @@ nfsd_file_dispose_list(struct list_head *dispose)
>>> while(!list_empty(dispose)) {
>>> nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
>>> list_del(&nf->nf_lru);
>>> + nfsd_file_flush(nf);
>>> nfsd_file_put_noref(nf);
>>> }
>>> }
>>> @@ -340,6 +351,7 @@ nfsd_file_dispose_list_sync(struct list_head *dispose)
>>> while(!list_empty(dispose)) {
>>> nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
>>> list_del(&nf->nf_lru);
>>> + nfsd_file_flush(nf);
>>> if (!refcount_dec_and_test(&nf->nf_ref))
>>> continue;
>>> if (nfsd_file_free(nf))
>>> --
>>> 2.35.1
>>>
>>>
>>>
>>> --
>>> Trond Myklebust
>>> Linux NFS client maintainer, Hammerspace
>>> [email protected]
>>>
>>>
>> --
>> Jan Kara <[email protected]>
>> SUSE Labs, CR
>
> --
> Chuck Lever
--
Chuck Lever