Hello!
I've been debugging NFS performance regression with recent kernels. It
seems to be at least partially related to the following behavior of NFS
(which is there for a long time AFAICT). Suppose the following workload:
fio --direct=0 --ioengine=sync --thread --directory=/test --invalidate=1 \
--group_reporting=1 --runtime=100 --fallocate=posix --ramp_time=10 \
--name=RandomWrites-async --new_group --rw=randwrite --size=32000m \
--numjobs=4 --bs=4k --fsync_on_close=1 --end_fsync=1 \
--filename_format='FioWorkloads.$jobnum'
So we do 4k buffered random writes from 4 threads into 4 different files.
Now the interesting behavior comes on the final fsync(2). What I observe is
that the NFS server getting a stream of 4-8k writes which have 'stable'
flag set. What the server does for each such write is that performs the
write and calls fsync(2). Since by the time fio calls fsync(2) on the NFS
client there is like 6-8 GB worth of dirty pages to write and the server
effectively ends up writing each individual 4k page as O_SYNC write, the
throughput is not great...
The reason why the client sets 'stable' flag for each page write seems to
be because nfs_writepages() issues writes with FLUSH_COND_STABLE for
WB_SYNC_ALL writeback and nfs_pgio_rpcsetup() has this logic:
switch (how & (FLUSH_STABLE | FLUSH_COND_STABLE)) {
case 0:
break;
case FLUSH_COND_STABLE:
if (nfs_reqs_to_commit(cinfo))
break;
fallthrough;
default:
hdr->args.stable = NFS_FILE_SYNC;
}
but since this is final fsync(2), there are no more requests to commit so
we set NFS_FILE_SYNC flag.
Now I'd think the client is stupid in submitting so many NFS_FILE_SYNC
writes instead of submitting all as async and then issuing commit (i.e.,
the switch above in nfs_pgio_rpcsetup() could gain something like:
if (count > <small_magic_number>)
break;
But I'm not 100% sure this is a correct thing to do since I'm not 100% sure
about the FLUSH_COND_STABLE requirements. On the other hand it could be
also argued that the NFS server could be more clever and batch the
fsync(2)s for many sync writes to the same file. But there the heuristic is
less clear.
So what do people think?
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu, 2024-05-23 at 18:54 +0200, Jan Kara wrote:
> Hello!
>
> I've been debugging NFS performance regression with recent kernels.
> It
> seems to be at least partially related to the following behavior of
> NFS
> (which is there for a long time AFAICT). Suppose the following
> workload:
>
> fio --direct=0 --ioengine=sync --thread --directory=/test --
> invalidate=1 \
> --group_reporting=1 --runtime=100 --fallocate=posix --ramp_time=10
> \
> --name=RandomWrites-async --new_group --rw=randwrite --size=32000m
> \
> --numjobs=4 --bs=4k --fsync_on_close=1 --end_fsync=1 \
> --filename_format='FioWorkloads.$jobnum'
>
> So we do 4k buffered random writes from 4 threads into 4 different
> files.
> Now the interesting behavior comes on the final fsync(2). What I
> observe is
> that the NFS server getting a stream of 4-8k writes which have
> 'stable'
> flag set. What the server does for each such write is that performs
> the
> write and calls fsync(2). Since by the time fio calls fsync(2) on the
> NFS
> client there is like 6-8 GB worth of dirty pages to write and the
> server
> effectively ends up writing each individual 4k page as O_SYNC write,
> the
> throughput is not great...
>
> The reason why the client sets 'stable' flag for each page write
> seems to
> be because nfs_writepages() issues writes with FLUSH_COND_STABLE for
> WB_SYNC_ALL writeback and nfs_pgio_rpcsetup() has this logic:
>
> switch (how & (FLUSH_STABLE | FLUSH_COND_STABLE)) {
> case 0:
> break;
> case FLUSH_COND_STABLE:
> if (nfs_reqs_to_commit(cinfo))
> break;
> fallthrough;
> default:
> hdr->args.stable = NFS_FILE_SYNC;
> }
>
> but since this is final fsync(2), there are no more requests to
> commit so
> we set NFS_FILE_SYNC flag.
>
> Now I'd think the client is stupid in submitting so many
> NFS_FILE_SYNC
> writes instead of submitting all as async and then issuing commit
> (i.e.,
> the switch above in nfs_pgio_rpcsetup() could gain something like:
>
> if (count > <small_magic_number>)
> break;
>
> But I'm not 100% sure this is a correct thing to do since I'm not
> 100% sure
> about the FLUSH_COND_STABLE requirements. On the other hand it could
> be
> also argued that the NFS server could be more clever and batch the
> fsync(2)s for many sync writes to the same file. But there the
> heuristic is
> less clear.
>
> So what do people think?
We can probably remove that case FLUSH_COND_STABLE in
nfs_pgio_rpcsetup() altogether, since we have the following just before
the call to nfs_pgio_rpcsetup()
if ((desc->pg_ioflags & FLUSH_COND_STABLE) &&
(desc->pg_moreio || nfs_reqs_to_commit(&cinfo)))
desc->pg_ioflags &= ~FLUSH_COND_STABLE;
The above is telling you that if we're flushing because we cannot
coalesce any more in __nfs_pageio_add_request(), then we do an unstable
write. Ditto if there are already unstable requests waiting for a
COMMIT.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
Hello Trond!
On Thu 23-05-24 18:00:01, Trond Myklebust wrote:
> On Thu, 2024-05-23 at 18:54 +0200, Jan Kara wrote:
> > Hello!
> >
> > I've been debugging NFS performance regression with recent kernels.
> > It
> > seems to be at least partially related to the following behavior of
> > NFS
> > (which is there for a long time AFAICT). Suppose the following
> > workload:
> >
> > fio --direct=0 --ioengine=sync --thread --directory=/test --
> > invalidate=1 \
> > ? --group_reporting=1 --runtime=100 --fallocate=posix --ramp_time=10
> > \
> > ? --name=RandomWrites-async --new_group --rw=randwrite --size=32000m
> > \
> > ? --numjobs=4 --bs=4k --fsync_on_close=1 --end_fsync=1 \
> > ? --filename_format='FioWorkloads.$jobnum'
> >
> > So we do 4k buffered random writes from 4 threads into 4 different
> > files.
> > Now the interesting behavior comes on the final fsync(2). What I
> > observe is
> > that the NFS server getting a stream of 4-8k writes which have
> > 'stable'
> > flag set. What the server does for each such write is that performs
> > the
> > write and calls fsync(2). Since by the time fio calls fsync(2) on the
> > NFS
> > client there is like 6-8 GB worth of dirty pages to write and the
> > server
> > effectively ends up writing each individual 4k page as O_SYNC write,
> > the
> > throughput is not great...
> >
> > The reason why the client sets 'stable' flag for each page write
> > seems to
> > be because nfs_writepages() issues writes with FLUSH_COND_STABLE for
> > WB_SYNC_ALL writeback and nfs_pgio_rpcsetup() has this logic:
> >
> > ??????? switch (how & (FLUSH_STABLE | FLUSH_COND_STABLE)) {
> > ??????? case 0:
> > ??????????????? break;
> > ??????? case FLUSH_COND_STABLE:
> > ??????????????? if (nfs_reqs_to_commit(cinfo))
> > ??????????????????????? break;
> > ??????????????? fallthrough;
> > ??????? default:
> > ??????????????? hdr->args.stable = NFS_FILE_SYNC;
> > ??????? }
> >
> > but since this is final fsync(2), there are no more requests to
> > commit so
> > we set NFS_FILE_SYNC flag.
> >
> > Now I'd think the client is stupid in submitting so many
> > NFS_FILE_SYNC
> > writes instead of submitting all as async and then issuing commit
> > (i.e.,
> > the switch above in nfs_pgio_rpcsetup() could gain something like:
> >
> > if (count > <small_magic_number>)
> > break;
> >
> > But I'm not 100% sure this is a correct thing to do since I'm not
> > 100% sure
> > about the FLUSH_COND_STABLE requirements. On the other hand it could
> > be
> > also argued that the NFS server could be more clever and batch the
> > fsync(2)s for many sync writes to the same file. But there the
> > heuristic is
> > less clear.
> >
> > So what do people think?
>
> We can probably remove that case FLUSH_COND_STABLE in
> nfs_pgio_rpcsetup() altogether, since we have the following just before
> the call to nfs_pgio_rpcsetup()
>
> if ((desc->pg_ioflags & FLUSH_COND_STABLE) &&
> (desc->pg_moreio || nfs_reqs_to_commit(&cinfo)))
> desc->pg_ioflags &= ~FLUSH_COND_STABLE;
Yeah, I was somewhat wondering about that as well. But my point was that
the client should apparently be dropping FLUSH_COND_STABLE in more cases
because currently fsync(2) submits several GB worth of dirty pages, each
page with NFS_FILE_SYNC set. Which is suboptimal... I'll try to understand
better why non of the above conditions is met.
> The above is telling you that if we're flushing because we cannot
> coalesce any more in __nfs_pageio_add_request(), then we do an unstable
> write. Ditto if there are already unstable requests waiting for a
> COMMIT.
Thanks for explanation! It helps me better understand the NFS page state
machinery :)
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 24-05-24 15:17:55, Jan Kara wrote:
> On Thu 23-05-24 18:00:01, Trond Myklebust wrote:
> > On Thu, 2024-05-23 at 18:54 +0200, Jan Kara wrote:
> > > Hello!
> > >
> > > I've been debugging NFS performance regression with recent kernels.
> > > It
> > > seems to be at least partially related to the following behavior of
> > > NFS
> > > (which is there for a long time AFAICT). Suppose the following
> > > workload:
> > >
> > > fio --direct=0 --ioengine=sync --thread --directory=/test --
> > > invalidate=1 \
> > > ? --group_reporting=1 --runtime=100 --fallocate=posix --ramp_time=10
> > > \
> > > ? --name=RandomWrites-async --new_group --rw=randwrite --size=32000m
> > > \
> > > ? --numjobs=4 --bs=4k --fsync_on_close=1 --end_fsync=1 \
> > > ? --filename_format='FioWorkloads.$jobnum'
> > >
> > > So we do 4k buffered random writes from 4 threads into 4 different
> > > files.
> > > Now the interesting behavior comes on the final fsync(2). What I
> > > observe is
> > > that the NFS server getting a stream of 4-8k writes which have
> > > 'stable'
> > > flag set. What the server does for each such write is that performs
> > > the
> > > write and calls fsync(2). Since by the time fio calls fsync(2) on the
> > > NFS
> > > client there is like 6-8 GB worth of dirty pages to write and the
> > > server
> > > effectively ends up writing each individual 4k page as O_SYNC write,
> > > the
> > > throughput is not great...
> > >
> > > The reason why the client sets 'stable' flag for each page write
> > > seems to
> > > be because nfs_writepages() issues writes with FLUSH_COND_STABLE for
> > > WB_SYNC_ALL writeback and nfs_pgio_rpcsetup() has this logic:
> > >
> > > ??????? switch (how & (FLUSH_STABLE | FLUSH_COND_STABLE)) {
> > > ??????? case 0:
> > > ??????????????? break;
> > > ??????? case FLUSH_COND_STABLE:
> > > ??????????????? if (nfs_reqs_to_commit(cinfo))
> > > ??????????????????????? break;
> > > ??????????????? fallthrough;
> > > ??????? default:
> > > ??????????????? hdr->args.stable = NFS_FILE_SYNC;
> > > ??????? }
> > >
> > > but since this is final fsync(2), there are no more requests to
> > > commit so
> > > we set NFS_FILE_SYNC flag.
> > >
> > > Now I'd think the client is stupid in submitting so many
> > > NFS_FILE_SYNC
> > > writes instead of submitting all as async and then issuing commit
> > > (i.e.,
> > > the switch above in nfs_pgio_rpcsetup() could gain something like:
> > >
> > > if (count > <small_magic_number>)
> > > break;
> > >
> > > But I'm not 100% sure this is a correct thing to do since I'm not
> > > 100% sure
> > > about the FLUSH_COND_STABLE requirements. On the other hand it could
> > > be
> > > also argued that the NFS server could be more clever and batch the
> > > fsync(2)s for many sync writes to the same file. But there the
> > > heuristic is
> > > less clear.
> > >
> > > So what do people think?
> >
> > We can probably remove that case FLUSH_COND_STABLE in
> > nfs_pgio_rpcsetup() altogether, since we have the following just before
> > the call to nfs_pgio_rpcsetup()
> >
> > if ((desc->pg_ioflags & FLUSH_COND_STABLE) &&
> > (desc->pg_moreio || nfs_reqs_to_commit(&cinfo)))
> > desc->pg_ioflags &= ~FLUSH_COND_STABLE;
>
> Yeah, I was somewhat wondering about that as well. But my point was that
> the client should apparently be dropping FLUSH_COND_STABLE in more cases
> because currently fsync(2) submits several GB worth of dirty pages, each
> page with NFS_FILE_SYNC set. Which is suboptimal... I'll try to understand
> better why non of the above conditions is met.
OK, I think I've tracked the problem down. Patch submitted [1].
Honza
[1] https://lore.kernel.org/r/[email protected]
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu, 23 May 2024 at 18:55, Jan Kara <[email protected]> wrote:
>
> Hello!
>
> I've been debugging NFS performance regression with recent kernels. It
> seems to be at least partially related to the following behavior of NFS
> (which is there for a long time AFAICT). Suppose the following workload:
>
> fio --direct=0 --ioengine=sync --thread --directory=/test --invalidate=1 \
> --group_reporting=1 --runtime=100 --fallocate=posix --ramp_time=10 \
> --name=RandomWrites-async --new_group --rw=randwrite --size=32000m \
> --numjobs=4 --bs=4k --fsync_on_close=1 --end_fsync=1 \
> --filename_format='FioWorkloads.$jobnum'
Where do you get the fio command from?
Dan
--
Dan Shelton - Cluster Specialist Win/Lin/Bsd
On Fri 24-05-24 19:08:51, Dan Shelton wrote:
> On Thu, 23 May 2024 at 18:55, Jan Kara <[email protected]> wrote:
> >
> > Hello!
> >
> > I've been debugging NFS performance regression with recent kernels. It
> > seems to be at least partially related to the following behavior of NFS
> > (which is there for a long time AFAICT). Suppose the following workload:
> >
> > fio --direct=0 --ioengine=sync --thread --directory=/test --invalidate=1 \
> > --group_reporting=1 --runtime=100 --fallocate=posix --ramp_time=10 \
> > --name=RandomWrites-async --new_group --rw=randwrite --size=32000m \
> > --numjobs=4 --bs=4k --fsync_on_close=1 --end_fsync=1 \
> > --filename_format='FioWorkloads.$jobnum'
>
> Where do you get the fio command from?
Well, this is somewhat hand-edited fio command that our QA runs as part of
performance testing of our kernels.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR