2024-05-29 16:11:18

by Jan Kara

[permalink] [raw]
Subject: NFS write congestion size

Hello,

so I was investigating why random writes to a large file over NFS got
noticeably slower. The workload we use to test this is this fio command:

fio --direct=0 --ioengine=sync --thread --directory=/mnt --invalidate=1 \
--group_reporting=1 --runtime=300 --fallocate=posix --ramp_time=10 \
--name=RandomWrites-async-257024-4k-4 --new_group --rw=randwrite \
--size=32000m --numjobs=4 --bs=4k --fsync_on_close=1 --end_fsync=1 \
--filename_format='FioWorkloads.$jobnum'

Eventually I've tracked down the regression to be caused by 6df25e58532b
("nfs: remove reliance on bdi congestion") which changed the congestion
mechanism from a generic bdi congestion handling to NFS private one. Before
this commit the fio achieved throughput of 180 MB/s, after this commit only
120 MB/s. Now part of the regression was actually caused by inefficient
fsync(2) and the fact that more dirty data was cached at the time of the
last fsync after commit 6df25e58532b. After fixing fsync [1], the
throughput got to 150 MB/s so better but still not quite the throughput
before 6df25e58532b.

The reason for remaining regression is that bdi congestion handling was
broken and the client had happily ~8GB of outstanding IO against the server
despite the congestion limit was 256 MB. The new congestion handling
actually works but as a result the server does not have enough dirty data
to efficiently operate on and the server disk often gets idle before the
client can send more.

I wanted to discuss possible solutions here.

Generally 256MB is not enough even for consumer grade contemporary disks to
max out throughput. There is tunable /proc/sys/fs/nfs/nfs_congestion_kb.
If I tweak it to say 1GB, that is enough to give the server enough data to
saturate the disk (most of the time) and fio reaches 180MB/s as before
commit 6df25e58532b. So one solution to the problem would be to change the
default of nfs_congestion_kb to 1GB.

Generally the problem with this tuning is that faster disks may need even
larger nfs_congestion_kb, the NFS client has no way of knowing what the
right value of nfs_congestion_kb is. I personally find the concept of
client throttling writes to the server flawed. The *server* should push
back (or throttle) if the client is too aggressively pushing out the data
and then the client can react to this backpressure. Because only the server
knows how much it can handle (also given the load from other clients). And
I believe this is actually what is happening in practice (e.g. when I tune
nfs_congestion_kb to really high number). So I think even better solution
may be to just remove the write congestion handling from the client
completely. The history before commit 6df25e58532b, when congestion was
effectively ignored, shows that this is unlikely to cause any practical
problems. What do people think?

Honza


[1] https://lore.kernel.org/all/[email protected]
--
Jan Kara <[email protected]>
SUSE Labs, CR


2024-05-29 17:05:36

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS write congestion size

On Wed, 2024-05-29 at 18:11 +0200, Jan Kara wrote:
> Hello,
>
> so I was investigating why random writes to a large file over NFS got
> noticeably slower. The workload we use to test this is this fio
> command:
>
> fio --direct=0 --ioengine=sync --thread --directory=/mnt --
> invalidate=1 \
>     --group_reporting=1 --runtime=300 --fallocate=posix --
> ramp_time=10 \
>     --name=RandomWrites-async-257024-4k-4 --new_group --rw=randwrite
> \
>     --size=32000m --numjobs=4 --bs=4k --fsync_on_close=1 --
> end_fsync=1 \
>     --filename_format='FioWorkloads.$jobnum'
>
> Eventually I've tracked down the regression to be caused by
> 6df25e58532b
> ("nfs: remove reliance on bdi congestion") which changed the
> congestion
> mechanism from a generic bdi congestion handling to NFS private one.
> Before
> this commit the fio achieved throughput of 180 MB/s, after this
> commit only
> 120 MB/s. Now part of the regression was actually caused by
> inefficient
> fsync(2) and the fact that more dirty data was cached at the time of
> the
> last fsync after commit 6df25e58532b. After fixing fsync [1], the
> throughput got to 150 MB/s so better but still not quite the
> throughput
> before 6df25e58532b.
>
> The reason for remaining regression is that bdi congestion handling
> was
> broken and the client had happily ~8GB of outstanding IO against the
> server
> despite the congestion limit was 256 MB. The new congestion handling
> actually works but as a result the server does not have enough dirty
> data
> to efficiently operate on and the server disk often gets idle before
> the
> client can send more.
>
> I wanted to discuss possible solutions here.
>
> Generally 256MB is not enough even for consumer grade contemporary
> disks to
> max out throughput. There is tunable
> /proc/sys/fs/nfs/nfs_congestion_kb.
> If I tweak it to say 1GB, that is enough to give the server enough
> data to
> saturate the disk (most of the time) and fio reaches 180MB/s as
> before
> commit 6df25e58532b. So one solution to the problem would be to
> change the
> default of nfs_congestion_kb to 1GB.
>
> Generally the problem with this tuning is that faster disks may need
> even
> larger nfs_congestion_kb, the NFS client has no way of knowing what
> the
> right value of nfs_congestion_kb is. I personally find the concept of
> client throttling writes to the server flawed. The *server* should
> push
> back (or throttle) if the client is too aggressively pushing out the
> data
> and then the client can react to this backpressure. Because only the
> server
> knows how much it can handle (also given the load from other
> clients). And
> I believe this is actually what is happening in practice (e.g. when I
> tune
> nfs_congestion_kb to really high number). So I think even better
> solution
> may be to just remove the write congestion handling from the client
> completely. The history before commit 6df25e58532b, when congestion
> was
> effectively ignored, shows that this is unlikely to cause any
> practical
> problems. What do people think?

I think we do still need a mechanism to prevent the client from pushing
more writebacks into the RPC layer when the server throttling is
causing RPC transmission queues to build up. Otherwise we end up
increasing the latency when the application is trying to do more I/O to
pages that are queued up for writeback in the RPC layer (since the
latter will be write locked).

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2024-05-29 21:48:19

by NeilBrown

[permalink] [raw]
Subject: Re: NFS write congestion size

On Thu, 30 May 2024, Trond Myklebust wrote:
> On Wed, 2024-05-29 at 18:11 +0200, Jan Kara wrote:
> > Hello,
> >
> > so I was investigating why random writes to a large file over NFS got
> > noticeably slower. The workload we use to test this is this fio
> > command:
> >
> > fio --direct=0 --ioengine=sync --thread --directory=/mnt --
> > invalidate=1 \
> >     --group_reporting=1 --runtime=300 --fallocate=posix --
> > ramp_time=10 \
> >     --name=RandomWrites-async-257024-4k-4 --new_group --rw=randwrite
> > \
> >     --size=32000m --numjobs=4 --bs=4k --fsync_on_close=1 --
> > end_fsync=1 \
> >     --filename_format='FioWorkloads.$jobnum'
> >
> > Eventually I've tracked down the regression to be caused by
> > 6df25e58532b
> > ("nfs: remove reliance on bdi congestion") which changed the
> > congestion
> > mechanism from a generic bdi congestion handling to NFS private one.
> > Before
> > this commit the fio achieved throughput of 180 MB/s, after this
> > commit only
> > 120 MB/s. Now part of the regression was actually caused by
> > inefficient
> > fsync(2) and the fact that more dirty data was cached at the time of
> > the
> > last fsync after commit 6df25e58532b. After fixing fsync [1], the
> > throughput got to 150 MB/s so better but still not quite the
> > throughput
> > before 6df25e58532b.
> >
> > The reason for remaining regression is that bdi congestion handling
> > was
> > broken and the client had happily ~8GB of outstanding IO against the
> > server
> > despite the congestion limit was 256 MB. The new congestion handling
> > actually works but as a result the server does not have enough dirty
> > data
> > to efficiently operate on and the server disk often gets idle before
> > the
> > client can send more.
> >
> > I wanted to discuss possible solutions here.
> >
> > Generally 256MB is not enough even for consumer grade contemporary
> > disks to
> > max out throughput. There is tunable
> > /proc/sys/fs/nfs/nfs_congestion_kb.
> > If I tweak it to say 1GB, that is enough to give the server enough
> > data to
> > saturate the disk (most of the time) and fio reaches 180MB/s as
> > before
> > commit 6df25e58532b. So one solution to the problem would be to
> > change the
> > default of nfs_congestion_kb to 1GB.
> >
> > Generally the problem with this tuning is that faster disks may need
> > even
> > larger nfs_congestion_kb, the NFS client has no way of knowing what
> > the
> > right value of nfs_congestion_kb is. I personally find the concept of
> > client throttling writes to the server flawed. The *server* should
> > push
> > back (or throttle) if the client is too aggressively pushing out the
> > data
> > and then the client can react to this backpressure. Because only the
> > server
> > knows how much it can handle (also given the load from other
> > clients). And
> > I believe this is actually what is happening in practice (e.g. when I
> > tune
> > nfs_congestion_kb to really high number). So I think even better
> > solution
> > may be to just remove the write congestion handling from the client
> > completely. The history before commit 6df25e58532b, when congestion
> > was
> > effectively ignored, shows that this is unlikely to cause any
> > practical
> > problems. What do people think?
>
> I think we do still need a mechanism to prevent the client from pushing
> more writebacks into the RPC layer when the server throttling is
> causing RPC transmission queues to build up. Otherwise we end up
> increasing the latency when the application is trying to do more I/O to
> pages that are queued up for writeback in the RPC layer (since the
> latter will be write locked).

If latency is the issue, could be make the throttling time-based?
So we throttle if the oldest queued request is more than X seconds old?
How easy is it to find the oldest queued request?

NeilBrown

2024-05-30 07:44:10

by Sagi Grimberg

[permalink] [raw]
Subject: Re: NFS write congestion size



On 29/05/2024 20:05, Trond Myklebust wrote:
> On Wed, 2024-05-29 at 18:11 +0200, Jan Kara wrote:
>> Hello,
>>
>> so I was investigating why random writes to a large file over NFS got
>> noticeably slower. The workload we use to test this is this fio
>> command:
>>
>> fio --direct=0 --ioengine=sync --thread --directory=/mnt --
>> invalidate=1 \
>>     --group_reporting=1 --runtime=300 --fallocate=posix --
>> ramp_time=10 \
>>     --name=RandomWrites-async-257024-4k-4 --new_group --rw=randwrite
>> \
>>     --size=32000m --numjobs=4 --bs=4k --fsync_on_close=1 --
>> end_fsync=1 \
>>     --filename_format='FioWorkloads.$jobnum'
>>
>> Eventually I've tracked down the regression to be caused by
>> 6df25e58532b
>> ("nfs: remove reliance on bdi congestion") which changed the
>> congestion
>> mechanism from a generic bdi congestion handling to NFS private one.
>> Before
>> this commit the fio achieved throughput of 180 MB/s, after this
>> commit only
>> 120 MB/s. Now part of the regression was actually caused by
>> inefficient
>> fsync(2) and the fact that more dirty data was cached at the time of
>> the
>> last fsync after commit 6df25e58532b. After fixing fsync [1], the
>> throughput got to 150 MB/s so better but still not quite the
>> throughput
>> before 6df25e58532b.
>>
>> The reason for remaining regression is that bdi congestion handling
>> was
>> broken and the client had happily ~8GB of outstanding IO against the
>> server
>> despite the congestion limit was 256 MB. The new congestion handling
>> actually works but as a result the server does not have enough dirty
>> data
>> to efficiently operate on and the server disk often gets idle before
>> the
>> client can send more.
>>
>> I wanted to discuss possible solutions here.
>>
>> Generally 256MB is not enough even for consumer grade contemporary
>> disks to
>> max out throughput. There is tunable
>> /proc/sys/fs/nfs/nfs_congestion_kb.
>> If I tweak it to say 1GB, that is enough to give the server enough
>> data to
>> saturate the disk (most of the time) and fio reaches 180MB/s as
>> before
>> commit 6df25e58532b. So one solution to the problem would be to
>> change the
>> default of nfs_congestion_kb to 1GB.
>>
>> Generally the problem with this tuning is that faster disks may need
>> even
>> larger nfs_congestion_kb, the NFS client has no way of knowing what
>> the
>> right value of nfs_congestion_kb is. I personally find the concept of
>> client throttling writes to the server flawed. The *server* should
>> push
>> back (or throttle) if the client is too aggressively pushing out the
>> data
>> and then the client can react to this backpressure. Because only the
>> server
>> knows how much it can handle (also given the load from other
>> clients). And
>> I believe this is actually what is happening in practice (e.g. when I
>> tune
>> nfs_congestion_kb to really high number). So I think even better
>> solution
>> may be to just remove the write congestion handling from the client
>> completely. The history before commit 6df25e58532b, when congestion
>> was
>> effectively ignored, shows that this is unlikely to cause any
>> practical
>> problems. What do people think?
> I think we do still need a mechanism to prevent the client from pushing
> more writebacks into the RPC layer when the server throttling is
> causing RPC transmission queues to build up. Otherwise we end up
> increasing the latency when the application is trying to do more I/O to
> pages that are queued up for writeback in the RPC layer (since the
> latter will be write locked).

Plus the server is likely serving multiple clients, so removing any type
of congestion handling from the client may overwhelm the server.

2024-05-30 08:32:08

by Jan Kara

[permalink] [raw]
Subject: Re: NFS write congestion size

On Thu 30-05-24 10:44:01, Sagi Grimberg wrote:
> On 29/05/2024 20:05, Trond Myklebust wrote:
> > On Wed, 2024-05-29 at 18:11 +0200, Jan Kara wrote:
> > > Hello,
> > >
> > > so I was investigating why random writes to a large file over NFS got
> > > noticeably slower. The workload we use to test this is this fio
> > > command:
> > >
> > > fio --direct=0 --ioengine=sync --thread --directory=/mnt --
> > > invalidate=1 \
> > > ??? --group_reporting=1 --runtime=300 --fallocate=posix --
> > > ramp_time=10 \
> > > ??? --name=RandomWrites-async-257024-4k-4 --new_group --rw=randwrite
> > > \
> > > ??? --size=32000m --numjobs=4 --bs=4k --fsync_on_close=1 --
> > > end_fsync=1 \
> > > ??? --filename_format='FioWorkloads.$jobnum'
> > >
> > > Eventually I've tracked down the regression to be caused by
> > > 6df25e58532b
> > > ("nfs: remove reliance on bdi congestion") which changed the
> > > congestion
> > > mechanism from a generic bdi congestion handling to NFS private one.
> > > Before
> > > this commit the fio achieved throughput of 180 MB/s, after this
> > > commit only
> > > 120 MB/s. Now part of the regression was actually caused by
> > > inefficient
> > > fsync(2) and the fact that more dirty data was cached at the time of
> > > the
> > > last fsync after commit 6df25e58532b. After fixing fsync [1], the
> > > throughput got to 150 MB/s so better but still not quite the
> > > throughput
> > > before 6df25e58532b.
> > >
> > > The reason for remaining regression is that bdi congestion handling
> > > was
> > > broken and the client had happily ~8GB of outstanding IO against the
> > > server
> > > despite the congestion limit was 256 MB. The new congestion handling
> > > actually works but as a result the server does not have enough dirty
> > > data
> > > to efficiently operate on and the server disk often gets idle before
> > > the
> > > client can send more.
> > >
> > > I wanted to discuss possible solutions here.
> > >
> > > Generally 256MB is not enough even for consumer grade contemporary
> > > disks to
> > > max out throughput. There is tunable
> > > /proc/sys/fs/nfs/nfs_congestion_kb.
> > > If I tweak it to say 1GB, that is enough to give the server enough
> > > data to
> > > saturate the disk (most of the time) and fio reaches 180MB/s as
> > > before
> > > commit 6df25e58532b. So one solution to the problem would be to
> > > change the
> > > default of nfs_congestion_kb to 1GB.
> > >
> > > Generally the problem with this tuning is that faster disks may need
> > > even
> > > larger nfs_congestion_kb, the NFS client has no way of knowing what
> > > the
> > > right value of nfs_congestion_kb is. I personally find the concept of
> > > client throttling writes to the server flawed. The *server* should
> > > push
> > > back (or throttle) if the client is too aggressively pushing out the
> > > data
> > > and then the client can react to this backpressure. Because only the
> > > server
> > > knows how much it can handle (also given the load from other
> > > clients). And
> > > I believe this is actually what is happening in practice (e.g. when I
> > > tune
> > > nfs_congestion_kb to really high number). So I think even better
> > > solution
> > > may be to just remove the write congestion handling from the client
> > > completely. The history before commit 6df25e58532b, when congestion
> > > was
> > > effectively ignored, shows that this is unlikely to cause any
> > > practical
> > > problems. What do people think?
> > I think we do still need a mechanism to prevent the client from pushing
> > more writebacks into the RPC layer when the server throttling is
> > causing RPC transmission queues to build up. Otherwise we end up
> > increasing the latency when the application is trying to do more I/O to
> > pages that are queued up for writeback in the RPC layer (since the
> > latter will be write locked).
>
> Plus the server is likely serving multiple clients, so removing any type
> of congestion handling from the client may overwhelm the server.

I understand this concern but before commit 6df25e58532b we effectively
didn't do any throttling for years and nobody complained. So servers
apparently know how to cope with clients sending too much IO to them.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-05-30 10:11:31

by Sagi Grimberg

[permalink] [raw]
Subject: Re: NFS write congestion size



On 30/05/2024 11:31, Jan Kara wrote:
> On Thu 30-05-24 10:44:01, Sagi Grimberg wrote:
>> On 29/05/2024 20:05, Trond Myklebust wrote:
>>> On Wed, 2024-05-29 at 18:11 +0200, Jan Kara wrote:
>>>> Hello,
>>>>
>>>> so I was investigating why random writes to a large file over NFS got
>>>> noticeably slower. The workload we use to test this is this fio
>>>> command:
>>>>
>>>> fio --direct=0 --ioengine=sync --thread --directory=/mnt --
>>>> invalidate=1 \
>>>>     --group_reporting=1 --runtime=300 --fallocate=posix --
>>>> ramp_time=10 \
>>>>     --name=RandomWrites-async-257024-4k-4 --new_group --rw=randwrite
>>>> \
>>>>     --size=32000m --numjobs=4 --bs=4k --fsync_on_close=1 --
>>>> end_fsync=1 \
>>>>     --filename_format='FioWorkloads.$jobnum'
>>>>
>>>> Eventually I've tracked down the regression to be caused by
>>>> 6df25e58532b
>>>> ("nfs: remove reliance on bdi congestion") which changed the
>>>> congestion
>>>> mechanism from a generic bdi congestion handling to NFS private one.
>>>> Before
>>>> this commit the fio achieved throughput of 180 MB/s, after this
>>>> commit only
>>>> 120 MB/s. Now part of the regression was actually caused by
>>>> inefficient
>>>> fsync(2) and the fact that more dirty data was cached at the time of
>>>> the
>>>> last fsync after commit 6df25e58532b. After fixing fsync [1], the
>>>> throughput got to 150 MB/s so better but still not quite the
>>>> throughput
>>>> before 6df25e58532b.
>>>>
>>>> The reason for remaining regression is that bdi congestion handling
>>>> was
>>>> broken and the client had happily ~8GB of outstanding IO against the
>>>> server
>>>> despite the congestion limit was 256 MB. The new congestion handling
>>>> actually works but as a result the server does not have enough dirty
>>>> data
>>>> to efficiently operate on and the server disk often gets idle before
>>>> the
>>>> client can send more.
>>>>
>>>> I wanted to discuss possible solutions here.
>>>>
>>>> Generally 256MB is not enough even for consumer grade contemporary
>>>> disks to
>>>> max out throughput. There is tunable
>>>> /proc/sys/fs/nfs/nfs_congestion_kb.
>>>> If I tweak it to say 1GB, that is enough to give the server enough
>>>> data to
>>>> saturate the disk (most of the time) and fio reaches 180MB/s as
>>>> before
>>>> commit 6df25e58532b. So one solution to the problem would be to
>>>> change the
>>>> default of nfs_congestion_kb to 1GB.
>>>>
>>>> Generally the problem with this tuning is that faster disks may need
>>>> even
>>>> larger nfs_congestion_kb, the NFS client has no way of knowing what
>>>> the
>>>> right value of nfs_congestion_kb is. I personally find the concept of
>>>> client throttling writes to the server flawed. The *server* should
>>>> push
>>>> back (or throttle) if the client is too aggressively pushing out the
>>>> data
>>>> and then the client can react to this backpressure. Because only the
>>>> server
>>>> knows how much it can handle (also given the load from other
>>>> clients). And
>>>> I believe this is actually what is happening in practice (e.g. when I
>>>> tune
>>>> nfs_congestion_kb to really high number). So I think even better
>>>> solution
>>>> may be to just remove the write congestion handling from the client
>>>> completely. The history before commit 6df25e58532b, when congestion
>>>> was
>>>> effectively ignored, shows that this is unlikely to cause any
>>>> practical
>>>> problems. What do people think?
>>> I think we do still need a mechanism to prevent the client from pushing
>>> more writebacks into the RPC layer when the server throttling is
>>> causing RPC transmission queues to build up. Otherwise we end up
>>> increasing the latency when the application is trying to do more I/O to
>>> pages that are queued up for writeback in the RPC layer (since the
>>> latter will be write locked).
>> Plus the server is likely serving multiple clients, so removing any type
>> of congestion handling from the client may overwhelm the server.
> I understand this concern but before commit 6df25e58532b we effectively
> didn't do any throttling for years and nobody complained.

don't know about the history nor what people could have attributed problems.

> So servers
> apparently know how to cope with clients sending too much IO to them.

not sure how an nfs server would cope with this. nfsv4 can reduce slots,
but not
sure what nfsv3 server would do...

btw, I think you meant that *slower* devices may need a larger queue to
saturate,
because if the device is fast, 256MB inflight is probably enough... So
you are solving
for the "consumer grade contemporary disks".

2024-05-30 11:13:54

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS write congestion size

On Thu, 2024-05-30 at 13:01 +0300, Sagi Grimberg wrote:
>
>
> On 30/05/2024 11:31, Jan Kara wrote:
> > On Thu 30-05-24 10:44:01, Sagi Grimberg wrote:
> > > On 29/05/2024 20:05, Trond Myklebust wrote:
> > > > On Wed, 2024-05-29 at 18:11 +0200, Jan Kara wrote:
> > > > > Hello,
> > > > >
> > > > > so I was investigating why random writes to a large file over
> > > > > NFS got
> > > > > noticeably slower. The workload we use to test this is this
> > > > > fio
> > > > > command:
> > > > >
> > > > > fio --direct=0 --ioengine=sync --thread --directory=/mnt --
> > > > > invalidate=1 \
> > > > >       --group_reporting=1 --runtime=300 --fallocate=posix --
> > > > > ramp_time=10 \
> > > > >       --name=RandomWrites-async-257024-4k-4 --new_group --
> > > > > rw=randwrite
> > > > > \
> > > > >       --size=32000m --numjobs=4 --bs=4k --fsync_on_close=1 --
> > > > > end_fsync=1 \
> > > > >       --filename_format='FioWorkloads.$jobnum'
> > > > >
> > > > > Eventually I've tracked down the regression to be caused by
> > > > > 6df25e58532b
> > > > > ("nfs: remove reliance on bdi congestion") which changed the
> > > > > congestion
> > > > > mechanism from a generic bdi congestion handling to NFS
> > > > > private one.
> > > > > Before
> > > > > this commit the fio achieved throughput of 180 MB/s, after
> > > > > this
> > > > > commit only
> > > > > 120 MB/s. Now part of the regression was actually caused by
> > > > > inefficient
> > > > > fsync(2) and the fact that more dirty data was cached at the
> > > > > time of
> > > > > the
> > > > > last fsync after commit 6df25e58532b. After fixing fsync [1],
> > > > > the
> > > > > throughput got to 150 MB/s so better but still not quite the
> > > > > throughput
> > > > > before 6df25e58532b.
> > > > >
> > > > > The reason for remaining regression is that bdi congestion
> > > > > handling
> > > > > was
> > > > > broken and the client had happily ~8GB of outstanding IO
> > > > > against the
> > > > > server
> > > > > despite the congestion limit was 256 MB. The new congestion
> > > > > handling
> > > > > actually works but as a result the server does not have
> > > > > enough dirty
> > > > > data
> > > > > to efficiently operate on and the server disk often gets idle
> > > > > before
> > > > > the
> > > > > client can send more.
> > > > >
> > > > > I wanted to discuss possible solutions here.
> > > > >
> > > > > Generally 256MB is not enough even for consumer grade
> > > > > contemporary
> > > > > disks to
> > > > > max out throughput. There is tunable
> > > > > /proc/sys/fs/nfs/nfs_congestion_kb.
> > > > > If I tweak it to say 1GB, that is enough to give the server
> > > > > enough
> > > > > data to
> > > > > saturate the disk (most of the time) and fio reaches 180MB/s
> > > > > as
> > > > > before
> > > > > commit 6df25e58532b. So one solution to the problem would be
> > > > > to
> > > > > change the
> > > > > default of nfs_congestion_kb to 1GB.
> > > > >
> > > > > Generally the problem with this tuning is that faster disks
> > > > > may need
> > > > > even
> > > > > larger nfs_congestion_kb, the NFS client has no way of
> > > > > knowing what
> > > > > the
> > > > > right value of nfs_congestion_kb is. I personally find the
> > > > > concept of
> > > > > client throttling writes to the server flawed. The *server*
> > > > > should
> > > > > push
> > > > > back (or throttle) if the client is too aggressively pushing
> > > > > out the
> > > > > data
> > > > > and then the client can react to this backpressure. Because
> > > > > only the
> > > > > server
> > > > > knows how much it can handle (also given the load from other
> > > > > clients). And
> > > > > I believe this is actually what is happening in practice
> > > > > (e.g. when I
> > > > > tune
> > > > > nfs_congestion_kb to really high number). So I think even
> > > > > better
> > > > > solution
> > > > > may be to just remove the write congestion handling from the
> > > > > client
> > > > > completely. The history before commit 6df25e58532b, when
> > > > > congestion
> > > > > was
> > > > > effectively ignored, shows that this is unlikely to cause any
> > > > > practical
> > > > > problems. What do people think?
> > > > I think we do still need a mechanism to prevent the client from
> > > > pushing
> > > > more writebacks into the RPC layer when the server throttling
> > > > is
> > > > causing RPC transmission queues to build up. Otherwise we end
> > > > up
> > > > increasing the latency when the application is trying to do
> > > > more I/O to
> > > > pages that are queued up for writeback in the RPC layer (since
> > > > the
> > > > latter will be write locked).
> > > Plus the server is likely serving multiple clients, so removing
> > > any type
> > > of congestion handling from the client may overwhelm the server.
> > I understand this concern but before commit 6df25e58532b we
> > effectively
> > didn't do any throttling for years and nobody complained.


Commit 6df25e58532b doesn't add throttling. It just converts from using
the bdi based mechanism to using a NFS-specific one.
That bdi based throttling mechanism dates at least back to 2007,
although there was code before that dating back to pre-git history.

>
> don't know about the history nor what people could have attributed
> problems.
>
> >   So servers
> > apparently know how to cope with clients sending too much IO to
> > them.
>
> not sure how an nfs server would cope with this. nfsv4 can reduce
> slots,
> but not
> sure what nfsv3 server would do...
>
> btw, I think you meant that *slower* devices may need a larger queue
> to
> saturate,
> because if the device is fast, 256MB inflight is probably enough...
> So
> you are solving
> for the "consumer grade contemporary disks".
>

It is hard to do server side congestion control with UDP, since it does
not have a native congestion mechanism to leverage.

However connection based transports are essentially a queuing mechanism
as far as the server is concerned. It can trivially push back on the
client by slowing down the rate at which it pulls RPC calls from the
transport (or stopping them altogether). That's a mechanism that works
just fine for both TCP and RDMA.

Additionally, NFSv4 has the session slot mechanism, and while that can
be used as a throttling mechanism, it is more about providing safe
only-once replay semantics. A prudent server implementation would not
rely on it to replace transport level throttling.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2024-05-30 11:23:24

by NeilBrown

[permalink] [raw]
Subject: Re: NFS write congestion size

On Thu, 30 May 2024, Sagi Grimberg wrote:
>
> On 30/05/2024 11:31, Jan Kara wrote:
> > On Thu 30-05-24 10:44:01, Sagi Grimberg wrote:
> >> On 29/05/2024 20:05, Trond Myklebust wrote:
> >>> On Wed, 2024-05-29 at 18:11 +0200, Jan Kara wrote:
> >>>> Hello,
> >>>>
> >>>> so I was investigating why random writes to a large file over NFS got
> >>>> noticeably slower. The workload we use to test this is this fio
> >>>> command:
> >>>>
> >>>> fio --direct=0 --ioengine=sync --thread --directory=/mnt --
> >>>> invalidate=1 \
> >>>>     --group_reporting=1 --runtime=300 --fallocate=posix --
> >>>> ramp_time=10 \
> >>>>     --name=RandomWrites-async-257024-4k-4 --new_group --rw=randwrite
> >>>> \
> >>>>     --size=32000m --numjobs=4 --bs=4k --fsync_on_close=1 --
> >>>> end_fsync=1 \
> >>>>     --filename_format='FioWorkloads.$jobnum'
> >>>>
> >>>> Eventually I've tracked down the regression to be caused by
> >>>> 6df25e58532b
> >>>> ("nfs: remove reliance on bdi congestion") which changed the
> >>>> congestion
> >>>> mechanism from a generic bdi congestion handling to NFS private one.
> >>>> Before
> >>>> this commit the fio achieved throughput of 180 MB/s, after this
> >>>> commit only
> >>>> 120 MB/s. Now part of the regression was actually caused by
> >>>> inefficient
> >>>> fsync(2) and the fact that more dirty data was cached at the time of
> >>>> the
> >>>> last fsync after commit 6df25e58532b. After fixing fsync [1], the
> >>>> throughput got to 150 MB/s so better but still not quite the
> >>>> throughput
> >>>> before 6df25e58532b.
> >>>>
> >>>> The reason for remaining regression is that bdi congestion handling
> >>>> was
> >>>> broken and the client had happily ~8GB of outstanding IO against the
> >>>> server
> >>>> despite the congestion limit was 256 MB. The new congestion handling
> >>>> actually works but as a result the server does not have enough dirty
> >>>> data
> >>>> to efficiently operate on and the server disk often gets idle before
> >>>> the
> >>>> client can send more.
> >>>>
> >>>> I wanted to discuss possible solutions here.
> >>>>
> >>>> Generally 256MB is not enough even for consumer grade contemporary
> >>>> disks to
> >>>> max out throughput. There is tunable
> >>>> /proc/sys/fs/nfs/nfs_congestion_kb.
> >>>> If I tweak it to say 1GB, that is enough to give the server enough
> >>>> data to
> >>>> saturate the disk (most of the time) and fio reaches 180MB/s as
> >>>> before
> >>>> commit 6df25e58532b. So one solution to the problem would be to
> >>>> change the
> >>>> default of nfs_congestion_kb to 1GB.
> >>>>
> >>>> Generally the problem with this tuning is that faster disks may need
> >>>> even
> >>>> larger nfs_congestion_kb, the NFS client has no way of knowing what
> >>>> the
> >>>> right value of nfs_congestion_kb is. I personally find the concept of
> >>>> client throttling writes to the server flawed. The *server* should
> >>>> push
> >>>> back (or throttle) if the client is too aggressively pushing out the
> >>>> data
> >>>> and then the client can react to this backpressure. Because only the
> >>>> server
> >>>> knows how much it can handle (also given the load from other
> >>>> clients). And
> >>>> I believe this is actually what is happening in practice (e.g. when I
> >>>> tune
> >>>> nfs_congestion_kb to really high number). So I think even better
> >>>> solution
> >>>> may be to just remove the write congestion handling from the client
> >>>> completely. The history before commit 6df25e58532b, when congestion
> >>>> was
> >>>> effectively ignored, shows that this is unlikely to cause any
> >>>> practical
> >>>> problems. What do people think?
> >>> I think we do still need a mechanism to prevent the client from pushing
> >>> more writebacks into the RPC layer when the server throttling is
> >>> causing RPC transmission queues to build up. Otherwise we end up
> >>> increasing the latency when the application is trying to do more I/O to
> >>> pages that are queued up for writeback in the RPC layer (since the
> >>> latter will be write locked).
> >> Plus the server is likely serving multiple clients, so removing any type
> >> of congestion handling from the client may overwhelm the server.
> > I understand this concern but before commit 6df25e58532b we effectively
> > didn't do any throttling for years and nobody complained.
>
> don't know about the history nor what people could have attributed problems.
>
> > So servers
> > apparently know how to cope with clients sending too much IO to them.
>
> not sure how an nfs server would cope with this. nfsv4 can reduce slots,
> but not
> sure what nfsv3 server would do...

An nfsv3 server would simply respond slowly, or just drop the request.
It is not up to the client to rate-limit requests so as to make life
easier for the server.

>
> btw, I think you meant that *slower* devices may need a larger queue to
> saturate,
> because if the device is fast, 256MB inflight is probably enough... So
> you are solving
> for the "consumer grade contemporary disks".

No, faster devices. The context here is random writes. The more of
those that are in the queue on the server, the more chance it has to
re-order or coalesce requests.
We don't want the device to go idle so we want lots of data buffered.
It will take the client some moments to respond to the client-side queue
dropping below a threshold by finding some pages to write and to submit
them. We don't want those moments to be long enough for the server-side
queue to empty.

NeilBrown

2024-05-30 18:14:37

by Sagi Grimberg

[permalink] [raw]
Subject: Re: NFS write congestion size



On 30/05/2024 14:13, Trond Myklebust wrote:
> On Thu, 2024-05-30 at 13:01 +0300, Sagi Grimberg wrote:
>>
>> On 30/05/2024 11:31, Jan Kara wrote:
>>> On Thu 30-05-24 10:44:01, Sagi Grimberg wrote:
>>>> On 29/05/2024 20:05, Trond Myklebust wrote:
>>>>> On Wed, 2024-05-29 at 18:11 +0200, Jan Kara wrote:
>>>>>> Hello,
>>>>>>
>>>>>> so I was investigating why random writes to a large file over
>>>>>> NFS got
>>>>>> noticeably slower. The workload we use to test this is this
>>>>>> fio
>>>>>> command:
>>>>>>
>>>>>> fio --direct=0 --ioengine=sync --thread --directory=/mnt --
>>>>>> invalidate=1 \
>>>>>>       --group_reporting=1 --runtime=300 --fallocate=posix --
>>>>>> ramp_time=10 \
>>>>>>       --name=RandomWrites-async-257024-4k-4 --new_group --
>>>>>> rw=randwrite
>>>>>> \
>>>>>>       --size=32000m --numjobs=4 --bs=4k --fsync_on_close=1 --
>>>>>> end_fsync=1 \
>>>>>>       --filename_format='FioWorkloads.$jobnum'
>>>>>>
>>>>>> Eventually I've tracked down the regression to be caused by
>>>>>> 6df25e58532b
>>>>>> ("nfs: remove reliance on bdi congestion") which changed the
>>>>>> congestion
>>>>>> mechanism from a generic bdi congestion handling to NFS
>>>>>> private one.
>>>>>> Before
>>>>>> this commit the fio achieved throughput of 180 MB/s, after
>>>>>> this
>>>>>> commit only
>>>>>> 120 MB/s. Now part of the regression was actually caused by
>>>>>> inefficient
>>>>>> fsync(2) and the fact that more dirty data was cached at the
>>>>>> time of
>>>>>> the
>>>>>> last fsync after commit 6df25e58532b. After fixing fsync [1],
>>>>>> the
>>>>>> throughput got to 150 MB/s so better but still not quite the
>>>>>> throughput
>>>>>> before 6df25e58532b.
>>>>>>
>>>>>> The reason for remaining regression is that bdi congestion
>>>>>> handling
>>>>>> was
>>>>>> broken and the client had happily ~8GB of outstanding IO
>>>>>> against the
>>>>>> server
>>>>>> despite the congestion limit was 256 MB. The new congestion
>>>>>> handling
>>>>>> actually works but as a result the server does not have
>>>>>> enough dirty
>>>>>> data
>>>>>> to efficiently operate on and the server disk often gets idle
>>>>>> before
>>>>>> the
>>>>>> client can send more.
>>>>>>
>>>>>> I wanted to discuss possible solutions here.
>>>>>>
>>>>>> Generally 256MB is not enough even for consumer grade
>>>>>> contemporary
>>>>>> disks to
>>>>>> max out throughput. There is tunable
>>>>>> /proc/sys/fs/nfs/nfs_congestion_kb.
>>>>>> If I tweak it to say 1GB, that is enough to give the server
>>>>>> enough
>>>>>> data to
>>>>>> saturate the disk (most of the time) and fio reaches 180MB/s
>>>>>> as
>>>>>> before
>>>>>> commit 6df25e58532b. So one solution to the problem would be
>>>>>> to
>>>>>> change the
>>>>>> default of nfs_congestion_kb to 1GB.
>>>>>>
>>>>>> Generally the problem with this tuning is that faster disks
>>>>>> may need
>>>>>> even
>>>>>> larger nfs_congestion_kb, the NFS client has no way of
>>>>>> knowing what
>>>>>> the
>>>>>> right value of nfs_congestion_kb is. I personally find the
>>>>>> concept of
>>>>>> client throttling writes to the server flawed. The *server*
>>>>>> should
>>>>>> push
>>>>>> back (or throttle) if the client is too aggressively pushing
>>>>>> out the
>>>>>> data
>>>>>> and then the client can react to this backpressure. Because
>>>>>> only the
>>>>>> server
>>>>>> knows how much it can handle (also given the load from other
>>>>>> clients). And
>>>>>> I believe this is actually what is happening in practice
>>>>>> (e.g. when I
>>>>>> tune
>>>>>> nfs_congestion_kb to really high number). So I think even
>>>>>> better
>>>>>> solution
>>>>>> may be to just remove the write congestion handling from the
>>>>>> client
>>>>>> completely. The history before commit 6df25e58532b, when
>>>>>> congestion
>>>>>> was
>>>>>> effectively ignored, shows that this is unlikely to cause any
>>>>>> practical
>>>>>> problems. What do people think?
>>>>> I think we do still need a mechanism to prevent the client from
>>>>> pushing
>>>>> more writebacks into the RPC layer when the server throttling
>>>>> is
>>>>> causing RPC transmission queues to build up. Otherwise we end
>>>>> up
>>>>> increasing the latency when the application is trying to do
>>>>> more I/O to
>>>>> pages that are queued up for writeback in the RPC layer (since
>>>>> the
>>>>> latter will be write locked).
>>>> Plus the server is likely serving multiple clients, so removing
>>>> any type
>>>> of congestion handling from the client may overwhelm the server.
>>> I understand this concern but before commit 6df25e58532b we
>>> effectively
>>> didn't do any throttling for years and nobody complained.
>
> Commit 6df25e58532b doesn't add throttling. It just converts from using
> the bdi based mechanism to using a NFS-specific one.
> That bdi based throttling mechanism dates at least back to 2007,
> although there was code before that dating back to pre-git history.
>
>> don't know about the history nor what people could have attributed
>> problems.
>>
>>>   So servers
>>> apparently know how to cope with clients sending too much IO to
>>> them.
>> not sure how an nfs server would cope with this. nfsv4 can reduce
>> slots,
>> but not
>> sure what nfsv3 server would do...
>>
>> btw, I think you meant that *slower* devices may need a larger queue
>> to
>> saturate,
>> because if the device is fast, 256MB inflight is probably enough...
>> So
>> you are solving
>> for the "consumer grade contemporary disks".
>>
> It is hard to do server side congestion control with UDP, since it does
> not have a native congestion mechanism to leverage.
>
> However connection based transports are essentially a queuing mechanism
> as far as the server is concerned. It can trivially push back on the
> client by slowing down the rate at which it pulls RPC calls from the
> transport (or stopping them altogether). That's a mechanism that works
> just fine for both TCP and RDMA.

Yes, I agree. tcp throttling works. Just has some downsides. I was
referring to a higher-level
credit based flow control.

>
> Additionally, NFSv4 has the session slot mechanism, and while that can
> be used as a throttling mechanism, it is more about providing safe
> only-once replay semantics.

OK, that is good to know. I assumed that it is designed to be used for
throttling.

> A prudent server implementation would not rely on it to replace transport level throttling.
>

Oh, definitely not replace. But I get your point.

2024-05-30 18:20:02

by Sagi Grimberg

[permalink] [raw]
Subject: Re: NFS write congestion size

>> btw, I think you meant that *slower* devices may need a larger queue to
>> saturate,
>> because if the device is fast, 256MB inflight is probably enough... So
>> you are solving
>> for the "consumer grade contemporary disks".
> No, faster devices. The context here is random writes. The more of
> those that are in the queue on the server, the more chance it has to
> re-order or coalesce requests.

Umm, what would reorder help with if writes are random?

> We don't want the device to go idle so we want lots of data buffered.
> It will take the client some moments to respond to the client-side queue
> dropping below a threshold by finding some pages to write and to submit
> them. We don't want those moments to be long enough for the server-side
> queue to empty.

Yes I agree. I was just arguing that the faster the device is, the lower
the completion
latency is. Meaning, if a device has a write-latency of 1us, you need a
shorter queue than
a device that has a write-latency of 100ms.

Anyways, we're getting side-tracked...

2024-05-31 14:14:47

by Jan Kara

[permalink] [raw]
Subject: Re: NFS write congestion size

On Thu 30-05-24 21:19:55, Sagi Grimberg wrote:
> > > btw, I think you meant that *slower* devices may need a larger queue to
> > > saturate,
> > > because if the device is fast, 256MB inflight is probably enough... So
> > > you are solving
> > > for the "consumer grade contemporary disks".
> > No, faster devices. The context here is random writes. The more of
> > those that are in the queue on the server, the more chance it has to
> > re-order or coalesce requests.
>
> Umm, what would reorder help with if writes are random?

Well, even though writes are random, they are still from a limited space so
the more of them you have, the larger contiguous chunks you are actually
going to get. But that's not really the main helping factor here. The main
benefit of allowing more than 256MB to be fed into the server is indeed
that the writeback on the server can be more contiguous. With 256MB limit
what I see is that the server gets the data, then commit which triggers
writeout of the data on the server, the server then signals completion and
idles before the it gets more data from the client. With say 1GB or larger
limit, the server has some data to write most of the time so overall
throughput and disk utilization is much better.

So it is all about how latency of the network, the speed of the storage and
batching of writeback completions plays together. And yes, we can just
increase the limit but my opinion still is that limiting the number of MB
of writeback we have outstanding against the server is a wrong way of
throttling things because that number is difficult to get right without
knowing storage speeds, network speeds, load from other clients and
internals of page writeback...

I understand Trond's concerns about latency increase as well as your
concerns about overloading the server with RPC traffic so perhaps we need
*some* mechanism by which the client can detect contention. But I think we
need to come up with something better than fixed number of MB...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR