LinuxLists.cc - nfs client strange behavior with cpuwait and memory writeback

2022-09-11 18:59:09

Subject: nfs client strange behavior with cpuwait and memory writeback

Hi everybody!!!

I am very happy writing my first email to one of the Linux mailing list.

I have read the faq and i know this mailing list is not a user help
desk but i have strange behaviour with memory write back and NFS.
Maybe someone can help me. I am so sorry if this is not the right
"forum".

I did three simple tests writing to the same NFS filesystem and the
behavior of the cpu and memory is extruding my brain.

The Environment:

- Linux RedHat 8.6, 2 vCPU (VMWare VM) and 8 GB RAM (but same behavior
with Red Hat 7.9)

- One nfs filesystem mounted with sync and without sync

1x.1x.2xx.1xx:/test_fs on /mnt/test_fs_with_sync type nfs
(rw,relatime,sync,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=1x.1x.2xx.1xx,mountvers=3,mountport=2050,mountproto=udp,local_lock=none,addr=1x.1x.2xx.1xx)

1x.1x.2xx.1xx:/test_fs on /mnt/test_fs_without_sync type nfs
(rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=1x.1x.2xx.1xx,mountvers=3,mountport=2050,mountproto=udp,local_lock=none,addr=1x.1x.2xx.1xx:)

- Link between nfs client and nfs server is a 10Gb (Fiber) and iperf3
data show the link works at maximum speed. No problems here. I know
there are nfs options like nconnect to improve performance but I am
interested in linux kernel internals.

The test:

1.- dd in /mnt/test_fs_without_sync

dd if=/dev/zero of=test.out bs=1M count=5000
5000+0 records in
5000+0 records out
5242880000 bytes (5.2 GB, 4.9 GiB) copied, 21.4122 s, 245 MB/s

* High cpuwait
* High nfs latency
* Writeback in use

Evidences:
https://zerobin.net/?43f9bea1953ed7aa#TaUk+K0GDhxjPq1EgJ2aAHgEyhntQ0NQzeFF51d9qI0=

https://i.stack.imgur.com/pTong.png

2.- dd in /mnt/test_fs_with_sync

dd if=/dev/zero of=test.out bs=1M count=5000
5000+0 records in
5000+0 records out
5242880000 bytes (5.2 GB, 4.9 GiB) copied, 35.6462 s, 147 MB/s

* High cpuwait
* Low nfs latency
* No writeback

Evidences
https://zerobin.net/?0ce52c5c5d946d7a#ZeyjHFIp7B+K+65DX2RzEGlp+Oq9rCidAKL8RpKpDJ8=

https://i.stack.imgur.com/Pf1xS.png

3.- dd in /mnt/test_fs_with_sync and oflag=direct

dd if=/dev/zero of=test.out bs=1M oflag=direct count=5000
5000+0 records in
5000+0 records out
5242880000 bytes (5.2 GB, 4.9 GiB) copied, 34.6491 s, 151 MB/s

* Low cpuwait
* Low nfs latency
* No writeback

Evidences:
https://zerobin.net/?03c4aa040a7a5323#bScEK36+Sdcz18VwKnBXNbOsi/qFt/O+qFyNj5FUs8k=

https://i.stack.imgur.com/Qs6y5.png

The questions:

I know write back is an old issue in linux and seems is the problem
here.I played with vm.dirty_background_bytes/vm.dirty_background_ratio
and vm.dirty_background_ratio/vm.dirty_background_ratio (i know only
one is valid) but whatever value put in this tunables I always have
iowait (except from dd with oflag=direct)

- In test number 2. How is it possible that it has no nfs latency but
has a high cpu wait?

- In test number 2. How is it possible that have almost the same code
path than test number 1? Test number 2 use a nfs filesystem mounted
with sync option but seems to use pagecache codepath (see flame graph)

- In test number 1. Why isn't there a change in cpuwait behavior when
vm.dirty tunables are changed? (i have tested a lot of combinations)

Thank you very much!!

Best regards.

2022-09-12 10:50:36

by Jeffrey Layton

[permalink] [raw]

Subject: Re: nfs client strange behavior with cpuwait and memory writeback

On Sun, 2022-09-11 at 20:58 +0200, Isak wrote:
> Hi everybody!!!
>
> I am very happy writing my first email to one of the Linux mailing list.
>
> I have read the faq and i know this mailing list is not a user help
> desk but i have strange behaviour with memory write back and NFS.
> Maybe someone can help me. I am so sorry if this is not the right
> "forum".
>
> I did three simple tests writing to the same NFS filesystem and the
> behavior of the cpu and memory is extruding my brain.
>
> The Environment:
>
> - Linux RedHat 8.6, 2 vCPU (VMWare VM) and 8 GB RAM (but same behavior
> with Red Hat 7.9)
>
> - One nfs filesystem mounted with sync and without sync
>
> 1x.1x.2xx.1xx:/test_fs on /mnt/test_fs_with_sync type nfs
> (rw,relatime,sync,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=1x.1x.2xx.1xx,mountvers=3,mountport=2050,mountproto=udp,local_lock=none,addr=1x.1x.2xx.1xx)
>
> 1x.1x.2xx.1xx:/test_fs on /mnt/test_fs_without_sync type nfs
> (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=1x.1x.2xx.1xx,mountvers=3,mountport=2050,mountproto=udp,local_lock=none,addr=1x.1x.2xx.1xx:)
>
> - Link between nfs client and nfs server is a 10Gb (Fiber) and iperf3
> data show the link works at maximum speed. No problems here. I know
> there are nfs options like nconnect to improve performance but I am
> interested in linux kernel internals.
>
> The test:
>
> 1.- dd in /mnt/test_fs_without_sync
>
> dd if=/dev/zero of=test.out bs=1M count=5000
> 5000+0 records in
> 5000+0 records out
> 5242880000 bytes (5.2 GB, 4.9 GiB) copied, 21.4122 s, 245 MB/s
>
> * High cpuwait
> * High nfs latency
> * Writeback in use
>
> Evidences:
> https://zerobin.net/?43f9bea1953ed7aa#TaUk+K0GDhxjPq1EgJ2aAHgEyhntQ0NQzeFF51d9qI0=
>
> https://i.stack.imgur.com/pTong.png
>
>
>
> 2.- dd in /mnt/test_fs_with_sync
>
> dd if=/dev/zero of=test.out bs=1M count=5000
> 5000+0 records in
> 5000+0 records out
> 5242880000 bytes (5.2 GB, 4.9 GiB) copied, 35.6462 s, 147 MB/s
>
> * High cpuwait
> * Low nfs latency
> * No writeback
>
> Evidences
> https://zerobin.net/?0ce52c5c5d946d7a#ZeyjHFIp7B+K+65DX2RzEGlp+Oq9rCidAKL8RpKpDJ8=
>
> https://i.stack.imgur.com/Pf1xS.png
>
>
>
> 3.- dd in /mnt/test_fs_with_sync and oflag=direct
>
> dd if=/dev/zero of=test.out bs=1M oflag=direct count=5000
> 5000+0 records in
> 5000+0 records out
> 5242880000 bytes (5.2 GB, 4.9 GiB) copied, 34.6491 s, 151 MB/s
>
> * Low cpuwait
> * Low nfs latency
> * No writeback
>
> Evidences:
> https://zerobin.net/?03c4aa040a7a5323#bScEK36+Sdcz18VwKnBXNbOsi/qFt/O+qFyNj5FUs8k=
>
> https://i.stack.imgur.com/Qs6y5.png
>
>
>
>
> The questions:
>
> I know write back is an old issue in linux and seems is the problem
> here.I played with vm.dirty_background_bytes/vm.dirty_background_ratio
> and vm.dirty_background_ratio/vm.dirty_background_ratio (i know only
> one is valid) but whatever value put in this tunables I always have
> iowait (except from dd with oflag=direct)
>
> - In test number 2. How is it possible that it has no nfs latency but
> has a high cpu wait?
>
> - In test number 2. How is it possible that have almost the same code
> path than test number 1? Test number 2 use a nfs filesystem mounted
> with sync option but seems to use pagecache codepath (see flame graph)
>

"sync" just means that the write codepaths do an implicit fsync of the
written range after every write. The data still goes through the
pagecache in that case. It just does a (synchronous) flush of the data
to the server and a commit after every 1M (in your case).

>
> - In test number 1. Why isn't there a change in cpuwait behavior when
> vm.dirty tunables are changed? (i have tested a lot of combinations)
>
>

Depends on which tunables you're twiddling, but you have 8G of RAM and
are writing a 5G file. All of that should fit in the pagecache without
needing to flush anything before all the writes are done. I imagine the
vm.dirty tunables don't really come into play in these tests, other than
maybe the background ones, and those shouldn't really affect your
buffered write throughput.
--
Jeff Layton <[email protected]>

2022-09-12 19:09:17

by Isak

[permalink] [raw]

Subject: Re: nfs client strange behavior with cpuwait and memory writeback

El lun, 12 sept 2022 a las 12:40, Jeff Layton (<[email protected]>) escribió:
>
> On Sun, 2022-09-11 at 20:58 +0200, Isak wrote:
> > Hi everybody!!!
> >
> > I am very happy writing my first email to one of the Linux mailing list.
> >
> > I have read the faq and i know this mailing list is not a user help
> > desk but i have strange behaviour with memory write back and NFS.
> > Maybe someone can help me. I am so sorry if this is not the right
> > "forum".
> >
> > I did three simple tests writing to the same NFS filesystem and the
> > behavior of the cpu and memory is extruding my brain.
> >
> > The Environment:
> >
> > - Linux RedHat 8.6, 2 vCPU (VMWare VM) and 8 GB RAM (but same behavior
> > with Red Hat 7.9)
> >
> > - One nfs filesystem mounted with sync and without sync
> >
> > 1x.1x.2xx.1xx:/test_fs on /mnt/test_fs_with_sync type nfs
> > (rw,relatime,sync,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=1x.1x.2xx.1xx,mountvers=3,mountport=2050,mountproto=udp,local_lock=none,addr=1x.1x.2xx.1xx)
> >
> > 1x.1x.2xx.1xx:/test_fs on /mnt/test_fs_without_sync type nfs
> > (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=1x.1x.2xx.1xx,mountvers=3,mountport=2050,mountproto=udp,local_lock=none,addr=1x.1x.2xx.1xx:)
> >
> > - Link between nfs client and nfs server is a 10Gb (Fiber) and iperf3
> > data show the link works at maximum speed. No problems here. I know
> > there are nfs options like nconnect to improve performance but I am
> > interested in linux kernel internals.
> >
> > The test:
> >
> > 1.- dd in /mnt/test_fs_without_sync
> >
> > dd if=/dev/zero of=test.out bs=1M count=5000
> > 5000+0 records in
> > 5000+0 records out
> > 5242880000 bytes (5.2 GB, 4.9 GiB) copied, 21.4122 s, 245 MB/s
> >
> > * High cpuwait
> > * High nfs latency
> > * Writeback in use
> >
> > Evidences:
> > https://zerobin.net/?43f9bea1953ed7aa#TaUk+K0GDhxjPq1EgJ2aAHgEyhntQ0NQzeFF51d9qI0=
> >
> > https://i.stack.imgur.com/pTong.png
> >
> >
> >
> > 2.- dd in /mnt/test_fs_with_sync
> >
> > dd if=/dev/zero of=test.out bs=1M count=5000
> > 5000+0 records in
> > 5000+0 records out
> > 5242880000 bytes (5.2 GB, 4.9 GiB) copied, 35.6462 s, 147 MB/s
> >
> > * High cpuwait
> > * Low nfs latency
> > * No writeback
> >
> > Evidences
> > https://zerobin.net/?0ce52c5c5d946d7a#ZeyjHFIp7B+K+65DX2RzEGlp+Oq9rCidAKL8RpKpDJ8=
> >
> > https://i.stack.imgur.com/Pf1xS.png
> >
> >
> >
> > 3.- dd in /mnt/test_fs_with_sync and oflag=direct
> >
> > dd if=/dev/zero of=test.out bs=1M oflag=direct count=5000
> > 5000+0 records in
> > 5000+0 records out
> > 5242880000 bytes (5.2 GB, 4.9 GiB) copied, 34.6491 s, 151 MB/s
> >
> > * Low cpuwait
> > * Low nfs latency
> > * No writeback
> >
> > Evidences:
> > https://zerobin.net/?03c4aa040a7a5323#bScEK36+Sdcz18VwKnBXNbOsi/qFt/O+qFyNj5FUs8k=
> >
> > https://i.stack.imgur.com/Qs6y5.png
> >
> >
> >
> >
> > The questions:
> >
> > I know write back is an old issue in linux and seems is the problem
> > here.I played with vm.dirty_background_bytes/vm.dirty_background_ratio
> > and vm.dirty_background_ratio/vm.dirty_background_ratio (i know only
> > one is valid) but whatever value put in this tunables I always have
> > iowait (except from dd with oflag=direct)
> >
> > - In test number 2. How is it possible that it has no nfs latency but
> > has a high cpu wait?
> >
> > - In test number 2. How is it possible that have almost the same code
> > path than test number 1? Test number 2 use a nfs filesystem mounted
> > with sync option but seems to use pagecache codepath (see flame graph)
> >
>
> "sync" just means that the write codepaths do an implicit fsync of the
> written range after every write. The data still goes through the
> pagecache in that case. It just does a (synchronous) flush of the data
> to the server and a commit after every 1M (in your case).

Thank you very much Jeff. Understood. My mistake. I thought that, with
the nfs sync option, page cache was actually not used. What about test
2 (with Sync) regarding to cpuwait? Seems like a CPU accounting
"problem"? I have high cpuwait and low NFS latency (nfsiostat). If dd
is launched with oflag=direct in the same NFS filesystem (mounted with
Sync), page cache is not used and there isn't cpuwait.

>
> >
> > - In test number 1. Why isn't there a change in cpuwait behavior when
> > vm.dirty tunables are changed? (i have tested a lot of combinations)
> >
> >
>
> Depends on which tunables you're twiddling, but you have 8G of RAM and
> are writing a 5G file. All of that should fit in the pagecache without
> needing to flush anything before all the writes are done. I imagine the
> vm.dirty tunables don't really come into play in these tests, other than
> maybe the background ones, and those shouldn't really affect your
> buffered write throughput.

My understanding (surely wrong) about page cache in Linux is that we
actually have two caches. One is "read cache" and the other is "write
cache" aka dirty pages so write cache should not exceed the parameter
vm.dirty_bytes or vm.dirty_ratio so I don't think 5gb file in 8gb RAM
with low vm.dirty_ratio does much buffering.

Thanks a lot Jeff for your help. I really appreciate it.

Best regards.

>
> Jeff Layton <[email protected]>

2022-09-13 12:05:22

by Jeffrey Layton

[permalink] [raw]

Subject: Re: nfs client strange behavior with cpuwait and memory writeback

On Mon, 2022-09-12 at 21:00 +0200, Isak wrote:
> El lun, 12 sept 2022 a las 12:40, Jeff Layton (<[email protected]>) escribi?:
> >
> > On Sun, 2022-09-11 at 20:58 +0200, Isak wrote:
> > > Hi everybody!!!
> > >
> > > I am very happy writing my first email to one of the Linux mailing list.
> > >
> > > I have read the faq and i know this mailing list is not a user help
> > > desk but i have strange behaviour with memory write back and NFS.
> > > Maybe someone can help me. I am so sorry if this is not the right
> > > "forum".
> > >
> > > I did three simple tests writing to the same NFS filesystem and the
> > > behavior of the cpu and memory is extruding my brain.
> > >
> > > The Environment:
> > >
> > > - Linux RedHat 8.6, 2 vCPU (VMWare VM) and 8 GB RAM (but same behavior
> > > with Red Hat 7.9)
> > >
> > > - One nfs filesystem mounted with sync and without sync
> > >
> > > 1x.1x.2xx.1xx:/test_fs on /mnt/test_fs_with_sync type nfs
> > > (rw,relatime,sync,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=1x.1x.2xx.1xx,mountvers=3,mountport=2050,mountproto=udp,local_lock=none,addr=1x.1x.2xx.1xx)
> > >
> > > 1x.1x.2xx.1xx:/test_fs on /mnt/test_fs_without_sync type nfs
> > > (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=1x.1x.2xx.1xx,mountvers=3,mountport=2050,mountproto=udp,local_lock=none,addr=1x.1x.2xx.1xx:)
> > >
> > > - Link between nfs client and nfs server is a 10Gb (Fiber) and iperf3
> > > data show the link works at maximum speed. No problems here. I know
> > > there are nfs options like nconnect to improve performance but I am
> > > interested in linux kernel internals.
> > >
> > > The test:
> > >
> > > 1.- dd in /mnt/test_fs_without_sync
> > >
> > > dd if=/dev/zero of=test.out bs=1M count=5000
> > > 5000+0 records in
> > > 5000+0 records out
> > > 5242880000 bytes (5.2 GB, 4.9 GiB) copied, 21.4122 s, 245 MB/s
> > >
> > > * High cpuwait
> > > * High nfs latency
> > > * Writeback in use
> > >
> > > Evidences:
> > > https://zerobin.net/?43f9bea1953ed7aa#TaUk+K0GDhxjPq1EgJ2aAHgEyhntQ0NQzeFF51d9qI0=
> > >
> > > https://i.stack.imgur.com/pTong.png
> > >
> > >
> > >
> > > 2.- dd in /mnt/test_fs_with_sync
> > >
> > > dd if=/dev/zero of=test.out bs=1M count=5000
> > > 5000+0 records in
> > > 5000+0 records out
> > > 5242880000 bytes (5.2 GB, 4.9 GiB) copied, 35.6462 s, 147 MB/s
> > >
> > > * High cpuwait
> > > * Low nfs latency
> > > * No writeback
> > >
> > > Evidences
> > > https://zerobin.net/?0ce52c5c5d946d7a#ZeyjHFIp7B+K+65DX2RzEGlp+Oq9rCidAKL8RpKpDJ8=
> > >
> > > https://i.stack.imgur.com/Pf1xS.png
> > >
> > >
> > >
> > > 3.- dd in /mnt/test_fs_with_sync and oflag=direct
> > >
> > > dd if=/dev/zero of=test.out bs=1M oflag=direct count=5000
> > > 5000+0 records in
> > > 5000+0 records out
> > > 5242880000 bytes (5.2 GB, 4.9 GiB) copied, 34.6491 s, 151 MB/s
> > >
> > > * Low cpuwait
> > > * Low nfs latency
> > > * No writeback
> > >
> > > Evidences:
> > > https://zerobin.net/?03c4aa040a7a5323#bScEK36+Sdcz18VwKnBXNbOsi/qFt/O+qFyNj5FUs8k=
> > >
> > > https://i.stack.imgur.com/Qs6y5.png
> > >
> > >
> > >
> > >
> > > The questions:
> > >
> > > I know write back is an old issue in linux and seems is the problem
> > > here.I played with vm.dirty_background_bytes/vm.dirty_background_ratio
> > > and vm.dirty_background_ratio/vm.dirty_background_ratio (i know only
> > > one is valid) but whatever value put in this tunables I always have
> > > iowait (except from dd with oflag=direct)
> > >
> > > - In test number 2. How is it possible that it has no nfs latency but
> > > has a high cpu wait?
> > >
> > > - In test number 2. How is it possible that have almost the same code
> > > path than test number 1? Test number 2 use a nfs filesystem mounted
> > > with sync option but seems to use pagecache codepath (see flame graph)
> > >
> >
> > "sync" just means that the write codepaths do an implicit fsync of the
> > written range after every write. The data still goes through the
> > pagecache in that case. It just does a (synchronous) flush of the data
> > to the server and a commit after every 1M (in your case).
>
> Thank you very much Jeff. Understood. My mistake. I thought that, with
> the nfs sync option, page cache was actually not used. What about test
> 2 (with Sync) regarding to cpuwait? Seems like a CPU accounting
> "problem"? I have high cpuwait and low NFS latency (nfsiostat). If dd
> is launched with oflag=direct in the same NFS filesystem (mounted with
> Sync), page cache is not used and there isn't cpuwait.
>
> >
> > >
> > > - In test number 1. Why isn't there a change in cpuwait behavior when
> > > vm.dirty tunables are changed? (i have tested a lot of combinations)
> > >
> > >
> >
> > Depends on which tunables you're twiddling, but you have 8G of RAM and
> > are writing a 5G file. All of that should fit in the pagecache without
> > needing to flush anything before all the writes are done. I imagine the
> > vm.dirty tunables don't really come into play in these tests, other than
> > maybe the background ones, and those shouldn't really affect your
> > buffered write throughput.
>
> My understanding (surely wrong) about page cache in Linux is that we
> actually have two caches. One is "read cache" and the other is "write
> cache" aka dirty pages so write cache should not exceed the parameter
> vm.dirty_bytes or vm.dirty_ratio so I don't think 5gb file in 8gb RAM
> with low vm.dirty_ratio does much buffering.
>

Not exactly.

The VM has two sets of thresholds: dirty_bytes and dirty_ratio, along
with "background" versions of the same tunables. Most distros these days
don't work the "bytes" values, but work with the "ratio" ones, primarily
because memory sizes can vary wildly and that gives better results.

The dirty_ratio indicates the point where the client starts forcibly
flushing pages in order to satisfy new allocation requests. If you want
to do a write, you have to allocate pages to hold the data and that will
block until the ratio of dirty memory to clean is below the threshold.

The dirty_background_ratio indicates the point where the VM starts more
aggressively flushing data in the background, but that doesn't usually
affect userland activity. Your allocations won't block when you exceed
the background ratio, for instance so you can just keep writing and
filling memory.

Ideally, you never want to hit the dirty_ratio, and if things are tuned
well, you never will as long as background writeback is keeping up with
the rate of page dirtying.
--
Jeff Layton <[email protected]>

2022-09-13 19:20:02

by Isak

[permalink] [raw]

Subject: Re: nfs client strange behavior with cpuwait and memory writeback

Thank you very much Jeff. Understood. That explains why I have high cpu
iowait and high latency (nfsiostat) when launching a dd to an NFS
filesystem without the sync option. Surely my vm.dirty tunables are wrong
for the workload. Maybe more RAM is needed.

But what about dd to NFS filesystem mounted with Sync (Test 2)?. As I
understood (from your free lessons) VM.dirty tunables don't have much sense
in that scenario because a fsync is done after 1MB write (in my case). I
have high cpuwait and low NFS latency (nfsiostat).How is This posible?This
is extruding my brain. If dd is launched with oflag=direct in the same NFS
filesystem (mounted with Sync), page cache is not used and there isn't
cpuwait.

Thank you very much.

2022-09-13 13:51 GMT+02:00, Jeff Layton <[email protected]>:
> On Mon, 2022-09-12 at 21:00 +0200, Isak wrote:
>> El lun, 12 sept 2022 a las 12:40, Jeff Layton (<[email protected]>)
>> escribió:
>> >
>> > On Sun, 2022-09-11 at 20:58 +0200, Isak wrote:
>> > > Hi everybody!!!
>> > >
>> > > I am very happy writing my first email to one of the Linux mailing
>> > > list.
>> > >
>> > > I have read the faq and i know this mailing list is not a user help
>> > > desk but i have strange behaviour with memory write back and NFS.
>> > > Maybe someone can help me. I am so sorry if this is not the right
>> > > "forum".
>> > >
>> > > I did three simple tests writing to the same NFS filesystem and the
>> > > behavior of the cpu and memory is extruding my brain.
>> > >
>> > > The Environment:
>> > >
>> > > - Linux RedHat 8.6, 2 vCPU (VMWare VM) and 8 GB RAM (but same
>> > > behavior
>> > > with Red Hat 7.9)
>> > >
>> > > - One nfs filesystem mounted with sync and without sync
>> > >
>> > > 1x.1x.2xx.1xx:/test_fs on /mnt/test_fs_with_sync type nfs
>> > > (rw,relatime,sync,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=1x.1x.2xx.1xx,mountvers=3,mountport=2050,mountproto=udp,local_lock=none,addr=1x.1x.2xx.1xx)
>> > >
>> > > 1x.1x.2xx.1xx:/test_fs on /mnt/test_fs_without_sync type nfs
>> > > (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=1x.1x.2xx.1xx,mountvers=3,mountport=2050,mountproto=udp,local_lock=none,addr=1x.1x.2xx.1xx:)
>> > >
>> > > - Link between nfs client and nfs server is a 10Gb (Fiber) and iperf3
>> > > data show the link works at maximum speed. No problems here. I know
>> > > there are nfs options like nconnect to improve performance but I am
>> > > interested in linux kernel internals.
>> > >
>> > > The test:
>> > >
>> > > 1.- dd in /mnt/test_fs_without_sync
>> > >
>> > > dd if=/dev/zero of=test.out bs=1M count=5000
>> > > 5000+0 records in
>> > > 5000+0 records out
>> > > 5242880000 bytes (5.2 GB, 4.9 GiB) copied, 21.4122 s, 245 MB/s
>> > >
>> > > * High cpuwait
>> > > * High nfs latency
>> > > * Writeback in use
>> > >
>> > > Evidences:
>> > > https://zerobin.net/?43f9bea1953ed7aa#TaUk+K0GDhxjPq1EgJ2aAHgEyhntQ0NQzeFF51d9qI0=
>> > >
>> > > https://i.stack.imgur.com/pTong.png
>> > >
>> > >
>> > >
>> > > 2.- dd in /mnt/test_fs_with_sync
>> > >
>> > > dd if=/dev/zero of=test.out bs=1M count=5000
>> > > 5000+0 records in
>> > > 5000+0 records out
>> > > 5242880000 bytes (5.2 GB, 4.9 GiB) copied, 35.6462 s, 147 MB/s
>> > >
>> > > * High cpuwait
>> > > * Low nfs latency
>> > > * No writeback
>> > >
>> > > Evidences
>> > > https://zerobin.net/?0ce52c5c5d946d7a#ZeyjHFIp7B+K+65DX2RzEGlp+Oq9rCidAKL8RpKpDJ8=
>> > >
>> > > https://i.stack.imgur.com/Pf1xS.png
>> > >
>> > >
>> > >
>> > > 3.- dd in /mnt/test_fs_with_sync and oflag=direct
>> > >
>> > > dd if=/dev/zero of=test.out bs=1M oflag=direct count=5000
>> > > 5000+0 records in
>> > > 5000+0 records out
>> > > 5242880000 bytes (5.2 GB, 4.9 GiB) copied, 34.6491 s, 151 MB/s
>> > >
>> > > * Low cpuwait
>> > > * Low nfs latency
>> > > * No writeback
>> > >
>> > > Evidences:
>> > > https://zerobin.net/?03c4aa040a7a5323#bScEK36+Sdcz18VwKnBXNbOsi/qFt/O+qFyNj5FUs8k=
>> > >
>> > > https://i.stack.imgur.com/Qs6y5.png
>> > >
>> > >
>> > >
>> > >
>> > > The questions:
>> > >
>> > > I know write back is an old issue in linux and seems is the problem
>> > > here.I played with
>> > > vm.dirty_background_bytes/vm.dirty_background_ratio
>> > > and vm.dirty_background_ratio/vm.dirty_background_ratio (i know only
>> > > one is valid) but whatever value put in this tunables I always have
>> > > iowait (except from dd with oflag=direct)
>> > >
>> > > - In test number 2. How is it possible that it has no nfs latency but
>> > > has a high cpu wait?
>> > >
>> > > - In test number 2. How is it possible that have almost the same code
>> > > path than test number 1? Test number 2 use a nfs filesystem mounted
>> > > with sync option but seems to use pagecache codepath (see flame
>> > > graph)
>> > >
>> >
>> > "sync" just means that the write codepaths do an implicit fsync of the
>> > written range after every write. The data still goes through the
>> > pagecache in that case. It just does a (synchronous) flush of the data
>> > to the server and a commit after every 1M (in your case).
>>
>> Thank you very much Jeff. Understood. My mistake. I thought that, with
>> the nfs sync option, page cache was actually not used. What about test
>> 2 (with Sync) regarding to cpuwait? Seems like a CPU accounting
>> "problem"? I have high cpuwait and low NFS latency (nfsiostat). If dd
>> is launched with oflag=direct in the same NFS filesystem (mounted with
>> Sync), page cache is not used and there isn't cpuwait.
>>
>> >
>> > >
>> > > - In test number 1. Why isn't there a change in cpuwait behavior when
>> > > vm.dirty tunables are changed? (i have tested a lot of combinations)
>> > >
>> > >
>> >
>> > Depends on which tunables you're twiddling, but you have 8G of RAM and
>> > are writing a 5G file. All of that should fit in the pagecache without
>> > needing to flush anything before all the writes are done. I imagine the
>> > vm.dirty tunables don't really come into play in these tests, other
>> > than
>> > maybe the background ones, and those shouldn't really affect your
>> > buffered write throughput.
>>
>> My understanding (surely wrong) about page cache in Linux is that we
>> actually have two caches. One is "read cache" and the other is "write
>> cache" aka dirty pages so write cache should not exceed the parameter
>> vm.dirty_bytes or vm.dirty_ratio so I don't think 5gb file in 8gb RAM
>> with low vm.dirty_ratio does much buffering.
>>
>
> Not exactly.
>
> The VM has two sets of thresholds: dirty_bytes and dirty_ratio, along
> with "background" versions of the same tunables. Most distros these days
> don't work the "bytes" values, but work with the "ratio" ones, primarily
> because memory sizes can vary wildly and that gives better results.
>
> The dirty_ratio indicates the point where the client starts forcibly
> flushing pages in order to satisfy new allocation requests. If you want
> to do a write, you have to allocate pages to hold the data and that will
> block until the ratio of dirty memory to clean is below the threshold.
>
> The dirty_background_ratio indicates the point where the VM starts more
> aggressively flushing data in the background, but that doesn't usually
> affect userland activity. Your allocations won't block when you exceed
> the background ratio, for instance so you can just keep writing and
> filling memory.
>
> Ideally, you never want to hit the dirty_ratio, and if things are tuned
> well, you never will as long as background writeback is keeping up with
> the rate of page dirtying.
> --
> Jeff Layton <[email protected]>
>

2022-09-14 11:56:21

by Jeff Layton

[permalink] [raw]

Subject: Re: nfs client strange behavior with cpuwait and memory writeback

On Tue, 2022-09-13 at 21:12 +0200, Isak wrote:
> Thank you very much Jeff. Understood. That explains why I have high cpu
> iowait and high latency (nfsiostat) when launching a dd to an NFS
> filesystem without the sync option. Surely my vm.dirty tunables are wrong
> for the workload. Maybe more RAM is needed.
>
> But what about dd to NFS filesystem mounted with Sync (Test 2)?. As I
> understood (from your free lessons) VM.dirty tunables don't have much sense
> in that scenario because a fsync is done after 1MB write (in my case). I
> have high cpuwait and low NFS latency (nfsiostat).How is This posible?This
> is extruding my brain. If dd is launched with oflag=direct in the same NFS
> filesystem (mounted with Sync), page cache is not used and there isn't
> cpuwait.
>
>
> Thank you very much.
>

I'm not familiar with the tool you used to collect this info, but
assuming that the "wa" column refers to the same thing as it would in
vmstat, then what you're talking about is iowait:

wa: Time spent waiting for IO. Prior to Linux 2.5.41, included in idle.

That just means that the CPU was waiting for I/O to complete, which is
basically what you'd expect with a sync mount. Each of those COMMIT
calls at the end of the writes is synchronous and the CPU has to wait
for it to complete.

> 2022-09-13 13:51 GMT+02:00, Jeff Layton <[email protected]>:
> > On Mon, 2022-09-12 at 21:00 +0200, Isak wrote:
> > > El lun, 12 sept 2022 a las 12:40, Jeff Layton (<[email protected]>)
> > > escribi?:
> > > >
> > > > On Sun, 2022-09-11 at 20:58 +0200, Isak wrote:
> > > > > Hi everybody!!!
> > > > >
> > > > > I am very happy writing my first email to one of the Linux mailing
> > > > > list.
> > > > >
> > > > > I have read the faq and i know this mailing list is not a user help
> > > > > desk but i have strange behaviour with memory write back and NFS.
> > > > > Maybe someone can help me. I am so sorry if this is not the right
> > > > > "forum".
> > > > >
> > > > > I did three simple tests writing to the same NFS filesystem and the
> > > > > behavior of the cpu and memory is extruding my brain.
> > > > >
> > > > > The Environment:
> > > > >
> > > > > - Linux RedHat 8.6, 2 vCPU (VMWare VM) and 8 GB RAM (but same
> > > > > behavior
> > > > > with Red Hat 7.9)
> > > > >
> > > > > - One nfs filesystem mounted with sync and without sync
> > > > >
> > > > > 1x.1x.2xx.1xx:/test_fs on /mnt/test_fs_with_sync type nfs
> > > > > (rw,relatime,sync,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=1x.1x.2xx.1xx,mountvers=3,mountport=2050,mountproto=udp,local_lock=none,addr=1x.1x.2xx.1xx)
> > > > >
> > > > > 1x.1x.2xx.1xx:/test_fs on /mnt/test_fs_without_sync type nfs
> > > > > (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=1x.1x.2xx.1xx,mountvers=3,mountport=2050,mountproto=udp,local_lock=none,addr=1x.1x.2xx.1xx:)
> > > > >
> > > > > - Link between nfs client and nfs server is a 10Gb (Fiber) and iperf3
> > > > > data show the link works at maximum speed. No problems here. I know
> > > > > there are nfs options like nconnect to improve performance but I am
> > > > > interested in linux kernel internals.
> > > > >
> > > > > The test:
> > > > >
> > > > > 1.- dd in /mnt/test_fs_without_sync
> > > > >
> > > > > dd if=/dev/zero of=test.out bs=1M count=5000
> > > > > 5000+0 records in
> > > > > 5000+0 records out
> > > > > 5242880000 bytes (5.2 GB, 4.9 GiB) copied, 21.4122 s, 245 MB/s
> > > > >
> > > > > * High cpuwait
> > > > > * High nfs latency
> > > > > * Writeback in use
> > > > >
> > > > > Evidences:
> > > > > https://zerobin.net/?43f9bea1953ed7aa#TaUk+K0GDhxjPq1EgJ2aAHgEyhntQ0NQzeFF51d9qI0=
> > > > >
> > > > > https://i.stack.imgur.com/pTong.png
> > > > >
> > > > >
> > > > >
> > > > > 2.- dd in /mnt/test_fs_with_sync
> > > > >
> > > > > dd if=/dev/zero of=test.out bs=1M count=5000
> > > > > 5000+0 records in
> > > > > 5000+0 records out
> > > > > 5242880000 bytes (5.2 GB, 4.9 GiB) copied, 35.6462 s, 147 MB/s
> > > > >
> > > > > * High cpuwait
> > > > > * Low nfs latency
> > > > > * No writeback
> > > > >
> > > > > Evidences
> > > > > https://zerobin.net/?0ce52c5c5d946d7a#ZeyjHFIp7B+K+65DX2RzEGlp+Oq9rCidAKL8RpKpDJ8=
> > > > >
> > > > > https://i.stack.imgur.com/Pf1xS.png
> > > > >
> > > > >
> > > > >
> > > > > 3.- dd in /mnt/test_fs_with_sync and oflag=direct
> > > > >
> > > > > dd if=/dev/zero of=test.out bs=1M oflag=direct count=5000
> > > > > 5000+0 records in
> > > > > 5000+0 records out
> > > > > 5242880000 bytes (5.2 GB, 4.9 GiB) copied, 34.6491 s, 151 MB/s
> > > > >
> > > > > * Low cpuwait
> > > > > * Low nfs latency
> > > > > * No writeback
> > > > >
> > > > > Evidences:
> > > > > https://zerobin.net/?03c4aa040a7a5323#bScEK36+Sdcz18VwKnBXNbOsi/qFt/O+qFyNj5FUs8k=
> > > > >
> > > > > https://i.stack.imgur.com/Qs6y5.png
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > The questions:
> > > > >
> > > > > I know write back is an old issue in linux and seems is the problem
> > > > > here.I played with
> > > > > vm.dirty_background_bytes/vm.dirty_background_ratio
> > > > > and vm.dirty_background_ratio/vm.dirty_background_ratio (i know only
> > > > > one is valid) but whatever value put in this tunables I always have
> > > > > iowait (except from dd with oflag=direct)
> > > > >
> > > > > - In test number 2. How is it possible that it has no nfs latency but
> > > > > has a high cpu wait?
> > > > >
> > > > > - In test number 2. How is it possible that have almost the same code
> > > > > path than test number 1? Test number 2 use a nfs filesystem mounted
> > > > > with sync option but seems to use pagecache codepath (see flame
> > > > > graph)
> > > > >
> > > >
> > > > "sync" just means that the write codepaths do an implicit fsync of the
> > > > written range after every write. The data still goes through the
> > > > pagecache in that case. It just does a (synchronous) flush of the data
> > > > to the server and a commit after every 1M (in your case).
> > >
> > > Thank you very much Jeff. Understood. My mistake. I thought that, with
> > > the nfs sync option, page cache was actually not used. What about test
> > > 2 (with Sync) regarding to cpuwait? Seems like a CPU accounting
> > > "problem"? I have high cpuwait and low NFS latency (nfsiostat). If dd
> > > is launched with oflag=direct in the same NFS filesystem (mounted with
> > > Sync), page cache is not used and there isn't cpuwait.
> > >
> > > >
> > > > >
> > > > > - In test number 1. Why isn't there a change in cpuwait behavior when
> > > > > vm.dirty tunables are changed? (i have tested a lot of combinations)
> > > > >
> > > > >
> > > >
> > > > Depends on which tunables you're twiddling, but you have 8G of RAM and
> > > > are writing a 5G file. All of that should fit in the pagecache without
> > > > needing to flush anything before all the writes are done. I imagine the
> > > > vm.dirty tunables don't really come into play in these tests, other
> > > > than
> > > > maybe the background ones, and those shouldn't really affect your
> > > > buffered write throughput.
> > >
> > > My understanding (surely wrong) about page cache in Linux is that we
> > > actually have two caches. One is "read cache" and the other is "write
> > > cache" aka dirty pages so write cache should not exceed the parameter
> > > vm.dirty_bytes or vm.dirty_ratio so I don't think 5gb file in 8gb RAM
> > > with low vm.dirty_ratio does much buffering.
> > >
> >
> > Not exactly.
> >
> > The VM has two sets of thresholds: dirty_bytes and dirty_ratio, along
> > with "background" versions of the same tunables. Most distros these days
> > don't work the "bytes" values, but work with the "ratio" ones, primarily
> > because memory sizes can vary wildly and that gives better results.
> >
> > The dirty_ratio indicates the point where the client starts forcibly
> > flushing pages in order to satisfy new allocation requests. If you want
> > to do a write, you have to allocate pages to hold the data and that will
> > block until the ratio of dirty memory to clean is below the threshold.
> >
> > The dirty_background_ratio indicates the point where the VM starts more
> > aggressively flushing data in the background, but that doesn't usually
> > affect userland activity. Your allocations won't block when you exceed
> > the background ratio, for instance so you can just keep writing and
> > filling memory.
> >
> > Ideally, you never want to hit the dirty_ratio, and if things are tuned
> > well, you never will as long as background writeback is keeping up with
> > the rate of page dirtying.
> > --
> > Jeff Layton <[email protected]>
> >

--
Jeff Layton <[email protected]>