LinuxLists.cc - Server bottleneck(?) due to large record write() buffer size from client app

2008-08-20 04:30:47

Subject: Server bottleneck(?) due to large record write() buffer size from client app

Hi All

If I understand it correctly, there are three points at which linux
nfs client sends the NFS write request:

1. Inside nfs_flush_incompatible() where it needs to send writes as
stable because the pages are required for new write request from an
application. I think this happens only in case of high memory
pressure.

2. Inside nfs_file_write(), when nfs_do_fsync() is called if the file
was opened with O_SYNC.

3. When the file is closed, any remaining writes are flushed out as
unstable and then the final commit is sent.

In some of the tests I am running, I see drastic fall in write
throughput between a record size(.. i.e. the size of the buffer handed
to the write() syscall..) of 32Kbytes and a record size of say 50
Mbytes and 100 Mbytes. This fall is seen for NFS wsize values of 32k,
64k, 1Mb and with different tcp_slot_table_entries values of 16, 64,
96 and 128. The test files are opened without O_SYNC over a sync
mounted NFS. The client is a big machine with 16 logical processors
and 16Gigs of RAM.

I suspect that the fall happens because the NFS client stack sends all
the NFS writes as unstable till the file gets closed, when it sends
the final commit request. Since the write() record sizes are pretty
big the throughput drops because the final commit takes
extra-ordinarily long for the whole 100Megs to commit at the server
resulting in lower aggregate throughput.

Is this understanding correct?

Can this behaviour be modified so that the client uses the knowledge
of the write() buffer size, by initiating writeback before the full
100megs needs to be committed to the server in one go?

Thanks
Shehjar

2008-08-20 19:15:27

by Trond Myklebust

[permalink] [raw]

Subject: Re: Server bottleneck(?) due to large record write() buffer size from client app

On Wed, 2008-08-20 at 14:14 +1000, Shehjar Tikoo wrote:
> Hi All
>
> If I understand it correctly, there are three points at which linux
> nfs client sends the NFS write request:
>
> 1. Inside nfs_flush_incompatible() where it needs to send writes as
> stable because the pages are required for new write request from an
> application. I think this happens only in case of high memory
> pressure.
>
> 2. Inside nfs_file_write(), when nfs_do_fsync() is called if the file
> was opened with O_SYNC.
>
> 3. When the file is closed, any remaining writes are flushed out as
> unstable and then the final commit is sent.
>
> In some of the tests I am running, I see drastic fall in write
> throughput between a record size(.. i.e. the size of the buffer handed
> to the write() syscall..) of 32Kbytes and a record size of say 50
> Mbytes and 100 Mbytes. This fall is seen for NFS wsize values of 32k,
> 64k, 1Mb and with different tcp_slot_table_entries values of 16, 64,
> 96 and 128. The test files are opened without O_SYNC over a sync
> mounted NFS. The client is a big machine with 16 logical processors
> and 16Gigs of RAM.
>
> I suspect that the fall happens because the NFS client stack sends all
> the NFS writes as unstable till the file gets closed, when it sends
> the final commit request. Since the write() record sizes are pretty
> big the throughput drops because the final commit takes
> extra-ordinarily long for the whole 100Megs to commit at the server
> resulting in lower aggregate throughput.
>
> Is this understanding correct?
>
> Can this behaviour be modified so that the client uses the knowledge
> of the write() buffer size, by initiating writeback before the full
> 100megs needs to be committed to the server in one go?

You fail to mention which kernels you are using for your testing, but in
most recent kernels you should be able to adjust the pdflush background
write rates using the tunables in /proc/sys/vm

Cheers
Trond

2008-08-21 02:21:37

by Shehjar Tikoo

[permalink] [raw]

Subject: Re: Server bottleneck(?) due to large record write() buffer size from client app

Trond Myklebust wrote:
> On Wed, 2008-08-20 at 14:14 +1000, Shehjar Tikoo wrote:
>> If I understand it correctly, there are three points at which
>> linux nfs client sends the NFS write request:
>>
>> 1. Inside nfs_flush_incompatible() where it needs to send writes
>> as stable because the pages are required for new write request
>> from an application. I think this happens only in case of high
>> memory pressure.
>>
>> 2. Inside nfs_file_write(), when nfs_do_fsync() is called if the
>> file was opened with O_SYNC.
>>
>> 3. When the file is closed, any remaining writes are flushed out
>> as unstable and then the final commit is sent.
>>
>> In some of the tests I am running, I see drastic fall in write
>> throughput between a record size(.. i.e. the size of the buffer
>> handed to the write() syscall..) of 32Kbytes and a record size of
>> say 50 Mbytes and 100 Mbytes. This fall is seen for NFS wsize
>> values of 32k, 64k, 1Mb and with different tcp_slot_table_entries
>> values of 16, 64, 96 and 128. The test files are opened without
>> O_SYNC over a sync mounted NFS. The client is a big machine with
>> 16 logical processors and 16Gigs of RAM.
>>
>> I suspect that the fall happens because the NFS client stack
>> sends all the NFS writes as unstable till the file gets closed,
>> when it sends the final commit request. Since the write() record
>> sizes are pretty big the throughput drops because the final
>> commit takes extra-ordinarily long for the whole 100Megs to
>> commit at the server resulting in lower aggregate throughput.
>>
>> Is this understanding correct?
>>
>> Can this behaviour be modified so that the client uses the
>> knowledge of the write() buffer size, by initiating writeback
>> before the full 100megs needs to be committed to the server in
>> one go?
>
> You fail to mention which kernels you are using for your testing,
> but in most recent kernels you should be able to adjust the pdflush
> background write rates using the tunables in /proc/sys/vm
>

The server is using 2.6.26 and the client is running 2.6.27-rc3.

By changing pdflush settings on the client, I'd be changing the
settings for the whole system. Is there a proc FS entry or any other
config param that lets me lower the number of write requests buffered
at client before the commit request is sent?

Thanks
Shehjar

2008-08-21 19:20:21

by Trond Myklebust

[permalink] [raw]

Subject: Re: Server bottleneck(?) due to large record write() buffer size from client app

On Thu, 2008-08-21 at 12:05 +1000, Shehjar Tikoo wrote:
> By changing pdflush settings on the client, I'd be changing the
> settings for the whole system. Is there a proc FS entry or any other
> config param that lets me lower the number of write requests buffered
> at client before the commit request is sent?

Write buffering is not something which is under the control of the NFS
filesystem: it is entirely managed by the VM. The filesystem only
enforces the close-to-open cache consistency requirements, which of
course are specific to NFS.

Trond

2008-08-22 01:05:26

by Shehjar Tikoo

[permalink] [raw]

Subject: Re: Server bottleneck(?) due to large record write() buffer size from client app

Trond Myklebust wrote:
> On Thu, 2008-08-21 at 12:05 +1000, Shehjar Tikoo wrote:
>> By changing pdflush settings on the client, I'd be changing the
>> settings for the whole system. Is there a proc FS entry or any
>> other config param that lets me lower the number of write
>> requests buffered at client before the commit request is sent?
>
> Write buffering is not something which is under the control of the
> NFS filesystem: it is entirely managed by the VM. The filesystem
> only enforces the close-to-open cache consistency requirements,
> which of course are specific to NFS.

True, but I see that there are atleast two cases where NFS client will
send out the outstanding requests:

1. In nfs_flush_incompatible and through the balanced_rate_limited
code, which is essentially an interface into the VM.

2. In a few other places in VFS, where it sends the requests after
checking whether the file is opened O_SYNC or if the inode is IS_SYNC.

WRT point 2, if its possible to send the requests out explicitly for
these two flags, i.e. O_SYNC or S_SYNC, it might be possible to
optimize the transmit behaviour for large write() record sizes, which
are left uncommitted either:

1. till the full record has been transmitted as UNSTABLE.

2. or till the file is closed.

...resulting in a big flush at the server when the commit does come
in.

See the attached file for the plot of what I am observing. The plot is
based on data for 2.6.26 but I see the same behaviour on 2.6.27-rc3.

Thanks
Shehjar

Attachments:

nfs_xfs_recsize_write_128slots_1mwsize.png (4.78 kB)