LinuxLists.cc - [PATCH] Smooth out NFS client writeback

2005-06-02 01:38:32

Subject: [PATCH] Smooth out NFS client writeback

Hi Trond,

The current NFS client can cause a program to stall
for long periods of time because it flushes all dirty
pages at once. The attached patch addresses this by
only writing back the amount requested by the VM
layer. It also reduces the # commit requests by
waiting for some writeback to complete before issuing
a commit. The patch also speeds up writebacks of
mmap'ed data by accumulating dirty pages but sending
commits earlier and omitting FLUSH_STABLE.

As part of the patch, I reinstated specifying the
commit range because I observed a marked speed up when
testing on a filesystem mounted from a Solaris 2.10
server.

The patch is against 2.6.12-rc5 w/NFS_ALL.

Thanks,
Shantanu

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

Attachments:

nfs-writeback.patch (17.72 kB)
2449169141-nfs-writeback.patch

2005-06-02 03:26:16

by Shantanu Goel

[permalink] [raw]

Subject: Re: [PATCH] Smooth out NFS client writeback

Here are some numbers with iozone (avg of 3 runs).
The command was run as: iozone -i0 -r4k -s64m -c -t4
I manually booted the client machine with mem=256M.
It has 2 Xeon CPU's with hyperthreading enabled. The
server is running 2.6.11. The machines are on 100mb
ethernet. The filesystem is mounted with
tcp,rsize=32768,wsize=32768.

Stock w/Patch
parent write 7227 6155
parent rewrite 7068 6973
children write 7826 11659
children rewrite 7083 6974

I also ran iozone in mmap mode with same options as
above but specifying -B as well. The machine hung
with the stock client so could not complete the test.
Here are the numbers with the patched client.

parent write 6350
parent rewrite 4167
children write 9375
children rewrite 4792

I also ran a latency test using the script I'm
attaching to this email. The script was invoked as:
write-test.pl 256m
It writes a 256MB and prints the time elapsed for each
5% increment.

Stock:

5(0s)
10(0s)
15(0s)
20(0s)
25(0s)
30(7s)
35(3s)
40(1s)
45(0s)
50(0s)
55(0s)
60(7s)
65(3s)
70(0s)
75(0s)
80(1s)
85(0s)
90(7s)
95(3s)
100(0s)
sync(4s)

Throughput: 7084 KB/s (6 MB/s)

w/Patch:

5(0s)
10(0s)
15(0s)
20(0s)
25(0s)
30(0s)
35(0s)
40(0s)
45(2s)
50(1s)
55(3s)
60(2s)
65(2s)
70(2s)
75(1s)
80(2s)
85(2s)
90(2s)
95(1s)
100(2s)
sync(11s)

Throughput: 7710 KB/s (7 MB/s)

--- Trond Myklebust <[email protected]>
wrote:

> on den 01.06.2005 Klokka 18:38 (-0700) skreiv
> Shantanu Goel:
> > Hi Trond,
> >
> > The current NFS client can cause a program to
> stall
> > for long periods of time because it flushes all
> dirty
> > pages at once. The attached patch addresses this
> by
> > only writing back the amount requested by the VM
> > layer. It also reduces the # commit requests by
> > waiting for some writeback to complete before
> issuing
> > a commit. The patch also speeds up writebacks of
> > mmap'ed data by accumulating dirty pages but
> sending
> > commits earlier and omitting FLUSH_STABLE.
>
> Hi Shantanu,
>
> Do you have any figures on this subject? I'd very
> much like to see how
> this changes the figures for both the throughput and
> the latency. We
> should look at both slow and fast networks (say
> 10Mbit, 100Mbit &
> 1GigE).
>
> My other question is how it affects stability in the
> case of low memory
> situations?
>
> Cheers,
> Trond
>
>

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

Attachments:

write-test.pl (1.65 kB)
3263050112-write-test.pl

2005-06-02 12:16:31

by Shantanu Goel

[permalink] [raw]

Subject: Re: [PATCH] Smooth out NFS client writeback

--- Trond Myklebust <[email protected]>
wrote:
> What happens when you increase the file size to
> significantly beyond the
> memory size on a slow network. That is the
> interesting case.
> Try, for instance, booting with mem=3D32m and iozone
> -s=20
>=20

I'll run this case and report back.

> > I also ran iozone in mmap mode with same options
> as
> > above but specifying -B as well. The machine hung
> > with the stock client so could not complete the
> test.=20
> > Here are the numbers with the patched client.
>=20
> Where was the machine hanging?
>=20

I didn't have console access due to X. I'll try and
get a sysrq-t when I get a chance.

> BTW: how did you determine the values for
> NFS_WRITE_CLUSTER and
> NFS_COMMIT_CLUSTER. they appear to be completely
> arbitrary AFAICS.
>=20

NFS_WRITE_CLUSTER is a variant of SWAP_CLUSTER_MAX
used by kswapd. NFS_COMMIT_CLUSTER was chosen to
match balance_dirty_pages() which uses 4MB for
throttling a writer.

> Also, have you compared to the latest NFS_ALL
> kernels? They contain a
> bunch of extra latency fixes that came from Ingo's
> RT work.
>=20

This patch is on top of 2.6.12-rc5 with NFS_ALL for
2.6.12-rc4. The stock numbers I reported are with
NFS_ALL. Is there a newer version somewhere?

> Finally, please explain _why_ have you removed the
> FLUSH_STABLE from
> nfs_writepage()? The reason for it in the existing
> code is to avoid the
> extra COMMIT call in situations where we know we are
> already very low on
> memory. I don't see anything new in your patches
> that avoids these low
> memory situations.

The current FLUSH_STABLE behaviour absolutely shoots
our performance on iozone -B compared to Solaris.
The low memory situation is avoided because of the
congestion check I added in writepage(). If the inode
is congested, it redirty's the page and returns. It
also issues a commit request if there are at least
NFS_WRITE_CLUSTER pages.

=09
=09
__________________________________=20
Do you Yahoo!?=20
Yahoo! Mail - You care about security. So do we.=20
http://promotions.yahoo.com/new_mail

-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=3D7477&alloc_id=3D16492&op=3Dclic=
k
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-06-28 22:44:03

by Shantanu Goel

[permalink] [raw]

Subject: Re: [PATCH] Smooth out NFS client writeback

Hi Trond,

Attached is the long delayed revised version of the
writeback smoothing patch this time against
2.6.12-mm2. I have omitted the commit w/range and
mmap writeback from this one. If this one is deemed
acceptable for inclusion I'll post the other 2 later.
The commit w/range really should be restored as it
makes quite a difference against Solaris NFS servers
with regular disks. I observed a difference of 2-3
MB/s under sustained writes. It makes no difference
with the Linux NFS server since it ignores the range.

I tested this version by reducing memory to 32M but
iozone OOM'ed. However, I observed the same behaviour
with the unpatched client. If I reduce dirty_ratio to
20 from 40 (the default), iozone completes without a
problem on both versions. I noticed the slab cache
gets as big as the page cache and the VM fails to take
that in account. In summary, the behaviour is at
least as good as that of the stock client in this
case.

Thanks,
Shantanu

____________________________________________________
Yahoo! Sports
Rekindle the Rivalries. Sign up for Fantasy Football
http://football.fantasysports.yahoo.com

Attachments:

nfs-writeback.patch (19.31 kB)
2449169141-nfs-writeback.patch

2005-06-29 14:15:22

by Peter Staubach

[permalink] [raw]

Subject: Re: Re: [PATCH] Smooth out NFS client writeback

Shantanu Goel wrote:

>Hi Trond,
>
>Attached is the long delayed revised version of the
>writeback smoothing patch this time against
>2.6.12-mm2. I have omitted the commit w/range and
>mmap writeback from this one. If this one is deemed
>acceptable for inclusion I'll post the other 2 later.
>The commit w/range really should be restored as it
>makes quite a difference against Solaris NFS servers
>with regular disks. I observed a difference of 2-3
>MB/s under sustained writes. It makes no difference
>with the Linux NFS server since it ignores the range.
>

On Solaris, at least with UFS as the underlying file system, the COMMIT
operations are processed by looking through the entire cached page list
or by doing page lookup operations on each individual page. If the entire
file is specified, ie. len = 0, then the page list is walked. If a range
is specified, then just the pages within the range are looked up.

Specifying the range can result in significantly less CPU overhead on the
server. This is why the NFSv3 COMMIT operation has a range which can be
specified... :-)

ps

-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-06-29 14:25:50

by Lever, Charles

[permalink] [raw]

Subject: RE: Re: [PATCH] Smooth out NFS client writeback

hi peter-

> On Solaris, at least with UFS as the underlying file system,=20
> the COMMIT
> operations are processed by looking through the entire cached=20
> page list
> or by doing page lookup operations on each individual page. =20
> If the entire
> file is specified, ie. len =3D 0, then the page list is walked.=20
> If a range
> is specified, then just the pages within the range are looked up.
>=20
> Specifying the range can result in significantly less CPU=20
> overhead on the
> server. This is why the NFSv3 COMMIT operation has a range=20
> which can be specified... :-)

a server CPU inefficiency hardly qualifies as a client bug. in the most
common cases where the client is creating and writing to a file, then
closing with a COMMIT(0,0) request, the server should be changed to
behave in a more efficient manner.

in other words, i think the client should optimize the number of
requests on the wire, and the server should optimize for using its CPU
and disks most efficiently. i haven't looked closely at shantanu's
patch, but i'm a little leary of the change if it means more wire
operations are generated than before.

-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-06-29 15:11:41

by Peter Staubach

[permalink] [raw]

Subject: Re: Re: [PATCH] Smooth out NFS client writeback

Lever, Charles wrote:

>hi peter-
>
>
>
>>On Solaris, at least with UFS as the underlying file system,
>>the COMMIT
>>operations are processed by looking through the entire cached
>>page list
>>or by doing page lookup operations on each individual page.
>>If the entire
>>file is specified, ie. len = 0, then the page list is walked.
>> If a range
>>is specified, then just the pages within the range are looked up.
>>
>>Specifying the range can result in significantly less CPU
>>overhead on the
>>server. This is why the NFSv3 COMMIT operation has a range
>>which can be specified... :-)
>>
>>
>
>a server CPU inefficiency hardly qualifies as a client bug. in the most
>common cases where the client is creating and writing to a file, then
>closing with a COMMIT(0,0) request, the server should be changed to
>behave in a more efficient manner.
>
>in other words, i think the client should optimize the number of
>requests on the wire, and the server should optimize for using its CPU
>and disks most efficiently. i haven't looked closely at shantanu's
>patch, but i'm a little leary of the change if it means more wire
>operations are generated than before.
>
>

I wouldn't claim that this is a client side bug either. I would claim that
it is an opportunity, for very little cost, to help a server to perform
better and this makes the client, and NFS in general, look better.

I don't think that there will be more over the wire operations generated
when using the range versus always specifying the entire file. Typically,
COMMITs are done for a range of the file or for the entire file at close.
The client typically needs to know what data is marked as needing to be
committed anyway, so it isn't very hard to figure out what the range that
needs to be committed is at the same time. I would claim that a client,
which simply issues a blanket COMMIT(0,0), without already having gathered
up the buffers/pages/whatever that need committing, is broken. It will be
unsafe because it will be subject to races like more data getting written
with UNSTABLE while the COMMIT is happening. This new data may or may not
have been committed by the COMMIT.

Bottom line for me -- if the client can do something to help the server to
help the client and it is an overall win, then I think that it should do
so...

Thanx...

ps

-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-06-29 15:35:39

by Lever, Charles

[permalink] [raw]

Subject: RE: Re: [PATCH] Smooth out NFS client writeback

> I don't think that there will be more over the wire=20
> operations generated
> when using the range versus always specifying the entire=20
> file.

committing the whole file at once is simply more efficient from a
network perspective when an application is performing random writes
(unless it is using O_SYNC). there are some cases where it won't make a
difference whether a range or a whole file commit is used, but i think
it would be really hard to figure out a client-side heuristic to decide
which is better.

> I would claim that a client,
> which simply issues a blanket COMMIT(0,0), without already=20
> having gathered
> up the buffers/pages/whatever that need committing, is=20
> broken. It will be
> unsafe because it will be subject to races like more data=20
> getting written
> with UNSTABLE while the COMMIT is happening. This new data=20
> may or may not
> have been committed by the COMMIT.

the linux client already keeps track of the order of writes and commits
well enough that this isn't an issue.

> Bottom line for me -- if the client can do something to help=20
> the server to
> help the client and it is an overall win, then I think that=20
> it should do so...

we're talking about potentially adding complexity to the client and
increasing the number of write and commit operations on the wire, which
could have a negative performance impact in other environments (think
WAN). all this to optimize a particular workload against a particular
server implementation.

i'm not saying this type of work shouldn't be explored, but as we
consider the change, we should be very careful especially since we don't
have adequate performance regression test coverage yet.

-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-06-29 15:54:11

by Peter Staubach

[permalink] [raw]

Subject: Re: Re: [PATCH] Smooth out NFS client writeback

Lever, Charles wrote:

>>I don't think that there will be more over the wire
>>operations generated
>>when using the range versus always specifying the entire
>>file.
>>
>>
>
>committing the whole file at once is simply more efficient from a
>network perspective when an application is performing random writes
>(unless it is using O_SYNC). there are some cases where it won't make a
>difference whether a range or a whole file commit is used, but i think
>it would be really hard to figure out a client-side heuristic to decide
>which is better.
>
>
>

The network perspective is important, to be sure. However, so is disk
i/o.

Having done an implementation, it isn't so hard to develop the heuristics,
actually.

Please keep in mind that COMMIT operations are very expensive. They are
to be avoided whenever possible. Clients should delay as long as possible
to commit data because they may be able to avoid committing it at all.
These benefits can include data which is repeatedly overwritten, small
writes to the same file system block on the server, temporary files, etc.
The list goes on and on. Not going to stable storage on the server is a
huge win in performance all that way around.

>>I would claim that a client,
>>which simply issues a blanket COMMIT(0,0), without already
>>having gathered
>>up the buffers/pages/whatever that need committing, is
>>broken. It will be
>>unsafe because it will be subject to races like more data
>>getting written
>>with UNSTABLE while the COMMIT is happening. This new data
>>may or may not
>>have been committed by the COMMIT.
>>
>>
>
>the linux client already keeps track of the order of writes and commits
>well enough that this isn't an issue.
>
>
>

Good to hear!

>>Bottom line for me -- if the client can do something to help
>>the server to
>>help the client and it is an overall win, then I think that
>>it should do so...
>>
>>
>
>we're talking about potentially adding complexity to the client and
>increasing the number of write and commit operations on the wire, which
>could have a negative performance impact in other environments (think
>WAN). all this to optimize a particular workload against a particular
>server implementation.
>
>i'm not saying this type of work shouldn't be explored, but as we
>consider the change, we should be very careful especially since we don't
>have adequate performance regression test coverage yet.
>
>

I think that there may be more server implementations out there that
could benefit from the client strategy than just the Solaris with UFS
NFS server. I would guess that most, if not all, would benefit. I
even suspect that we could make the Linux server go faster if it paid
attention to the byte range specified, instead of always sync'ing the
entire file. The cost of looking at clean pages/buffers is small, but
adds up quickly... :-)

ps

-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-06-07 04:12:40

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH] Smooth out NFS client writeback

to den 02.06.2005 Klokka 05:16 (-0700) skreiv Shantanu Goel:

> The low memory situation is avoided because of the
> congestion check I added in writepage(). If the inode
> is congested, it redirty's the page and returns. It
> also issues a commit request if there are at least
> NFS_WRITE_CLUSTER pages.

This sounds very much like what we were doing in the early 2.6 series
when we were returning WRITEPAGE_ACTIVATE in order to achieve the same
effect. We ended up reverting those patches after learning that
systematically redirtying pages in low-memory situations may lead to
some _very_ nasty deadlocks.
The above have "works" for ramdisks because they have hard limits on the
amount of physical memory they can use in the form of the ramdisk size.
NFS OTOH really shouldn't do it, since it has no such limits, and can
basically end up eating all of physical memory.

Have you BTW tested the performance of the other changes without this
part of the patch? I assume the numbers you presented last week were for
the combined changes.

Cheers,
Trond

-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games. How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-06-02 01:56:25

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH] Smooth out NFS client writeback

on den 01.06.2005 Klokka 18:38 (-0700) skreiv Shantanu Goel:
> Hi Trond,
>
> The current NFS client can cause a program to stall
> for long periods of time because it flushes all dirty
> pages at once. The attached patch addresses this by
> only writing back the amount requested by the VM
> layer. It also reduces the # commit requests by
> waiting for some writeback to complete before issuing
> a commit. The patch also speeds up writebacks of
> mmap'ed data by accumulating dirty pages but sending
> commits earlier and omitting FLUSH_STABLE.

Hi Shantanu,

Do you have any figures on this subject? I'd very much like to see how
this changes the figures for both the throughput and the latency. We
should look at both slow and fast networks (say 10Mbit, 100Mbit &
1GigE).

My other question is how it affects stability in the case of low memory
situations?

Cheers,
Trond

-------------------------------------------------------
This SF.Net email is sponsored by Yahoo.
Introducing Yahoo! Search Developer Network - Create apps using Yahoo!
Search APIs Find out how you can build Yahoo! directly into your own
Applications - visit http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-06-02 04:38:53

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH] Smooth out NFS client writeback

on den 01.06.2005 Klokka 20:26 (-0700) skreiv Shantanu Goel:
> Here are some numbers with iozone (avg of 3 runs).
> The command was run as: iozone -i0 -r4k -s64m -c -t4
> I manually booted the client machine with mem=256M.

What happens when you increase the file size to significantly beyond the
memory size on a slow network. That is the interesting case.
Try, for instance, booting with mem=32m and iozone -s

> I also ran iozone in mmap mode with same options as
> above but specifying -B as well. The machine hung
> with the stock client so could not complete the test.
> Here are the numbers with the patched client.

Where was the machine hanging?

Cheers,
Trond

BTW: how did you determine the values for NFS_WRITE_CLUSTER and
NFS_COMMIT_CLUSTER. they appear to be completely arbitrary AFAICS.

Also, have you compared to the latest NFS_ALL kernels? They contain a
bunch of extra latency fixes that came from Ingo's RT work.

Finally, please explain _why_ have you removed the FLUSH_STABLE from
nfs_writepage()? The reason for it in the existing code is to avoid the
extra COMMIT call in situations where we know we are already very low on
memory. I don't see anything new in your patches that avoids these low
memory situations.

-------------------------------------------------------
This SF.Net email is sponsored by Yahoo.
Introducing Yahoo! Search Developer Network - Create apps using Yahoo!
Search APIs Find out how you can build Yahoo! directly into your own
Applications - visit http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-06-29 22:35:03

by Shantanu Goel

[permalink] [raw]

Subject: Re: Re: [PATCH] Smooth out NFS client writeback

> Lever, Charles wrote:
> >committing the whole file at once is simply more
> efficient from a
> >network perspective when an application is
> performing random writes
> >(unless it is using O_SYNC). there are some cases
> where it won't make a
> >difference whether a range or a whole file commit
> is used, but i think
> >it would be really hard to figure out a client-side
> heuristic to decide
> >which is better.

The patch I posted is actually much better in terms of
the # commit requests it issues compared to the stock
client. There are some #'s for iozone -r4k -s256m -c:

Proc Stock Patch
write 16486 16384
commit 518 80

Here are the numbers for write throughput as seen by
the children in KB/s in the iozone test above to see
the difference that commit range makes.

Stock Patch w/o commit Patch w/commit
7064 6038 10226 =20

When I get a chance I'll modify the Linux NFS server
to honour the range as well and see what if any
difference it makes.

Thanks,
Shantanu

=09
____________________________________________________=20
Yahoo! Sports=20
Rekindle the Rivalries. Sign up for Fantasy Football=20
http://football.fantasysports.yahoo.com

-------------------------------------------------------
This SF.Net email is sponsored by the 'Do More With Dual!' webinar happen=
ing
July 14 at 8am PDT/11am EDT. We invite you to explore the latest in dual
core and dual graphics technology at this free one hour event hosted by H=
P,=20
AMD, and NVIDIA. To register visit http://www.hp.com/go/dualwebinar
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs