LinuxLists.cc - nfsd write throughput

2004-08-02 16:31:35

Subject: nfsd write throughput

Hi,

I've been looking at the write throughput with NFSv3, and played around
a little. Here's a patch that seems to increase iozone's write throughput
from 6MB/s to close to 10MB/s on my test machine.

(The improvement in rewrite throughput is less pronounced). I'm still
doing some testing on this, but I would still appreciate some feedback,
especially from people seeing throughput problems.

Olaf
--
Olaf Kirch | The Hardware Gods hate me.
[email protected] |
---------------+

Attachments:

(No filename) (485.00 B)
nfsd-write-speedup (1.13 kB)
Download all attachments

2004-08-03 02:10:44

by Greg Banks

[permalink] [raw]

Subject: Re: nfsd write throughput

On Mon, Aug 02, 2004 at 06:24:49PM +0200, Olaf Kirch wrote:
> @@ -810,6 +811,22 @@
> }
> last_ino = inode->i_ino;
> last_dev = inode->i_sb->s_dev;
> + } else if (err >= 0 && !stable) {
> + /* If we've been writing several pages, schedule them
> + * for the disk immediately. The client may be streaming
> + * and we don't want to hang on a huge journal sync when the
> + * commit comes in
> + */
> + struct address_space *mapping;
> +
> + /* This assumes a minimum page size of 1K, and will issue
> + * a filemap_flushfast call every 64 pages written by the
> + * client. */
> + if ((cnt & 1023) == 0
> + && ((offset / cnt) & 63) == 0
> + && (mapping = inode->i_mapping) != NULL
> + && !bdi_write_congested(mapping->backing_dev_info))
> + filemap_flushfast(mapping);
> }
>
> dprintk("nfsd: write complete err=%d\n", err);

Olaf, I think this patch has problems.

First, the way the v3 server is supposed to work is that normal page
cache pressure pushes pages from unstable writes to disk before the
COMMIT call arrives from the client. The best way to achieve this
for a dedicated NFS server box is tuning the pdflush parameters
to be more aggressive about writing back dirty pages, e.g. bumping
down the following in /proc/vm: dirty_background_ratio, dirty_ratio,
dirty_writeback_centisecs, and dirty_expire_centisecs. I have to
admit I've not tried this yet on 2.6 but the equivalent on 2.4 has
been generally useful.

I think another useful approach would be to writeback pages which
have been written by NFS unstable writes at a faster rate than pages
written by local applications, i.e. add a new /proc/vm/ sysctl like
nfs_dirty_writeback_centisecs and a per-page flag. With a separate
sysctl the default value can be smaller so that you get the desired
behaviour for NFS pages without the syadmin having to do page cache
tuning or perturbing the behaviour of local IO.

The justification for this approach is that data in such pages is
most likely also stored in clients' page caches too. Recent IRIX
releases do this, and I have an open bug to implement something like
that in Linux.

Second, I have several problems with the heuristics for choosing when
to call filemap_flushfast().

For example, imagine the disk backend is a hardware RAID5 with a
stripe size of 128K or greater and the client is doing streaming
32K WRITE calls. With your patch, every second WRITE call will now
try to write half a RAID stripe unit, requiring the RAID controller
to read the other half to update parity, which will significantly
hurt performance. Similar bad things happen if the server is doing
strided or random writes of 1024 B at offsets which are multiples
of 64 KB.

If the disk writes are being pushed by the normal page cache
mechanisms, then the normal page cache and filesystem write clustering
has at least some chance (and the state, e.g. XFS is aware of the
hardware RAID stripe parameters) to construct writes of an appropriate
size. Whether the page cache and fs actually do the right thing is
another matter, but that's where the responsibility lies.

Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. http://www.ostg.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-03 02:24:00

by NeilBrown

[permalink] [raw]

Subject: Re: nfsd write throughput

On Monday August 2, [email protected] wrote:
> Hi,
>
> I've been looking at the write throughput with NFSv3, and played around
> a little. Here's a patch that seems to increase iozone's write throughput
> from 6MB/s to close to 10MB/s on my test machine.
>
> (The improvement in rewrite throughput is less pronounced). I'm still
> doing some testing on this, but I would still appreciate some feedback,
> especially from people seeing throughput problems.

Interesting idea, and probably worth pursuing, but it does look rather
"hackish".

The comment sounds very filesystem-specific. Which filesystem(s) are
you testing this with? ext3?

It would be nice if this sort of functionality could go in the
filesystem. ext3 already has a tunable for the commit_interval
("commit="). It seems like what you are wanting is a similar tunable
that is measured in blocks rather than seconds. Would that be right?
or have I missed something important.

NeilBrown

>
> Olaf
> --
> Olaf Kirch | The Hardware Gods hate me.
> [email protected] |
> ---------------+
> Index: linux-2.6.5/fs/nfsd/vfs.c
> ===================================================================
> --- linux-2.6.5.orig/fs/nfsd/vfs.c 2004-08-02 14:48:02.000000000 +0200
> +++ linux-2.6.5/fs/nfsd/vfs.c 2004-08-02 17:54:28.000000000 +0200
> @@ -45,6 +45,7 @@
> #include <linux/quotaops.h>
> #include <linux/dnotify.h>
> #include <linux/xattr_acl.h>
> +#include <linux/backing-dev.h>
>
> #include <asm/uaccess.h>
>
> @@ -810,6 +811,22 @@
> }
> last_ino = inode->i_ino;
> last_dev = inode->i_sb->s_dev;
> + } else if (err >= 0 && !stable) {
> + /* If we've been writing several pages, schedule them
> + * for the disk immediately. The client may be streaming
> + * and we don't want to hang on a huge journal sync when the
> + * commit comes in
> + */
> + struct address_space *mapping;
> +
> + /* This assumes a minimum page size of 1K, and will issue
> + * a filemap_flushfast call every 64 pages written by the
> + * client. */
> + if ((cnt & 1023) == 0
> + && ((offset / cnt) & 63) == 0
> + && (mapping = inode->i_mapping) != NULL
> + && !bdi_write_congested(mapping->backing_dev_info))
> + filemap_flushfast(mapping);
> }
>
> dprintk("nfsd: write complete err=%d\n", err);

-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. http://www.ostg.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-03 06:06:24

by Olaf Kirch

[permalink] [raw]

Subject: Re: nfsd write throughput

On Tue, Aug 03, 2004 at 12:10:18PM +1000, Greg Banks wrote:
> > + if ((cnt & 1023) == 0
> > + && ((offset / cnt) & 63) == 0

> First, the way the v3 server is supposed to work is that normal page
> cache pressure pushes pages from unstable writes to disk before the
> COMMIT call arrives from the client. The best way to achieve this
> for a dedicated NFS server box is tuning the pdflush parameters
> to be more aggressive about writing back dirty pages, e.g. bumping
> down the following in /proc/vm: dirty_background_ratio, dirty_ratio,
> dirty_writeback_centisecs, and dirty_expire_centisecs. I have to

Yes and no. Can we expect every user to fiddle with the pdflush
tunables to get an NFS server that performs reasonably well?

> I think another useful approach would be to writeback pages which
> have been written by NFS unstable writes at a faster rate than pages
> written by local applications, i.e. add a new /proc/vm/ sysctl like
> nfs_dirty_writeback_centisecs and a per-page flag.

That may be a useful solution, too. My patch basically does what
fadvise(WONTNEED) does.

> For example, imagine the disk backend is a hardware RAID5 with a
> stripe size of 128K or greater and the client is doing streaming
> 32K WRITE calls. With your patch, every second WRITE call will now
> try to write half a RAID stripe unit,

No, it doesn't. If you look at the if() expression, you'll see it
writes every 64 client-size pages. In the worst case that's every 64K,
but for Linux clients that's every 256K which is a reasonable size
for IDE DMA, as well as most RAID configurations

> size. Whether the page cache and fs actually do the right thing is
> another matter, but that's where the responsibility lies.

I agree.

Olaf
--
Olaf Kirch | The Hardware Gods hate me.
[email protected] |
---------------+

-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. http://www.ostg.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-03 07:55:12

by Greg Banks

[permalink] [raw]

Subject: Re: nfsd write throughput

On Tue, Aug 03, 2004 at 08:02:13AM +0200, Olaf Kirch wrote:
> On Tue, Aug 03, 2004 at 12:10:18PM +1000, Greg Banks wrote:
>
> > First, the way the v3 server is supposed to work is that normal page
> > cache pressure pushes pages from unstable writes to disk before the
> > COMMIT call arrives from the client. The best way to achieve this
> > for a dedicated NFS server box is tuning the pdflush parameters
> > to be more aggressive about writing back dirty pages, e.g. bumping
> > down the following in /proc/vm: dirty_background_ratio, dirty_ratio,
> > dirty_writeback_centisecs, and dirty_expire_centisecs. I have to
>
> Yes and no. Can we expect every user to fiddle with the pdflush
> tunables to get an NFS server that performs reasonably well?

No, of course not, hence the next idea.

> > I think another useful approach would be to writeback pages which
> > have been written by NFS unstable writes at a faster rate than pages
> > written by local applications, i.e. add a new /proc/vm/ sysctl like
> > nfs_dirty_writeback_centisecs and a per-page flag.
>
> That may be a useful solution, too. My patch basically does what
> fadvise(WONTNEED) does.

Sure, the key question is when and for how many pages. You don't
really have enough information in nfsd_write() to tell that safely.

> > For example, imagine the disk backend is a hardware RAID5 with a
> > stripe size of 128K or greater and the client is doing streaming
> > 32K WRITE calls. With your patch, every second WRITE call will now
> > try to write half a RAID stripe unit,
>
> No, it doesn't. If you look at the if() expression, you'll see it
> writes every 64 client-size pages.

> > > + if ((cnt & 1023) == 0
> > > + && ((offset / cnt) & 63) == 0

It writes every time `offset' is a multiple of 64 times `cnt' and
`cnt' is a multiple of 1024. At this point `cnt' is the length
of the data received in the WRITE call, which has only a vague
relationship to the client page size.

> In the worst case that's every 64K,
> but for Linux clients that's every 256K which is a reasonable size
> for IDE DMA, as well as most RAID configurations

We have configurations with hardware RAID stripe sizes up to 4MB.

Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. http://www.ostg.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-03 08:09:18

by Olaf Kirch

[permalink] [raw]

Subject: Re: nfsd write throughput

On Tue, Aug 03, 2004 at 05:55:06PM +1000, Greg Banks wrote:
> I think another useful approach would be to writeback pages which
> have been written by NFS unstable writes at a faster rate than pages
> written by local applications, i.e. add a new /proc/vm/ sysctl like
> nfs_dirty_writeback_centisecs and a per-page flag.

The problem with this approach is that we have no access to the
pages. nfsd_write goes through writev.

> > That may be a useful solution, too. My patch basically does what
> > fadvise(WONTNEED) does.
>
> Sure, the key question is when and for how many pages. You don't
> really have enough information in nfsd_write() to tell that safely.

Well, for streaming writes "everything we've written so far" is a
reasonable approximation. Random writes may receive a penalty, I
admit.

> It writes every time `offset' is a multiple of 64 times `cnt' and
> `cnt' is a multiple of 1024. At this point `cnt' is the length
> of the data received in the WRITE call, which has only a vague
> relationship to the client page size.

Okay, I should have been more precise: the test tries to make
sure we're seeing a full wsize worth of data, which is usually a
good indication of writes being streamed, and tries to lump
enough of them together to allow the file system to make an
intelligent decision.

I'm not claiming this is god's wisdom - I'm trying out ideas :)

Olaf
--
Olaf Kirch | The Hardware Gods hate me.
[email protected] |
---------------+

-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. http://www.ostg.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-03 08:28:18

by Greg Banks

[permalink] [raw]

Subject: Re: nfsd write throughput

On Tue, Aug 03, 2004 at 10:09:13AM +0200, Olaf Kirch wrote:
> On Tue, Aug 03, 2004 at 05:55:06PM +1000, Greg Banks wrote:
> > I think another useful approach would be to writeback pages which
> > have been written by NFS unstable writes at a faster rate than pages
> > written by local applications, i.e. add a new /proc/vm/ sysctl like
> > nfs_dirty_writeback_centisecs and a per-page flag.
>
> The problem with this approach is that we have no access to the
> pages. nfsd_write goes through writev.

I understand that the IRIX approach involves passing a special
flag through the equivalent of vfs_writev(). For Linux you
could probably do this with a magic flag in the struct file.
Obviously this is non-trivial.

> > > That may be a useful solution, too. My patch basically does what
> > > fadvise(WONTNEED) does.
> >
> > Sure, the key question is when and for how many pages. You don't
> > really have enough information in nfsd_write() to tell that safely.
>
> Well, for streaming writes "everything we've written so far" is a
> reasonable approximation. Random writes may receive a penalty, I
> admit.

Also reverse writes, and writes of many complete small (too small
to kick in the streaming heuristic) files. Doing stuff in the page
cache has a better chance of handling those cases.

> > It writes every time `offset' is a multiple of 64 times `cnt' and
> > `cnt' is a multiple of 1024. At this point `cnt' is the length
> > of the data received in the WRITE call, which has only a vague
> > relationship to the client page size.
>
> Okay, I should have been more precise: the test tries to make
> sure we're seeing a full wsize worth of data, which is usually a
> good indication of writes being streamed, and tries to lump
> enough of them together to allow the file system to make an
> intelligent decision.

The two problems are:

1. the heuristic is too simple and can be fooled by a number of
non-streaming access patterns which won't benefit from the
early flush. You could fix this by keeping per-file state
like the readahead state, but...

2. even when the heuristic does detect a streaming write it may
be too early to usefully flush data.

In both cases the page cache has a better chance of getting it right.

> I'm not claiming this is god's wisdom - I'm trying out ideas :)

Sure.

Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. http://www.ostg.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-03 10:35:19

by Olaf Kirch

[permalink] [raw]

Subject: Re: nfsd write throughput

Hi folks,

I've been looking at the problem from a different angle...

Theory:
The main bottleneck is that we spend a long time in commit(),
blocking other WRITE calls from making any progress (thereby
stalling all NFS clients). The reason is what we take inode->i_sem
in nfsd_sync, but the writev() code wants to grab the same
semaphore.

Circumstantial Evidence:

I've been doing some tests with the latencies of WRITE and COMMIT, using a
single stream write. The average time we spend in nfsd_write is miniscule,
usually it's less than 2 milliseconds. However when a commit comes in,
we take a hit there as well - something around 500 ms for reiser, and
400 ms for ext3. Syncing to reiser frequently takes up to 1.2 seconds,
while the 400 ms for ext3 is pretty constant.

Right now, nfsd_sync calls

filemap_fdatawrite
filp->f_op->fsync
filemap_fdatawait

all under the i_sem. However, it seems we don't need the i_sem for
the filemap_* functions (is that valid - at least sync_page_range
doesn't?). So I changed the code to make it grab i_sem only for the fsync
call, but unfortunately, that doesn't seem to make much of a difference,
as I found out. Most of the time taken by a commit is spent in fsync
(the delta between the fsync latency and the overall commit latency is
usually less than 5 ms, i.e. ~1%).

I also changed nfsd_sync to call filemap_fdatawrite_range instead of
filemap_fdatawrite, but that doesn't make a noticeable difference either.

I then re-enabled my flushfast hack, and the commit latencies went
down to 30 ms on ext3, with the occasional spike of 300 ms. On reiser,
the commit latency went down to something like 50 ms on average.

(The reiserfs rewrite case was fairly bad, however. Rewrite over NFS
on top of reiser is fairly slow to begin with, much slower than write;
and the gain from the flushfast patch is minimal - but that's a different
story)

Conclusion:

So this at least supports my theory that the commits are throttling the
writes quite a bit. For the sake of completeness, I did some more iozone
measurements, and on write/rewrite the performance gain is about 50%
on both reiser and ext3, for a single client. I would think for several
clients writing concurrently, the gain should be even more pronounced,
but I haven't run these tests yet.

I'm wondering what could happen if we change nfsd_sync to not take the
i_sem at all... I'll talk to a few VFS folks around here and try to find out.

PS:

Another thing I noticed was that the commit calls sent by the Linux
client (2.6.5) are not evenly distributed over time. Much of the time,
the client will call COMMIT 4-6 times a second, and then all of a sudden
I see 30-80 calls a second several times in a row.

Olaf
--
Olaf Kirch | The Hardware Gods hate me.
[email protected] |
---------------+

-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. http://www.ostg.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-03 10:52:42

by Olaf Kirch

[permalink] [raw]

Subject: Re: nfsd write throughput

On Tue, Aug 03, 2004 at 12:32:14PM +0200, Olaf Kirch wrote:
> I'm wondering what could happen if we change nfsd_sync to not take the
> i_sem at all... I'll talk to a few VFS folks around here and try to find out.

Just for fun, I changed nfsd_sync to not take i_sem, and the
performance gain was about halfway inbetween the unmodified
nfsd and the one with the flushfast hack.

I think the reason this isn't better than flushfast is this: when several
commit calls for the same file come in (and as I mentioned, the Linux
client does send up to 80 commits per second), we end up with all nfsd
threads handling one commit, sitting in filemap_fdatawait or fsync,
waiting for the same file to be flushed.

Olaf
--
Olaf Kirch | The Hardware Gods hate me.
[email protected] |
---------------+

-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. http://www.ostg.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-03 11:24:53

by Greg Banks

[permalink] [raw]

Subject: Re: nfsd write throughput

On Tue, Aug 03, 2004 at 12:32:14PM +0200, Olaf Kirch wrote:
> Hi folks,
>
> I've been looking at the problem from a different angle...
>
> Theory:
> The main bottleneck is that we spend a long time in commit(),
> blocking other WRITE calls from making any progress (thereby
> stalling all NFS clients). The reason is what we take inode->i_sem
> in nfsd_sync, but the writev() code wants to grab the same
> semaphore.
>
> Circumstantial Evidence:
>
> I've been doing some tests with the latencies of WRITE and COMMIT, using a
> single stream write. The average time we spend in nfsd_write is miniscule,
> usually it's less than 2 milliseconds. However when a commit comes in,
> we take a hit there as well - something around 500 ms for reiser, and
> 400 ms for ext3. Syncing to reiser frequently takes up to 1.2 seconds,
> while the 400 ms for ext3 is pretty constant.

With IRIX clients, which do far fewer COMMITs than (at least 2.4) Linux
clients, I have seen COMMIT latencies in the order of multiple seconds
as over a gigabyte of data is written to disk at 180 MB/s.

> Right now, nfsd_sync calls
>
> filemap_fdatawrite
> filp->f_op->fsync
> filemap_fdatawait
>
> all under the i_sem. However, it seems we don't need the i_sem for
> the filemap_* functions (is that valid - at least sync_page_range
> doesn't?). So I changed the code to make it grab i_sem only for the fsync
> call, but unfortunately, that doesn't seem to make much of a difference,
> as I found out. Most of the time taken by a commit is spent in fsync
> (the delta between the fsync latency and the overall commit latency is
> usually less than 5 ms, i.e. ~1%).
>
> I also changed nfsd_sync to call filemap_fdatawrite_range instead of
> filemap_fdatawrite, but that doesn't make a noticeable difference either.

I tried this many months ago on 2.4, including tweaking the VFS layer to
pass a ranged flush down to XFS, also without any noticeable effect.
The performance limitation was nfsds being locked out because they were
unable to get the BKL which was held in sync_old_buffers(). In the
single streaming writer case reducing the flush range made no difference
to the number of pages queued to disk and hence no difference to
performance.

BTW this BKLage is the main reason why I gave up hope trying to get the
2.4 kernel's write performance up.

> I then re-enabled my flushfast hack, and the commit latencies went
> down to 30 ms on ext3, with the occasional spike of 300 ms. On reiser,
> the commit latency went down to something like 50 ms on average.
>
> (The reiserfs rewrite case was fairly bad, however. Rewrite over NFS
> on top of reiser is fairly slow to begin with, much slower than write;
> and the gain from the flushfast patch is minimal - but that's a different
> story)
>
>
> Conclusion:
>
> So this at least supports my theory that the commits are throttling the
> writes quite a bit.

Indeed.

> For the sake of completeness, I did some more iozone
> measurements, and on write/rewrite the performance gain is about 50%
> on both reiser and ext3, for a single client. I would think for several
> clients writing concurrently, the gain should be even more pronounced,
> but I haven't run these tests yet.
>
> I'm wondering what could happen if we change nfsd_sync to not take the
> i_sem at all... I'll talk to a few VFS folks around here and try to find out.

I imagine they will not be thrilled by the idea.

I still think the best approach is to get the page cache to start
pushing unstable NFS pages to disk more aggresively, after the WRITE
but before the COMMIT. This should avoid long waits for disk IO
with i_sem held; IIRC the page cache will only hold i_sem long enough
to traverse page lists, allowing another WRITE call to get in soon.

> PS:
>
> Another thing I noticed was that the commit calls sent by the Linux
> client (2.6.5) are not evenly distributed over time. Much of the time,
> the client will call COMMIT 4-6 times a second, and then all of a sudden
> I see 30-80 calls a second several times in a row.

That's not good.

Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. http://www.ostg.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-03 13:26:27

by Olaf Kirch

[permalink] [raw]

Subject: Re: nfsd write throughput

On Tue, Aug 03, 2004 at 09:24:46PM +1000, Greg Banks wrote:
> With IRIX clients, which do far fewer COMMITs than (at least 2.4) Linux
> clients, I have seen COMMIT latencies in the order of multiple seconds
> as over a gigabyte of data is written to disk at 180 MB/s.

Yes - commit latencies varies a lot, depending on disk speed, client,
network bandwidth, and the number of clients.

> I imagine they will not be thrilled by the idea.

Probably not :-) And judging by the experiments I did with this, it's
not worth it - now we end up woth all nfsd's stuck in filemap_fdatawait.

> I still think the best approach is to get the page cache to start
> pushing unstable NFS pages to disk more aggresively, after the WRITE
> but before the COMMIT. This should avoid long waits for disk IO
> with i_sem held; IIRC the page cache will only hold i_sem long enough
> to traverse page lists, allowing another WRITE call to get in soon.

Right.

I looked into how else I could do this, using the normal
background writeout as you suggested. I looked at the way
pdflush_operation(background_writeout) is doing it, but I'm wondering
if this is the right place for us. background_writeout loops over all
dirty inodes in all super blocks, only to call do_writepages in the end
pretty much the same way filemap_flush does - and in the end it
may send out the wrong pages.

So in order to implement what you suggest, we would need to change a lot
of code in page-writeback.c and fs-writeback.c - just for the benefit
of nfsd. Is this worth it?

On the other hand, fadvise and filemap_flush is a perfectly sane way
of telling the kernel to get rid of a specific range of pages we're no
longer interested in, without having to have pdflush do a linear crawl
over all dirty supers and inodes.

So I guess the question I'm asking is - what would be a reasonable
heuristic for nfsd_write that improves streaming writes without hurting
random ones, and that doesn't cause fragmentation etc?

Olaf
--
Olaf Kirch | The Hardware Gods hate me.
[email protected] |
---------------+

-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. http://www.ostg.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-04 00:10:56

by Bruce Allan

[permalink] [raw]

Subject: Re: nfsd write throughput

Greg Banks <[email protected]> wrote on 08/02/2004 07:10:18 PM:

<snip>
> First, the way the v3 server is supposed to work is that normal
> page cache pressure pushes pages from unstable writes to disk
> before the COMMIT call arrives from the client. The best way to
> achieve this for a dedicated NFS server box is tuning the pdflush
> parameters to be more aggressive about writing back dirty pages,
> e.g. bumping down the following in /proc/vm:
> dirty_background_ration, dirty_ratio, dirty_writeback_centisecs,
> and dirty_expire_centisecs. I have to admit I've not tried this
> yet on 2.6 but the equivalent on 2.4 has been generally useful.
<snip>

This information (and the comparable info for 2.6) should be in Chapter
5. Optimizing NFS Performance of the NFS-HOWTO
(http://nfs.sourceforge.net/nfs-howto/performance.html). Greg, can you
throw together a documentation patch? If you don't have the time and/or
inclination, I could give it a shot if you care to review the content.
Or will this be a wasted effort if Olaf's changes (or something along
those lines) gets in.

Regards,
---
Bruce Allan <[email protected]>
Software Engineer, Linux Technology Center
IBM Corporation, Beaverton OR
503-578-4187 IBM Tie-line 775-4187

-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. http://www.ostg.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-04 08:18:47

by Greg Banks

[permalink] [raw]

Subject: Re: nfsd write throughput

On Tue, Aug 03, 2004 at 05:10:44PM -0700, Bruce Allan wrote:
> Greg Banks <[email protected]> wrote on 08/02/2004 07:10:18 PM:
>
> <snip>
> > First, the way the v3 server is supposed to work is that normal
> > page cache pressure pushes pages from unstable writes to disk
> > before the COMMIT call arrives from the client. The best way to
> > achieve this for a dedicated NFS server box is tuning the pdflush
> > parameters to be more aggressive about writing back dirty pages,
> > e.g. bumping down the following in /proc/vm:
> > dirty_background_ration, dirty_ratio, dirty_writeback_centisecs,
> > and dirty_expire_centisecs. I have to admit I've not tried this
> > yet on 2.6 but the equivalent on 2.4 has been generally useful.
> <snip>
>
> This information (and the comparable info for 2.6) should be in Chapter
> 5. Optimizing NFS Performance of the NFS-HOWTO
> (http://nfs.sourceforge.net/nfs-howto/performance.html).

Yes.

> Greg, can you
> throw together a documentation patch? If you don't have the time and/or
> inclination, I could give it a shot if you care to review the content.

That document has several wrong or outdated pieces of advice, e.g. in
sections 5.1, 5.4, 5.7. If you want me to write a diff it won't be
a trivial one and you'll have to wait while I plow through some of
the other four dozen bugs I have queued up. I'm happy to do it if
you're happy to wait.

Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. http://www.ostg.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs