LinuxLists.cc - [PATCH] Fix nfsd rewrite performance

2005-08-01 11:40:01

Subject: [PATCH] Fix nfsd rewrite performance

Since no-one commented on the previous versions of the patch, I'm
submitting it almost unchanged now. This patch is relative to 2.6.12-rc3,
and has undergone some iozone testing, which showed write and rewrite
performance coming out almost the same.

------------------------------------------------------------------
Subject: NFS: Fix rewrite performance

This patch fixes an nfsd performance issue with rewrite. Most of the time,
the iovecs passed to nfsd_vfs_write are unaligned. As the default writev
implementation will just call write() on each chunk in the iovec, this
will cause partial blocks to be dirtied, triggering a read-modify-write
cycle for each block.

The short-term fix is to make sure nfsd aligns the data properly.
The long term fix would be to make the VFS smarter about writev requests.

Signed-off-by: Olaf Kirch <[email protected]>

Index: linux-2.6.12.new/fs/nfsd/vfs.c
===================================================================
--- linux-2.6.12.new.orig/fs/nfsd/vfs.c
+++ linux-2.6.12.new/fs/nfsd/vfs.c
@@ -874,6 +874,46 @@ out:
return err;
}

+/*
+ * Helper function to page-align the write payload.
+ */
+static inline int
+nfsd_page_align_payload(struct kvec *vec, int vlen)
+{
+ unsigned char *this_page, *prev_page;
+ int i, chunk0, chunk1;
+
+ /* The following checks are just paranoia */
+ if (vlen < 2)
+ return 0;
+
+ if (vec[0].iov_len + vec[vlen-1].iov_len != PAGE_CACHE_SIZE)
+ return 0;
+ for (i = 1; i < vlen - 1; ++i) {
+ if (vec[i].iov_len != PAGE_CACHE_SIZE)
+ return 0;
+ }
+
+ chunk0 = vec[0].iov_len;
+ chunk1 = PAGE_CACHE_SIZE - chunk0;
+
+ this_page = (unsigned char *) vec[vlen-1].iov_base;
+ for (i = vlen-1; i; --i) {
+ prev_page = (unsigned char *) vec[i-1].iov_base;
+
+ /* Push trailing partial page so it's
+ * aligned with the end of the page, then
+ * pull up the missing chunk from the previous
+ * page */
+ memmove(this_page + chunk0, this_page, chunk1);
+ memcpy(this_page, prev_page + chunk1, chunk0);
+ vec[i].iov_len = PAGE_CACHE_SIZE;
+ this_page = prev_page;
+ }
+
+ return 1;
+}
+
static inline int
nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
loff_t offset, struct kvec *vec, int vlen,
@@ -917,6 +957,17 @@ nfsd_vfs_write(struct svc_rqst *rqstp, s
if (stable && !EX_WGATHER(exp))
file->f_flags |= O_SYNC;

+ /* Hack: if we're rewriting the file, make sure
+ * we align the iovec properly to avoid costly
+ * read-modify-write operations on the block devices.
+ * This hack can go away once we have generic_file_writev.
+ */
+ if ((offset < inode->i_size)
+ && (cnt % PAGE_CACHE_SIZE) == 0
+ && vec->iov_len != PAGE_CACHE_SIZE
+ && nfsd_page_align_payload(vec, vlen))
+ vec++, vlen--;
+
/* Write the data. */
oldfs = get_fs(); set_fs(KERNEL_DS);
err = vfs_writev(file, (struct iovec __user *)vec, vlen, &offset);
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax

-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-08-01 11:53:34

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH] Fix nfsd rewrite performance

I haven't had the chance to read carefully; just one quick question:

On Mon, Aug 01, 2005 at 01:39:54PM +0200, Olaf Kirch wrote:
> + this_page = (unsigned char *) vec[vlen-1].iov_base;
> + for (i = vlen-1; i; --i) {
> + prev_page = (unsigned char *) vec[i-1].iov_base;
> +
> + /* Push trailing partial page so it's
> + * aligned with the end of the page, then
> + * pull up the missing chunk from the previous
> + * page */
> + memmove(this_page + chunk0, this_page, chunk1);
> + memcpy(this_page, prev_page + chunk1, chunk0);
> + vec[i].iov_len = PAGE_CACHE_SIZE;
> + this_page = prev_page;
> + }

If there's stuff after the write data (as there could be in the NFSv4
case at least), does this overwrite it?

--b.

-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-08-01 11:59:32

by Olaf Kirch

[permalink] [raw]

Subject: Re: [PATCH] Fix nfsd rewrite performance

On Mon, Aug 01, 2005 at 07:53:19AM -0400, J. Bruce Fields wrote:
> On Mon, Aug 01, 2005 at 01:39:54PM +0200, Olaf Kirch wrote:
> > + this_page = (unsigned char *) vec[vlen-1].iov_base;
> > + for (i = vlen-1; i; --i) {
> > + prev_page = (unsigned char *) vec[i-1].iov_base;
> > +
> > + /* Push trailing partial page so it's
> > + * aligned with the end of the page, then
> > + * pull up the missing chunk from the previous
> > + * page */
> > + memmove(this_page + chunk0, this_page, chunk1);
> > + memcpy(this_page, prev_page + chunk1, chunk0);
> > + vec[i].iov_len = PAGE_CACHE_SIZE;
> > + this_page = prev_page;
> > + }
>
> If there's stuff after the write data (as there could be in the NFSv4
> case at least), does this overwrite it?

It does. I hadn't thought of these cursed NFSv4 compound requests.

Alternatively, we could pull in everything and align it with the beginning
of the message. This would clobber the RPC header etc. I don't see us
doing doing any in-place decodes of strings etc in the svcauth code,
so that may be a safer choice. I would like to avoid allocating an extra
page just for aligning things.

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax

-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-08-01 12:10:57

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH] Fix nfsd rewrite performance

On Mon, Aug 01, 2005 at 01:59:22PM +0200, Olaf Kirch wrote:
> Alternatively, we could pull in everything and align it with the beginning
> of the message. This would clobber the RPC header etc. I don't see us
> doing doing any in-place decodes of strings etc in the svcauth code,
> so that may be a safer choice. I would like to avoid allocating an extra
> page just for aligning things.

Yeah. We'd also need to check, though, that we aren't taking pointers
to data before the write (when decoding a previous compound op). I
think we might be....

--b.

-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-08-01 12:14:26

by Olaf Kirch

[permalink] [raw]

Subject: Re: [PATCH] Fix nfsd rewrite performance

On Mon, Aug 01, 2005 at 08:10:53AM -0400, J. Bruce Fields wrote:
> > Alternatively, we could pull in everything and align it with the beginning
> > of the message. This would clobber the RPC header etc. I don't see us
> > doing doing any in-place decodes of strings etc in the svcauth code,
> > so that may be a safer choice. I would like to avoid allocating an extra
> > page just for aligning things.
>
> Yeah. We'd also need to check, though, that we aren't taking pointers
> to data before the write (when decoding a previous compound op). I
> think we might be....

Can we assume that the previous op inside the compound op has been
dealt with completely? Why would it be unsafe to clobber it?

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax

-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-08-01 12:58:34

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH] Fix nfsd rewrite performance

On Mon, Aug 01, 2005 at 02:14:22PM +0200, Olaf Kirch wrote:
> On Mon, Aug 01, 2005 at 08:10:53AM -0400, J. Bruce Fields wrote:
> > > Alternatively, we could pull in everything and align it with the beginning
> > > of the message. This would clobber the RPC header etc. I don't see us
> > > doing doing any in-place decodes of strings etc in the svcauth code,
> > > so that may be a safer choice. I would like to avoid allocating an extra
> > > page just for aligning things.
> >
> > Yeah. We'd also need to check, though, that we aren't taking pointers
> > to data before the write (when decoding a previous compound op). I
> > think we might be....
>
> Can we assume that the previous op inside the compound op has been
> dealt with completely?

Yeah, you're right, that's not a problem.

There's a different problem, though, with the request deferral stuff.
It waits for upcalls by aborting request processing, saving a copy of
the raw request data, then processing it from scratch again when the
upcall response comes.

This is already unfortunate for writes since nfsd will try to kmalloc()
enough memory to store the whole request. (See the "FIXME" in
svcsock.c:svc_defer().)

But moving the data around in the pages will cause weird corruption if
we decide to wait on an upcall after a write, because it's no longer to
reprocess the request from scratch. It can only happen with v4, and
probably only in fairly weird cases, but we can't guarantee it won't
happen.

So I think we need to fix that upcall deferral mechanism before doing
this shifting of data.

--b.

-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-08-02 09:49:56

by Olaf Kirch

[permalink] [raw]

Subject: Re: [PATCH] Fix nfsd rewrite performance

On Mon, Aug 01, 2005 at 08:58:32AM -0400, J. Bruce Fields wrote:
> There's a different problem, though, with the request deferral stuff.
> It waits for upcalls by aborting request processing, saving a copy of
> the raw request data, then processing it from scratch again when the
> upcall response comes.

I just looked at the code, and there are two types of deferrals.
One is for authentication stuff - this will actually use svc_defer and
svc_revisit, which goes back to look at the entire request.

The other type is what hte nfs4idmap stuff does, which uses its own
defer/revisit routines, which just make the nfsd thread wait.

So neither poses a problem to munging the iovec in nfsd_write: The
authentication stuff will happen when we parse the RPC header, which is
way before we decide to call nfsd_write. The idmap stuff may be called
after processing a write call, but it will not go back to revisit the
entire request, it will just stall briefly while idmapd is doing its job.
Correct?

In general, I believe it does not make sense to call svc_defer once we've
started to process a request, because the request may not be idempotent.
It sounds like a reasonable assumption to me that deferring and revisiting
an entire requests is only done before we hit the RPC program handler.

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax

-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-08-02 10:02:28

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH] Fix nfsd rewrite performance

On Tue, Aug 02, 2005 at 11:49:50AM +0200, Olaf Kirch wrote:
> I just looked at the code, and there are two types of deferrals.
> One is for authentication stuff - this will actually use svc_defer and
> svc_revisit, which goes back to look at the entire request.
>
> The other type is what hte nfs4idmap stuff does, which uses its own
> defer/revisit routines, which just make the nfsd thread wait.
>
> So neither poses a problem to munging the iovec in nfsd_write:

It can also happen whenever we do fh_verify(). (See the exp_find() call in
fh_verify().)

> The authentication stuff will happen when we parse the RPC header,
> which is way before we decide to call nfsd_write. The idmap stuff may
> be called after processing a write call, but it will not go back to
> revisit the entire request, it will just stall briefly while idmapd is
> doing its job. Correct?
>
> In general, I believe it does not make sense to call svc_defer once
> we've started to process a request, because the request may not be
> idempotent. It sounds like a reasonable assumption to me that
> deferring and revisiting an entire requests is only done before we hit
> the RPC program handler.

Yeah. So the simplest solution might be to make the other upcalls use
the same sort of deferral as the idmap upcalls, at least in the v4 case.
In the v3 case fh_verify always happens early enough not to be a
problem, I believe.

Though I still think it's unfortunate even for v3 that it will try to
copy (and allocate space for) the entire write request in the case of an
authentication- or export- related upcall.

--b.

-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-08-02 10:49:11

by Olaf Kirch

[permalink] [raw]

Subject: Re: [PATCH] Fix nfsd rewrite performance

On Tue, Aug 02, 2005 at 06:02:22AM -0400, J. Bruce Fields wrote:
> Yeah. So the simplest solution might be to make the other upcalls use
> the same sort of deferral as the idmap upcalls, at least in the v4 case.
> In the v3 case fh_verify always happens early enough not to be a
> problem, I believe.
>
> Though I still think it's unfortunate even for v3 that it will try to
> copy (and allocate space for) the entire write request in the case of an
> authentication- or export- related upcall.

I just looked at svc_defer in 2.6.12 and it seems it cannot handle
more than one page of data anyway, and ditches anything else.

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax

-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-08-02 12:07:40

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH] Fix nfsd rewrite performance

On Tue, Aug 02, 2005 at 12:49:04PM +0200, Olaf Kirch wrote:
> On Tue, Aug 02, 2005 at 06:02:22AM -0400, J. Bruce Fields wrote:
> > Yeah. So the simplest solution might be to make the other upcalls use
> > the same sort of deferral as the idmap upcalls, at least in the v4 case.
> > In the v3 case fh_verify always happens early enough not to be a
> > problem, I believe.
> >
> > Though I still think it's unfortunate even for v3 that it will try to
> > copy (and allocate space for) the entire write request in the case of an
> > authentication- or export- related upcall.
>
> I just looked at svc_defer in 2.6.12 and it seems it cannot handle
> more than one page of data anyway, and ditches anything else.

Oops, you're right. Forcing the client to time out and retry there is
kind of unfortunate too, though. A half hour of activity (to expire
entries from the export cache) followed by a write could trigger this.

--b.

-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs