LinuxLists.cc - patch: improve generic_file_buffered

2007-09-05 14:15:27

Subject: patch: improve generic_file_buffered_write()

Hi,

recently we discovered writing to a nfs-exported lustre filesystem is rather
slow (20-40 MB/s writing, but over 200 MB/s reading).

As I already explained on the nfs mailing list, this happens since there is an
offset on the very first page due to the nfs header.

http://sourceforge.net/mailarchive/forum.php?thread_name=200708312003.30446.bernd-schubert%40gmx.de&forum_name=nfs

While this especially effects lustre, Olaf Kirch also noticed it on another
filesystem before and wrote a nfs patch for it. This patch has two
disadvantages - it requires to move all data within the pages, IMHO rather
cpu time consuming, furthermore, it presently causes data corruption when
more than one nfs thread is running.

After thinking it over and over again we (Goswin and I) believe it would be
best to improve generic_file_buffered_write().
If there is sufficient data now, as it is usual for aio writes,
generic_file_buffered_write() will now fill each page as much as possible and
only then prepare/commit it. Before generic_file_buffered_write() commited
chunks of pages even though there were still more data.

The attached patch still has two FIXMEs, both for likely()/unlikely()
conditions which IMHO don't reflect the likelyhood for the new aio data
functions.

filemap.c | 144
+++++++++++++++++++++++++++++++++++++++++---------------------
1 file changed, 96 insertions(+), 48 deletions(-)

Signed-off-by: Bernd Schubert <[email protected]>
Signed-off-by: Goswin von Brederlow <[email protected]>

Cheers,
Goswin and Bernd

Index: linux-2.6.20.3/mm/filemap.c
===================================================================
--- linux-2.6.20.3.orig/mm/filemap.c 2007-09-04 13:43:04.000000000 +0200
+++ linux-2.6.20.3/mm/filemap.c 2007-09-05 12:39:23.000000000 +0200
@@ -2057,6 +2057,19 @@
}
EXPORT_SYMBOL(generic_file_direct_write);

+/**
+ * This function will do 3 main tasks for each iov:
+ * - prepare a write
+ * - copy the data from iov into a new page
+ * - commit this page
+ * @iob: file operations
+ * @iov: vector of data to write
+ * @nr_segs: number of iov segments
+ * @pos: position in the file
+ * @ppos: position in the file after this function
+ * @count: number of bytes to write
+ * written: offset in iov->base (data to skip on write)
+ */
ssize_t
generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t pos, loff_t *ppos,
@@ -2074,6 +2087,11 @@
const struct iovec *cur_iov = iov; /* current iovec */
size_t iov_base = 0; /* offset in the current iovec */
char __user *buf;
+ unsigned long data_start = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
+ loff_t wpos = pos; /* the position in the file we will return */
+
+ /* position in file as index of pages */
+ unsigned long index = pos >> PAGE_CACHE_SHIFT;

pagevec_init(&lru_pvec, 0);

@@ -2087,9 +2105,15 @@
buf = cur_iov->iov_base + iov_base;
}

+ page = __grab_cache_page(mapping, index, &cached_page, &lru_pvec);
+ if (!page) {
+ status = -ENOMEM;
+ goto out;
+ }
+
do {
- unsigned long index;
unsigned long offset;
+ unsigned long data_end; /* end of data within the page */
size_t copied;

offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
@@ -2106,6 +2130,8 @@
*/
bytes = min(bytes, cur_iov->iov_len - iov_base);

+ data_end = offset + bytes;
+
/*
* Bring in the user page that we will copy from _first_.
* Otherwise there's a nasty deadlock on copying from the
@@ -2114,95 +2140,117 @@
*/
fault_in_pages_readable(buf, bytes);

- page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
- if (!page) {
- status = -ENOMEM;
- break;
- }
-
if (unlikely(bytes == 0)) {
status = 0;
copied = 0;
goto zero_length_segment;
}

- status = a_ops->prepare_write(file, page, offset, offset+bytes);
- if (unlikely(status)) {
- loff_t isize = i_size_read(inode);
-
- if (status != AOP_TRUNCATED_PAGE)
- unlock_page(page);
- page_cache_release(page);
- if (status == AOP_TRUNCATED_PAGE)
- continue;
+ if (data_end == PAGE_CACHE_SIZE || count == bytes) {
/*
- * prepare_write() may have instantiated a few blocks
- * outside i_size. Trim these off again.
+ * Only prepare a write if its an entire page or if
+ * we don't have more data
*/
- if (pos + bytes > isize)
- vmtruncate(inode, isize);
- break;
+ status = a_ops->prepare_write(file, page, data_start, data_end);
+ if (unlikely(status)) {
+ loff_t isize = i_size_read(inode);
+
+ if (status == AOP_TRUNCATED_PAGE)
+ continue;
+ /*
+ * prepare_write() may have instantiated a few blocks
+ * outside i_size. Trim these off again.
+ */
+ if (pos + bytes > isize)
+ vmtruncate(inode, isize);
+ }
}
- if (likely(nr_segs == 1))
+ if (likely(nr_segs == 1)) /* FIXME: really likely with aio? */
copied = filemap_copy_from_user(page, offset,
buf, bytes);
else
copied = filemap_copy_from_user_iovec(page, offset,
cur_iov, iov_base, bytes);
- flush_dcache_page(page);
- status = a_ops->commit_write(file, page, offset, offset+bytes);
- if (status == AOP_TRUNCATED_PAGE) {
+
+ if (data_end == PAGE_CACHE_SIZE || count == bytes) {
+ /*
+ * Same as above, always try fill pages up to
+ * PAGE_CACHE_SIZE if possible (sufficient data)
+ */
+ flush_dcache_page(page);
+ status = a_ops->commit_write(file, page,
+ data_start, data_end);
+ if (status == AOP_TRUNCATED_PAGE) {
+ continue;
+ }
+ unlock_page(page);
+ mark_page_accessed(page);
page_cache_release(page);
- continue;
+ balance_dirty_pages_ratelimited(mapping);
+ cond_resched();
+ if (bytes < count) { /* still more iov data to write */
+ page = __grab_cache_page(mapping, index + 1,
+ &cached_page, &lru_pvec);
+ if (!page) {
+ status = -ENOMEM;
+ goto out;
+ }
+ } else {
+ page = NULL;
+ }
+ written += data_end - data_start;
+ wpos += data_end - data_start;
+ data_start = 0; /* correct would be data_end % PAGE_SIZE
+ * but if this would be != 0, we don't
+ * have more data
+ */
}
zero_length_segment:
if (likely(copied >= 0)) {
- if (!status)
- status = copied;
-
- if (status >= 0) {
- written += status;
- count -= status;
- pos += status;
- buf += status;
- if (unlikely(nr_segs > 1)) {
+ if (!status) {
+ count -= copied;
+ pos += copied;
+ buf += copied;
+ if (unlikely(nr_segs > 1)) { /* FIXME: really unlikely with aio? */
filemap_set_next_iovec(&cur_iov,
- &iov_base, status);
+ &iov_base, copied);
if (count)
buf = cur_iov->iov_base +
iov_base;
} else {
- iov_base += status;
+ iov_base += copied;
}
}
}
if (unlikely(copied != bytes))
- if (status >= 0)
+ if (!status)
status = -EFAULT;
- unlock_page(page);
- mark_page_accessed(page);
- page_cache_release(page);
- if (status < 0)
+ if (status)
break;
- balance_dirty_pages_ratelimited(mapping);
- cond_resched();
} while (count);
- *ppos = pos;
+
+out:
+ *ppos = wpos;

if (cached_page)
page_cache_release(cached_page);

+ if (page) {
+ unlock_page(page);
+ page_cache_release(page);
+ }
+
/*
* For now, when the user asks for O_SYNC, we'll actually give O_DSYNC
*/
- if (likely(status >= 0)) {
+ if (likely(!status)) {
if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
if (!a_ops->writepage || !is_sync_kiocb(iocb))
status = generic_osync_inode(inode, mapping,
OSYNC_METADATA|OSYNC_DATA);
}
- }
-
+ }
+
/*
* If we get here for O_DIRECT writes then we must have fallen through
* to buffered writes (block instantiation inside i_size). So we sync

--
Bernd Schubert
Q-Leap Networks GmbH

Attachments:

(No filename) (7.68 kB)
(No filename) (189.00 B)
Download all attachments

2007-09-05 15:36:16

by Randy Dunlap

[permalink] [raw]

Subject: Re: patch: improve generic_file_buffered_write()

On Wed, 5 Sep 2007 15:45:36 +0200 Bernd Schubert wrote:

> Hi,

meta-comments:

> filemap.c | 144
> +++++++++++++++++++++++++++++++++++++++++---------------------
> 1 file changed, 96 insertions(+), 48 deletions(-)

Use "diffstat -p 1 -w 70" per Documentation/SubmittingPatches.

> Index: linux-2.6.20.3/mm/filemap.c
> ===================================================================
> --- linux-2.6.20.3.orig/mm/filemap.c 2007-09-04 13:43:04.000000000 +0200
> +++ linux-2.6.20.3/mm/filemap.c 2007-09-05 12:39:23.000000000 +0200
> @@ -2057,6 +2057,19 @@
> }
> EXPORT_SYMBOL(generic_file_direct_write);
>
> +/**
> + * This function will do 3 main tasks for each iov:
> + * - prepare a write
> + * - copy the data from iov into a new page
> + * - commit this page
> + * @iob: file operations
> + * @iov: vector of data to write
> + * @nr_segs: number of iov segments
> + * @pos: position in the file
> + * @ppos: position in the file after this function
> + * @count: number of bytes to write
> + * written: offset in iov->base (data to skip on write)
> + */
> ssize_t
> generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
> unsigned long nr_segs, loff_t pos, loff_t *ppos,

Use proper kernel-doc notation, per Documentation/kernel-doc-nano-HOWTO.txt.

This comment block should be:

/**
* generic_file_buffered_write - handle an iov
* @iocb: file operations
* @iov: vector of data to write
* @nr_segs: number of iov segments
* @pos: position in the file
* @ppos: position in the file after this function
* @count: number of bytes to write
* @written: offset in iov->base (data to skip on write)
*
* This function will do 3 main tasks for each iov:
* - prepare a write
* - copy the data from iov into a new page
* - commit this page
*/

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

2007-09-05 17:42:17

by Bernd Schubert

[permalink] [raw]

Subject: Re: patch: improve generic_file_buffered_write() (2nd try 1/2)

Hello Randy,

thanks for your review.

On Wednesday 05 September 2007 17:35:29 Randy Dunlap wrote:
> On Wed, 5 Sep 2007 15:45:36 +0200 Bernd Schubert wrote:
> > Hi,
>
> meta-comments:
> > filemap.c | 144
> > +++++++++++++++++++++++++++++++++++++++++---------------------
> > 1 file changed, 96 insertions(+), 48 deletions(-)
>
> Use "diffstat -p 1 -w 70" per Documentation/SubmittingPatches.

Thanks, never would have thought this is documented.

[...]

> Use proper kernel-doc notation, per
> Documentation/kernel-doc-nano-HOWTO.txt.

Ouch, I really should have read these files before.
<joking>
Now I know why there are so few functions commented. Nobody wants to read the
documentation.
</joking>

>
> This comment block should be:
>
> /**
> * generic_file_buffered_write - handle an iov
> * @iocb: file operations
> * @iov: vector of data to write
> * @nr_segs: number of iov segments
> * @pos: position in the file
> * @ppos: position in the file after this function
> * @count: number of bytes to write
> * @written: offset in iov->base (data to skip on write)
> *
> * This function will do 3 main tasks for each iov:
> * - prepare a write
> * - copy the data from iov into a new page
> * - commit this page

Thanks, done.

I also removed the FIXMEs and created a second patch.

Signed-off-by: Bernd Schubert <[email protected]>
Signed-off-by: Goswin von Brederlow <[email protected]>

mm/filemap.c | 142 +++++++++++++++++++++++++++++++++----------------
1 file changed, 96 insertions(+), 46 deletions(-)

Index: linux-2.6.20.3/mm/filemap.c
===================================================================
--- linux-2.6.20.3.orig/mm/filemap.c 2007-09-05 14:04:18.000000000 +0200
+++ linux-2.6.20.3/mm/filemap.c 2007-09-05 18:50:26.000000000 +0200
@@ -2057,6 +2057,21 @@
}
EXPORT_SYMBOL(generic_file_direct_write);

+/**
+ * generic_file_buffered_write - handle iov'ectors
+ * @iob: file operations
+ * @iov: vector of data to write
+ * @nr_segs: number of iov segments
+ * @pos: position in the file
+ * @ppos: position in the file after this function
+ * @count: number of bytes to write
+ * written: offset in iov->base (data to skip on write)
+ *
+ * This function will do 3 main tasks for each iov:
+ * - prepare a write
+ * - copy the data from iov into a new page
+ * - commit this page
+ */
ssize_t
generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t pos, loff_t *ppos,
@@ -2074,6 +2089,11 @@
const struct iovec *cur_iov = iov; /* current iovec */
size_t iov_base = 0; /* offset in the current iovec */
char __user *buf;
+ unsigned long data_start = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
+ loff_t wpos = pos; /* the position in the file we will return */
+
+ /* position in file as index of pages */
+ unsigned long index = pos >> PAGE_CACHE_SHIFT;

pagevec_init(&lru_pvec, 0);

@@ -2087,9 +2107,15 @@
buf = cur_iov->iov_base + iov_base;
}

+ page = __grab_cache_page(mapping, index, &cached_page, &lru_pvec);
+ if (!page) {
+ status = -ENOMEM;
+ goto out;
+ }
+
do {
- unsigned long index;
unsigned long offset;
+ unsigned long data_end; /* end of data within the page */
size_t copied;

offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
@@ -2106,6 +2132,8 @@
*/
bytes = min(bytes, cur_iov->iov_len - iov_base);

+ data_end = offset + bytes;
+
/*
* Bring in the user page that we will copy from _first_.
* Otherwise there's a nasty deadlock on copying from the
@@ -2114,34 +2142,30 @@
*/
fault_in_pages_readable(buf, bytes);

- page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
- if (!page) {
- status = -ENOMEM;
- break;
- }
-
if (unlikely(bytes == 0)) {
status = 0;
copied = 0;
goto zero_length_segment;
}

- status = a_ops->prepare_write(file, page, offset, offset+bytes);
- if (unlikely(status)) {
- loff_t isize = i_size_read(inode);
-
- if (status != AOP_TRUNCATED_PAGE)
- unlock_page(page);
- page_cache_release(page);
- if (status == AOP_TRUNCATED_PAGE)
- continue;
+ if (data_end == PAGE_CACHE_SIZE || count == bytes) {
/*
- * prepare_write() may have instantiated a few blocks
- * outside i_size. Trim these off again.
+ * Only prepare a write if its an entire page or if
+ * we don't have more data
*/
- if (pos + bytes > isize)
- vmtruncate(inode, isize);
- break;
+ status = a_ops->prepare_write(file, page, data_start, data_end);
+ if (unlikely(status)) {
+ loff_t isize = i_size_read(inode);
+
+ if (status == AOP_TRUNCATED_PAGE)
+ continue;
+ /*
+ * prepare_write() may have instantiated a few blocks
+ * outside i_size. Trim these off again.
+ */
+ if (pos + bytes > isize)
+ vmtruncate(inode, isize);
+ }
}
if (likely(nr_segs == 1))
copied = filemap_copy_from_user(page, offset,
@@ -2149,60 +2173,86 @@
else
copied = filemap_copy_from_user_iovec(page, offset,
cur_iov, iov_base, bytes);
- flush_dcache_page(page);
- status = a_ops->commit_write(file, page, offset, offset+bytes);
- if (status == AOP_TRUNCATED_PAGE) {
+
+ if (data_end == PAGE_CACHE_SIZE || count == bytes) {
+ /*
+ * Same as above, always try fill pages up to
+ * PAGE_CACHE_SIZE if possible (sufficient data)
+ */
+ flush_dcache_page(page);
+ status = a_ops->commit_write(file, page,
+ data_start, data_end);
+ if (status == AOP_TRUNCATED_PAGE) {
+ continue;
+ }
+ unlock_page(page);
+ mark_page_accessed(page);
page_cache_release(page);
- continue;
+ balance_dirty_pages_ratelimited(mapping);
+ cond_resched();
+ if (bytes < count) { /* still more iov data to write */
+ page = __grab_cache_page(mapping, index + 1,
+ &cached_page, &lru_pvec);
+ if (!page) {
+ status = -ENOMEM;
+ goto out;
+ }
+ } else {
+ page = NULL;
+ }
+ written += data_end - data_start;
+ wpos += data_end - data_start;
+ data_start = 0; /* correct would be data_end % PAGE_SIZE
+ * but if this would be != 0, we don't
+ * have more data
+ */
}
zero_length_segment:
if (likely(copied >= 0)) {
- if (!status)
- status = copied;
-
- if (status >= 0) {
- written += status;
- count -= status;
- pos += status;
- buf += status;
+ if (!status) {
+ count -= copied;
+ pos += copied;
+ buf += copied;
if (unlikely(nr_segs > 1)) {
filemap_set_next_iovec(&cur_iov,
- &iov_base, status);
+ &iov_base, copied);
if (count)
buf = cur_iov->iov_base +
iov_base;
} else {
- iov_base += status;
+ iov_base += copied;
}
}
}
if (unlikely(copied != bytes))
- if (status >= 0)
+ if (!status)
status = -EFAULT;
- unlock_page(page);
- mark_page_accessed(page);
- page_cache_release(page);
- if (status < 0)
+ if (status)
break;
- balance_dirty_pages_ratelimited(mapping);
- cond_resched();
} while (count);
- *ppos = pos;
+
+out:
+ *ppos = wpos;

if (cached_page)
page_cache_release(cached_page);

+ if (page) {
+ unlock_page(page);
+ page_cache_release(page);
+ }
+
/*
* For now, when the user asks for O_SYNC, we'll actually give O_DSYNC
*/
- if (likely(status >= 0)) {
+ if (likely(!status)) {
if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
if (!a_ops->writepage || !is_sync_kiocb(iocb))
status = generic_osync_inode(inode, mapping,
OSYNC_METADATA|OSYNC_DATA);
}
- }
-
+ }
+
/*
* If we get here for O_DIRECT writes then we must have fallen through
* to buffered writes (block instantiation inside i_size). So we sync

--
Bernd Schubert
Q-Leap Networks GmbH

2007-09-05 17:49:41

by Bernd Schubert

[permalink] [raw]

Subject: Re: patch: improve generic_file_buffered_write() (2nd try 2/2)

I guess when aio was introduced this was probably forgotten. For small chunks
or synchronous i/o the likehood is correct, but for big data chunks and aio
the likehood is false.

Signed-off-by: Bernd Schubert <[email protected]>
Signed-off-by: Goswin von Brederlow <[email protected]>

Index: linux-2.6.20.3/mm/filemap.c
===================================================================
--- linux-2.6.20.3.orig/mm/filemap.c 2007-09-05 18:51:59.000000000 +0200
+++ linux-2.6.20.3/mm/filemap.c 2007-09-05 18:53:12.000000000 +0200
@@ -2100,7 +2100,7 @@
/*
* handle partial DIO write. Adjust cur_iov if needed.
*/
- if (likely(nr_segs == 1))
+ if (nr_segs == 1)
buf = iov->iov_base + written;
else {
filemap_set_next_iovec(&cur_iov, &iov_base, written);
@@ -2167,7 +2167,7 @@
vmtruncate(inode, isize);
}
}
- if (likely(nr_segs == 1))
+ if (nr_segs == 1)
copied = filemap_copy_from_user(page, offset,
buf, bytes);
else
@@ -2213,7 +2213,7 @@
count -= copied;
pos += copied;
buf += copied;
- if (unlikely(nr_segs > 1)) {
+ if (nr_segs > 1) {
filemap_set_next_iovec(&cur_iov,
&iov_base, copied);
if (count)

2007-09-07 18:18:32

by Nick Piggin

[permalink] [raw]

Subject: Re: patch: improve generic_file_buffered_write() (2nd try 1/2)

On Thursday 06 September 2007 03:41, Bernd Schubert wrote:

> > This comment block should be:
> >
> > /**
> > * generic_file_buffered_write - handle an iov
> > * @iocb: file operations
> > * @iov: vector of data to write
> > * @nr_segs: number of iov segments
> > * @pos: position in the file
> > * @ppos: position in the file after this function
> > * @count: number of bytes to write
> > * @written: offset in iov->base (data to skip on write)
> > *
> > * This function will do 3 main tasks for each iov:
> > * - prepare a write
> > * - copy the data from iov into a new page
> > * - commit this page
>
> Thanks, done.
>
> I also removed the FIXMEs and created a second patch.
>
> Signed-off-by: Bernd Schubert <[email protected]>
> Signed-off-by: Goswin von Brederlow <[email protected]>

Minor nit: when resubmitting a patch, you should include everything
(ie. the full changelog of problem statement and fix description) in a
single mail. It's just a bit easier...

So I believe the problem is that for a multi-segment iovec, we currently
prepare_write/commit_write once for each segment, right? We do this
because there is a nasty deadlock in the VM (copy_from_user being
called with a page locked), and copying multiple segs dramatically
increases the chances that one of these copies will cause a page fault
and thus potentially deadlock.

The fix you have I don't think can work because a filesystem must be
notified of the modification _before_ it has happened. (If I understand
correctly, you are skipping the prepare_write potentially until after
some data is copied?).

Anyway, there are fixes for this deadlock in Andrew's -mm tree, but
also a workaround for the NFSD problem in git commit 29dbb3fc. Did
you try a later kernel to see if it is fixed there?

Thanks,
Nick

>
> mm/filemap.c | 142 +++++++++++++++++++++++++++++++++----------------
> 1 file changed, 96 insertions(+), 46 deletions(-)
>
> Index: linux-2.6.20.3/mm/filemap.c
> ===================================================================
> --- linux-2.6.20.3.orig/mm/filemap.c 2007-09-05 14:04:18.000000000 +0200
> +++ linux-2.6.20.3/mm/filemap.c 2007-09-05 18:50:26.000000000 +0200
> @@ -2057,6 +2057,21 @@
> }
> EXPORT_SYMBOL(generic_file_direct_write);
>
> +/**
> + * generic_file_buffered_write - handle iov'ectors
> + * @iob: file operations
> + * @iov: vector of data to write
> + * @nr_segs: number of iov segments
> + * @pos: position in the file
> + * @ppos: position in the file after this function
> + * @count: number of bytes to write
> + * written: offset in iov->base (data to skip on write)
> + *
> + * This function will do 3 main tasks for each iov:
> + * - prepare a write
> + * - copy the data from iov into a new page
> + * - commit this page
> + */
> ssize_t
> generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
> unsigned long nr_segs, loff_t pos, loff_t *ppos,
> @@ -2074,6 +2089,11 @@
> const struct iovec *cur_iov = iov; /* current iovec */
> size_t iov_base = 0; /* offset in the current iovec */
> char __user *buf;
> + unsigned long data_start = (pos & (PAGE_CACHE_SIZE -1)); /* Within page
> */ + loff_t wpos = pos; /* the position in the file we will return */ +
> + /* position in file as index of pages */
> + unsigned long index = pos >> PAGE_CACHE_SHIFT;
>
> pagevec_init(&lru_pvec, 0);
>
> @@ -2087,9 +2107,15 @@
> buf = cur_iov->iov_base + iov_base;
> }
>
> + page = __grab_cache_page(mapping, index, &cached_page, &lru_pvec);
> + if (!page) {
> + status = -ENOMEM;
> + goto out;
> + }
> +
> do {
> - unsigned long index;
> unsigned long offset;
> + unsigned long data_end; /* end of data within the page */
> size_t copied;
>
> offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
> @@ -2106,6 +2132,8 @@
> */
> bytes = min(bytes, cur_iov->iov_len - iov_base);
>
> + data_end = offset + bytes;
> +
> /*
> * Bring in the user page that we will copy from _first_.
> * Otherwise there's a nasty deadlock on copying from the
> @@ -2114,34 +2142,30 @@
> */
> fault_in_pages_readable(buf, bytes);
>
> - page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
> - if (!page) {
> - status = -ENOMEM;
> - break;
> - }
> -
> if (unlikely(bytes == 0)) {
> status = 0;
> copied = 0;
> goto zero_length_segment;
> }
>
> - status = a_ops->prepare_write(file, page, offset, offset+bytes);
> - if (unlikely(status)) {
> - loff_t isize = i_size_read(inode);
> -
> - if (status != AOP_TRUNCATED_PAGE)
> - unlock_page(page);
> - page_cache_release(page);
> - if (status == AOP_TRUNCATED_PAGE)
> - continue;
> + if (data_end == PAGE_CACHE_SIZE || count == bytes) {
> /*
> - * prepare_write() may have instantiated a few blocks
> - * outside i_size. Trim these off again.
> + * Only prepare a write if its an entire page or if
> + * we don't have more data
> */
> - if (pos + bytes > isize)
> - vmtruncate(inode, isize);
> - break;
> + status = a_ops->prepare_write(file, page, data_start, data_end);
> + if (unlikely(status)) {
> + loff_t isize = i_size_read(inode);
> +
> + if (status == AOP_TRUNCATED_PAGE)
> + continue;
> + /*
> + * prepare_write() may have instantiated a few blocks
> + * outside i_size. Trim these off again.
> + */
> + if (pos + bytes > isize)
> + vmtruncate(inode, isize);
> + }
> }
> if (likely(nr_segs == 1))
> copied = filemap_copy_from_user(page, offset,
> @@ -2149,60 +2173,86 @@
> else
> copied = filemap_copy_from_user_iovec(page, offset,
> cur_iov, iov_base, bytes);
> - flush_dcache_page(page);
> - status = a_ops->commit_write(file, page, offset, offset+bytes);
> - if (status == AOP_TRUNCATED_PAGE) {
> +
> + if (data_end == PAGE_CACHE_SIZE || count == bytes) {
> + /*
> + * Same as above, always try fill pages up to
> + * PAGE_CACHE_SIZE if possible (sufficient data)
> + */
> + flush_dcache_page(page);
> + status = a_ops->commit_write(file, page,
> + data_start, data_end);
> + if (status == AOP_TRUNCATED_PAGE) {
> + continue;
> + }
> + unlock_page(page);
> + mark_page_accessed(page);
> page_cache_release(page);
> - continue;
> + balance_dirty_pages_ratelimited(mapping);
> + cond_resched();
> + if (bytes < count) { /* still more iov data to write */
> + page = __grab_cache_page(mapping, index + 1,
> + &cached_page, &lru_pvec);
> + if (!page) {
> + status = -ENOMEM;
> + goto out;
> + }
> + } else {
> + page = NULL;
> + }
> + written += data_end - data_start;
> + wpos += data_end - data_start;
> + data_start = 0; /* correct would be data_end % PAGE_SIZE
> + * but if this would be != 0, we don't
> + * have more data
> + */
> }
> zero_length_segment:
> if (likely(copied >= 0)) {
> - if (!status)
> - status = copied;
> -
> - if (status >= 0) {
> - written += status;
> - count -= status;
> - pos += status;
> - buf += status;
> + if (!status) {
> + count -= copied;
> + pos += copied;
> + buf += copied;
> if (unlikely(nr_segs > 1)) {
> filemap_set_next_iovec(&cur_iov,
> - &iov_base, status);
> + &iov_base, copied);
> if (count)
> buf = cur_iov->iov_base +
> iov_base;
> } else {
> - iov_base += status;
> + iov_base += copied;
> }
> }
> }
> if (unlikely(copied != bytes))
> - if (status >= 0)
> + if (!status)
> status = -EFAULT;
> - unlock_page(page);
> - mark_page_accessed(page);
> - page_cache_release(page);
> - if (status < 0)
> + if (status)
> break;
> - balance_dirty_pages_ratelimited(mapping);
> - cond_resched();
> } while (count);
> - *ppos = pos;
> +
> +out:
> + *ppos = wpos;
>
> if (cached_page)
> page_cache_release(cached_page);
>
> + if (page) {
> + unlock_page(page);
> + page_cache_release(page);
> + }
> +
> /*
> * For now, when the user asks for O_SYNC, we'll actually give O_DSYNC
> */
> - if (likely(status >= 0)) {
> + if (likely(!status)) {
> if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> if (!a_ops->writepage || !is_sync_kiocb(iocb))
> status = generic_osync_inode(inode, mapping,
> OSYNC_METADATA|OSYNC_DATA);
> }
> - }
> -
> + }
> +
> /*
> * If we get here for O_DIRECT writes then we must have fallen through
> * to buffered writes (block instantiation inside i_size). So we sync

2007-09-07 20:31:17

by Nick Piggin

[permalink] [raw]

Subject: Re: patch: improve generic_file_buffered_write() (2nd try 1/2)

On Saturday 08 September 2007 06:01, Goswin von Brederlow wrote:
> Nick Piggin <[email protected]> writes:
> > On Thursday 06 September 2007 03:41, Bernd Schubert wrote:
> > Minor nit: when resubmitting a patch, you should include everything
> > (ie. the full changelog of problem statement and fix description) in a
> > single mail. It's just a bit easier...
>
> Will do next time.
>
> > So I believe the problem is that for a multi-segment iovec, we currently
> > prepare_write/commit_write once for each segment, right? We do this
>
> It is more complex.
>
> Currently a __grab_cache_page, a_ops->prepare_write,
> filemap_copy_from_user[_iovec] and a_ops->commit_write is done
> whenever we hit
>
> a) a page boundary

This is required by the prepare_write/commit_write API. The write_begin
/ write_end API is also a page-based one, but in future, we are looking
at having a more general API but we haven't completely decided on the
form yet. "perform_write" is one proposal you can look for.

> b) a segment boundary

This is done, as I said, because of the deadlock issue. While the issue is
more completely fixed in -mm, a special case for kernel memory (eg. nfsd)
is in the latest mainline kernels.

> Those two cases don't have to, and from the stats basically never,
> coincide. For NFSd this means we do this TWICE per segment and TWICE
> per page.

The page boundary doesn't matter so much (well it does for other reasons,
but we've never been good at them...). The segment boundary means that
we aren't able to do block sized writes very well and end up doing a lot of
read-modify-write operations that could be avoided.

> > because there is a nasty deadlock in the VM (copy_from_user being
> > called with a page locked), and copying multiple segs dramatically
> > increases the chances that one of these copies will cause a page fault
> > and thus potentially deadlock.
>
> What actually locks the page? Is it __grab_cache_page or
> a_ops->prepare_write?

prepare_write must be given a locked page.

> Note that the patch does not change the number of copy_from_user calls
> being made nor does it change their arguments. If we need 2 (or more)
> segments to fill a page we still do 2 seperate calls to
> filemap_copy_from_user_iovec, both only spanning (part of) one
> segment.
>
> What the patch changes is the number of copy_from_user calls between
> __grab_cache_page and a_ops->commit_write.

So you're doing all copy_from_user calls within a prepare_write? Then
you're increasing the chances of deadlock. If not, then you're breaking
the API contract.

> Copying a full PAGE_SIZE bytes from multiple segments in one go would
> be a further improvement if that is possible.
>
> > The fix you have I don't think can work because a filesystem must be
> > notified of the modification _before_ it has happened. (If I understand
> > correctly, you are skipping the prepare_write potentially until after
> > some data is copied?).
>
> Yes. We changed the order of copy_from_user calls and
> a_ops->prepare_write by mistake. We will rectify that and do the
> prepare_write for the full page (when possible) before copying the
> data into the page.

OK, that is what used to be done, but the API is broken due to this
deadlock. write_begin/write_end fixes it properly.

> > Anyway, there are fixes for this deadlock in Andrew's -mm tree, but
> > also a workaround for the NFSD problem in git commit 29dbb3fc. Did
> > you try a later kernel to see if it is fixed there?
>
> Later than 2.6.23-rc5?

No it would be included earlier. The "segment_eq" check should be
allowing kernel writes (nfsd) to write multiple segments. If you have a
patch which changes this significantly, then it would indicate the
existing logic has a problem (or you've got a userspace application doing
the writev, which should be fixed by the write_begin patches in -mm).

2007-09-07 20:35:59

by Goswin von Brederlow

[permalink] [raw]

Subject: Re: patch: improve generic_file_buffered_write() (2nd try 1/2)

Nick Piggin <[email protected]> writes:

> On Thursday 06 September 2007 03:41, Bernd Schubert wrote:
> Minor nit: when resubmitting a patch, you should include everything
> (ie. the full changelog of problem statement and fix description) in a
> single mail. It's just a bit easier...

Will do next time.

> So I believe the problem is that for a multi-segment iovec, we currently
> prepare_write/commit_write once for each segment, right? We do this

It is more complex.

Currently a __grab_cache_page, a_ops->prepare_write,
filemap_copy_from_user[_iovec] and a_ops->commit_write is done
whenever we hit

a) a page boundary
b) a segment boundary

Those two cases don't have to, and from the stats basically never,
coincide. For NFSd this means we do this TWICE per segment and TWICE
per page.

> because there is a nasty deadlock in the VM (copy_from_user being
> called with a page locked), and copying multiple segs dramatically
> increases the chances that one of these copies will cause a page fault
> and thus potentially deadlock.

What actually locks the page? Is it __grab_cache_page or
a_ops->prepare_write?

Note that the patch does not change the number of copy_from_user calls
being made nor does it change their arguments. If we need 2 (or more)
segments to fill a page we still do 2 seperate calls to
filemap_copy_from_user_iovec, both only spanning (part of) one
segment.

What the patch changes is the number of copy_from_user calls between
__grab_cache_page and a_ops->commit_write.

Copying a full PAGE_SIZE bytes from multiple segments in one go would
be a further improvement if that is possible.

> The fix you have I don't think can work because a filesystem must be
> notified of the modification _before_ it has happened. (If I understand
> correctly, you are skipping the prepare_write potentially until after
> some data is copied?).

Yes. We changed the order of copy_from_user calls and
a_ops->prepare_write by mistake. We will rectify that and do the
prepare_write for the full page (when possible) before copying the
data into the page.

> Anyway, there are fixes for this deadlock in Andrew's -mm tree, but
> also a workaround for the NFSD problem in git commit 29dbb3fc. Did
> you try a later kernel to see if it is fixed there?

Later than 2.6.23-rc5?

> Thanks,
> Nick

MfG
Goswin

2007-09-07 21:00:25

by Goswin von Brederlow

[permalink] [raw]

Subject: Re: patch: improve generic_file_buffered_write() (2nd try 1/2)

Nick Piggin <[email protected]> writes:

> Anyway, there are fixes for this deadlock in Andrew's -mm tree, but
> also a workaround for the NFSD problem in git commit 29dbb3fc. Did
> you try a later kernel to see if it is fixed there?

I had a chance to look up that commit (git clone took a while so sorry
for writing 2 mails). It is present in 2.6.23-rc5 so I already noticed
it when merging our patch in 2.6.23-rc5.

Upon closer reading of the patch though I see that it will indeed
prevent writes by the nfsd to be split smaller than PAGE_SIZE and it
will cause filemap_copy_from_user[_iovec] to be called with a source
spanning multiple pages.

So the commit 29dbb3fc should have a simmilar, slightly better even,
gain for the nfsd and other kernel space segments. But it will not
improve writes from user space, where ~14% of the commits were saved
during a days work for me.

Now I have a question about fault_in_pages_readable(). Can I call that
for multiple pages and then call __grab_cache_page() without risking
one of the pages from getting lost again and causing a deadlock?

MfG
Goswin

2007-09-07 21:12:26

by Goswin von Brederlow

[permalink] [raw]

Subject: Re: patch: improve generic_file_buffered_write() (2nd try 1/2)

Nick Piggin <[email protected]> writes:

> On Saturday 08 September 2007 06:01, Goswin von Brederlow wrote:
>> Nick Piggin <[email protected]> writes:
>> > So I believe the problem is that for a multi-segment iovec, we currently
>> > prepare_write/commit_write once for each segment, right? We do this
>>
>> It is more complex.
>>
>> Currently a __grab_cache_page, a_ops->prepare_write,
>> filemap_copy_from_user[_iovec] and a_ops->commit_write is done
>> whenever we hit
>>
>> a) a page boundary
>
> This is required by the prepare_write/commit_write API. The write_begin
> / write_end API is also a page-based one, but in future, we are looking
> at having a more general API but we haven't completely decided on the
> form yet. "perform_write" is one proposal you can look for.
>
>> b) a segment boundary
>
> This is done, as I said, because of the deadlock issue. While the issue is
> more completely fixed in -mm, a special case for kernel memory (eg. nfsd)
> is in the latest mainline kernels.

Can you tell me where to get the fix from -mm? If it is completly
fixed there then that could make our patch obsolete.

>> Those two cases don't have to, and from the stats basically never,
>> coincide. For NFSd this means we do this TWICE per segment and TWICE
>> per page.
>
> The page boundary doesn't matter so much (well it does for other reasons,
> but we've never been good at them...). The segment boundary means that
> we aren't able to do block sized writes very well and end up doing a lot of
> read-modify-write operations that could be avoided.

Those are extremly costly for lustre. We have tested exporting a
lustre filesystem to NFS. Without fixes we get 40MB/s and with the
fixes it rises to nearly 200MB/s. That is a factor of 5 in speed.

>> > because there is a nasty deadlock in the VM (copy_from_user being
>> > called with a page locked), and copying multiple segs dramatically
>> > increases the chances that one of these copies will cause a page fault
>> > and thus potentially deadlock.
>>
>> What actually locks the page? Is it __grab_cache_page or
>> a_ops->prepare_write?
>
> prepare_write must be given a locked page.

Then that means __grab_cache_page does return a locked page because
there is nothing between the two calls that would.

>> Note that the patch does not change the number of copy_from_user calls
>> being made nor does it change their arguments. If we need 2 (or more)
>> segments to fill a page we still do 2 seperate calls to
>> filemap_copy_from_user_iovec, both only spanning (part of) one
>> segment.
>>
>> What the patch changes is the number of copy_from_user calls between
>> __grab_cache_page and a_ops->commit_write.
>
> So you're doing all copy_from_user calls within a prepare_write? Then
> you're increasing the chances of deadlock. If not, then you're breaking
> the API contract.

Actually due to a bug, as you noticed, we do the copy first and then
prepare/write. But fixing that would indeed do multiple copies between
prepare and commit.

>> Copying a full PAGE_SIZE bytes from multiple segments in one go would
>> be a further improvement if that is possible.
>>
>> > The fix you have I don't think can work because a filesystem must be
>> > notified of the modification _before_ it has happened. (If I understand
>> > correctly, you are skipping the prepare_write potentially until after
>> > some data is copied?).
>>
>> Yes. We changed the order of copy_from_user calls and
>> a_ops->prepare_write by mistake. We will rectify that and do the
>> prepare_write for the full page (when possible) before copying the
>> data into the page.
>
> OK, that is what used to be done, but the API is broken due to this
> deadlock. write_begin/write_end fixes it properly.

I'm verry interested in that fix.

>> > Anyway, there are fixes for this deadlock in Andrew's -mm tree, but
>> > also a workaround for the NFSD problem in git commit 29dbb3fc. Did
>> > you try a later kernel to see if it is fixed there?
>>
>> Later than 2.6.23-rc5?
>
> No it would be included earlier. The "segment_eq" check should be
> allowing kernel writes (nfsd) to write multiple segments. If you have a
> patch which changes this significantly, then it would indicate the
> existing logic has a problem (or you've got a userspace application doing
> the writev, which should be fixed by the write_begin patches in -mm).

I've got userspace application doing the writev. To be exact 14% of
the commits were saved by combining multiple segments into a single
prepare/write pair. Since the kernel segments don't fragment anymore
in 2.6.23-rc5 those savings must come from user space stuff.

>From the stats posted earlier you can see that there is a substantial
amount of calls with 6 segments all (alot) smaller than a page. Lots
of calls our patch or the write_begin/end will save.

MfG
Goswin

2007-09-07 21:16:53

by Nick Piggin

[permalink] [raw]

Subject: Re: patch: improve generic_file_buffered_write() (2nd try 1/2)

On Saturday 08 September 2007 07:00, Goswin von Brederlow wrote:
> Nick Piggin <[email protected]> writes:
> > Anyway, there are fixes for this deadlock in Andrew's -mm tree, but
> > also a workaround for the NFSD problem in git commit 29dbb3fc. Did
> > you try a later kernel to see if it is fixed there?
>
> I had a chance to look up that commit (git clone took a while so sorry
> for writing 2 mails). It is present in 2.6.23-rc5 so I already noticed
> it when merging our patch in 2.6.23-rc5.
>
> Upon closer reading of the patch though I see that it will indeed
> prevent writes by the nfsd to be split smaller than PAGE_SIZE and it
> will cause filemap_copy_from_user[_iovec] to be called with a source
> spanning multwhat to do in cambridge pubiple pages.

OK, good.

> So the commit 29dbb3fc should have a simmilar, slightly better even,
> gain for the nfsd and other kernel space segments. But it will not
> improve writes from user space, where ~14% of the commits were saved
> during a days work for me.
>
>
> Now I have a question about fault_in_pages_readable(). Can I call that
> for multiple pages and then call __grab_cache_page() without risking
> one of the pages from getting lost again and causing a deadlock?

No. The existing mainline code is buggy, and it is just hoping that
the userspace page does not get paged out between the fault_in_pages
and the subsequent copy_from_user.

If you do multiple fault_in_pages_readable(), you probably have less
chance of deadlocking, but on the 2nd call, you might still have to
wait for a long time to page in, during which time the 1st page may get
paged out.

But there is a proper solution in -mm with the write_begin aop.

2007-09-07 21:27:49

by Nick Piggin

[permalink] [raw]

Subject: Re: patch: improve generic_file_buffered_write() (2nd try 1/2)

On Saturday 08 September 2007 07:12, Goswin von Brederlow wrote:
> Nick Piggin <[email protected]> writes:
> > On Saturday 08 September 2007 06:01, Goswin von Brederlow wrote:

> >> b) a segment boundary
> >
> > This is done, as I said, because of the deadlock issue. While the issue
> > is more completely fixed in -mm, a special case for kernel memory (eg.
> > nfsd) is in the latest mainline kernels.
>
> Can you tell me where to get the fix from -mm? If it is completly
> fixed there then that could make our patch obsolete.

In the latest -mm series file, they start at
mm-revert-kernel_ds-buffered-write-optimisation.patch
...
and go to
ocfs2-convert-to-new-aops.patch

> >> What actually locks the page? Is it __grab_cache_page or
> >> a_ops->prepare_write?
> >
> > prepare_write must be given a locked page.
>
> Then that means __grab_cache_page does return a locked page because
> there is nothing between the two calls that would.

That's right.

> > No it would be included earlier. The "segment_eq" check should be
> > allowing kernel writes (nfsd) to write multiple segments. If you have a
> > patch which changes this significantly, then it would indicate the
> > existing logic has a problem (or you've got a userspace application doing
> > the writev, which should be fixed by the write_begin patches in -mm).
>
> I've got userspace application doing the writev. To be exact 14% of
> the commits were saved by combining multiple segments into a single
> prepare/write pair. Since the kernel segments don't fragment anymore
> in 2.6.23-rc5 those savings must come from user space stuff.
>
> From the stats posted earlier you can see that there is a substantial
> amount of calls with 6 segments all (alot) smaller than a page. Lots
> of calls our patch or the write_begin/end will save.

OK. The write_begin/write_end patchset is intrusive, no question. I'm not sure
what you're intending to do with it. They have been tested in -mm for quite a
while now, but just going with a simple patch that tries to copy more segments
might be OK for you if you're backporting. The deadlock is pretty uncommon.

2007-09-07 21:34:29

by Nick Piggin

[permalink] [raw]

Subject: Re: patch: improve generic_file_buffered_write() (2nd try 1/2)

On Saturday 08 September 2007 17:25, Nick Piggin wrote:
> On Saturday 08 September 2007 07:12, Goswin von Brederlow wrote:
> > Nick Piggin <[email protected]> writes:
> > > On Saturday 08 September 2007 06:01, Goswin von Brederlow wrote:
> > >> b) a segment boundary
> > >
> > > This is done, as I said, because of the deadlock issue. While the issue
> > > is more completely fixed in -mm, a special case for kernel memory (eg.
> > > nfsd) is in the latest mainline kernels.
> >
> > Can you tell me where to get the fix from -mm? If it is completly
> > fixed there then that could make our patch obsolete.
>
> In the latest -mm series file, they start at
> mm-revert-kernel_ds-buffered-write-optimisation.patch
> ...
> and go to
> ocfs2-convert-to-new-aops.patch
>
> > >> What actually locks the page? Is it __grab_cache_page or
> > >> a_ops->prepare_write?
> > >
> > > prepare_write must be given a locked page.
> >
> > Then that means __grab_cache_page does return a locked page because
> > there is nothing between the two calls that would.
>
> That's right.
>
> > > No it would be included earlier. The "segment_eq" check should be
> > > allowing kernel writes (nfsd) to write multiple segments. If you have a
> > > patch which changes this significantly, then it would indicate the
> > > existing logic has a problem (or you've got a userspace application
> > > doing the writev, which should be fixed by the write_begin patches in
> > > -mm).
> >
> > I've got userspace application doing the writev. To be exact 14% of
> > the commits were saved by combining multiple segments into a single
> > prepare/write pair. Since the kernel segments don't fragment anymore
> > in 2.6.23-rc5 those savings must come from user space stuff.
> >
> > From the stats posted earlier you can see that there is a substantial
> > amount of calls with 6 segments all (alot) smaller than a page. Lots
> > of calls our patch or the write_begin/end will save.
>
> OK. The write_begin/write_end patchset is intrusive, no question. I'm not
> sure what you're intending to do with it. They have been tested in -mm for
> quite a while now, but just going with a simple patch that tries to copy
> more segments might be OK for you if you're backporting. The deadlock is
> pretty uncommon.

Lustre should probably have to be ported over to write_begin/write_end in
order to use it too. With the patches in -mm, if a filesystem is still using
prepare_write/commit_write, the vm reverts to a safe path which avoids
the deadlock (and allows multi-seg io copies), but copies the data twice.

OTOH, this is very likely to go upstream, so your filesystem will need to be
ported over sooner or later anyway.

2007-09-08 09:43:32

by Goswin von Brederlow

[permalink] [raw]

Subject: Re: patch: improve generic_file_buffered_write() (2nd try 1/2)

Nick Piggin <[email protected]> writes:

> Lustre should probably have to be ported over to write_begin/write_end in
> order to use it too. With the patches in -mm, if a filesystem is still using
> prepare_write/commit_write, the vm reverts to a safe path which avoids
> the deadlock (and allows multi-seg io copies), but copies the data twice.

Not quite relevant for the performance problem. The situation is like
this:

lustre servers <-lustre network protocol-> lustre client <-NFS-> desktop

The NFSd problem is on the lustre client that only plays gateway. That
is not to say that the lustre servers or desktop loose performance due
to fragmenting writes too but it isn't that noticeable there.

> OTOH, this is very likely to go upstream, so your filesystem will need to be
> ported over sooner or later anyway.

Lustre copies the ext3 source from the kernel, patches in some extra
features and renames them during build. So one the one hand it always
breaks whenever someone meddles with the ext3 code. On the other hand
improvement for ext3 get picked up by lustre semi automatically.

In this case lustre would get the begin_write() function from ext3 and
use it.

MfG
Goswin

2007-09-08 10:15:31

by Nick Piggin

[permalink] [raw]

Subject: Re: patch: improve generic_file_buffered_write() (2nd try 1/2)

On Saturday 08 September 2007 19:43, Goswin von Brederlow wrote:
> Nick Piggin <[email protected]> writes:
> > Lustre should probably have to be ported over to write_begin/write_end in
> > order to use it too. With the patches in -mm, if a filesystem is still
> > using prepare_write/commit_write, the vm reverts to a safe path which
> > avoids the deadlock (and allows multi-seg io copies), but copies the data
> > twice.
>
> Not quite relevant for the performance problem. The situation is like
> this:
>
> lustre servers <-lustre network protocol-> lustre client <-NFS-> desktop
>
> The NFSd problem is on the lustre client that only plays gateway. That
> is not to say that the lustre servers or desktop loose performance due
> to fragmenting writes too but it isn't that noticeable there.

OK, but the filesystem client would need to support write_begin/write_end
in order for writes to do multi-seg iovecs. nfsd of course will also be fixed
by the earlier patch I referenced, but you did want userspace multi-seg
writes to work too...

> > OTOH, this is very likely to go upstream, so your filesystem will need to
> > be ported over sooner or later anyway.
>
> Lustre copies the ext3 source from the kernel, patches in some extra
> features and renames them during build. So one the one hand it always
> breaks whenever someone meddles with the ext3 code. On the other hand
> improvement for ext3 get picked up by lustre semi automatically.
>
> In this case lustre would get the begin_write() function from ext3 and
> use it.

OK. Yes ext3 is converted in -mm.