2007-01-13 03:24:20

by Nick Piggin

[permalink] [raw]
Subject: [patch 0/10] buffered write deadlock fix

The following set of patches attempt to fix the buffered write
locking problems (and there are a couple of peripheral patches
and cleanups there too).

This does pass the write deadlock tests that otherwise fail.

Has survived a few hours of fsx-linux on ext2 and 3.

Patches against 2.6.20-rc4. I didn't have the heart to attempt
to rebase them on -mm, at least until I get some feedback ;)

Thanks,
Nick

--
SuSE Labs


2007-01-13 03:24:30

by Nick Piggin

[permalink] [raw]
Subject: [patch 1/10] fs: libfs buffered write leak fix

simple_prepare_write and nobh_prepare_write leak uninitialised kernel data.
Fix the former, make a note of the latter. Several other filesystems seem
to be iffy here, too.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -327,32 +327,35 @@ int simple_readpage(struct file *file, s
int simple_prepare_write(struct file *file, struct page *page,
unsigned from, unsigned to)
{
- if (!PageUptodate(page)) {
- if (to - from != PAGE_CACHE_SIZE) {
- void *kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr, 0, from);
- memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
- }
+ if (PageUptodate(page))
+ return 0;
+
+ if (to - from != PAGE_CACHE_SIZE) {
+ clear_highpage(page);
+ flush_dcache_page(page);
SetPageUptodate(page);
}
+
return 0;
}

int simple_commit_write(struct file *file, struct page *page,
- unsigned offset, unsigned to)
+ unsigned from, unsigned to)
{
- struct inode *inode = page->mapping->host;
- loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
-
- /*
- * No need to use i_size_read() here, the i_size
- * cannot change under us because we hold the i_mutex.
- */
- if (pos > inode->i_size)
- i_size_write(inode, pos);
- set_page_dirty(page);
+ if (to > from) {
+ struct inode *inode = page->mapping->host;
+ loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+
+ if (to - from == PAGE_CACHE_SIZE)
+ SetPageUptodate(page);
+ /*
+ * No need to use i_size_read() here, the i_size
+ * cannot change under us because we hold the i_mutex.
+ */
+ if (pos > inode->i_size)
+ i_size_write(inode, pos);
+ set_page_dirty(page);
+ }
return 0;
}

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2344,6 +2344,8 @@ int nobh_prepare_write(struct page *page

if (is_mapped_to_disk)
SetPageMappedToDisk(page);
+
+ /* XXX: information leak vs read(2) */
SetPageUptodate(page);

/*

2007-01-13 03:24:44

by Nick Piggin

[permalink] [raw]
Subject: [patch 2/10] mm: revert "generic_file_buffered_write(): handle zero length iovec segments"

From: Andrew Morton <[email protected]>

Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6.

This was a bugfix against 6527c2bdf1f833cc18e8f42bd97973d583e4aa83, which we
also revert.

Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1911,12 +1911,6 @@ generic_file_buffered_write(struct kiocb
break;
}

- if (unlikely(bytes == 0)) {
- status = 0;
- copied = 0;
- goto zero_length_segment;
- }
-
status = a_ops->prepare_write(file, page, offset, offset+bytes);
if (unlikely(status)) {
loff_t isize = i_size_read(inode);
@@ -1946,8 +1940,7 @@ generic_file_buffered_write(struct kiocb
page_cache_release(page);
continue;
}
-zero_length_segment:
- if (likely(copied >= 0)) {
+ if (likely(copied > 0)) {
if (!status)
status = copied;

Index: linux-2.6/mm/filemap.h
===================================================================
--- linux-2.6.orig/mm/filemap.h
+++ linux-2.6/mm/filemap.h
@@ -87,7 +87,7 @@ filemap_set_next_iovec(const struct iove
const struct iovec *iov = *iovp;
size_t base = *basep;

- do {
+ while (bytes) {
int copy = min(bytes, iov->iov_len - base);

bytes -= copy;
@@ -96,7 +96,7 @@ filemap_set_next_iovec(const struct iove
iov++;
base = 0;
}
- } while (bytes);
+ }
*iovp = iov;
*basep = base;
}

2007-01-13 03:25:04

by Nick Piggin

[permalink] [raw]
Subject: [patch 3/10] mm: revert "generic_file_buffered_write(): deadlock on vectored write"

From: Andrew Morton <[email protected]>

Revert 6527c2bdf1f833cc18e8f42bd97973d583e4aa83

This patch fixed the following bug:

When prefaulting in the pages in generic_file_buffered_write(), we only
faulted in the pages for the firts segment of the iovec. If the second of
successive segment described a mmapping of the page into which we're
write()ing, and that page is not up-to-date, the fault handler tries to lock
the already-locked page (to bring it up to date) and deadlocks.

An exploit for this bug is in writev-deadlock-demo.c, in
http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.

(These demos assume blocksize < PAGE_CACHE_SIZE).

The problem with this fix is that it takes the kernel back to doing a single
prepare_write()/commit_write() per iovec segment. So in the worst case we'll
run prepare_write+commit_write 1024 times where we previously would have run
it once. The other problem with the fix is that it fix all the locking problems.


<insert numbers obtained via ext3-tools's writev-speed.c here>

And apparently this change killed NFS overwrite performance, because, I
suppose, it talks to the server for each prepare_write+commit_write.

So just back that patch out - we'll be fixing the deadlock by other means.

Signed-off-by: Andrew Morton <[email protected]>

Nick says: also it only ever actually papered over the bug, because after
faulting in the pages, they might be unmapped or reclaimed.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1881,21 +1881,14 @@ generic_file_buffered_write(struct kiocb
do {
unsigned long index;
unsigned long offset;
+ unsigned long maxlen;
size_t copied;

offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
index = pos >> PAGE_CACHE_SHIFT;
bytes = PAGE_CACHE_SIZE - offset;
-
- /* Limit the size of the copy to the caller's write size */
- bytes = min(bytes, count);
-
- /*
- * Limit the size of the copy to that of the current segment,
- * because fault_in_pages_readable() doesn't know how to walk
- * segments.
- */
- bytes = min(bytes, cur_iov->iov_len - iov_base);
+ if (bytes > count)
+ bytes = count;

/*
* Bring in the user page that we will copy from _first_.
@@ -1903,7 +1896,10 @@ generic_file_buffered_write(struct kiocb
* same page as we're writing to, without it being marked
* up-to-date.
*/
- fault_in_pages_readable(buf, bytes);
+ maxlen = cur_iov->iov_len - iov_base;
+ if (maxlen > bytes)
+ maxlen = bytes;
+ fault_in_pages_readable(buf, maxlen);

page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
if (!page) {

2007-01-13 03:25:42

by Nick Piggin

[permalink] [raw]
Subject: [patch 8/10] mm: generic_file_buffered_write cleanup more

No need to do the confusing switch of variables from copied into status.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1898,28 +1898,22 @@ generic_file_buffered_write(struct kiocb
goto fs_write_aop_error;

if (likely(copied > 0)) {
- if (!status)
- status = copied;
-
- if (status >= 0) {
- written += status;
- count -= status;
- pos += status;
- buf += status;
- if (unlikely(nr_segs > 1)) {
- filemap_set_next_iovec(&cur_iov,
- &iov_offset, status);
- if (count)
- buf = cur_iov->iov_base +
- iov_offset;
- } else {
- iov_offset += status;
- }
+ written += copied;
+ count -= copied;
+ pos += copied;
+ buf += copied;
+ if (unlikely(nr_segs > 1)) {
+ filemap_set_next_iovec(&cur_iov,
+ &iov_offset, copied);
+ if (count)
+ buf = cur_iov->iov_base + iov_offset;
+ } else {
+ iov_offset += copied;
}
}
if (unlikely(copied != bytes))
- if (status >= 0)
- status = -EFAULT;
+ status = -EFAULT;
+
unlock_page(page);
mark_page_accessed(page);
page_cache_release(page);

2007-01-13 03:25:43

by Nick Piggin

[permalink] [raw]
Subject: [patch 5/10] mm: debug write deadlocks

Allow CONFIG_DEBUG_VM to switch off the prefaulting logic, to simulate the
difficult race where the page may be unmapped before calling copy_from_user.
Makes the race much easier to hit.

This is useful for demonstration and testing purposes, but is removed in a
subsequent patch.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1894,6 +1894,7 @@ generic_file_buffered_write(struct kiocb
if (maxlen > bytes)
maxlen = bytes;

+#ifndef CONFIG_DEBUG_VM
/*
* Bring in the user page that we will copy from _first_.
* Otherwise there's a nasty deadlock on copying from the
@@ -1901,6 +1902,7 @@ generic_file_buffered_write(struct kiocb
* up-to-date.
*/
fault_in_pages_readable(buf, maxlen);
+#endif

page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
if (!page) {

2007-01-13 03:25:55

by Nick Piggin

[permalink] [raw]
Subject: [patch 9/10] mm: generic_file_buffered_write iovec cleanup

Hide some of the open-coded nr_segs tests into the iovec helpers. This is
all to simplify generic_file_buffered_write, because that gets more complex
in the next patch.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/filemap.h
===================================================================
--- linux-2.6.orig/mm/filemap.h
+++ linux-2.6/mm/filemap.h
@@ -22,82 +22,82 @@ __filemap_copy_from_user_iovec_inatomic(

/*
* Copy as much as we can into the page and return the number of bytes which
- * were sucessfully copied. If a fault is encountered then clear the page
- * out to (offset+bytes) and return the number of bytes which were copied.
- *
- * NOTE: For this to work reliably we really want copy_from_user_inatomic_nocache
- * to *NOT* zero any tail of the buffer that it failed to copy. If it does,
- * and if the following non-atomic copy succeeds, then there is a small window
- * where the target page contains neither the data before the write, nor the
- * data after the write (it contains zero). A read at this time will see
- * data that is inconsistent with any ordering of the read and the write.
- * (This has been detected in practice).
+ * were sucessfully copied. If a fault is encountered then return the number of
+ * bytes which were copied.
*/
static inline size_t
-filemap_copy_from_user(struct page *page, unsigned long offset,
- const char __user *buf, unsigned bytes)
+filemap_copy_from_user_atomic(struct page *page, unsigned long offset,
+ const struct iovec *iov, unsigned long nr_segs,
+ size_t base, size_t bytes)
{
char *kaddr;
- int left;
+ size_t copied;

kaddr = kmap_atomic(page, KM_USER0);
- left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
+ if (likely(nr_segs == 1)) {
+ int left;
+ char __user *buf = iov->iov_base + base;
+ left = __copy_from_user_inatomic_nocache(kaddr + offset,
+ buf, bytes);
+ copied = bytes - left;
+ } else {
+ copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset,
+ iov, base, bytes);
+ }
kunmap_atomic(kaddr, KM_USER0);

- if (left != 0) {
- /* Do it the slow way */
- kaddr = kmap(page);
- left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
- kunmap(page);
- }
- return bytes - left;
+ return copied;
}

/*
- * This has the same sideeffects and return value as filemap_copy_from_user().
- * The difference is that on a fault we need to memset the remainder of the
- * page (out to offset+bytes), to emulate filemap_copy_from_user()'s
- * single-segment behaviour.
+ * This has the same sideeffects and return value as
+ * filemap_copy_from_user_atomic().
+ * The difference is that it attempts to resolve faults.
*/
static inline size_t
-filemap_copy_from_user_iovec(struct page *page, unsigned long offset,
- const struct iovec *iov, size_t base, size_t bytes)
+filemap_copy_from_user(struct page *page, unsigned long offset,
+ const struct iovec *iov, unsigned long nr_segs,
+ size_t base, size_t bytes)
{
char *kaddr;
size_t copied;

- kaddr = kmap_atomic(page, KM_USER0);
- copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
- base, bytes);
- kunmap_atomic(kaddr, KM_USER0);
- if (copied != bytes) {
- kaddr = kmap(page);
- copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
- base, bytes);
- if (bytes - copied)
- memset(kaddr + offset + copied, 0, bytes - copied);
- kunmap(page);
+ kaddr = kmap(page);
+ if (likely(nr_segs == 1)) {
+ int left;
+ char __user *buf = iov->iov_base + base;
+ left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
+ copied = bytes - left;
+ } else {
+ copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset,
+ iov, base, bytes);
}
+ kunmap(page);
return copied;
}

static inline void
-filemap_set_next_iovec(const struct iovec **iovp, size_t *basep, size_t bytes)
+filemap_set_next_iovec(const struct iovec **iovp, unsigned long nr_segs,
+ size_t *basep, size_t bytes)
{
- const struct iovec *iov = *iovp;
- size_t base = *basep;
-
- while (bytes) {
- int copy = min(bytes, iov->iov_len - base);
-
- bytes -= copy;
- base += copy;
- if (iov->iov_len == base) {
- iov++;
- base = 0;
+ if (likely(nr_segs == 1)) {
+ *basep += bytes;
+ } else {
+ const struct iovec *iov = *iovp;
+ size_t base = *basep;
+
+ while (bytes) {
+ int copy = min(bytes, iov->iov_len - base);
+
+ bytes -= copy;
+ base += copy;
+ if (iov->iov_len == base) {
+ iov++;
+ base = 0;
+ }
}
+ *iovp = iov;
+ *basep = base;
}
- *iovp = iov;
- *basep = base;
}
#endif
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1840,12 +1840,7 @@ generic_file_buffered_write(struct kiocb
/*
* handle partial DIO write. Adjust cur_iov if needed.
*/
- if (likely(nr_segs == 1))
- buf = iov->iov_base + written;
- else {
- filemap_set_next_iovec(&cur_iov, &iov_offset, written);
- buf = cur_iov->iov_base + iov_offset;
- }
+ filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, written);

do {
struct page *page;
@@ -1855,6 +1850,7 @@ generic_file_buffered_write(struct kiocb
size_t bytes; /* Bytes to write to page */
size_t copied; /* Bytes copied from user */

+ buf = cur_iov->iov_base + iov_offset;
offset = (pos & (PAGE_CACHE_SIZE - 1));
index = pos >> PAGE_CACHE_SHIFT;
bytes = PAGE_CACHE_SIZE - offset;
@@ -1886,31 +1882,19 @@ generic_file_buffered_write(struct kiocb
if (unlikely(status))
goto fs_write_aop_error;

- if (likely(nr_segs == 1))
- copied = filemap_copy_from_user(page, offset,
- buf, bytes);
- else
- copied = filemap_copy_from_user_iovec(page, offset,
- cur_iov, iov_offset, bytes);
+ copied = filemap_copy_from_user(page, offset,
+ cur_iov, nr_segs, iov_offset, bytes);
flush_dcache_page(page);
+
status = a_ops->commit_write(file, page, offset, offset+bytes);
if (unlikely(status))
goto fs_write_aop_error;

- if (likely(copied > 0)) {
- written += copied;
- count -= copied;
- pos += copied;
- buf += copied;
- if (unlikely(nr_segs > 1)) {
- filemap_set_next_iovec(&cur_iov,
- &iov_offset, copied);
- if (count)
- buf = cur_iov->iov_base + iov_offset;
- } else {
- iov_offset += copied;
- }
- }
+ written += copied;
+ count -= copied;
+ pos += copied;
+ filemap_set_next_iovec(&cur_iov, nr_segs,
+ &iov_offset, written);
if (unlikely(copied != bytes))
status = -EFAULT;

2007-01-13 03:26:26

by Nick Piggin

[permalink] [raw]
Subject: [patch 10/10] mm: fix pagecache write deadlocks

Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

1. generic_buffered_write
2. lock_page
3. prepare_write
4. unlock_page+vmtruncate
5. copy_from_user
6. mmap_sem(r)
7. handle_mm_fault
8. lock_page (filemap_nopage)
9. commit_write
1. unlock_page

b. sys_munmap / sys_mlock / others
c. mmap_sem(w)
d. make_pages_present
e. get_user_pages
f. handle_mm_fault
g. lock_page (filemap_nopage)

2,8 - recursive deadlock if page is same
2,8;2,8 - ABBA deadlock is page is different
2,6;c,g - ABBA deadlock if page is same

The solution is as follows:
1. If we find the destination page is uptodate, continue as normal, but use
atomic usercopies which do not take pagefaults and do not zero the uncopied
tail of the destination. The destination is already uptodate, so we can
commit_write the full length even if there was a partial copy: it does not
matter that the tail was not modified, because if it is dirtied and written
back to disk it will not cause any problems (uptodate *means* that the
destination page is as new or newer than the copy on disk).

1a. The above requires that fault_in_pages_readable correctly returns access
information, because atomic usercopies cannot distinguish between
non-present pages in a readable mapping, from lack of a readable mapping.

2. If we find the destination page is non uptodate, unlock it (this could be
made slightly more optimal), then find and pin the source page with
get_user_pages. Relock the destination page and continue with the copy.
However, instead of a usercopy (which might take a fault), copy the data
via the kernel address space.

(also, rename maxlen to seglen, because it was confusing)

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1843,11 +1843,12 @@ generic_file_buffered_write(struct kiocb
filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, written);

do {
+ struct page *src_page;
struct page *page;
pgoff_t index; /* Pagecache index for current page */
unsigned long offset; /* Offset into pagecache page */
- unsigned long maxlen; /* Bytes remaining in current iovec */
- size_t bytes; /* Bytes to write to page */
+ unsigned long seglen; /* Bytes remaining in current iovec */
+ unsigned long bytes; /* Bytes to write to page */
size_t copied; /* Bytes copied from user */

buf = cur_iov->iov_base + iov_offset;
@@ -1857,20 +1858,30 @@ generic_file_buffered_write(struct kiocb
if (bytes > count)
bytes = count;

- maxlen = cur_iov->iov_len - iov_offset;
- if (maxlen > bytes)
- maxlen = bytes;
+ /*
+ * a non-NULL src_page indicates that we're doing the
+ * copy via get_user_pages and kmap.
+ */
+ src_page = NULL;
+
+ seglen = cur_iov->iov_len - iov_offset;
+ if (seglen > bytes)
+ seglen = bytes;

-#ifndef CONFIG_DEBUG_VM
/*
* Bring in the user page that we will copy from _first_.
* Otherwise there's a nasty deadlock on copying from the
* same page as we're writing to, without it being marked
* up-to-date.
+ *
+ * Not only is this an optimisation, but it is also required
+ * to check that the address is actually valid, when atomic
+ * usercopies are used, below.
*/
- fault_in_pages_readable(buf, maxlen);
-#endif
-
+ if (unlikely(fault_in_pages_readable(buf, seglen))) {
+ status = -EFAULT;
+ break;
+ }

page = __grab_cache_page(mapping, index);
if (!page) {
@@ -1878,31 +1889,88 @@ generic_file_buffered_write(struct kiocb
break;
}

+ /*
+ * non-uptodate pages cannot cope with short copies, and we
+ * cannot take a pagefault with the destination page locked.
+ * So pin the source page to copy it.
+ */
+ if (!PageUptodate(page)) {
+ unlock_page(page);
+
+ bytes = min(bytes, PAGE_CACHE_SIZE -
+ ((unsigned long)buf & ~PAGE_CACHE_MASK));
+
+ /*
+ * Cannot get_user_pages with a page locked for the
+ * same reason as we can't take a page fault with a
+ * page locked (as explained below).
+ */
+ status = get_user_pages(current, current->mm,
+ (unsigned long)buf & PAGE_CACHE_MASK, 1,
+ 0, 0, &src_page, NULL);
+ if (status != 1) {
+ page_cache_release(page);
+ break;
+ }
+
+ lock_page(page);
+ if (!page->mapping) {
+ unlock_page(page);
+ page_cache_release(page);
+ page_cache_release(src_page);
+ continue;
+ }
+
+ }
+
status = a_ops->prepare_write(file, page, offset, offset+bytes);
if (unlikely(status))
goto fs_write_aop_error;

- copied = filemap_copy_from_user(page, offset,
+ if (!src_page) {
+ /*
+ * Must not enter the pagefault handler here, because
+ * we hold the page lock, so we might recursively
+ * deadlock on the same lock, or get an ABBA deadlock
+ * against a different lock, or against the mmap_sem
+ * (which nests outside the page lock). So increment
+ * preempt count, and use _atomic usercopies.
+ *
+ * The page is uptodate so we are OK to encounter a
+ * short copy: if unmodified parts of the page are
+ * marked dirty and written out to disk, it doesn't
+ * really matter.
+ */
+ copied = filemap_copy_from_user_atomic(page, offset,
cur_iov, nr_segs, iov_offset, bytes);
+ } else {
+ char *src, *dst;
+ src = kmap(src_page);
+ dst = kmap(page);
+ memcpy(dst + offset,
+ src + ((unsigned long)buf & ~PAGE_CACHE_MASK),
+ bytes);
+ kunmap(page);
+ kunmap(src_page);
+ copied = bytes;
+ }
flush_dcache_page(page);

status = a_ops->commit_write(file, page, offset, offset+bytes);
if (unlikely(status))
goto fs_write_aop_error;

+ unlock_page(page);
+ mark_page_accessed(page);
+ page_cache_release(page);
+ if (src_page)
+ page_cache_release(src_page);
+
written += copied;
count -= copied;
pos += copied;
- filemap_set_next_iovec(&cur_iov, nr_segs,
- &iov_offset, written);
- if (unlikely(copied != bytes))
- status = -EFAULT;
+ filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, copied);

- unlock_page(page);
- mark_page_accessed(page);
- page_cache_release(page);
- if (status < 0)
- break;
balance_dirty_pages_ratelimited(mapping);
cond_resched();
continue;
@@ -1911,6 +1979,8 @@ fs_write_aop_error:
if (status != AOP_TRUNCATED_PAGE)
unlock_page(page);
page_cache_release(page);
+ if (src_page)
+ page_cache_release(src_page);

/*
* prepare_write() may have instantiated a few blocks
@@ -1923,7 +1993,6 @@ fs_write_aop_error:
continue;
else
break;
-
} while (count);
*ppos = pos;

Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -198,6 +198,9 @@ static inline int fault_in_pages_writeab
{
int ret;

+ if (unlikely(size == 0))
+ return 0;
+
/*
* Writing zeroes into userspace here is OK, because we know that if
* the zero gets there, we'll be overwriting it.
@@ -217,19 +220,23 @@ static inline int fault_in_pages_writeab
return ret;
}

-static inline void fault_in_pages_readable(const char __user *uaddr, int size)
+static inline int fault_in_pages_readable(const char __user *uaddr, int size)
{
volatile char c;
int ret;

+ if (unlikely(size == 0))
+ return 0;
+
ret = __get_user(c, uaddr);
if (ret == 0) {
const char __user *end = uaddr + size - 1;

if (((unsigned long)uaddr & PAGE_MASK) !=
((unsigned long)end & PAGE_MASK))
- __get_user(c, end);
+ ret = __get_user(c, end);
}
+ return ret;
}

#endif /* _LINUX_PAGEMAP_H */

2007-01-13 03:25:37

by Nick Piggin

[permalink] [raw]
Subject: [patch 7/10] mm: cleanup pagecache insertion operations

Quite a bit of code is used in maintaining these "cached pages" that are
probably pretty unlikely to get used. It would require a narrow race where
the page is inserted concurrently while this process is allocating a page
in order to create the spare page. Then a multi-page write into an uncached
part of the file, to make use of it.

Next, the buffered write path (and others) uses its own LRU pagevec when it
should be just using the per-CPU LRU pagevec (which will cut down on both data
and code size cacheline footprint). Also, these private LRU pagevecs are
emptied after just a very short time, in contrast with the per-CPU pagevecs
that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required
to add the pages to pagecache for a bulk write (in 4K chunks).

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -686,26 +686,22 @@ EXPORT_SYMBOL(find_lock_page);
struct page *find_or_create_page(struct address_space *mapping,
unsigned long index, gfp_t gfp_mask)
{
- struct page *page, *cached_page = NULL;
+ struct page *page;
int err;
repeat:
page = find_lock_page(mapping, index);
if (!page) {
- if (!cached_page) {
- cached_page = alloc_page(gfp_mask);
- if (!cached_page)
- return NULL;
- }
- err = add_to_page_cache_lru(cached_page, mapping,
- index, gfp_mask);
- if (!err) {
- page = cached_page;
- cached_page = NULL;
- } else if (err == -EEXIST)
- goto repeat;
+ page = alloc_page(gfp_mask);
+ if (!page)
+ return NULL;
+ err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
+ if (unlikely(err)) {
+ page_cache_release(page);
+ page = NULL;
+ if (err == -EEXIST)
+ goto repeat;
+ }
}
- if (cached_page)
- page_cache_release(cached_page);
return page;
}
EXPORT_SYMBOL(find_or_create_page);
@@ -891,11 +887,9 @@ void do_generic_mapping_read(struct addr
unsigned long next_index;
unsigned long prev_index;
loff_t isize;
- struct page *cached_page;
int error;
struct file_ra_state ra = *_ra;

- cached_page = NULL;
index = *ppos >> PAGE_CACHE_SHIFT;
next_index = index;
prev_index = ra.prev_page;
@@ -1059,23 +1053,20 @@ no_cached_page:
* Ok, it wasn't cached, so we need to create a new
* page..
*/
- if (!cached_page) {
- cached_page = page_cache_alloc_cold(mapping);
- if (!cached_page) {
- desc->error = -ENOMEM;
- goto out;
- }
+ page = page_cache_alloc_cold(mapping);
+ if (!page) {
+ desc->error = -ENOMEM;
+ goto out;
}
- error = add_to_page_cache_lru(cached_page, mapping,
+ error = add_to_page_cache_lru(page, mapping,
index, GFP_KERNEL);
if (error) {
+ page_cache_release(page);
if (error == -EEXIST)
goto find_page;
desc->error = error;
goto out;
}
- page = cached_page;
- cached_page = NULL;
goto readpage;
}

@@ -1083,8 +1074,6 @@ out:
*_ra = ra;

*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
- if (cached_page)
- page_cache_release(cached_page);
if (filp)
file_accessed(filp);
}
@@ -1542,35 +1531,28 @@ static inline struct page *__read_cache_
int (*filler)(void *,struct page*),
void *data)
{
- struct page *page, *cached_page = NULL;
+ struct page *page;
int err;
repeat:
page = find_get_page(mapping, index);
if (!page) {
- if (!cached_page) {
- cached_page = page_cache_alloc_cold(mapping);
- if (!cached_page)
- return ERR_PTR(-ENOMEM);
- }
- err = add_to_page_cache_lru(cached_page, mapping,
- index, GFP_KERNEL);
- if (err == -EEXIST)
- goto repeat;
- if (err < 0) {
+ page = page_cache_alloc_cold(mapping);
+ if (!page)
+ return ERR_PTR(-ENOMEM);
+ err = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
+ if (unlikely(err)) {
+ page_cache_release(page);
+ if (err == -EEXIST)
+ goto repeat;
/* Presumably ENOMEM for radix tree node */
- page_cache_release(cached_page);
return ERR_PTR(err);
}
- page = cached_page;
- cached_page = NULL;
err = filler(data, page);
if (err < 0) {
page_cache_release(page);
page = ERR_PTR(err);
}
}
- if (cached_page)
- page_cache_release(cached_page);
return page;
}

@@ -1621,40 +1603,6 @@ retry:
EXPORT_SYMBOL(read_cache_page);

/*
- * If the page was newly created, increment its refcount and add it to the
- * caller's lru-buffering pagevec. This function is specifically for
- * generic_file_write().
- */
-static inline struct page *
-__grab_cache_page(struct address_space *mapping, unsigned long index,
- struct page **cached_page, struct pagevec *lru_pvec)
-{
- int err;
- struct page *page;
-repeat:
- page = find_lock_page(mapping, index);
- if (!page) {
- if (!*cached_page) {
- *cached_page = page_cache_alloc(mapping);
- if (!*cached_page)
- return NULL;
- }
- err = add_to_page_cache(*cached_page, mapping,
- index, GFP_KERNEL);
- if (err == -EEXIST)
- goto repeat;
- if (err == 0) {
- page = *cached_page;
- page_cache_get(page);
- if (!pagevec_add(lru_pvec, page))
- __pagevec_lru_add(lru_pvec);
- *cached_page = NULL;
- }
- }
- return page;
-}
-
-/*
* The logic we want is
*
* if suid or (sgid and xgrp)
@@ -1848,6 +1796,33 @@ generic_file_direct_write(struct kiocb *
}
EXPORT_SYMBOL(generic_file_direct_write);

+/*
+ * Find or create a page at the given pagecache position. Return the locked
+ * page. This function is specifically for buffered writes.
+ */
+static struct page *__grab_cache_page(struct address_space *mapping,
+ pgoff_t index)
+{
+ int status;
+ struct page *page;
+repeat:
+ page = find_lock_page(mapping, index);
+ if (likely(page))
+ return page;
+
+ page = page_cache_alloc(mapping);
+ if (!page)
+ return NULL;
+ status = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
+ if (unlikely(status)) {
+ page_cache_release(page);
+ if (status == -EEXIST)
+ goto repeat;
+ return NULL;
+ }
+ return page;
+}
+
ssize_t
generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t pos, loff_t *ppos,
@@ -1858,15 +1833,10 @@ generic_file_buffered_write(struct kiocb
const struct address_space_operations *a_ops = mapping->a_ops;
struct inode *inode = mapping->host;
long status = 0;
- struct page *page;
- struct page *cached_page = NULL;
- struct pagevec lru_pvec;
const struct iovec *cur_iov = iov; /* current iovec */
size_t iov_offset = 0; /* offset in the current iovec */
char __user *buf;

- pagevec_init(&lru_pvec, 0);
-
/*
* handle partial DIO write. Adjust cur_iov if needed.
*/
@@ -1878,6 +1848,7 @@ generic_file_buffered_write(struct kiocb
}

do {
+ struct page *page;
pgoff_t index; /* Pagecache index for current page */
unsigned long offset; /* Offset into pagecache page */
unsigned long maxlen; /* Bytes remaining in current iovec */
@@ -1904,7 +1875,8 @@ generic_file_buffered_write(struct kiocb
fault_in_pages_readable(buf, maxlen);
#endif

- page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
+
+ page = __grab_cache_page(mapping, index);
if (!page) {
status = -ENOMEM;
break;
@@ -1977,9 +1949,6 @@ fs_write_aop_error:
} while (count);
*ppos = pos;

- if (cached_page)
- page_cache_release(cached_page);
-
/*
* For now, when the user asks for O_SYNC, we'll actually give O_DSYNC
*/
@@ -1999,7 +1968,6 @@ fs_write_aop_error:
if (unlikely(file->f_flags & O_DIRECT) && written)
status = filemap_write_and_wait(mapping);

- pagevec_lru_add(&lru_pvec);
return written ? written : status;
}
EXPORT_SYMBOL(generic_file_buffered_write);
Index: linux-2.6/fs/mpage.c
===================================================================
--- linux-2.6.orig/fs/mpage.c
+++ linux-2.6/fs/mpage.c
@@ -389,31 +389,25 @@ mpage_readpages(struct address_space *ma
struct bio *bio = NULL;
unsigned page_idx;
sector_t last_block_in_bio = 0;
- struct pagevec lru_pvec;
struct buffer_head map_bh;
unsigned long first_logical_block = 0;

clear_buffer_mapped(&map_bh);
- pagevec_init(&lru_pvec, 0);
for (page_idx = 0; page_idx < nr_pages; page_idx++) {
struct page *page = list_entry(pages->prev, struct page, lru);

prefetchw(&page->flags);
list_del(&page->lru);
- if (!add_to_page_cache(page, mapping,
+ if (!add_to_page_cache_lru(page, mapping,
page->index, GFP_KERNEL)) {
bio = do_mpage_readpage(bio, page,
nr_pages - page_idx,
&last_block_in_bio, &map_bh,
&first_logical_block,
get_block);
- if (!pagevec_add(&lru_pvec, page))
- __pagevec_lru_add(&lru_pvec);
- } else {
- page_cache_release(page);
}
+ page_cache_release(page);
}
- pagevec_lru_add(&lru_pvec);
BUG_ON(!list_empty(pages));
if (bio)
mpage_bio_submit(READ, bio);
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -133,28 +133,25 @@ int read_cache_pages(struct address_spac
int (*filler)(void *, struct page *), void *data)
{
struct page *page;
- struct pagevec lru_pvec;
int ret = 0;

- pagevec_init(&lru_pvec, 0);
-
while (!list_empty(pages)) {
page = list_to_page(pages);
list_del(&page->lru);
- if (add_to_page_cache(page, mapping, page->index, GFP_KERNEL)) {
+ if (add_to_page_cache_lru(page, mapping,
+ page->index, GFP_KERNEL)) {
page_cache_release(page);
continue;
}
+ page_cache_release(page);
+
ret = filler(data, page);
- if (!pagevec_add(&lru_pvec, page))
- __pagevec_lru_add(&lru_pvec);
- if (ret) {
+ if (unlikely(ret)) {
put_pages_list(pages);
break;
}
task_io_account_read(PAGE_CACHE_SIZE);
}
- pagevec_lru_add(&lru_pvec);
return ret;
}

@@ -164,7 +161,6 @@ static int read_pages(struct address_spa
struct list_head *pages, unsigned nr_pages)
{
unsigned page_idx;
- struct pagevec lru_pvec;
int ret;

if (mapping->a_ops->readpages) {
@@ -174,19 +170,15 @@ static int read_pages(struct address_spa
goto out;
}

- pagevec_init(&lru_pvec, 0);
for (page_idx = 0; page_idx < nr_pages; page_idx++) {
struct page *page = list_to_page(pages);
list_del(&page->lru);
- if (!add_to_page_cache(page, mapping,
+ if (!add_to_page_cache_lru(page, mapping,
page->index, GFP_KERNEL)) {
mapping->a_ops->readpage(filp, page);
- if (!pagevec_add(&lru_pvec, page))
- __pagevec_lru_add(&lru_pvec);
- } else
- page_cache_release(page);
+ }
+ page_cache_release(page);
}
- pagevec_lru_add(&lru_pvec);
ret = 0;
out:
return ret;

2007-01-13 03:27:08

by Nick Piggin

[permalink] [raw]
Subject: [patch 6/10] mm: be sure to trim blocks

If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then
we may have failed the write operation despite prepare_write having
instantiated blocks past i_size. Fix this, and consolidate the trimming into
one place.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1911,22 +1911,9 @@ generic_file_buffered_write(struct kiocb
}

status = a_ops->prepare_write(file, page, offset, offset+bytes);
- if (unlikely(status)) {
- loff_t isize = i_size_read(inode);
+ if (unlikely(status))
+ goto fs_write_aop_error;

- if (status != AOP_TRUNCATED_PAGE)
- unlock_page(page);
- page_cache_release(page);
- if (status == AOP_TRUNCATED_PAGE)
- continue;
- /*
- * prepare_write() may have instantiated a few blocks
- * outside i_size. Trim these off again.
- */
- if (pos + bytes > isize)
- vmtruncate(inode, isize);
- break;
- }
if (likely(nr_segs == 1))
copied = filemap_copy_from_user(page, offset,
buf, bytes);
@@ -1935,10 +1922,9 @@ generic_file_buffered_write(struct kiocb
cur_iov, iov_offset, bytes);
flush_dcache_page(page);
status = a_ops->commit_write(file, page, offset, offset+bytes);
- if (status == AOP_TRUNCATED_PAGE) {
- page_cache_release(page);
- continue;
- }
+ if (unlikely(status))
+ goto fs_write_aop_error;
+
if (likely(copied > 0)) {
if (!status)
status = copied;
@@ -1969,6 +1955,25 @@ generic_file_buffered_write(struct kiocb
break;
balance_dirty_pages_ratelimited(mapping);
cond_resched();
+ continue;
+
+fs_write_aop_error:
+ if (status != AOP_TRUNCATED_PAGE)
+ unlock_page(page);
+ page_cache_release(page);
+
+ /*
+ * prepare_write() may have instantiated a few blocks
+ * outside i_size. Trim these off again. Don't need
+ * i_size_read because we hold i_mutex.
+ */
+ if (pos + bytes > inode->i_size)
+ vmtruncate(inode, inode->i_size);
+ if (status == AOP_TRUNCATED_PAGE)
+ continue;
+ else
+ break;
+
} while (count);
*ppos = pos;

2007-01-13 03:27:41

by Nick Piggin

[permalink] [raw]
Subject: [patch 4/10] mm: generic_file_buffered_write cleanup

From: Andrew Morton <[email protected]>

Clean up buffered write code. Rename some variables and fix some types.

Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1854,16 +1854,15 @@ generic_file_buffered_write(struct kiocb
size_t count, ssize_t written)
{
struct file *file = iocb->ki_filp;
- struct address_space * mapping = file->f_mapping;
+ struct address_space *mapping = file->f_mapping;
const struct address_space_operations *a_ops = mapping->a_ops;
struct inode *inode = mapping->host;
long status = 0;
struct page *page;
struct page *cached_page = NULL;
- size_t bytes;
struct pagevec lru_pvec;
const struct iovec *cur_iov = iov; /* current iovec */
- size_t iov_base = 0; /* offset in the current iovec */
+ size_t iov_offset = 0; /* offset in the current iovec */
char __user *buf;

pagevec_init(&lru_pvec, 0);
@@ -1874,31 +1873,33 @@ generic_file_buffered_write(struct kiocb
if (likely(nr_segs == 1))
buf = iov->iov_base + written;
else {
- filemap_set_next_iovec(&cur_iov, &iov_base, written);
- buf = cur_iov->iov_base + iov_base;
+ filemap_set_next_iovec(&cur_iov, &iov_offset, written);
+ buf = cur_iov->iov_base + iov_offset;
}

do {
- unsigned long index;
- unsigned long offset;
- unsigned long maxlen;
- size_t copied;
+ pgoff_t index; /* Pagecache index for current page */
+ unsigned long offset; /* Offset into pagecache page */
+ unsigned long maxlen; /* Bytes remaining in current iovec */
+ size_t bytes; /* Bytes to write to page */
+ size_t copied; /* Bytes copied from user */

- offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
+ offset = (pos & (PAGE_CACHE_SIZE - 1));
index = pos >> PAGE_CACHE_SHIFT;
bytes = PAGE_CACHE_SIZE - offset;
if (bytes > count)
bytes = count;

+ maxlen = cur_iov->iov_len - iov_offset;
+ if (maxlen > bytes)
+ maxlen = bytes;
+
/*
* Bring in the user page that we will copy from _first_.
* Otherwise there's a nasty deadlock on copying from the
* same page as we're writing to, without it being marked
* up-to-date.
*/
- maxlen = cur_iov->iov_len - iov_base;
- if (maxlen > bytes)
- maxlen = bytes;
fault_in_pages_readable(buf, maxlen);

page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
@@ -1929,7 +1930,7 @@ generic_file_buffered_write(struct kiocb
buf, bytes);
else
copied = filemap_copy_from_user_iovec(page, offset,
- cur_iov, iov_base, bytes);
+ cur_iov, iov_offset, bytes);
flush_dcache_page(page);
status = a_ops->commit_write(file, page, offset, offset+bytes);
if (status == AOP_TRUNCATED_PAGE) {
@@ -1947,12 +1948,12 @@ generic_file_buffered_write(struct kiocb
buf += status;
if (unlikely(nr_segs > 1)) {
filemap_set_next_iovec(&cur_iov,
- &iov_base, status);
+ &iov_offset, status);
if (count)
buf = cur_iov->iov_base +
- iov_base;
+ iov_offset;
} else {
- iov_base += status;
+ iov_offset += status;
}
}
}

2007-01-14 04:00:10

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 10/10] mm: fix pagecache write deadlocks

Nick Piggin wrote:

> @@ -1878,31 +1889,88 @@ generic_file_buffered_write(struct kiocb
> break;
> }
>
> + /*
> + * non-uptodate pages cannot cope with short copies, and we
> + * cannot take a pagefault with the destination page locked.
> + * So pin the source page to copy it.
> + */
> + if (!PageUptodate(page)) {
> + unlock_page(page);
> +
> + bytes = min(bytes, PAGE_CACHE_SIZE -
> + ((unsigned long)buf & ~PAGE_CACHE_MASK));
> +
> + /*
> + * Cannot get_user_pages with a page locked for the
> + * same reason as we can't take a page fault with a
> + * page locked (as explained below).
> + */
> + status = get_user_pages(current, current->mm,
> + (unsigned long)buf & PAGE_CACHE_MASK, 1,
> + 0, 0, &src_page, NULL);

Thinko... get_user_pages needs to be called with mmap_sem held, obviously.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-14 14:25:41

by Dmitri Monakhov

[permalink] [raw]
Subject: Re: [patch 6/10] mm: be sure to trim blocks

Nick Piggin <[email protected]> writes:

> If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then
> we may have failed the write operation despite prepare_write having
> instantiated blocks past i_size. Fix this, and consolidate the trimming into
> one place.
>
> Signed-off-by: Nick Piggin <[email protected]>
>
> Index: linux-2.6/mm/filemap.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap.c
> +++ linux-2.6/mm/filemap.c
> @@ -1911,22 +1911,9 @@ generic_file_buffered_write(struct kiocb
> }
>
> status = a_ops->prepare_write(file, page, offset, offset+bytes);
> - if (unlikely(status)) {
> - loff_t isize = i_size_read(inode);
> + if (unlikely(status))
> + goto fs_write_aop_error;
May be it's stupid question but still..
Why we treat non zero prepare_write() return code as error, it may be positive.
Positive error code may be used as fine grained 'bytes' limiter in case of
blksize < pgsize as follows:

status = a_ops->prepare_write(file, page, offset, offset+bytes);
if (unlikely(status)) {
if (status > 0) {
bytes = min(bytes, status);
status = 0;
} else {
goto fs_write_aop_error;
}
}
---
This is useful because fs may want to reduce 'bytes' by number of reasons,
for example make it blksize bound.
Example : filesystem has 1k blksize and only two free blocks. And we try
write 4k bytes.
Currently write(fd, buff, 4096) will return -ENOSPC
But after this fix write(fd, buff, 4096) will return as mutch as it can (2048).
>
> - if (status != AOP_TRUNCATED_PAGE)
> - unlock_page(page);
> - page_cache_release(page);
> - if (status == AOP_TRUNCATED_PAGE)
> - continue;
> - /*
> - * prepare_write() may have instantiated a few blocks
> - * outside i_size. Trim these off again.
> - */
> - if (pos + bytes > isize)
> - vmtruncate(inode, isize);
> - break;
> - }
> if (likely(nr_segs == 1))
> copied = filemap_copy_from_user(page, offset,
> buf, bytes);
> @@ -1935,10 +1922,9 @@ generic_file_buffered_write(struct kiocb
> cur_iov, iov_offset, bytes);
> flush_dcache_page(page);
> status = a_ops->commit_write(file, page, offset, offset+bytes);
> - if (status == AOP_TRUNCATED_PAGE) {
> - page_cache_release(page);
> - continue;
> - }
> + if (unlikely(status))
> + goto fs_write_aop_error;
> +
> if (likely(copied > 0)) {
> if (!status)
> status = copied;
> @@ -1969,6 +1955,25 @@ generic_file_buffered_write(struct kiocb
> break;
> balance_dirty_pages_ratelimited(mapping);
> cond_resched();
> + continue;
> +
> +fs_write_aop_error:
> + if (status != AOP_TRUNCATED_PAGE)
> + unlock_page(page);
> + page_cache_release(page);
> +
> + /*
> + * prepare_write() may have instantiated a few blocks
> + * outside i_size. Trim these off again. Don't need
> + * i_size_read because we hold i_mutex.
> + */
> + if (pos + bytes > inode->i_size)
> + vmtruncate(inode, inode->i_size);
> + if (status == AOP_TRUNCATED_PAGE)
> + continue;
> + else
> + break;
> +
> } while (count);
> *ppos = pos;
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2007-01-16 17:36:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 6/10] mm: be sure to trim blocks

On Sat, 2007-01-13 at 04:25 +0100, Nick Piggin wrote:
> If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then
> we may have failed the write operation despite prepare_write having
> instantiated blocks past i_size. Fix this, and consolidate the trimming into
> one place.
>
> Signed-off-by: Nick Piggin <[email protected]>
>
> Index: linux-2.6/mm/filemap.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap.c
> +++ linux-2.6/mm/filemap.c
> @@ -1911,22 +1911,9 @@ generic_file_buffered_write(struct kiocb
> }
>
> status = a_ops->prepare_write(file, page, offset, offset+bytes);
> - if (unlikely(status)) {
> - loff_t isize = i_size_read(inode);
> + if (unlikely(status))
> + goto fs_write_aop_error;
>
> - if (status != AOP_TRUNCATED_PAGE)
> - unlock_page(page);
> - page_cache_release(page);
> - if (status == AOP_TRUNCATED_PAGE)
> - continue;
> - /*
> - * prepare_write() may have instantiated a few blocks
> - * outside i_size. Trim these off again.
> - */
> - if (pos + bytes > isize)
> - vmtruncate(inode, isize);
> - break;
> - }
> if (likely(nr_segs == 1))
> copied = filemap_copy_from_user(page, offset,
> buf, bytes);
> @@ -1935,10 +1922,9 @@ generic_file_buffered_write(struct kiocb
> cur_iov, iov_offset, bytes);
> flush_dcache_page(page);
> status = a_ops->commit_write(file, page, offset, offset+bytes);
> - if (status == AOP_TRUNCATED_PAGE) {
> - page_cache_release(page);
> - continue;
> - }
> + if (unlikely(status))
> + goto fs_write_aop_error;
> +

I don't think this is correct, see how status >= 0 is used a few lines
downwards. Perhaps something along the lines of an
is_positive_aop_return() to test on?

> if (likely(copied > 0)) {
> if (!status)
> status = copied;
> @@ -1969,6 +1955,25 @@ generic_file_buffered_write(struct kiocb
> break;
> balance_dirty_pages_ratelimited(mapping);
> cond_resched();
> + continue;
> +
> +fs_write_aop_error:
> + if (status != AOP_TRUNCATED_PAGE)
> + unlock_page(page);
> + page_cache_release(page);
> +
> + /*
> + * prepare_write() may have instantiated a few blocks
> + * outside i_size. Trim these off again. Don't need
> + * i_size_read because we hold i_mutex.
> + */
> + if (pos + bytes > inode->i_size)
> + vmtruncate(inode, inode->i_size);
> + if (status == AOP_TRUNCATED_PAGE)
> + continue;
> + else
> + break;
> +
> } while (count);
> *ppos = pos;


2007-01-16 19:14:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 6/10] mm: be sure to trim blocks

On Tue, 2007-01-16 at 18:36 +0100, Peter Zijlstra wrote:
> buf, bytes);
> > @@ -1935,10 +1922,9 @@ generic_file_buffered_write(struct kiocb
> > cur_iov, iov_offset, bytes);
> > flush_dcache_page(page);
> > status = a_ops->commit_write(file, page, offset, offset+bytes);
> > - if (status == AOP_TRUNCATED_PAGE) {
> > - page_cache_release(page);
> > - continue;
> > - }
> > + if (unlikely(status))
> > + goto fs_write_aop_error;
> > +
>
> I don't think this is correct, see how status >= 0 is used a few lines
> downwards. Perhaps something along the lines of an
> is_positive_aop_return() to test on?

Hmm, if commit_write() will never return non error positive values then
this and 8/10 look sane.

2007-01-20 03:50:42

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 6/10] mm: be sure to trim blocks

On Sun, Jan 14, 2007 at 05:25:44PM +0300, Dmitriy Monakhov wrote:
> Nick Piggin <[email protected]> writes:
>
> > If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then
> > we may have failed the write operation despite prepare_write having
> > instantiated blocks past i_size. Fix this, and consolidate the trimming into
> > one place.
> >
> > Signed-off-by: Nick Piggin <[email protected]>
> >
> > Index: linux-2.6/mm/filemap.c
> > ===================================================================
> > --- linux-2.6.orig/mm/filemap.c
> > +++ linux-2.6/mm/filemap.c
> > @@ -1911,22 +1911,9 @@ generic_file_buffered_write(struct kiocb
> > }
> >
> > status = a_ops->prepare_write(file, page, offset, offset+bytes);
> > - if (unlikely(status)) {
> > - loff_t isize = i_size_read(inode);
> > + if (unlikely(status))
> > + goto fs_write_aop_error;
> May be it's stupid question but still..
> Why we treat non zero prepare_write() return code as error, it may be positive.
> Positive error code may be used as fine grained 'bytes' limiter in case of
> blksize < pgsize as follows:
>
> status = a_ops->prepare_write(file, page, offset, offset+bytes);
> if (unlikely(status)) {
> if (status > 0) {
> bytes = min(bytes, status);
> status = 0;
> } else {
> goto fs_write_aop_error;
> }
> }
> ---
> This is useful because fs may want to reduce 'bytes' by number of reasons,
> for example make it blksize bound.
> Example : filesystem has 1k blksize and only two free blocks. And we try
> write 4k bytes.
> Currently write(fd, buff, 4096) will return -ENOSPC
> But after this fix write(fd, buff, 4096) will return as mutch as it can (2048).

It isn't a stupid question. Hmm, while it isn't documented in vfs.txt, it
seems like some filesystems actually do this. AFFS, maybe JFFS. So good
catch, thanks.

2007-01-20 03:52:07

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 6/10] mm: be sure to trim blocks

On Tue, Jan 16, 2007 at 08:14:16PM +0100, Peter Zijlstra wrote:
> On Tue, 2007-01-16 at 18:36 +0100, Peter Zijlstra wrote:
> > buf, bytes);
> > > @@ -1935,10 +1922,9 @@ generic_file_buffered_write(struct kiocb
> > > cur_iov, iov_offset, bytes);
> > > flush_dcache_page(page);
> > > status = a_ops->commit_write(file, page, offset, offset+bytes);
> > > - if (status == AOP_TRUNCATED_PAGE) {
> > > - page_cache_release(page);
> > > - continue;
> > > - }
> > > + if (unlikely(status))
> > > + goto fs_write_aop_error;
> > > +
> >
> > I don't think this is correct, see how status >= 0 is used a few lines
> > downwards. Perhaps something along the lines of an
> > is_positive_aop_return() to test on?
>
> Hmm, if commit_write() will never return non error positive values then
> this and 8/10 look sane.

It's really ugly, but it looks like at least some filesystems do. So
I'll fix up this.