2007-01-29 10:31:42

by Nick Piggin

[permalink] [raw]
Subject: [patch 0/9] buffered write deadlock fix

The following set of patches attempt to fix the buffered write
locking problems (and there are a couple of peripheral patches
and cleanups there too).

Patches against 2.6.20-rc6. I was hoping that 2.6.20-rc6-mm2 would
be an easier diff with the fsaio patches gone, but the readahead
rewrite clashes badly :(

Please apply?

Thanks,
Nick

--
SuSE Labs


2007-01-29 10:31:56

by Nick Piggin

[permalink] [raw]
Subject: [patch 1/9] fs: libfs buffered write leak fix

simple_prepare_write and nobh_prepare_write leak uninitialised kernel data.
Fix the former, make a note of the latter. Several other filesystems seem
to be iffy here, too.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -327,32 +327,35 @@ int simple_readpage(struct file *file, s
int simple_prepare_write(struct file *file, struct page *page,
unsigned from, unsigned to)
{
- if (!PageUptodate(page)) {
- if (to - from != PAGE_CACHE_SIZE) {
- void *kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr, 0, from);
- memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
- }
+ if (PageUptodate(page))
+ return 0;
+
+ if (to - from != PAGE_CACHE_SIZE) {
+ clear_highpage(page);
+ flush_dcache_page(page);
SetPageUptodate(page);
}
+
return 0;
}

int simple_commit_write(struct file *file, struct page *page,
- unsigned offset, unsigned to)
+ unsigned from, unsigned to)
{
- struct inode *inode = page->mapping->host;
- loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
-
- /*
- * No need to use i_size_read() here, the i_size
- * cannot change under us because we hold the i_mutex.
- */
- if (pos > inode->i_size)
- i_size_write(inode, pos);
- set_page_dirty(page);
+ if (to > from) {
+ struct inode *inode = page->mapping->host;
+ loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+
+ if (to - from == PAGE_CACHE_SIZE)
+ SetPageUptodate(page);
+ /*
+ * No need to use i_size_read() here, the i_size
+ * cannot change under us because we hold the i_mutex.
+ */
+ if (pos > inode->i_size)
+ i_size_write(inode, pos);
+ set_page_dirty(page);
+ }
return 0;
}

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2344,6 +2344,8 @@ int nobh_prepare_write(struct page *page

if (is_mapped_to_disk)
SetPageMappedToDisk(page);
+
+ /* XXX: information leak vs read(2) */
SetPageUptodate(page);

/*

2007-01-29 10:32:26

by Nick Piggin

[permalink] [raw]
Subject: [patch 2/9] mm: revert "generic_file_buffered_write(): handle zero length iovec segments"

From: Andrew Morton <[email protected]>

Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6.

This was a bugfix against 6527c2bdf1f833cc18e8f42bd97973d583e4aa83, which we
also revert.

Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1911,12 +1911,6 @@ generic_file_buffered_write(struct kiocb
break;
}

- if (unlikely(bytes == 0)) {
- status = 0;
- copied = 0;
- goto zero_length_segment;
- }
-
status = a_ops->prepare_write(file, page, offset, offset+bytes);
if (unlikely(status)) {
loff_t isize = i_size_read(inode);
@@ -1946,8 +1940,7 @@ generic_file_buffered_write(struct kiocb
page_cache_release(page);
continue;
}
-zero_length_segment:
- if (likely(copied >= 0)) {
+ if (likely(copied > 0)) {
if (!status)
status = copied;

Index: linux-2.6/mm/filemap.h
===================================================================
--- linux-2.6.orig/mm/filemap.h
+++ linux-2.6/mm/filemap.h
@@ -87,7 +87,7 @@ filemap_set_next_iovec(const struct iove
const struct iovec *iov = *iovp;
size_t base = *basep;

- do {
+ while (bytes) {
int copy = min(bytes, iov->iov_len - base);

bytes -= copy;
@@ -96,7 +96,7 @@ filemap_set_next_iovec(const struct iove
iov++;
base = 0;
}
- } while (bytes);
+ }
*iovp = iov;
*basep = base;
}

2007-01-29 10:32:29

by Nick Piggin

[permalink] [raw]
Subject: [patch 4/9] mm: generic_file_buffered_write cleanup

From: Andrew Morton <[email protected]>

Clean up buffered write code. Rename some variables and fix some types.

Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1854,16 +1854,15 @@ generic_file_buffered_write(struct kiocb
size_t count, ssize_t written)
{
struct file *file = iocb->ki_filp;
- struct address_space * mapping = file->f_mapping;
+ struct address_space *mapping = file->f_mapping;
const struct address_space_operations *a_ops = mapping->a_ops;
struct inode *inode = mapping->host;
long status = 0;
struct page *page;
struct page *cached_page = NULL;
- size_t bytes;
struct pagevec lru_pvec;
const struct iovec *cur_iov = iov; /* current iovec */
- size_t iov_base = 0; /* offset in the current iovec */
+ size_t iov_offset = 0; /* offset in the current iovec */
char __user *buf;

pagevec_init(&lru_pvec, 0);
@@ -1874,31 +1873,33 @@ generic_file_buffered_write(struct kiocb
if (likely(nr_segs == 1))
buf = iov->iov_base + written;
else {
- filemap_set_next_iovec(&cur_iov, &iov_base, written);
- buf = cur_iov->iov_base + iov_base;
+ filemap_set_next_iovec(&cur_iov, &iov_offset, written);
+ buf = cur_iov->iov_base + iov_offset;
}

do {
- unsigned long index;
- unsigned long offset;
- unsigned long maxlen;
- size_t copied;
+ pgoff_t index; /* Pagecache index for current page */
+ unsigned long offset; /* Offset into pagecache page */
+ unsigned long maxlen; /* Bytes remaining in current iovec */
+ size_t bytes; /* Bytes to write to page */
+ size_t copied; /* Bytes copied from user */

- offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
+ offset = (pos & (PAGE_CACHE_SIZE - 1));
index = pos >> PAGE_CACHE_SHIFT;
bytes = PAGE_CACHE_SIZE - offset;
if (bytes > count)
bytes = count;

+ maxlen = cur_iov->iov_len - iov_offset;
+ if (maxlen > bytes)
+ maxlen = bytes;
+
/*
* Bring in the user page that we will copy from _first_.
* Otherwise there's a nasty deadlock on copying from the
* same page as we're writing to, without it being marked
* up-to-date.
*/
- maxlen = cur_iov->iov_len - iov_base;
- if (maxlen > bytes)
- maxlen = bytes;
fault_in_pages_readable(buf, maxlen);

page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
@@ -1929,7 +1930,7 @@ generic_file_buffered_write(struct kiocb
buf, bytes);
else
copied = filemap_copy_from_user_iovec(page, offset,
- cur_iov, iov_base, bytes);
+ cur_iov, iov_offset, bytes);
flush_dcache_page(page);
status = a_ops->commit_write(file, page, offset, offset+bytes);
if (status == AOP_TRUNCATED_PAGE) {
@@ -1947,12 +1948,12 @@ generic_file_buffered_write(struct kiocb
buf += status;
if (unlikely(nr_segs > 1)) {
filemap_set_next_iovec(&cur_iov,
- &iov_base, status);
+ &iov_offset, status);
if (count)
buf = cur_iov->iov_base +
- iov_base;
+ iov_offset;
} else {
- iov_base += status;
+ iov_offset += status;
}
}
}

2007-01-29 10:33:07

by Nick Piggin

[permalink] [raw]
Subject: [patch 3/9] mm: revert "generic_file_buffered_write(): deadlock on vectored write"

From: Andrew Morton <[email protected]>

Revert 6527c2bdf1f833cc18e8f42bd97973d583e4aa83

This patch fixed the following bug:

When prefaulting in the pages in generic_file_buffered_write(), we only
faulted in the pages for the firts segment of the iovec. If the second of
successive segment described a mmapping of the page into which we're
write()ing, and that page is not up-to-date, the fault handler tries to lock
the already-locked page (to bring it up to date) and deadlocks.

An exploit for this bug is in writev-deadlock-demo.c, in
http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.

(These demos assume blocksize < PAGE_CACHE_SIZE).

The problem with this fix is that it takes the kernel back to doing a single
prepare_write()/commit_write() per iovec segment. So in the worst case we'll
run prepare_write+commit_write 1024 times where we previously would have run
it once. The other problem with the fix is that it fix all the locking problems.


<insert numbers obtained via ext3-tools's writev-speed.c here>

And apparently this change killed NFS overwrite performance, because, I
suppose, it talks to the server for each prepare_write+commit_write.

So just back that patch out - we'll be fixing the deadlock by other means.

Signed-off-by: Andrew Morton <[email protected]>

Nick says: also it only ever actually papered over the bug, because after
faulting in the pages, they might be unmapped or reclaimed.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1881,21 +1881,14 @@ generic_file_buffered_write(struct kiocb
do {
unsigned long index;
unsigned long offset;
+ unsigned long maxlen;
size_t copied;

offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
index = pos >> PAGE_CACHE_SHIFT;
bytes = PAGE_CACHE_SIZE - offset;
-
- /* Limit the size of the copy to the caller's write size */
- bytes = min(bytes, count);
-
- /*
- * Limit the size of the copy to that of the current segment,
- * because fault_in_pages_readable() doesn't know how to walk
- * segments.
- */
- bytes = min(bytes, cur_iov->iov_len - iov_base);
+ if (bytes > count)
+ bytes = count;

/*
* Bring in the user page that we will copy from _first_.
@@ -1903,7 +1896,10 @@ generic_file_buffered_write(struct kiocb
* same page as we're writing to, without it being marked
* up-to-date.
*/
- fault_in_pages_readable(buf, bytes);
+ maxlen = cur_iov->iov_len - iov_base;
+ if (maxlen > bytes)
+ maxlen = bytes;
+ fault_in_pages_readable(buf, maxlen);

page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
if (!page) {

2007-01-29 10:33:44

by Nick Piggin

[permalink] [raw]
Subject: [patch 5/9] mm: debug write deadlocks

Allow CONFIG_DEBUG_VM to switch off the prefaulting logic, to simulate the
difficult race where the page may be unmapped before calling copy_from_user.
Makes the race much easier to hit.

This is useful for demonstration and testing purposes, but is removed in a
subsequent patch.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1894,6 +1894,7 @@ generic_file_buffered_write(struct kiocb
if (maxlen > bytes)
maxlen = bytes;

+#ifndef CONFIG_DEBUG_VM
/*
* Bring in the user page that we will copy from _first_.
* Otherwise there's a nasty deadlock on copying from the
@@ -1901,6 +1902,7 @@ generic_file_buffered_write(struct kiocb
* up-to-date.
*/
fault_in_pages_readable(buf, maxlen);
+#endif

page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
if (!page) {

2007-01-29 10:33:45

by Nick Piggin

[permalink] [raw]
Subject: [patch 6/9] mm: be sure to trim blocks

If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then
we may have failed the write operation despite prepare_write having
instantiated blocks past i_size. Fix this, and consolidate the trimming into
one place.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1911,22 +1911,9 @@ generic_file_buffered_write(struct kiocb
}

status = a_ops->prepare_write(file, page, offset, offset+bytes);
- if (unlikely(status)) {
- loff_t isize = i_size_read(inode);
+ if (unlikely(status))
+ goto fs_write_aop_error;

- if (status != AOP_TRUNCATED_PAGE)
- unlock_page(page);
- page_cache_release(page);
- if (status == AOP_TRUNCATED_PAGE)
- continue;
- /*
- * prepare_write() may have instantiated a few blocks
- * outside i_size. Trim these off again.
- */
- if (pos + bytes > isize)
- vmtruncate(inode, isize);
- break;
- }
if (likely(nr_segs == 1))
copied = filemap_copy_from_user(page, offset,
buf, bytes);
@@ -1935,40 +1922,53 @@ generic_file_buffered_write(struct kiocb
cur_iov, iov_offset, bytes);
flush_dcache_page(page);
status = a_ops->commit_write(file, page, offset, offset+bytes);
- if (status == AOP_TRUNCATED_PAGE) {
- page_cache_release(page);
- continue;
+ if (unlikely(status < 0))
+ goto fs_write_aop_error;
+ if (unlikely(copied != bytes)) {
+ status = -EFAULT;
+ goto fs_write_aop_error;
}
- if (likely(copied > 0)) {
- if (!status)
- status = copied;
+ if (unlikely(status > 0)) /* filesystem did partial write */
+ copied = status;

- if (status >= 0) {
- written += status;
- count -= status;
- pos += status;
- buf += status;
- if (unlikely(nr_segs > 1)) {
- filemap_set_next_iovec(&cur_iov,
- &iov_offset, status);
- if (count)
- buf = cur_iov->iov_base +
- iov_offset;
- } else {
- iov_offset += status;
- }
+ if (likely(copied > 0)) {
+ written += copied;
+ count -= copied;
+ pos += copied;
+ buf += copied;
+ if (unlikely(nr_segs > 1)) {
+ filemap_set_next_iovec(&cur_iov,
+ &iov_offset, copied);
+ if (count)
+ buf = cur_iov->iov_base + iov_offset;
+ } else {
+ iov_offset += copied;
}
}
- if (unlikely(copied != bytes))
- if (status >= 0)
- status = -EFAULT;
unlock_page(page);
mark_page_accessed(page);
page_cache_release(page);
- if (status < 0)
- break;
balance_dirty_pages_ratelimited(mapping);
cond_resched();
+ continue;
+
+fs_write_aop_error:
+ if (status != AOP_TRUNCATED_PAGE)
+ unlock_page(page);
+ page_cache_release(page);
+
+ /*
+ * prepare_write() may have instantiated a few blocks
+ * outside i_size. Trim these off again. Don't need
+ * i_size_read because we hold i_mutex.
+ */
+ if (pos + bytes > inode->i_size)
+ vmtruncate(inode, inode->i_size);
+ if (status == AOP_TRUNCATED_PAGE)
+ continue;
+ else
+ break;
+
} while (count);
*ppos = pos;

2007-01-29 10:33:55

by Nick Piggin

[permalink] [raw]
Subject: [patch 9/9] mm: fix pagecache write deadlocks

Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

1. generic_buffered_write
2. lock_page
3. prepare_write
4. unlock_page+vmtruncate
5. copy_from_user
6. mmap_sem(r)
7. handle_mm_fault
8. lock_page (filemap_nopage)
9. commit_write
10. unlock_page

a. sys_munmap / sys_mlock / others
b. mmap_sem(w)
c. make_pages_present
d. get_user_pages
e. handle_mm_fault
f. lock_page (filemap_nopage)

2,8 - recursive deadlock if page is same
2,8;2,8 - ABBA deadlock is page is different
2,6;b,f - ABBA deadlock if page is same

The solution is as follows:
1. If we find the destination page is uptodate, continue as normal, but use
atomic usercopies which do not take pagefaults and do not zero the uncopied
tail of the destination. The destination is already uptodate, so we can
commit_write the full length even if there was a partial copy: it does not
matter that the tail was not modified, because if it is dirtied and written
back to disk it will not cause any problems (uptodate *means* that the
destination page is as new or newer than the copy on disk).

1a. The above requires that fault_in_pages_readable correctly returns access
information, because atomic usercopies cannot distinguish between
non-present pages in a readable mapping, from lack of a readable mapping.

2. If we find the destination page is non uptodate, unlock it (this could be
made slightly more optimal), then find and pin the source page with
get_user_pages. Relock the destination page and continue with the copy.
However, instead of a usercopy (which might take a fault), copy the data
via the kernel address space.

(also, rename maxlen to seglen, because it was confusing)

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1843,11 +1843,12 @@ generic_file_buffered_write(struct kiocb
filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, written);

do {
+ struct page *src_page;
struct page *page;
pgoff_t index; /* Pagecache index for current page */
unsigned long offset; /* Offset into pagecache page */
- unsigned long maxlen; /* Bytes remaining in current iovec */
- size_t bytes; /* Bytes to write to page */
+ unsigned long seglen; /* Bytes remaining in current iovec */
+ unsigned long bytes; /* Bytes to write to page */
size_t copied; /* Bytes copied from user */

buf = cur_iov->iov_base + iov_offset;
@@ -1857,20 +1858,30 @@ generic_file_buffered_write(struct kiocb
if (bytes > count)
bytes = count;

- maxlen = cur_iov->iov_len - iov_offset;
- if (maxlen > bytes)
- maxlen = bytes;
+ /*
+ * a non-NULL src_page indicates that we're doing the
+ * copy via get_user_pages and kmap.
+ */
+ src_page = NULL;
+
+ seglen = cur_iov->iov_len - iov_offset;
+ if (seglen > bytes)
+ seglen = bytes;

-#ifndef CONFIG_DEBUG_VM
/*
* Bring in the user page that we will copy from _first_.
* Otherwise there's a nasty deadlock on copying from the
* same page as we're writing to, without it being marked
* up-to-date.
+ *
+ * Not only is this an optimisation, but it is also required
+ * to check that the address is actually valid, when atomic
+ * usercopies are used, below.
*/
- fault_in_pages_readable(buf, maxlen);
-#endif
-
+ if (unlikely(fault_in_pages_readable(buf, seglen))) {
+ status = -EFAULT;
+ break;
+ }

page = __grab_cache_page(mapping, index);
if (!page) {
@@ -1878,32 +1889,94 @@ generic_file_buffered_write(struct kiocb
break;
}

+ /*
+ * non-uptodate pages cannot cope with short copies, and we
+ * cannot take a pagefault with the destination page locked.
+ * So pin the source page to copy it.
+ */
+ if (!PageUptodate(page)) {
+ unlock_page(page);
+
+ bytes = min(bytes, PAGE_CACHE_SIZE -
+ ((unsigned long)buf & ~PAGE_CACHE_MASK));
+
+ /*
+ * Cannot get_user_pages with a page locked for the
+ * same reason as we can't take a page fault with a
+ * page locked (as explained below).
+ */
+ down_read(&current->mm->mmap_sem);
+ status = get_user_pages(current, current->mm,
+ (unsigned long)buf & PAGE_CACHE_MASK, 1,
+ 0, 0, &src_page, NULL);
+ up_read(&current->mm->mmap_sem);
+ if (status != 1) {
+ page_cache_release(page);
+ break;
+ }
+
+ lock_page(page);
+ if (!page->mapping) {
+ unlock_page(page);
+ page_cache_release(page);
+ page_cache_release(src_page);
+ continue;
+ }
+
+ }
+
status = a_ops->prepare_write(file, page, offset, offset+bytes);
if (unlikely(status))
goto fs_write_aop_error;

- copied = filemap_copy_from_user(page, offset,
+ if (!src_page) {
+ /*
+ * Must not enter the pagefault handler here, because
+ * we hold the page lock, so we might recursively
+ * deadlock on the same lock, or get an ABBA deadlock
+ * against a different lock, or against the mmap_sem
+ * (which nests outside the page lock). So increment
+ * preempt count, and use _atomic usercopies.
+ *
+ * The page is uptodate so we are OK to encounter a
+ * short copy: if unmodified parts of the page are
+ * marked dirty and written out to disk, it doesn't
+ * really matter.
+ */
+ pagefault_disable();
+ copied = filemap_copy_from_user_atomic(page, offset,
cur_iov, nr_segs, iov_offset, bytes);
+ pagefault_enable();
+ } else {
+ char *src, *dst;
+ src = kmap(src_page);
+ dst = kmap(page);
+ memcpy(dst + offset,
+ src + ((unsigned long)buf & ~PAGE_CACHE_MASK),
+ bytes);
+ kunmap(page);
+ kunmap(src_page);
+ copied = bytes;
+ }
flush_dcache_page(page);

status = a_ops->commit_write(file, page, offset, offset+bytes);
if (unlikely(status < 0))
goto fs_write_aop_error;
- if (unlikely(copied != bytes)) {
- status = -EFAULT;
- goto fs_write_aop_error;
- }
if (unlikely(status > 0)) /* filesystem did partial write */
- copied = status;
+ copied = min_t(size_t, copied, status);
+
+ unlock_page(page);
+ mark_page_accessed(page);
+ page_cache_release(page);
+ if (src_page)
+ page_cache_release(src_page);

written += copied;
count -= copied;
pos += copied;
filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, copied);

- unlock_page(page);
- mark_page_accessed(page);
- page_cache_release(page);
balance_dirty_pages_ratelimited(mapping);
cond_resched();
continue;
@@ -1912,6 +1985,8 @@ fs_write_aop_error:
if (status != AOP_TRUNCATED_PAGE)
unlock_page(page);
page_cache_release(page);
+ if (src_page)
+ page_cache_release(src_page);

/*
* prepare_write() may have instantiated a few blocks
@@ -1924,7 +1999,6 @@ fs_write_aop_error:
continue;
else
break;
-
} while (count);
*ppos = pos;

Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -198,6 +198,9 @@ static inline int fault_in_pages_writeab
{
int ret;

+ if (unlikely(size == 0))
+ return 0;
+
/*
* Writing zeroes into userspace here is OK, because we know that if
* the zero gets there, we'll be overwriting it.
@@ -217,19 +220,23 @@ static inline int fault_in_pages_writeab
return ret;
}

-static inline void fault_in_pages_readable(const char __user *uaddr, int size)
+static inline int fault_in_pages_readable(const char __user *uaddr, int size)
{
volatile char c;
int ret;

+ if (unlikely(size == 0))
+ return 0;
+
ret = __get_user(c, uaddr);
if (ret == 0) {
const char __user *end = uaddr + size - 1;

if (((unsigned long)uaddr & PAGE_MASK) !=
((unsigned long)end & PAGE_MASK))
- __get_user(c, end);
+ ret = __get_user(c, end);
}
+ return ret;
}

#endif /* _LINUX_PAGEMAP_H */

2007-01-29 10:34:46

by Nick Piggin

[permalink] [raw]
Subject: [patch 7/9] mm: cleanup pagecache insertion operations

Quite a bit of code is used in maintaining these "cached pages" that are
probably pretty unlikely to get used. It would require a narrow race where
the page is inserted concurrently while this process is allocating a page
in order to create the spare page. Then a multi-page write into an uncached
part of the file, to make use of it.

Next, the buffered write path (and others) uses its own LRU pagevec when it
should be just using the per-CPU LRU pagevec (which will cut down on both data
and code size cacheline footprint). Also, these private LRU pagevecs are
emptied after just a very short time, in contrast with the per-CPU pagevecs
that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required
to add the pages to pagecache for a bulk write (in 4K chunks).

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -686,26 +686,22 @@ EXPORT_SYMBOL(find_lock_page);
struct page *find_or_create_page(struct address_space *mapping,
unsigned long index, gfp_t gfp_mask)
{
- struct page *page, *cached_page = NULL;
+ struct page *page;
int err;
repeat:
page = find_lock_page(mapping, index);
if (!page) {
- if (!cached_page) {
- cached_page = alloc_page(gfp_mask);
- if (!cached_page)
- return NULL;
- }
- err = add_to_page_cache_lru(cached_page, mapping,
- index, gfp_mask);
- if (!err) {
- page = cached_page;
- cached_page = NULL;
- } else if (err == -EEXIST)
- goto repeat;
+ page = alloc_page(gfp_mask);
+ if (!page)
+ return NULL;
+ err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
+ if (unlikely(err)) {
+ page_cache_release(page);
+ page = NULL;
+ if (err == -EEXIST)
+ goto repeat;
+ }
}
- if (cached_page)
- page_cache_release(cached_page);
return page;
}
EXPORT_SYMBOL(find_or_create_page);
@@ -891,11 +887,9 @@ void do_generic_mapping_read(struct addr
unsigned long next_index;
unsigned long prev_index;
loff_t isize;
- struct page *cached_page;
int error;
struct file_ra_state ra = *_ra;

- cached_page = NULL;
index = *ppos >> PAGE_CACHE_SHIFT;
next_index = index;
prev_index = ra.prev_page;
@@ -1059,23 +1053,20 @@ no_cached_page:
* Ok, it wasn't cached, so we need to create a new
* page..
*/
- if (!cached_page) {
- cached_page = page_cache_alloc_cold(mapping);
- if (!cached_page) {
- desc->error = -ENOMEM;
- goto out;
- }
+ page = page_cache_alloc_cold(mapping);
+ if (!page) {
+ desc->error = -ENOMEM;
+ goto out;
}
- error = add_to_page_cache_lru(cached_page, mapping,
+ error = add_to_page_cache_lru(page, mapping,
index, GFP_KERNEL);
if (error) {
+ page_cache_release(page);
if (error == -EEXIST)
goto find_page;
desc->error = error;
goto out;
}
- page = cached_page;
- cached_page = NULL;
goto readpage;
}

@@ -1083,8 +1074,6 @@ out:
*_ra = ra;

*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
- if (cached_page)
- page_cache_release(cached_page);
if (filp)
file_accessed(filp);
}
@@ -1542,35 +1531,28 @@ static inline struct page *__read_cache_
int (*filler)(void *,struct page*),
void *data)
{
- struct page *page, *cached_page = NULL;
+ struct page *page;
int err;
repeat:
page = find_get_page(mapping, index);
if (!page) {
- if (!cached_page) {
- cached_page = page_cache_alloc_cold(mapping);
- if (!cached_page)
- return ERR_PTR(-ENOMEM);
- }
- err = add_to_page_cache_lru(cached_page, mapping,
- index, GFP_KERNEL);
- if (err == -EEXIST)
- goto repeat;
- if (err < 0) {
+ page = page_cache_alloc_cold(mapping);
+ if (!page)
+ return ERR_PTR(-ENOMEM);
+ err = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
+ if (unlikely(err)) {
+ page_cache_release(page);
+ if (err == -EEXIST)
+ goto repeat;
/* Presumably ENOMEM for radix tree node */
- page_cache_release(cached_page);
return ERR_PTR(err);
}
- page = cached_page;
- cached_page = NULL;
err = filler(data, page);
if (err < 0) {
page_cache_release(page);
page = ERR_PTR(err);
}
}
- if (cached_page)
- page_cache_release(cached_page);
return page;
}

@@ -1621,40 +1603,6 @@ retry:
EXPORT_SYMBOL(read_cache_page);

/*
- * If the page was newly created, increment its refcount and add it to the
- * caller's lru-buffering pagevec. This function is specifically for
- * generic_file_write().
- */
-static inline struct page *
-__grab_cache_page(struct address_space *mapping, unsigned long index,
- struct page **cached_page, struct pagevec *lru_pvec)
-{
- int err;
- struct page *page;
-repeat:
- page = find_lock_page(mapping, index);
- if (!page) {
- if (!*cached_page) {
- *cached_page = page_cache_alloc(mapping);
- if (!*cached_page)
- return NULL;
- }
- err = add_to_page_cache(*cached_page, mapping,
- index, GFP_KERNEL);
- if (err == -EEXIST)
- goto repeat;
- if (err == 0) {
- page = *cached_page;
- page_cache_get(page);
- if (!pagevec_add(lru_pvec, page))
- __pagevec_lru_add(lru_pvec);
- *cached_page = NULL;
- }
- }
- return page;
-}
-
-/*
* The logic we want is
*
* if suid or (sgid and xgrp)
@@ -1848,6 +1796,33 @@ generic_file_direct_write(struct kiocb *
}
EXPORT_SYMBOL(generic_file_direct_write);

+/*
+ * Find or create a page at the given pagecache position. Return the locked
+ * page. This function is specifically for buffered writes.
+ */
+static struct page *__grab_cache_page(struct address_space *mapping,
+ pgoff_t index)
+{
+ int status;
+ struct page *page;
+repeat:
+ page = find_lock_page(mapping, index);
+ if (likely(page))
+ return page;
+
+ page = page_cache_alloc(mapping);
+ if (!page)
+ return NULL;
+ status = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
+ if (unlikely(status)) {
+ page_cache_release(page);
+ if (status == -EEXIST)
+ goto repeat;
+ return NULL;
+ }
+ return page;
+}
+
ssize_t
generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t pos, loff_t *ppos,
@@ -1858,15 +1833,10 @@ generic_file_buffered_write(struct kiocb
const struct address_space_operations *a_ops = mapping->a_ops;
struct inode *inode = mapping->host;
long status = 0;
- struct page *page;
- struct page *cached_page = NULL;
- struct pagevec lru_pvec;
const struct iovec *cur_iov = iov; /* current iovec */
size_t iov_offset = 0; /* offset in the current iovec */
char __user *buf;

- pagevec_init(&lru_pvec, 0);
-
/*
* handle partial DIO write. Adjust cur_iov if needed.
*/
@@ -1878,6 +1848,7 @@ generic_file_buffered_write(struct kiocb
}

do {
+ struct page *page;
pgoff_t index; /* Pagecache index for current page */
unsigned long offset; /* Offset into pagecache page */
unsigned long maxlen; /* Bytes remaining in current iovec */
@@ -1904,7 +1875,8 @@ generic_file_buffered_write(struct kiocb
fault_in_pages_readable(buf, maxlen);
#endif

- page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
+
+ page = __grab_cache_page(mapping, index);
if (!page) {
status = -ENOMEM;
break;
@@ -1972,9 +1944,6 @@ fs_write_aop_error:
} while (count);
*ppos = pos;

- if (cached_page)
- page_cache_release(cached_page);
-
/*
* For now, when the user asks for O_SYNC, we'll actually give O_DSYNC
*/
@@ -1994,7 +1963,6 @@ fs_write_aop_error:
if (unlikely(file->f_flags & O_DIRECT) && written)
status = filemap_write_and_wait(mapping);

- pagevec_lru_add(&lru_pvec);
return written ? written : status;
}
EXPORT_SYMBOL(generic_file_buffered_write);
Index: linux-2.6/fs/mpage.c
===================================================================
--- linux-2.6.orig/fs/mpage.c
+++ linux-2.6/fs/mpage.c
@@ -389,31 +389,25 @@ mpage_readpages(struct address_space *ma
struct bio *bio = NULL;
unsigned page_idx;
sector_t last_block_in_bio = 0;
- struct pagevec lru_pvec;
struct buffer_head map_bh;
unsigned long first_logical_block = 0;

clear_buffer_mapped(&map_bh);
- pagevec_init(&lru_pvec, 0);
for (page_idx = 0; page_idx < nr_pages; page_idx++) {
struct page *page = list_entry(pages->prev, struct page, lru);

prefetchw(&page->flags);
list_del(&page->lru);
- if (!add_to_page_cache(page, mapping,
+ if (!add_to_page_cache_lru(page, mapping,
page->index, GFP_KERNEL)) {
bio = do_mpage_readpage(bio, page,
nr_pages - page_idx,
&last_block_in_bio, &map_bh,
&first_logical_block,
get_block);
- if (!pagevec_add(&lru_pvec, page))
- __pagevec_lru_add(&lru_pvec);
- } else {
- page_cache_release(page);
}
+ page_cache_release(page);
}
- pagevec_lru_add(&lru_pvec);
BUG_ON(!list_empty(pages));
if (bio)
mpage_bio_submit(READ, bio);
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -133,28 +133,25 @@ int read_cache_pages(struct address_spac
int (*filler)(void *, struct page *), void *data)
{
struct page *page;
- struct pagevec lru_pvec;
int ret = 0;

- pagevec_init(&lru_pvec, 0);
-
while (!list_empty(pages)) {
page = list_to_page(pages);
list_del(&page->lru);
- if (add_to_page_cache(page, mapping, page->index, GFP_KERNEL)) {
+ if (add_to_page_cache_lru(page, mapping,
+ page->index, GFP_KERNEL)) {
page_cache_release(page);
continue;
}
+ page_cache_release(page);
+
ret = filler(data, page);
- if (!pagevec_add(&lru_pvec, page))
- __pagevec_lru_add(&lru_pvec);
- if (ret) {
+ if (unlikely(ret)) {
put_pages_list(pages);
break;
}
task_io_account_read(PAGE_CACHE_SIZE);
}
- pagevec_lru_add(&lru_pvec);
return ret;
}

@@ -164,7 +161,6 @@ static int read_pages(struct address_spa
struct list_head *pages, unsigned nr_pages)
{
unsigned page_idx;
- struct pagevec lru_pvec;
int ret;

if (mapping->a_ops->readpages) {
@@ -174,19 +170,15 @@ static int read_pages(struct address_spa
goto out;
}

- pagevec_init(&lru_pvec, 0);
for (page_idx = 0; page_idx < nr_pages; page_idx++) {
struct page *page = list_to_page(pages);
list_del(&page->lru);
- if (!add_to_page_cache(page, mapping,
+ if (!add_to_page_cache_lru(page, mapping,
page->index, GFP_KERNEL)) {
mapping->a_ops->readpage(filp, page);
- if (!pagevec_add(&lru_pvec, page))
- __pagevec_lru_add(&lru_pvec);
- } else
- page_cache_release(page);
+ }
+ page_cache_release(page);
}
- pagevec_lru_add(&lru_pvec);
ret = 0;
out:
return ret;

2007-01-29 10:34:53

by Nick Piggin

[permalink] [raw]
Subject: [patch 8/9] mm: generic_file_buffered_write iovec cleanup

Hide some of the open-coded nr_segs tests into the iovec helpers. This is
all to simplify generic_file_buffered_write, because that gets more complex
in the next patch.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/filemap.h
===================================================================
--- linux-2.6.orig/mm/filemap.h
+++ linux-2.6/mm/filemap.h
@@ -22,82 +22,82 @@ __filemap_copy_from_user_iovec_inatomic(

/*
* Copy as much as we can into the page and return the number of bytes which
- * were sucessfully copied. If a fault is encountered then clear the page
- * out to (offset+bytes) and return the number of bytes which were copied.
- *
- * NOTE: For this to work reliably we really want copy_from_user_inatomic_nocache
- * to *NOT* zero any tail of the buffer that it failed to copy. If it does,
- * and if the following non-atomic copy succeeds, then there is a small window
- * where the target page contains neither the data before the write, nor the
- * data after the write (it contains zero). A read at this time will see
- * data that is inconsistent with any ordering of the read and the write.
- * (This has been detected in practice).
+ * were sucessfully copied. If a fault is encountered then return the number of
+ * bytes which were copied.
*/
static inline size_t
-filemap_copy_from_user(struct page *page, unsigned long offset,
- const char __user *buf, unsigned bytes)
+filemap_copy_from_user_atomic(struct page *page, unsigned long offset,
+ const struct iovec *iov, unsigned long nr_segs,
+ size_t base, size_t bytes)
{
char *kaddr;
- int left;
+ size_t copied;

kaddr = kmap_atomic(page, KM_USER0);
- left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
+ if (likely(nr_segs == 1)) {
+ int left;
+ char __user *buf = iov->iov_base + base;
+ left = __copy_from_user_inatomic_nocache(kaddr + offset,
+ buf, bytes);
+ copied = bytes - left;
+ } else {
+ copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset,
+ iov, base, bytes);
+ }
kunmap_atomic(kaddr, KM_USER0);

- if (left != 0) {
- /* Do it the slow way */
- kaddr = kmap(page);
- left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
- kunmap(page);
- }
- return bytes - left;
+ return copied;
}

/*
- * This has the same sideeffects and return value as filemap_copy_from_user().
- * The difference is that on a fault we need to memset the remainder of the
- * page (out to offset+bytes), to emulate filemap_copy_from_user()'s
- * single-segment behaviour.
+ * This has the same sideeffects and return value as
+ * filemap_copy_from_user_atomic().
+ * The difference is that it attempts to resolve faults.
*/
static inline size_t
-filemap_copy_from_user_iovec(struct page *page, unsigned long offset,
- const struct iovec *iov, size_t base, size_t bytes)
+filemap_copy_from_user(struct page *page, unsigned long offset,
+ const struct iovec *iov, unsigned long nr_segs,
+ size_t base, size_t bytes)
{
char *kaddr;
size_t copied;

- kaddr = kmap_atomic(page, KM_USER0);
- copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
- base, bytes);
- kunmap_atomic(kaddr, KM_USER0);
- if (copied != bytes) {
- kaddr = kmap(page);
- copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
- base, bytes);
- if (bytes - copied)
- memset(kaddr + offset + copied, 0, bytes - copied);
- kunmap(page);
+ kaddr = kmap(page);
+ if (likely(nr_segs == 1)) {
+ int left;
+ char __user *buf = iov->iov_base + base;
+ left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
+ copied = bytes - left;
+ } else {
+ copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset,
+ iov, base, bytes);
}
+ kunmap(page);
return copied;
}

static inline void
-filemap_set_next_iovec(const struct iovec **iovp, size_t *basep, size_t bytes)
+filemap_set_next_iovec(const struct iovec **iovp, unsigned long nr_segs,
+ size_t *basep, size_t bytes)
{
- const struct iovec *iov = *iovp;
- size_t base = *basep;
-
- while (bytes) {
- int copy = min(bytes, iov->iov_len - base);
-
- bytes -= copy;
- base += copy;
- if (iov->iov_len == base) {
- iov++;
- base = 0;
+ if (likely(nr_segs == 1)) {
+ *basep += bytes;
+ } else {
+ const struct iovec *iov = *iovp;
+ size_t base = *basep;
+
+ while (bytes) {
+ int copy = min(bytes, iov->iov_len - base);
+
+ bytes -= copy;
+ base += copy;
+ if (iov->iov_len == base) {
+ iov++;
+ base = 0;
+ }
}
+ *iovp = iov;
+ *basep = base;
}
- *iovp = iov;
- *basep = base;
}
#endif
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1840,12 +1840,7 @@ generic_file_buffered_write(struct kiocb
/*
* handle partial DIO write. Adjust cur_iov if needed.
*/
- if (likely(nr_segs == 1))
- buf = iov->iov_base + written;
- else {
- filemap_set_next_iovec(&cur_iov, &iov_offset, written);
- buf = cur_iov->iov_base + iov_offset;
- }
+ filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, written);

do {
struct page *page;
@@ -1855,6 +1850,7 @@ generic_file_buffered_write(struct kiocb
size_t bytes; /* Bytes to write to page */
size_t copied; /* Bytes copied from user */

+ buf = cur_iov->iov_base + iov_offset;
offset = (pos & (PAGE_CACHE_SIZE - 1));
index = pos >> PAGE_CACHE_SHIFT;
bytes = PAGE_CACHE_SIZE - offset;
@@ -1886,13 +1882,10 @@ generic_file_buffered_write(struct kiocb
if (unlikely(status))
goto fs_write_aop_error;

- if (likely(nr_segs == 1))
- copied = filemap_copy_from_user(page, offset,
- buf, bytes);
- else
- copied = filemap_copy_from_user_iovec(page, offset,
- cur_iov, iov_offset, bytes);
+ copied = filemap_copy_from_user(page, offset,
+ cur_iov, nr_segs, iov_offset, bytes);
flush_dcache_page(page);
+
status = a_ops->commit_write(file, page, offset, offset+bytes);
if (unlikely(status < 0))
goto fs_write_aop_error;
@@ -1903,20 +1896,11 @@ generic_file_buffered_write(struct kiocb
if (unlikely(status > 0)) /* filesystem did partial write */
copied = status;

- if (likely(copied > 0)) {
- written += copied;
- count -= copied;
- pos += copied;
- buf += copied;
- if (unlikely(nr_segs > 1)) {
- filemap_set_next_iovec(&cur_iov,
- &iov_offset, copied);
- if (count)
- buf = cur_iov->iov_base + iov_offset;
- } else {
- iov_offset += copied;
- }
- }
+ written += copied;
+ count -= copied;
+ pos += copied;
+ filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, copied);
+
unlock_page(page);
mark_page_accessed(page);
page_cache_release(page);

2007-01-29 11:11:25

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 9/9] mm: fix pagecache write deadlocks

On Mon, Jan 29, 2007 at 11:33:03AM +0100, Nick Piggin wrote:
> + } else {
> + char *src, *dst;
> + src = kmap(src_page);
> + dst = kmap(page);
> + memcpy(dst + offset,
> + src + ((unsigned long)buf & ~PAGE_CACHE_MASK),
> + bytes);
> + kunmap(page);
> + kunmap(src_page);
> + copied = bytes;
> + }
> flush_dcache_page(page);

Hmm, I guess these should use kmap_atomic with KM_USER[01]?

The kmap is from an earlier iteration that wanted to sleep
with the page mapped into kernel.

2007-01-30 20:56:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 0/9] buffered write deadlock fix

On Mon, 29 Jan 2007 11:31:37 +0100 (CET)
Nick Piggin <[email protected]> wrote:

> The following set of patches attempt to fix the buffered write
> locking problems

y'know, four or five years back I fixed this bug by doing

current->locked_page = page;

in the write() code, and then teaching the pagefault code to avoid locking
the same page. Patch below.

But then evil mean Hugh pointed out that the patch is still vulnerable to
ab/ba deadlocking so I dropped it.

But Hugh lied! There is no ab/ba deadlock because both a and b need
i_mutex to get into the write() path.

This approach doesn't fix the writev() performance regresson which
nobody has measured yet but which the NFS guys reported.

But I think with this fix in place we can go back to a modified version of
the 2.6.17 filemap.c code and get that performance back, but I haven't
thought about that.

It's a heck of a lot simpler than your patches though ;)





There is a longstanding problem in which the kernel can deadlock if an
application performs a write(bf, buf, n), and `buf' points at a mmapped
page of `fd', and that page is the pagecache page into which this write
will write, and the page is not yet present.

The kernel will put a new page into pagecache, lock it, run
copy_from_user(). The copy will fault and filemap_nopage() will try to
lock the page in preparation for reading it in. But it was already
locked up in generic_file_write().

We have had a workaround in there - touch the userspace address by hand
(to fault it in) before locking the pagecache page, and hope that it
doesn't get evicted before we try to lock it.

But there's a place in the kmap_atomic-for-generic_file_write code
where this race is detectable. I put a printk in there and saw a
stream of them during heavy testing. So the workaround doesn't work.


What this patch does is

- within generic_file_write(): record the locked page in current->locked_page

- within filemap_nopage, look to see if the page which we're faulting
in is equal to current->locked_page.

If it is, taken special action: instead of locking the page,
reading it and unlocking it, we assume that the page is already
locked. Bring it uptodate and relock it before returning.


I tested this by disabling the fault-it-in-by-hand workaround, writing
a special test app and adding several printks. Not pretty, but it does
the job.




include/linux/sched.h | 2 ++
mm/filemap.c | 26 ++++++++++++++++++++++----
2 files changed, 24 insertions(+), 4 deletions(-)

--- linux-mnm/mm/filemap.c~write-deadlock Thu Oct 31 22:42:38 2002
+++ linux-mnm-akpm/mm/filemap.c Thu Oct 31 22:42:38 2002
@@ -997,6 +997,7 @@ struct page * filemap_nopage(struct vm_a
struct page *page;
unsigned long size, pgoff, endoff;
int did_readahead;
+ struct page *locked_page = current->locked_page;

pgoff = ((address - area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
endoff = ((area->vm_end - area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
@@ -1092,10 +1093,12 @@ no_cached_page:

page_not_uptodate:
inc_page_state(pgmajfault);
- lock_page(page);
+ if (page != locked_page)
+ lock_page(page);

/* Did it get unhashed while we waited for it? */
if (!page->mapping) {
+ BUG_ON(page == locked_page); /* can't happen: i_sem */
unlock_page(page);
page_cache_release(page);
goto retry_all;
@@ -1103,12 +1106,15 @@ page_not_uptodate:

/* Did somebody else get it up-to-date? */
if (PageUptodate(page)) {
- unlock_page(page);
+ if (page != locked_page)
+ unlock_page(page);
goto success;
}

if (!mapping->a_ops->readpage(file, page)) {
wait_on_page_locked(page);
+ if (page == locked_page)
+ lock_page(page);
if (PageUptodate(page))
goto success;
}
@@ -1119,10 +1125,12 @@ page_not_uptodate:
* because there really aren't any performance issues here
* and we need to check for errors.
*/
- lock_page(page);
+ if (page != locked_page)
+ lock_page(page);

/* Somebody truncated the page on us? */
if (!page->mapping) {
+ BUG_ON(page == locked_page); /* can't happen: i_sem */
unlock_page(page);
page_cache_release(page);
goto retry_all;
@@ -1130,12 +1138,15 @@ page_not_uptodate:

/* Somebody else successfully read it in? */
if (PageUptodate(page)) {
- unlock_page(page);
+ if (page != locked_page)
+ unlock_page(page);
goto success;
}
ClearPageError(page);
if (!mapping->a_ops->readpage(file, page)) {
wait_on_page_locked(page);
+ if (page == locked_page)
+ lock_page(page);
if (PageUptodate(page))
goto success;
}
@@ -1445,6 +1456,11 @@ void remove_suid(struct dentry *dentry)
}
}

+/*
+ * There is rare deadlock scenario in which the source and dest pages are
+ * the same, and the VM evicts the page at the wrong time. Here we set
+ * current->locked_page to signal that to special-case code in filemap_nopage
+ */
static inline int
filemap_copy_from_user(struct page *page, unsigned long offset,
const char *buf, unsigned bytes)
@@ -1459,7 +1475,9 @@ filemap_copy_from_user(struct page *page
if (left != 0) {
/* Do it the slow way */
kaddr = kmap(page);
+ current->locked_page = page;
left = __copy_from_user(kaddr + offset, buf, bytes);
+ current->locked_page = NULL;
kunmap(page);
}
return left;
--- linux-mnm/include/linux/sched.h~write-deadlock Thu Oct 31 22:42:38 2002
+++ linux-mnm-akpm/include/linux/sched.h Thu Oct 31 22:42:38 2002
@@ -266,6 +266,7 @@ extern struct user_struct root_user;

typedef struct prio_array prio_array_t;
struct backing_dev_info;
+struct page;

struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
@@ -387,6 +388,7 @@ struct task_struct {
void *journal_info;
struct dentry *proc_dentry;
struct backing_dev_info *backing_dev_info;
+ struct page *locked_page;
};

extern void __put_task_struct(struct task_struct *tsk);

.

2007-01-30 23:21:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 0/9] buffered write deadlock fix

On Tue, 30 Jan 2007 12:55:58 -0800
Andrew Morton <[email protected]> wrote:

> y'know, four or five years back I fixed this bug by doing
>
> current->locked_page = page;
>
> in the write() code, and then teaching the pagefault code to avoid locking
> the same page. Patch below.
>
> But then evil mean Hugh pointed out that the patch is still vulnerable to
> ab/ba deadlocking so I dropped it.

And he was right, of course. Task A holds file a's i_mutex and takes a
fault against file b's page. Task B holds file b's i_mutex and takes a
fault against file a's page. Drat.

I wonder if there's a sane way of preventing that.

2007-01-31 00:32:18

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 0/9] buffered write deadlock fix

On Tue, Jan 30, 2007 at 12:55:58PM -0800, Andrew Morton wrote:
> On Mon, 29 Jan 2007 11:31:37 +0100 (CET)
> Nick Piggin <[email protected]> wrote:
>
> > The following set of patches attempt to fix the buffered write
> > locking problems
>
> y'know, four or five years back I fixed this bug by doing
>
> current->locked_page = page;
>
> in the write() code, and then teaching the pagefault code to avoid locking
> the same page. Patch below.
>
> But then evil mean Hugh pointed out that the patch is still vulnerable to
> ab/ba deadlocking so I dropped it.
>
> But Hugh lied! There is no ab/ba deadlock because both a and b need
> i_mutex to get into the write() path.

Not only is there still the abba deadlock on page locks, as you point
out in your next mail, but there is also an abba on page lock vs mmap_sem.

> This approach doesn't fix the writev() performance regresson which
> nobody has measured yet but which the NFS guys reported.
>
> But I think with this fix in place we can go back to a modified version of
> the 2.6.17 filemap.c code and get that performance back, but I haven't
> thought about that.
>
> It's a heck of a lot simpler than your patches though ;)

Ignoring the cleanup patches, the only thing mine really do is to use
get_user_pages in generic_file_buffered_write if the destination page
is not uptodate, and use atomic copies if it is. Conceptually pretty simple.

Mine fixes the writev performance regression iff the destination page
it uptodate. More importantly, they fix 3 deadlocks in the core mm.

2007-01-31 01:32:07

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 0/9] buffered write deadlock fix

On Tue, Jan 30, 2007 at 03:21:19PM -0800, Andrew Morton wrote:
> On Tue, 30 Jan 2007 12:55:58 -0800
> Andrew Morton <[email protected]> wrote:
>
> > y'know, four or five years back I fixed this bug by doing
> >
> > current->locked_page = page;
> >
> > in the write() code, and then teaching the pagefault code to avoid locking
> > the same page. Patch below.
> >
> > But then evil mean Hugh pointed out that the patch is still vulnerable to
> > ab/ba deadlocking so I dropped it.
>
> And he was right, of course. Task A holds file a's i_mutex and takes a
> fault against file b's page. Task B holds file b's i_mutex and takes a
> fault against file a's page. Drat.
>
> I wonder if there's a sane way of preventing that.

If you want to go down the path of carrying state around in task_struct,
you can take the mmap_sem and set a flag, then get_user_pages the source
page and lock both source and destination in ascending order, then your
page fault handler checks the flag and skips mmap_sem, and the rest of
your fault path checks both the page locks you're holding.

At which point you arrive at a horrible mess :)

2007-02-02 23:52:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 1/9] fs: libfs buffered write leak fix

On Mon, 29 Jan 2007 11:31:46 +0100 (CET)
Nick Piggin <[email protected]> wrote:

> simple_prepare_write and nobh_prepare_write leak uninitialised kernel data.

They do? Under what situation?

> Fix the former,

How?

> make a note of the latter. Several other filesystems seem
> to be iffy here, too.

Please, tell us what the bug is so that others have a chance of reviewing
and, if needed, fixing those other filesystems.

> --- linux-2.6.orig/fs/libfs.c
> +++ linux-2.6/fs/libfs.c
> @@ -327,32 +327,35 @@ int simple_readpage(struct file *file, s
> int simple_prepare_write(struct file *file, struct page *page,
> unsigned from, unsigned to)
> {
> - if (!PageUptodate(page)) {
> - if (to - from != PAGE_CACHE_SIZE) {
> - void *kaddr = kmap_atomic(page, KM_USER0);
> - memset(kaddr, 0, from);
> - memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
> - flush_dcache_page(page);
> - kunmap_atomic(kaddr, KM_USER0);
> - }
> + if (PageUptodate(page))
> + return 0;
> +
> + if (to - from != PAGE_CACHE_SIZE) {
> + clear_highpage(page);
> + flush_dcache_page(page);
> SetPageUptodate(page);
> }

memclear_highpage_flush() is fashionable.

> ===================================================================
> --- linux-2.6.orig/fs/buffer.c
> +++ linux-2.6/fs/buffer.c
> @@ -2344,6 +2344,8 @@ int nobh_prepare_write(struct page *page
>
> if (is_mapped_to_disk)
> SetPageMappedToDisk(page);
> +
> + /* XXX: information leak vs read(2) */
> SetPageUptodate(page);
>
> /*

That comment is too terse to be useful.

2007-02-02 23:52:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 0/9] buffered write deadlock fix

On Mon, 29 Jan 2007 11:31:37 +0100 (CET)
Nick Piggin <[email protected]> wrote:

> The following set of patches attempt to fix the buffered write
> locking problems (and there are a couple of peripheral patches
> and cleanups there too).
>
> Patches against 2.6.20-rc6. I was hoping that 2.6.20-rc6-mm2 would
> be an easier diff with the fsaio patches gone, but the readahead
> rewrite clashes badly :(

Well fsaio is restored, but there's now considerable doubt over it due to
the recent febril febrility.

How bad is the clash with the readahead patches?

Clashes with git-block are likely, too.

Bugfixes come first, so I will drop readahead and fsaio and git-block to get
this work completed if needed - please work agaisnt mainline.

2007-02-02 23:53:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 9/9] mm: fix pagecache write deadlocks

On Mon, 29 Jan 2007 11:33:03 +0100 (CET)
Nick Piggin <[email protected]> wrote:

> Modify the core write() code so that it won't take a pagefault while holding a
> lock on the pagecache page. There are a number of different deadlocks possible
> if we try to do such a thing:
>
> 1. generic_buffered_write
> 2. lock_page
> 3. prepare_write
> 4. unlock_page+vmtruncate
> 5. copy_from_user
> 6. mmap_sem(r)
> 7. handle_mm_fault
> 8. lock_page (filemap_nopage)
> 9. commit_write
> 10. unlock_page
>
> a. sys_munmap / sys_mlock / others
> b. mmap_sem(w)
> c. make_pages_present
> d. get_user_pages
> e. handle_mm_fault
> f. lock_page (filemap_nopage)
>
> 2,8 - recursive deadlock if page is same
> 2,8;2,8 - ABBA deadlock is page is different
> 2,6;b,f - ABBA deadlock if page is same
>
> The solution is as follows:
> 1. If we find the destination page is uptodate, continue as normal, but use
> atomic usercopies which do not take pagefaults and do not zero the uncopied
> tail of the destination. The destination is already uptodate, so we can
> commit_write the full length even if there was a partial copy: it does not
> matter that the tail was not modified, because if it is dirtied and written
> back to disk it will not cause any problems (uptodate *means* that the
> destination page is as new or newer than the copy on disk).
>
> 1a. The above requires that fault_in_pages_readable correctly returns access
> information, because atomic usercopies cannot distinguish between
> non-present pages in a readable mapping, from lack of a readable mapping.
>
> 2. If we find the destination page is non uptodate, unlock it (this could be
> made slightly more optimal), then find and pin the source page with
> get_user_pages. Relock the destination page and continue with the copy.
> However, instead of a usercopy (which might take a fault), copy the data
> via the kernel address space.
>

Oh what a mess we're making :(

Unfortunately, write() into a non-uptodate page is very much the common
case. We've always tried to avoid doing a pte-walk in the write() path to
fix this bug. Careful performance testing is needed here so we can assess
the impact. For threaded applications, simply the taking of mmap_sem might
be the biggest problem.

And I can't think of any tricks we can play to avoid doing the pte-walk in
most cases. For example, we don't yet have a page to run page_mapped()
against.

> break;
> }
>
> + /*
> + * non-uptodate pages cannot cope with short copies, and we
> + * cannot take a pagefault with the destination page locked.
> + * So pin the source page to copy it.
> + */
> + if (!PageUptodate(page)) {
> + unlock_page(page);
> +
> + bytes = min(bytes, PAGE_CACHE_SIZE -
> + ((unsigned long)buf & ~PAGE_CACHE_MASK));
> +
> + /*
> + * Cannot get_user_pages with a page locked for the
> + * same reason as we can't take a page fault with a
> + * page locked (as explained below).
> + */
> + down_read(&current->mm->mmap_sem);
> + status = get_user_pages(current, current->mm,
> + (unsigned long)buf & PAGE_CACHE_MASK, 1,
> + 0, 0, &src_page, NULL);
> + up_read(&current->mm->mmap_sem);
> + if (status != 1) {
> + page_cache_release(page);
> + break;
> + }
> +
> + lock_page(page);
> + if (!page->mapping) {

Hopefully this can't happen? If it can, who went and took our page off the
mapping? Reclaim? The elevated page_count will prevent that?

> + unlock_page(page);
> + page_cache_release(page);
> + page_cache_release(src_page);
> + continue;
> + }
> +
> + }
> +
> status = a_ops->prepare_write(file, page, offset, offset+bytes);
> if (unlikely(status))
> goto fs_write_aop_error;
>
> - copied = filemap_copy_from_user(page, offset,
> + if (!src_page) {
> + /*
> + * Must not enter the pagefault handler here, because
> + * we hold the page lock, so we might recursively
> + * deadlock on the same lock, or get an ABBA deadlock
> + * against a different lock, or against the mmap_sem
> + * (which nests outside the page lock). So increment
> + * preempt count, and use _atomic usercopies.
> + *
> + * The page is uptodate so we are OK to encounter a
> + * short copy: if unmodified parts of the page are
> + * marked dirty and written out to disk, it doesn't
> + * really matter.
> + */
> + pagefault_disable();
> + copied = filemap_copy_from_user_atomic(page, offset,
> cur_iov, nr_segs, iov_offset, bytes);
> + pagefault_enable();
> + } else {
> + char *src, *dst;
> + src = kmap(src_page);
> + dst = kmap(page);
> + memcpy(dst + offset,
> + src + ((unsigned long)buf & ~PAGE_CACHE_MASK),
> + bytes);
> + kunmap(page);
> + kunmap(src_page);
> + copied = bytes;
> + }

As you say, we don't want to do this: taking two kmap()s at the same time
leaves us vulnerable to kmap() deadlocks: we deadlock when there are 512
tasks (LAST_PKMAP) each holding one kmap, and all waiting for someone else
to give one back.



2007-02-03 01:23:10

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 0/9] buffered write deadlock fix

On Fri, Feb 02, 2007 at 03:52:32PM -0800, Andrew Morton wrote:
> On Mon, 29 Jan 2007 11:31:37 +0100 (CET)
> Nick Piggin <[email protected]> wrote:
>
> > The following set of patches attempt to fix the buffered write
> > locking problems (and there are a couple of peripheral patches
> > and cleanups there too).
> >
> > Patches against 2.6.20-rc6. I was hoping that 2.6.20-rc6-mm2 would
> > be an easier diff with the fsaio patches gone, but the readahead
> > rewrite clashes badly :(
>
> Well fsaio is restored, but there's now considerable doubt over it due to
> the recent febril febrility.
>
> How bad is the clash with the readahead patches?

I don't think it would be so bad that one couldn't merge readahead
back on top quite easily... The fsaio ones are a little harder because
they change generic_file_buffered_write.

> Clashes with git-block are likely, too.
>
> Bugfixes come first, so I will drop readahead and fsaio and git-block to get
> this work completed if needed - please work agaisnt mainline.

OK.

2007-02-03 01:33:18

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/9] fs: libfs buffered write leak fix

On Fri, Feb 02, 2007 at 03:52:36PM -0800, Andrew Morton wrote:
> On Mon, 29 Jan 2007 11:31:46 +0100 (CET)
> Nick Piggin <[email protected]> wrote:
>
> > simple_prepare_write and nobh_prepare_write leak uninitialised kernel data.
>
> They do? Under what situation?

Yes, I have at least reproduced the libfs leak.

The situation is when you write into a !uptodate page, the prepare_write
function runs SetPageUptodate *before* we have copied data in. Thus you
can read uninitialised data out of there.

SetPageUptodate must not be used (or at least used carefully) in
prepare_write. commit_write is the correct place to do this.

> > Fix the former,
>
> How?

If doing a partial-write, simply clear the whole page and set it uptodate
(don't need to get too tricky). If doing a full-write, only set it uptodate
in the commit_write.

> > make a note of the latter. Several other filesystems seem
> > to be iffy here, too.
>
> Please, tell us what the bug is so that others have a chance of reviewing
> and, if needed, fixing those other filesystems.
>
> > --- linux-2.6.orig/fs/libfs.c
> > +++ linux-2.6/fs/libfs.c
> > @@ -327,32 +327,35 @@ int simple_readpage(struct file *file, s
> > int simple_prepare_write(struct file *file, struct page *page,
> > unsigned from, unsigned to)
> > {
> > - if (!PageUptodate(page)) {
> > - if (to - from != PAGE_CACHE_SIZE) {
> > - void *kaddr = kmap_atomic(page, KM_USER0);
> > - memset(kaddr, 0, from);
> > - memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
> > - flush_dcache_page(page);
> > - kunmap_atomic(kaddr, KM_USER0);
> > - }
> > + if (PageUptodate(page))
> > + return 0;
> > +
> > + if (to - from != PAGE_CACHE_SIZE) {
> > + clear_highpage(page);
> > + flush_dcache_page(page);
> > SetPageUptodate(page);
> > }
>
> memclear_highpage_flush() is fashionable.

Good one.

> > ===================================================================
> > --- linux-2.6.orig/fs/buffer.c
> > +++ linux-2.6/fs/buffer.c
> > @@ -2344,6 +2344,8 @@ int nobh_prepare_write(struct page *page
> >
> > if (is_mapped_to_disk)
> > SetPageMappedToDisk(page);
> > +
> > + /* XXX: information leak vs read(2) */
> > SetPageUptodate(page);
> >
> > /*
>
> That comment is too terse to be useful.

OK, similar problem here - we have brought all the buffers uptodate
that we are *not* going to write over, or partially write over, but
we can have an uninitialised hole over the region we want to write.

I think just setting page uptodate in commit_write might do the
trick? (and getting rid of the set_page_dirty there).

2007-02-03 01:38:26

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 9/9] mm: fix pagecache write deadlocks

On Fri, Feb 02, 2007 at 03:53:11PM -0800, Andrew Morton wrote:
> On Mon, 29 Jan 2007 11:33:03 +0100 (CET)
> Nick Piggin <[email protected]> wrote:
>
> > Modify the core write() code so that it won't take a pagefault while holding a
> > lock on the pagecache page. There are a number of different deadlocks possible
> > if we try to do such a thing:
> >
> > 1. generic_buffered_write
> > 2. lock_page
> > 3. prepare_write
> > 4. unlock_page+vmtruncate
> > 5. copy_from_user
> > 6. mmap_sem(r)
> > 7. handle_mm_fault
> > 8. lock_page (filemap_nopage)
> > 9. commit_write
> > 10. unlock_page
> >
> > a. sys_munmap / sys_mlock / others
> > b. mmap_sem(w)
> > c. make_pages_present
> > d. get_user_pages
> > e. handle_mm_fault
> > f. lock_page (filemap_nopage)
> >
> > 2,8 - recursive deadlock if page is same
> > 2,8;2,8 - ABBA deadlock is page is different
> > 2,6;b,f - ABBA deadlock if page is same
> >
> > The solution is as follows:
> > 1. If we find the destination page is uptodate, continue as normal, but use
> > atomic usercopies which do not take pagefaults and do not zero the uncopied
> > tail of the destination. The destination is already uptodate, so we can
> > commit_write the full length even if there was a partial copy: it does not
> > matter that the tail was not modified, because if it is dirtied and written
> > back to disk it will not cause any problems (uptodate *means* that the
> > destination page is as new or newer than the copy on disk).
> >
> > 1a. The above requires that fault_in_pages_readable correctly returns access
> > information, because atomic usercopies cannot distinguish between
> > non-present pages in a readable mapping, from lack of a readable mapping.
> >
> > 2. If we find the destination page is non uptodate, unlock it (this could be
> > made slightly more optimal), then find and pin the source page with
> > get_user_pages. Relock the destination page and continue with the copy.
> > However, instead of a usercopy (which might take a fault), copy the data
> > via the kernel address space.
> >
>
> Oh what a mess we're making :(
>
> Unfortunately, write() into a non-uptodate page is very much the common
> case. We've always tried to avoid doing a pte-walk in the write() path to
> fix this bug. Careful performance testing is needed here so we can assess
> the impact. For threaded applications, simply the taking of mmap_sem might
> be the biggest problem.
>
> And I can't think of any tricks we can play to avoid doing the pte-walk in
> most cases. For example, we don't yet have a page to run page_mapped()
> against.

After this patch series, I am working on another that will allow filesystems
to specifically code around the problem (eg. by handling short usercopies
properly).

I tried to take this approach generically the first time, but it turns out
lots of filesystems had subtle problems, so if we do it this way instead,
then filesystem developers who actually care enough can improve their
code, and those that don't won't hold them back (or prevent this bug from
being fixed).

> > break;
> > }
> >
> > + /*
> > + * non-uptodate pages cannot cope with short copies, and we
> > + * cannot take a pagefault with the destination page locked.
> > + * So pin the source page to copy it.
> > + */
> > + if (!PageUptodate(page)) {
> > + unlock_page(page);
> > +
> > + bytes = min(bytes, PAGE_CACHE_SIZE -
> > + ((unsigned long)buf & ~PAGE_CACHE_MASK));
> > +
> > + /*
> > + * Cannot get_user_pages with a page locked for the
> > + * same reason as we can't take a page fault with a
> > + * page locked (as explained below).
> > + */
> > + down_read(&current->mm->mmap_sem);
> > + status = get_user_pages(current, current->mm,
> > + (unsigned long)buf & PAGE_CACHE_MASK, 1,
> > + 0, 0, &src_page, NULL);
> > + up_read(&current->mm->mmap_sem);
> > + if (status != 1) {
> > + page_cache_release(page);
> > + break;
> > + }
> > +
> > + lock_page(page);
> > + if (!page->mapping) {
>
> Hopefully this can't happen? If it can, who went and took our page off the
> mapping? Reclaim? The elevated page_count will prevent that?

Truncate/invalidate?

> > + unlock_page(page);
> > + page_cache_release(page);
> > + page_cache_release(src_page);
> > + continue;
> > + }
> > +
> > + }
> > +
> > status = a_ops->prepare_write(file, page, offset, offset+bytes);
> > if (unlikely(status))
> > goto fs_write_aop_error;
> >
> > - copied = filemap_copy_from_user(page, offset,
> > + if (!src_page) {
> > + /*
> > + * Must not enter the pagefault handler here, because
> > + * we hold the page lock, so we might recursively
> > + * deadlock on the same lock, or get an ABBA deadlock
> > + * against a different lock, or against the mmap_sem
> > + * (which nests outside the page lock). So increment
> > + * preempt count, and use _atomic usercopies.
> > + *
> > + * The page is uptodate so we are OK to encounter a
> > + * short copy: if unmodified parts of the page are
> > + * marked dirty and written out to disk, it doesn't
> > + * really matter.
> > + */
> > + pagefault_disable();
> > + copied = filemap_copy_from_user_atomic(page, offset,
> > cur_iov, nr_segs, iov_offset, bytes);
> > + pagefault_enable();
> > + } else {
> > + char *src, *dst;
> > + src = kmap(src_page);
> > + dst = kmap(page);
> > + memcpy(dst + offset,
> > + src + ((unsigned long)buf & ~PAGE_CACHE_MASK),
> > + bytes);
> > + kunmap(page);
> > + kunmap(src_page);
> > + copied = bytes;
> > + }
>
> As you say, we don't want to do this: taking two kmap()s at the same time
> leaves us vulnerable to kmap() deadlocks: we deadlock when there are 512
> tasks (LAST_PKMAP) each holding one kmap, and all waiting for someone else
> to give one back.

Yep, thinko. My updated version uses KM_USER[01].

Thanks for reviewing so far.

2007-02-03 01:58:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 1/9] fs: libfs buffered write leak fix

On Sat, 3 Feb 2007 02:33:16 +0100
Nick Piggin <[email protected]> wrote:

> > > ===================================================================
> > > --- linux-2.6.orig/fs/buffer.c
> > > +++ linux-2.6/fs/buffer.c
> > > @@ -2344,6 +2344,8 @@ int nobh_prepare_write(struct page *page
> > >
> > > if (is_mapped_to_disk)
> > > SetPageMappedToDisk(page);
> > > +
> > > + /* XXX: information leak vs read(2) */
> > > SetPageUptodate(page);
> > >
> > > /*
> >
> > That comment is too terse to be useful.
>
> OK, similar problem here - we have brought all the buffers uptodate
> that we are *not* going to write over, or partially write over, but
> we can have an uninitialised hole over the region we want to write.
>
> I think just setting page uptodate in commit_write might do the
> trick? (and getting rid of the set_page_dirty there).

Yes, the page just isn't uptodate yet in prepare_write() - moving things
to commti_write() sounds sane.

But please, can we have sufficient changelogs and comments in the next version?

2007-02-03 02:09:29

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/9] fs: libfs buffered write leak fix

On Fri, Feb 02, 2007 at 05:58:01PM -0800, Andrew Morton wrote:
> On Sat, 3 Feb 2007 02:33:16 +0100
> Nick Piggin <[email protected]> wrote:
>
> > I think just setting page uptodate in commit_write might do the
> > trick? (and getting rid of the set_page_dirty there).
>
> Yes, the page just isn't uptodate yet in prepare_write() - moving things
> to commti_write() sounds sane.
>
> But please, can we have sufficient changelogs and comments in the next version?

You're right, sorry. Is this any better? (warning: nobh code is untested)

--
simple_prepare_write and nobh_prepare_write leak uninitialised kernel data.
This happens because the prepare_write functions leave an uninitialised
"hole" over the part of the page that the write is expected to go to. This
is fine, but they then mark the page uptodate, which means a concurrent read
can come in and copy the uninitialised memory into userspace before it written
to.

Fix simple_readpage by simply initialising the whole page in the case of a
partial-page write. In the case of a full-page write, we don't SetPageDirty
until commit_write time.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -327,25 +327,32 @@ int simple_readpage(struct file *file, s
int simple_prepare_write(struct file *file, struct page *page,
unsigned from, unsigned to)
{
- if (!PageUptodate(page)) {
- if (to - from != PAGE_CACHE_SIZE) {
- void *kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr, 0, from);
- memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
- }
+ if (PageUptodate(page))
+ return 0;
+
+ if (to - from != PAGE_CACHE_SIZE) {
+ /*
+ * Partial-page write? Initialise the complete page and
+ * set it uptodate. We could avoid initialising the
+ * (from, to) hole, and opt to mark it uptodate in
+ * simple_commit_write, but that's probably only a win
+ * for filesystems that would need to read blocks off disk.
+ */
+ memclear_highpage_flush(page, 0, PAGE_CACHE_SIZE);
SetPageUptodate(page);
}
+
return 0;
}

int simple_commit_write(struct file *file, struct page *page,
- unsigned offset, unsigned to)
+ unsigned from, unsigned to)
{
struct inode *inode = page->mapping->host;
loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;

+ if (to - from == PAGE_CACHE_SIZE)
+ SetPageUptodate(page);
/*
* No need to use i_size_read() here, the i_size
* cannot change under us because we hold the i_mutex.
@@ -353,6 +360,7 @@ int simple_commit_write(struct file *fil
if (pos > inode->i_size)
i_size_write(inode, pos);
set_page_dirty(page);
+
return 0;
}

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2344,17 +2344,6 @@ int nobh_prepare_write(struct page *page

if (is_mapped_to_disk)
SetPageMappedToDisk(page);
- SetPageUptodate(page);
-
- /*
- * Setting the page dirty here isn't necessary for the prepare_write
- * function - commit_write will do that. But if/when this function is
- * used within the pagefault handler to ensure that all mmapped pages
- * have backing space in the filesystem, we will need to dirty the page
- * if its contents were altered.
- */
- if (dirtied_it)
- set_page_dirty(page);

return 0;

@@ -2384,6 +2373,7 @@ int nobh_commit_write(struct file *file,
struct inode *inode = page->mapping->host;
loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;

+ SetPageUptodate(page);
set_page_dirty(page);
if (pos > inode->i_size) {
i_size_write(inode, pos);
Index: linux-2.6/Documentation/filesystems/vfs.txt
===================================================================
--- linux-2.6.orig/Documentation/filesystems/vfs.txt
+++ linux-2.6/Documentation/filesystems/vfs.txt
@@ -617,6 +617,11 @@ struct address_space_operations {
In this case the prepare_write will be retried one the lock is
regained.

+ Note: the page _must not_ be marked uptodate in this function
+ (or anywhere else) unless it actually is uptodate right now. As
+ soon as a page is marked uptodate, it is possible for a concurrent
+ read(2) to copy it to userspace.
+
commit_write: If prepare_write succeeds, new data will be copied
into the page and then commit_write will be called. It will
typically update the size of the file (if appropriate) and

2007-02-03 02:20:03

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 1/9] fs: libfs buffered write leak fix

On Sat, 3 Feb 2007 03:09:26 +0100
Nick Piggin <[email protected]> wrote:

> From: Nick Piggin <[email protected]>
> To: Andrew Morton <[email protected]>

argh. Yesterday all my emails were getting a mysterious
s/osdl/linux-foundation/ done to them at the server, so I switched everything
over. Now it would appear that they are getting an equally mysterious
s/linux-foundation/osdl/ done to them. I assume you sent this to
[email protected]?


> Cc: Linux Kernel <[email protected]>, Linux Filesystems <[email protected]>, Linux Memory Management <[email protected]>
> Subject: Re: [patch 1/9] fs: libfs buffered write leak fix
> Date: Sat, 3 Feb 2007 03:09:26 +0100
> User-Agent: Mutt/1.5.9i
>
> On Fri, Feb 02, 2007 at 05:58:01PM -0800, Andrew Morton wrote:
> > On Sat, 3 Feb 2007 02:33:16 +0100
> > Nick Piggin <[email protected]> wrote:
> >
> > > I think just setting page uptodate in commit_write might do the
> > > trick? (and getting rid of the set_page_dirty there).
> >
> > Yes, the page just isn't uptodate yet in prepare_write() - moving things
> > to commti_write() sounds sane.
> >
> > But please, can we have sufficient changelogs and comments in the next version?
>
> You're right, sorry. Is this any better?

yup, thanks.

> (warning: nobh code is untested)

ow.

2007-02-03 02:28:54

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/9] fs: libfs buffered write leak fix

On Fri, Feb 02, 2007 at 06:19:55PM -0800, Andrew Morton wrote:
> On Sat, 3 Feb 2007 03:09:26 +0100
> Nick Piggin <[email protected]> wrote:
>
> > From: Nick Piggin <[email protected]>
> > To: Andrew Morton <[email protected]>
>
> argh. Yesterday all my emails were getting a mysterious
> s/osdl/linux-foundation/ done to them at the server, so I switched everything
> over. Now it would appear that they are getting an equally mysterious
> s/linux-foundation/osdl/ done to them. I assume you sent this to
> [email protected]?

No. Your first reply I got to this patch came as linux-foundantion, and
that's what I replied to. Your subsequent reply back to me ("Yes, the page
just isn't uptodate yet..."), came from osdl.org, which is what I replied
to.

> > Cc: Linux Kernel <[email protected]>, Linux Filesystems <[email protected]>, Linux Memory Management <[email protected]>
> > Subject: Re: [patch 1/9] fs: libfs buffered write leak fix
> > Date: Sat, 3 Feb 2007 03:09:26 +0100
> > User-Agent: Mutt/1.5.9i
> >
> > On Fri, Feb 02, 2007 at 05:58:01PM -0800, Andrew Morton wrote:
> > > On Sat, 3 Feb 2007 02:33:16 +0100
> > > Nick Piggin <[email protected]> wrote:
> > >
> > > > I think just setting page uptodate in commit_write might do the
> > > > trick? (and getting rid of the set_page_dirty there).
> > >
> > > Yes, the page just isn't uptodate yet in prepare_write() - moving things
> > > to commti_write() sounds sane.
> > >
> > > But please, can we have sufficient changelogs and comments in the next version?
> >
> > You're right, sorry. Is this any better?
>
> yup, thanks.
>
> > (warning: nobh code is untested)
>
> ow.

I'll get a chance to do that later today. I have to fire up the old test
case and see if I can reproduce the problem with nobh on a real fs...
Will get back to you when I do.

2007-02-03 06:41:00

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [patch 0/9] buffered write deadlock fix

On Fri, Feb 02, 2007 at 03:52:32PM -0800, Andrew Morton wrote:
> On Mon, 29 Jan 2007 11:31:37 +0100 (CET)
> Nick Piggin <[email protected]> wrote:
>
> > The following set of patches attempt to fix the buffered write
> > locking problems (and there are a couple of peripheral patches
> > and cleanups there too).
> >
> > Patches against 2.6.20-rc6. I was hoping that 2.6.20-rc6-mm2 would
> > be an easier diff with the fsaio patches gone, but the readahead
> > rewrite clashes badly :(
>
> Well fsaio is restored, but there's now considerable doubt over it due to
> the recent febril febrility.

I think Ingo made a point earlier about letting the old co-exist with the
new. Fibrils + kevents have great potential for a next generation
solution but we need to give the whole story some time to play out and prove
it in practice, debate and benchmark the alternative combinations, optimize it
for various workloads etc. It will also take more work on top before we
can get the whole POSIX AIO implementation supported on top of this. I'll be
very happy when that happens ... it is just that it is still too early to
be sure.

Since this is going to be a new interface, not the existing linux AIO
interface, I do not see any conflict between the two. Samba4 already uses
fsaio, and we now have the ability to do POSIX AIO over kernel AIO (which
depends on fsaio). The more we delay real world usage the longer we take
to learn about the application patterns that matter. And it is those
patterns that are key.

>
> How bad is the clash with the readahead patches?
>
> Clashes with git-block are likely, too.
>
> Bugfixes come first, so I will drop readahead and fsaio and git-block to get
> this work completed if needed - please work agaisnt mainline.

If you need help with fixing the clashes, please let me know.

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2007-02-03 15:30:34

by Wu Fengguang

[permalink] [raw]
Subject: Re: [patch 0/9] buffered write deadlock fix

On Fri, Feb 02, 2007 at 03:52:32PM -0800, Andrew Morton wrote:
> Bugfixes come first, so I will drop readahead and fsaio and git-block to get
> this work completed if needed - please work agaisnt mainline.

OK with readahead.

There are too much fixes in the series. I'd like to fold them up and
update some change logs. And then there would be one more update.

Regards,
Wu

2007-02-03 17:53:41

by Jörn Engel

[permalink] [raw]
Subject: Re: [patch 1/9] fs: libfs buffered write leak fix

On Sat, 3 February 2007 02:33:16 +0100, Nick Piggin wrote:
>
> If doing a partial-write, simply clear the whole page and set it uptodate
> (don't need to get too tricky).

That sounds just like a bug I recently fixed in logfs. prepare_write()
would clear the page, commit_write() would write the whole page. Bug
can be reproduced with a simple testcate:

echo -n foo > foo
echo -n bar >> foo
cat foo

With the bug, the second write will replace "foo" with "\0\0\0" and
cat will return "bar". Doing a read instead of clearing the page will
return "foobar", as would be expected.

Can you hit the same bug with your patch or did I miss something?

Jörn

--
When people work hard for you for a pat on the back, you've got
to give them that pat.
-- Robert Heinlein

2007-02-04 03:56:07

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/9] fs: libfs buffered write leak fix

On Sat, Feb 03, 2007 at 05:49:47PM +0000, J?rn Engel wrote:
> On Sat, 3 February 2007 02:33:16 +0100, Nick Piggin wrote:
> >
> > If doing a partial-write, simply clear the whole page and set it uptodate
> > (don't need to get too tricky).
>
> That sounds just like a bug I recently fixed in logfs. prepare_write()
> would clear the page, commit_write() would write the whole page. Bug
> can be reproduced with a simple testcate:
>
> echo -n foo > foo
> echo -n bar >> foo
> cat foo
>
> With the bug, the second write will replace "foo" with "\0\0\0" and
> cat will return "bar". Doing a read instead of clearing the page will
> return "foobar", as would be expected.
>
> Can you hit the same bug with your patch or did I miss something?

Yes, the page is only cleared if it is not uptodate. This is fine
for the simple filesystems.