2006-10-13 16:44:03

by Nick Piggin

[permalink] [raw]
Subject: [rfc] buffered write deadlock fix

The following set of patches attempt to fix the buffered write
locking problems.

While looking at this deadlock, it became apparent that there are
several others which are equally bad or worse. It will be very
good to fix these.

I ceased to become an admirer of this problem when it stopped my
pagefault vs invalidate race fix from being merged!

Review and comments would be very nice. Testing only if you don't
value your data. I realise all filesystem developers are busy
solving the 10TB fsck problem now, but if you could please take a
minute to look at the fs/ changes, and also ensure your
filesystem's prepare and commit_write handlers aren't broken.

Sorry for the shotgun mail. It is your fault for ever being
mentioned in the same email as the buffered write deadlock ;)

Thanks,
Nick

--
SuSE Labs


2006-10-13 16:44:08

by Nick Piggin

[permalink] [raw]
Subject: [patch 1/6] mm: revert "generic_file_buffered_write(): handle zero length iovec segments"

From: Andrew Morton <[email protected]>

Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6.

This was a bugfix against 6527c2bdf1f833cc18e8f42bd97973d583e4aa83, which we
also revert.


Signed-off-by: Andrew Morton <[email protected]>
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1912,12 +1912,6 @@ generic_file_buffered_write(struct kiocb
break;
}

- if (unlikely(bytes == 0)) {
- status = 0;
- copied = 0;
- goto zero_length_segment;
- }
-
status = a_ops->prepare_write(file, page, offset, offset+bytes);
if (unlikely(status)) {
loff_t isize = i_size_read(inode);
@@ -1947,8 +1941,7 @@ generic_file_buffered_write(struct kiocb
page_cache_release(page);
continue;
}
-zero_length_segment:
- if (likely(copied >= 0)) {
+ if (likely(copied > 0)) {
if (!status)
status = copied;

Index: linux-2.6/mm/filemap.h
===================================================================
--- linux-2.6.orig/mm/filemap.h
+++ linux-2.6/mm/filemap.h
@@ -87,7 +87,7 @@ filemap_set_next_iovec(const struct iove
const struct iovec *iov = *iovp;
size_t base = *basep;

- do {
+ while (bytes) {
int copy = min(bytes, iov->iov_len - base);

bytes -= copy;
@@ -96,7 +96,7 @@ filemap_set_next_iovec(const struct iove
iov++;
base = 0;
}
- } while (bytes);
+ }
*iovp = iov;
*basep = base;
}

2006-10-13 16:44:58

by Nick Piggin

[permalink] [raw]
Subject: [patch 5/6] mm: debug write deadlocks

Allow CONFIG_DEBUG_VM to switch off the prefaulting logic, to simulate the
difficult race where the page may be unmapped before calling copy_from_user.
Makes the race much easier to hit.

This probably needn't go upstream.

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1895,6 +1895,7 @@ generic_file_buffered_write(struct kiocb
if (maxlen > bytes)
maxlen = bytes;

+#ifndef CONFIG_DEBUG_VM
/*
* Bring in the user page that we will copy from _first_.
* Otherwise there's a nasty deadlock on copying from the
@@ -1902,6 +1903,7 @@ generic_file_buffered_write(struct kiocb
* up-to-date.
*/
fault_in_pages_readable(buf, maxlen);
+#endif

page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
if (!page) {

2006-10-13 16:44:35

by Nick Piggin

[permalink] [raw]
Subject: [patch 2/6] mm: revert "generic_file_buffered_write(): deadlock on vectored write"

From: Andrew Morton <[email protected]>

Revert 6527c2bdf1f833cc18e8f42bd97973d583e4aa83

This patch fixed the following bug:

When prefaulting in the pages in generic_file_buffered_write(), we only
faulted in the pages for the firts segment of the iovec. If the second of
successive segment described a mmapping of the page into which we're
write()ing, and that page is not up-to-date, the fault handler tries to lock
the already-locked page (to bring it up to date) and deadlocks.

An exploit for this bug is in writev-deadlock-demo.c, in
http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.

(These demos assume blocksize < PAGE_CACHE_SIZE).

The problem with this fix is that it takes the kernel back to doing a single
prepare_write()/commit_write() per iovec segment. So in the worst case we'll
run prepare_write+commit_write 1024 times where we previously would have run
it once. The other problem with the fix is that it fix all the locking problems.


<insert numbers obtained via ext3-tools's writev-speed.c here>

And apparently this change killed NFS overwrite performance, because, I
suppose, it talks to the server for each prepare_write+commit_write.

So just back that patch out - we'll be fixing the deadlock by other means.

Signed-off-by: Andrew Morton <[email protected]>
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1882,21 +1882,14 @@ generic_file_buffered_write(struct kiocb
do {
unsigned long index;
unsigned long offset;
+ unsigned long maxlen;
size_t copied;

offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
index = pos >> PAGE_CACHE_SHIFT;
bytes = PAGE_CACHE_SIZE - offset;
-
- /* Limit the size of the copy to the caller's write size */
- bytes = min(bytes, count);
-
- /*
- * Limit the size of the copy to that of the current segment,
- * because fault_in_pages_readable() doesn't know how to walk
- * segments.
- */
- bytes = min(bytes, cur_iov->iov_len - iov_base);
+ if (bytes > count)
+ bytes = count;

/*
* Bring in the user page that we will copy from _first_.
@@ -1904,7 +1897,10 @@ generic_file_buffered_write(struct kiocb
* same page as we're writing to, without it being marked
* up-to-date.
*/
- fault_in_pages_readable(buf, bytes);
+ maxlen = cur_iov->iov_len - iov_base;
+ if (maxlen > bytes)
+ maxlen = bytes;
+ fault_in_pages_readable(buf, maxlen);

page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
if (!page) {

2006-10-13 16:44:30

by Nick Piggin

[permalink] [raw]
Subject: [patch 3/6] mm: generic_file_buffered_write cleanup

From: Andrew Morton <[email protected]>

Signed-off-by: Andrew Morton <[email protected]>
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1855,16 +1855,15 @@ generic_file_buffered_write(struct kiocb
size_t count, ssize_t written)
{
struct file *file = iocb->ki_filp;
- struct address_space * mapping = file->f_mapping;
+ struct address_space *mapping = file->f_mapping;
const struct address_space_operations *a_ops = mapping->a_ops;
struct inode *inode = mapping->host;
long status = 0;
struct page *page;
struct page *cached_page = NULL;
- size_t bytes;
struct pagevec lru_pvec;
const struct iovec *cur_iov = iov; /* current iovec */
- size_t iov_base = 0; /* offset in the current iovec */
+ size_t iov_offset = 0; /* offset in the current iovec */
char __user *buf;

pagevec_init(&lru_pvec, 0);
@@ -1875,31 +1874,33 @@ generic_file_buffered_write(struct kiocb
if (likely(nr_segs == 1))
buf = iov->iov_base + written;
else {
- filemap_set_next_iovec(&cur_iov, &iov_base, written);
- buf = cur_iov->iov_base + iov_base;
+ filemap_set_next_iovec(&cur_iov, &iov_offset, written);
+ buf = cur_iov->iov_base + iov_offset;
}

do {
- unsigned long index;
- unsigned long offset;
- unsigned long maxlen;
- size_t copied;
+ pgoff_t index; /* Pagecache index for current page */
+ unsigned long offset; /* Offset into pagecache page */
+ unsigned long maxlen; /* Bytes remaining in current iovec */
+ size_t bytes; /* Bytes to write to page */
+ size_t copied; /* Bytes copied from user */

- offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
+ offset = (pos & (PAGE_CACHE_SIZE - 1));
index = pos >> PAGE_CACHE_SHIFT;
bytes = PAGE_CACHE_SIZE - offset;
if (bytes > count)
bytes = count;

+ maxlen = cur_iov->iov_len - iov_offset;
+ if (maxlen > bytes)
+ maxlen = bytes;
+
/*
* Bring in the user page that we will copy from _first_.
* Otherwise there's a nasty deadlock on copying from the
* same page as we're writing to, without it being marked
* up-to-date.
*/
- maxlen = cur_iov->iov_len - iov_base;
- if (maxlen > bytes)
- maxlen = bytes;
fault_in_pages_readable(buf, maxlen);

page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
@@ -1930,7 +1931,7 @@ generic_file_buffered_write(struct kiocb
buf, bytes);
else
copied = filemap_copy_from_user_iovec(page, offset,
- cur_iov, iov_base, bytes);
+ cur_iov, iov_offset, bytes);
flush_dcache_page(page);
status = a_ops->commit_write(file, page, offset, offset+bytes);
if (status == AOP_TRUNCATED_PAGE) {
@@ -1948,12 +1949,12 @@ generic_file_buffered_write(struct kiocb
buf += status;
if (unlikely(nr_segs > 1)) {
filemap_set_next_iovec(&cur_iov,
- &iov_base, status);
+ &iov_offset, status);
if (count)
buf = cur_iov->iov_base +
- iov_base;
+ iov_offset;
} else {
- iov_base += status;
+ iov_offset += status;
}
}
}

2006-10-13 16:45:02

by Nick Piggin

[permalink] [raw]
Subject: [patch 4/6] mm: comment mmap_sem / lock_page lockorder

Add a few more examples to the mmap_sem / lock_page ordering.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -73,7 +73,7 @@ generic_file_direct_IO(int rw, struct ki
* ->mapping->tree_lock (arch-dependent flush_dcache_mmap_lock)
*
* ->mmap_sem
- * ->lock_page (access_process_vm)
+ * ->lock_page (page fault, sys_mmap, access_process_vm)
*
* ->mmap_sem
* ->i_mutex (msync)
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -29,7 +29,7 @@
* taken together; in truncation, i_mutex is taken outermost.
*
* mm->mmap_sem
- * page->flags PG_locked (lock_page)
+ * page->flags PG_locked (lock_page, eg from pagefault)
* mapping->i_mmap_lock
* anon_vma->lock
* mm->page_table_lock or pte_lock

2006-10-13 16:45:40

by Nick Piggin

[permalink] [raw]
Subject: [patch 6/6] mm: fix pagecache write deadlocks

From: Andrew Morton <[email protected]> and Nick Piggin <[email protected]>

The idea is to modify the core write() code so that it won't take a pagefault
while holding a lock on the pagecache page. There are a number of different
deadlocks possible if we try to do such a thing:

1. generic_buffered_write
2. lock_page
3. prepare_write
4. unlock_page+vmtruncate
5. copy_from_user
6. mmap_sem(r)
7. handle_mm_fault
8. lock_page (filemap_nopage)
9. commit_write
1. unlock_page

b. sys_munmap / sys_mlock / others
c. mmap_sem(w)
d. make_pages_present
e. get_user_pages
f. handle_mm_fault
g. lock_page (filemap_nopage)

2,8 - recursive deadlock if page is same
2,8;2,7 - ABBA deadlock is page is different
2,6;c,g - ABBA deadlock if page is same

- Instead of copy_from_user(), use inc_preempt_count() and
copy_from_user_inatomic().

- If the copy_from_user_inatomic() hits a pagefault, it'll return a short
copy.

- if the page was not uptodate, we cannot commit the write, because the
uncopied bit could have uninitialised data. Commit zero length copy,
which should do the right thing (ie. not set the page uptodate).

- if the page was uptodate, commit the copied portion so we make some
progress.

- unlock_page()

- go back and try to fault the page in again, redo the lock_page,
prepare_write, copy_from_user_inatomic(), etc.

- Now we have a non uptodate page, and we keep faulting on a 2nd or later
iovec, this gives a deadlock, because fault_in_pages readable only faults
in the first iovec. To fix this situation, if we fault on a !uptodate page,
make the next iteration only attempt to copy a single iovec.

- This also showed up a number of buggy prepare_write / commit_write
implementations that were setting the page uptodate in the prepare_write
side: bad! this allows uninitialised data to be read. Fix these.

(also, rename maxlen to seglen, because it was confusing)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1881,7 +1881,7 @@ generic_file_buffered_write(struct kiocb
do {
pgoff_t index; /* Pagecache index for current page */
unsigned long offset; /* Offset into pagecache page */
- unsigned long maxlen; /* Bytes remaining in current iovec */
+ unsigned long seglen; /* Bytes remaining in current iovec */
size_t bytes; /* Bytes to write to page */
size_t copied; /* Bytes copied from user */

@@ -1891,18 +1891,16 @@ generic_file_buffered_write(struct kiocb
if (bytes > count)
bytes = count;

- maxlen = cur_iov->iov_len - iov_offset;
- if (maxlen > bytes)
- maxlen = bytes;
+ seglen = min(cur_iov->iov_len - iov_offset, bytes);

+retry_noprogress:
#ifndef CONFIG_DEBUG_VM
/*
- * Bring in the user page that we will copy from _first_.
- * Otherwise there's a nasty deadlock on copying from the
- * same page as we're writing to, without it being marked
- * up-to-date.
+ * Bring in the user page that we will copy from _first_, this
+ * minimises the chance we have to break into the slowpath
+ * below.
*/
- fault_in_pages_readable(buf, maxlen);
+ fault_in_pages_readable(buf, seglen);
#endif

page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
@@ -1913,8 +1911,6 @@ generic_file_buffered_write(struct kiocb

status = a_ops->prepare_write(file, page, offset, offset+bytes);
if (unlikely(status)) {
- loff_t isize = i_size_read(inode);
-
if (status != AOP_TRUNCATED_PAGE)
unlock_page(page);
page_cache_release(page);
@@ -1922,52 +1918,86 @@ generic_file_buffered_write(struct kiocb
continue;
/*
* prepare_write() may have instantiated a few blocks
- * outside i_size. Trim these off again.
+ * outside i_size. Trim these off again. Don't need
+ * i_size_read because we hold i_mutex.
*/
- if (pos + bytes > isize)
- vmtruncate(inode, isize);
+ if (pos + bytes > inode->i_size)
+ vmtruncate(inode, inode->i_size);
break;
}
+
+ /*
+ * Must not enter the pagefault handler here, because we hold
+ * the page lock, so we might recursively deadlock on the same
+ * lock, or get an ABBA deadlock against a different lock, or
+ * against the mmap_sem (which nests outside the page lock).
+ * So increment preempt count, and use _atomic usercopies.
+ */
+ inc_preempt_count();
if (likely(nr_segs == 1))
- copied = filemap_copy_from_user(page, offset,
+ copied = filemap_copy_from_user_atomic(page, offset,
buf, bytes);
else
- copied = filemap_copy_from_user_iovec(page, offset,
- cur_iov, iov_offset, bytes);
+ copied = filemap_copy_from_user_iovec_atomic(page,
+ offset, cur_iov, iov_offset,
+ bytes);
+ dec_preempt_count();
+
+ if (!PageUptodate(page)) {
+ /*
+ * If the page is not uptodate, we cannot allow a
+ * partial commit_write, because that might expose
+ * uninitialised data.
+ *
+ * We will enter the single-segment path below, which
+ * should get the filesystem to bring the page
+ * uputodate for us next time.
+ */
+ if (unlikely(copied != bytes))
+ copied = 0;
+ }
+
flush_dcache_page(page);
- status = a_ops->commit_write(file, page, offset, offset+bytes);
+ status = a_ops->commit_write(file, page, offset, offset+copied);
if (status == AOP_TRUNCATED_PAGE) {
page_cache_release(page);
continue;
}
- if (likely(copied > 0)) {
- if (!status)
- status = copied;
-
- if (status >= 0) {
- written += status;
- count -= status;
- pos += status;
- buf += status;
- if (unlikely(nr_segs > 1)) {
- filemap_set_next_iovec(&cur_iov,
- &iov_offset, status);
- if (count)
- buf = cur_iov->iov_base +
- iov_offset;
- } else {
- iov_offset += status;
- }
- }
- }
- if (unlikely(copied != bytes))
- if (status >= 0)
- status = -EFAULT;
unlock_page(page);
mark_page_accessed(page);
page_cache_release(page);
+
if (status < 0)
break;
+ if (likely(copied > 0)) {
+ written += copied;
+ count -= copied;
+ pos += copied;
+ buf += copied;
+ if (unlikely(nr_segs > 1)) {
+ filemap_set_next_iovec(&cur_iov,
+ &iov_offset, copied);
+ if (count)
+ buf = cur_iov->iov_base + iov_offset;
+ } else {
+ iov_offset += copied;
+ }
+ } else {
+#ifdef CONFIG_DEBUG_VM
+ fault_in_pages_readable(buf, bytes);
+#endif
+ /*
+ * OK, we took a fault without making progress. Fall
+ * back to trying just a single segment next time.
+ * This is important to prevent deadlock if a full
+ * page write was not causing the page to be brought
+ * uptodate. A smaller write will tend to bring it
+ * uptodate in ->prepare_write, and thus we have a
+ * chance of making some progress.
+ */
+ bytes = seglen;
+ goto retry_noprogress;
+ }
balance_dirty_pages_ratelimited(mapping);
cond_resched();
} while (count);
Index: linux-2.6/mm/filemap.h
===================================================================
--- linux-2.6.orig/mm/filemap.h
+++ linux-2.6/mm/filemap.h
@@ -22,19 +22,19 @@ __filemap_copy_from_user_iovec_inatomic(

/*
* Copy as much as we can into the page and return the number of bytes which
- * were sucessfully copied. If a fault is encountered then clear the page
- * out to (offset+bytes) and return the number of bytes which were copied.
+ * were sucessfully copied. If a fault is encountered then return the number of
+ * bytes which were copied.
*
- * NOTE: For this to work reliably we really want copy_from_user_inatomic_nocache
- * to *NOT* zero any tail of the buffer that it failed to copy. If it does,
- * and if the following non-atomic copy succeeds, then there is a small window
- * where the target page contains neither the data before the write, nor the
- * data after the write (it contains zero). A read at this time will see
- * data that is inconsistent with any ordering of the read and the write.
- * (This has been detected in practice).
+ * NOTE: For this to work reliably we really want
+ * copy_from_user_inatomic_nocache to *NOT* zero any tail of the buffer that it
+ * failed to copy. If it does, and if the following non-atomic copy succeeds,
+ * then there is a small window where the target page contains neither the data
+ * before the write, nor the data after the write (it contains zero). A read at
+ * this time will see data that is inconsistent with any ordering of the read
+ * and the write. (This has been detected in practice).
*/
static inline size_t
-filemap_copy_from_user(struct page *page, unsigned long offset,
+filemap_copy_from_user_atomic(struct page *page, unsigned long offset,
const char __user *buf, unsigned bytes)
{
char *kaddr;
@@ -44,23 +44,32 @@ filemap_copy_from_user(struct page *page
left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
kunmap_atomic(kaddr, KM_USER0);

- if (left != 0) {
- /* Do it the slow way */
- kaddr = kmap(page);
- left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
- kunmap(page);
- }
+ return bytes - left;
+}
+
+static inline size_t
+filemap_copy_from_user_nonatomic(struct page *page, unsigned long offset,
+ const char __user *buf, unsigned bytes)
+{
+ char *kaddr;
+ int left;
+
+ kaddr = kmap(page);
+ left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
+ kunmap(page);
+
return bytes - left;
}

/*
- * This has the same sideeffects and return value as filemap_copy_from_user().
+ * This has the same sideeffects and return value as
+ * filemap_copy_from_user_atomic().
* The difference is that on a fault we need to memset the remainder of the
* page (out to offset+bytes), to emulate filemap_copy_from_user()'s
* single-segment behaviour.
*/
static inline size_t
-filemap_copy_from_user_iovec(struct page *page, unsigned long offset,
+filemap_copy_from_user_iovec_atomic(struct page *page, unsigned long offset,
const struct iovec *iov, size_t base, size_t bytes)
{
char *kaddr;
@@ -70,14 +79,27 @@ filemap_copy_from_user_iovec(struct page
copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
base, bytes);
kunmap_atomic(kaddr, KM_USER0);
- if (copied != bytes) {
- kaddr = kmap(page);
- copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
- base, bytes);
- if (bytes - copied)
- memset(kaddr + offset + copied, 0, bytes - copied);
- kunmap(page);
- }
+ return copied;
+}
+
+/*
+ * This has the same sideeffects and return value as
+ * filemap_copy_from_user_nonatomic().
+ * The difference is that on a fault we need to memset the remainder of the
+ * page (out to offset+bytes), to emulate filemap_copy_from_user_nonatomic()'s
+ * single-segment behaviour.
+ */
+static inline size_t
+filemap_copy_from_user_iovec_nonatomic(struct page *page, unsigned long offset,
+ const struct iovec *iov, size_t base, size_t bytes)
+{
+ char *kaddr;
+ size_t copied;
+
+ kaddr = kmap(page);
+ copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
+ base, bytes);
+ kunmap(page);
return copied;
}

Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -327,32 +327,35 @@ int simple_readpage(struct file *file, s
int simple_prepare_write(struct file *file, struct page *page,
unsigned from, unsigned to)
{
- if (!PageUptodate(page)) {
- if (to - from != PAGE_CACHE_SIZE) {
- void *kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr, 0, from);
- memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
- }
+ if (PageUptodate(page))
+ return 0;
+
+ if (to - from != PAGE_CACHE_SIZE) {
+ clear_highpage(page);
+ flush_dcache_page(page);
SetPageUptodate(page);
}
+
return 0;
}

int simple_commit_write(struct file *file, struct page *page,
- unsigned offset, unsigned to)
+ unsigned from, unsigned to)
{
- struct inode *inode = page->mapping->host;
- loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
-
- /*
- * No need to use i_size_read() here, the i_size
- * cannot change under us because we hold the i_mutex.
- */
- if (pos > inode->i_size)
- i_size_write(inode, pos);
- set_page_dirty(page);
+ if (to > from) {
+ struct inode *inode = page->mapping->host;
+ loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+
+ if (to - from == PAGE_CACHE_SIZE)
+ SetPageUptodate(page);
+ /*
+ * No need to use i_size_read() here, the i_size
+ * cannot change under us because we hold the i_mutex.
+ */
+ if (pos > inode->i_size)
+ i_size_write(inode, pos);
+ set_page_dirty(page);
+ }
return 0;
}

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1856,6 +1856,9 @@ static int __block_commit_write(struct i
unsigned blocksize;
struct buffer_head *bh, *head;

+ if (from == to)
+ return 0;
+
blocksize = 1 << inode->i_blkbits;

for(bh = head = page_buffers(page), block_start = 0;
@@ -2317,17 +2320,6 @@ int nobh_prepare_write(struct page *page

if (is_mapped_to_disk)
SetPageMappedToDisk(page);
- SetPageUptodate(page);
-
- /*
- * Setting the page dirty here isn't necessary for the prepare_write
- * function - commit_write will do that. But if/when this function is
- * used within the pagefault handler to ensure that all mmapped pages
- * have backing space in the filesystem, we will need to dirty the page
- * if its contents were altered.
- */
- if (dirtied_it)
- set_page_dirty(page);

return 0;

@@ -2337,15 +2329,6 @@ failed:
free_buffer_head(read_bh[i]);
}

- /*
- * Error recovery is pretty slack. Clear the page and mark it dirty
- * so we'll later zero out any blocks which _were_ allocated.
- */
- kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr, 0, PAGE_CACHE_SIZE);
- kunmap_atomic(kaddr, KM_USER0);
- SetPageUptodate(page);
- set_page_dirty(page);
return ret;
}
EXPORT_SYMBOL(nobh_prepare_write);
@@ -2353,13 +2336,16 @@ EXPORT_SYMBOL(nobh_prepare_write);
int nobh_commit_write(struct file *file, struct page *page,
unsigned from, unsigned to)
{
- struct inode *inode = page->mapping->host;
- loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+ if (to > from) {
+ struct inode *inode = page->mapping->host;
+ loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;

- set_page_dirty(page);
- if (pos > inode->i_size) {
- i_size_write(inode, pos);
- mark_inode_dirty(inode);
+ SetPageUptodate(page);
+ set_page_dirty(page);
+ if (pos > inode->i_size) {
+ i_size_write(inode, pos);
+ mark_inode_dirty(inode);
+ }
}
return 0;
}
@@ -2450,6 +2436,7 @@ int nobh_truncate_page(struct address_sp
memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
flush_dcache_page(page);
kunmap_atomic(kaddr, KM_USER0);
+ SetPageUptodate(page);
set_page_dirty(page);
}
unlock_page(page);

2006-10-13 22:16:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 6/6] mm: fix pagecache write deadlocks

On Fri, 13 Oct 2006 18:44:52 +0200 (CEST)
Nick Piggin <[email protected]> wrote:

> From: Andrew Morton <[email protected]> and Nick Piggin <[email protected]>
>
> The idea is to modify the core write() code so that it won't take a pagefault
> while holding a lock on the pagecache page. There are a number of different
> deadlocks possible if we try to do such a thing:
>
> 1. generic_buffered_write
> 2. lock_page
> 3. prepare_write
> 4. unlock_page+vmtruncate
> 5. copy_from_user
> 6. mmap_sem(r)
> 7. handle_mm_fault
> 8. lock_page (filemap_nopage)
> 9. commit_write
> 1. unlock_page
>
> b. sys_munmap / sys_mlock / others
> c. mmap_sem(w)
> d. make_pages_present
> e. get_user_pages
> f. handle_mm_fault
> g. lock_page (filemap_nopage)
>
> 2,8 - recursive deadlock if page is same
> 2,8;2,7 - ABBA deadlock is page is different
> 2,6;c,g - ABBA deadlock if page is same
>
> - Instead of copy_from_user(), use inc_preempt_count() and
> copy_from_user_inatomic().
>
> - If the copy_from_user_inatomic() hits a pagefault, it'll return a short
> copy.
>
> - if the page was not uptodate, we cannot commit the write, because the
> uncopied bit could have uninitialised data. Commit zero length copy,
> which should do the right thing (ie. not set the page uptodate).
>
> - if the page was uptodate, commit the copied portion so we make some
> progress.
>
> - unlock_page()
>
> - go back and try to fault the page in again, redo the lock_page,
> prepare_write, copy_from_user_inatomic(), etc.
>
> - Now we have a non uptodate page, and we keep faulting on a 2nd or later
> iovec, this gives a deadlock, because fault_in_pages readable only faults
> in the first iovec. To fix this situation, if we fault on a !uptodate page,
> make the next iteration only attempt to copy a single iovec.
>
> - This also showed up a number of buggy prepare_write / commit_write
> implementations that were setting the page uptodate in the prepare_write
> side: bad! this allows uninitialised data to be read. Fix these.

Well. It's non-buggy under the current protocol because the page remains
locked throughout. This patch would make these ->prepare_write()
implementations buggy.

> +#ifdef CONFIG_DEBUG_VM
> + fault_in_pages_readable(buf, bytes);
> +#endif

I'll need to remember to take that out later on. Or reorder the patches, I
guess.

> int simple_commit_write(struct file *file, struct page *page,
> - unsigned offset, unsigned to)
> + unsigned from, unsigned to)
> {
> - struct inode *inode = page->mapping->host;
> - loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
> -
> - /*
> - * No need to use i_size_read() here, the i_size
> - * cannot change under us because we hold the i_mutex.
> - */
> - if (pos > inode->i_size)
> - i_size_write(inode, pos);
> - set_page_dirty(page);
> + if (to > from) {
> + struct inode *inode = page->mapping->host;
> + loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
> +
> + if (to - from == PAGE_CACHE_SIZE)
> + SetPageUptodate(page);

This SetPageUptodate() can go away?

> @@ -2317,17 +2320,6 @@ int nobh_prepare_write(struct page *page
>
> if (is_mapped_to_disk)
> SetPageMappedToDisk(page);
> - SetPageUptodate(page);
> -
> - /*
> - * Setting the page dirty here isn't necessary for the prepare_write
> - * function - commit_write will do that. But if/when this function is
> - * used within the pagefault handler to ensure that all mmapped pages
> - * have backing space in the filesystem, we will need to dirty the page
> - * if its contents were altered.
> - */
> - if (dirtied_it)
> - set_page_dirty(page);
>
> return 0;

Local variable `dirtied_it' can go away now.

Or can it? We've modified the page's contents. If the copy_from_user gets
a fault and we do a zero-length ->commit_write(), nobody ends up dirtying
the page.

> @@ -2450,6 +2436,7 @@ int nobh_truncate_page(struct address_sp
> memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
> flush_dcache_page(page);
> kunmap_atomic(kaddr, KM_USER0);
> + SetPageUptodate(page);
> set_page_dirty(page);
> }
> unlock_page(page);

I've already forgotten why this was added. Comment, please ;)

2006-10-14 04:19:47

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 6/6] mm: fix pagecache write deadlocks

On Fri, Oct 13, 2006 at 03:14:57PM -0700, Andrew Morton wrote:
> On Fri, 13 Oct 2006 18:44:52 +0200 (CEST)
> Nick Piggin <[email protected]> wrote:
> >
> > - This also showed up a number of buggy prepare_write / commit_write
> > implementations that were setting the page uptodate in the prepare_write
> > side: bad! this allows uninitialised data to be read. Fix these.
>
> Well. It's non-buggy under the current protocol because the page remains
> locked throughout. This patch would make these ->prepare_write()
> implementations buggy.

But if it becomes uptodate, then do_generic_mapping_read can read it
without locking it (and so can filemap_nopage at present, although it
looks like that's going to take the page lock soon).

> > +#ifdef CONFIG_DEBUG_VM
> > + fault_in_pages_readable(buf, bytes);
> > +#endif
>
> I'll need to remember to take that out later on. Or reorder the patches, I
> guess.
>
> > int simple_commit_write(struct file *file, struct page *page,
> > - unsigned offset, unsigned to)
> > + unsigned from, unsigned to)
> > {
> > - struct inode *inode = page->mapping->host;
> > - loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
> > -
> > - /*
> > - * No need to use i_size_read() here, the i_size
> > - * cannot change under us because we hold the i_mutex.
> > - */
> > - if (pos > inode->i_size)
> > - i_size_write(inode, pos);
> > - set_page_dirty(page);
> > + if (to > from) {
> > + struct inode *inode = page->mapping->host;
> > + loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
> > +
> > + if (to - from == PAGE_CACHE_SIZE)
> > + SetPageUptodate(page);
>
> This SetPageUptodate() can go away?

It is needed because the prepare_write does not try to bring the page
uptodate if it is a full page write. (I was considering another patch
to just remove that completely, as it might be overkill for something
like sysfs, but it demonstrates the principle quite nicely).

>
> > @@ -2317,17 +2320,6 @@ int nobh_prepare_write(struct page *page
> >
> > if (is_mapped_to_disk)
> > SetPageMappedToDisk(page);
> > - SetPageUptodate(page);
> > -
> > - /*
> > - * Setting the page dirty here isn't necessary for the prepare_write
> > - * function - commit_write will do that. But if/when this function is
> > - * used within the pagefault handler to ensure that all mmapped pages
> > - * have backing space in the filesystem, we will need to dirty the page
> > - * if its contents were altered.
> > - */
> > - if (dirtied_it)
> > - set_page_dirty(page);
> >
> > return 0;
>
> Local variable `dirtied_it' can go away now.
>
> Or can it? We've modified the page's contents. If the copy_from_user gets
> a fault and we do a zero-length ->commit_write(), nobody ends up dirtying
> the page.

But only if the page is not uptodate. Otherwise we won't modify its contents
(no need to).

> > @@ -2450,6 +2436,7 @@ int nobh_truncate_page(struct address_sp
> > memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
> > flush_dcache_page(page);
> > kunmap_atomic(kaddr, KM_USER0);
> > + SetPageUptodate(page);
> > set_page_dirty(page);
> > }
> > unlock_page(page);
>
> I've already forgotten why this was added. Comment, please ;)

Well, nobh_prepare_write no longer sets it uptodate, so we need to if
we're going to set_page_dirty. OTOH, why does truncate_page need to
zero the pagecache anyway? I wonder if we couldn't delete this whole
function? (not in this patchset!)

2006-10-14 04:30:44

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 6/6] mm: fix pagecache write deadlocks

On Sat, Oct 14, 2006 at 06:19:27AM +0200, Nick Piggin wrote:
> On Fri, Oct 13, 2006 at 03:14:57PM -0700, Andrew Morton wrote:
> > On Fri, 13 Oct 2006 18:44:52 +0200 (CEST)
> > Nick Piggin <[email protected]> wrote:
> > >
> > > - This also showed up a number of buggy prepare_write / commit_write
> > > implementations that were setting the page uptodate in the prepare_write
> > > side: bad! this allows uninitialised data to be read. Fix these.
> >
> > Well. It's non-buggy under the current protocol because the page remains
> > locked throughout. This patch would make these ->prepare_write()
> > implementations buggy.
>
> But if it becomes uptodate, then do_generic_mapping_read can read it
> without locking it (and so can filemap_nopage at present, although it
> looks like that's going to take the page lock soon).

So the simple_prepare_write bug is an uninitialised data loeak. If
you read the part of the file which is about to be written to (and thus
does not get memset()ed), you can read junk.

I was able to trigger this with a simple test on ramfs.

2006-10-14 05:04:20

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 6/6] mm: fix pagecache write deadlocks

On Fri, Oct 13, 2006 at 06:44:52PM +0200, Nick Piggin wrote:
> From: Andrew Morton <[email protected]> and Nick Piggin <[email protected]>
>
> The idea is to modify the core write() code so that it won't take a pagefault
> while holding a lock on the pagecache page. There are a number of different
> deadlocks possible if we try to do such a thing:

Here is a patch to improve the comment a little. This is a pretty tricky
situation so we must be clear as to why it works.
--

Comment was not entirely clear about why we must eliminate all other
possibilities.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1946,12 +1946,19 @@ retry_noprogress:
if (!PageUptodate(page)) {
/*
* If the page is not uptodate, we cannot allow a
- * partial commit_write, because that might expose
- * uninitialised data.
+ * partial commit_write because when we unlock the
+ * page below, someone else might bring it uptodate
+ * and we lose our write. We cannot allow a full
+ * commit_write, because that exposes uninitialised
+ * data. We cannot zero the rest of the file and do
+ * a full commit_write because that exposes transient
+ * zeroes.
*
- * We will enter the single-segment path below, which
- * should get the filesystem to bring the page
- * uputodate for us next time.
+ * Abort the operation entirely with a zero length
+ * commit_write. Retry. We will enter the
+ * single-segment path below, which should get the
+ * filesystem to bring the page uputodate for us next
+ * time.
*/
if (unlikely(copied != bytes))
copied = 0;

2006-10-15 11:37:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 6/6] mm: fix pagecache write deadlocks

On Fri, 2006-10-13 at 18:44 +0200, Andrew Morton wrote:
> The idea is to modify the core write() code so that it won't take a pagefault
> while holding a lock on the pagecache page. There are a number of different
> deadlocks possible if we try to do such a thing:
>
> 1. generic_buffered_write
> 2. lock_page
> 3. prepare_write
> 4. unlock_page+vmtruncate
> 5. copy_from_user
> 6. mmap_sem(r)
> 7. handle_mm_fault
> 8. lock_page (filemap_nopage)
> 9. commit_write
> 1. unlock_page
>
> b. sys_munmap / sys_mlock / others
> c. mmap_sem(w)
> d. make_pages_present
> e. get_user_pages
> f. handle_mm_fault
> g. lock_page (filemap_nopage)
>
> 2,8 - recursive deadlock if page is same
> 2,8;2,7 - ABBA deadlock is page is different

2,8;2,8 I think you mean

> 2,6;c,g - ABBA deadlock if page is same

> +
> + /*
> + * Must not enter the pagefault handler here, because we hold
> + * the page lock, so we might recursively deadlock on the same
> + * lock, or get an ABBA deadlock against a different lock, or
> + * against the mmap_sem (which nests outside the page lock).
> + * So increment preempt count, and use _atomic usercopies.
> + */
> + inc_preempt_count();
> if (likely(nr_segs == 1))
> - copied = filemap_copy_from_user(page, offset,
> + copied = filemap_copy_from_user_atomic(page, offset,
> buf, bytes);
> else
> - copied = filemap_copy_from_user_iovec(page, offset,
> - cur_iov, iov_offset, bytes);
> + copied = filemap_copy_from_user_iovec_atomic(page,
> + offset, cur_iov, iov_offset,
> + bytes);
> + dec_preempt_count();
> +

Why use raw {inc,dec}_preempt_count() and not
preempt_{disable,enable}()? Is the compiler barrier not needed here? And
do we really want to avoid the preempt_check_resched()?

> Index: linux-2.6/mm/filemap.h
> ===================================================================
> --- linux-2.6.orig/mm/filemap.h
> +++ linux-2.6/mm/filemap.h
> @@ -22,19 +22,19 @@ __filemap_copy_from_user_iovec_inatomic(
>
> /*
> * Copy as much as we can into the page and return the number of bytes which
> - * were sucessfully copied. If a fault is encountered then clear the page
> - * out to (offset+bytes) and return the number of bytes which were copied.
> + * were sucessfully copied. If a fault is encountered then return the number of
> + * bytes which were copied.
> *
> - * NOTE: For this to work reliably we really want copy_from_user_inatomic_nocache
> - * to *NOT* zero any tail of the buffer that it failed to copy. If it does,
> - * and if the following non-atomic copy succeeds, then there is a small window
> - * where the target page contains neither the data before the write, nor the
> - * data after the write (it contains zero). A read at this time will see
> - * data that is inconsistent with any ordering of the read and the write.
> - * (This has been detected in practice).
> + * NOTE: For this to work reliably we really want
> + * copy_from_user_inatomic_nocache to *NOT* zero any tail of the buffer that it
> + * failed to copy. If it does, and if the following non-atomic copy succeeds,
> + * then there is a small window where the target page contains neither the data
> + * before the write, nor the data after the write (it contains zero). A read at
> + * this time will see data that is inconsistent with any ordering of the read
> + * and the write. (This has been detected in practice).
> */
> static inline size_t
> -filemap_copy_from_user(struct page *page, unsigned long offset,
> +filemap_copy_from_user_atomic(struct page *page, unsigned long offset,
> const char __user *buf, unsigned bytes)
> {
> char *kaddr;
> @@ -44,23 +44,32 @@ filemap_copy_from_user(struct page *page
> left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
> kunmap_atomic(kaddr, KM_USER0);
>
> - if (left != 0) {
> - /* Do it the slow way */
> - kaddr = kmap(page);
> - left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
> - kunmap(page);
> - }
> + return bytes - left;
> +}
> +
> +static inline size_t
> +filemap_copy_from_user_nonatomic(struct page *page, unsigned long offset,
> + const char __user *buf, unsigned bytes)
> +{
> + char *kaddr;
> + int left;
> +
> + kaddr = kmap(page);
> + left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
> + kunmap(page);
> +
> return bytes - left;
> }
>
> /*
> - * This has the same sideeffects and return value as filemap_copy_from_user().
> + * This has the same sideeffects and return value as
> + * filemap_copy_from_user_atomic().
> * The difference is that on a fault we need to memset the remainder of the
> * page (out to offset+bytes), to emulate filemap_copy_from_user()'s
> * single-segment behaviour.
> */
> static inline size_t
> -filemap_copy_from_user_iovec(struct page *page, unsigned long offset,
> +filemap_copy_from_user_iovec_atomic(struct page *page, unsigned long offset,
> const struct iovec *iov, size_t base, size_t bytes)
> {
> char *kaddr;
> @@ -70,14 +79,27 @@ filemap_copy_from_user_iovec(struct page
> copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
> base, bytes);
> kunmap_atomic(kaddr, KM_USER0);
> - if (copied != bytes) {
> - kaddr = kmap(page);
> - copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
> - base, bytes);
> - if (bytes - copied)
> - memset(kaddr + offset + copied, 0, bytes - copied);
> - kunmap(page);
> - }
> + return copied;
> +}
> +
> +/*
> + * This has the same sideeffects and return value as
> + * filemap_copy_from_user_nonatomic().
> + * The difference is that on a fault we need to memset the remainder of the
> + * page (out to offset+bytes), to emulate filemap_copy_from_user_nonatomic()'s
> + * single-segment behaviour.
> + */
> +static inline size_t
> +filemap_copy_from_user_iovec_nonatomic(struct page *page, unsigned long offset,
> + const struct iovec *iov, size_t base, size_t bytes)
> +{
> + char *kaddr;
> + size_t copied;
> +
> + kaddr = kmap(page);
> + copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
> + base, bytes);
> + kunmap(page);
> return copied;
> }
>

Why create the _nonatomic versions? There are no users.


2006-10-15 11:35:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 6/6] mm: fix pagecache write deadlocks

On Sat, 2006-10-14 at 06:19 +0200, Nick Piggin wrote:
> On Fri, Oct 13, 2006 at 03:14:57PM -0700, Andrew Morton wrote:
> > On Fri, 13 Oct 2006 18:44:52 +0200 (CEST)
> > Nick Piggin <[email protected]> wrote:

> > > @@ -2450,6 +2436,7 @@ int nobh_truncate_page(struct address_sp
> > > memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
> > > flush_dcache_page(page);
> > > kunmap_atomic(kaddr, KM_USER0);
> > > + SetPageUptodate(page);
> > > set_page_dirty(page);
> > > }
> > > unlock_page(page);
> >
> > I've already forgotten why this was added. Comment, please ;)
>
> Well, nobh_prepare_write no longer sets it uptodate, so we need to if
> we're going to set_page_dirty. OTOH, why does truncate_page need to
> zero the pagecache anyway? I wonder if we couldn't delete this whole
> function? (not in this patchset!)

It zeros the tail end of the page so we don't leak old data?

2006-10-15 11:56:59

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 6/6] mm: fix pagecache write deadlocks

On Sun, Oct 15, 2006 at 01:37:10PM +0200, Peter Zijlstra wrote:
> On Fri, 2006-10-13 at 18:44 +0200, Andrew Morton wrote:
> > The idea is to modify the core write() code so that it won't take a pagefault
> > while holding a lock on the pagecache page. There are a number of different
> > deadlocks possible if we try to do such a thing:
> >
> > 1. generic_buffered_write
> > 2. lock_page
> > 3. prepare_write
> > 4. unlock_page+vmtruncate
> > 5. copy_from_user
> > 6. mmap_sem(r)
> > 7. handle_mm_fault
> > 8. lock_page (filemap_nopage)
> > 9. commit_write
> > 1. unlock_page
> >
> > b. sys_munmap / sys_mlock / others
> > c. mmap_sem(w)
> > d. make_pages_present
> > e. get_user_pages
> > f. handle_mm_fault
> > g. lock_page (filemap_nopage)
> >
> > 2,8 - recursive deadlock if page is same
> > 2,8;2,7 - ABBA deadlock is page is different
>
> 2,8;2,8 I think you mean

Right. I've asked akpm to make a note of it (I don't think I can send a
meta-patch ;))

> > + /*
> > + * Must not enter the pagefault handler here, because we hold
> > + * the page lock, so we might recursively deadlock on the same
> > + * lock, or get an ABBA deadlock against a different lock, or
> > + * against the mmap_sem (which nests outside the page lock).
> > + * So increment preempt count, and use _atomic usercopies.
> > + */
> > + inc_preempt_count();
> > if (likely(nr_segs == 1))
> > - copied = filemap_copy_from_user(page, offset,
> > + copied = filemap_copy_from_user_atomic(page, offset,
> > buf, bytes);
> > else
> > - copied = filemap_copy_from_user_iovec(page, offset,
> > - cur_iov, iov_offset, bytes);
> > + copied = filemap_copy_from_user_iovec_atomic(page,
> > + offset, cur_iov, iov_offset,
> > + bytes);
> > + dec_preempt_count();
> > +
>
> Why use raw {inc,dec}_preempt_count() and not
> preempt_{disable,enable}()? Is the compiler barrier not needed here? And
> do we really want to avoid the preempt_check_resched()?

Counter to intuition, we actually don't mind being preempted here,
but we do mind entering the (core) pagefault handler. Incrementing
the preempt count causes the arch specific handler to bail out early
before it takes any locks.

Clear as mud? Wrapping it in a better name might be an improvement?
Or wrapping it into the copy*user_atomic functions themselves (which
is AFAIK the only place we use it).

> > Index: linux-2.6/mm/filemap.h
> > ===================================================================
> > --- linux-2.6.orig/mm/filemap.h
> > +++ linux-2.6/mm/filemap.h
> > @@ -22,19 +22,19 @@ __filemap_copy_from_user_iovec_inatomic(

> > +/*
> > + * This has the same sideeffects and return value as
> > + * filemap_copy_from_user_nonatomic().
> > + * The difference is that on a fault we need to memset the remainder of the
> > + * page (out to offset+bytes), to emulate filemap_copy_from_user_nonatomic()'s
> > + * single-segment behaviour.
> > + */
> > +static inline size_t
> > +filemap_copy_from_user_iovec_nonatomic(struct page *page, unsigned long offset,
> > + const struct iovec *iov, size_t base, size_t bytes)
> > +{
> > + char *kaddr;
> > + size_t copied;
> > +
> > + kaddr = kmap(page);
> > + copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
> > + base, bytes);
> > + kunmap(page);
> > return copied;
> > }
> >
>
> Why create the _nonatomic versions? There are no users.

This was leftover from Andrew's patch... maybe filemap_xip wants it and
I've forgotten about it?

2006-10-15 13:51:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 6/6] mm: fix pagecache write deadlocks


> > > + /*
> > > + * Must not enter the pagefault handler here, because we hold
> > > + * the page lock, so we might recursively deadlock on the same
> > > + * lock, or get an ABBA deadlock against a different lock, or
> > > + * against the mmap_sem (which nests outside the page lock).
> > > + * So increment preempt count, and use _atomic usercopies.
> > > + */
> > > + inc_preempt_count();
> > > if (likely(nr_segs == 1))
> > > - copied = filemap_copy_from_user(page, offset,
> > > + copied = filemap_copy_from_user_atomic(page, offset,
> > > buf, bytes);
> > > else
> > > - copied = filemap_copy_from_user_iovec(page, offset,
> > > - cur_iov, iov_offset, bytes);
> > > + copied = filemap_copy_from_user_iovec_atomic(page,
> > > + offset, cur_iov, iov_offset,
> > > + bytes);
> > > + dec_preempt_count();
> > > +
> >
> > Why use raw {inc,dec}_preempt_count() and not
> > preempt_{disable,enable}()? Is the compiler barrier not needed here? And
> > do we really want to avoid the preempt_check_resched()?
>
> Counter to intuition, we actually don't mind being preempted here,
> but we do mind entering the (core) pagefault handler. Incrementing
> the preempt count causes the arch specific handler to bail out early
> before it takes any locks.
>
> Clear as mud? Wrapping it in a better name might be an improvement?
> Or wrapping it into the copy*user_atomic functions themselves (which
> is AFAIK the only place we use it).

Right, but since you do inc the preempt_count you do disable preemption,
might as well check TIF_NEED_RESCHED when enabling preemption again.

Sticking it in the atomic copy functions does make sense to me.

2006-10-15 14:19:55

by Nick Piggin

[permalink] [raw]
Subject: Re: SPAM: Re: [patch 6/6] mm: fix pagecache write deadlocks

On Sun, Oct 15, 2006 at 03:51:09PM +0200, Peter Zijlstra wrote:
>
> > >
> > > Why use raw {inc,dec}_preempt_count() and not
> > > preempt_{disable,enable}()? Is the compiler barrier not needed here? And
> > > do we really want to avoid the preempt_check_resched()?
> >
> > Counter to intuition, we actually don't mind being preempted here,
> > but we do mind entering the (core) pagefault handler. Incrementing
> > the preempt count causes the arch specific handler to bail out early
> > before it takes any locks.
> >
> > Clear as mud? Wrapping it in a better name might be an improvement?
> > Or wrapping it into the copy*user_atomic functions themselves (which
> > is AFAIK the only place we use it).
>
> Right, but since you do inc the preempt_count you do disable preemption,
> might as well check TIF_NEED_RESCHED when enabling preemption again.

Yeah, you are right about that. Unfortunately there isn't a good
way to do this at the moment... well we could disable preempt
around the section, but that would be silly for a PREEMPT kernel.

And we should really decouple it from preempt entirely, in case we
ever want to check for it some other way in the pagefault handler.

2006-10-15 15:47:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: SPAM: Re: [patch 6/6] mm: fix pagecache write deadlocks

On Sun, 2006-10-15 at 16:19 +0200, Nick Piggin wrote:
> On Sun, Oct 15, 2006 at 03:51:09PM +0200, Peter Zijlstra wrote:
> >
> > > >
> > > > Why use raw {inc,dec}_preempt_count() and not
> > > > preempt_{disable,enable}()? Is the compiler barrier not needed here? And
> > > > do we really want to avoid the preempt_check_resched()?
> > >
> > > Counter to intuition, we actually don't mind being preempted here,
> > > but we do mind entering the (core) pagefault handler. Incrementing
> > > the preempt count causes the arch specific handler to bail out early
> > > before it takes any locks.
> > >
> > > Clear as mud? Wrapping it in a better name might be an improvement?
> > > Or wrapping it into the copy*user_atomic functions themselves (which
> > > is AFAIK the only place we use it).
> >
> > Right, but since you do inc the preempt_count you do disable preemption,
> > might as well check TIF_NEED_RESCHED when enabling preemption again.
>
> Yeah, you are right about that. Unfortunately there isn't a good
> way to do this at the moment... well we could disable preempt
> around the section, but that would be silly for a PREEMPT kernel.
>
> And we should really decouple it from preempt entirely, in case we
> ever want to check for it some other way in the pagefault handler.

How about we make sure all kmap_atomic implementations behave properly
and make in_atomic true.

Signed-off-by: Peter Zijlstra <[email protected]>
---
include/asm-frv/highmem.h | 5 +++--
include/asm-mips/highmem.h | 10 ++++++++--
include/linux/highmem.h | 9 ++++++---
mm/filemap.c | 4 +---
4 files changed, 18 insertions(+), 10 deletions(-)

Index: linux-2.6/include/asm-frv/highmem.h
===================================================================
--- linux-2.6.orig/include/asm-frv/highmem.h 2006-07-17 22:30:51.000000000 +0200
+++ linux-2.6/include/asm-frv/highmem.h 2006-10-15 17:32:02.000000000 +0200
@@ -115,7 +115,7 @@ static inline void *kmap_atomic(struct p
{
unsigned long paddr;

- preempt_disable();
+ inc_preempt_count();
paddr = page_to_phys(page);

switch (type) {
@@ -170,7 +170,8 @@ static inline void kunmap_atomic(void *k
default:
BUG();
}
- preempt_enable();
+ dec_preempt_count();
+ preempt_check_resched();
}

#endif /* !__ASSEMBLY__ */
Index: linux-2.6/include/asm-mips/highmem.h
===================================================================
--- linux-2.6.orig/include/asm-mips/highmem.h 2006-07-17 22:30:56.000000000 +0200
+++ linux-2.6/include/asm-mips/highmem.h 2006-10-15 17:36:49.000000000 +0200
@@ -70,11 +70,17 @@ static inline void *kmap(struct page *pa

static inline void *kmap_atomic(struct page *page, enum km_type type)
{
+ inc_preempt_count();
return page_address(page);
}

-static inline void kunmap_atomic(void *kvaddr, enum km_type type) { }
-#define kmap_atomic_pfn(pfn, idx) page_address(pfn_to_page(pfn))
+static inline void kunmap_atomic(void *kvaddr, enum km_type type)
+{
+ dec_preempt_count();
+ preempt_check_resched();
+}
+
+#define kmap_atomic_pfn(pfn, idx) kmap_atomic(pfn_to_page(pfn), (idx))

#define kmap_atomic_to_page(ptr) virt_to_page(ptr)

Index: linux-2.6/include/linux/highmem.h
===================================================================
--- linux-2.6.orig/include/linux/highmem.h 2006-10-07 18:47:32.000000000 +0200
+++ linux-2.6/include/linux/highmem.h 2006-10-15 17:32:44.000000000 +0200
@@ -3,6 +3,7 @@

#include <linux/fs.h>
#include <linux/mm.h>
+#include <linux/preempt.h>

#include <asm/cacheflush.h>

@@ -41,9 +42,11 @@ static inline void *kmap(struct page *pa

#define kunmap(page) do { (void) (page); } while (0)

-#define kmap_atomic(page, idx) page_address(page)
-#define kunmap_atomic(addr, idx) do { } while (0)
-#define kmap_atomic_pfn(pfn, idx) page_address(pfn_to_page(pfn))
+#define kmap_atomic(page, idx) \
+ ({ inc_preempt_count(); page_address(page); })
+#define kunmap_atomic(addr, idx) \
+ do { dec_preempt_count(); preempt_check_resched(); } while (0)
+#define kmap_atomic_pfn(pfn, idx) kmap_atomic(pfn_to_page(pfn), (idx))
#define kmap_atomic_to_page(ptr) virt_to_page(ptr)
#endif

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c 2006-10-15 17:24:41.000000000 +0200
+++ linux-2.6/mm/filemap.c 2006-10-15 17:40:19.000000000 +0200
@@ -2140,9 +2140,8 @@ retry_noprogress:
* the page lock, so we might recursively deadlock on the same
* lock, or get an ABBA deadlock against a different lock, or
* against the mmap_sem (which nests outside the page lock).
- * So increment preempt count, and use _atomic usercopies.
+ * So use _atomic usercopies.
*/
- inc_preempt_count();
if (likely(nr_segs == 1))
copied = filemap_copy_from_user_atomic(page, offset,
buf, bytes);
@@ -2150,7 +2149,6 @@ retry_noprogress:
copied = filemap_copy_from_user_iovec_atomic(page,
offset, cur_iov, iov_offset,
bytes);
- dec_preempt_count();

if (!PageUptodate(page)) {
/*


2006-10-15 15:57:51

by Nick Piggin

[permalink] [raw]
Subject: RRe: [patch 6/6] mm: fix pagecache write deadlocks

On Sun, Oct 15, 2006 at 05:47:03PM +0200, Peter Zijlstra wrote:
> >
> > And we should really decouple it from preempt entirely, in case we
> > ever want to check for it some other way in the pagefault handler.
>
> How about we make sure all kmap_atomic implementations behave properly
> and make in_atomic true.

Hmm, but you may not be doing a copy*user within the kmap. And you may
want an atomic copy*user not within a kmap (maybe).

I think it really would be more logical to do it in a wrapper function
pagefault_disable() pagefault_enable()? ;)

2006-10-15 16:14:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RRe: [patch 6/6] mm: fix pagecache write deadlocks

On Sun, 2006-10-15 at 17:57 +0200, Nick Piggin wrote:
> On Sun, Oct 15, 2006 at 05:47:03PM +0200, Peter Zijlstra wrote:
> > >
> > > And we should really decouple it from preempt entirely, in case we
> > > ever want to check for it some other way in the pagefault handler.
> >
> > How about we make sure all kmap_atomic implementations behave properly
> > and make in_atomic true.
>
> Hmm, but you may not be doing a copy*user within the kmap. And you may
> want an atomic copy*user not within a kmap (maybe).
>
> I think it really would be more logical to do it in a wrapper function
> pagefault_disable() pagefault_enable()? ;)

I did that one first, but then noticed that most non trivial kmap_atomic
implementations already did the inc_preempt_count()/dec_preempt_count()
thing (except frv which did preempt_disable()/preempt_enable() ?)

Anyway, here goes:

Signed-off-by: Peter Zijlstra <[email protected]>
---
mm/filemap.c | 4 +---
mm/filemap.h | 20 ++++++++++++++++++++
2 files changed, 21 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c 2006-10-14 20:20:20.000000000 +0200
+++ linux-2.6/mm/filemap.c 2006-10-15 17:16:59.000000000 +0200
@@ -2140,9 +2140,8 @@ retry_noprogress:
* the page lock, so we might recursively deadlock on the same
* lock, or get an ABBA deadlock against a different lock, or
* against the mmap_sem (which nests outside the page lock).
- * So increment preempt count, and use _atomic usercopies.
+ * So use _atomic usercopies.
*/
- inc_preempt_count();
if (likely(nr_segs == 1))
copied = filemap_copy_from_user_atomic(page, offset,
buf, bytes);
@@ -2150,7 +2149,6 @@ retry_noprogress:
copied = filemap_copy_from_user_iovec_atomic(page,
offset, cur_iov, iov_offset,
bytes);
- dec_preempt_count();

if (!PageUptodate(page)) {
/*
Index: linux-2.6/mm/filemap.h
===================================================================
--- linux-2.6.orig/mm/filemap.h 2006-10-14 20:20:20.000000000 +0200
+++ linux-2.6/mm/filemap.h 2006-10-15 17:17:45.000000000 +0200
@@ -21,6 +21,22 @@ __filemap_copy_from_user_iovec_inatomic(
size_t bytes);

/*
+ * By increasing the preempt_count we make sure the arch preempt
+ * handler bails out early, before taking any locks, so that the copy
+ * operation gets terminated early.
+ */
+pagefault_static inline void disable(void)
+{
+ inc_preempt_count();
+}
+
+pagefault_static inline void enable(void)
+{
+ dec_preempt_count();
+ preempt_check_resched();
+}
+
+/*
* Copy as much as we can into the page and return the number of bytes which
* were sucessfully copied. If a fault is encountered then return the number of
* bytes which were copied.
@@ -40,9 +56,11 @@ filemap_copy_from_user_atomic(struct pag
char *kaddr;
int left;

+ pagefault_disable();
kaddr = kmap_atomic(page, KM_USER0);
left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
kunmap_atomic(kaddr, KM_USER0);
+ pagefault_enable();

return bytes - left;
}
@@ -75,10 +93,12 @@ filemap_copy_from_user_iovec_atomic(stru
char *kaddr;
size_t copied;

+ pagefault_disable();
kaddr = kmap_atomic(page, KM_USER0);
copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
base, bytes);
kunmap_atomic(kaddr, KM_USER0);
+ pagefault_enable();
return copied;
}



2006-10-16 15:24:10

by Nick Piggin

[permalink] [raw]
Subject: pagefault_disable (was Re: [patch 6/6] mm: fix pagecache write deadlocks)

(trimming cc list)

Peter Zijlstra wrote:
> On Sun, 2006-10-15 at 17:57 +0200, Nick Piggin wrote:

>>Hmm, but you may not be doing a copy*user within the kmap. And you may
>>want an atomic copy*user not within a kmap (maybe).
>>
>>I think it really would be more logical to do it in a wrapper function
>>pagefault_disable() pagefault_enable()? ;)
>
>
> I did that one first, but then noticed that most non trivial kmap_atomic
> implementations already did the inc_preempt_count()/dec_preempt_count()
> thing (except frv which did preempt_disable()/preempt_enable() ?)
>
> Anyway, here goes:
>
> Signed-off-by: Peter Zijlstra <[email protected]>

I think this is a good approach. The missed preempt checks could easily
have been causing scheduling delays because the usercopy can take up a
lot of kernel time.

I don't know that the function should go in filemap.h... uaccess.h seems
more appropriate, and had thought the pagefault_disable() be calle
directly from within the copy_*_user_inatomic functions themselves, not
the filemap helper.

Also, the rest of the kernel tree (mainly uaccess and futexes) should be
converted ;)

> Index: linux-2.6/mm/filemap.h
> ===================================================================
> --- linux-2.6.orig/mm/filemap.h 2006-10-14 20:20:20.000000000 +0200
> +++ linux-2.6/mm/filemap.h 2006-10-15 17:17:45.000000000 +0200
> @@ -21,6 +21,22 @@ __filemap_copy_from_user_iovec_inatomic(
> size_t bytes);
>
> /*
> + * By increasing the preempt_count we make sure the arch preempt
> + * handler bails out early, before taking any locks, so that the copy
> + * operation gets terminated early.
> + */
> +pagefault_static inline void disable(void)
> +{
> + inc_preempt_count();
> +}
> +
> +pagefault_static inline void enable(void)
> +{
> + dec_preempt_count();
> + preempt_check_resched();
> +}

Interesting prototype ;)

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-16 16:05:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: pagefault_disable (was Re: [patch 6/6] mm: fix pagecache write deadlocks)

On Tue, 2006-10-17 at 01:24 +1000, Nick Piggin wrote:
> (trimming cc list)
>
> Peter Zijlstra wrote:
> > On Sun, 2006-10-15 at 17:57 +0200, Nick Piggin wrote:
>
> >>Hmm, but you may not be doing a copy*user within the kmap. And you may
> >>want an atomic copy*user not within a kmap (maybe).
> >>
> >>I think it really would be more logical to do it in a wrapper function
> >>pagefault_disable() pagefault_enable()? ;)
> >
> >
> > I did that one first, but then noticed that most non trivial kmap_atomic
> > implementations already did the inc_preempt_count()/dec_preempt_count()
> > thing (except frv which did preempt_disable()/preempt_enable() ?)
> >
> > Anyway, here goes:
> >
> > Signed-off-by: Peter Zijlstra <[email protected]>
>
> I think this is a good approach. The missed preempt checks could easily
> have been causing scheduling delays because the usercopy can take up a
> lot of kernel time.
>
> I don't know that the function should go in filemap.h... uaccess.h seems
> more appropriate, and had thought the pagefault_disable() be calle
> directly from within the copy_*_user_inatomic functions themselves, not
> the filemap helper.
>
> Also, the rest of the kernel tree (mainly uaccess and futexes) should be
> converted ;)

Yeah, lotsa places to touch.

> > Index: linux-2.6/mm/filemap.h
> > ===================================================================
> > --- linux-2.6.orig/mm/filemap.h 2006-10-14 20:20:20.000000000 +0200
> > +++ linux-2.6/mm/filemap.h 2006-10-15 17:17:45.000000000 +0200
> > @@ -21,6 +21,22 @@ __filemap_copy_from_user_iovec_inatomic(
> > size_t bytes);
> >
> > /*
> > + * By increasing the preempt_count we make sure the arch preempt
> > + * handler bails out early, before taking any locks, so that the copy
> > + * operation gets terminated early.
> > + */
> > +pagefault_static inline void disable(void)
> > +{
> > + inc_preempt_count();

I think we also need a barrier(); here. We need to make sure the preempt
count is written to memory before we hit the fault handler.

> > +}
> > +
> > +pagefault_static inline void enable(void)
> > +{
> > + dec_preempt_count();
> > + preempt_check_resched();
> > +}
>
> Interesting prototype ;)

Bah, sed magic gone awry ;-(

2006-10-16 16:12:36

by Nick Piggin

[permalink] [raw]
Subject: Re: pagefault_disable (was Re: [patch 6/6] mm: fix pagecache write deadlocks)

Peter Zijlstra wrote:
> On Tue, 2006-10-17 at 01:24 +1000, Nick Piggin wrote:

>>Also, the rest of the kernel tree (mainly uaccess and futexes) should be
>>converted ;)
>
>
> Yeah, lotsa places to touch.
>
>
>>>Index: linux-2.6/mm/filemap.h
>>>===================================================================
>>>--- linux-2.6.orig/mm/filemap.h 2006-10-14 20:20:20.000000000 +0200
>>>+++ linux-2.6/mm/filemap.h 2006-10-15 17:17:45.000000000 +0200
>>>@@ -21,6 +21,22 @@ __filemap_copy_from_user_iovec_inatomic(
>>> size_t bytes);
>>>
>>> /*
>>>+ * By increasing the preempt_count we make sure the arch preempt
>>>+ * handler bails out early, before taking any locks, so that the copy
>>>+ * operation gets terminated early.
>>>+ */
>>>+pagefault_static inline void disable(void)
>>>+{
>>>+ inc_preempt_count();
>
>
> I think we also need a barrier(); here. We need to make sure the preempt
> count is written to memory before we hit the fault handler.

It will come from this thread, but I guess the fault is not an event the
compiler can forsee, so indeed it might optimise this into the wrong place.
Perhaps not with any copy*user implementation we have, but at least in
theory...

>>>+pagefault_static inline void enable(void)
>>>+{
>>>+ dec_preempt_count();
>>>+ preempt_check_resched();
>>>+}

You'll want barriers before and after the dec_preempt_count, for similar
reasons.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-18 14:25:39

by Chris Mason

[permalink] [raw]
Subject: Re: [patch 6/6] mm: fix pagecache write deadlocks

> Index: linux-2.6/fs/buffer.c
> ===================================================================
> --- linux-2.6.orig/fs/buffer.c
> +++ linux-2.6/fs/buffer.c
> @@ -1856,6 +1856,9 @@ static int __block_commit_write(struct i
> unsigned blocksize;
> struct buffer_head *bh, *head;
>
> + if (from == to)
> + return 0;
> +
> blocksize = 1 << inode->i_blkbits;

reiserfs v3 copied the __block_commit_write logic for checking for a
partially updated page, so reiserfs_commit_page will have to be updated
to handle from==to. Right now it will set the page up to date.

I also used a prepare/commit pare where from==to as a way to trigger
tail conversions in the lilo ioctl. I'll both for you and make a
patch.

-chris