2006-10-22 08:40:29

by Nick Piggin

[permalink] [raw]
Subject: [RFC] commit_write less than prepared by prepare_write

This is a request for comments and review by filesystem maintainers.
Especially if you own GFS2, OCFS2, Reiserfs, JFFS! Those using fs/buffer.c
routines directly should be reasonably safe.

We have several long standing deadlocks in the core MM when doing
buffered writes. The full explanation is in 2.6.19-rc2-mm2 in the
patch mm-fix-pagecache-write-deadlocks.patch and there are some followup
fixes.

Basically: when copying the pages from user supplied address to pagecache,
between prepare_write and commit_write, we can take a fault on the
copy_from_user and enter the pagefault handler holding the page lock.

There is no way that this can work safely, so we must do atomic copies
here, which will just return a short write in the case of a fault.

The protocol for handling short writes is as follows:

- If page uptodate, commit_write with the shorter length (potentially 0).
Move the counters forward and retry.

- If the page is not uptodate, we commit a zero length commit_write, at the
start offset given by the prepare_write being closed.

* Note that if we hit some other error condition, we may never run another
prepare_write or commit_write, so we can't rely on that.

For the case of an uptodate page, it may be somewhat safer to commit the
full length that was prepared (we can always pretend that the uncopied
part got dirtied; the retry can happily overwrite it). Comments? I figure
that if we have to audit everything anyway, then keeping this wart would
be silly.

For the non-uptodate page:
We can't run a full length commit_write, because the data in the page is
uninitialised. Can't zero out the uninitialised data, because that would
lead to zeros temporarily appearing in the pagecache. Any other ways to
fix it?

The problem with doing a short commit, as noted by Mark Fasheh, is that
prepare_write might have allocated a disk mapping (eg. if we write over a
hole), and if the commit fails and we unlock the page, then a read from
the filesystem might think that is valid data.

Mark's idea is to loop through the page buffers and zero out the
buffer_new ones when the caller detects a fault here. This should work
because a hole is zeroes anyway, so we could mark the page uptodate and it
would eventually get written back to initialise those blocks. But, is
there any other filesystem specific state that would need to be fixed
up? Do we need another filesystem call?

I would prefer that disk block allocation be performed at commit_write
time. I realise that is probably a lot more rework, and there may be
reasons why it can't be done?

Another alternative is to pre-read the page in prepare_write, which
should always get us out of trouble. Could be a good option for
simple filesystems. People who care about performance probably want to
avoid this.

Any other ideas?

The regular buffered filesystems seem to be working OK. The bulk of the
filesystems are in fs-prepare_write-fixes.patch, but it is a bit
scattered at the moment. Applying the full -mm patch would be a good
idea.

The conversion involves:
* ensure we don't mark the page uptodate in prepare_write if it is not
(which is a bug anyway... I found a few of those).
* ensure we don't mark a non uptodate page as uptodate if the commit
was short.

Filesystems that are non trivial are GFS2, OCFS2, Reiserfs, JFFS so
I need maintainers to look at those.

For the "moronic filesystems that do not allow holes", the cont_ routines
have already been using zero length prepare/commit for something else.
OGAWA Hirofumi is thinking about this. Any comments?

Thanks,
Nick


2006-10-24 09:33:11

by Jörn Engel

[permalink] [raw]
Subject: Re: [RFC] commit_write less than prepared by prepare_write

On Sun, 22 October 2006 10:40:20 +0200, Nick Piggin wrote:
>
> Filesystems that are non trivial are GFS2, OCFS2, Reiserfs, JFFS so
> I need maintainers to look at those.

JFFS has been unmaintained for some years and barely evaded removal at
least once. Now might be the time to pull support for good or else
convert it by adding an #error to it.

J?rn

--
This above all: to thine own self be true.
-- Shakespeare

2006-10-24 15:47:42

by Vladimir V. Saveliev

[permalink] [raw]
Subject: Re: [RFC] commit_write less than prepared by prepare_write

Hello

On Sunday 22 October 2006 12:40, Nick Piggin wrote:
> This is a request for comments and review by filesystem maintainers.
> Especially if you own GFS2, OCFS2, Reiserfs, JFFS! Those using fs/buffer.c
> routines directly should be reasonably safe.
>
> We have several long standing deadlocks in the core MM when doing
> buffered writes. The full explanation is in 2.6.19-rc2-mm2 in the
> patch mm-fix-pagecache-write-deadlocks.patch and there are some followup
> fixes.
>
> Basically: when copying the pages from user supplied address to pagecache,
> between prepare_write and commit_write, we can take a fault on the
> copy_from_user and enter the pagefault handler holding the page lock.
>
> There is no way that this can work safely, so we must do atomic copies
> here, which will just return a short write in the case of a fault.
>
> The protocol for handling short writes is as follows:
>
> - If page uptodate, commit_write with the shorter length (potentially 0).
> Move the counters forward and retry.
>
> - If the page is not uptodate, we commit a zero length commit_write, at the
> start offset given by the prepare_write being closed.
>
> * Note that if we hit some other error condition, we may never run another
> prepare_write or commit_write, so we can't rely on that.
>
> For the case of an uptodate page, it may be somewhat safer to commit the
> full length that was prepared (we can always pretend that the uncopied
> part got dirtied; the retry can happily overwrite it). Comments?

Retry can EFAULT, I guess. For example, if file mapped to user space gets truncated.

> I figure
> that if we have to audit everything anyway, then keeping this wart would
> be silly.
>
> For the non-uptodate page:
> We can't run a full length commit_write, because the data in the page is
> uninitialised. Can't zero out the uninitialised data, because that would
> lead to zeros temporarily appearing in the pagecache. Any other ways to
> fix it?
>
> The problem with doing a short commit, as noted by Mark Fasheh, is that
> prepare_write might have allocated a disk mapping (eg. if we write over a
> hole), and if the commit fails and we unlock the page, then a read from
> the filesystem might think that is valid data.
>
> Mark's idea is to loop through the page buffers and zero out the
> buffer_new ones when the caller detects a fault here. This should work
> because a hole is zeroes anyway, so we could mark the page uptodate and it
> would eventually get written back to initialise those blocks. But, is
> there any other filesystem specific state that would need to be fixed
> up? Do we need another filesystem call?
>
> I would prefer that disk block allocation be performed at commit_write
> time. I realise that is probably a lot more rework, and there may be
> reasons why it can't be done?
>
> Another alternative is to pre-read the page in prepare_write, which
> should always get us out of trouble. Could be a good option for
> simple filesystems. People who care about performance probably want to
> avoid this.
>
> Any other ideas?
>
> The regular buffered filesystems seem to be working OK. The bulk of the
> filesystems are in fs-prepare_write-fixes.patch, but it is a bit
> scattered at the moment. Applying the full -mm patch would be a good
> idea.
>
> The conversion involves:
> * ensure we don't mark the page uptodate in prepare_write if it is not
> (which is a bug anyway... I found a few of those).
> * ensure we don't mark a non uptodate page as uptodate if the commit
> was short.
>
> Filesystems that are non trivial are GFS2, OCFS2, Reiserfs, JFFS so
> I need maintainers to look at those.
>
> For the "moronic filesystems that do not allow holes", the cont_ routines
> have already been using zero length prepare/commit for something else.
> OGAWA Hirofumi is thinking about this. Any comments?
>
> Thanks,
> Nick
>
>

2006-11-04 11:57:18

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] commit_write less than prepared by prepare_write

On Sun, Oct 22, 2006 at 10:40:20AM +0200, Nick Piggin wrote:
> For the non-uptodate page:
> We can't run a full length commit_write, because the data in the page is
> uninitialised. Can't zero out the uninitialised data, because that would
> lead to zeros temporarily appearing in the pagecache. Any other ways to
> fix it?
>
> The problem with doing a short commit, as noted by Mark Fasheh, is that
> prepare_write might have allocated a disk mapping (eg. if we write over a
> hole), and if the commit fails and we unlock the page, then a read from
> the filesystem might think that is valid data.
>
> Mark's idea is to loop through the page buffers and zero out the
> buffer_new ones when the caller detects a fault here. This should work
> because a hole is zeroes anyway, so we could mark the page uptodate and it
> would eventually get written back to initialise those blocks. But, is
> there any other filesystem specific state that would need to be fixed
> up? Do we need another filesystem call?
>
> I would prefer that disk block allocation be performed at commit_write
> time. I realise that is probably a lot more rework, and there may be
> reasons why it can't be done?
>
> Another alternative is to pre-read the page in prepare_write, which
> should always get us out of trouble. Could be a good option for
> simple filesystems. People who care about performance probably want to
> avoid this.
>
> Any other ideas?

OK, so I was able to pretty easily reproduce the stale data problem
with CONFIG_DEBUG_VM turned on (if anyone wants a copy of the test
program, let me know privately).

The following patch attempts to implement Mark's idea, and does seem
to work for ext2 bh. nobh seem to be broken in a non-trivial way.
Because we discard the bh, we don't know that we might have just allocated
under a hole. The old "error handling" zeroed out the entire page which,
AFAIKS, could destroy existing data. My patches in -mm ripped this out
which introduces the same sort of stale data problem as when doing a
zero length commit (arguably better? probably not: one is data corruption,
the other is a data corruption + information leak).

So nobh isn't fixed, but a note is made of it. I think it was broken when
I got there, though?

Are there any other filesystems that don't mark buffer_new properly, and
also don't zero out allocated holes? If so, I think they are also broken
before I did anything (think -EFAULT case).

Nick
--
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1951,7 +1951,7 @@ retry_noprogress:
bytes);
dec_preempt_count();

- if (!PageUptodate(page)) {
+ if (unlikely(copied != bytes)) {
/*
* If the page is not uptodate, we cannot allow a
* partial commit_write because when we unlock the
@@ -1967,9 +1967,15 @@ retry_noprogress:
* single-segment path below, which should get the
* filesystem to bring the page uputodate for us next
* time.
+ *
+ * Must zero out new buffers here so that we do end
+ * up properly filling holes rather than leaving stale
+ * data in them that might be read in future.
*/
- if (unlikely(copied != bytes))
+ if (!PageUptodate(page)) {
+ page_zero_new_buffers(page);
copied = 0;
+ }
}

flush_dcache_page(page);
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1491,6 +1491,38 @@ out:
}
EXPORT_SYMBOL(block_invalidatepage);

+void page_zero_new_buffers(struct page *page)
+{
+ unsigned int block_start, block_end;
+ struct buffer_head *head, *bh;
+
+ BUG_ON(!PageLocked(page));
+ BUG_ON(PageUptodate(page));
+ if (!page_has_buffers(page))
+ return;
+
+ bh = head = page_buffers(page);
+ block_start = 0;
+ do {
+ block_end = block_start + bh->b_size;
+
+ if (buffer_new(bh)) {
+ void *kaddr;
+
+ kaddr = kmap_atomic(page, KM_USER0);
+ memset(kaddr+block_start, 0, bh->b_size);
+ flush_dcache_page(page);
+ kunmap_atomic(kaddr, KM_USER0);
+ clear_buffer_new(bh);
+ set_buffer_uptodate(bh);
+ mark_buffer_dirty(bh);
+ }
+
+ block_start = block_end;
+ bh = bh->b_this_page;
+ } while (bh != head);
+}
+
/*
* We attach and possibly dirty the buffers atomically wrt
* __set_page_dirty_buffers() via private_lock. try_to_free_buffers
@@ -1784,36 +1816,33 @@ static int __block_prepare_write(struct
}
continue;
}
- if (buffer_new(bh))
- clear_buffer_new(bh);
if (!buffer_mapped(bh)) {
WARN_ON(bh->b_size != blocksize);
err = get_block(inode, block, bh, 1);
if (err)
break;
- if (buffer_new(bh)) {
- unmap_underlying_metadata(bh->b_bdev,
- bh->b_blocknr);
- if (PageUptodate(page)) {
- set_buffer_uptodate(bh);
- continue;
- }
- if (block_end > to || block_start < from) {
- void *kaddr;
-
- kaddr = kmap_atomic(page, KM_USER0);
- if (block_end > to)
- memset(kaddr+to, 0,
- block_end-to);
- if (block_start < from)
- memset(kaddr+block_start,
- 0, from-block_start);
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
- }
+ }
+ if (buffer_new(bh)) {
+ unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
+ if (PageUptodate(page)) {
+ set_buffer_uptodate(bh);
continue;
}
+ if (block_end > to || block_start < from) {
+ void *kaddr;
+
+ kaddr = kmap_atomic(page, KM_USER0);
+ if (block_end > to)
+ memset(kaddr+to, 0, block_end-to);
+ if (block_start < from)
+ memset(kaddr+block_start,
+ 0, from-block_start);
+ flush_dcache_page(page);
+ kunmap_atomic(kaddr, KM_USER0);
+ }
+ continue;
}
+
if (PageUptodate(page)) {
if (!buffer_uptodate(bh))
set_buffer_uptodate(bh);
@@ -1833,43 +1862,10 @@ static int __block_prepare_write(struct
if (!buffer_uptodate(*wait_bh))
err = -EIO;
}
- if (!err) {
- bh = head;
- do {
- if (buffer_new(bh))
- clear_buffer_new(bh);
- } while ((bh = bh->b_this_page) != head);
- return 0;
- }
- /* Error case: */
- /*
- * Zero out any newly allocated blocks to avoid exposing stale
- * data. If BH_New is set, we know that the block was newly
- * allocated in the above loop.
- */
- bh = head;
- block_start = 0;
- do {
- block_end = block_start+blocksize;
- if (block_end <= from)
- goto next_bh;
- if (block_start >= to)
- break;
- if (buffer_new(bh)) {
- void *kaddr;

- clear_buffer_new(bh);
- kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr+block_start, 0, bh->b_size);
- flush_dcache_page(page);
- kunmap_atomic(kaddr, KM_USER0);
- set_buffer_uptodate(bh);
- mark_buffer_dirty(bh);
- }
-next_bh:
- block_start = block_end;
- bh = bh->b_this_page;
- } while (bh != head);
+ if (err)
+ page_zero_new_buffers(page);
+
return err;
}

@@ -2246,7 +2242,6 @@ int nobh_prepare_write(struct page *page
int i;
int ret = 0;
int is_mapped_to_disk = 1;
- int dirtied_it = 0;

if (PageMappedToDisk(page))
return 0;
@@ -2282,15 +2277,16 @@ int nobh_prepare_write(struct page *page
if (PageUptodate(page))
continue;
if (buffer_new(&map_bh) || !buffer_mapped(&map_bh)) {
+ /*
+ * XXX: stale data can be exposed as we are not zeroing
+ * out newly allocated blocks. If a subsequent operation
+ * fails, we'll never know about this :(
+ */
kaddr = kmap_atomic(page, KM_USER0);
- if (block_start < from) {
- memset(kaddr+block_start, 0, from-block_start);
- dirtied_it = 1;
- }
- if (block_end > to) {
+ if (block_start < from)
+ memset(kaddr+block_start, 0, block_end-block_start);
+ if (block_end > to)
memset(kaddr + to, 0, block_end - to);
- dirtied_it = 1;
- }
flush_dcache_page(page);
kunmap_atomic(kaddr, KM_USER0);
continue;
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h
+++ linux-2.6/include/linux/buffer_head.h
@@ -151,6 +151,7 @@ struct buffer_head *alloc_page_buffers(s
int retry);
void create_empty_buffers(struct page *, unsigned long,
unsigned long b_state);
+void page_zero_new_buffers(struct page *page);
void end_buffer_read_sync(struct buffer_head *bh, int uptodate);
void end_buffer_write_sync(struct buffer_head *bh, int uptodate);