2009-03-17 18:19:59

by Jan Kara

[permalink] [raw]
Subject: [PATCH 0/4] Fix bugs and possible data corruption if blocksize < pagesize


Hi,

lately, I've been tracking some problems with ext3 reported by HP on ia64 and
this is what I came with. The first three patches fix real bugs - the first two
fix bugs in ext3 leading to false reports of EIO errors from JBD and aborted
journal, the third patch fixes a bug in block_write_full_page() which can
possibly lead to data corruption.
With these patches, don't see the data corruption I was able to reproduce
with fsx-linux under UML. I'm not yet sure whether all the problems HP reported
are gone (they're testing now) but since this seems to be a nasty problem I'm
posting earlier rather than later. Please someone have a look whether my fix
and analysis looks sane. Thanks.

Honza


2009-03-17 18:41:37

by Jan Kara

[permalink] [raw]
Subject: [PATCH 3/4] fs: Avoid data corruption with blocksize < pagesize

Assume the following situation:
Filesystem with blocksize < pagesize - suppose blocksize = 1024,
pagesize = 4096. File 'f' has first four blocks already allocated.
(line with "state:" contains the state of buffers in the page - m = mapped,
u = uptodate, d = dirty)

process 1: process 2:

write to 'f' bytes 0 - 1024
state: |mud,-,-,-|, page dirty
write to 'f' bytes 1024 - 4096:
__block_prepare_write() maps blocks
state: |mud,m,m,m|, page dirty
we fail to copy data -> copied = 0
block_write_end() does nothing
page gets unlocked
writepage() is called on the page
block_write_full_page() writes buffers with garbage

This patch fixes the problem by skipping !uptodate buffers in
block_write_full_page().

CC: Nick Piggin <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/buffer.c | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 9f69741..22c0144 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1774,7 +1774,12 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
} while (bh != head);

do {
- if (!buffer_mapped(bh))
+ /*
+ * Parallel write could have already mapped the buffers but
+ * it then had to restart before copying in new data. We
+ * must avoid writing garbage so just skip the buffer.
+ */
+ if (!buffer_mapped(bh) || !buffer_uptodate(bh))
continue;
/*
* If it's a fully non-blocking write attempt and we cannot
--
1.6.0.2


2009-03-17 18:50:08

by Jan Kara

[permalink] [raw]
Subject: [PATCH 1/4] ext3: Fix false EIO errors

When machine is under heavy memory pressure, it can happen that the page
we want to copy data from is paged out from memory before we copy data
from it and thus we are called with copied == 0. Generally, we should
not file buffers into journal lists if the don't really contain uptodate
data. So fix the range of buffers we file to journal list to contain
only buffers with copied data.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext3/inode.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 5fa453b..62005c0 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1261,7 +1261,7 @@ static int ext3_ordered_write_end(struct file *file,
int ret = 0, ret2;

from = pos & (PAGE_CACHE_SIZE - 1);
- to = from + len;
+ to = from + copied;

ret = walk_page_buffers(handle, page_buffers(page),
from, to, NULL, ext3_journal_dirty_data);
--
1.6.0.2


2009-03-17 19:43:56

by Jan Kara

[permalink] [raw]
Subject: [PATCH 2/4] ext3: Avoid false EIO errors

Sometimes block_write_begin() can map buffers in a page but later we
fail to copy data into those buffers (because the source page has
been paged out in the mean time). We then end up with !uptodate mapped
buffers. We must not file such buffers into a transaction - firstly they
contain garbage and secondly it confuses the journaling code (it thinks
write has failed and complains about IO errors).

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext3/inode.c | 6 +++++-
1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 62005c0..d351eab 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1430,7 +1430,11 @@ static int bput_one(handle_t *handle, struct buffer_head *bh)

static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
{
- if (buffer_mapped(bh))
+ /*
+ * Parallel write could have mapped the buffer but it didn't copy
+ * the data in yet. So avoid filing such buffer into a transaction.
+ */
+ if (buffer_mapped(bh) && buffer_uptodate(bh))
return ext3_journal_dirty_data(handle, bh);
return 0;
}
--
1.6.0.2


2009-03-17 17:33:55

by Jan Kara

[permalink] [raw]
Subject: [PATCH 4/4] fs: Warn about writing !uptodate buffers

Make submit_bh() warn about writing !uptodate buffers. Hopefully this
warns us about writing garbage (although bugs in write EIO handling
are going to trigger this as well as they already trigger the warning
in mark_buffer_dirty()).

Signed-off-by: Jan Kara <[email protected]>
---
fs/buffer.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 22c0144..985f617 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2997,6 +2997,8 @@ int submit_bh(int rw, struct buffer_head * bh)
BUG_ON(!buffer_locked(bh));
BUG_ON(!buffer_mapped(bh));
BUG_ON(!bh->b_end_io);
+ if (rw & WRITE)
+ WARN_ON_ONCE(!buffer_uptodate(bh));

/*
* Mask in barrier bit for a write (could be either a WRITE or a
--
1.6.0.2

2009-03-18 12:03:20

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 3/4] fs: Avoid data corruption with blocksize < pagesize

On Tue, Mar 17, 2009 at 06:33:54PM +0100, Jan Kara wrote:
> Assume the following situation:
> Filesystem with blocksize < pagesize - suppose blocksize = 1024,
> pagesize = 4096. File 'f' has first four blocks already allocated.
> (line with "state:" contains the state of buffers in the page - m = mapped,
> u = uptodate, d = dirty)
>
> process 1: process 2:
>
> write to 'f' bytes 0 - 1024
> state: |mud,-,-,-|, page dirty
> write to 'f' bytes 1024 - 4096:
> __block_prepare_write() maps blocks
> state: |mud,m,m,m|, page dirty
> we fail to copy data -> copied = 0
> block_write_end() does nothing
> page gets unlocked
> writepage() is called on the page
> block_write_full_page() writes buffers with garbage
>
> This patch fixes the problem by skipping !uptodate buffers in
> block_write_full_page().
>
> CC: Nick Piggin <[email protected]>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/buffer.c | 7 ++++++-
> 1 files changed, 6 insertions(+), 1 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 9f69741..22c0144 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1774,7 +1774,12 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
> } while (bh != head);
>
> do {
> - if (!buffer_mapped(bh))
> + /*
> + * Parallel write could have already mapped the buffers but
> + * it then had to restart before copying in new data. We
> + * must avoid writing garbage so just skip the buffer.
> + */
> + if (!buffer_mapped(bh) || !buffer_uptodate(bh))
> continue;

I don't quite see how this can happen. Further down in this loop,
we do a test_clear_buffer_dirty(), which should exclude this I
think? And marking the buffer dirty if it is not uptodate should
be a bug.


2009-03-18 12:08:07

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 4/4] fs: Warn about writing !uptodate buffers

On Wednesday 18 March 2009 04:33:55 Jan Kara wrote:
> Make submit_bh() warn about writing !uptodate buffers. Hopefully this
> warns us about writing garbage (although bugs in write EIO handling
> are going to trigger this as well as they already trigger the warning
> in mark_buffer_dirty()).
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/buffer.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 22c0144..985f617 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2997,6 +2997,8 @@ int submit_bh(int rw, struct buffer_head * bh)
> BUG_ON(!buffer_locked(bh));
> BUG_ON(!buffer_mapped(bh));
> BUG_ON(!bh->b_end_io);
> + if (rw & WRITE)
> + WARN_ON_ONCE(!buffer_uptodate(bh));

Yes very nice assertion to have. Arguably I think it should be a BUG_ON
because it is definitely some state corruption at least, and writing
garbage data to disk at worst. But WARN_ON for now is probably best.

I have some patches to fix up some problems with EIO handling in the
VM (and I think solves the the buffer warning too). I have to get them
out and try to get them merged again... at which point this should
probably be turned into a bug.

Acked-by: Nick Piggin <[email protected]>

2009-03-18 14:13:35

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 3/4] fs: Avoid data corruption with blocksize < pagesize

On Wed 18-03-09 13:00:23, Nick Piggin wrote:
> On Tue, Mar 17, 2009 at 06:33:54PM +0100, Jan Kara wrote:
> > Assume the following situation:
> > Filesystem with blocksize < pagesize - suppose blocksize = 1024,
> > pagesize = 4096. File 'f' has first four blocks already allocated.
> > (line with "state:" contains the state of buffers in the page - m = mapped,
> > u = uptodate, d = dirty)
> >
> > process 1: process 2:
> >
> > write to 'f' bytes 0 - 1024
> > state: |mud,-,-,-|, page dirty
> > write to 'f' bytes 1024 - 4096:
> > __block_prepare_write() maps blocks
> > state: |mud,m,m,m|, page dirty
> > we fail to copy data -> copied = 0
> > block_write_end() does nothing
> > page gets unlocked
> > writepage() is called on the page
> > block_write_full_page() writes buffers with garbage
> >
> > This patch fixes the problem by skipping !uptodate buffers in
> > block_write_full_page().
> >
> > CC: Nick Piggin <[email protected]>
> > Signed-off-by: Jan Kara <[email protected]>
> > ---
> > fs/buffer.c | 7 ++++++-
> > 1 files changed, 6 insertions(+), 1 deletions(-)
> >
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 9f69741..22c0144 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -1774,7 +1774,12 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
> > } while (bh != head);
> >
> > do {
> > - if (!buffer_mapped(bh))
> > + /*
> > + * Parallel write could have already mapped the buffers but
> > + * it then had to restart before copying in new data. We
> > + * must avoid writing garbage so just skip the buffer.
> > + */
> > + if (!buffer_mapped(bh) || !buffer_uptodate(bh))
> > continue;
>
> I don't quite see how this can happen. Further down in this loop,
> we do a test_clear_buffer_dirty(), which should exclude this I
> think? And marking the buffer dirty if it is not uptodate should
> be a bug.
Hmm, this patch definitely does something important because without it I
hit corruption in UML in ~20 minutes and with it no corruption happens
in ~3 hours. Maybe someone calls set_page_dirty() on the page and
__set_page_dirty_buffers() unconditionally dirties all the buffers the
page has? But I still don't see how the write could be lost which is what
I observe in fsx-linux test. I'm doing some more tests to understand this
better.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-03-18 18:42:35

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 3/4] fs: Avoid data corruption with blocksize < pagesize

On Tue, Mar 17, 2009 at 06:33:54PM +0100, Jan Kara wrote:
> Assume the following situation:
> Filesystem with blocksize < pagesize - suppose blocksize = 1024,
> pagesize = 4096. File 'f' has first four blocks already allocated.
> (line with "state:" contains the state of buffers in the page - m = mapped,
> u = uptodate, d = dirty)
>
> process 1: process 2:
>
> write to 'f' bytes 0 - 1024
> state: |mud,-,-,-|, page dirty
> write to 'f' bytes 1024 - 4096:
> __block_prepare_write() maps blocks
> state: |mud,m,m,m|, page dirty
> we fail to copy data -> copied = 0
> block_write_end() does nothing
> page gets unlocked


If copied = 0 then in block_write_end we do

page_zero_new_buffers(page, start+copied, start+len

which would mean we should not see garbage.


> writepage() is called on the page
> block_write_full_page() writes buffers with garbage
>

-aneesh

2009-03-18 18:50:20

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 3/4] fs: Avoid data corruption with blocksize < pagesize

On Thu 19-03-09 00:12:22, Aneesh Kumar K.V wrote:
> On Tue, Mar 17, 2009 at 06:33:54PM +0100, Jan Kara wrote:
> > Assume the following situation:
> > Filesystem with blocksize < pagesize - suppose blocksize = 1024,
> > pagesize = 4096. File 'f' has first four blocks already allocated.
> > (line with "state:" contains the state of buffers in the page - m = mapped,
> > u = uptodate, d = dirty)
> >
> > process 1: process 2:
> >
> > write to 'f' bytes 0 - 1024
> > state: |mud,-,-,-|, page dirty
> > write to 'f' bytes 1024 - 4096:
> > __block_prepare_write() maps blocks
> > state: |mud,m,m,m|, page dirty
> > we fail to copy data -> copied = 0
> > block_write_end() does nothing
> > page gets unlocked
>
>
> If copied = 0 then in block_write_end we do
>
> page_zero_new_buffers(page, start+copied, start+len
>
> which would mean we should not see garbage.
But this will zero only *new* buffers - so if they are already allocated,
get_block() won't set new flag and they won't be zeroed...
But I'm not saying I understand why this seems to help against a corruption
under UML because we don't seem to be writing !uptodate buffers there.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-03-18 18:57:41

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 3/4] fs: Avoid data corruption with blocksize < pagesize

On Wed 18-03-09 13:00:23, Nick Piggin wrote:
> On Tue, Mar 17, 2009 at 06:33:54PM +0100, Jan Kara wrote:
> > Assume the following situation:
> > Filesystem with blocksize < pagesize - suppose blocksize = 1024,
> > pagesize = 4096. File 'f' has first four blocks already allocated.
> > (line with "state:" contains the state of buffers in the page - m = mapped,
> > u = uptodate, d = dirty)
> >
> > process 1: process 2:
> >
> > write to 'f' bytes 0 - 1024
> > state: |mud,-,-,-|, page dirty
> > write to 'f' bytes 1024 - 4096:
> > __block_prepare_write() maps blocks
> > state: |mud,m,m,m|, page dirty
> > we fail to copy data -> copied = 0
> > block_write_end() does nothing
> > page gets unlocked
> > writepage() is called on the page
> > block_write_full_page() writes buffers with garbage
> >
> > This patch fixes the problem by skipping !uptodate buffers in
> > block_write_full_page().
> >
> > CC: Nick Piggin <[email protected]>
> > Signed-off-by: Jan Kara <[email protected]>
> > ---
> > fs/buffer.c | 7 ++++++-
> > 1 files changed, 6 insertions(+), 1 deletions(-)
> >
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 9f69741..22c0144 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -1774,7 +1774,12 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
> > } while (bh != head);
> >
> > do {
> > - if (!buffer_mapped(bh))
> > + /*
> > + * Parallel write could have already mapped the buffers but
> > + * it then had to restart before copying in new data. We
> > + * must avoid writing garbage so just skip the buffer.
> > + */
> > + if (!buffer_mapped(bh) || !buffer_uptodate(bh))
> > continue;
>
> I don't quite see how this can happen. Further down in this loop,
> we do a test_clear_buffer_dirty(), which should exclude this I
> think? And marking the buffer dirty if it is not uptodate should
> be a bug.
OK, I spoke too soon. Now I reproduced the corruption under UML even with
this patch. So it may be something different...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR