2009-04-28 18:51:04

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V3] Fix sub-block zeroing for buffered writes into unwritten extents

We need to mark the buffer_head mapping prealloc space
as new during write_begin. Otherwise we don't zero out the
page cache content properly for a partial write. This will
cause file corruption with preallocation.

Also use block number -1 as the fake block number so that
unmap_underlying_metadata doesn't drop wrong buffer_head

Signed-off-by: Aneesh Kumar K.V <[email protected]>

---
fs/ext4/inode.c | 11 ++++++++++-
1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e91f978..0214389 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2318,11 +2318,20 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
/* not enough space to reserve */
return ret;

- map_bh(bh_result, inode->i_sb, 0);
+ map_bh(bh_result, inode->i_sb, -1);
set_buffer_new(bh_result);
set_buffer_delay(bh_result);
} else if (ret > 0) {
bh_result->b_size = (ret << inode->i_blkbits);
+ bh_result->b_bdev = inode->i_sb->s_bdev;
+ bh->b_blocknr = -1;
+ /*
+ * With sub-block writes into unwritten extents
+ * we also need to mark the buffer as new so that
+ * the unwritten parts of the buffer gets correctly zeroed.
+ */
+ if (buffer_unwritten(bh_result))
+ set_buffer_new(bh_result);
ret = 0;
}

--
tg: (27b1833..) preallocate_corruption_quickfix (depends on: master)


2009-04-28 21:47:47

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH -V3] Fix sub-block zeroing for buffered writes into unwritten extents

Aneesh Kumar K.V wrote:
> We need to mark the buffer_head mapping prealloc space
> as new during write_begin. Otherwise we don't zero out the
> page cache content properly for a partial write. This will
> cause file corruption with preallocation.
>
> Also use block number -1 as the fake block number so that
> unmap_underlying_metadata doesn't drop wrong buffer_head
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>
> ---
> fs/ext4/inode.c | 11 ++++++++++-
> 1 files changed, 10 insertions(+), 1 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e91f978..0214389 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2318,11 +2318,20 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
> /* not enough space to reserve */
> return ret;
>
> - map_bh(bh_result, inode->i_sb, 0);
> + map_bh(bh_result, inode->i_sb, -1);

This seems fine, though unrelated, isn't it? But mapping delalloc
blocks to -1 temporarily rather than to 0 seems safer to me (could this
possibly be related to our low-block corruption cases?)

Oh, I guess this is for the unmap_underlying_metadata stuff, though I
don't know what that call is for in ext4, to be honest. :) At any rate
this should make it not findable there which is fine, I guess.

> set_buffer_new(bh_result);
> set_buffer_delay(bh_result);
> } else if (ret > 0) {
> bh_result->b_size = (ret << inode->i_blkbits);
> + bh_result->b_bdev = inode->i_sb->s_bdev;
> + bh->b_blocknr = -1;

Mingming pointed out on irc that this sets the blocknr to -1 for every
mapping we find, which is probably not what we want. :) But if it's an
actually (pre)allocated block, why do we set it to a fake number at all?

I guess it seems to me that we should be setting up a preallocated
block/bh just about like any other, with a block nr, bdev, etc when we
create it or look it up - but with BH_Unwritten as well to flag it as
such. It may not actually matter but it just seems odd to me for it to
have a fake block nr.

If surrounding infrastructure still expects to call get_block each time
to split up an unwritten extent, ok for now to leave it unmapped, but
that needs work I think, as we mentioned on irc.

FWIW, setting it to -1 even under the if (buffer_unwritten()) test below
is probably redundant, I think it's already set that way from
alloc_page_buffers().

> + /*
> + * With sub-block writes into unwritten extents
> + * we also need to mark the buffer as new so that
> + * the unwritten parts of the buffer gets correctly zeroed.
> + */
> + if (buffer_unwritten(bh_result))
> + set_buffer_new(bh_result);
> ret = 0;
> }
>

This part still seems fine to me :)

-Eric

2009-04-29 01:30:28

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH -V3] Fix sub-block zeroing for buffered writes into unwritten extents


On Wed, 2009-04-29 at 00:20 +0530, Aneesh Kumar K.V wrote:
> We need to mark the buffer_head mapping prealloc space
> as new during write_begin. Otherwise we don't zero out the
> page cache content properly for a partial write. This will
> cause file corruption with preallocation.
>
> Also use block number -1 as the fake block number so that
> unmap_underlying_metadata doesn't drop wrong buffer_head
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>
> ---
> fs/ext4/inode.c | 11 ++++++++++-
> 1 files changed, 10 insertions(+), 1 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e91f978..0214389 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2318,11 +2318,20 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
> /* not enough space to reserve */
> return ret;
>
> - map_bh(bh_result, inode->i_sb, 0);
> + map_bh(bh_result, inode->i_sb, -1);
> set_buffer_new(bh_result);
> set_buffer_delay(bh_result);
> } else if (ret > 0) {
> bh_result->b_size = (ret << inode->i_blkbits);
> + bh_result->b_bdev = inode->i_sb->s_bdev;
> + bh->b_blocknr = -1;

A small typo, should be bh_result->b_blocknr

But isn't this will incorrect set up the b_blocknr for normal
successful(allocated, non preallocated) get_block lookup? As
ext4_get_blocks_wrap() will return 1 (>0) if it found it allocated.

> + /*
> + * With sub-block writes into unwritten extents
> + * we also need to mark the buffer as new so that
> + * the unwritten parts of the buffer gets correctly zeroed.
> + */
> + if (buffer_unwritten(bh_result))
> + set_buffer_new(bh_result);
> ret = 0;
> }
>

I think it nicer to setup the fake block_nr together when
set_buffer_new(), at the ext4_ext_get_block() time when it handles
preallocation lookup on delalloc. This will avoid calling
buffer_unwritten(bh_result) check for every return bh result for
ext4_get_blocks_wrap(). And makes the logic more saner.

How about patch attached, tested with my testcase, the partial write
preallocation corruption is fixed.

But looking at the comment change, looks like the original intention is
to set the buffer unwritten so that a read from that uninitialzed block
returns 0. Turns out the VFS needs to set the buffer new for this
purpose.


-----------------------------------------------------------------------------

This patch fixed the file data garbage with partial write to a
preallocated space when delayed allocation is enabled.

The preallocated (uninitialized) buffer need to be set as buffer_new()
so read to this uninitialized block will return 0. With delayed
allocation, the create flag pass to get_block() from write_begin()
does a look up on the preallocated extent, the returning buffer did not
have the proper buffer_new flag set, resulting the page filled up with
garbage and get written to disk later.

Signed-off-by: Mingming Cao <[email protected]>
Index: linux-2.6.28-rc6/fs/ext4/extents.c
===================================================================
--- linux-2.6.28-rc6.orig/fs/ext4/extents.c 2009-04-28 11:52:05.000000000 -0700
+++ linux-2.6.28-rc6/fs/ext4/extents.c 2009-04-28 17:35:55.000000000 -0700
@@ -2767,15 +2767,28 @@ int ext4_ext_get_blocks(handle_t *handle
if (create == EXT4_CREATE_UNINITIALIZED_EXT)
goto out;
if (!create) {
+ if (allocated > max_blocks)
+ allocated = max_blocks;
/*
- * We have blocks reserved already. We
+ * We have blocks preallocated already. For
+ * lookup (creat=0) at write_begin time we
* return allocated blocks so that delalloc
* won't do block reservation for us. But
- * the buffer head will be unmapped so that
- * a read from the block returns 0s.
+ * we need to mark the buffer head new so that
+ * a read from the block returns 0s. Fake the
+ * block number -1 so that the following call
+ * of unmap_underlying_metadata doesn't drop
+ * wrong buffer_head
*/
- if (allocated > max_blocks)
- allocated = max_blocks;
+ bh_result->b_blocknr = -1;
+ bh_result->b_bdev = inode->i_sb->s_bdev;
+ set_buffer_new(bh_result);
+
+ /*
+ * We also needs to mark the buffer as
+ * unwritten so we don'te write these
+ * uninitalized pages
+ */
set_buffer_unwritten(bh_result);
goto out2;
}
Index: linux-2.6.28-rc6/fs/ext4/inode.c
===================================================================
--- linux-2.6.28-rc6.orig/fs/ext4/inode.c 2009-04-28 11:52:05.000000000 -0700
+++ linux-2.6.28-rc6/fs/ext4/inode.c 2009-04-28 17:34:24.000000000 -0700
@@ -2173,7 +2173,7 @@ static int ext4_da_get_block_prep(struct
/* not enough space to reserve */
return ret;

- map_bh(bh_result, inode->i_sb, 0);
+ map_bh(bh_result, inode->i_sb, -1);
set_buffer_new(bh_result);
set_buffer_delay(bh_result);
} else if (ret > 0) {


2009-04-29 04:46:31

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V3] Fix sub-block zeroing for buffered writes into unwritten extents

On Tue, Apr 28, 2009 at 06:30:26PM -0700, Mingming wrote:
>
> On Wed, 2009-04-29 at 00:20 +0530, Aneesh Kumar K.V wrote:
> > We need to mark the buffer_head mapping prealloc space
> > as new during write_begin. Otherwise we don't zero out the
> > page cache content properly for a partial write. This will
> > cause file corruption with preallocation.
> >
> > Also use block number -1 as the fake block number so that
> > unmap_underlying_metadata doesn't drop wrong buffer_head
> >
> > Signed-off-by: Aneesh Kumar K.V <[email protected]>
> >
> > ---
> > fs/ext4/inode.c | 11 ++++++++++-
> > 1 files changed, 10 insertions(+), 1 deletions(-)
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index e91f978..0214389 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -2318,11 +2318,20 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
> > /* not enough space to reserve */
> > return ret;
> >
> > - map_bh(bh_result, inode->i_sb, 0);
> > + map_bh(bh_result, inode->i_sb, -1);
> > set_buffer_new(bh_result);
> > set_buffer_delay(bh_result);
> > } else if (ret > 0) {
> > bh_result->b_size = (ret << inode->i_blkbits);
> > + bh_result->b_bdev = inode->i_sb->s_bdev;
> > + bh->b_blocknr = -1;
>
> A small typo, should be bh_result->b_blocknr
>
> But isn't this will incorrect set up the b_blocknr for normal
> successful(allocated, non preallocated) get_block lookup? As
> ext4_get_blocks_wrap() will return 1 (>0) if it found it allocated.
>
> > + /*
> > + * With sub-block writes into unwritten extents
> > + * we also need to mark the buffer as new so that
> > + * the unwritten parts of the buffer gets correctly zeroed.
> > + */
> > + if (buffer_unwritten(bh_result))
> > + set_buffer_new(bh_result);
> > ret = 0;
> > }
> >
>
> I think it nicer to setup the fake block_nr together when
> set_buffer_new(), at the ext4_ext_get_block() time when it handles
> preallocation lookup on delalloc. This will avoid calling
> buffer_unwritten(bh_result) check for every return bh result for
> ext4_get_blocks_wrap(). And makes the logic more saner.
>
> How about patch attached, tested with my testcase, the partial write
> preallocation corruption is fixed.
>
> But looking at the comment change, looks like the original intention is
> to set the buffer unwritten so that a read from that uninitialzed block
> returns 0. Turns out the VFS needs to set the buffer new for this
> purpose.

Should work. My only concern is this change will have impact on the read
path and for non delalloc case. For 2.6.30 I guess we can do the change
only for delayed alloc case which is less intrusive.(ie to to change only
ext4_da_get_block_prep). I have split the patches into two and will send a
follow up patch. For .31 we want to do return with same buffer_head flags
that xfs sets for delayed and unwritten extents.

-aneesh