LinuxLists.cc - [PATCH -V4 1/2] Fix sub-block zeroing for buffered writes into unwritten extents

2009-04-29 04:47:27

Subject: [PATCH -V4 1/2] Fix sub-block zeroing for buffered writes into unwritten extents

We need to mark the buffer_head mapping prealloc space
as new during write_begin. Otherwise we don't zero out the
page cache content properly for a partial write. This will
cause file corruption with preallocation.

Also use block number -1 as the fake block number so that
unmap_underlying_metadata doesn't drop wrong buffer_head

Signed-off-by: Aneesh Kumar K.V <[email protected]>

---
fs/ext4/inode.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e91f978..12dcfab 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2323,6 +2323,16 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
set_buffer_delay(bh_result);
} else if (ret > 0) {
bh_result->b_size = (ret << inode->i_blkbits);
+ /*
+ * With sub-block writes into unwritten extents
+ * we also need to mark the buffer as new so that
+ * the unwritten parts of the buffer gets correctly zeroed.
+ */
+ if (buffer_unwritten(bh_result)) {
+ bh_result->b_bdev = inode->i_sb->s_bdev;
+ set_buffer_new(bh_result);
+ bh_result->b_blocknr = -1;
+ }
ret = 0;
}

--
tg: (56a50ad..) preallocate_corruption_quickfix (depends on: master)

2009-04-29 04:47:29

by Aneesh Kumar K.V

[permalink] [raw]

Subject: [PATCH -V4 2/2] ext4: Use -1 as the fake block number for delayed new buffer_head

Block number '0' should not be used as the fake block number for
the delayed new buffer. This will result in vfs calling umap_underlying_metadata for
block number '0'. So use -1 instead.

Signed-off-by: Aneesh Kumar K.V <[email protected]>

---
fs/ext4/inode.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 12dcfab..c7c2156 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2318,7 +2318,7 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
/* not enough space to reserve */
return ret;

- map_bh(bh_result, inode->i_sb, 0);
+ map_bh(bh_result, inode->i_sb, -1);
set_buffer_new(bh_result);
set_buffer_delay(bh_result);
} else if (ret > 0) {
--
tg: (15262e5..) proper_mappingblock_for_delay (depends on: preallocate_corruption_quickfix)

2009-04-29 13:59:56

by Eric Sandeen

[permalink] [raw]

Subject: Re: [PATCH -V4 2/2] ext4: Use -1 as the fake block number for delayed new buffer_head

Aneesh Kumar K.V wrote:
> Block number '0' should not be used as the fake block number for
> the delayed new buffer. This will result in vfs calling umap_underlying_metadata for
> block number '0'. So use -1 instead.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>
> ---
> fs/ext4/inode.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 12dcfab..c7c2156 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2318,7 +2318,7 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
> /* not enough space to reserve */
> return ret;
>
> - map_bh(bh_result, inode->i_sb, 0);
> + map_bh(bh_result, inode->i_sb, -1);
> set_buffer_new(bh_result);
> set_buffer_delay(bh_result);
> } else if (ret > 0) {

this seems fine too, thanks.

-Eric

2009-04-29 14:00:03

by Eric Sandeen

[permalink] [raw]

Subject: Re: [PATCH -V4 1/2] Fix sub-block zeroing for buffered writes into unwritten extents

Aneesh Kumar K.V wrote:
> We need to mark the buffer_head mapping prealloc space
> as new during write_begin. Otherwise we don't zero out the
> page cache content properly for a partial write. This will
> cause file corruption with preallocation.
>
> Also use block number -1 as the fake block number so that
> unmap_underlying_metadata doesn't drop wrong buffer_head
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>
> ---
> fs/ext4/inode.c | 10 ++++++++++
> 1 files changed, 10 insertions(+), 0 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e91f978..12dcfab 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2323,6 +2323,16 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
> set_buffer_delay(bh_result);
> } else if (ret > 0) {
> bh_result->b_size = (ret << inode->i_blkbits);
> + /*
> + * With sub-block writes into unwritten extents
> + * we also need to mark the buffer as new so that
> + * the unwritten parts of the buffer gets correctly zeroed.
> + */
> + if (buffer_unwritten(bh_result)) {
> + bh_result->b_bdev = inode->i_sb->s_bdev;
> + set_buffer_new(bh_result);
> + bh_result->b_blocknr = -1;
> + }
> ret = 0;
> }
>

Ok, I guess this seems like the safest approach. Long term we should
look really hard at the state & block nr of these buffer heads, but I
agree that keeping the changes restricted to the preallocation path for
now is safest.

-Eric

2009-04-29 15:35:31

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH -V4 2/2] ext4: Use -1 as the fake block number for delayed new buffer_head

On Wed, Apr 29, 2009 at 10:17:21AM +0530, Aneesh Kumar K.V wrote:
> Block number '0' should not be used as the fake block number for
> the delayed new buffer. This will result in vfs calling umap_underlying_metadata for
> block number '0'. So use -1 instead.

sector_t is an unsigned type, so we probably want to use ~0 instead of
-1. I can fix this up before we apply into the patch queue.

Are we agreed both of these should probably be pushed to Linus for
2.6.30?

- Ted

2009-04-29 15:37:40

by Eric Sandeen

[permalink] [raw]

Subject: Re: [PATCH -V4 2/2] ext4: Use -1 as the fake block number for delayed new buffer_head

Theodore Tso wrote:
> On Wed, Apr 29, 2009 at 10:17:21AM +0530, Aneesh Kumar K.V wrote:
>> Block number '0' should not be used as the fake block number for
>> the delayed new buffer. This will result in vfs calling umap_underlying_metadata for
>> block number '0'. So use -1 instead.
>
> sector_t is an unsigned type, so we probably want to use ~0 instead of
> -1. I can fix this up before we apply into the patch queue.

I don't think that helps. The point is to have a block number which is
invalid, therefore won't get unmapped or accidentally written to ...

-Eric

> Are we agreed both of these should probably be pushed to Linus for
> 2.6.30?
>
> - Ted

2009-04-29 16:52:19

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH -V4 2/2] ext4: Use -1 as the fake block number for delayed new buffer_head

On Wed, Apr 29, 2009 at 10:37:34AM -0500, Eric Sandeen wrote:
> Theodore Tso wrote:
> > On Wed, Apr 29, 2009 at 10:17:21AM +0530, Aneesh Kumar K.V wrote:
> >> Block number '0' should not be used as the fake block number for
> >> the delayed new buffer. This will result in vfs calling umap_underlying_metadata for
> >> block number '0'. So use -1 instead.
> >
> > sector_t is an unsigned type, so we probably want to use ~0 instead of
> > -1. I can fix this up before we apply into the patch queue.
>
> I don't think that helps. The point is to have a block number which is
> invalid, therefore won't get unmapped or accidentally written to ...

This is more of a type-safety thing to eliminate compiler warnings.
We could use something like s_blocks_count instead, which has less
chance of wrapping, but by the time we get to the bh level, the risk
of wrapping should be minimal, and ~0 (or -1) is more distinctive when
debugging/tracing.

- Ted

2009-04-29 17:01:25

by Eric Sandeen

[permalink] [raw]

Subject: Re: [PATCH -V4 2/2] ext4: Use -1 as the fake block number for delayed new buffer_head

Theodore Tso wrote:
> On Wed, Apr 29, 2009 at 10:37:34AM -0500, Eric Sandeen wrote:
>> Theodore Tso wrote:
>>> On Wed, Apr 29, 2009 at 10:17:21AM +0530, Aneesh Kumar K.V wrote:
>>>> Block number '0' should not be used as the fake block number for
>>>> the delayed new buffer. This will result in vfs calling umap_underlying_metadata for
>>>> block number '0'. So use -1 instead.
>>> sector_t is an unsigned type, so we probably want to use ~0 instead of
>>> -1. I can fix this up before we apply into the patch queue.
>> I don't think that helps. The point is to have a block number which is
>> invalid, therefore won't get unmapped or accidentally written to ...
>
> This is more of a type-safety thing to eliminate compiler warnings.
> We could use something like s_blocks_count instead, which has less
> chance of wrapping, but by the time we get to the bh level, the risk
> of wrapping should be minimal, and ~0 (or -1) is more distinctive when
> debugging/tracing.

I'm sorry. Poor choice of fonts, or something, I read "-0" not "~0" and
wondered what on earth you were thinking. ;)

-Eric

2009-04-29 17:28:45

by Mingming Cao

[permalink] [raw]

Subject: Re: [PATCH -V4 1/2] Fix sub-block zeroing for buffered writes into unwritten extents

On Wed, 2009-04-29 at 08:59 -0500, Eric Sandeen wrote:
> Aneesh Kumar K.V wrote:
> > We need to mark the buffer_head mapping prealloc space
> > as new during write_begin. Otherwise we don't zero out the
> > page cache content properly for a partial write. This will
> > cause file corruption with preallocation.
> >
> > Also use block number -1 as the fake block number so that
> > unmap_underlying_metadata doesn't drop wrong buffer_head
> >
> > Signed-off-by: Aneesh Kumar K.V <[email protected]>
> >
> > ---
> > fs/ext4/inode.c | 10 ++++++++++
> > 1 files changed, 10 insertions(+), 0 deletions(-)
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index e91f978..12dcfab 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -2323,6 +2323,16 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
> > set_buffer_delay(bh_result);
> > } else if (ret > 0) {
> > bh_result->b_size = (ret << inode->i_blkbits);
> > + /*
> > + * With sub-block writes into unwritten extents
> > + * we also need to mark the buffer as new so that
> > + * the unwritten parts of the buffer gets correctly zeroed.
> > + */
> > + if (buffer_unwritten(bh_result)) {
> > + bh_result->b_bdev = inode->i_sb->s_bdev;
> > + set_buffer_new(bh_result);
> > + bh_result->b_blocknr = -1;
> > + }
> > ret = 0;
> > }
> >
>
> Ok, I guess this seems like the safest approach. Long term we should
> look really hard at the state & block nr of these buffer heads, but I
> agree that keeping the changes restricted to the preallocation path for
> now is safest.
>

This path (ret >0) this is the path where get_blocks() find the block
allocated or preallocated. The buffer_unwritten() is strict to the
preallocation case, but why not take care of the buffer_new() when we
set the buffer_unwritten() for preallocation in ext4_ext_get_blocks()
at the first place? That makes the "preallocation" case handling there
all together.

But both patch is correct, I have tested the prealloc,
prealloc->paritial write, prealloc->paritial long
write->partial-short-write, the content of the afterward read seems all
sane in both patch.

Any thoughts about the comments update I made in my previous patch? This
part of comment in preallocation handling in ext4_ext_get_blocks()
needs some cleanup.

Think this over, if we set the buffer new here(i.e. in the write_begin()
path), I wonder about the read case: where do we set the buffer_new()
for the read on preallocated space? the ext4_ext_get_blocks() with
create = 0 on preallocated extent will return bh unwritten, but not new.
However my read tests right after new preallocation returns all zeroed
data. I wonder what I am missing.

Mingming
> -Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-05-04 08:54:27

by Aneesh Kumar K.V

[permalink] [raw]

Subject: Re: [PATCH -V4 2/2] ext4: Use -1 as the fake block number for delayed new buffer_head

On Wed, Apr 29, 2009 at 11:35:21AM -0400, Theodore Tso wrote:
> On Wed, Apr 29, 2009 at 10:17:21AM +0530, Aneesh Kumar K.V wrote:
> > Block number '0' should not be used as the fake block number for
> > the delayed new buffer. This will result in vfs calling umap_underlying_metadata for
> > block number '0'. So use -1 instead.
>
> sector_t is an unsigned type, so we probably want to use ~0 instead of
> -1. I can fix this up before we apply into the patch queue.
>
> Are we agreed both of these should probably be pushed to Linus for
> 2.6.30?
>

With ABAT I am seeing the below error during fsstress run.

EXT4-fs: mounted filesystem sdb1 with ordered data mode
attempt to access beyond end of device
sdb1: rw=1, want=0, limit=136713087
Buffer I/O error on device sdb1, logical block 18446744073709551615
lost page write due to I/O error on sdb1
attempt to access beyond end of device
sdb1: rw=1, want=0, limit=136713087
Buffer I/O error on device sdb1, logical block 18446744073709551615
lost page write due to I/O error on sdb1
attempt to access beyond end of device
sdb1: rw=1, want=0, limit=136713087
Buffer I/O error on device sdb1, logical block 18446744073709551615
lost page write due to I/O error on sdb1
attempt to access beyond end of device
sdb1: rw=1, want=0, limit=136713087

That is the logical block number ~0. Well that would imply we are trying
to write the buffer_head which is marked delay.

-aneesh

2009-05-04 15:06:25

by Eric Sandeen

[permalink] [raw]

Subject: Re: [PATCH -V4 2/2] ext4: Use -1 as the fake block number for delayed new buffer_head

Aneesh Kumar K.V wrote:
> On Wed, Apr 29, 2009 at 11:35:21AM -0400, Theodore Tso wrote:
>> On Wed, Apr 29, 2009 at 10:17:21AM +0530, Aneesh Kumar K.V wrote:
>>> Block number '0' should not be used as the fake block number for
>>> the delayed new buffer. This will result in vfs calling umap_underlying_metadata for
>>> block number '0'. So use -1 instead.
>> sector_t is an unsigned type, so we probably want to use ~0 instead of
>> -1. I can fix this up before we apply into the patch queue.
>>
>> Are we agreed both of these should probably be pushed to Linus for
>> 2.6.30?
>>
>
> With ABAT I am seeing the below error during fsstress run.
>
> EXT4-fs: mounted filesystem sdb1 with ordered data mode
> attempt to access beyond end of device
> sdb1: rw=1, want=0, limit=136713087
> Buffer I/O error on device sdb1, logical block 18446744073709551615

Ok, I think this is actually good. Looks like we are leaking
uninitialized delalloc buffer heads... this may well explain some of the
corruptions we've seen. So now .... what's going on ... :)

-Eric

2009-05-12 02:42:30

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH -V4 1/2] Fix sub-block zeroing for buffered writes into unwritten extents

On Wed, Apr 29, 2009 at 10:17:20AM +0530, Aneesh Kumar K.V wrote:
> We need to mark the buffer_head mapping prealloc space
> as new during write_begin. Otherwise we don't zero out the
> page cache content properly for a partial write. This will
> cause file corruption with preallocation.
>
> Also use block number -1 as the fake block number so that
> unmap_underlying_metadata doesn't drop wrong buffer_head

The buffer_head code is starting to scare me more and more.

I'm looking at this code again and I can't figure out why it's safe
(or why we would need to) put in an invalid number into
bh_result->b_blocknr:

> @@ -2323,6 +2323,16 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
> set_buffer_delay(bh_result);
> } else if (ret > 0) {
> bh_result->b_size = (ret << inode->i_blkbits);
> + /*
> + * With sub-block writes into unwritten extents
> + * we also need to mark the buffer as new so that
> + * the unwritten parts of the buffer gets correctly zeroed.
> + */
> + if (buffer_unwritten(bh_result)) {
> + bh_result->b_bdev = inode->i_sb->s_bdev;
> + set_buffer_new(bh_result);
> + bh_result->b_blocknr = -1;

Why do we need to avoid calling unmap_underlying_metadata()?

And after the buffer is zero'ed out, it leaves b_blocknr in a
buffer_head attached to the page at an invalid block number. Doesn't
that get us in trouble later on?

I see that this line is removed later on in the for-2.6.31 patch "Mark
the unwritten buffer_head as mapped during write_begin". But is it
safe for 2.6.30?

- Ted

2009-05-12 03:37:47

by Eric Sandeen

[permalink] [raw]

Subject: Re: [PATCH -V4 1/2] Fix sub-block zeroing for buffered writes into unwritten extents

Theodore Tso wrote:
> On Wed, Apr 29, 2009 at 10:17:20AM +0530, Aneesh Kumar K.V wrote:
>> We need to mark the buffer_head mapping prealloc space
>> as new during write_begin. Otherwise we don't zero out the
>> page cache content properly for a partial write. This will
>> cause file corruption with preallocation.
>>
>> Also use block number -1 as the fake block number so that
>> unmap_underlying_metadata doesn't drop wrong buffer_head
>
> The buffer_head code is starting to scare me more and more.
>
> I'm looking at this code again and I can't figure out why it's safe
> (or why we would need to) put in an invalid number into
> bh_result->b_blocknr:

I don't know for sure why it should be invalid; I think a preallocated
block, since it has an *actual* *block* *allocated* after all, should
have that block number. But if it's going to be fake, let's not use a
"real" one like the superblock location...

A real block nr does eventually get assigned when we do getblock with
create=1 AFAICT.

>> @@ -2323,6 +2323,16 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
>> set_buffer_delay(bh_result);
>> } else if (ret > 0) {
>> bh_result->b_size = (ret << inode->i_blkbits);
>> + /*
>> + * With sub-block writes into unwritten extents
>> + * we also need to mark the buffer as new so that
>> + * the unwritten parts of the buffer gets correctly zeroed.
>> + */
>> + if (buffer_unwritten(bh_result)) {
>> + bh_result->b_bdev = inode->i_sb->s_bdev;
>> + set_buffer_new(bh_result);
>> + bh_result->b_blocknr = -1;
>
> Why do we need to avoid calling unmap_underlying_metadata()?

For that matter, why do we call unmap_underlying_metadata at all, ever?

> And after the buffer is zero'ed out, it leaves b_blocknr in a
> buffer_head attached to the page at an invalid block number. Doesn't
> that get us in trouble later on?
>
> I see that this line is removed later on in the for-2.6.31 patch "Mark
> the unwritten buffer_head as mapped during write_begin". But is it
> safe for 2.6.30?

I have this in F11 now, but it's giving me the heebie-jeebies still. At
least it's confined to preallocation (one of the great new ext4 features
I've been promoting recently... :)

-Eric

2009-05-12 15:16:59

by Aneesh Kumar K.V

[permalink] [raw]

Subject: [PATCH -V5] Fix sub-block zeroing for buffered writes intounwritten extents

ext3: Fix sub-block zeroing for writes into preallocated extents

From: Aneesh Kumar K.V <[email protected]>

We need to mark the buffer_head mapping preallocated space as new
during write_begin. Otherwise we don't zero out the page cache content
properly for a partial write. This will cause file corruption with
preallocation.

Now that we mark the buffer_head new we also need to have a valid
buffer_head blocknr so that unmap_underlying_metadata unmap the
right block.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/extents.c | 2 ++
fs/ext4/inode.c | 8 ++++++++
2 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e403321..c3768cd 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2875,6 +2875,8 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
if (allocated > max_blocks)
allocated = max_blocks;
set_buffer_unwritten(bh_result);
+ bh_result->b_bdev = inode->i_sb->s_bdev;
+ bh_result->b_blocknr = newblock;
goto out2;
}

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e91f978..498cf8b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2323,6 +2323,14 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
set_buffer_delay(bh_result);
} else if (ret > 0) {
bh_result->b_size = (ret << inode->i_blkbits);
+ /*
+ * With sub-block writes into unwritten extents
+ * we also need to mark the buffer as new so that
+ * the unwritten parts of the buffer gets correctly zeroed.
+ */
+ if (buffer_unwritten(bh_result)) {
+ set_buffer_new(bh_result);
+ }
ret = 0;
}

2009-05-12 15:18:06

by Aneesh Kumar K.V

[permalink] [raw]

Subject: [PATCH -V5] ext4: Use -1 as the fake block number for delayed new buffer_head

ext4: Use a fake block number for delayed new buffer_head

From: Aneesh Kumar K.V <[email protected]>

Use a very large unsigned number (~0xffff) as as the fake block number
for the delayed new buffer. The VFS should never try to write out this
number, but if it does, this will make it obvious.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/inode.c | 6 +++++-
1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 498cf8b..4b1478c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2297,6 +2297,10 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
{
int ret = 0;
+ sector_t invalid_block = ~((sector_t) 0xffff);
+
+ if (invalid_block < ext4_blocks_count(EXT4_SB(inode->i_sb)->s_es))
+ invalid_block = ~0;

BUG_ON(create == 0);
BUG_ON(bh_result->b_size != inode->i_sb->s_blocksize);
@@ -2318,7 +2322,7 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
/* not enough space to reserve */
return ret;

- map_bh(bh_result, inode->i_sb, 0);
+ map_bh(bh_result, inode->i_sb, invalid_block);
set_buffer_new(bh_result);
set_buffer_delay(bh_result);
} else if (ret > 0) {