2009-04-27 19:05:49

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [RFC PATCH] mark buffer_head mapping preallocate area as new during write_begin with delayed allocation

We need to mark the buffer_head mapping prealloc space
as new during write_begin. Otherwise we don't zero out the
page cache content properly for a partial write. This will
cause file corruption with preallocation.

Signed-off-by: Aneesh Kumar K.V <[email protected]>

---
fs/ext4/inode.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c6bd6ce..c7251ec 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2323,6 +2323,8 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
set_buffer_delay(bh_result);
} else if (ret > 0) {
bh_result->b_size = (ret << inode->i_blkbits);
+ if (buffer_unwritten(bh_result))
+ set_buffer_new(bh_result);
ret = 0;
}

--
tg: (51bec03..) preallocate_corruption (depends on: ext4_lock_group_conversion)


2009-04-27 19:30:16

by Eric Sandeen

[permalink] [raw]
Subject: Re: [RFC PATCH] mark buffer_head mapping preallocate area as new during write_begin with delayed allocation

Aneesh Kumar K.V wrote:
> We need to mark the buffer_head mapping prealloc space
> as new during write_begin. Otherwise we don't zero out the
> page cache content properly for a partial write. This will
> cause file corruption with preallocation.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>
> ---
> fs/ext4/inode.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c6bd6ce..c7251ec 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2323,6 +2323,8 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
> set_buffer_delay(bh_result);
> } else if (ret > 0) {
> bh_result->b_size = (ret << inode->i_blkbits);
> + if (buffer_unwritten(bh_result))
> + set_buffer_new(bh_result);
> ret = 0;
> }
>

Hm, yep, that sure does break things. For some future regression test:

[root@inode tmp]# /root/fallocate -l 8k testfile
[root@inode tmp]# dd if=/dev/zero of=testfile bs=1 count=10 conv=notrunc
10+0 records in
10+0 records out
10 bytes (10 B) copied, 5.1491e-05 s, 194 kB/s
[root@inode tmp]# hexdump -C testfile

<much garbage ensues>

This looks pretty reasonable; Aneesh & I talked online and found that
xfs has a somewhat similar fix:

commit 549054afadae44889c0b40d4c3bfb0207b98d5a0
Author: David Chinner <[email protected]>
Date: Sat Feb 10 18:36:35 2007 +1100

[XFS] Fix sub-block zeroing for buffered writes into unwritten extents.

When writing less than a filesystem block of data into an unwritten
extent
via buffered I/O, __xfs_get_blocks fails to set the buffer new flag.
As a
result, the generic code will not zero either edge of the block
resulting
in garbage being written to disk either side of the real data. Set the
buffer new state on bufferd writes to unwritten extents to ensure that
zeroing occurs.

-Eric

2009-04-27 23:04:58

by Mingming Cao

[permalink] [raw]
Subject: Re: [RFC PATCH] mark buffer_head mapping preallocate area as new during write_begin with delayed allocation


在 2009-04-28二的 00:35 +0530,Aneesh Kumar K.V写道:
> We need to mark the buffer_head mapping prealloc space
> as new during write_begin. Otherwise we don't zero out the
> page cache content properly for a partial write. This will
> cause file corruption with preallocation.
>

> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>
> ---
> fs/ext4/inode.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c6bd6ce..c7251ec 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2323,6 +2323,8 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
> set_buffer_delay(bh_result);
> } else if (ret > 0) {
> bh_result->b_size = (ret << inode->i_blkbits);
> + if (buffer_unwritten(bh_result))
> + set_buffer_new(bh_result);
> ret = 0;
> }
>

Thanks Aneesh.

Just to share with list, I have seen garbage content show up on a
preallocated but later partially written blocks. This only happens with
delayed allocation. The test simply preallocate 2blocks to a new file,
then write a few bytes to the beginning of file(less than a block), and
od shows the first block the written content followed by garbage filled
to the end of the first block.

After examing the code, we did set the buffer as new for nondelalloc, as
the create flag passed to ext4_ext_get_blocks() is 1, while for delalloc
case, ext4_get_blocks_prep() calling ext4_ext_get_block() with create
=0, which leads to the code path that forget to set the bh as new if the
block is preallocated.

This patch is mostly correct except forget to set the bh_result->bdev,
which caused the fs blow out.

The updated patch fixed the problem for me.

Signed-off-by: Mingming Cao <[email protected]>

Index: linux-2.6.28-rc6/fs/ext4/inode.c
===================================================================
--- linux-2.6.28-rc6.orig/fs/ext4/inode.c 2009-03-12 10:21:05.000000000 -0700
+++ linux-2.6.28-rc6/fs/ext4/inode.c 2009-04-27 14:35:21.000000000 -0700
@@ -2177,7 +2177,10 @@ static int ext4_da_get_block_prep(struct
set_buffer_new(bh_result);
set_buffer_delay(bh_result);
} else if (ret > 0) {
+ if (buffer_unwritten(bh_result))
+ set_buffer_new(bh_result);
bh_result->b_size = (ret << inode->i_blkbits);
+ bh_result->b_bdev = inode->i_sb->s_bdev;
ret = 0;
}



2009-04-28 03:03:39

by Eric Sandeen

[permalink] [raw]
Subject: Re: [RFC PATCH] mark buffer_head mapping preallocate area as new during write_begin with delayed allocation

Mingming Cao wrote:
> 在 2009-04-28二的 00:35 +0530,Aneesh Kumar K.V写道:
>> We need to mark the buffer_head mapping prealloc space
>> as new during write_begin. Otherwise we don't zero out the
>> page cache content properly for a partial write. This will
>> cause file corruption with preallocation.
>>
>
>> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>>
>> ---
>> fs/ext4/inode.c | 2 ++
>> 1 files changed, 2 insertions(+), 0 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index c6bd6ce..c7251ec 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -2323,6 +2323,8 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
>> set_buffer_delay(bh_result);
>> } else if (ret > 0) {
>> bh_result->b_size = (ret << inode->i_blkbits);
>> + if (buffer_unwritten(bh_result))
>> + set_buffer_new(bh_result);
>> ret = 0;
>> }
>>
>
> Thanks Aneesh.
>
> Just to share with list, I have seen garbage content show up on a
> preallocated but later partially written blocks. This only happens with
> delayed allocation. The test simply preallocate 2blocks to a new file,
> then write a few bytes to the beginning of file(less than a block), and
> od shows the first block the written content followed by garbage filled
> to the end of the first block.
>
> After examing the code, we did set the buffer as new for nondelalloc, as
> the create flag passed to ext4_ext_get_blocks() is 1, while for delalloc
> case, ext4_get_blocks_prep() calling ext4_ext_get_block() with create
> =0, which leads to the code path that forget to set the bh as new if the
> block is preallocated.
>
> This patch is mostly correct except forget to set the bh_result->bdev,
> which caused the fs blow out.

Yep, I saw the oops too.

> The updated patch fixed the problem for me.
>
> Signed-off-by: Mingming Cao <[email protected]>
>
> Index: linux-2.6.28-rc6/fs/ext4/inode.c
> ===================================================================
> --- linux-2.6.28-rc6.orig/fs/ext4/inode.c 2009-03-12 10:21:05.000000000 -0700
> +++ linux-2.6.28-rc6/fs/ext4/inode.c 2009-04-27 14:35:21.000000000 -0700
> @@ -2177,7 +2177,10 @@ static int ext4_da_get_block_prep(struct
> set_buffer_new(bh_result);
> set_buffer_delay(bh_result);
> } else if (ret > 0) {
> + if (buffer_unwritten(bh_result))
> + set_buffer_new(bh_result);
> bh_result->b_size = (ret << inode->i_blkbits);
> + bh_result->b_bdev = inode->i_sb->s_bdev;
> ret = 0;
> }

It may be just me, but I'd like to sort out why we now need to set
b_bdev here just because we set it as new, and it wasn't necessary
before...? If it's obvious I'm not yet seeing it :)

-Eric

2009-04-28 04:21:13

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH] mark buffer_head mapping preallocate area as new during write_begin with delayed allocation

On Mon, Apr 27, 2009 at 04:04:54PM -0700, Mingming Cao wrote:
.....

>
> Index: linux-2.6.28-rc6/fs/ext4/inode.c
> ===================================================================
> --- linux-2.6.28-rc6.orig/fs/ext4/inode.c 2009-03-12 10:21:05.000000000 -0700
> +++ linux-2.6.28-rc6/fs/ext4/inode.c 2009-04-27 14:35:21.000000000 -0700
> @@ -2177,7 +2177,10 @@ static int ext4_da_get_block_prep(struct
> set_buffer_new(bh_result);
> set_buffer_delay(bh_result);
> } else if (ret > 0) {
> + if (buffer_unwritten(bh_result))
> + set_buffer_new(bh_result);
> bh_result->b_size = (ret << inode->i_blkbits);
> + bh_result->b_bdev = inode->i_sb->s_bdev;


Updated patch to set bh_result->b_dev. I also added comments in the
source to explain whey we need to mark buffer_head new. Also updated
single line patch summary. I will send the update (-v2) patch.

-aneesh

2009-04-28 09:31:54

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH] mark buffer_head mapping preallocate area as new during write_begin with delayed allocation

On Tue, Apr 28, 2009 at 09:50:49AM +0530, Aneesh Kumar K.V wrote:
> On Mon, Apr 27, 2009 at 04:04:54PM -0700, Mingming Cao wrote:
> .....
>
> >
> > Index: linux-2.6.28-rc6/fs/ext4/inode.c
> > ===================================================================
> > --- linux-2.6.28-rc6.orig/fs/ext4/inode.c 2009-03-12 10:21:05.000000000 -0700
> > +++ linux-2.6.28-rc6/fs/ext4/inode.c 2009-04-27 14:35:21.000000000 -0700
> > @@ -2177,7 +2177,10 @@ static int ext4_da_get_block_prep(struct
> > set_buffer_new(bh_result);
> > set_buffer_delay(bh_result);
> > } else if (ret > 0) {
> > + if (buffer_unwritten(bh_result))
> > + set_buffer_new(bh_result);
> > bh_result->b_size = (ret << inode->i_blkbits);
> > + bh_result->b_bdev = inode->i_sb->s_bdev;
>
>
> Updated patch to set bh_result->b_dev. I also added comments in the
> source to explain whey we need to mark buffer_head new. Also updated
> single line patch summary. I will send the update (-v2) patch.

Looking at the source again i guess setting just b_dev is not enough.
unmap_underlying_metadata looks at the mapping block number, which we
don't have in case on unwritten buffer_head. How about the below patch ?
It involve vfs changes. But i guess it is correct with respect to the
meaning of BH_New (Disk mapping was newly created by get_block). I guess
BH_New implies BH_Mapped.

I haven't tested the patch yet. Also it should be split into multiple
patches. It also a fix a problem where we missed an
unamp_underlying_metadata in case of delayed allocated blocks. I guess
that can also cause corruption with delayed allocation.


From: Aneesh Kumar K.V <[email protected]>
Subject: [PATCH -V3] ext4: Fix sub-block zeroing for buffered writes into unwritten extents.

We need to mark the buffer_head mapping prealloc space
as new during write_begin. Otherwise we don't zero out the
page cache content properly for a partial write. This will
cause file corruption with preallocation.

Signed-off-by: Aneesh Kumar K.V <[email protected]>

---
fs/buffer.c | 9 ++++++++-
fs/ext4/inode.c | 8 +++++---
2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index b3e5be7..13f0d52 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1867,15 +1867,22 @@ static int __block_prepare_write(struct inode *inode, struct page *page,
err = get_block(inode, block, bh, 1);
if (err)
break;
- if (buffer_new(bh)) {
+ if (buffer_new(bh))
unmap_underlying_metadata(bh->b_bdev,
bh->b_blocknr);
+ if (buffer_new(bh) || buffer_unwritten(bh) ||
+ buffer_delay(bh)) {
if (PageUptodate(page)) {
clear_buffer_new(bh);
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);
continue;
}
+ /*
+ * sub-block writes into unwritten or
+ * delayed buffer should result in zero out
+ * of the rest of the buffer
+ */
if (block_end > to || block_start < from)
zero_user_segments(page,
to, block_end,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e91f978..504afb7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1892,13 +1892,17 @@ static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd, sector_t logical,
if (buffer_delay(bh)) {
bh->b_blocknr = pblock;
clear_buffer_delay(bh);
+ set_buffer_mapped(bh);
bh->b_bdev = inode->i_sb->s_bdev;
+ unmap_underlying_metadata(bh->b_bdev,
+ pblock);
} else if (buffer_unwritten(bh)) {
bh->b_blocknr = pblock;
clear_buffer_unwritten(bh);
set_buffer_mapped(bh);
- set_buffer_new(bh);
bh->b_bdev = inode->i_sb->s_bdev;
+ unmap_underlying_metadata(bh->b_bdev,
+ pblock);
} else if (buffer_mapped(bh))
BUG_ON(bh->b_blocknr != pblock);

@@ -2318,8 +2322,6 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
/* not enough space to reserve */
return ret;

- map_bh(bh_result, inode->i_sb, 0);
- set_buffer_new(bh_result);
set_buffer_delay(bh_result);
} else if (ret > 0) {
bh_result->b_size = (ret << inode->i_blkbits);
--
tg: (2084a87..) preallocate_corruption (depends on: ext4_lock_group_conversion)


2009-04-28 12:48:28

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC PATCH] mark buffer_head mapping preallocate area as new during write_begin with delayed allocation

On Tue, Apr 28, 2009 at 03:01:45PM +0530, Aneesh Kumar K.V wrote:
>
> Looking at the source again i guess setting just b_dev is not enough.
> unmap_underlying_metadata looks at the mapping block number, which we
> don't have in case on unwritten buffer_head. How about the below patch ?
> It involve vfs changes. But i guess it is correct with respect to the
> meaning of BH_New (Disk mapping was newly created by get_block). I guess
> BH_New implies BH_Mapped.

Argh. So we have multiple problems going on here. One is the
original problem, namely that of a partial write into an preallocated
block can leave garbage behind in that unitialized block.

The other problem seems to be in the case of a delayed allocation
write, where we return a buffer_head which is marked new, and this
causes block_prepare_write() to call unmap_underlying_metadata(dev, 0).

In theory this could cause problems if we try installing a new
bootloader in the filesystem's boot block while there's a delayed
writes happening in the background, since we could end up discarding
the write to the boot sector. We've lived with this for quite a wihle
though.

My concern with making the fs/buffer.c changes is that we need to make
sure it doesn't break any of the other filesystems, so that's going to
make it hard to try to slip this with 2.6.30-rc4 nearly upon us.
(Silly question; why doesn't XFS get caught by this?)

So the question is do we try to fix both bugs with one patch, and very
likely have to wait until 2.6.31 before the patch is incorporated? Or
do we fix the second bug using an ext4-only fix, with the knowledge
that post 2.6.30, we'll need undo most of it and fix it properly with
a change that involves fs/buffer.c?

My preference is for the former, unless we belive the 2nd bug is
serious enough that we really need to address it ASAP (in which case
we have a lot of work ahead of us in terms of coordinating with the
other filesystem developers). What do other folks think?

- Ted

2009-04-28 16:36:04

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH] mark buffer_head mapping preallocate area as new during write_begin with delayed allocation

On Tue, Apr 28, 2009 at 08:48:21AM -0400, Theodore Tso wrote:
> On Tue, Apr 28, 2009 at 03:01:45PM +0530, Aneesh Kumar K.V wrote:
> >
> > Looking at the source again i guess setting just b_dev is not enough.
> > unmap_underlying_metadata looks at the mapping block number, which we
> > don't have in case on unwritten buffer_head. How about the below patch ?
> > It involve vfs changes. But i guess it is correct with respect to the
> > meaning of BH_New (Disk mapping was newly created by get_block). I guess
> > BH_New implies BH_Mapped.
>
> Argh. So we have multiple problems going on here. One is the
> original problem, namely that of a partial write into an preallocated
> block can leave garbage behind in that unitialized block.
>
> The other problem seems to be in the case of a delayed allocation
> write, where we return a buffer_head which is marked new, and this
> causes block_prepare_write() to call unmap_underlying_metadata(dev, 0).

Not just that. On block allocation we are not calling
unmap_underlying_metadata(dev, blocknumber) for delayed allocated
blocks. That would imply file corruption.

>
> In theory this could cause problems if we try installing a new
> bootloader in the filesystem's boot block while there's a delayed
> writes happening in the background, since we could end up discarding
> the write to the boot sector. We've lived with this for quite a wihle
> though.
>
> My concern with making the fs/buffer.c changes is that we need to make
> sure it doesn't break any of the other filesystems, so that's going to
> make it hard to try to slip this with 2.6.30-rc4 nearly upon us.
> (Silly question; why doesn't XFS get caught by this?)
>
> So the question is do we try to fix both bugs with one patch, and very
> likely have to wait until 2.6.31 before the patch is incorporated? Or
> do we fix the second bug using an ext4-only fix, with the knowledge
> that post 2.6.30, we'll need undo most of it and fix it properly with
> a change that involves fs/buffer.c?
>
> My preference is for the former, unless we belive the 2nd bug is
> serious enough that we really need to address it ASAP (in which case
> we have a lot of work ahead of us in terms of coordinating with the
> other filesystem developers). What do other folks think?

The original reported problem is something really easy to reproduce. So
i guess if we can have a ext4 local change that would fix the original
problem that would be good. Considering that map_bh(bdev, 0) didn't
create any issues till now, what we can do is to do a similar update
for unwritten_buffer in ext4_da_block_write_prep. That's the v2 version
of the patch with the below addition
bh_result->b_blocknr = 0;

-aneesh

2009-04-28 16:38:05

by Eric Sandeen

[permalink] [raw]
Subject: Re: [RFC PATCH] mark buffer_head mapping preallocate area as new during write_begin with delayed allocation

Theodore Tso wrote:
> On Tue, Apr 28, 2009 at 03:01:45PM +0530, Aneesh Kumar K.V wrote:
>> Looking at the source again i guess setting just b_dev is not enough.
>> unmap_underlying_metadata looks at the mapping block number, which we
>> don't have in case on unwritten buffer_head. How about the below patch ?
>> It involve vfs changes. But i guess it is correct with respect to the
>> meaning of BH_New (Disk mapping was newly created by get_block). I guess
>> BH_New implies BH_Mapped.
>
> Argh. So we have multiple problems going on here. One is the
> original problem, namely that of a partial write into an preallocated
> block can leave garbage behind in that unitialized block.
>
> The other problem seems to be in the case of a delayed allocation
> write, where we return a buffer_head which is marked new, and this
> causes block_prepare_write() to call unmap_underlying_metadata(dev, 0).
>
> In theory this could cause problems if we try installing a new
> bootloader in the filesystem's boot block while there's a delayed
> writes happening in the background, since we could end up discarding
> the write to the boot sector. We've lived with this for quite a wihle
> though.
>
> My concern with making the fs/buffer.c changes is that we need to make
> sure it doesn't break any of the other filesystems, so that's going to
> make it hard to try to slip this with 2.6.30-rc4 nearly upon us.
> (Silly question; why doesn't XFS get caught by this?)

I'm not sure offhand. All xfs does is this in the get_block path:

* With sub-block writes into unwritten extents we also need to mark
* the buffer as new so that the unwritten parts of the buffer gets
* correctly zeroed.
*/
if (create &&
((!buffer_mapped(bh_result) && !buffer_uptodate(bh_result)) ||
(offset >= i_size_read(inode)) ||
(iomap.iomap_flags & (IOMAP_NEW|IOMAP_UNWRITTEN))))
set_buffer_new(bh_result);

so it returns with BH_New as well.

> So the question is do we try to fix both bugs with one patch, and very
> likely have to wait until 2.6.31 before the patch is incorporated? Or
> do we fix the second bug using an ext4-only fix, with the knowledge
> that post 2.6.30, we'll need undo most of it and fix it properly with
> a change that involves fs/buffer.c?

I have the sense that this might need a bit more digging around, and I
finally got stuff out of the way to do so :)

-Eric


2009-04-28 17:00:58

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC PATCH] mark buffer_head mapping preallocate area as new during write_begin with delayed allocation

On Tue, Apr 28, 2009 at 10:05:54PM +0530, Aneesh Kumar K.V wrote:
> On Tue, Apr 28, 2009 at 08:48:21AM -0400, Theodore Tso wrote:
> > On Tue, Apr 28, 2009 at 03:01:45PM +0530, Aneesh Kumar K.V wrote:
> > >
> > > Looking at the source again i guess setting just b_dev is not enough.
> > > unmap_underlying_metadata looks at the mapping block number, which we
> > > don't have in case on unwritten buffer_head. How about the below patch ?
> > > It involve vfs changes. But i guess it is correct with respect to the
> > > meaning of BH_New (Disk mapping was newly created by get_block). I guess
> > > BH_New implies BH_Mapped.
> >
> > Argh. So we have multiple problems going on here. One is the
> > original problem, namely that of a partial write into an preallocated
> > block can leave garbage behind in that unitialized block.
> >
> > The other problem seems to be in the case of a delayed allocation
> > write, where we return a buffer_head which is marked new, and this
> > causes block_prepare_write() to call unmap_underlying_metadata(dev, 0).
>
> Not just that. On block allocation we are not calling
> unmap_underlying_metadata(dev, blocknumber) for delayed allocated
> blocks. That would imply file corruption.

I don't think I'm following you . If we write into block that was
delayed allocated. Are you saying we might get in trouble of the
delayed allocation block is mmap'ed in?

> The original reported problem is something really easy to reproduce. So
> i guess if we can have a ext4 local change that would fix the original
> problem that would be good. Considering that map_bh(bdev, 0) didn't
> create any issues till now, what we can do is to do a similar update
> for unwritten_buffer in ext4_da_block_write_prep. That's the v2 version
> of the patch with the below addition
> bh_result->b_blocknr = 0;

OK, I can put togehter a patch to do this. Whatever we do, I think
we're going to need a *lot* of testing.

- Ted

2009-04-28 18:58:01

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH] mark buffer_head mapping preallocate area as new during write_begin with delayed allocation

On Tue, Apr 28, 2009 at 01:00:47PM -0400, Theodore Tso wrote:
> On Tue, Apr 28, 2009 at 10:05:54PM +0530, Aneesh Kumar K.V wrote:
> > On Tue, Apr 28, 2009 at 08:48:21AM -0400, Theodore Tso wrote:
> > > On Tue, Apr 28, 2009 at 03:01:45PM +0530, Aneesh Kumar K.V wrote:
> > > >
> > > > Looking at the source again i guess setting just b_dev is not enough.
> > > > unmap_underlying_metadata looks at the mapping block number, which we
> > > > don't have in case on unwritten buffer_head. How about the below patch ?
> > > > It involve vfs changes. But i guess it is correct with respect to the
> > > > meaning of BH_New (Disk mapping was newly created by get_block). I guess
> > > > BH_New implies BH_Mapped.
> > >
> > > Argh. So we have multiple problems going on here. One is the
> > > original problem, namely that of a partial write into an preallocated
> > > block can leave garbage behind in that unitialized block.
> > >
> > > The other problem seems to be in the case of a delayed allocation
> > > write, where we return a buffer_head which is marked new, and this
> > > causes block_prepare_write() to call unmap_underlying_metadata(dev, 0).
> >
> > Not just that. On block allocation we are not calling
> > unmap_underlying_metadata(dev, blocknumber) for delayed allocated
> > blocks. That would imply file corruption.
>
> I don't think I'm following you . If we write into block that was
> delayed allocated. Are you saying we might get in trouble of the
> delayed allocation block is mmap'ed in?

We allocate blocks for delayed buffer during writepage. Now we need to
make sure after getting the blocks we drop the old buffer_head mapping
that we may have with this particular block attached to the block
device. That is done by calling unmap_underlying_metadata. Now the
current code doesn't call unmap_underlying_metadata for delayed
allocated blocks. That would mean we can see corrupt files if old
buffer_head mapping gets synced to disk AFTER we write the new
buffer_head mapping.

>
> > The original reported problem is something really easy to reproduce. So
> > i guess if we can have a ext4 local change that would fix the original
> > problem that would be good. Considering that map_bh(bdev, 0) didn't
> > create any issues till now, what we can do is to do a similar update
> > for unwritten_buffer in ext4_da_block_write_prep. That's the v2 version
> > of the patch with the below addition
> > bh_result->b_blocknr = 0;
>
> OK, I can put togehter a patch to do this. Whatever we do, I think
> we're going to need a *lot* of testing.

I just sent -v3 version

-aneesh

2009-04-28 19:36:03

by Eric Sandeen

[permalink] [raw]
Subject: Re: [RFC PATCH] mark buffer_head mapping preallocate area as new during write_begin with delayed allocation

Aneesh Kumar K.V wrote:
> On Tue, Apr 28, 2009 at 01:00:47PM -0400, Theodore Tso wrote:
>> On Tue, Apr 28, 2009 at 10:05:54PM +0530, Aneesh Kumar K.V wrote:
...

>>>> The other problem seems to be in the case of a delayed allocation
>>>> write, where we return a buffer_head which is marked new, and this
>>>> causes block_prepare_write() to call unmap_underlying_metadata(dev, 0).
>>> Not just that. On block allocation we are not calling
>>> unmap_underlying_metadata(dev, blocknumber) for delayed allocated
>>> blocks. That would imply file corruption.
>> I don't think I'm following you . If we write into block that was
>> delayed allocated. Are you saying we might get in trouble of the
>> delayed allocation block is mmap'ed in?
>
> We allocate blocks for delayed buffer during writepage. Now we need to
> make sure after getting the blocks we drop the old buffer_head mapping
> that we may have with this particular block attached to the block
> device. That is done by calling unmap_underlying_metadata. Now the
> current code doesn't call unmap_underlying_metadata for delayed
> allocated blocks. That would mean we can see corrupt files if old
> buffer_head mapping gets synced to disk AFTER we write the new
> buffer_head mapping.


Talking w/ Aneesh on IRC, I don't see how we can have stray dirty
mappings lying around for this block device unless someone is writing
directly to the mounted block device, which I don't think is ever
considered safe ...

I'm not quite sure what the call to __unmap_underlying_blocks() in
mpage_da_map_blocks() is for, I guess?

-Eric

2009-04-29 01:38:33

by Mingming Cao

[permalink] [raw]
Subject: Re: [RFC PATCH] mark buffer_head mapping preallocate area as new during write_begin with delayed allocation


On Tue, 2009-04-28 at 13:00 -0400, Theodore Tso wrote:
> On Tue, Apr 28, 2009 at 10:05:54PM +0530, Aneesh Kumar K.V wrote:
> > On Tue, Apr 28, 2009 at 08:48:21AM -0400, Theodore Tso wrote:
> > > On Tue, Apr 28, 2009 at 03:01:45PM +0530, Aneesh Kumar K.V wrote:
> > > >
> > > > Looking at the source again i guess setting just b_dev is not enough.
> > > > unmap_underlying_metadata looks at the mapping block number, which we
> > > > don't have in case on unwritten buffer_head. How about the below patch ?
> > > > It involve vfs changes. But i guess it is correct with respect to the
> > > > meaning of BH_New (Disk mapping was newly created by get_block). I guess
> > > > BH_New implies BH_Mapped.
> > >
> > > Argh. So we have multiple problems going on here. One is the
> > > original problem, namely that of a partial write into an preallocated
> > > block can leave garbage behind in that unitialized block.
> > >
> > > The other problem seems to be in the case of a delayed allocation
> > > write, where we return a buffer_head which is marked new, and this
> > > causes block_prepare_write() to call unmap_underlying_metadata(dev, 0).
> >
> > Not just that. On block allocation we are not calling
> > unmap_underlying_metadata(dev, blocknumber) for delayed allocated
> > blocks. That would imply file corruption.
>
> I don't think I'm following you . If we write into block that was
> delayed allocated. Are you saying we might get in trouble of the
> delayed allocation block is mmap'ed in?
>
> > The original reported problem is something really easy to reproduce. So
> > i guess if we can have a ext4 local change that would fix the original
> > problem that would be good. Considering that map_bh(bdev, 0) didn't
> > create any issues till now, what we can do is to do a similar update
> > for unwritten_buffer in ext4_da_block_write_prep. That's the v2 version
> > of the patch with the below addition
> > bh_result->b_blocknr = 0;
>
> OK, I can put togehter a patch to do this. Whatever we do, I think
> we're going to need a *lot* of testing.
>
> - Ted

Aneesh, Eric and I discussed this online today, we find a separate
issue, the lookup on the preallocated extent doesn't set the
buffer_mapped(), so loop up/write to the same preallocated block
multiple times (e.g. write 1 byte at a time, for 10 bytes total) will
end up calling ext4_get_blocks_wrap() multiple times.

It seems reasonable to set the buffer mapped for preallocated buffer,
with blocknr set to the real mapped block number (rather than faked -1
for the buffer blocknr in the V3 proposed fix for partial write garbage
issue), and later reply on unwritten flag to force the
writepage()/mpage_da_map_blocks calls get_block() to do the unintialized
extent split. But this change seems require more thoughts and heavy
auditing, and not as urgency as the data corruption problem.

Mingming


2009-04-29 11:57:28

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH] mark buffer_head mapping preallocate area as new during write_begin with delayed allocation

> Aneesh Kumar K.V wrote:
> > On Tue, Apr 28, 2009 at 01:00:47PM -0400, Theodore Tso wrote:
> >> On Tue, Apr 28, 2009 at 10:05:54PM +0530, Aneesh Kumar K.V wrote:
> ...
> >>>> The other problem seems to be in the case of a delayed allocation
> >>>> write, where we return a buffer_head which is marked new, and this
> >>>> causes block_prepare_write() to call unmap_underlying_metadata(dev, 0).
> >>> Not just that. On block allocation we are not calling
> >>> unmap_underlying_metadata(dev, blocknumber) for delayed allocated
> >>> blocks. That would imply file corruption.
> >> I don't think I'm following you . If we write into block that was
> >> delayed allocated. Are you saying we might get in trouble of the
> >> delayed allocation block is mmap'ed in?
> >
> > We allocate blocks for delayed buffer during writepage. Now we need to
> > make sure after getting the blocks we drop the old buffer_head mapping
> > that we may have with this particular block attached to the block
> > device. That is done by calling unmap_underlying_metadata. Now the
> > current code doesn't call unmap_underlying_metadata for delayed
> > allocated blocks. That would mean we can see corrupt files if old
> > buffer_head mapping gets synced to disk AFTER we write the new
> > buffer_head mapping.
>
>
> Talking w/ Aneesh on IRC, I don't see how we can have stray dirty
> mappings lying around for this block device unless someone is writing
> directly to the mounted block device, which I don't think is ever
> considered safe ...
>
> I'm not quite sure what the call to __unmap_underlying_blocks() in
> mpage_da_map_blocks() is for, I guess?
For ext3 / ext4 I think we don't need unmap_underlying_blocks() since
before we reallocate a block, we make sure that the transaction freeing
the block is committed and clear all dirty bits from freed blocks.
But for more careless filesystems, if they reallocate metadata block
as a data block and don't clear the dirty bit in blockdev mapping,
unmap_underlying_blocks() does it for them.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2009-04-29 14:08:17

by Eric Sandeen

[permalink] [raw]
Subject: Re: [RFC PATCH] mark buffer_head mapping preallocate area as new during write_begin with delayed allocation

Jan Kara wrote:
>> Aneesh Kumar K.V wrote:
>>> On Tue, Apr 28, 2009 at 01:00:47PM -0400, Theodore Tso wrote:
>>>> On Tue, Apr 28, 2009 at 10:05:54PM +0530, Aneesh Kumar K.V wrote:
>> ...
>>>>>> The other problem seems to be in the case of a delayed allocation
>>>>>> write, where we return a buffer_head which is marked new, and this
>>>>>> causes block_prepare_write() to call unmap_underlying_metadata(dev, 0).
>>>>> Not just that. On block allocation we are not calling
>>>>> unmap_underlying_metadata(dev, blocknumber) for delayed allocated
>>>>> blocks. That would imply file corruption.
>>>> I don't think I'm following you . If we write into block that was
>>>> delayed allocated. Are you saying we might get in trouble of the
>>>> delayed allocation block is mmap'ed in?
>>> We allocate blocks for delayed buffer during writepage. Now we need to
>>> make sure after getting the blocks we drop the old buffer_head mapping
>>> that we may have with this particular block attached to the block
>>> device. That is done by calling unmap_underlying_metadata. Now the
>>> current code doesn't call unmap_underlying_metadata for delayed
>>> allocated blocks. That would mean we can see corrupt files if old
>>> buffer_head mapping gets synced to disk AFTER we write the new
>>> buffer_head mapping.
>>
>> Talking w/ Aneesh on IRC, I don't see how we can have stray dirty
>> mappings lying around for this block device unless someone is writing
>> directly to the mounted block device, which I don't think is ever
>> considered safe ...
>>
>> I'm not quite sure what the call to __unmap_underlying_blocks() in
>> mpage_da_map_blocks() is for, I guess?
> For ext3 / ext4 I think we don't need unmap_underlying_blocks() since
> before we reallocate a block, we make sure that the transaction freeing
> the block is committed and clear all dirty bits from freed blocks.
> But for more careless filesystems, if they reallocate metadata block
> as a data block and don't clear the dirty bit in blockdev mapping,
> unmap_underlying_blocks() does it for them.

That's what I thought - so I was wondering why we have specific calls to
this in ext4:

mpage_da_map_blocks
__unmap_underlying_blocks
for (i = 0; i < blocks; i++)
unmap_underlying_metadata

-Eric

2009-04-29 18:13:23

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH] mark buffer_head mapping preallocate area as new during write_begin with delayed allocation

On Wed 29-04-09 09:08:05, Eric Sandeen wrote:
> Jan Kara wrote:
> >> Aneesh Kumar K.V wrote:
> >>> On Tue, Apr 28, 2009 at 01:00:47PM -0400, Theodore Tso wrote:
> >>>> On Tue, Apr 28, 2009 at 10:05:54PM +0530, Aneesh Kumar K.V wrote:
> >> ...
> >>>>>> The other problem seems to be in the case of a delayed allocation
> >>>>>> write, where we return a buffer_head which is marked new, and this
> >>>>>> causes block_prepare_write() to call unmap_underlying_metadata(dev, 0).
> >>>>> Not just that. On block allocation we are not calling
> >>>>> unmap_underlying_metadata(dev, blocknumber) for delayed allocated
> >>>>> blocks. That would imply file corruption.
> >>>> I don't think I'm following you . If we write into block that was
> >>>> delayed allocated. Are you saying we might get in trouble of the
> >>>> delayed allocation block is mmap'ed in?
> >>> We allocate blocks for delayed buffer during writepage. Now we need to
> >>> make sure after getting the blocks we drop the old buffer_head mapping
> >>> that we may have with this particular block attached to the block
> >>> device. That is done by calling unmap_underlying_metadata. Now the
> >>> current code doesn't call unmap_underlying_metadata for delayed
> >>> allocated blocks. That would mean we can see corrupt files if old
> >>> buffer_head mapping gets synced to disk AFTER we write the new
> >>> buffer_head mapping.
> >>
> >> Talking w/ Aneesh on IRC, I don't see how we can have stray dirty
> >> mappings lying around for this block device unless someone is writing
> >> directly to the mounted block device, which I don't think is ever
> >> considered safe ...
> >>
> >> I'm not quite sure what the call to __unmap_underlying_blocks() in
> >> mpage_da_map_blocks() is for, I guess?
> > For ext3 / ext4 I think we don't need unmap_underlying_blocks() since
> > before we reallocate a block, we make sure that the transaction freeing
> > the block is committed and clear all dirty bits from freed blocks.
> > But for more careless filesystems, if they reallocate metadata block
> > as a data block and don't clear the dirty bit in blockdev mapping,
> > unmap_underlying_blocks() does it for them.
>
> That's what I thought - so I was wondering why we have specific calls to
> this in ext4:
>
> mpage_da_map_blocks
> __unmap_underlying_blocks
> for (i = 0; i < blocks; i++)
> unmap_underlying_metadata
Hmm, OK. So maybe change it warn on dirty blockdev buffer and if the warning
does not trigger we can believe that our theory is right ;).

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR