2009-12-10 17:28:25

by Curt Wohlgemuth

[permalink] [raw]
Subject: [PATCH] ext4: Ensure zeroout blocks have no dirty metadata

This fixes a bug in which new blocks returned from an extent created with
ext4_ext_zeroout() can have dirty metadata still associated with them.

Signed-off-by: Curt Wohlgemuth <[email protected]>
---

This is for the problem I reported on 23 Nov ("Bug in extent zeroout: blocks
not marked as new"). I'm not seeing the corruption with this fix that I was
seeing without it.

diff -uprN orig/fs/ext4/extents.c new/fs/ext4/extents.c
--- orig/fs/ext4/extents.c 2009-12-09 15:09:25.000000000 -0800
+++ new/fs/ext4/extents.c 2009-12-09 15:09:37.000000000 -0800
@@ -2474,9 +2474,21 @@ static int ext4_ext_zeroout(struct inode
submit_bio(WRITE, bio);
wait_for_completion(&event);

- if (test_bit(BIO_UPTODATE, &bio->bi_flags))
+ /* On success, we need to insure all metadata associated
+ * with each of these blocks is unmapped. */
+ if (test_bit(BIO_UPTODATE, &bio->bi_flags)) {
+ sector_t block = ee_pblock;
+
ret = 0;
- else {
+ done = 0;
+ while (done < len) {
+ unmap_underlying_metadata(inode->i_sb->s_bdev,
+ block);
+
+ done++;
+ block++;
+ }
+ } else {
ret = -EIO;
break;
}


2009-12-10 17:44:45

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] ext4: Ensure zeroout blocks have no dirty metadata

Curt Wohlgemuth wrote:
> This fixes a bug in which new blocks returned from an extent created with
> ext4_ext_zeroout() can have dirty metadata still associated with them.

Do you have a testcase or at least end result details for this corruption?

> Signed-off-by: Curt Wohlgemuth <[email protected]>
> ---
>
> This is for the problem I reported on 23 Nov ("Bug in extent zeroout: blocks
> not marked as new"). I'm not seeing the corruption with this fix that I was
> seeing without it.
>
> diff -uprN orig/fs/ext4/extents.c new/fs/ext4/extents.c
> --- orig/fs/ext4/extents.c 2009-12-09 15:09:25.000000000 -0800
> +++ new/fs/ext4/extents.c 2009-12-09 15:09:37.000000000 -0800
> @@ -2474,9 +2474,21 @@ static int ext4_ext_zeroout(struct inode
> submit_bio(WRITE, bio);
> wait_for_completion(&event);
>
> - if (test_bit(BIO_UPTODATE, &bio->bi_flags))
> + /* On success, we need to insure all metadata associated

nitpick, "ensure" I think, although I guess they're mostly synonymous
today so do with that what you will :)

-Eric

> + * with each of these blocks is unmapped. */
> + if (test_bit(BIO_UPTODATE, &bio->bi_flags)) {
> + sector_t block = ee_pblock;
> +
> ret = 0;
> - else {
> + done = 0;
> + while (done < len) {
> + unmap_underlying_metadata(inode->i_sb->s_bdev,
> + block);
> +
> + done++;
> + block++;
> + }
> + } else {
> ret = -EIO;
> break;
> }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2009-12-10 18:01:17

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: [PATCH] ext4: Ensure zeroout blocks have no dirty metadata

On Thu, Dec 10, 2009 at 9:44 AM, Eric Sandeen <[email protected]> wrote:
> Curt Wohlgemuth wrote:
>> This fixes a bug in which new blocks returned from an extent created with
>> ext4_ext_zeroout() can have dirty metadata still associated with them.
>
> Do you have a testcase or at least end result details for this corruption?

<sigh> I wish I had a testcase. I do know some facts about the
workload(s) involved:

1. No journal
2. Nearly full ext2 partition, mounted as ext4 -- tune2fs used to turn
on "extent dir_index"
3. Existing non-extent based files are removed, new (extent-based)
files are created.
4. Files created with O_DIRECT, ~8MB, fallocate(KEEP_SIZE) used,
writes are generally in increasing offset order, ~64K usually.
5. Some (~1%) writes are to holes in the fallocate'd file; we're not
yet using Mingming's patches, so these writes fall back to buffered
writes.
6. Lots of processes, lots of threads.
7. End result is a block (or sometimes 2 blocks) of all zeros, at a
block-aligned offset. This zero'ed block is *always* somewhere in a
14-block extent created from using ext4_ext_zeroout().

Lots and lots of tracing showed that these blocks were originally
dirtied metadata blocks from unlinked (hence truncated) non-extent
based files. These indirect blocks in ext4 have their direct block
pointers turned to zero as the blocks are truncated, and these
indirect blocks are marked as dirty. Even though bforget() is called
on these metadata blocks, bforget() won't wait on the buffer. Waiting
on the buffer is meant to happen at allocate time, for "new" buffers.

>
>> ? ? ? Signed-off-by: Curt Wohlgemuth <[email protected]>
>> ---
>>
>> This is for the problem I reported on 23 Nov ("Bug in extent zeroout: blocks
>> not marked as new"). ?I'm not seeing the corruption with this fix that I was
>> seeing without it.
>>
>> diff -uprN orig/fs/ext4/extents.c new/fs/ext4/extents.c
>> --- orig/fs/ext4/extents.c ? ?2009-12-09 15:09:25.000000000 -0800
>> +++ new/fs/ext4/extents.c ? ? 2009-12-09 15:09:37.000000000 -0800
>> @@ -2474,9 +2474,21 @@ static int ext4_ext_zeroout(struct inode
>> ? ? ? ? ? ? ? submit_bio(WRITE, bio);
>> ? ? ? ? ? ? ? wait_for_completion(&event);
>>
>> - ? ? ? ? ? ? if (test_bit(BIO_UPTODATE, &bio->bi_flags))
>> + ? ? ? ? ? ? /* On success, we need to insure all metadata associated
>
> nitpick, "ensure" I think, although I guess they're mostly synonymous
> today so do with that what you will :)

I'll let Ted decide :-)

>
> -Eric
>
>> + ? ? ? ? ? ? ?* with each of these blocks is unmapped. */
>> + ? ? ? ? ? ? if (test_bit(BIO_UPTODATE, &bio->bi_flags)) {
>> + ? ? ? ? ? ? ? ? ? ? sector_t block = ee_pblock;
>> +
>> ? ? ? ? ? ? ? ? ? ? ? ret = 0;
>> - ? ? ? ? ? ? else {
>> + ? ? ? ? ? ? ? ? ? ? done = 0;
>> + ? ? ? ? ? ? ? ? ? ? while (done < len) {
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? unmap_underlying_metadata(inode->i_sb->s_bdev,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? block);
>> +
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? done++;
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? block++;
>> + ? ? ? ? ? ? ? ? ? ? }
>> + ? ? ? ? ? ? } else {
>> ? ? ? ? ? ? ? ? ? ? ? ret = -EIO;
>> ? ? ? ? ? ? ? ? ? ? ? break;
>> ? ? ? ? ? ? ? }
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>
>

2009-12-11 09:09:06

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] ext4: Ensure zeroout blocks have no dirty metadata

On 2009-12-10, at 10:28, Curt Wohlgemuth wrote:
> This fixes a bug in which new blocks returned from an extent created
> with
> ext4_ext_zeroout() can have dirty metadata still associated with them.
>
> This is for the problem I reported on 23 Nov ("Bug in extent
> zeroout: blocks
> not marked as new"). I'm not seeing the corruption with this fix
> that I was
> seeing without it.
>
> @@ -2474,9 +2474,21 @@ static int ext4_ext_zeroout(struct inode
> + /* On success, we need to insure all metadata associated
> + * with each of these blocks is unmapped. */
> + if (test_bit(BIO_UPTODATE, &bio->bi_flags)) {
> + sector_t block = ee_pblock;
> +
> ret = 0;
> + done = 0;
> + while (done < len) {
> + unmap_underlying_metadata(inode->i_sb->s_bdev,


IIRC, during the discussion of this problem, Ted pointed out that this
can only happen if the filesystem is running in no-journal mode.
Otherwise, there are shadow allocation bitmaps that prevent metadata
blocks from being reallocated until the transaction that released them
has committed.

In that case, it makes sense to only do this extra work if the
filesystem is running in no-journal mode.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-12-11 22:01:46

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: [PATCH] ext4: Ensure zeroout blocks have no dirty metadata

> IIRC, during the discussion of this problem, Ted pointed out that this can
> only happen if the filesystem is running in no-journal mode. ?Otherwise,
> there are shadow allocation bitmaps that prevent metadata blocks from being
> reallocated until the transaction that released them has committed.
>
> In that case, it makes sense to only do this extra work if the filesystem is
> running in no-journal mode.

Good point, Andreas. I changed this to send in the handle to
ext4_ext_zeroout().

=========================================

This fixes a bug with no journal being used, in which new blocks returned
from an extent created with ext4_ext_zeroout() can have dirty metadata still
associated with them.

Signed-off-by: Curt Wohlgemuth <[email protected]>
---

This is for the problem I reported on 23 Nov ("Bug in extent zeroout: blocks
not marked as new"). I'm not seeing the corruption with this fix that I was
seeing without it.

diff -uprN orig/fs/ext4/extents.c new/fs/ext4/extents.c
--- orig/fs/ext4/extents.c 2009-12-09 15:09:25.000000000 -0800
+++ new/fs/ext4/extents.c 2009-12-11 13:56:11.000000000 -0800
@@ -2421,7 +2421,8 @@ static void bi_complete(struct bio *bio,
}

/* FIXME!! we need to try to merge to left or right after zero-out */
-static int ext4_ext_zeroout(struct inode *inode, struct ext4_extent *ex)
+static int ext4_ext_zeroout(struct inode *inode, struct ext4_extent *ex,
+ handle_t *handle)
{
int ret = -EIO;
struct bio *bio;
@@ -2474,9 +2475,28 @@ static int ext4_ext_zeroout(struct inode
submit_bio(WRITE, bio);
wait_for_completion(&event);

- if (test_bit(BIO_UPTODATE, &bio->bi_flags))
+ if (test_bit(BIO_UPTODATE, &bio->bi_flags)) {
+
ret = 0;
- else {
+
+ /* On success, if there is no journal through which
+ * metadata is committed, we need to insure all
+ * metadata associated with each of these blocks is
+ * unmapped. */
+ if (!ext4_handle_valid(handle)) {
+ sector_t block = ee_pblock;
+
+ done = 0;
+ while (done < len) {
+ unmap_underlying_metadata(inode->i_sb->
+ s_bdev,
+ block);
+
+ done++;
+ block++;
+ }
+ }
+ } else {
ret = -EIO;
break;
}
@@ -2532,7 +2552,7 @@ static int ext4_ext_convert_to_initializ
goto out;
/* If extent has less than 2*EXT4_EXT_ZERO_LEN zerout directly */
if (ee_len <= 2*EXT4_EXT_ZERO_LEN) {
- err = ext4_ext_zeroout(inode, &orig_ex);
+ err = ext4_ext_zeroout(inode, &orig_ex, handle);
if (err)
goto fix_extent_len;
/* update the extent length and mark as initialized */
@@ -2583,7 +2603,8 @@ static int ext4_ext_convert_to_initializ
err = ext4_ext_insert_extent(handle, inode, path,
ex3, 0);
if (err == -ENOSPC) {
- err = ext4_ext_zeroout(inode, &orig_ex);
+ err = ext4_ext_zeroout(inode, &orig_ex,
+ handle);
if (err)
goto fix_extent_len;
ex->ee_block = orig_ex.ee_block;
@@ -2603,7 +2624,7 @@ static int ext4_ext_convert_to_initializ
* implies that we can leak some junk data to user
* space.
*/
- err = ext4_ext_zeroout(inode, ex3);
+ err = ext4_ext_zeroout(inode, ex3, handle);
if (err) {
/*
* We should actually mark the
@@ -2639,7 +2660,7 @@ static int ext4_ext_convert_to_initializ
ext4_ext_mark_uninitialized(ex3);
err = ext4_ext_insert_extent(handle, inode, path, ex3, 0);
if (err == -ENOSPC) {
- err = ext4_ext_zeroout(inode, &orig_ex);
+ err = ext4_ext_zeroout(inode, &orig_ex, handle);
if (err)
goto fix_extent_len;
/* update the extent length and mark as initialized */
@@ -2688,7 +2709,7 @@ static int ext4_ext_convert_to_initializ
*/
if (le16_to_cpu(orig_ex.ee_len) <= EXT4_EXT_ZERO_LEN &&
iblock != ee_block) {
- err = ext4_ext_zeroout(inode, &orig_ex);
+ err = ext4_ext_zeroout(inode, &orig_ex, handle);
if (err)
goto fix_extent_len;
/* update the extent length and mark as initialized */
@@ -2757,7 +2778,7 @@ static int ext4_ext_convert_to_initializ
insert:
err = ext4_ext_insert_extent(handle, inode, path, &newex, 0);
if (err == -ENOSPC) {
- err = ext4_ext_zeroout(inode, &orig_ex);
+ err = ext4_ext_zeroout(inode, &orig_ex, handle);
if (err)
goto fix_extent_len;
/* update the extent length and mark as initialized */
@@ -2871,7 +2892,7 @@ static int ext4_split_unwritten_extents(
ext4_ext_mark_uninitialized(ex3);
err = ext4_ext_insert_extent(handle, inode, path, ex3, flags);
if (err == -ENOSPC) {
- err = ext4_ext_zeroout(inode, &orig_ex);
+ err = ext4_ext_zeroout(inode, &orig_ex, handle);
if (err)
goto fix_extent_len;
/* update the extent length and mark as initialized */
@@ -2942,7 +2963,7 @@ static int ext4_split_unwritten_extents(
insert:
err = ext4_ext_insert_extent(handle, inode, path, &newex, flags);
if (err == -ENOSPC) {
- err = ext4_ext_zeroout(inode, &orig_ex);
+ err = ext4_ext_zeroout(inode, &orig_ex, handle);
if (err)
goto fix_extent_len;
/* update the extent length and mark as initialized */

2009-12-11 23:27:24

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] ext4: Ensure zeroout blocks have no dirty metadata

On 2009-12-11, at 15:01, Curt Wohlgemuth wrote:
> Andreas Dilger wrote:
>> IIRC, during the discussion of this problem, Ted pointed out that
>> this can
>> only happen if the filesystem is running in no-journal mode.
>> Otherwise,
>> there are shadow allocation bitmaps that prevent metadata blocks
>> from being
>> reallocated until the transaction that released them has committed.
>>
>> In that case, it makes sense to only do this extra work if the
>> filesystem is
>> running in no-journal mode.
>
> Good point, Andreas. I changed this to send in the handle to
> ext4_ext_zeroout().

I was just thinking of checking EXT4_SB(inode->i_sb)->s_journal ==
NULL. I also thought about checking EXT4_HAS_INCOMPAT_FEATURE(sb,
NEEDS_RECOVERY), but I don't know if that is 100% safe (i.e. is it
possible to mount such a filesystem with "norecovery"?).

I don't have any particular objection to this patch, though it gives
the slightly false impression that the blocks being zeroed are
journaled, and it needs to modify a lot more places in the code.

> =========================================
>
> This fixes a bug with no journal being used, in which new blocks
> returned
> from an extent created with ext4_ext_zeroout() can have dirty
> metadata still
> associated with them.
>
> Signed-off-by: Curt Wohlgemuth <[email protected]>
> ---
>
> This is for the problem I reported on 23 Nov ("Bug in extent
> zeroout: blocks
> not marked as new"). I'm not seeing the corruption with this fix
> that I was
> seeing without it.
>
> diff -uprN orig/fs/ext4/extents.c new/fs/ext4/extents.c
> --- orig/fs/ext4/extents.c 2009-12-09 15:09:25.000000000 -0800
> +++ new/fs/ext4/extents.c 2009-12-11 13:56:11.000000000 -0800
> @@ -2421,7 +2421,8 @@ static void bi_complete(struct bio *bio,
> }
>
> /* FIXME!! we need to try to merge to left or right after zero-out */
> -static int ext4_ext_zeroout(struct inode *inode, struct ext4_extent
> *ex)
> +static int ext4_ext_zeroout(struct inode *inode, struct ext4_extent
> *ex,
> + handle_t *handle)
> {
> int ret = -EIO;
> struct bio *bio;
> @@ -2474,9 +2475,28 @@ static int ext4_ext_zeroout(struct inode
> submit_bio(WRITE, bio);
> wait_for_completion(&event);
>
> - if (test_bit(BIO_UPTODATE, &bio->bi_flags))
> + if (test_bit(BIO_UPTODATE, &bio->bi_flags)) {
> +
> ret = 0;
> - else {
> +
> + /* On success, if there is no journal through which
> + * metadata is committed, we need to insure all
> + * metadata associated with each of these blocks is
> + * unmapped. */
> + if (!ext4_handle_valid(handle)) {
> + sector_t block = ee_pblock;
> +
> + done = 0;
> + while (done < len) {
> + unmap_underlying_metadata(inode->i_sb->
> + s_bdev,
> + block);
> +
> + done++;
> + block++;
> + }
> + }
> + } else {
> ret = -EIO;
> break;
> }
> @@ -2532,7 +2552,7 @@ static int ext4_ext_convert_to_initializ
> goto out;
> /* If extent has less than 2*EXT4_EXT_ZERO_LEN zerout directly */
> if (ee_len <= 2*EXT4_EXT_ZERO_LEN) {
> - err = ext4_ext_zeroout(inode, &orig_ex);
> + err = ext4_ext_zeroout(inode, &orig_ex, handle);
> if (err)
> goto fix_extent_len;
> /* update the extent length and mark as initialized */
> @@ -2583,7 +2603,8 @@ static int ext4_ext_convert_to_initializ
> err = ext4_ext_insert_extent(handle, inode, path,
> ex3, 0);
> if (err == -ENOSPC) {
> - err = ext4_ext_zeroout(inode, &orig_ex);
> + err = ext4_ext_zeroout(inode, &orig_ex,
> + handle);
> if (err)
> goto fix_extent_len;
> ex->ee_block = orig_ex.ee_block;
> @@ -2603,7 +2624,7 @@ static int ext4_ext_convert_to_initializ
> * implies that we can leak some junk data to user
> * space.
> */
> - err = ext4_ext_zeroout(inode, ex3);
> + err = ext4_ext_zeroout(inode, ex3, handle);
> if (err) {
> /*
> * We should actually mark the
> @@ -2639,7 +2660,7 @@ static int ext4_ext_convert_to_initializ
> ext4_ext_mark_uninitialized(ex3);
> err = ext4_ext_insert_extent(handle, inode, path, ex3, 0);
> if (err == -ENOSPC) {
> - err = ext4_ext_zeroout(inode, &orig_ex);
> + err = ext4_ext_zeroout(inode, &orig_ex, handle);
> if (err)
> goto fix_extent_len;
> /* update the extent length and mark as initialized */
> @@ -2688,7 +2709,7 @@ static int ext4_ext_convert_to_initializ
> */
> if (le16_to_cpu(orig_ex.ee_len) <= EXT4_EXT_ZERO_LEN &&
> iblock != ee_block) {
> - err = ext4_ext_zeroout(inode, &orig_ex);
> + err = ext4_ext_zeroout(inode, &orig_ex, handle);
> if (err)
> goto fix_extent_len;
> /* update the extent length and mark as initialized */
> @@ -2757,7 +2778,7 @@ static int ext4_ext_convert_to_initializ
> insert:
> err = ext4_ext_insert_extent(handle, inode, path, &newex, 0);
> if (err == -ENOSPC) {
> - err = ext4_ext_zeroout(inode, &orig_ex);
> + err = ext4_ext_zeroout(inode, &orig_ex, handle);
> if (err)
> goto fix_extent_len;
> /* update the extent length and mark as initialized */
> @@ -2871,7 +2892,7 @@ static int ext4_split_unwritten_extents(
> ext4_ext_mark_uninitialized(ex3);
> err = ext4_ext_insert_extent(handle, inode, path, ex3, flags);
> if (err == -ENOSPC) {
> - err = ext4_ext_zeroout(inode, &orig_ex);
> + err = ext4_ext_zeroout(inode, &orig_ex, handle);
> if (err)
> goto fix_extent_len;
> /* update the extent length and mark as initialized */
> @@ -2942,7 +2963,7 @@ static int ext4_split_unwritten_extents(
> insert:
> err = ext4_ext_insert_extent(handle, inode, path, &newex, flags);
> if (err == -ENOSPC) {
> - err = ext4_ext_zeroout(inode, &orig_ex);
> + err = ext4_ext_zeroout(inode, &orig_ex, handle);
> if (err)
> goto fix_extent_len;
> /* update the extent length and mark as initialized */
> --
> To unsubscribe from this list: send the line "unsubscribe linux-
> ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-12-15 22:33:34

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: [PATCH] ext4: Ensure zeroout blocks have no dirty metadata

On Fri, Dec 11, 2009 at 3:27 PM, Andreas Dilger <[email protected]> wrote:
> On 2009-12-11, at 15:01, Curt Wohlgemuth wrote:

>>
>> Good point, Andreas. ?I changed this to send in the handle to
>> ext4_ext_zeroout().

> I was just thinking of checking EXT4_SB(inode->i_sb)->s_journal == NULL. ?I
> also thought about checking EXT4_HAS_INCOMPAT_FEATURE(sb, NEEDS_RECOVERY),
> but I don't know if that is 100% safe (i.e. is it possible to mount such a
> filesystem with "norecovery"?).

Arggh. There are too many ways to check the journal use/mode... This
patch is considerably simpler.

=====================================================

This fixes a bug with no journal being used, in which new blocks returned
from an extent created with ext4_ext_zeroout() can have dirty metadata still
associated with them.

Signed-off-by: Curt Wohlgemuth <[email protected]>
---

This is for the problem I reported on 23 Nov ("Bug in extent zeroout: blocks
not marked as new"). I'm not seeing the corruption with this fix that I was
seeing without it.

diff -uprN orig/fs/ext4/extents.c new/fs/ext4/extents.c
--- orig/fs/ext4/extents.c 2009-12-09 15:09:25.000000000 -0800
+++ new/fs/ext4/extents.c 2009-12-15 13:26:29.000000000 -0800
@@ -2474,9 +2474,28 @@ static int ext4_ext_zeroout(struct inode
submit_bio(WRITE, bio);
wait_for_completion(&event);

- if (test_bit(BIO_UPTODATE, &bio->bi_flags))
+ if (test_bit(BIO_UPTODATE, &bio->bi_flags)) {
+
ret = 0;
- else {
+
+ /* On success, if there is no journal through which
+ * metadata is committed, we need to insure all
+ * metadata associated with each of these blocks is
+ * unmapped. */
+ if (EXT4_SB(inode->i_sb)->s_journal == NULL) {
+ sector_t block = ee_pblock;
+
+ done = 0;
+ while (done < len) {
+ unmap_underlying_metadata(inode->i_sb->
+ s_bdev,
+ block);
+
+ done++;
+ block++;
+ }
+ }
+ } else {
ret = -EIO;
break;
}

2009-12-18 11:49:52

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH] ext4: Ensure zeroout blocks have no dirty metadata

On Thu, Dec 10, 2009 at 09:28:28AM -0800, Curt Wohlgemuth wrote:
> This fixes a bug in which new blocks returned from an extent created with
> ext4_ext_zeroout() can have dirty metadata still associated with them.
>
> Signed-off-by: Curt Wohlgemuth <[email protected]>
> ---
>
> This is for the problem I reported on 23 Nov ("Bug in extent zeroout: blocks
> not marked as new"). I'm not seeing the corruption with this fix that I was
> seeing without it.
>
> diff -uprN orig/fs/ext4/extents.c new/fs/ext4/extents.c
> --- orig/fs/ext4/extents.c 2009-12-09 15:09:25.000000000 -0800
> +++ new/fs/ext4/extents.c 2009-12-09 15:09:37.000000000 -0800
> @@ -2474,9 +2474,21 @@ static int ext4_ext_zeroout(struct inode
> submit_bio(WRITE, bio);
> wait_for_completion(&event);
>
> - if (test_bit(BIO_UPTODATE, &bio->bi_flags))
> + /* On success, we need to insure all metadata associated
> + * with each of these blocks is unmapped. */
> + if (test_bit(BIO_UPTODATE, &bio->bi_flags)) {
> + sector_t block = ee_pblock;
> +
> ret = 0;
> - else {
> + done = 0;
> + while (done < len) {
> + unmap_underlying_metadata(inode->i_sb->s_bdev,
> + block);
> +
> + done++;
> + block++;
> + }
> + } else {
> ret = -EIO;
> break;
> }

a) We are zeroing out 'done' blocks but you are unmapping 'len' blocks ?
b) We are already doing a unmap in mpage_da_map_blocks so i guess
what you want is to unmap the extra block allocated
c) ee_pblock is in 512 byte units so the block number is wrong.

how about the patch below ?

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 3a7928f..f9a735f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3023,6 +3023,14 @@ out:
return err;
}

+static void unmap_underlying_metadata_blocks(struct block_device *bdev,
+ sector_t block, int count)
+{
+ int i;
+ for (i = 0; i < count; i++)
+ unmap_underlying_metadata(bdev, block + i);
+}
+
static int
ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
ext4_lblk_t iblock, unsigned int max_blocks,
@@ -3098,6 +3106,18 @@ out:
} else
allocated = ret;
set_buffer_new(bh_result);
+ /*
+ * if we allocated more blocks than requested
+ * we need to make sure we unmap the extra block
+ * allocated. The actual needed block will get
+ * unmapped later when we find the buffer_head marked
+ * new.
+ */
+ if (allocated > max_blocks) {
+ unmap_underlying_metadata_blocks(inode->i_sb->s_bdev,
+ newblock + max_blocks,
+ allocated - max_blocks);
+ }
map_out:
set_buffer_mapped(bh_result);
out1:



2009-12-18 12:10:13

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH] ext4: Ensure zeroout blocks have no dirty metadata

On Fri, Dec 18, 2009 at 05:19:46PM +0530, Aneesh Kumar K.V wrote:
> On Thu, Dec 10, 2009 at 09:28:28AM -0800, Curt Wohlgemuth wrote:
> > This fixes a bug in which new blocks returned from an extent created with
> > ext4_ext_zeroout() can have dirty metadata still associated with them.
> >
> > Signed-off-by: Curt Wohlgemuth <[email protected]>

A better option would be to do the unmap during fallocate.

commit 87b3121fd9d1223acb08326fc0c9711b0bc3cfeb
Author: Aneesh Kumar K.V <[email protected]>
Date: Fri Dec 18 17:38:15 2009 +0530

ext4: unmap the underlying metadata when allocating blocks via fallocate

This become important when we are running with nojournal mode. We
may end up allocating recently freed metablocks for fallocate. We
want to make sure we unmap the old mapping so that when we convert
the fallocated uninitialized extent to initialized extent we don't
have the old mapping around. Leaving the old mapping can cause
file system corruption

Signed-off-by: Aneesh Kumar K.V <[email protected]>

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index ab31e65..7c0fcae 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1768,6 +1768,20 @@ static inline void set_bitmap_uptodate(struct buffer_head *bh)
set_bit(BH_BITMAP_UPTODATE, &(bh)->b_state);
}

+/*
+ * __unmap_underlying_bh_blocks - just a helper function to unmap
+ * set of blocks described by @bh
+ */
+static inline void __unmap_underlying_bh_blocks(struct inode *inode,
+ struct buffer_head *bh)
+{
+ struct block_device *bdev = inode->i_sb->s_bdev;
+ int blocks, i;
+
+ blocks = bh->b_size >> inode->i_blkbits;
+ for (i = 0; i < blocks; i++)
+ unmap_underlying_metadata(bdev, bh->b_blocknr + i);
+}
#endif /* __KERNEL__ */

#endif /* _EXT4_H */
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 3a7928f..4e646a5 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3508,6 +3508,8 @@ retry:
ret2 = ext4_journal_stop(handle);
break;
}
+ if (buffer_new(&map_bh))
+ __unmap_underlying_bh_blocks(inode, &map_bh);
if ((block + ret) >= (EXT4_BLOCK_ALIGN(offset + len,
blkbits) >> blkbits))
new_size = offset + len;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5352db1..7b44737 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2073,22 +2073,6 @@ static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd, sector_t logical,
}
}

-
-/*
- * __unmap_underlying_blocks - just a helper function to unmap
- * set of blocks described by @bh
- */
-static inline void __unmap_underlying_blocks(struct inode *inode,
- struct buffer_head *bh)
-{
- struct block_device *bdev = inode->i_sb->s_bdev;
- int blocks, i;
-
- blocks = bh->b_size >> inode->i_blkbits;
- for (i = 0; i < blocks; i++)
- unmap_underlying_metadata(bdev, bh->b_blocknr + i);
-}

2009-12-18 23:11:59

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: [PATCH] ext4: Ensure zeroout blocks have no dirty metadata

On Fri, Dec 18, 2009 at 4:10 AM, Aneesh Kumar K.V
<[email protected]> wrote:
> On Fri, Dec 18, 2009 at 05:19:46PM +0530, Aneesh Kumar K.V wrote:
>> On Thu, Dec 10, 2009 at 09:28:28AM -0800, Curt Wohlgemuth wrote:
>> > This fixes a bug in which new blocks returned from an extent created with
>> > ext4_ext_zeroout() can have dirty metadata still associated with them.
>> >
>> > ? ? Signed-off-by: Curt Wohlgemuth <[email protected]>
>
> A better option would be to do the unmap during fallocate.

The problem here is that we'll also call unmap_underlying_metadata()
on these same blocks when they get written to, and the extents become
initialized. At that point, the buffers are marked as 'new' and so
__block_write_full_page() and friends will again try to clear out any
old metadata.

You could argue that since there won't be any metadata that this
second call will be fast, but still...

Curt

>
> commit 87b3121fd9d1223acb08326fc0c9711b0bc3cfeb
> Author: Aneesh Kumar K.V <[email protected]>
> Date: ? Fri Dec 18 17:38:15 2009 +0530
>
> ? ?ext4: unmap the underlying metadata when allocating blocks via fallocate
>
> ? ?This become important when we are running with nojournal mode. We
> ? ?may end up allocating recently freed metablocks for fallocate. We
> ? ?want to make sure we unmap the old mapping so that when we convert
> ? ?the fallocated uninitialized extent to initialized extent we don't
> ? ?have the old mapping around. Leaving the old mapping can cause
> ? ?file system corruption
>
> ? ?Signed-off-by: Aneesh Kumar K.V <[email protected]>
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index ab31e65..7c0fcae 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1768,6 +1768,20 @@ static inline void set_bitmap_uptodate(struct buffer_head *bh)
> ? ? ? ?set_bit(BH_BITMAP_UPTODATE, &(bh)->b_state);
> ?}
>
> +/*
> + * __unmap_underlying_bh_blocks - just a helper function to unmap
> + * set of blocks described by @bh
> + */
> +static inline void __unmap_underlying_bh_blocks(struct inode *inode,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct buffer_head *bh)
> +{
> + ? ? ? struct block_device *bdev = inode->i_sb->s_bdev;
> + ? ? ? int blocks, i;
> +
> + ? ? ? blocks = bh->b_size >> inode->i_blkbits;
> + ? ? ? for (i = 0; i < blocks; i++)
> + ? ? ? ? ? ? ? unmap_underlying_metadata(bdev, bh->b_blocknr + i);
> +}
> ?#endif /* __KERNEL__ */
>
> ?#endif /* _EXT4_H */
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 3a7928f..4e646a5 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3508,6 +3508,8 @@ retry:
> ? ? ? ? ? ? ? ? ? ? ? ?ret2 = ext4_journal_stop(handle);
> ? ? ? ? ? ? ? ? ? ? ? ?break;
> ? ? ? ? ? ? ? ?}
> + ? ? ? ? ? ? ? if (buffer_new(&map_bh))
> + ? ? ? ? ? ? ? ? ? ? ? __unmap_underlying_bh_blocks(inode, &map_bh);
> ? ? ? ? ? ? ? ?if ((block + ret) >= (EXT4_BLOCK_ALIGN(offset + len,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?blkbits) >> blkbits))
> ? ? ? ? ? ? ? ? ? ? ? ?new_size = offset + len;
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5352db1..7b44737 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2073,22 +2073,6 @@ static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd, sector_t logical,
> ? ? ? ?}
> ?}
>
> -
> -/*
> - * __unmap_underlying_blocks - just a helper function to unmap
> - * set of blocks described by @bh
> - */
> -static inline void __unmap_underlying_blocks(struct inode *inode,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct buffer_head *bh)
> -{
> - ? ? ? struct block_device *bdev = inode->i_sb->s_bdev;
> - ? ? ? int blocks, i;
> -
> - ? ? ? blocks = bh->b_size >> inode->i_blkbits;
> - ? ? ? for (i = 0; i < blocks; i++)
> - ? ? ? ? ? ? ? unmap_underlying_metadata(bdev, bh->b_blocknr + i);
> -}
> -
> ?static void ext4_da_block_invalidatepages(struct mpage_da_data *mpd,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?sector_t logical, long blk_cnt)
> ?{
> @@ -2243,7 +2227,7 @@ static int mpage_da_map_blocks(struct mpage_da_data *mpd)
> ? ? ? ?new.b_size = (blks << mpd->inode->i_blkbits);
>
> ? ? ? ?if (buffer_new(&new))
> - ? ? ? ? ? ? ? __unmap_underlying_blocks(mpd->inode, &new);
> + ? ? ? ? ? ? ? __unmap_underlying_bh_blocks(mpd->inode, &new);
>
> ? ? ? ?/*
> ? ? ? ? * If blocks are delayed marked, we need to
>

2009-12-29 17:56:33

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: [PATCH] ext4: Ensure zeroout blocks have no dirty metadata

On Fri, Dec 18, 2009 at 3:49 AM, Aneesh Kumar K.V
<[email protected]> wrote:
> On Thu, Dec 10, 2009 at 09:28:28AM -0800, Curt Wohlgemuth wrote:
>> This fixes a bug in which new blocks returned from an extent created with
>> ext4_ext_zeroout() can have dirty metadata still associated with them.
>>
>> ? ? ? Signed-off-by: Curt Wohlgemuth <[email protected]>
>> ---
>>
>> This is for the problem I reported on 23 Nov ("Bug in extent zeroout: blocks
>> not marked as new"). ?I'm not seeing the corruption with this fix that I was
>> seeing without it.
>>
>> diff -uprN orig/fs/ext4/extents.c new/fs/ext4/extents.c
>> --- orig/fs/ext4/extents.c ? ?2009-12-09 15:09:25.000000000 -0800
>> +++ new/fs/ext4/extents.c ? ? 2009-12-09 15:09:37.000000000 -0800
>> @@ -2474,9 +2474,21 @@ static int ext4_ext_zeroout(struct inode
>> ? ? ? ? ? ? ? submit_bio(WRITE, bio);
>> ? ? ? ? ? ? ? wait_for_completion(&event);
>>
>> - ? ? ? ? ? ? if (test_bit(BIO_UPTODATE, &bio->bi_flags))
>> + ? ? ? ? ? ? /* On success, we need to insure all metadata associated
>> + ? ? ? ? ? ? ?* with each of these blocks is unmapped. */
>> + ? ? ? ? ? ? if (test_bit(BIO_UPTODATE, &bio->bi_flags)) {
>> + ? ? ? ? ? ? ? ? ? ? sector_t block = ee_pblock;
>> +
>> ? ? ? ? ? ? ? ? ? ? ? ret = 0;
>> - ? ? ? ? ? ? else {
>> + ? ? ? ? ? ? ? ? ? ? done = 0;
>> + ? ? ? ? ? ? ? ? ? ? while (done < len) {
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? unmap_underlying_metadata(inode->i_sb->s_bdev,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? block);
>> +
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? done++;
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? block++;
>> + ? ? ? ? ? ? ? ? ? ? }
>> + ? ? ? ? ? ? } else {
>> ? ? ? ? ? ? ? ? ? ? ? ret = -EIO;
>> ? ? ? ? ? ? ? ? ? ? ? break;
>> ? ? ? ? ? ? ? }
>
> a) We are zeroing out 'done' blocks but you are unmapping 'len' blocks ?
> b) We are already doing a unmap in mpage_da_map_blocks so i guess
> ? what you want is to unmap the extra block allocated
> c) ee_pblock is in 512 byte units so the block number is wrong.
>
> how about the patch below ?

Thanks for pointing out the arithmetic problems in my patch. Yours
below looks good to me.
Curt


>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 3a7928f..f9a735f 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3023,6 +3023,14 @@ out:
> ? ? ? ?return err;
> ?}
>
> +static void unmap_underlying_metadata_blocks(struct block_device *bdev,
> + ? ? ? ? ? ? ? ? ? ? ? sector_t block, int count)
> +{
> + ? ? ? int i;
> + ? ? ? for (i = 0; i < count; i++)
> + ? ? ? ? ? ? ? ?unmap_underlying_metadata(bdev, block + i);
> +}
> +
> ?static int
> ?ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
> ? ? ? ? ? ? ? ? ? ? ? ?ext4_lblk_t iblock, unsigned int max_blocks,
> @@ -3098,6 +3106,18 @@ out:
> ? ? ? ?} else
> ? ? ? ? ? ? ? ?allocated = ret;
> ? ? ? ?set_buffer_new(bh_result);
> + ? ? ? /*
> + ? ? ? ?* if we allocated more blocks than requested
> + ? ? ? ?* we need to make sure we unmap the extra block
> + ? ? ? ?* allocated. The actual needed block will get
> + ? ? ? ?* unmapped later when we find the buffer_head marked
> + ? ? ? ?* new.
> + ? ? ? ?*/
> + ? ? ? if (allocated > max_blocks) {
> + ? ? ? ? ? ? ? unmap_underlying_metadata_blocks(inode->i_sb->s_bdev,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? newblock + max_blocks,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? allocated - max_blocks);
> + ? ? ? }
> ?map_out:
> ? ? ? ?set_buffer_mapped(bh_result);
> ?out1:
>
>
>

2009-12-29 18:53:26

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: Ensure zeroout blocks have no dirty metadata

On Tue, Dec 29, 2009 at 09:56:29AM -0800, Curt Wohlgemuth wrote:
>
> Thanks for pointing out the arithmetic problems in my patch. Yours
> below looks good to me.

Hi Curt,

Can you do me a favor and send out a full patch with Aneesh's
correction?

Thanks,

- Ted

2009-12-29 23:23:40

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: [PATCH] ext4: Ensure zeroout blocks have no dirty metadata

Hi Ted:

On Tue, Dec 29, 2009 at 10:53 AM, <[email protected]> wrote:
> On Tue, Dec 29, 2009 at 09:56:29AM -0800, Curt Wohlgemuth wrote:
>>
>> Thanks for pointing out the arithmetic problems in my patch. ?Yours
>> below looks good to me.
>
> Hi Curt,
>
> Can you do me a favor and send out a full patch with Aneesh's
> correction?

Here it is. This is against 2.6.33-rc2, if that makes a difference. Thanks!

====================================================================

This fixes a bug in which new blocks returned from an extent created with
ext4_ext_zeroout() can have dirty metadata still associated with them.

This patch was originally by Aneesh Kumar <[email protected]>.

Signed-off-by: Curt Wohlgemuth <[email protected]>

---


diff -uprN orig/fs/ext4/extents.c new/fs/ext4/extents.c
--- orig/fs/ext4/extents.c 2009-12-29 14:59:12.000000000 -0800
+++ new/fs/ext4/extents.c 2009-12-29 15:18:43.000000000 -0800
@@ -3023,6 +3023,14 @@ out:
return err;
}

+static void unmap_underlying_metadata_blocks(struct block_device *bdev,
+ sector_t block, int count)
+{
+ int i;
+ for (i = 0; i < count; i++)
+ unmap_underlying_metadata(bdev, block + i);
+}
+
static int
ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
ext4_lblk_t iblock, unsigned int max_blocks,
@@ -3098,6 +3106,18 @@ out:
} else
allocated = ret;
set_buffer_new(bh_result);
+ /*
+ * if we allocated more blocks than requested
+ * we need to make sure we unmap the extra block
+ * allocated. The actual needed block will get
+ * unmapped later when we find the buffer_head marked
+ * new.
+ */
+ if (allocated > max_blocks) {
+ unmap_underlying_metadata_blocks(inode->i_sb->s_bdev,
+ newblock + max_blocks,
+ allocated - max_blocks);
+ }
map_out:
set_buffer_mapped(bh_result);
out1: