2010-11-30 23:22:43

by Jiaying Zhang

[permalink] [raw]
Subject: [PATCH] discard an inode's preallocated blocks after failed allocation

We have seen kernel crashes caused by the BUG_ON in
ext4_mb_return_to_preallocation() that checks whether the inode's
i_prealloc_list is empty. As I understand, the assumption is that
when ext4_mb_return_to_preallocation() is called from ext4_free_blocks(),
the inode's preallocation list should have been already discarded.
However, although we call ext4_discard_preallocations() during ext4
truncate, we don't always call that function in various failure
cases before calling ext4_free_blocks(). So it is likely to hit this
BUG_ON with disk errors or corrupted fs etc.

To fix the problem, the following patch adds ext4_discard_preallocation()
before ext4_free_blocks() in failed allocation cases. This will discard
any preallocated block extent attached to the inode, but I think it is
probably what we should be doing with failed allocation.

I am also curious whether we can drop the ext4_mb_return_to_preallocation()
call from ext4_free_blocks(). From the comments above that function, it
seems to intent to discard the specified blocks only but all it is currently
doing is the BUG_ON check on whether the inode's preallocation list is empty.
Is there any plan to extend this function later?


ext4: discard an inode's preallocated blocks after failed allocation so that
we won't hit on the BUG_ON check in ext4_mb_return_to_preallocation() later.

Signed-off-by: Jiaying Zhang <[email protected]>

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 29a4adf..2164df6 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1056,6 +1056,7 @@ cleanup:
}

if (err) {
+ ext4_discard_preallocations(inode);
/* free all allocated blocks in error case */
for (i = 0; i < depth; i++) {
if (!ablocks[i])
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4f4362c..456cb4a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -690,6 +690,7 @@ allocated:
*err = 0;
return ret;
failed_out:
+ ext4_discard_preallocations(inode);
for (i = 0; i < index; i++)
ext4_free_blocks(handle, inode, 0, new_blocks[i], 1, 0);
return ret;
@@ -787,6 +788,7 @@ static int ext4_alloc_branch(handle_t *handle, struct inode *inode,
return err;
failed:
/* Allocation failed, free what we already allocated */
+ ext4_discard_preallocations(inode);
ext4_free_blocks(handle, inode, 0, new_blocks[0], 1, 0);
for (i = 1; i <= n ; i++) {
/*
@@ -878,6 +880,7 @@ static int ext4_splice_branch(handle_t *handle, struct inode *inode,
return err;

err_out:
+ ext4_discard_preallocations(inode);
for (i = 1; i <= num; i++) {
/*
* branch[i].bh is newly allocated, so there is no


2011-01-09 03:04:56

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] discard an inode's preallocated blocks after failed allocation

On Tue, Nov 30, 2010 at 03:22:38PM -0800, Jiaying Zhang wrote:
> We have seen kernel crashes caused by the BUG_ON in
> ext4_mb_return_to_preallocation() that checks whether the inode's
> i_prealloc_list is empty. As I understand, the assumption is that
> when ext4_mb_return_to_preallocation() is called from ext4_free_blocks(),
> the inode's preallocation list should have been already discarded.
> However, although we call ext4_discard_preallocations() during ext4
> truncate, we don't always call that function in various failure
> cases before calling ext4_free_blocks(). So it is likely to hit this
> BUG_ON with disk errors or corrupted fs etc.
>
> To fix the problem, the following patch adds ext4_discard_preallocation()
> before ext4_free_blocks() in failed allocation cases. This will discard
> any preallocated block extent attached to the inode, but I think it is
> probably what we should be doing with failed allocation.
>
> I am also curious whether we can drop the ext4_mb_return_to_preallocation()
> call from ext4_free_blocks(). From the comments above that function, it
> seems to intent to discard the specified blocks only but all it is currently
> doing is the BUG_ON check on whether the inode's preallocation list is empty.
> Is there any plan to extend this function later?
>
>
> ext4: discard an inode's preallocated blocks after failed allocation so that
> we won't hit on the BUG_ON check in ext4_mb_return_to_preallocation() later.
>
> Signed-off-by: Jiaying Zhang <[email protected]>

Sorry, I haven't had a chance to go diving into the "legislative
intent" behind the code in question, until now. The reason why I
haven't is because it's the mballoc code is a bit of a mess (and some
of it is my fault).

The basic idea behind ext4_mb_return_to_preallocation() is to take the
blocks that had been allocated, and if they are adjacent to the
preallocation regions reserved for the inode, to add them to back to
the preallocation region. This is especially a good thing if the
blocks had just been taking from a preallocation region, and we
errored out trying to insert them into the extent tree, for example.
In that case, it would be good to be able to "put back" the blocks to
their original preallocation region.

There are only a few problems with this scheme:

1) It was never implemented.

2) There is a BUG_ON which triggers if the inode has any
preallocations. After staring at it for a long time, I've concluded
that it makes no sense whatsoever. If I'm right about why it was
there, then in the case where we had just allocated some blocks from
the preallocation region, and now we need to put them back, of course
we inode would have some prellocated regions.

3) If the journal is enabled, we treat all data blocks as if they were
metadata blocks (that is, we don't actually release the blocks until
the current transaction has committed); this is necessary since we
need to make sure we don't actually use some data block until we know
it has been released. This is fine, but (3a) in the code where we
finally get around to freeing the data blocks, we don't actually call
ext4_mb_return_to_preallcation(), which means the bug which Jiaying
was tracking down can never happen except in no-journal mod, and (3b)
if the data block has never been used (i.e., in the case where we
allocated the data block, but then ever got a chance to install the
blocks so they could have never have gotten used), we don't need to
wait until the next journal transaction --- and in fact in that case
there's no point calling trim on the blocks in question, since they
were never written into.

Argh; what a mess.

So it seems like the best thing to do for now is an alternate patch
which Jiaying had suggested to me privately, which is to simply remove
the BUG_ON in ext4_mb_return_to_preallocation(), since in the long
run, when we get aroundt to actually implementing
ext4_mb_return_to_preallocation(), it does the right thing.

I'm actually not sure how much of an optimization it is to implement
ext4_mb_return_to_preallocation(), so the other alternative is to
remove the function entirely. How often are we likely to return an
error to userspace, and then have the userspace continue writing to
the file?

- Ted

2011-01-09 03:43:35

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH] ext4: remove ext4_mb_return_to_preallocation()

This function was never implemented, except for a BUG_ON which was
tripping when ext4 is run without a journal. The problem is that
although the comment asserts that "truncate (which is the only way to
free block) discards all preallocations", ext4_free_blocks() is also
called in various error recovery paths when blocks have been
allocated, but for various reasons, we were not able to use those data
blocks (for example, because we ran out of memory while trying to
manipulate the extent tree, or some other similar situation).

In addition to the fact that this function isn't implemented except
for the incorrect BUG_ON, the single caller of this function,
ext4_free_blocks(), doesn't use it all if the journal is enabled.

So remove the (stub) function entirely for now. If we decide it's
better to add it back, it's only going to be useful with a relatively
large number of code changes anyway.

Google-Bug-Id: 3236408

Cc: Jiaying Zhang <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/mballoc.c | 14 --------------
1 files changed, 0 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 5bbf6e3..85b6d44 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3881,19 +3881,6 @@ repeat:
}
}

-/*
- * finds all preallocated spaces and return blocks being freed to them
- * if preallocated space becomes full (no block is used from the space)
- * then the function frees space in buddy
- * XXX: at the moment, truncate (which is the only way to free blocks)
- * discards all preallocations
- */
-static void ext4_mb_return_to_preallocation(struct inode *inode,
- struct ext4_buddy *e4b,
- sector_t block, int count)
-{
- BUG_ON(!list_empty(&EXT4_I(inode)->i_prealloc_list));
-}
#ifdef CONFIG_EXT4_DEBUG
static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
{
@@ -4648,7 +4635,6 @@ do_more:
ext4_lock_group(sb, block_group);
mb_clear_bits(bitmap_bh->b_data, bit, count);
mb_free_blocks(inode, &e4b, bit, count);
- ext4_mb_return_to_preallocation(inode, &e4b, block, count);
}

ret = ext4_free_blks_count(sb, gdp) + count;
--
1.7.3.1


2011-01-10 16:45:35

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] ext4: remove ext4_mb_return_to_preallocation()

On 01/08/2011 09:43 PM, Theodore Ts'o wrote:
> This function was never implemented, except for a BUG_ON which was
> tripping when ext4 is run without a journal. The problem is that
> although the comment asserts that "truncate (which is the only way to
> free block) discards all preallocations", ext4_free_blocks() is also
> called in various error recovery paths when blocks have been
> allocated, but for various reasons, we were not able to use those data
> blocks (for example, because we ran out of memory while trying to
> manipulate the extent tree, or some other similar situation).
>
> In addition to the fact that this function isn't implemented except
> for the incorrect BUG_ON, the single caller of this function,
> ext4_free_blocks(), doesn't use it all if the journal is enabled.
>
> So remove the (stub) function entirely for now. If we decide it's
> better to add it back, it's only going to be useful with a relatively
> large number of code changes anyway.

Acked-by: Eric Sandeen <[email protected]>

This always bugged me. :)

> Google-Bug-Id: 3236408
>
> Cc: Jiaying Zhang <[email protected]>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> fs/ext4/mballoc.c | 14 --------------
> 1 files changed, 0 insertions(+), 14 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 5bbf6e3..85b6d44 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -3881,19 +3881,6 @@ repeat:
> }
> }
>
> -/*
> - * finds all preallocated spaces and return blocks being freed to them
> - * if preallocated space becomes full (no block is used from the space)
> - * then the function frees space in buddy
> - * XXX: at the moment, truncate (which is the only way to free blocks)
> - * discards all preallocations
> - */
> -static void ext4_mb_return_to_preallocation(struct inode *inode,
> - struct ext4_buddy *e4b,
> - sector_t block, int count)
> -{
> - BUG_ON(!list_empty(&EXT4_I(inode)->i_prealloc_list));
> -}
> #ifdef CONFIG_EXT4_DEBUG
> static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
> {
> @@ -4648,7 +4635,6 @@ do_more:
> ext4_lock_group(sb, block_group);
> mb_clear_bits(bitmap_bh->b_data, bit, count);
> mb_free_blocks(inode, &e4b, bit, count);
> - ext4_mb_return_to_preallocation(inode, &e4b, block, count);
> }
>
> ret = ext4_free_blks_count(sb, gdp) + count;