2009-03-13 21:57:54

by Eric Sandeen

[permalink] [raw]
Subject: [PATCH] fix bb_prealloc_list corruption due to wrong group locking

This is for Red Hat bug 490026,
EXT4 panic, list corruption in ext4_mb_new_inode_pa

(this was on backported ext4 from 2.6.29)

We hit a BUG() in __list_add from ext4_mb_new_inode_pa()
because the list head pointed to a removed item:

list_add corruption. next->prev should be ffff81042f2fe158,
but was 0000000000200200

(0000000000200200 is LIST_POISON2, set when the item is deleted)

ext4_lock_group(sb, group) is supposed to protect this list for
each group, and a common code flow is this:

ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);
ext4_lock_group(sb, grp);
list_del(&pa->pa_group_list);
ext4_unlock_group(sb, grp);

so its critical that we get the right group number back for
this pa->pa_pstart block.

however, ext4_mb_put_pa passes in (pa->pa_pstart - 1) with a
comment, "-1 is to protect from crossing allocation group"

Other list-manipulators do not use the "-1" so we have the
potential to lock the wrong group and race. Given how the
ext4_get_group_no_and_offset() function works, it doesn't seem
to me that the subtraction is correct.

I've not been able to reproduce the bug, so this is by inspection.

Signed-off-by: Eric Sandeen <[email protected]>
---

Index: linux-2.6/fs/ext4/mballoc.c
===================================================================
--- linux-2.6.orig/fs/ext4/mballoc.c
+++ linux-2.6/fs/ext4/mballoc.c
@@ -3603,8 +3603,7 @@ static void ext4_mb_put_pa(struct ext4_a
pa->pa_deleted = 1;
spin_unlock(&pa->pa_lock);

- /* -1 is to protect from crossing allocation group */
- ext4_get_group_no_and_offset(sb, pa->pa_pstart - 1, &grp, NULL);
+ ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);

/*
* possible race:



2009-03-13 22:20:31

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] fix bb_prealloc_list corruption due to wrong group locking

Eric Sandeen wrote:
> This is for Red Hat bug 490026,
> EXT4 panic, list corruption in ext4_mb_new_inode_pa
>
> (this was on backported ext4 from 2.6.29)
>
> We hit a BUG() in __list_add from ext4_mb_new_inode_pa()
> because the list head pointed to a removed item:
>
> list_add corruption. next->prev should be ffff81042f2fe158,
> but was 0000000000200200
>
> (0000000000200200 is LIST_POISON2, set when the item is deleted)
>
> ext4_lock_group(sb, group) is supposed to protect this list for
> each group, and a common code flow is this:
>
> ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);
> ext4_lock_group(sb, grp);
> list_del(&pa->pa_group_list);
> ext4_unlock_group(sb, grp);
>
> so its critical that we get the right group number back for
> this pa->pa_pstart block.
>
> however, ext4_mb_put_pa passes in (pa->pa_pstart - 1) with a
> comment, "-1 is to protect from crossing allocation group"
>
> Other list-manipulators do not use the "-1" so we have the
> potential to lock the wrong group and race. Given how the
> ext4_get_group_no_and_offset() function works, it doesn't seem
> to me that the subtraction is correct.

Hm, unless pa_pstart gets advanced to the point where it's in the next
group when it's used up... might be more reading to do here.

-Eric

2009-03-14 04:41:05

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] fix bb_prealloc_list corruption due to wrong group locking

Eric Sandeen wrote:
> Eric Sandeen wrote:
>> This is for Red Hat bug 490026,
>> EXT4 panic, list corruption in ext4_mb_new_inode_pa
>>
>> (this was on backported ext4 from 2.6.29)
>>
>> We hit a BUG() in __list_add from ext4_mb_new_inode_pa()
>> because the list head pointed to a removed item:
>>
>> list_add corruption. next->prev should be ffff81042f2fe158,
>> but was 0000000000200200
>>
>> (0000000000200200 is LIST_POISON2, set when the item is deleted)
>>
>> ext4_lock_group(sb, group) is supposed to protect this list for
>> each group, and a common code flow is this:
>>
>> ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);
>> ext4_lock_group(sb, grp);
>> list_del(&pa->pa_group_list);
>> ext4_unlock_group(sb, grp);
>>
>> so its critical that we get the right group number back for
>> this pa->pa_pstart block.
>>
>> however, ext4_mb_put_pa passes in (pa->pa_pstart - 1) with a
>> comment, "-1 is to protect from crossing allocation group"
>>
>> Other list-manipulators do not use the "-1" so we have the
>> potential to lock the wrong group and race. Given how the
>> ext4_get_group_no_and_offset() function works, it doesn't seem
>> to me that the subtraction is correct.
>
> Hm, unless pa_pstart gets advanced to the point where it's in the next
> group when it's used up... might be more reading to do here.

Ok I think I was on the right track here. It looks like for group_pa
(with pa_linear == 1), pa->pa_pstart is advanced as it's used (actually
in ext4_mb_release_context(), but that's a detail) so by the time it is
used up, pa->pa_pstart has advanced into the next group* and therefore
subtracting one to find the group it belong(ed) to is correct.

However, for inode_pa (with pa_linear == 0) only pa_free is decremented,
and pa_pstart does not move. Therefore subtracting one from pa_pstart
in ext4_mb_put_pa is actually grabbing the previous block group's lock
in the inode case, and we open a race with other threads which are
locking the correct group.

I'll do a bit more testing/reading but I think that what we probably
need is something like:

static void ext4_mb_put_pa(struct ext4_allocation_context *ac, ...)
{
...
/* -1 is to protect from crossing allocation group */
if (pa->pa_linear)
pa->pa_pstart--;
ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);
...

Could probably come up with something clearer, but that's the gist of it.

-Eric


*i.e. if pa_start began at 0, and the group had 512 blocks, when all
blocks are used, pa_start has advanced by 512, and "512" is the first
block in the *next* group, so we need to trim one off in that case.

2009-03-16 05:44:34

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH] fix bb_prealloc_list corruption due to wrong group locking

On Fri, Mar 13, 2009 at 04:57:45PM -0500, Eric Sandeen wrote:
> This is for Red Hat bug 490026,
> EXT4 panic, list corruption in ext4_mb_new_inode_pa
>
> (this was on backported ext4 from 2.6.29)
>
> We hit a BUG() in __list_add from ext4_mb_new_inode_pa()
> because the list head pointed to a removed item:
>
> list_add corruption. next->prev should be ffff81042f2fe158,
> but was 0000000000200200
>
> (0000000000200200 is LIST_POISON2, set when the item is deleted)
>
> ext4_lock_group(sb, group) is supposed to protect this list for
> each group, and a common code flow is this:
>
> ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);
> ext4_lock_group(sb, grp);
> list_del(&pa->pa_group_list);
> ext4_unlock_group(sb, grp);
>
> so its critical that we get the right group number back for
> this pa->pa_pstart block.
>
> however, ext4_mb_put_pa passes in (pa->pa_pstart - 1) with a
> comment, "-1 is to protect from crossing allocation group"
>
> Other list-manipulators do not use the "-1" so we have the
> potential to lock the wrong group and race. Given how the
> ext4_get_group_no_and_offset() function works, it doesn't seem
> to me that the subtraction is correct.
>
> I've not been able to reproduce the bug, so this is by inspection.
>
> Signed-off-by: Eric Sandeen <[email protected]>
> ---
>
> Index: linux-2.6/fs/ext4/mballoc.c
> ===================================================================
> --- linux-2.6.orig/fs/ext4/mballoc.c
> +++ linux-2.6/fs/ext4/mballoc.c
> @@ -3603,8 +3603,7 @@ static void ext4_mb_put_pa(struct ext4_a
> pa->pa_deleted = 1;
> spin_unlock(&pa->pa_lock);
>
> - /* -1 is to protect from crossing allocation group */
> - ext4_get_group_no_and_offset(sb, pa->pa_pstart - 1, &grp, NULL);
> + ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);
>
> /*
> * possible race:
>


But the change is needed for lg prealloc space because locality group
prealloc reduce pa_pstart on block allocation and once fully allocated
pa_pstart can point to the next block group. But what you found is also
correct for inode prealloc space. I guess the code broke due to FLEX_BG
because without FLEX_BG pa_pstart will never be the first block in the
group so even for inode prealloc space pa_pstart - 1 would be in the
same group. You may want to do

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 4415bee..b4656f7 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3585,11 +3585,10 @@ static void ext4_mb_pa_callback(struct rcu_head *head)
* drops a reference to preallocated space descriptor
* if this was the last reference and the space is consumed
*/
-static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
- struct super_block *sb, struct ext4_prealloc_space *pa)
+static void ext4_mb_put_pa(struct super_block *sb,
+ ext4_group_t grp, struct ext4_prealloc_space *pa)
{
- ext4_group_t grp;
-
+
if (!atomic_dec_and_test(&pa->pa_count) || pa->pa_free != 0)
return;

@@ -3602,10 +3601,7 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,

pa->pa_deleted = 1;
spin_unlock(&pa->pa_lock);
-
- /* -1 is to protect from crossing allocation group */
- ext4_get_group_no_and_offset(sb, pa->pa_pstart - 1, &grp, NULL);
-
+
/*
* possible race:
*
@@ -4469,8 +4465,11 @@ static void ext4_mb_add_n_trim(struct ext4_allocation_context *ac)
*/
static int ext4_mb_release_context(struct ext4_allocation_context *ac)
{
+ ext4_group_t grp;
struct ext4_prealloc_space *pa = ac->ac_pa;
if (pa) {
+ ext4_get_group_no_and_offset(ac->ac_sb,
+ pa->pa_pstart, &grp, NULL);
if (pa->pa_linear) {
/* see comment in ext4_mb_use_group_pa() */
spin_lock(&pa->pa_lock);
@@ -4497,7 +4496,7 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
spin_unlock(pa->pa_obj_lock);
ext4_mb_add_n_trim(ac);
}
- ext4_mb_put_pa(ac, ac->ac_sb, pa);
+ ext4_mb_put_pa(ac->ac_sb, grp, pa);
}
if (ac->ac_bitmap_page)
page_cache_release(ac->ac_bitmap_page);

2009-03-16 15:04:02

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] fix bb_prealloc_list corruption due to wrong group locking

Aneesh Kumar K.V wrote:
> On Fri, Mar 13, 2009 at 04:57:45PM -0500, Eric Sandeen wrote:
>> This is for Red Hat bug 490026,
>> EXT4 panic, list corruption in ext4_mb_new_inode_pa
>>
>> (this was on backported ext4 from 2.6.29)
>>
>> We hit a BUG() in __list_add from ext4_mb_new_inode_pa()
>> because the list head pointed to a removed item:
>>
>> list_add corruption. next->prev should be ffff81042f2fe158,
>> but was 0000000000200200
>>
>> (0000000000200200 is LIST_POISON2, set when the item is deleted)
>>
>> ext4_lock_group(sb, group) is supposed to protect this list for
>> each group, and a common code flow is this:
>>
>> ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);
>> ext4_lock_group(sb, grp);
>> list_del(&pa->pa_group_list);
>> ext4_unlock_group(sb, grp);
>>
>> so its critical that we get the right group number back for
>> this pa->pa_pstart block.
>>
>> however, ext4_mb_put_pa passes in (pa->pa_pstart - 1) with a
>> comment, "-1 is to protect from crossing allocation group"
>>
>> Other list-manipulators do not use the "-1" so we have the
>> potential to lock the wrong group and race. Given how the
>> ext4_get_group_no_and_offset() function works, it doesn't seem
>> to me that the subtraction is correct.
>>
>> I've not been able to reproduce the bug, so this is by inspection.
>>
>> Signed-off-by: Eric Sandeen <[email protected]>
>> ---
>>
>> Index: linux-2.6/fs/ext4/mballoc.c
>> ===================================================================
>> --- linux-2.6.orig/fs/ext4/mballoc.c
>> +++ linux-2.6/fs/ext4/mballoc.c
>> @@ -3603,8 +3603,7 @@ static void ext4_mb_put_pa(struct ext4_a
>> pa->pa_deleted = 1;
>> spin_unlock(&pa->pa_lock);
>>
>> - /* -1 is to protect from crossing allocation group */
>> - ext4_get_group_no_and_offset(sb, pa->pa_pstart - 1, &grp, NULL);
>> + ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);
>>
>> /*
>> * possible race:
>>
>
>
> But the change is needed for lg prealloc space because locality group
> prealloc reduce pa_pstart on block allocation and once fully allocated
> pa_pstart can point to the next block group.

Right, that's what I followed up with Friday, missed it the first go-round.

> But what you found is also
> correct for inode prealloc space. I guess the code broke due to FLEX_BG
> because without FLEX_BG pa_pstart will never be the first block in the
> group so even for inode prealloc space pa_pstart - 1 would be in the
> same group.

Hm, so for inode it's initialized as:

pa->pa_pstart = ext4_grp_offs_to_block(sb, &ac->ac_b_ex);
regardless of flex? well, anyway...

You may want to do

<aneesh's patch moving the grp calculation>

The only thing I don't like about that is we're calculating grp a lot
when we don't actually need it. How about just:

Index: linux-2.6/fs/ext4/mballoc.c
===================================================================
--- linux-2.6.orig/fs/ext4/mballoc.c
+++ linux-2.6/fs/ext4/mballoc.c
@@ -3603,8 +3603,11 @@ static void ext4_mb_put_pa(struct ext4_a
pa->pa_deleted = 1;
spin_unlock(&pa->pa_lock);

- /* -1 is to protect from crossing allocation group */
- ext4_get_group_no_and_offset(sb, pa->pa_pstart - 1, &grp, NULL);
+ /* If linear, pa_pstart is in next block group when used up */
+ if (pa->pa_linear)
+ pa->pa_pstart--;
+
+ ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);

/*
* possible race:

-Eric

2009-03-16 16:47:50

by Eric Sandeen

[permalink] [raw]
Subject: [PATCH V2] fix bb_prealloc_list corruption due to wrong group locking

This is for Red Hat bug 490026,
EXT4 panic, list corruption in ext4_mb_new_inode_pa

ext4_lock_group(sb, group) is supposed to protect this list for
each group, and a common code flow to remove an album is like
this:

ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);
ext4_lock_group(sb, grp);
list_del(&pa->pa_group_list);
ext4_unlock_group(sb, grp);

so it's critical that we get the right group number back for
this prealloc context, to lock the right group (the one
associated with this pa) and prevent concurrent list manipulation.

however, ext4_mb_put_pa() passes in (pa->pa_pstart - 1) with a
comment, /* -1 is to protect from crossing allocation group */

This makes sense for the group_pa, where pa_pstart is advanced
by the length which has been used (in ext4_mb_release_context()),
and when the entire length has been used, pa_pstart has been
advanced to the first block of the next group.

However, for inode_pa, pa_pstart is never advanced; it's just
set once to the first block in the group and not moved after
that. So in this case, if we subtract one in ext4_mb_put_pa(),
we are actually locking the *previous* group, and opening the
race with the other threads which do not subtract off the extra
block.

Signed-off-by: Eric Sandeen <[email protected]>
---

ndex: linux-2.6/fs/ext4/mballoc.c
===================================================================
--- linux-2.6.orig/fs/ext4/mballoc.c
+++ linux-2.6/fs/ext4/mballoc.c
@@ -3589,6 +3589,7 @@ static void ext4_mb_put_pa(struct ext4_a
struct super_block *sb, struct ext4_prealloc_space *pa)
{
ext4_group_t grp;
+ ext4_fsblk_t grp_blk;

if (!atomic_dec_and_test(&pa->pa_count) || pa->pa_free != 0)
return;
@@ -3603,8 +3604,12 @@ static void ext4_mb_put_pa(struct ext4_a
pa->pa_deleted = 1;
spin_unlock(&pa->pa_lock);

- /* -1 is to protect from crossing allocation group */
- ext4_get_group_no_and_offset(sb, pa->pa_pstart - 1, &grp, NULL);
+ grp_blk = pa->pa_pstart;
+ /* If linear, pa_pstart is in the next block group when pa is used up */
+ if (pa->pa_linear)
+ grp_blk--;
+
+ ext4_get_group_no_and_offset(sb, grp_blk, &grp, NULL);

/*
* possible race:


2009-03-16 17:13:20

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH V2] fix bb_prealloc_list corruption due to wrong group locking

On Mon, Mar 16, 2009 at 11:47:37AM -0500, Eric Sandeen wrote:
> This is for Red Hat bug 490026,
> EXT4 panic, list corruption in ext4_mb_new_inode_pa
>
> ext4_lock_group(sb, group) is supposed to protect this list for
> each group, and a common code flow to remove an album is like
> this:
>
> ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);
> ext4_lock_group(sb, grp);
> list_del(&pa->pa_group_list);
> ext4_unlock_group(sb, grp);
>
> so it's critical that we get the right group number back for
> this prealloc context, to lock the right group (the one
> associated with this pa) and prevent concurrent list manipulation.
>
> however, ext4_mb_put_pa() passes in (pa->pa_pstart - 1) with a
> comment, /* -1 is to protect from crossing allocation group */
>
> This makes sense for the group_pa, where pa_pstart is advanced
> by the length which has been used (in ext4_mb_release_context()),
> and when the entire length has been used, pa_pstart has been
> advanced to the first block of the next group.
>
> However, for inode_pa, pa_pstart is never advanced; it's just
> set once to the first block in the group and not moved after
> that. So in this case, if we subtract one in ext4_mb_put_pa(),
> we are actually locking the *previous* group, and opening the
> race with the other threads which do not subtract off the extra
> block.
>
> Signed-off-by: Eric Sandeen <[email protected]>
> ---
>
> ndex: linux-2.6/fs/ext4/mballoc.c
> ===================================================================
> --- linux-2.6.orig/fs/ext4/mballoc.c
> +++ linux-2.6/fs/ext4/mballoc.c
> @@ -3589,6 +3589,7 @@ static void ext4_mb_put_pa(struct ext4_a
> struct super_block *sb, struct ext4_prealloc_space *pa)
> {
> ext4_group_t grp;
> + ext4_fsblk_t grp_blk;
>
> if (!atomic_dec_and_test(&pa->pa_count) || pa->pa_free != 0)
> return;
> @@ -3603,8 +3604,12 @@ static void ext4_mb_put_pa(struct ext4_a
> pa->pa_deleted = 1;
> spin_unlock(&pa->pa_lock);
>
> - /* -1 is to protect from crossing allocation group */
> - ext4_get_group_no_and_offset(sb, pa->pa_pstart - 1, &grp, NULL);
> + grp_blk = pa->pa_pstart;
> + /* If linear, pa_pstart is in the next block group when pa is used up */


/* If linear, pa_pstart may be in the next block group when pa is used up */
^^^^^^^^^^


> + if (pa->pa_linear)
> + grp_blk--;
> +
> + ext4_get_group_no_and_offset(sb, grp_blk, &grp, NULL);
>
> /*
> * possible race:
>

Reviewed-by: Aneesh Kumar K.V <[email protected]>

2009-03-16 17:28:29

by Eric Sandeen

[permalink] [raw]
Subject: [PATCH V3] fix bb_prealloc_list corruption due to wrong group locking

This is for Red Hat bug 490026,
EXT4 panic, list corruption in ext4_mb_new_inode_pa

ext4_lock_group(sb, group) is supposed to protect this list for
each group, and a common code flow to remove an album is like
this:

ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);
ext4_lock_group(sb, grp);
list_del(&pa->pa_group_list);
ext4_unlock_group(sb, grp);

so it's critical that we get the right group number back for
this prealloc context, to lock the right group (the one
associated with this pa) and prevent concurrent list manipulation.

however, ext4_mb_put_pa() passes in (pa->pa_pstart - 1) with a
comment, /* -1 is to protect from crossing allocation group */

This makes sense for the group_pa, where pa_pstart is advanced
by the length which has been used (in ext4_mb_release_context()),
and when the entire length has been used, pa_pstart has been
advanced to the first block of the next group.

However, for inode_pa, pa_pstart is never advanced; it's just
set once to the first block in the group and not moved after
that. So in this case, if we subtract one in ext4_mb_put_pa(),
we are actually locking the *previous* group, and opening the
race with the other threads which do not subtract off the extra
block.

Signed-off-by: Eric Sandeen <[email protected]>
---

(fixed up comment per Aneesh's suggestion)

Index: linux-2.6/fs/ext4/mballoc.c
===================================================================
--- linux-2.6.orig/fs/ext4/mballoc.c
+++ linux-2.6/fs/ext4/mballoc.c
@@ -3589,6 +3589,7 @@ static void ext4_mb_put_pa(struct ext4_a
struct super_block *sb, struct ext4_prealloc_space *pa)
{
ext4_group_t grp;
+ ext4_fsblk_t grp_blk;

if (!atomic_dec_and_test(&pa->pa_count) || pa->pa_free != 0)
return;
@@ -3603,8 +3604,12 @@ static void ext4_mb_put_pa(struct ext4_a
pa->pa_deleted = 1;
spin_unlock(&pa->pa_lock);

- /* -1 is to protect from crossing allocation group */
- ext4_get_group_no_and_offset(sb, pa->pa_pstart - 1, &grp, NULL);
+ grp_blk = pa->pa_pstart;
+ /* If linear, pa_pstart may be in the next group when pa is used up */
+ if (pa->pa_linear)
+ grp_blk--;
+
+ ext4_get_group_no_and_offset(sb, grp_blk, &grp, NULL);

/*
* possible race:

2009-03-16 17:43:08

by Frank Mayhar

[permalink] [raw]
Subject: Re: [PATCH V3] fix bb_prealloc_list corruption due to wrong group locking

On Mon, 2009-03-16 at 12:28 -0500, Eric Sandeen wrote:
> This is for Red Hat bug 490026,
> EXT4 panic, list corruption in ext4_mb_new_inode_pa
>
> ext4_lock_group(sb, group) is supposed to protect this list for
> each group, and a common code flow to remove an album is like
> this:
>
> ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);
> ext4_lock_group(sb, grp);
> list_del(&pa->pa_group_list);
> ext4_unlock_group(sb, grp);
>
> so it's critical that we get the right group number back for
> this prealloc context, to lock the right group (the one
> associated with this pa) and prevent concurrent list manipulation.

Eric, this may just be coincidence, but is it possible that this may be
related to our bitmap problem I described last week? We haven't tracked
it down yet but it certainly smells like a race and your fix corrects
just such a race in the same code.

The bitmap problem, btw, involves stuff apparently being marked as used
when it's really free (or something very much like that), ultimately
resulting in double frees.
--
Frank Mayhar <[email protected]>
Google, Inc.


2009-03-16 17:48:32

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH V3] fix bb_prealloc_list corruption due to wrong group locking

Frank Mayhar wrote:
> On Mon, 2009-03-16 at 12:28 -0500, Eric Sandeen wrote:
>> This is for Red Hat bug 490026,
>> EXT4 panic, list corruption in ext4_mb_new_inode_pa
>>
>> ext4_lock_group(sb, group) is supposed to protect this list for
>> each group, and a common code flow to remove an album is like
>> this:
>>
>> ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);
>> ext4_lock_group(sb, grp);
>> list_del(&pa->pa_group_list);
>> ext4_unlock_group(sb, grp);
>>
>> so it's critical that we get the right group number back for
>> this prealloc context, to lock the right group (the one
>> associated with this pa) and prevent concurrent list manipulation.
>
> Eric, this may just be coincidence, but is it possible that this may be
> related to our bitmap problem I described last week? We haven't tracked
> it down yet but it certainly smells like a race and your fix corrects
> just such a race in the same code.
>
> The bitmap problem, btw, involves stuff apparently being marked as used
> when it's really free (or something very much like that), ultimately
> resulting in double frees.

Hi Frank - I don't *think* so just because deleted items are poisoned
and I would expect that we'd trip over a bad pointer in the corrupted
list item as the first indicator of trouble... but I could be wrong.

I think you said you could reproduce it, right? So certainly worth
testing with this fix I suppose.

Thanks,
-Eric

2009-03-16 17:53:32

by Frank Mayhar

[permalink] [raw]
Subject: Re: [PATCH V3] fix bb_prealloc_list corruption due to wrong group locking

On Mon, 2009-03-16 at 12:48 -0500, Eric Sandeen wrote:
> Hi Frank - I don't *think* so just because deleted items are poisoned
> and I would expect that we'd trip over a bad pointer in the corrupted
> list item as the first indicator of trouble... but I could be wrong.
>
> I think you said you could reproduce it, right? So certainly worth
> testing with this fix I suppose.

Yeah, we're working on doing that now. We're not running with a lot of
debugging turned on so we may or may not actually fall over if we try to
traverse freed (or just wrong) list entries. If it isn't the same race,
though, then there's still one lurking in this code and that would be a
hell of a coincidence.
--
Frank Mayhar <[email protected]>
Google, Inc.


2009-03-17 03:30:45

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH V3] fix bb_prealloc_list corruption due to wrong group locking

On Mon, Mar 16, 2009 at 12:28:16PM -0500, Eric Sandeen wrote:
> This is for Red Hat bug 490026,
> EXT4 panic, list corruption in ext4_mb_new_inode_pa

Thanks, applied to the ext4 tree.

- Ted


2009-03-18 16:11:56

by Frank Mayhar

[permalink] [raw]
Subject: Re: [PATCH V3] fix bb_prealloc_list corruption due to wrong group locking

On Mon, 2009-03-16 at 10:53 -0700, Frank Mayhar wrote:
> On Mon, 2009-03-16 at 12:48 -0500, Eric Sandeen wrote:
> > Hi Frank - I don't *think* so just because deleted items are poisoned
> > and I would expect that we'd trip over a bad pointer in the corrupted
> > list item as the first indicator of trouble... but I could be wrong.
> >
> > I think you said you could reproduce it, right? So certainly worth
> > testing with this fix I suppose.
>
> Yeah, we're working on doing that now. We're not running with a lot of
> debugging turned on so we may or may not actually fall over if we try to
> traverse freed (or just wrong) list entries. If it isn't the same race,
> though, then there's still one lurking in this code and that would be a
> hell of a coincidence.

FYI we're running now with this patch in both a test and a production
environment and haven't (yet) seen the bitmap inconsistency we were
encountering regularly. So far we have over 18 hours of runtime. So,
despite the enormous coincidence, our problem appears to be due to the
same cause.

Thanks for the fix, Eric!
--
Frank Mayhar <[email protected]>
Google, Inc.


2009-03-18 16:17:25

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH V3] fix bb_prealloc_list corruption due to wrong group locking

Frank Mayhar wrote:
> On Mon, 2009-03-16 at 10:53 -0700, Frank Mayhar wrote:
>> On Mon, 2009-03-16 at 12:48 -0500, Eric Sandeen wrote:
>>> Hi Frank - I don't *think* so just because deleted items are poisoned
>>> and I would expect that we'd trip over a bad pointer in the corrupted
>>> list item as the first indicator of trouble... but I could be wrong.
>>>
>>> I think you said you could reproduce it, right? So certainly worth
>>> testing with this fix I suppose.
>> Yeah, we're working on doing that now. We're not running with a lot of
>> debugging turned on so we may or may not actually fall over if we try to
>> traverse freed (or just wrong) list entries. If it isn't the same race,
>> though, then there's still one lurking in this code and that would be a
>> hell of a coincidence.
>
> FYI we're running now with this patch in both a test and a production
> environment and haven't (yet) seen the bitmap inconsistency we were
> encountering regularly. So far we have over 18 hours of runtime. So,
> despite the enormous coincidence, our problem appears to be due to the
> same cause.
>
> Thanks for the fix, Eric!

Good deal; I'll just trust that it's fixed, and maybe at some point I'll
sort out how they're related ;)

Thanks,
-Eric

2009-03-18 18:11:20

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH V3] fix bb_prealloc_list corruption due to wrong group locking

On Wed, Mar 18, 2009 at 11:17:09AM -0500, Eric Sandeen wrote:
>
> Good deal; I'll just trust that it's fixed, and maybe at some point I'll
> sort out how they're related ;)
>

... and the patch just appeared in mainline and so should be in
2.6.29. :-)

- Ted