2022-11-23 19:45:29

by Jan Kara

[permalink] [raw]
Subject: [PATCH] ext4: Fix deadlock due to mbcache entry corruption

When manipulating xattr blocks, we can deadlock infinitely looping
inside ext4_xattr_block_set() where we constantly keep finding xattr
block for reuse in mbcache but we are unable to reuse it because its
reference count is too big. This happens because cache entry for the
xattr block is marked as reusable (e_reusable set) although its
reference count is too big. When this inconsistency happens, this
inconsistent state is kept indefinitely and so ext4_xattr_block_set()
keeps retrying indefinitely.

The inconsistent state is caused by non-atomic update of e_reusable bit.
e_reusable is part of a bitfield and e_reusable update can race with
update of e_referenced bit in the same bitfield resulting in loss of one
of the updates. Fix the problem by using atomic bitops instead.

CC: [email protected]
Fixes: 6048c64b2609 ("mbcache: add reusable flag to cache entries")
Reported-and-tested-by: Jeremi Piotrowski <[email protected]>
Reported-by: Thilo Fromm <[email protected]>
Link: https://lore.kernel.org/r/[email protected]/
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/xattr.c | 4 ++--
fs/mbcache.c | 14 ++++++++------
include/linux/mbcache.h | 9 +++++++--
3 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 800ce5cdb9d2..08043aa72cf1 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1281,7 +1281,7 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
ce = mb_cache_entry_get(ea_block_cache, hash,
bh->b_blocknr);
if (ce) {
- ce->e_reusable = 1;
+ set_bit(MBE_REUSABLE_B, &ce->e_flags);
mb_cache_entry_put(ea_block_cache, ce);
}
}
@@ -2042,7 +2042,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
}
BHDR(new_bh)->h_refcount = cpu_to_le32(ref);
if (ref == EXT4_XATTR_REFCOUNT_MAX)
- ce->e_reusable = 0;
+ clear_bit(MBE_REUSABLE_B, &ce->e_flags);
ea_bdebug(new_bh, "reusing; refcount now=%d",
ref);
ext4_xattr_block_csum_set(inode, new_bh);
diff --git a/fs/mbcache.c b/fs/mbcache.c
index e272ad738faf..2a4b8b549e93 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -100,8 +100,9 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
atomic_set(&entry->e_refcnt, 2);
entry->e_key = key;
entry->e_value = value;
- entry->e_reusable = reusable;
- entry->e_referenced = 0;
+ entry->e_flags = 0;
+ if (reusable)
+ set_bit(MBE_REUSABLE_B, &entry->e_flags);
head = mb_cache_entry_head(cache, key);
hlist_bl_lock(head);
hlist_bl_for_each_entry(dup, dup_node, head, e_hash_list) {
@@ -165,7 +166,8 @@ static struct mb_cache_entry *__entry_find(struct mb_cache *cache,
while (node) {
entry = hlist_bl_entry(node, struct mb_cache_entry,
e_hash_list);
- if (entry->e_key == key && entry->e_reusable &&
+ if (entry->e_key == key &&
+ test_bit(MBE_REUSABLE_B, &entry->e_flags) &&
atomic_inc_not_zero(&entry->e_refcnt))
goto out;
node = node->next;
@@ -284,7 +286,7 @@ EXPORT_SYMBOL(mb_cache_entry_delete_or_get);
void mb_cache_entry_touch(struct mb_cache *cache,
struct mb_cache_entry *entry)
{
- entry->e_referenced = 1;
+ set_bit(MBE_REFERENCED_B, &entry->e_flags);
}
EXPORT_SYMBOL(mb_cache_entry_touch);

@@ -309,9 +311,9 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
entry = list_first_entry(&cache->c_list,
struct mb_cache_entry, e_list);
/* Drop initial hash reference if there is no user */
- if (entry->e_referenced ||
+ if (test_bit(MBE_REFERENCED_B, &entry->e_flags) ||
atomic_cmpxchg(&entry->e_refcnt, 1, 0) != 1) {
- entry->e_referenced = 0;
+ clear_bit(MBE_REFERENCED_B, &entry->e_flags);
list_move_tail(&entry->e_list, &cache->c_list);
continue;
}
diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
index 2da63fd7b98f..97e64184767d 100644
--- a/include/linux/mbcache.h
+++ b/include/linux/mbcache.h
@@ -10,6 +10,12 @@

struct mb_cache;

+/* Cache entry flags */
+enum {
+ MBE_REFERENCED_B = 0,
+ MBE_REUSABLE_B
+};
+
struct mb_cache_entry {
/* List of entries in cache - protected by cache->c_list_lock */
struct list_head e_list;
@@ -26,8 +32,7 @@ struct mb_cache_entry {
atomic_t e_refcnt;
/* Key in hash - stable during lifetime of the entry */
u32 e_key;
- u32 e_referenced:1;
- u32 e_reusable:1;
+ unsigned long e_flags;
/* User provided value - stable during lifetime of the entry */
u64 e_value;
};
--
2.35.3


2022-12-01 15:17:28

by Jeremi Piotrowski

[permalink] [raw]
Subject: Re: [PATCH] ext4: Fix deadlock due to mbcache entry corruption

On Wed, Nov 23, 2022 at 08:39:50PM +0100, Jan Kara wrote:
> When manipulating xattr blocks, we can deadlock infinitely looping
> inside ext4_xattr_block_set() where we constantly keep finding xattr
> block for reuse in mbcache but we are unable to reuse it because its
> reference count is too big. This happens because cache entry for the
> xattr block is marked as reusable (e_reusable set) although its
> reference count is too big. When this inconsistency happens, this
> inconsistent state is kept indefinitely and so ext4_xattr_block_set()
> keeps retrying indefinitely.
>
> The inconsistent state is caused by non-atomic update of e_reusable bit.
> e_reusable is part of a bitfield and e_reusable update can race with
> update of e_referenced bit in the same bitfield resulting in loss of one
> of the updates. Fix the problem by using atomic bitops instead.
>
> CC: [email protected]
> Fixes: 6048c64b2609 ("mbcache: add reusable flag to cache entries")
> Reported-and-tested-by: Jeremi Piotrowski <[email protected]>
> Reported-by: Thilo Fromm <[email protected]>
> Link: https://lore.kernel.org/r/[email protected]/
> Signed-off-by: Jan Kara <[email protected]>

Hi Ted,

Could it be that you didn't see this email? We have users who are hitting this
and are very eager to see this bugfix get merged and backported to stable.

Thanks,
Jeremi

> ---
> fs/ext4/xattr.c | 4 ++--
> fs/mbcache.c | 14 ++++++++------
> include/linux/mbcache.h | 9 +++++++--
> 3 files changed, 17 insertions(+), 10 deletions(-)
>
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 800ce5cdb9d2..08043aa72cf1 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -1281,7 +1281,7 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
> ce = mb_cache_entry_get(ea_block_cache, hash,
> bh->b_blocknr);
> if (ce) {
> - ce->e_reusable = 1;
> + set_bit(MBE_REUSABLE_B, &ce->e_flags);
> mb_cache_entry_put(ea_block_cache, ce);
> }
> }
> @@ -2042,7 +2042,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
> }
> BHDR(new_bh)->h_refcount = cpu_to_le32(ref);
> if (ref == EXT4_XATTR_REFCOUNT_MAX)
> - ce->e_reusable = 0;
> + clear_bit(MBE_REUSABLE_B, &ce->e_flags);
> ea_bdebug(new_bh, "reusing; refcount now=%d",
> ref);
> ext4_xattr_block_csum_set(inode, new_bh);
> diff --git a/fs/mbcache.c b/fs/mbcache.c
> index e272ad738faf..2a4b8b549e93 100644
> --- a/fs/mbcache.c
> +++ b/fs/mbcache.c
> @@ -100,8 +100,9 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
> atomic_set(&entry->e_refcnt, 2);
> entry->e_key = key;
> entry->e_value = value;
> - entry->e_reusable = reusable;
> - entry->e_referenced = 0;
> + entry->e_flags = 0;
> + if (reusable)
> + set_bit(MBE_REUSABLE_B, &entry->e_flags);
> head = mb_cache_entry_head(cache, key);
> hlist_bl_lock(head);
> hlist_bl_for_each_entry(dup, dup_node, head, e_hash_list) {
> @@ -165,7 +166,8 @@ static struct mb_cache_entry *__entry_find(struct mb_cache *cache,
> while (node) {
> entry = hlist_bl_entry(node, struct mb_cache_entry,
> e_hash_list);
> - if (entry->e_key == key && entry->e_reusable &&
> + if (entry->e_key == key &&
> + test_bit(MBE_REUSABLE_B, &entry->e_flags) &&
> atomic_inc_not_zero(&entry->e_refcnt))
> goto out;
> node = node->next;
> @@ -284,7 +286,7 @@ EXPORT_SYMBOL(mb_cache_entry_delete_or_get);
> void mb_cache_entry_touch(struct mb_cache *cache,
> struct mb_cache_entry *entry)
> {
> - entry->e_referenced = 1;
> + set_bit(MBE_REFERENCED_B, &entry->e_flags);
> }
> EXPORT_SYMBOL(mb_cache_entry_touch);
>
> @@ -309,9 +311,9 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
> entry = list_first_entry(&cache->c_list,
> struct mb_cache_entry, e_list);
> /* Drop initial hash reference if there is no user */
> - if (entry->e_referenced ||
> + if (test_bit(MBE_REFERENCED_B, &entry->e_flags) ||
> atomic_cmpxchg(&entry->e_refcnt, 1, 0) != 1) {
> - entry->e_referenced = 0;
> + clear_bit(MBE_REFERENCED_B, &entry->e_flags);
> list_move_tail(&entry->e_list, &cache->c_list);
> continue;
> }
> diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
> index 2da63fd7b98f..97e64184767d 100644
> --- a/include/linux/mbcache.h
> +++ b/include/linux/mbcache.h
> @@ -10,6 +10,12 @@
>
> struct mb_cache;
>
> +/* Cache entry flags */
> +enum {
> + MBE_REFERENCED_B = 0,
> + MBE_REUSABLE_B
> +};
> +
> struct mb_cache_entry {
> /* List of entries in cache - protected by cache->c_list_lock */
> struct list_head e_list;
> @@ -26,8 +32,7 @@ struct mb_cache_entry {
> atomic_t e_refcnt;
> /* Key in hash - stable during lifetime of the entry */
> u32 e_key;
> - u32 e_referenced:1;
> - u32 e_reusable:1;
> + unsigned long e_flags;
> /* User provided value - stable during lifetime of the entry */
> u64 e_value;
> };
> --
> 2.35.3

Subject: Re: [PATCH] ext4: Fix deadlock due to mbcache entry corruption

On 01.12.22 16:10, Jeremi Piotrowski wrote:
> On Wed, Nov 23, 2022 at 08:39:50PM +0100, Jan Kara wrote:
>> When manipulating xattr blocks, we can deadlock infinitely looping
>> inside ext4_xattr_block_set() where we constantly keep finding xattr
>> block for reuse in mbcache but we are unable to reuse it because its
>> reference count is too big. This happens because cache entry for the
>> xattr block is marked as reusable (e_reusable set) although its
>> reference count is too big. When this inconsistency happens, this
>> inconsistent state is kept indefinitely and so ext4_xattr_block_set()
>> keeps retrying indefinitely.
>>
>> The inconsistent state is caused by non-atomic update of e_reusable bit.
>> e_reusable is part of a bitfield and e_reusable update can race with
>> update of e_referenced bit in the same bitfield resulting in loss of one
>> of the updates. Fix the problem by using atomic bitops instead.
>>
>> CC: [email protected]
>> Fixes: 6048c64b2609 ("mbcache: add reusable flag to cache entries")
>> Reported-and-tested-by: Jeremi Piotrowski <[email protected]>
>> Reported-by: Thilo Fromm <[email protected]>
>> Link: https://lore.kernel.org/r/[email protected]/
>> Signed-off-by: Jan Kara <[email protected]>
>
> Could it be that you didn't see this email? We have users who are hitting this
> and are very eager to see this bugfix get merged and backported to stable.

Andreas, Ted, or any other trusted ext4 reviewer:

Jan's patch to fix the regression is now our 12 days out and afaics
didn't make any progress (or did I miss something?). Is there are reason
why or did it simply fall through the cracks? Just asking, because it
would be good to finally get this resolved.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

>> fs/ext4/xattr.c | 4 ++--
>> fs/mbcache.c | 14 ++++++++------
>> include/linux/mbcache.h | 9 +++++++--
>> 3 files changed, 17 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
>> index 800ce5cdb9d2..08043aa72cf1 100644
>> --- a/fs/ext4/xattr.c
>> +++ b/fs/ext4/xattr.c
>> @@ -1281,7 +1281,7 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
>> ce = mb_cache_entry_get(ea_block_cache, hash,
>> bh->b_blocknr);
>> if (ce) {
>> - ce->e_reusable = 1;
>> + set_bit(MBE_REUSABLE_B, &ce->e_flags);
>> mb_cache_entry_put(ea_block_cache, ce);
>> }
>> }
>> @@ -2042,7 +2042,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
>> }
>> BHDR(new_bh)->h_refcount = cpu_to_le32(ref);
>> if (ref == EXT4_XATTR_REFCOUNT_MAX)
>> - ce->e_reusable = 0;
>> + clear_bit(MBE_REUSABLE_B, &ce->e_flags);
>> ea_bdebug(new_bh, "reusing; refcount now=%d",
>> ref);
>> ext4_xattr_block_csum_set(inode, new_bh);
>> diff --git a/fs/mbcache.c b/fs/mbcache.c
>> index e272ad738faf..2a4b8b549e93 100644
>> --- a/fs/mbcache.c
>> +++ b/fs/mbcache.c
>> @@ -100,8 +100,9 @@ int mb_cache_entry_create(struct mb_cache *cache, gfp_t mask, u32 key,
>> atomic_set(&entry->e_refcnt, 2);
>> entry->e_key = key;
>> entry->e_value = value;
>> - entry->e_reusable = reusable;
>> - entry->e_referenced = 0;
>> + entry->e_flags = 0;
>> + if (reusable)
>> + set_bit(MBE_REUSABLE_B, &entry->e_flags);
>> head = mb_cache_entry_head(cache, key);
>> hlist_bl_lock(head);
>> hlist_bl_for_each_entry(dup, dup_node, head, e_hash_list) {
>> @@ -165,7 +166,8 @@ static struct mb_cache_entry *__entry_find(struct mb_cache *cache,
>> while (node) {
>> entry = hlist_bl_entry(node, struct mb_cache_entry,
>> e_hash_list);
>> - if (entry->e_key == key && entry->e_reusable &&
>> + if (entry->e_key == key &&
>> + test_bit(MBE_REUSABLE_B, &entry->e_flags) &&
>> atomic_inc_not_zero(&entry->e_refcnt))
>> goto out;
>> node = node->next;
>> @@ -284,7 +286,7 @@ EXPORT_SYMBOL(mb_cache_entry_delete_or_get);
>> void mb_cache_entry_touch(struct mb_cache *cache,
>> struct mb_cache_entry *entry)
>> {
>> - entry->e_referenced = 1;
>> + set_bit(MBE_REFERENCED_B, &entry->e_flags);
>> }
>> EXPORT_SYMBOL(mb_cache_entry_touch);
>>
>> @@ -309,9 +311,9 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
>> entry = list_first_entry(&cache->c_list,
>> struct mb_cache_entry, e_list);
>> /* Drop initial hash reference if there is no user */
>> - if (entry->e_referenced ||
>> + if (test_bit(MBE_REFERENCED_B, &entry->e_flags) ||
>> atomic_cmpxchg(&entry->e_refcnt, 1, 0) != 1) {
>> - entry->e_referenced = 0;
>> + clear_bit(MBE_REFERENCED_B, &entry->e_flags);
>> list_move_tail(&entry->e_list, &cache->c_list);
>> continue;
>> }
>> diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
>> index 2da63fd7b98f..97e64184767d 100644
>> --- a/include/linux/mbcache.h
>> +++ b/include/linux/mbcache.h
>> @@ -10,6 +10,12 @@
>>
>> struct mb_cache;
>>
>> +/* Cache entry flags */
>> +enum {
>> + MBE_REFERENCED_B = 0,
>> + MBE_REUSABLE_B
>> +};
>> +
>> struct mb_cache_entry {
>> /* List of entries in cache - protected by cache->c_list_lock */
>> struct list_head e_list;
>> @@ -26,8 +32,7 @@ struct mb_cache_entry {
>> atomic_t e_refcnt;
>> /* Key in hash - stable during lifetime of the entry */
>> u32 e_key;
>> - u32 e_referenced:1;
>> - u32 e_reusable:1;
>> + unsigned long e_flags;
>> /* User provided value - stable during lifetime of the entry */
>> u64 e_value;
>> };
>> --
>> 2.35.3

2022-12-06 01:57:35

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] ext4: Fix deadlock due to mbcache entry corruption

On Wed, Nov 23, 2022 at 08:39:50PM +0100, Jan Kara wrote:
> When manipulating xattr blocks, we can deadlock infinitely looping
> inside ext4_xattr_block_set() where we constantly keep finding xattr
> block for reuse in mbcache but we are unable to reuse it because its
> reference count is too big. This happens because cache entry for the
> xattr block is marked as reusable (e_reusable set) although its
> reference count is too big. When this inconsistency happens, this
> inconsistent state is kept indefinitely and so ext4_xattr_block_set()
> keeps retrying indefinitely.
>
> The inconsistent state is caused by non-atomic update of e_reusable bit.
> e_reusable is part of a bitfield and e_reusable update can race with
> update of e_referenced bit in the same bitfield resulting in loss of one
> of the updates. Fix the problem by using atomic bitops instead.
>
> CC: [email protected]
> Fixes: 6048c64b2609 ("mbcache: add reusable flag to cache entries")
> Reported-and-tested-by: Jeremi Piotrowski <[email protected]>
> Reported-by: Thilo Fromm <[email protected]>
> Link: https://lore.kernel.org/r/[email protected]/
> Signed-off-by: Jan Kara <[email protected]>

Reviewed-by: Andreas Dilger <[email protected]>

Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2022-12-08 05:59:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: Fix deadlock due to mbcache entry corruption

On Mon, Dec 05, 2022 at 04:41:49PM +0100, Thorsten Leemhuis wrote:
>
> Jan's patch to fix the regression is now our 12 days out and afaics
> didn't make any progress (or did I miss something?). Is there are reason
> why or did it simply fall through the cracks? Just asking, because it
> would be good to finally get this resolved.
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

This patch showed up right before the Thanksgiving holiday, and (b) it
just missed Q/A cutoff for the the ext4 bugfix pull request which I
sent to Linus right before I went on my Thanksgiving break.

Since Thanksgiving, I've been busy with the realities of corporate
life --- end of year performance evaluations, preparing for 2023
roadmap reviews with management, etc. So the next pull request I was
planning to send to Linus is when the merge window opens, and I'm
currently processing patches and running Q/A to be ready for the
opening of that merge window.


One thing which is completely unclear to me is how this relates to the
claimed regression. I understand that Jeremi and Thilo have asserted
that the hang goes away if a backport commit 51ae846cff5 ("ext4: fix
warning in ext4_iomap_begin as race between bmap and write") is not in
their 5.15 product tree.

However, the stack traces point to a problem in the extended attribute
code, which has nothing to do with ext4_bmap(), and commit 51ae846cff5
only changes the ext4's bmap function --- which these days gets used
for the FIBMAP ioctl and very little else.

Furthermore, the fix which Jan provided, and which apparently fixes
the user's problem, (a) doesn't touch the ext4_bmap function, and (b)
has a fixes tag for the patch:

Fixes: 6048c64b2609 ("mbcache: add reusable flag to cache entries")

... which is a commit which dates back to 2016, and the v4.6 kernel. ?!?

So at this point, I have no idea whether or not this is a regression
or not, but we'll get the fix to Linus soon.

Cheers,

- Ted

2022-12-08 11:20:24

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: Fix deadlock due to mbcache entry corruption

Hi Ted!

On Thu 08-12-22 00:55:55, Theodore Ts'o wrote:
> One thing which is completely unclear to me is how this relates to the
> claimed regression. I understand that Jeremi and Thilo have asserted
> that the hang goes away if a backport commit 51ae846cff5 ("ext4: fix
> warning in ext4_iomap_begin as race between bmap and write") is not in
> their 5.15 product tree.

IMHO the bisection was flawed and somehow the test of a revert (which guys
claimed to have done) must have been lucky and didn't trip the race. This
is not that hard to imagine because firstly, the commit 51ae846cff5 got
included in the same stable kernel release as ext4 xattr changes
(65f8b80053 ("ext4: fix race when reusing xattr blocks") and related
mbcache changes) which likely made the race more likely. Secondly, the
mbcache state corruption is not that easy to hit because you need an
interaction of slab reclaim on mbcache entry with ext4 xattr code adding
reference to xattr block and just hitting the reference limit.

> However, the stack traces point to a problem in the extended attribute
> code, which has nothing to do with ext4_bmap(), and commit 51ae846cff5
> only changes the ext4's bmap function --- which these days gets used
> for the FIBMAP ioctl and very little else.
>
> Furthermore, the fix which Jan provided, and which apparently fixes
> the user's problem, (a) doesn't touch the ext4_bmap function, and (b)
> has a fixes tag for the patch:
>
> Fixes: 6048c64b2609 ("mbcache: add reusable flag to cache entries")
>
> ... which is a commit which dates back to 2016, and the v4.6 kernel. ?!?

Yes. AFAICT the bitfield race in mbcache was introduced in this commit but
somehow ext4 was using mbcache in a way that wasn't tripping the race.
After 65f8b80053 ("ext4: fix race when reusing xattr blocks"), the race
became much more likely and users started to notice...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-12-08 15:22:42

by Jeremi Piotrowski

[permalink] [raw]
Subject: Re: [PATCH] ext4: Fix deadlock due to mbcache entry corruption

On Thu, Dec 08, 2022 at 10:15:23AM +0100, Jan Kara wrote:
> Hi Ted!
>
> On Thu 08-12-22 00:55:55, Theodore Ts'o wrote:
> > One thing which is completely unclear to me is how this relates to the
> > claimed regression. I understand that Jeremi and Thilo have asserted
> > that the hang goes away if a backport commit 51ae846cff5 ("ext4: fix
> > warning in ext4_iomap_begin as race between bmap and write") is not in
> > their 5.15 product tree.
>
> IMHO the bisection was flawed and somehow the test of a revert (which guys
> claimed to have done) must have been lucky and didn't trip the race. This
> is not that hard to imagine because firstly, the commit 51ae846cff5 got
> included in the same stable kernel release as ext4 xattr changes
> (65f8b80053 ("ext4: fix race when reusing xattr blocks") and related
> mbcache changes) which likely made the race more likely. Secondly, the
> mbcache state corruption is not that easy to hit because you need an
> interaction of slab reclaim on mbcache entry with ext4 xattr code adding
> reference to xattr block and just hitting the reference limit.
>

Yeah, sorry about that, there was never a bisect that led to 51ae846cff5, it
was just a guess and at that point we were unable to reproduce it ourselves so
we just had information from a user stating that when they revert that commit
in their own test build the issue doesn't occur.

Once we were able to personally reproduce the actual bisect led to 65f8b80053,
which as Honza stated made sure that the corruption/inconsistency leads to a
busy loop which is harder to miss.

> > However, the stack traces point to a problem in the extended attribute
> > code, which has nothing to do with ext4_bmap(), and commit 51ae846cff5
> > only changes the ext4's bmap function --- which these days gets used
> > for the FIBMAP ioctl and very little else.
> >
> > Furthermore, the fix which Jan provided, and which apparently fixes
> > the user's problem, (a) doesn't touch the ext4_bmap function, and (b)
> > has a fixes tag for the patch:
> >
> > Fixes: 6048c64b2609 ("mbcache: add reusable flag to cache entries")
> >
> > ... which is a commit which dates back to 2016, and the v4.6 kernel. ?!?
>
> Yes. AFAICT the bitfield race in mbcache was introduced in this commit but
> somehow ext4 was using mbcache in a way that wasn't tripping the race.
> After 65f8b80053 ("ext4: fix race when reusing xattr blocks"), the race
> became much more likely and users started to notice...
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2022-12-08 15:41:36

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: Fix deadlock due to mbcache entry corruption

On Thu, Dec 08, 2022 at 10:15:23AM +0100, Jan Kara wrote:
> > Furthermore, the fix which Jan provided, and which apparently fixes
> > the user's problem, (a) doesn't touch the ext4_bmap function, and (b)
> > has a fixes tag for the patch:
> >
> > Fixes: 6048c64b2609 ("mbcache: add reusable flag to cache entries")
> >
> > ... which is a commit which dates back to 2016, and the v4.6 kernel. ?!?
>
> Yes. AFAICT the bitfield race in mbcache was introduced in this commit but
> somehow ext4 was using mbcache in a way that wasn't tripping the race.
> After 65f8b80053 ("ext4: fix race when reusing xattr blocks"), the race
> became much more likely and users started to notice...

Ah, OK. And 65f8b80053 landed in 6.0, so while the bug may have been
around for much longer, this change made it much more likely that
folks would notice. That's the missing piece and why Microsoft
started noticing this in their "Flatcar" container kernel.

So I'll update the commit description so that this is more clear, and
then I can figure out how to tell the regression-bot that the
regression should be tracked using commit 65f8b80053 instead of
51ae846cff5 ("ext4: fix warning in ext4_iomap_begin as race between
bmap and write").

Cheers, and thanks for the clarification,

- Ted

Subject: Re: [PATCH] ext4: Fix deadlock due to mbcache entry corruption

On 08.12.22 16:39, Theodore Ts'o wrote:
> On Thu, Dec 08, 2022 at 10:15:23AM +0100, Jan Kara wrote:
>>> Furthermore, the fix which Jan provided, and which apparently fixes
>>> the user's problem, (a) doesn't touch the ext4_bmap function, and (b)
>>> has a fixes tag for the patch:
>>>
>>> Fixes: 6048c64b2609 ("mbcache: add reusable flag to cache entries")
>>>
>>> ... which is a commit which dates back to 2016, and the v4.6 kernel. ?!?
>>
>> Yes. AFAICT the bitfield race in mbcache was introduced in this commit but
>> somehow ext4 was using mbcache in a way that wasn't tripping the race.
>> After 65f8b80053 ("ext4: fix race when reusing xattr blocks"), the race
>> became much more likely and users started to notice...
>
> Ah, OK. And 65f8b80053 landed in 6.0, so while the bug may have been
> around for much longer, this change made it much more likely that
> folks would notice. That's the missing piece and why Microsoft
> started noticing this in their "Flatcar" container kernel.

Yeah, likely when 65f8b80053 was backported to 5.15.y in 1be97463696c

> So I'll update the commit description so that this is more clear,

Thx for taking care of this, I'm glad this is on track now.

Maybe I should talk to Greg again to revert backported changes like
1be97463696c until fixes for them are ready.

> and
> then I can figure out how to tell the regression-bot that the
> regression should be tracked using commit 65f8b80053 instead of
> 51ae846cff5 ("ext4: fix warning in ext4_iomap_begin as race between
> bmap and write").

FWIW, there is no strong need to, nobody looks at those details once the
regression is fixed. But yeah, that might change over time, so let me
take care of that:

#regzbot introduced: 65f8b80053

[normally things like that have to be done as a direct or indirect reply
to the report, but regzbot knows (famos last words...) how to associate
this command with the report, as the patch that started this thread
linked to the report using a Link: tag].

Ciao, Thorsten

2022-12-09 06:13:15

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: Fix deadlock due to mbcache entry corruption

On Thu, Dec 08, 2022 at 06:16:02PM +0100, Thorsten Leemhuis wrote:
>
> Maybe I should talk to Greg again to revert backported changes like
> 1be97463696c until fixes for them are ready.

The fix is in the ext4 git tree, and it's ready to be pushed to Linus
when the merge window opens --- presumably, on Sunday.

So it's probably not worth it to revert the backported change, only to
reapply immediately afterwards.

- Ted

Subject: Re: [PATCH] ext4: Fix deadlock due to mbcache entry corruption

On 09.12.22 07:12, Theodore Ts'o wrote:
> On Thu, Dec 08, 2022 at 06:16:02PM +0100, Thorsten Leemhuis wrote:
>>
>> Maybe I should talk to Greg again to revert backported changes like
>> 1be97463696c until fixes for them are ready.
>
> The fix is in the ext4 git tree, and it's ready to be pushed to Linus
> when the merge window opens --- presumably, on Sunday.

Thx!

> So it's probably not worth it to revert the backported change, only to
> reapply immediately afterwards.

Definitely agreed, I was more taking in the general sense (sorry, should
have been clearer), as it's not the first time some backport exposes
existing problems that take a while to get analyzed and fixed in
mainline. Which is just how it is sometimes, hence a revert and a
reapply of that backport (once the fix is available) in stable/longterm
sounds appropriate to me to prevent users from running into known problems.

Ciao, Thorsten

2022-12-09 06:54:51

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH] ext4: Fix deadlock due to mbcache entry corruption

On Fri, Dec 09, 2022 at 07:31:03AM +0100, Thorsten Leemhuis wrote:
>On 09.12.22 07:12, Theodore Ts'o wrote:
>> On Thu, Dec 08, 2022 at 06:16:02PM +0100, Thorsten Leemhuis wrote:
>>>
>>> Maybe I should talk to Greg again to revert backported changes like
>>> 1be97463696c until fixes for them are ready.
>>
>> The fix is in the ext4 git tree, and it's ready to be pushed to Linus
>> when the merge window opens --- presumably, on Sunday.
>
>Thx!
>
>> So it's probably not worth it to revert the backported change, only to
>> reapply immediately afterwards.
>
>Definitely agreed, I was more taking in the general sense (sorry, should
>have been clearer), as it's not the first time some backport exposes
>existing problems that take a while to get analyzed and fixed in
>mainline. Which is just how it is sometimes, hence a revert and a
>reapply of that backport (once the fix is available) in stable/longterm
>sounds appropriate to me to prevent users from running into known problems.

It's a balancing act: reverting a fix would mean that we reintroduce an
issue that was previously fixed back to users. It's not always the right
thing to do, and sometimes we won't.

--
Thanks,
Sasha