LinuxLists.cc - [PATCH] ext4: flush s_error_work before journal destroy in ext4_fill

2021-07-20 06:20:24

Subject: [PATCH] ext4: flush s_error_work before journal destroy in ext4_fill_super

'commit c92dc856848f ("ext4: defer saving error info from atomic
context")' and '2d01ddc86606 ("ext4: save error info to sb through journal
if available")' add s_error_work to fix checksum error problem. But the
error path in ext4_fill_super can lead the follow BUG_ON.

Our testcase got follow BUG:
[32031.759805] ------------[ cut here ]------------
[32031.759807] kernel BUG at fs/jbd2/transaction.c:373!
[32031.760075] invalid opcode: 0000 [#1] SMP PTI
[32031.760336] CPU: 5 PID: 1029268 Comm: kworker/5:1 Kdump: loaded
Tainted: G OE --------- - -
4.18.0
...
[32031.766665] jbd2__journal_start+0xf1/0x1f0 [jbd2]
[32031.766934] jbd2_journal_start+0x19/0x20 [jbd2]
[32031.767218] flush_stashed_error_work+0x30/0x90 [ext4]
[32031.767487] process_one_work+0x195/0x390
[32031.767747] worker_thread+0x30/0x390
[32031.768007] ? process_one_work+0x390/0x390
[32031.768265] kthread+0x10d/0x130
[32031.768521] ? kthread_flush_work_fn+0x10/0x10
[32031.768778] ret_from_fork+0x35/0x40

static int start_this_handle(...)
BUG_ON(journal->j_flags & JBD2_UNMOUNT); <---- Trigger this

flush_stashed_error_work will try to access journal. We cannot flush
s_error_work after journal destroy.

Fixes: c92dc856848f ("ext4: defer saving error info from atomic context")
Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
Signed-off-by: yangerkun <[email protected]>
---
fs/ext4/super.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index dfa09a277b56..7db2be03848f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5173,15 +5173,15 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)

ext4_xattr_destroy_cache(sbi->s_ea_block_cache);
sbi->s_ea_block_cache = NULL;
+failed_mount3a:
+ ext4_es_unregister_shrinker(sbi);
+failed_mount3:
+ flush_work(&sbi->s_error_work);

if (sbi->s_journal) {
jbd2_journal_destroy(sbi->s_journal);
sbi->s_journal = NULL;
}
-failed_mount3a:
- ext4_es_unregister_shrinker(sbi);
-failed_mount3:
- flush_work(&sbi->s_error_work);
del_timer_sync(&sbi->s_err_report);
ext4_stop_mmpd(sbi);
failed_mount2:
--
2.31.1

2021-07-23 12:12:22

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH] ext4: flush s_error_work before journal destroy in ext4_fill_super

On Tue, Jul 20, 2021 at 02:24:09PM +0800, yangerkun wrote:
> 'commit c92dc856848f ("ext4: defer saving error info from atomic
> context")' and '2d01ddc86606 ("ext4: save error info to sb through journal
> if available")' add s_error_work to fix checksum error problem. But the
> error path in ext4_fill_super can lead the follow BUG_ON.

Can you share with me your test case? Your patch will result in the
shrinker potentially not getting released in some error paths (which
will cause other kernel panics), and in any case, once the journal is
destroyed here:

> @@ -5173,15 +5173,15 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>
> ext4_xattr_destroy_cache(sbi->s_ea_block_cache);
> sbi->s_ea_block_cache = NULL;
> +failed_mount3a:
> + ext4_es_unregister_shrinker(sbi);
> +failed_mount3:
> + flush_work(&sbi->s_error_work);
>
> if (sbi->s_journal) {
> jbd2_journal_destroy(sbi->s_journal);
> sbi->s_journal = NULL;
> }
> -failed_mount3a:
> - ext4_es_unregister_shrinker(sbi);
> -failed_mount3:
> - flush_work(&sbi->s_error_work);

sbi->s_journal is set to NULL, which means that in
flush_stashed_error_work(), journal will be NULL, which means we won't
call start_this_handle and so this change will not make a difference
given the kernel stack trace in the commit description.

Thanks,

- Ted

2021-07-23 13:11:56

by yangerkun

[permalink] [raw]

Subject: Re: [PATCH] ext4: flush s_error_work before journal destroy in ext4_fill_super

在 2021/7/23 20:11, Theodore Ts'o 写道:
> On Tue, Jul 20, 2021 at 02:24:09PM +0800, yangerkun wrote:
>> 'commit c92dc856848f ("ext4: defer saving error info from atomic
>> context")' and '2d01ddc86606 ("ext4: save error info to sb through journal
>> if available")' add s_error_work to fix checksum error problem. But the
>> error path in ext4_fill_super can lead the follow BUG_ON.
>
> Can you share with me your test case? Your patch will result in the
> shrinker potentially not getting released in some error paths (which
> will cause other kernel panics), and in any case, once the journal is
> destroyed here:
>
>> @@ -5173,15 +5173,15 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>>
>> ext4_xattr_destroy_cache(sbi->s_ea_block_cache);
>> sbi->s_ea_block_cache = NULL;
>> +failed_mount3a:
>> + ext4_es_unregister_shrinker(sbi);
>> +failed_mount3:
>> + flush_work(&sbi->s_error_work);
>>
>> if (sbi->s_journal) {
>> jbd2_journal_destroy(sbi->s_journal);
>> sbi->s_journal = NULL;
>> }
>> -failed_mount3a:
>> - ext4_es_unregister_shrinker(sbi);
>> -failed_mount3:
>> - flush_work(&sbi->s_error_work);
>
> sbi->s_journal is set to NULL, which means that in
> flush_stashed_error_work(), journal will be NULL, which means we won't
> call start_this_handle and so this change will not make a difference
> given the kernel stack trace in the commit description.

For example, before wo goto failed_mount_wq, we may meet some error and
will goto ext4_handle_error which can call
schedule_work(&EXT4_SB(sb)->s_error_work). So the work may start
concurrent with ext4_fill_super goto failed_mount_wq. There does not
have any lock to protect the concurrent read and modifies for
sbi->s_journal.

Injection fault some delay between jbd2_journal_destory and
sbi->s_journal and

We can injection fault while we do mount and add some delay like follow.
We will get some panic report easily...

failed_mount_wq:
ext4_xattr_destroy_cache(sbi->s_ea_inode_cache);
sbi->s_ea_inode_cache = NULL;

ext4_xattr_destroy_cache(sbi->s_ea_block_cache);
sbi->s_ea_block_cache = NULL;

if (sbi->s_journal) {
jbd2_journal_destroy(sbi->s_journal);
msleep(300); <---- add some delay
sbi->s_journal = NULL;
}

So we need to make sure to work will exists while we destroy the
journal. It maybe better to fix it by move the flush_work before destory
journal.

>
> Thanks,
>
> - Ted
> .
>

2021-07-23 13:25:34

by yangerkun

[permalink] [raw]

Subject: Re: [PATCH] ext4: flush s_error_work before journal destroy in ext4_fill_super

2021-07-23 19:12:06

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH] ext4: flush s_error_work before journal destroy in ext4_fill_super

On Fri, Jul 23, 2021 at 09:11:08PM +0800, yangerkun wrote:
>
> For example, before wo goto failed_mount_wq, we may meet some error and will
> goto ext4_handle_error which can call
> schedule_work(&EXT4_SB(sb)->s_error_work). So the work may start concurrent
> with ext4_fill_super goto failed_mount_wq. There does not have any lock to
> protect the concurrent read and modifies for sbi->s_journal.

Yes, and I'm asking *how* is this actually happening in practice?
I've been going through the code paths and I don't see any place where
ext4_error*() would be called. That's why I wanted to see your test
case which was reproducing it. (Not just where you added the msleep,
but how the error was getting triggered in the first place.)

On Fri, Jul 23, 2021 at 09:25:12PM +0800, yangerkun wrote:
>
> > Can you share with me your test case? Your patch will result in the
> > shrinker potentially not getting released in some error paths (which
> > will cause other kernel panics), and in any case, once the journal is
>
> The only logic we have changed is that we move the flush_work before we call
> jbd2_journal_destory. I have not seen the problem you describe... Can you
> help to explain more...

Sorry, I was mistaken. I thought you were moving the
ext4_es_unregister_shrinker() and flush_work() before the label for
failed_mount_wq; that was a misreading of your patch.

The other way we could fix this might be something like this:

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index dfa09a277b56..d663d11fa0de 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -693,7 +693,7 @@ static void flush_stashed_error_work(struct work_struct *work)
{
struct ext4_sb_info *sbi = container_of(work, struct ext4_sb_info,
s_error_work);
- journal_t *journal = sbi->s_journal;
+ journal_t *journal = READ_ONCE(sbi->s_journal);
handle_t *handle;

/*
@@ -1184,9 +1184,11 @@ static void ext4_put_super(struct super_block *sb)
ext4_unregister_sysfs(sb);

if (sbi->s_journal) {
- aborted = is_journal_aborted(sbi->s_journal);
- err = jbd2_journal_destroy(sbi->s_journal);
- sbi->s_journal = NULL;
+ journal_t *journal = sbi->s_journal;
+
+ WRITE_ONCE(sbi->s_journal, NULL);
+ aborted = is_journal_aborted(journal);
+ err = jbd2_journal_destroy(journal);
if ((err < 0) && !aborted) {
ext4_abort(sb, -err, "Couldn't clean up the journal");
}
@@ -5175,8 +5177,10 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
sbi->s_ea_block_cache = NULL;

if (sbi->s_journal) {
- jbd2_journal_destroy(sbi->s_journal);
- sbi->s_journal = NULL;
+ journal_t *journal = sbi->s_journal;
+
+ WRITE_ONCE(sbi->s_journal, NULL);
+ jbd2_journal_destroy(journal);
}
failed_mount3a:
ext4_es_unregister_shrinker(sbi);
@@ -5487,7 +5491,7 @@ static int ext4_load_journal(struct super_block *sb,
EXT4_SB(sb)->s_journal = journal;
err = ext4_clear_journal_err(sb, es);
if (err) {
- EXT4_SB(sb)->s_journal = NULL;
+ WRITE_ONCE(EXT4_SB(sb)->s_journal, NULL);
jbd2_journal_destroy(journal);
return err;
}

... and here's another possible fix:

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index dfa09a277b56..e9e122e52ce8 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -704,7 +704,8 @@ static void flush_stashed_error_work(struct work_struct *work)
* We use directly jbd2 functions here to avoid recursing back into
* ext4 error handling code during handling of previous errors.
*/
- if (!sb_rdonly(sbi->s_sb) && journal) {
+ if (!sb_rdonly(sbi->s_sb) && journal &&
+ !(journal->j_flags & JBD2_UNMOUNT)) {
struct buffer_head *sbh = sbi->s_sbh;
handle = jbd2_journal_start(journal, 1);
if (IS_ERR(handle))

But I would be interested in understanding how we could be triggering
this problem in the first place before deciding what's the best fix.

Cheers,

- Ted

2021-07-26 07:14:18

by yangerkun

[permalink] [raw]

Subject: Re: [PATCH] ext4: flush s_error_work before journal destroy in ext4_fill_super

在 2021/7/24 3:11, Theodore Ts'o 写道:
> On Fri, Jul 23, 2021 at 09:11:08PM +0800, yangerkun wrote:
>>
>> For example, before wo goto failed_mount_wq, we may meet some error and will
>> goto ext4_handle_error which can call
>> schedule_work(&EXT4_SB(sb)->s_error_work). So the work may start concurrent
>> with ext4_fill_super goto failed_mount_wq. There does not have any lock to
>> protect the concurrent read and modifies for sbi->s_journal.
>
> Yes, and I'm asking *how* is this actually happening in practice?
> I've been going through the code paths and I don't see any place where
> ext4_error*() would be called. That's why I wanted to see your test
> case which was reproducing it. (Not just where you added the msleep,
> but how the error was getting triggered in the first place.)

Hi Ted,

The problem only happened once early with parallel ltp testcase(but we
cannot reproduce it again with same case...). And dmesg with latter:

[32031.739678] EXT4-fs error (device loop66): ext4_fill_super:4672: comm
chdir01: inode #2: comm chdir01: iget: illegal inode #
[32031.740193] EXT4-fs (loop66): get root inode failed
[32031.740484] EXT4-fs (loop66): mount failed
[32031.758811] EXT4-fs error (device loop66): ext4_map_blocks:595: inode
#8: block 532: comm chdir01: lblock 1 mapped to illegal pblock 532
(length 1)
[32031.759293] jbd2_journal_bmap: journal block not found at offset 1 on
loop66-8
[32031.759805] ------------[ cut here ]------------
[32031.759807] kernel BUG at fs/jbd2/transaction.c:373!

ext4_fill_super
ext4_load_journal
EXT4_SB(sb)->s_journal = journal
root = ext4_iget(sb, EXT4_ROOT_INO, EXT4_IGET_SPECIAL)
// will failed and goto failed_mount4
__ext4_iget
__ext4_error
ext4_handle_error
schedule_work(&EXT4_SB(sb)->s_error_work)

And this trigger the concurrent read and modifies for sbi->s_journal...

Thanks,
Kun.

>
>
> On Fri, Jul 23, 2021 at 09:25:12PM +0800, yangerkun wrote:
>>
>>> Can you share with me your test case? Your patch will result in the
>>> shrinker potentially not getting released in some error paths (which
>>> will cause other kernel panics), and in any case, once the journal is
>>
>> The only logic we have changed is that we move the flush_work before we call
>> jbd2_journal_destory. I have not seen the problem you describe... Can you
>> help to explain more...
>
> Sorry, I was mistaken. I thought you were moving the
> ext4_es_unregister_shrinker() and flush_work() before the label for
> failed_mount_wq; that was a misreading of your patch.
>
> The other way we could fix this might be something like this:
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index dfa09a277b56..d663d11fa0de 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -693,7 +693,7 @@ static void flush_stashed_error_work(struct work_struct *work)
> {
> struct ext4_sb_info *sbi = container_of(work, struct ext4_sb_info,
> s_error_work);
> - journal_t *journal = sbi->s_journal;
> + journal_t *journal = READ_ONCE(sbi->s_journal);
> handle_t *handle;
>
> /*
> @@ -1184,9 +1184,11 @@ static void ext4_put_super(struct super_block *sb)
> ext4_unregister_sysfs(sb);
>
> if (sbi->s_journal) {
> - aborted = is_journal_aborted(sbi->s_journal);
> - err = jbd2_journal_destroy(sbi->s_journal);
> - sbi->s_journal = NULL;
> + journal_t *journal = sbi->s_journal;
> +
> + WRITE_ONCE(sbi->s_journal, NULL);
> + aborted = is_journal_aborted(journal);
> + err = jbd2_journal_destroy(journal);
> if ((err < 0) && !aborted) {
> ext4_abort(sb, -err, "Couldn't clean up the journal");
> }
> @@ -5175,8 +5177,10 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> sbi->s_ea_block_cache = NULL;
>
> if (sbi->s_journal) {
> - jbd2_journal_destroy(sbi->s_journal);
> - sbi->s_journal = NULL;
> + journal_t *journal = sbi->s_journal;
> +
> + WRITE_ONCE(sbi->s_journal, NULL);
> + jbd2_journal_destroy(journal);
> }
> failed_mount3a:
> ext4_es_unregister_shrinker(sbi);
> @@ -5487,7 +5491,7 @@ static int ext4_load_journal(struct super_block *sb,
> EXT4_SB(sb)->s_journal = journal;
> err = ext4_clear_journal_err(sb, es);
> if (err) {
> - EXT4_SB(sb)->s_journal = NULL;
> + WRITE_ONCE(EXT4_SB(sb)->s_journal, NULL);
> jbd2_journal_destroy(journal);
> return err;
> }
>
> ... and here's another possible fix:
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index dfa09a277b56..e9e122e52ce8 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -704,7 +704,8 @@ static void flush_stashed_error_work(struct work_struct *work)
> * We use directly jbd2 functions here to avoid recursing back into
> * ext4 error handling code during handling of previous errors.
> */
> - if (!sb_rdonly(sbi->s_sb) && journal) {
> + if (!sb_rdonly(sbi->s_sb) && journal &&
> + !(journal->j_flags & JBD2_UNMOUNT)) {
> struct buffer_head *sbh = sbi->s_sbh;
> handle = jbd2_journal_start(journal, 1);
> if (IS_ERR(handle))
>
>
>
> But I would be interested in understanding how we could be triggering
> this problem in the first place before deciding what's the best fix.
>
> Cheers,
>
> - Ted
> .
>

2021-07-26 13:26:37

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH] ext4: flush s_error_work before journal destroy in ext4_fill_super

On Mon 26-07-21 15:13:34, yangerkun wrote:
>
>
> 在 2021/7/24 3:11, Theodore Ts'o 写道:
> > On Fri, Jul 23, 2021 at 09:11:08PM +0800, yangerkun wrote:
> > >
> > > For example, before wo goto failed_mount_wq, we may meet some error and will
> > > goto ext4_handle_error which can call
> > > schedule_work(&EXT4_SB(sb)->s_error_work). So the work may start concurrent
> > > with ext4_fill_super goto failed_mount_wq. There does not have any lock to
> > > protect the concurrent read and modifies for sbi->s_journal.
> >
> > Yes, and I'm asking *how* is this actually happening in practice?
> > I've been going through the code paths and I don't see any place where
> > ext4_error*() would be called. That's why I wanted to see your test
> > case which was reproducing it. (Not just where you added the msleep,
> > but how the error was getting triggered in the first place.)
>
> Hi Ted,
>
>
> The problem only happened once early with parallel ltp testcase(but we
> cannot reproduce it again with same case...). And dmesg with latter:
>
>
> [32031.739678] EXT4-fs error (device loop66): ext4_fill_super:4672: comm
> chdir01: inode #2: comm chdir01: iget: illegal inode #
> [32031.740193] EXT4-fs (loop66): get root inode failed
> [32031.740484] EXT4-fs (loop66): mount failed

Oh, OK. I guess s_inodes_count was <= 1 which made a check in __ext4_iget()
trip. That is theoretically possible if s_inodes_per_block == 1,
s_inodes_per_group == 1 and s_groups_count == 1. I guess ext4_fill_super()
needs to check s_inodes_count is large enough at least for all reserved
inodes to fit. But it's still unclear to me how we could have succeeded in
loading the journal which apparently has inode number 8 from the error
message below.

In parallel LTP checks I've sometimes noticed strange errors in the past
that looked very much like filesystem being modified through the block
device while being in use by the kernel (maybe some bug in the test
framework). And this is something we just don't support and the kernel can
crash in this case.

In either case I agree it makes sense to make sure error work cannot race
with journal destruction either by flushing it before destroying the
journal or some other means...

Honza

> [32031.758811] EXT4-fs error (device loop66): ext4_map_blocks:595: inode #8:
> block 532: comm chdir01: lblock 1 mapped to illegal pblock 532 (length 1)
> [32031.759293] jbd2_journal_bmap: journal block not found at offset 1 on
> loop66-8
> [32031.759805] ------------[ cut here ]------------
> [32031.759807] kernel BUG at fs/jbd2/transaction.c:373!
>
>
> ext4_fill_super
> ext4_load_journal
> EXT4_SB(sb)->s_journal = journal
> root = ext4_iget(sb, EXT4_ROOT_INO, EXT4_IGET_SPECIAL)
> // will failed and goto failed_mount4
> __ext4_iget
> __ext4_error
> ext4_handle_error
> schedule_work(&EXT4_SB(sb)->s_error_work)
>
>
> And this trigger the concurrent read and modifies for sbi->s_journal...
>
> Thanks,
> Kun.
>
>
> >
> >
> > On Fri, Jul 23, 2021 at 09:25:12PM +0800, yangerkun wrote:
> > >
> > > > Can you share with me your test case? Your patch will result in the
> > > > shrinker potentially not getting released in some error paths (which
> > > > will cause other kernel panics), and in any case, once the journal is
> > >
> > > The only logic we have changed is that we move the flush_work before we call
> > > jbd2_journal_destory. I have not seen the problem you describe... Can you
> > > help to explain more...
> >
> > Sorry, I was mistaken. I thought you were moving the
> > ext4_es_unregister_shrinker() and flush_work() before the label for
> > failed_mount_wq; that was a misreading of your patch.
> >
> > The other way we could fix this might be something like this:
> >
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index dfa09a277b56..d663d11fa0de 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -693,7 +693,7 @@ static void flush_stashed_error_work(struct work_struct *work)
> > {
> > struct ext4_sb_info *sbi = container_of(work, struct ext4_sb_info,
> > s_error_work);
> > - journal_t *journal = sbi->s_journal;
> > + journal_t *journal = READ_ONCE(sbi->s_journal);
> > handle_t *handle;
> > /*
> > @@ -1184,9 +1184,11 @@ static void ext4_put_super(struct super_block *sb)
> > ext4_unregister_sysfs(sb);
> > if (sbi->s_journal) {
> > - aborted = is_journal_aborted(sbi->s_journal);
> > - err = jbd2_journal_destroy(sbi->s_journal);
> > - sbi->s_journal = NULL;
> > + journal_t *journal = sbi->s_journal;
> > +
> > + WRITE_ONCE(sbi->s_journal, NULL);
> > + aborted = is_journal_aborted(journal);
> > + err = jbd2_journal_destroy(journal);
> > if ((err < 0) && !aborted) {
> > ext4_abort(sb, -err, "Couldn't clean up the journal");
> > }
> > @@ -5175,8 +5177,10 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> > sbi->s_ea_block_cache = NULL;
> > if (sbi->s_journal) {
> > - jbd2_journal_destroy(sbi->s_journal);
> > - sbi->s_journal = NULL;
> > + journal_t *journal = sbi->s_journal;
> > +
> > + WRITE_ONCE(sbi->s_journal, NULL);
> > + jbd2_journal_destroy(journal);
> > }
> > failed_mount3a:
> > ext4_es_unregister_shrinker(sbi);
> > @@ -5487,7 +5491,7 @@ static int ext4_load_journal(struct super_block *sb,
> > EXT4_SB(sb)->s_journal = journal;
> > err = ext4_clear_journal_err(sb, es);
> > if (err) {
> > - EXT4_SB(sb)->s_journal = NULL;
> > + WRITE_ONCE(EXT4_SB(sb)->s_journal, NULL);
> > jbd2_journal_destroy(journal);
> > return err;
> > }
> >
> > ... and here's another possible fix:
> >
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index dfa09a277b56..e9e122e52ce8 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -704,7 +704,8 @@ static void flush_stashed_error_work(struct work_struct *work)
> > * We use directly jbd2 functions here to avoid recursing back into
> > * ext4 error handling code during handling of previous errors.
> > */
> > - if (!sb_rdonly(sbi->s_sb) && journal) {
> > + if (!sb_rdonly(sbi->s_sb) && journal &&
> > + !(journal->j_flags & JBD2_UNMOUNT)) {
> > struct buffer_head *sbh = sbi->s_sbh;
> > handle = jbd2_journal_start(journal, 1);
> > if (IS_ERR(handle))
> >
> >
> >
> > But I would be interested in understanding how we could be triggering
> > this problem in the first place before deciding what's the best fix.
> >
> > Cheers,
> >
> > - Ted
> > .
> >
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-08-03 11:13:47

by yangerkun

[permalink] [raw]

Subject: Re: [PATCH] ext4: flush s_error_work before journal destroy in ext4_fill_super

在 2021/7/26 21:26, Jan Kara 写道:
> On Mon 26-07-21 15:13:34, yangerkun wrote:
>>
>>
>> 在 2021/7/24 3:11, Theodore Ts'o 写道:
>>> On Fri, Jul 23, 2021 at 09:11:08PM +0800, yangerkun wrote:
>>>>
>>>> For example, before wo goto failed_mount_wq, we may meet some error and will
>>>> goto ext4_handle_error which can call
>>>> schedule_work(&EXT4_SB(sb)->s_error_work). So the work may start concurrent
>>>> with ext4_fill_super goto failed_mount_wq. There does not have any lock to
>>>> protect the concurrent read and modifies for sbi->s_journal.
>>>
>>> Yes, and I'm asking *how* is this actually happening in practice?
>>> I've been going through the code paths and I don't see any place where
>>> ext4_error*() would be called. That's why I wanted to see your test
>>> case which was reproducing it. (Not just where you added the msleep,
>>> but how the error was getting triggered in the first place.)
>>
>> Hi Ted,
>>
>>
>> The problem only happened once early with parallel ltp testcase(but we
>> cannot reproduce it again with same case...). And dmesg with latter:
>>
>>
>> [32031.739678] EXT4-fs error (device loop66): ext4_fill_super:4672: comm
>> chdir01: inode #2: comm chdir01: iget: illegal inode #
>> [32031.740193] EXT4-fs (loop66): get root inode failed
>> [32031.740484] EXT4-fs (loop66): mount failed
>
> Oh, OK. I guess s_inodes_count was <= 1 which made a check in __ext4_iget()
> trip. That is theoretically possible if s_inodes_per_block == 1,
> s_inodes_per_group == 1 and s_groups_count == 1. I guess ext4_fill_super()
> needs to check s_inodes_count is large enough at least for all reserved
> inodes to fit. But it's still unclear to me how we could have succeeded in
> loading the journal which apparently has inode number 8 from the error
> message below.
>
> In parallel LTP checks I've sometimes noticed strange errors in the past
> that looked very much like filesystem being modified through the block
> device while being in use by the kernel (maybe some bug in the test
> framework). And this is something we just don't support and the kernel can
> crash in this case.
>
> In either case I agree it makes sense to make sure error work cannot race
> with journal destruction either by flushing it before destroying the
> journal or some other means...

Hi Ted,

Should we apply this patch first, or do you have some better solution...

Thanks,
Kun.

>
> Honza
>
>> [32031.758811] EXT4-fs error (device loop66): ext4_map_blocks:595: inode #8:
>> block 532: comm chdir01: lblock 1 mapped to illegal pblock 532 (length 1)
>> [32031.759293] jbd2_journal_bmap: journal block not found at offset 1 on
>> loop66-8
>> [32031.759805] ------------[ cut here ]------------
>> [32031.759807] kernel BUG at fs/jbd2/transaction.c:373!
>>
>>
>> ext4_fill_super
>> ext4_load_journal
>> EXT4_SB(sb)->s_journal = journal
>> root = ext4_iget(sb, EXT4_ROOT_INO, EXT4_IGET_SPECIAL)
>> // will failed and goto failed_mount4
>> __ext4_iget
>> __ext4_error
>> ext4_handle_error
>> schedule_work(&EXT4_SB(sb)->s_error_work)
>>
>>
>> And this trigger the concurrent read and modifies for sbi->s_journal...
>>
>> Thanks,
>> Kun.
>>
>>
>>>
>>>
>>> On Fri, Jul 23, 2021 at 09:25:12PM +0800, yangerkun wrote:
>>>>
>>>>> Can you share with me your test case? Your patch will result in the
>>>>> shrinker potentially not getting released in some error paths (which
>>>>> will cause other kernel panics), and in any case, once the journal is
>>>>
>>>> The only logic we have changed is that we move the flush_work before we call
>>>> jbd2_journal_destory. I have not seen the problem you describe... Can you
>>>> help to explain more...
>>>
>>> Sorry, I was mistaken. I thought you were moving the
>>> ext4_es_unregister_shrinker() and flush_work() before the label for
>>> failed_mount_wq; that was a misreading of your patch.
>>>
>>> The other way we could fix this might be something like this:
>>>
>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>>> index dfa09a277b56..d663d11fa0de 100644
>>> --- a/fs/ext4/super.c
>>> +++ b/fs/ext4/super.c
>>> @@ -693,7 +693,7 @@ static void flush_stashed_error_work(struct work_struct *work)
>>> {
>>> struct ext4_sb_info *sbi = container_of(work, struct ext4_sb_info,
>>> s_error_work);
>>> - journal_t *journal = sbi->s_journal;
>>> + journal_t *journal = READ_ONCE(sbi->s_journal);
>>> handle_t *handle;
>>> /*
>>> @@ -1184,9 +1184,11 @@ static void ext4_put_super(struct super_block *sb)
>>> ext4_unregister_sysfs(sb);
>>> if (sbi->s_journal) {
>>> - aborted = is_journal_aborted(sbi->s_journal);
>>> - err = jbd2_journal_destroy(sbi->s_journal);
>>> - sbi->s_journal = NULL;
>>> + journal_t *journal = sbi->s_journal;
>>> +
>>> + WRITE_ONCE(sbi->s_journal, NULL);
>>> + aborted = is_journal_aborted(journal);
>>> + err = jbd2_journal_destroy(journal);
>>> if ((err < 0) && !aborted) {
>>> ext4_abort(sb, -err, "Couldn't clean up the journal");
>>> }
>>> @@ -5175,8 +5177,10 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>>> sbi->s_ea_block_cache = NULL;
>>> if (sbi->s_journal) {
>>> - jbd2_journal_destroy(sbi->s_journal);
>>> - sbi->s_journal = NULL;
>>> + journal_t *journal = sbi->s_journal;
>>> +
>>> + WRITE_ONCE(sbi->s_journal, NULL);
>>> + jbd2_journal_destroy(journal);
>>> }
>>> failed_mount3a:
>>> ext4_es_unregister_shrinker(sbi);
>>> @@ -5487,7 +5491,7 @@ static int ext4_load_journal(struct super_block *sb,
>>> EXT4_SB(sb)->s_journal = journal;
>>> err = ext4_clear_journal_err(sb, es);
>>> if (err) {
>>> - EXT4_SB(sb)->s_journal = NULL;
>>> + WRITE_ONCE(EXT4_SB(sb)->s_journal, NULL);
>>> jbd2_journal_destroy(journal);
>>> return err;
>>> }
>>>
>>> ... and here's another possible fix:
>>>
>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>>> index dfa09a277b56..e9e122e52ce8 100644
>>> --- a/fs/ext4/super.c
>>> +++ b/fs/ext4/super.c
>>> @@ -704,7 +704,8 @@ static void flush_stashed_error_work(struct work_struct *work)
>>> * We use directly jbd2 functions here to avoid recursing back into
>>> * ext4 error handling code during handling of previous errors.
>>> */
>>> - if (!sb_rdonly(sbi->s_sb) && journal) {
>>> + if (!sb_rdonly(sbi->s_sb) && journal &&
>>> + !(journal->j_flags & JBD2_UNMOUNT)) {
>>> struct buffer_head *sbh = sbi->s_sbh;
>>> handle = jbd2_journal_start(journal, 1);
>>> if (IS_ERR(handle))
>>>
>>>
>>>
>>> But I would be interested in understanding how we could be triggering
>>> this problem in the first place before deciding what's the best fix.
>>>
>>> Cheers,
>>>
>>> - Ted
>>> .
>>>

2021-08-24 10:13:17

by yangerkun

[permalink] [raw]

Subject: Re: [PATCH] ext4: flush s_error_work before journal destroy in ext4_fill_super

Sorry for that this patch will lead a bug when mount failed on arm64(I
have verify this patch on x86, and it seems ok...).

I will try to fix the problem with another way....

[2021-08-24 17:43:29 INFO] [ 45.808127] Internal error: Oops: 96000004
[#1] SMP
[2021-08-24 17:43:29 INFO] [ 45.812986] Modules linked in: realtek
hclge hns3 megaraid_sas hisi_sas_v3_hw hibmc_drm hisi_sas_main
host_edma_drv hnae3
[2021-08-24 17:43:29 INFO] [ 45.823896] Process mount (pid: 1187,
stack limit = 0x00000000e0bd7181)
[2021-08-24 17:43:29 INFO] [ 45.830481] CPU: 9 PID: 1187 Comm: mount
Not tainted 4.19.90_4.19.90-2108.7.0+ #2
[2021-08-24 17:43:29 INFO] [ 45.837929] Hardware name: Huawei TaiShan
2280 V2/BC82AMDC, BIOS 2280-V2 CS V3.26.01 06/14/2019
[2021-08-24 17:43:29 INFO] [ 45.846587] pstate: 10400009 (nzcV daif
+PAN -UAO)
[2021-08-24 17:43:29 INFO] [ 45.851359] pc :
percpu_counter_add_batch+0x5c/0x358
[2021-08-24 17:43:29 INFO] [ 45.856303] lr :
ext4_es_free_extent+0xd4/0x4a8
[2021-08-24 17:43:29 INFO] [ 45.860812] sp : ffff80262a60f170
[2021-08-24 17:43:29 INFO] [ 45.864112] x29: ffff80262a60f170 x28:
dfff200000000000
[2021-08-24 17:43:29 INFO] [ 45.869399] x27: ffff8026403e65cc x26:
0000000000000000
[2021-08-24 17:43:29 INFO] [ 45.874686] x25: ffff80266c429e20 x24:
0000000000000100
[2021-08-24 17:43:29 INFO] [ 45.879973] x23: 1ffff004cd8853c4 x22:
ffffffffffffffff
[2021-08-24 17:43:29 INFO] [ 45.885260] x21: 0000000000000000 x20:
00006026b7a31000
[2021-08-24 17:43:29 INFO] [ 45.890547] x19: ffff80266c429e00 x18:
0000000000000000
[2021-08-24 17:43:29 INFO] [ 45.895833] x17: 0000000000000000 x16:
0000000000000000
[2021-08-24 17:43:29 INFO] [ 45.901120] x15: 1fffe4000504bd02 x14:
ffff200023ba1adc
[2021-08-24 17:43:29 INFO] [ 45.906406] x13: ffff200024332530 x12:
ffff20002433240c
[2021-08-24 17:43:29 INFO] [ 45.911693] x11: ffff2000243304f4 x10:
ffff200024328f6c
[2021-08-24 17:43:29 INFO] [ 45.916980] x9 : 1ffff004c807ccb9 x8 :
ffff200023b75950
[2021-08-24 17:43:29 INFO] [ 45.922266] x7 : ffff200023ba1fe0 x6 :
0000000000000004
[2021-08-24 17:43:29 INFO] [ 45.927553] x5 : 0000000000000000 x4 :
0000000000000000
[2021-08-24 17:43:29 INFO] [ 45.932839] x3 : 00000c04d6f46200 x2 :
dfff200000000000
[2021-08-24 17:43:29 INFO] [ 45.938126] x1 : 0000000000000003 x0 :
00006026b7a31000
[2021-08-24 17:43:29 INFO] [ 45.943413] Call trace:
[2021-08-24 17:43:29 INFO] [ 45.945851]
percpu_counter_add_batch+0x5c/0x358
[2021-08-24 17:43:29 INFO] [ 45.950446] ext4_es_free_extent+0xd4/0x4a8
[2021-08-24 17:43:29 INFO] [ 45.954610] __es_remove_extent+0x37c/0x598
[2021-08-24 17:43:29 INFO] [ 45.958773] ext4_es_remove_extent+0x6c/0x270
[2021-08-24 17:43:29 INFO] [ 45.963110] ext4_clear_inode+0x50/0x188
[2021-08-24 17:43:29 INFO] [ 45.967016] ext4_evict_inode+0x314/0x1450
[2021-08-24 17:43:29 INFO] [ 45.971094] evict+0x238/0x5a0
[2021-08-24 17:43:29 INFO] [ 45.974136] iput+0x2bc/0x6b8
[2021-08-24 17:43:29 INFO] [ 45.977090] jbd2_journal_destroy+0x408/0x800
[2021-08-24 17:43:29 INFO] [ 45.981428] ext4_fill_super+0x5a04/0x8cc0
[2021-08-24 17:43:29 INFO] [ 45.985505] mount_bdev+0x268/0x328
[2021-08-24 17:43:29 INFO] [ 45.988978] ext4_mount+0x44/0x58
[2021-08-24 17:43:29 INFO] [ 45.992277] mount_fs+0x68/0x390
[2021-08-24 17:43:29 INFO] [ 45.995492] vfs_kern_mount.part.2+0x54/0x388
[2021-08-24 17:43:29 INFO] [ 45.999830] do_mount+0xa5c/0x20d0
[2021-08-24 17:43:29 INFO] [ 46.003216] ksys_mount+0x9c/0x118
[2021-08-24 17:43:29 INFO] [ 46.006603] __arm64_sys_mount+0xa8/0x108
[2021-08-24 17:43:29 INFO] [ 46.010595] el0_svc_common+0x10c/0x4a0
[2021-08-24 17:43:29 INFO] [ 46.014415] el0_svc_handler+0x170/0x248
[2021-08-24 17:43:29 INFO] [ 46.018320] el0_svc+0x10/0x218
[2021-08-24 17:43:29 INFO] [ 46.021448] Code: f2fbffe2 d343fc03
92400801 11000c21 (38e26862)
[2021-08-24 17:43:29 INFO] [ 46.027513] ---[ end trace
316fdb8833588599 ]---
[2021-08-24 17:43:29 INFO] [ 46.037646] Kernel panic - not syncing:
Fatal exception
[2021-08-24 17:43:29 INFO] [ 46.042848] SMP: stopping secondary CPUs
[2021-08-24 17:43:29 INFO] [ 46.046779] Kernel Offset: 0x1baf0000 from
0xffff200008000000
[2021-08-24 17:43:29 INFO] [ 46.052500] CPU features: 0x12,a2a00a38
[2021-08-24 17:43:29 INFO] [ 46.056318] Memory Limit: none
[2021-08-24 17:43:29 INFO] [ 46.064887] Rebooting in 1 seconds..
[2021-08-24 17:43:30 INFO] [ 47.068489] SMP: stopping secondary CPUs
[2021-08-24 17:43:32 INFO] [ 48.158471] SMP: failed to stop secondary
CPUs 0-127

在 2021/7/23 21:25, yangerkun 写道:
>
>
> 在 2021/7/23 20:11, Theodore Ts'o 写道:
>> On Tue, Jul 20, 2021 at 02:24:09PM +0800, yangerkun wrote:
>>> 'commit c92dc856848f ("ext4: defer saving error info from atomic
>>> context")' and '2d01ddc86606 ("ext4: save error info to sb through
>>> journal
>>> if available")' add s_error_work to fix checksum error problem. But the
>>> error path in ext4_fill_super can lead the follow BUG_ON.
>>
>> Can you share with me your test case? Your patch will result in the
>> shrinker potentially not getting released in some error paths (which
>> will cause other kernel panics), and in any case, once the journal is
>
> Hi Ted,
>
> The only logic we have changed is that we move the flush_work before we
> call jbd2_journal_destory. I have not seen the problem you describe...
> Can you help to explain more...
>
> Thanks,
> Kun.
>
>> destroyed here:
>>
>>> @@ -5173,15 +5173,15 @@ static int ext4_fill_super(struct super_block
>>> *sb, void *data, int silent)
>>>       ext4_xattr_destroy_cache(sbi->s_ea_block_cache);
>>>       sbi->s_ea_block_cache = NULL;
>>> +failed_mount3a:
>>> +    ext4_es_unregister_shrinker(sbi);
>>> +failed_mount3:
>>> +    flush_work(&sbi->s_error_work);
>>>       if (sbi->s_journal) {
>>>           jbd2_journal_destroy(sbi->s_journal);
>>>           sbi->s_journal = NULL;
>>>       }
>>> -failed_mount3a:
>>> -    ext4_es_unregister_shrinker(sbi);
>>> -failed_mount3:
>>> -    flush_work(&sbi->s_error_work);
>>
>> sbi->s_journal is set to NULL, which means that in
>> flush_stashed_error_work(), journal will be NULL, which means we won't
>> call start_this_handle and so this change will not make a difference
>> given the kernel stack trace in the commit description.
>>
>> Thanks,
>>
>>                         - Ted
>> .
>>
> .