2021-04-06 13:05:47

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: Fix bug on in ext4_es_cache_extent as ext4_split_extent_at failed

On Thu, Mar 25, 2021 at 10:29:25AM +0800, Ye Bin wrote:
> We got follow bug_on:
> [130747.323114] kernel BUG at fs/ext4/extents_status.c:762!
> [130747.323117] Internal error: Oops - BUG: 0 [#1] SMP
> ......
> [130747.334329] Call trace:
> [130747.334553] ext4_es_cache_extent+0x150/0x168 [ext4]
> [130747.334975] ext4_cache_extents+0x64/0xe8 [ext4]
> [130747.335368] ext4_find_extent+0x300/0x330 [ext4]
> [130747.335759] ext4_ext_map_blocks+0x74/0x1178 [ext4]
> [130747.336179] ext4_map_blocks+0x2f4/0x5f0 [ext4]
> [130747.336567] ext4_mpage_readpages+0x4a8/0x7a8 [ext4]
> [130747.336995] ext4_readpage+0x54/0x100 [ext4]
> [130747.337359] generic_file_buffered_read+0x410/0xae8
> [130747.337767] generic_file_read_iter+0x114/0x190
> [130747.338152] ext4_file_read_iter+0x5c/0x140 [ext4]
> [130747.338556] __vfs_read+0x11c/0x188
> [130747.338851] vfs_read+0x94/0x150
> [130747.339110] ksys_read+0x74/0xf0
>
> If call ext4_ext_insert_extent failed but new extent already inserted, we just
> update "ex->ee_len = orig_ex.ee_len", this will lead to extent overlap, then
> cause bug on when cache extent.

How did this happen in the first place? It sounds like if the extent
was already inserted, that would be casue there was an on-disk file
system corruption, no?

In that case, shouldn't we call ext4_error() to declare the file
system has an inconsistency, so it can be fixed by fsck?

> If call ext4_ext_insert_extent failed don't update ex->ee_len with old value.
> Maybe there will lead to block leak, but it can be fixed by fsck later.

- Ted


2021-04-07 13:48:51

by Ye Bin

[permalink] [raw]
Subject: Re: [PATCH] ext4: Fix bug on in ext4_es_cache_extent as ext4_split_extent_at failed



On 2021/4/6 11:34, Theodore Ts'o wrote:
> On Thu, Mar 25, 2021 at 10:29:25AM +0800, Ye Bin wrote:
>> We got follow bug_on:
>> [130747.323114] kernel BUG at fs/ext4/extents_status.c:762!
>> [130747.323117] Internal error: Oops - BUG: 0 [#1] SMP
>> ......
>> [130747.334329] Call trace:
>> [130747.334553] ext4_es_cache_extent+0x150/0x168 [ext4]
>> [130747.334975] ext4_cache_extents+0x64/0xe8 [ext4]
>> [130747.335368] ext4_find_extent+0x300/0x330 [ext4]
>> [130747.335759] ext4_ext_map_blocks+0x74/0x1178 [ext4]
>> [130747.336179] ext4_map_blocks+0x2f4/0x5f0 [ext4]
>> [130747.336567] ext4_mpage_readpages+0x4a8/0x7a8 [ext4]
>> [130747.336995] ext4_readpage+0x54/0x100 [ext4]
>> [130747.337359] generic_file_buffered_read+0x410/0xae8
>> [130747.337767] generic_file_read_iter+0x114/0x190
>> [130747.338152] ext4_file_read_iter+0x5c/0x140 [ext4]
>> [130747.338556] __vfs_read+0x11c/0x188
>> [130747.338851] vfs_read+0x94/0x150
>> [130747.339110] ksys_read+0x74/0xf0
>>
>> If call ext4_ext_insert_extent failed but new extent already inserted, we just
>> update "ex->ee_len = orig_ex.ee_len", this will lead to extent overlap, then
>> cause bug on when cache extent.
> How did this happen in the first place? It sounds like if the extent
> was already inserted, that would be casue there was an on-disk file
> system corruption, no?
>
> In that case, shouldn't we call ext4_error() to declare the file
> system has an inconsistency, so it can be fixed by fsck?
We inject IO fault when runing fsstress, JBD detect IO error then
trigger JBD abort. At the same time,
if ext4_ext_insert_extent already insert new exntent then call
ext4_ext_dirty to dirty metadata , but
JBD already aborted , ext4_ext_dirty will return error.
In ext4_ext_dirty function call ext4_ext_check_inode check extent if
ok, if not, trigger BUG_ON and
also print extent detail information. I caught follow call trace and
extent information:
=================================================================================
[ 7597.436071] inode 1808 ext4_ext_check_inode
[ 7597.436074] inode[1808]: [4] lblk=76 len=5 orig_len=32773 prev=0
[ 7597.436077] inode[1808]: [3] lblk=81 len=25 orig_len=25 prev=81
[ 7597.436080] inode[1808]: [2] lblk=106 len=130 orig_len=32898 prev=106
[ 7597.436084] inode[1808]: [1] lblk=228 len=8 orig_len=32776 prev=236
[ 7597.436181] ------------[ cut here ]------------
[ 7597.436183] kernel BUG at fs/ext4/extents.c:221!
[ 7597.436216] invalid opcode: 0000 [#1] SMP KASAN
[ 7597.436219] Buffer I/O error on dev sda, logical block 131072, lost
sync page write
[ 7597.436224] CPU: 7 PID: 5157 Comm: kworker/u16:1 Not tainted
4.19.95-00020-gb095ded9163d-dirty #186
[ 7597.436230] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31
04/01/2014
[ 7597.436242] Workqueue: writeback wb_workfn (flush-8:16)
[ 7597.436248] RIP: 0010:__ext4_ext_dirty.cold+0x2ee/0x308
[ 7597.436251] Code: ff 48 89 df e8 c7 e1 d0 ff e9 48 ff ff ff 48 c7 c7
80 d1 f7 ae e8 87 45 84 ff 48 83 05 ee 3b ca 07 01 48 83 05 c6 3c ca 07
01 <0f> 04
[ 7597.436253] RSP: 0018:ffff88838bcaf050 EFLAGS: 00010202
[ 7597.436256] RAX: 0000000000000000 RBX: ffff888364b67783 RCX:
0000000000000000
[ 7597.436258] RDX: 0000000000000001 RSI: 0000000000000008 RDI:
ffffed1071795dfd
[ 7597.436259] RBP: 000000000000003b R08: 0000000000000000 R09:
0000000000000001
[ 7597.436261] R10: ffffed10744fbd6e R11: ffff8883a27deb77 R12:
dffffc0000000000
[ 7597.436262] R13: dffffc0000000000 R14: ffff888364b673f0 R15:
ffff888364b6730c
[ 7597.436264] FS: 0000000000000000(0000) GS:ffff8883a27c0000(0000)
knlGS:0000000000000000
[ 7597.436266] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7597.436267] CR2: 00007fca7a249000 CR3: 000000039ce75000 CR4:
00000000000006e0
[ 7597.436270] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 7597.436271] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 7597.436275] JBD2: Error -5 detected when updating journal superblock
for sda-8.
[ 7597.436277] Call Trace:
[ 7597.436294] ext4_split_extent_at+0x785/0xc80
[ 7597.466767] ext4_split_extent.isra.0+0x2af/0x550
[ 7597.472503] ext4_ext_convert_to_initialized+0x4b4/0x2510
[ 7597.536999] ext4_ext_map_blocks+0x37ef/0x50c0
[ 7597.542467] ext4_map_blocks+0x685/0x1910
[ 7597.544107] ext4_writepages+0x1809/0x32f0
[ 7597.547546] do_writepages+0x7f/0x1b0
[ 7597.548619] __writeback_single_inode+0xc6/0xc40
[ 7597.549203] writeback_sb_inodes+0x49f/0xd60
[ 7597.550748] wb_writeback+0x252/0x9d0
[ 7597.552335] wb_workfn+0x309/0xe60
[ 7597.557029] process_one_work+0x70b/0x1610
[ 7597.557566] worker_thread+0x5a9/0x1060
[ 7597.559091] kthread+0x35e/0x430
[ 7597.560601] ret_from_fork+0x1f/0x30
=================================================================================
Obviously, the follow extent is overlap:
[ 7597.436080] inode[1808]: [2] lblk=106 len=130 orig_len=32898 prev=106
[ 7597.436084] inode[1808]: [1] lblk=228 len=8 orig_len=32776 prev=236

I think this wouldn't cause file system inconsistency, as error data
will not be stored in disk.

>
>> If call ext4_ext_insert_extent failed don't update ex->ee_len with old value.
>> Maybe there will lead to block leak, but it can be fixed by fsck later.
> - Ted
> .
>

2021-04-09 01:48:26

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: Fix bug on in ext4_es_cache_extent as ext4_split_extent_at failed

On Wed, Apr 07, 2021 at 09:41:57AM +0800, yebin wrote:
> > > If call ext4_ext_insert_extent failed but new extent already inserted, we just
> > > update "ex->ee_len = orig_ex.ee_len", this will lead to extent overlap, then
> > > cause bug on when cache extent.
> > How did this happen in the first place? It sounds like if the extent
> > was already inserted, that would be casue there was an on-disk file
> > system corruption, no?
> >
> > In that case, shouldn't we call ext4_error() to declare the file
> > system has an inconsistency, so it can be fixed by fsck?
> We inject IO fault when runing fsstress, JBD detect IO error then trigger
> JBD abort. At the same time,
> if ext4_ext_insert_extent already insert new exntent then call
> ext4_ext_dirty to dirty metadata , but
> JBD already aborted , ext4_ext_dirty will return error.
> In ext4_ext_dirty function call ext4_ext_check_inode check extent if ok, if
> not, trigger BUG_ON and
> also print extent detail information.

In this particular case, skipping the "ex->ee_len = orig_ex.ee_len"
may avoid the BUG_ON. But it's not clear that this is always the
right thing to do. The fundamental question is what should we do we
run into an error while we are in the middle of making changes to
on-disk and in-memory data structures?

In the ideal world, we should undo the changes that we were in the
middle of making before we return an error. That way, the semantics
are very clear; on success, the function has made the requested change
to the file system. If the function returns an error, then no changes
should be made.

That was the reasoning behind resetting ex->ee_len to orig_ex.ee_len
in the fix_extent_len inside ext4_split_extent_at(). Unofrtunately,
ext4_ext_insert_extent() does *not* always follow this convention, and
that's because it would be extremely difficult for it to do so --- the
mutations that it makes can be quite complex, including potentially
increasing the height of the extent tree.

However, I don't think your fix is by any means the ideal one, because
the much more common way that ext4_ext_insert_extent() is when it
needs to insert a new leaf node, or need to increase the height of the
extent tree --- and in it returns an ENOSPC failure. In that case, it
won't have made any changes changes in the extent tree, and so having
ext4_split_extent_at() undo the change to ex->ee_len is the right
thing to do.

Having blocks get leaked when there is an ENOSPC failure, requiring
fsck to be run --- and without giving the user any warning that this
has happened is *not* a good way to fail. So I don't think the
proposed patch is the right way to go.

A better way to go would be to teach ext4_ext_insert_extent() so if
there is a late failure, that it unwinds the leaf node back to its
original state (at least from a semantic value). Since the extent
leaf node could have been split, and/or adjacent extent entries may
have been merged, what it would need to do is to remember the starting
block number and length, and make whatever changes are necessaries to
the extent entries in that leaf node corresponding to that starting
block number and length.

If you don't want to do that, then a "do no harm" fix would be
something like this:

...
} else if (err == -EROFS) {
return err;
} else if (err)
goto fix_extent_len;

So in the journal abort case, when err is set to EROFS, we don't try
to reset the length, since in theory the file system is read-only
already anyway. However, in the ENOSPC case, we won't end up silently
leaking blocks that will be lost until the user somehow decides to run
fsck.

There are still times when this doesn't get things completely right
(e.g., what if we get a late ENOMEM error versus an early ENOMEM
failure), where the only real fix is to make ext4_ext_insert_extent()
obey the convention that if it returns an error, it must not result in
any user-visible state change.

Cheers,

- Ted