2019-05-11 12:43:56

by Richard Weinberger

[permalink] [raw]
Subject: Re: ext3/ext4 filesystem corruption under post 5.1.0 kernels

[CC'in linux-ext4]

On Sat, May 11, 2019 at 1:47 PM Arthur Marsh
<[email protected]> wrote:
>
> I have yet to bisect, but have had trouble with recent, post 5.1.0 kernels built from Linus' git head on both i386 (Pentium 4 pc) and amd64 (Athlon II X4 640).
>
> The easiest way to trigger the problem is:
>
> git gc
>
> on the kernel source tree, although the problem can occur without doing a git gc.
>
> The filesystem with the kernel source tree is the root file system, ext3, mounted as:
>
> /dev/sdb7 on / type ext3 (rw,relatime,errors=remount-ro)
>
> After the "Compressing objects" stage, the following appears in dmesg:
>
> [ 848.968550] EXT4-fs error (device sdb7): ext4_get_branch:171: inode #8: block 30343695: comm jbd2/sdb7-8: invalid block
> [ 849.077426] Aborting journal on device sdb7-8.
> [ 849.100963] EXT4-fs (sdb7): Remounting filesystem read-only
> [ 849.100976] jbd2_journal_bmap: journal block not found at offset 989 on sdb7-8
>
> fsck -yv
>
> then reports:
>
> # fsck -yv
> fsck from util-linux 2.33.1
> e2fsck 1.45.0 (6-Mar-2019)
> /dev/sdb7: recovering journal
> /dev/sdb7 contains a file system with errors, check forced.
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> Free blocks count wrong (4619656, counted=4619444).
> Fix? yes
>
> Free inodes count wrong (15884075, counted=15884058).
> Fix? yes
>
>
> /dev/sdb7: ***** FILE SYSTEM WAS MODIFIED *****
>
> Other times, I have gotten:
>
> "Inodes that were part of a corrupted orphan linked list found."
> "Block bitmap differences:"
> "Free blocks sound wrong for group"
>
> No problems have been observed with the 5.1.0 release kernel.
>
> Any suggestions for narrowing down the issue welcome.

Can you git-bisect it?

--
Thanks,
//richard


2019-05-11 22:09:17

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext3/ext4 filesystem corruption under post 5.1.0 kernels

On Sat, May 11, 2019 at 02:43:16PM +0200, Richard Weinberger wrote:
> [CC'in linux-ext4]
>
> On Sat, May 11, 2019 at 1:47 PM Arthur Marsh
> <[email protected]> wrote:
> >
> >
> > The filesystem with the kernel source tree is the root file system, ext3, mounted as:
> >
> > /dev/sdb7 on / type ext3 (rw,relatime,errors=remount-ro)
> >
> > After the "Compressing objects" stage, the following appears in dmesg:
> >
> > [ 848.968550] EXT4-fs error (device sdb7): ext4_get_branch:171: inode #8: block 30343695: comm jbd2/sdb7-8: invalid block
> > [ 849.077426] Aborting journal on device sdb7-8.
> > [ 849.100963] EXT4-fs (sdb7): Remounting filesystem read-only
> > [ 849.100976] jbd2_journal_bmap: journal block not found at offset 989 on sdb7-8

This indicates that the extent tree blocks for the journal was found
to be corrupt; so the journal couldn't be found.

> > # fsck -yv
> > fsck from util-linux 2.33.1
> > e2fsck 1.45.0 (6-Mar-2019)
> > /dev/sdb7: recovering journal
> > /dev/sdb7 contains a file system with errors, check forced.

But e2fsck had no problem finding the journal.

> > Pass 1: Checking inodes, blocks, and sizes
> > Pass 2: Checking directory structure
> > Pass 3: Checking directory connectivity
> > Pass 4: Checking reference counts
> > Pass 5: Checking group summary information
> > Free blocks count wrong (4619656, counted=4619444).
> > Fix? yes
> >
> > Free inodes count wrong (15884075, counted=15884058).
> > Fix? yes

And no other significant problems were found. (Ext4 never updates or
relies on the summary number of free blocks and free inodes, since
updating it is a scalability bottleneck and these values can be
calculated from the per block group free block/inodes count. So the
fact that e2fsck needed to update them is not an issue.)

So that implies that we got one set of values when we read the journal
inode when attempting to mount the file system, and a *different* set
of values when e2fsck was run. Which makes means that we need
consider the possibility that the problem is below the file system
layer (e.g., the block layer, device drivers, etc.).


> > /dev/sdb7: ***** FILE SYSTEM WAS MODIFIED *****
> >
> > Other times, I have gotten:
> >
> > "Inodes that were part of a corrupted orphan linked list found."
> > "Block bitmap differences:"
> > "Free blocks sound wrong for group"
> >

This variety of issues also implies that the issue may be in the data
read by the file system, as opposed to an issue in the file system.

Arthur, can you give us the full details of your hardware
configuration and your kernel config file? Also, what kernel git
commit ID were you testing?

- Ted

2019-05-13 11:21:58

by Arthur Marsh

[permalink] [raw]
Subject: Re: ext3/ext4 filesystem corruption under post 5.1.0 kernels

After a git bisect reset and updating to the current Linus git head, the problem no longer occurs.

Thanks for the feedback on the problem that I experienced.

Arthur.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2019-05-14 02:00:15

by Arthur Marsh

[permalink] [raw]
Subject: Re: ext3/ext4 filesystem corruption under post 5.1.0 kernels

Apologies, I had forgotten to

got bisect - - hard origin/master

I am still seeing the corruption leading to the invalid block error on 5.1.0+ kernels on both my machines.

Arthur.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2019-05-15 02:59:51

by Arthur Marsh

[permalink] [raw]
Subject: Re: ext3/ext4 filesystem corruption under post 5.1.0 kernels



On 14 May 2019 11:29:37 am ACST, Arthur Marsh <[email protected]> wrote:
>Apologies, I had forgotten to
>
>git bisect - - hard origin/master
>
>I am still seeing the corruption leading to the invalid block error on
>5.1.0+ kernels on both my machines.
>
>Arthur.

After the mm commits, the 32 bit kernel on Pentium-D still exhibits the "invalid block" issue when running git gc on the kernel source.

The 64 bit kernel on Athlon II X4 640 has since the mm commits had less problems running git gc on the kernel source but had an "invalid block" error after a second run of git gc.

Arthur.

Arthur.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2019-05-15 04:57:55

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext3/ext4 filesystem corruption under post 5.1.0 kernels

Ah, I think I see the problem. Sorry, this one was my fault. Does
this fix things for you?

- Ted

From 0c72924ef346d54e8627440e6d71257aa5b56105 Mon Sep 17 00:00:00 2001
From: Theodore Ts'o <[email protected]>
Date: Wed, 15 May 2019 00:51:19 -0400
Subject: [PATCH] ext4: fix block validity checks for journal inodes using indirect blocks

Commit 345c0dbf3a30 ("ext4: protect journal inode's blocks using
block_validity") failed to add an exception for the journal inode in
ext4_check_blockref(), which is the function used by ext4_get_branch()
for indirect blocks. This caused attempts to read from the ext3-style
journals to fail with:

[ 848.968550] EXT4-fs error (device sdb7): ext4_get_branch:171: inode #8: block 30343695: comm jbd2/sdb7-8: invalid block

Fix this by adding the missing exception check.

Fixes: 345c0dbf3a30 ("ext4: protect journal inode's blocks using block_validity")
Reported-by: Arthur Marsh <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>
---
fs/ext4/block_validity.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/fs/ext4/block_validity.c b/fs/ext4/block_validity.c
index 8d03550aaae3..8e83741b02e0 100644
--- a/fs/ext4/block_validity.c
+++ b/fs/ext4/block_validity.c
@@ -277,6 +277,11 @@ int ext4_check_blockref(const char *function, unsigned int line,
__le32 *bref = p;
unsigned int blk;

+ if (ext4_has_feature_journal(inode->i_sb) &&
+ (inode->i_ino ==
+ le32_to_cpu(EXT4_SB(inode->i_sb)->s_es->s_journal_inum)))
+ return 0;
+
while (bref < p+max) {
blk = le32_to_cpu(*bref++);
if (blk &&
--
2.19.1

2019-05-15 12:14:52

by Arthur Marsh

[permalink] [raw]
Subject: Re: ext3/ext4 filesystem corruption under post 5.1.0 kernels



On 15 May 2019 2:27:17 pm ACST, Theodore Ts'o <[email protected]> wrote:
>Ah, I think I see the problem. Sorry, this one was my fault. Does
>this fix things for you?
>
> - Ted
>
>From 0c72924ef346d54e8627440e6d71257aa5b56105 Mon Sep 17 00:00:00 2001
>From: Theodore Ts'o <[email protected]>
>Date: Wed, 15 May 2019 00:51:19 -0400
>Subject: [PATCH] ext4: fix block validity checks for journal inodes
>using indirect blocks
>
>Commit 345c0dbf3a30 ("ext4: protect journal inode's blocks using
>block_validity") failed to add an exception for the journal inode in
>ext4_check_blockref(), which is the function used by ext4_get_branch()
>for indirect blocks. This caused attempts to read from the ext3-style
>journals to fail with:
>
>[ 848.968550] EXT4-fs error (device sdb7): ext4_get_branch:171: inode
>#8: block 30343695: comm jbd2/sdb7-8: invalid block
>
>Fix this by adding the missing exception check.
>
>Fixes: 345c0dbf3a30 ("ext4: protect journal inode's blocks using
>block_validity")
>Reported-by: Arthur Marsh <[email protected]>
>Signed-off-by: Theodore Ts'o <[email protected]>
>---
> fs/ext4/block_validity.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
>diff --git a/fs/ext4/block_validity.c b/fs/ext4/block_validity.c
>index 8d03550aaae3..8e83741b02e0 100644
>--- a/fs/ext4/block_validity.c
>+++ b/fs/ext4/block_validity.c
>@@ -277,6 +277,11 @@ int ext4_check_blockref(const char *function,
>unsigned int line,
> __le32 *bref = p;
> unsigned int blk;
>
>+ if (ext4_has_feature_journal(inode->i_sb) &&
>+ (inode->i_ino ==
>+ le32_to_cpu(EXT4_SB(inode->i_sb)->s_es->s_journal_inum)))
>+ return 0;
>+
> while (bref < p+max) {
> blk = le32_to_cpu(*bref++);
> if (blk &&

I have built kernels with the attached patch applied and run git gc on the patched kernels (both the 32 bit kernel on the Pentium-D and the 64 bit kernel on the Athlon II X4 640).

There were a couple of warnings from other processes being blocked while the git gc was taking place but no filesystem corruption detected. (I ran forced fsck checks on the root filesystems after the git gc runs to check for corruption).

Thanks for the patch!

Arthur.

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Attachments:
ext4.tmp (506.00 B)
20190515iucode_tool.log (4.18 kB)
Download all attachments

2019-05-16 02:57:21

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext3/ext4 filesystem corruption under post 5.1.0 kernels

On Wed, May 15, 2019 at 09:42:11PM +0930, Arthur Marsh wrote:
> I have built kernels with the attached patch applied and run git gc
> on the patched kernels (both the 32 bit kernel on the Pentium-D and
> the 64 bit kernel on the Athlon II X4 640).
>
> There were a couple of warnings from other processes being blocked
> while the git gc was taking place but no filesystem corruption
> detected. (I ran forced fsck checks on the root filesystems after
> the git gc runs to check for corruption).
>
> Thanks for the patch!

Thanks for the bug report! My apologies for the inconvenience; I'm
going to take a look at improving my regression test configurations so
I would have noticed this earlier.

Cheers,

- Ted

2019-05-17 09:44:44

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: ext3/ext4 filesystem corruption under post 5.1.0 kernels

Hi Ted,

On Sun, May 12, 2019 at 12:07 AM Theodore Ts'o <[email protected]> wrote:
> On Sat, May 11, 2019 at 02:43:16PM +0200, Richard Weinberger wrote:
> > [CC'in linux-ext4]
> >
> > On Sat, May 11, 2019 at 1:47 PM Arthur Marsh
> > <[email protected]> wrote:
> > >
> > >
> > > The filesystem with the kernel source tree is the root file system, ext3, mounted as:
> > >
> > > /dev/sdb7 on / type ext3 (rw,relatime,errors=remount-ro)
> > >
> > > After the "Compressing objects" stage, the following appears in dmesg:
> > >
> > > [ 848.968550] EXT4-fs error (device sdb7): ext4_get_branch:171: inode #8: block 30343695: comm jbd2/sdb7-8: invalid block
> > > [ 849.077426] Aborting journal on device sdb7-8.
> > > [ 849.100963] EXT4-fs (sdb7): Remounting filesystem read-only
> > > [ 849.100976] jbd2_journal_bmap: journal block not found at offset 989 on sdb7-8
>
> This indicates that the extent tree blocks for the journal was found
> to be corrupt; so the journal couldn't be found.
>
> > > # fsck -yv
> > > fsck from util-linux 2.33.1
> > > e2fsck 1.45.0 (6-Mar-2019)
> > > /dev/sdb7: recovering journal
> > > /dev/sdb7 contains a file system with errors, check forced.
>
> But e2fsck had no problem finding the journal.
>
> > > Pass 1: Checking inodes, blocks, and sizes
> > > Pass 2: Checking directory structure
> > > Pass 3: Checking directory connectivity
> > > Pass 4: Checking reference counts
> > > Pass 5: Checking group summary information
> > > Free blocks count wrong (4619656, counted=4619444).
> > > Fix? yes
> > >
> > > Free inodes count wrong (15884075, counted=15884058).
> > > Fix? yes
>
> And no other significant problems were found. (Ext4 never updates or
> relies on the summary number of free blocks and free inodes, since
> updating it is a scalability bottleneck and these values can be
> calculated from the per block group free block/inodes count. So the
> fact that e2fsck needed to update them is not an issue.)
>
> So that implies that we got one set of values when we read the journal
> inode when attempting to mount the file system, and a *different* set
> of values when e2fsck was run. Which makes means that we need
> consider the possibility that the problem is below the file system
> layer (e.g., the block layer, device drivers, etc.).
>
>
> > > /dev/sdb7: ***** FILE SYSTEM WAS MODIFIED *****
> > >
> > > Other times, I have gotten:
> > >
> > > "Inodes that were part of a corrupted orphan linked list found."
> > > "Block bitmap differences:"
> > > "Free blocks sound wrong for group"
> > >
>
> This variety of issues also implies that the issue may be in the data
> read by the file system, as opposed to an issue in the file system.
>
> Arthur, can you give us the full details of your hardware
> configuration and your kernel config file? Also, what kernel git
> commit ID were you testing?

I'm seeing similar things running post v5.1 on ARAnyM (Atari emulator):

EXT4-fs (sda1): mounting ext3 file system using the ext4 subsystem
...
EXT4-fs error (device sda1): ext4_get_branch:171: inode #1980:
block 27550: comm jbd2/sda1-1980: invalid block

and userspace hung somewhere during initial system startup, so I had to
kill the instance.

-----

EXT4-fs (sda1): mounting ext3 file system using the ext4 subsystem
EXT4-fs (sda1): INFO: recovery required on readonly filesystem
EXT4-fs (sda1): write access will be enabled during recovery
EXT4-fs warning (device sda1): ext4_clear_journal_err:5078:
Filesystem error recorded from previous mount: IO failure
EXT4-fs warning (device sda1): ext4_clear_journal_err:5079:
Marking fs in need of filesystem check.
EXT4-fs (sda1): recovery complete
EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
VFS: Mounted root (ext3 filesystem) readonly on device 8:1.
...
Run /sbin/init as init process
random: fast init done
EXT4-fs (sda1): re-mounted. Opts:
random: crng init done
EXT4-fs (sda1): re-mounted. Opts: errors=remount-ro
EXT4-fs (sda1): error count since last fsck: 1
EXT4-fs (sda1): initial error at time 1557931133:
ext4_get_branch:171: inode 1980: block 27550
EXT4-fs (sda1): last error at time 1557931133:
ext4_get_branch:171: inode 1980: block 27550

-----

EXT4-fs (sda1): mounting ext3 file system using the ext4 subsystem
EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
VFS: Mounted root (ext3 filesystem) readonly on device 8:1.
...
Run /sbin/init as init process
random: fast init done
EXT4-fs (sda1): re-mounted. Opts:
EXT4-fs (sda1): re-mounted. Opts: errors=remount-ro
random: crng init done
EXT4-fs error (device sda1): ext4_get_branch:171: inode #1980:
block 27550: comm jbd2/sda1-1980: invalid block
Aborting journal on device sda1-1980.
EXT4-fs (sda1): Remounting filesystem read-only
jbd2_journal_bmap: journal block not found at offset 426 on sda1-1980
EXT4-fs error (device sda1): ext4_journal_check_start:61: Detected
aborted journal
EXT4-fs (sda1): error count since last fsck: 3
EXT4-fs (sda1): initial error at time 1557931133:
ext4_get_branch:171: inode 1980: block 27550
EXT4-fs (sda1): last error at time 1558083596:
ext4_journal_check_start:61: inode 1980: block 27550
EXT4-fs error (device sda1): ext4_remount:5328: Abort forced by user

---

EXT4-fs (sda1): mounting ext3 file system using the ext4 subsystem
EXT4-fs (sda1): INFO: recovery required on readonly filesystem
EXT4-fs (sda1): write access will be enabled during recovery
random: fast init done
EXT4-fs warning (device sda1): ext4_clear_journal_err:5078:
Filesystem error recorded from previous mount: IO failure
EXT4-fs warning (device sda1): ext4_clear_journal_err:5079:
Marking fs in need of filesystem check.
EXT4-fs (sda1): recovery complete
EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
...
Run /sbin/init as init process
random: crng init done
EXT4-fs (sda1): re-mounted. Opts:
EXT4-fs (sda1): re-mounted. Opts: errors=remount-ro
EXT4-fs (sda1): error count since last fsck: 4
EXT4-fs (sda1): initial error at time 1557931133:
ext4_get_branch:171: inode 1980: block 27550
EXT4-fs (sda1): last error at time 1558083665: ext4_remount:5328:
inode 1980: block 27550

Notes:
- It's always the same block,
- Block device is an image file, accessed using
arch/m68k/emu/nfblock.c, which did not receive any recent (bvec)
updates.
- There are no reported errors for the device containing the image
file on the host,
- Given Arthur sees the issue on a different class of machines, it's
unlikely the issue is related to a problem with the block device
(driver). It may still be an issue with the block layer, though,
- Both Arthur and I are mounting an ext3 file system using the ext4
subsystem.

Thanks!

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2019-05-17 16:48:00

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: ext3/ext4 filesystem corruption under post 5.1.0 kernels

Hi Ted,

On Wed, May 15, 2019 at 6:57 AM Theodore Ts'o <[email protected]> wrote:
> Ah, I think I see the problem. Sorry, this one was my fault. Does
> this fix things for you?

Thanks!
Sorry for missing this patch in the thread before.

> From 0c72924ef346d54e8627440e6d71257aa5b56105 Mon Sep 17 00:00:00 2001
> From: Theodore Ts'o <[email protected]>
> Date: Wed, 15 May 2019 00:51:19 -0400
> Subject: [PATCH] ext4: fix block validity checks for journal inodes using indirect blocks
>
> Commit 345c0dbf3a30 ("ext4: protect journal inode's blocks using
> block_validity") failed to add an exception for the journal inode in
> ext4_check_blockref(), which is the function used by ext4_get_branch()
> for indirect blocks. This caused attempts to read from the ext3-style
> journals to fail with:
>
> [ 848.968550] EXT4-fs error (device sdb7): ext4_get_branch:171: inode #8: block 30343695: comm jbd2/sdb7-8: invalid block
>
> Fix this by adding the missing exception check.
>
> Fixes: 345c0dbf3a30 ("ext4: protect journal inode's blocks using block_validity")
> Reported-by: Arthur Marsh <[email protected]>
> Signed-off-by: Theodore Ts'o <[email protected]>

Intermittent issue no more seen in 10 test boots, so
Tested-by: Geert Uytterhoeven <[email protected]>

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2019-07-01 12:51:15

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: ext3/ext4 filesystem corruption under post 5.1.0 kernels

Hi Ted,

On Fri, May 17, 2019 at 6:44 PM Geert Uytterhoeven <[email protected]> wrote:
> On Wed, May 15, 2019 at 6:57 AM Theodore Ts'o <[email protected]> wrote:
> > Ah, I think I see the problem. Sorry, this one was my fault. Does
> > this fix things for you?
>
> Thanks!
> Sorry for missing this patch in the thread before.
>
> > From 0c72924ef346d54e8627440e6d71257aa5b56105 Mon Sep 17 00:00:00 2001
> > From: Theodore Ts'o <[email protected]>
> > Date: Wed, 15 May 2019 00:51:19 -0400
> > Subject: [PATCH] ext4: fix block validity checks for journal inodes using indirect blocks
> >
> > Commit 345c0dbf3a30 ("ext4: protect journal inode's blocks using
> > block_validity") failed to add an exception for the journal inode in
> > ext4_check_blockref(), which is the function used by ext4_get_branch()
> > for indirect blocks. This caused attempts to read from the ext3-style
> > journals to fail with:
> >
> > [ 848.968550] EXT4-fs error (device sdb7): ext4_get_branch:171: inode #8: block 30343695: comm jbd2/sdb7-8: invalid block
> >
> > Fix this by adding the missing exception check.
> >
> > Fixes: 345c0dbf3a30 ("ext4: protect journal inode's blocks using block_validity")
> > Reported-by: Arthur Marsh <[email protected]>
> > Signed-off-by: Theodore Ts'o <[email protected]>
>
> Intermittent issue no more seen in 10 test boots, so
> Tested-by: Geert Uytterhoeven <[email protected]>

Despite this fix having been applied upstream, the kernel prints from
time to time:

EXT4-fs (sda1): error count since last fsck: 5
EXT4-fs (sda1): initial error at time 1557931133:
ext4_get_branch:171: inode 1980: block 27550
EXT4-fs (sda1): last error at time 1558114349:
ext4_get_branch:171: inode 1980: block 27550

This happens even after a manual run of "e2fsck -f" (while it's mounted
RO), which reports a clean file system.

The inode and block numbers match the numbers printed due to the
previous bug.

Do you have an idea what's wrong?
Note that I run a very old version of e2fsck (from a decade ago).

Thanks!

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds