2018-07-06 17:45:20

by dann frazier

[permalink] [raw]
Subject: [Bisect] ext4_validate_inode_bitmap:98: comm stress-ng: Corrupt inode bitmap

Hi,
We're seeing a regression triggered by the stress-ng[*] "chdir" test
that I've bisected to:

044e6e3d74a3 ext4: don't update checksum of new initialized bitmaps

So far we've only seen failures on servers based on HiSilicon's family
of ARM64 SoCs (D05/Hi1616 SoC, D06/Hi1620 SoC). On these systems it is
very reproducible.

= Test Case =
#!/bin/sh
umount /tmp/mnt || /bin/true
mkdir -p /tmp/mnt
mkfs.ext4 -F /dev/sda1
mount /dev/sda1 /tmp/mnt
# Running directly under /tmp/mnt doesn't trigger the issue, we need
# this subdirectory for some reason
dir="/tmp/mnt/tmp/disk_stress_ng_f70f0f26-b332-4c48-9e07-67c529770e3d"
mkdir -p "$dir"
stress-ng --aggressive --verify --timeout 240 --temp-path "$dir" \
--chdir 0 --hdd-opts dsync --readahead-bytes 16M -k

= Result =
[70586.263840] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
[70586.637085] EXT4-fs error (device sda1): ext4_validate_inode_bitmap:98: comm stress-ng: Corrupt inode bitmap - block_group = 43936, inode_bitmap = 1439694864
[70602.296551] EXT4-fs error (device sda1): ext4_validate_inode_bitmap:98: comm stress-ng: Corrupt inode bitmap - block_group = 48896, inode_bitmap = 1602224144
[70602.409291] EXT4-fs error (device sda1): ext4_validate_inode_bitmap:98: comm stress-ng: Corrupt inode bitmap - block_group = 48928, inode_bitmap = 1603272720

Full dmesg:
http://people.canonical.com/~dannf/d05-stress-ng/d05-sas-chdir-4.18.0-rc3+.dmesg

[*] http://kernel.ubuntu.com/~cking/stress-ng/


2018-07-07 04:11:19

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Bisect] ext4_validate_inode_bitmap:98: comm stress-ng: Corrupt inode bitmap

On Fri, Jul 06, 2018 at 11:43:24AM -0600, dann frazier wrote:
> Hi,
> We're seeing a regression triggered by the stress-ng[*] "chdir" test
> that I've bisected to:
>
> 044e6e3d74a3 ext4: don't update checksum of new initialized bitmaps
>
> So far we've only seen failures on servers based on HiSilicon's family
> of ARM64 SoCs (D05/Hi1616 SoC, D06/Hi1620 SoC). On these systems it is
> very reproducible.

Thanks for the report. Can you verify whether or not this patch fixes
things for you?

- Ted

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index da6c10c1e37a..1cfb74bc4dca 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -90,6 +90,8 @@ static int ext4_validate_inode_bitmap(struct super_block *sb,
return -EFSCORRUPTED;

ext4_lock_group(sb, block_group);
+ if (buffer_verified(bh))
+ goto verified;
blk = ext4_inode_bitmap(sb, desc);
if (!ext4_inode_bitmap_csum_verify(sb, block_group, desc, bh,
EXT4_INODES_PER_GROUP(sb) / 8)) {
@@ -101,6 +103,7 @@ static int ext4_validate_inode_bitmap(struct super_block *sb,
return -EFSBADCRC;
}
set_buffer_verified(bh);
+verified:
ext4_unlock_group(sb, block_group);
return 0;
}

2018-07-10 17:04:43

by dann frazier

[permalink] [raw]
Subject: Re: [Bisect] ext4_validate_inode_bitmap:98: comm stress-ng: Corrupt inode bitmap

On Sat, Jul 07, 2018 at 12:10:18AM -0400, Theodore Y. Ts'o wrote:
> On Fri, Jul 06, 2018 at 11:43:24AM -0600, dann frazier wrote:
> > Hi,
> > We're seeing a regression triggered by the stress-ng[*] "chdir" test
> > that I've bisected to:
> >
> > 044e6e3d74a3 ext4: don't update checksum of new initialized bitmaps
> >
> > So far we've only seen failures on servers based on HiSilicon's family
> > of ARM64 SoCs (D05/Hi1616 SoC, D06/Hi1620 SoC). On these systems it is
> > very reproducible.
>
> Thanks for the report. Can you verify whether or not this patch fixes
> things for you?

hey Ted,
Sorry for the delayed response - was afk for a long weekend.
Your patch does seem to fix the issue for me - after applying the
patch, I was able to survive 20 iterations (and counting), where
previously it would always fail the first time.

However, I've received a conflicting report from a colleague who
appears to still be seeing errors. I'll get back to you ASAP once I am
able to (in-?)validate that observation.

-dann

> - Ted
>
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index da6c10c1e37a..1cfb74bc4dca 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -90,6 +90,8 @@ static int ext4_validate_inode_bitmap(struct super_block *sb,
> return -EFSCORRUPTED;
>
> ext4_lock_group(sb, block_group);
> + if (buffer_verified(bh))
> + goto verified;
> blk = ext4_inode_bitmap(sb, desc);
> if (!ext4_inode_bitmap_csum_verify(sb, block_group, desc, bh,
> EXT4_INODES_PER_GROUP(sb) / 8)) {
> @@ -101,6 +103,7 @@ static int ext4_validate_inode_bitmap(struct super_block *sb,
> return -EFSBADCRC;
> }
> set_buffer_verified(bh);
> +verified:
> ext4_unlock_group(sb, block_group);
> return 0;
> }

2018-07-10 20:45:21

by dann frazier

[permalink] [raw]
Subject: Re: [Bisect] ext4_validate_inode_bitmap:98: comm stress-ng: Corrupt inode bitmap

On Tue, Jul 10, 2018 at 10:51:43AM -0600, dann frazier wrote:
> On Sat, Jul 07, 2018 at 12:10:18AM -0400, Theodore Y. Ts'o wrote:
> > On Fri, Jul 06, 2018 at 11:43:24AM -0600, dann frazier wrote:
> > > Hi,
> > > We're seeing a regression triggered by the stress-ng[*] "chdir" test
> > > that I've bisected to:
> > >
> > > 044e6e3d74a3 ext4: don't update checksum of new initialized bitmaps
> > >
> > > So far we've only seen failures on servers based on HiSilicon's family
> > > of ARM64 SoCs (D05/Hi1616 SoC, D06/Hi1620 SoC). On these systems it is
> > > very reproducible.
> >
> > Thanks for the report. Can you verify whether or not this patch fixes
> > things for you?
>
> hey Ted,
> Sorry for the delayed response - was afk for a long weekend.
> Your patch does seem to fix the issue for me - after applying the
> patch, I was able to survive 20 iterations (and counting), where
> previously it would always fail the first time.
>
> However, I've received a conflicting report from a colleague who
> appears to still be seeing errors. I'll get back to you ASAP once I am
> able to (in-?)validate that observation.

OK - I believe I found an explanation for my colleague's continued
test failures after applying the patch. The filesystem being used may
have already been corrupted from a previous run, and the test w/ your
patch just tripped over it. Details are here starting in comment #9 if
you'd like to look them over:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1780137

-dann

2018-07-11 08:59:02

by Ike Panhc

[permalink] [raw]
Subject: Re: [Bisect] ext4_validate_inode_bitmap:98: comm stress-ng: Corrupt inode bitmap

On 07/11/2018 04:43 AM, dann frazier wrote:
> On Tue, Jul 10, 2018 at 10:51:43AM -0600, dann frazier wrote:
>> On Sat, Jul 07, 2018 at 12:10:18AM -0400, Theodore Y. Ts'o wrote:
>>> On Fri, Jul 06, 2018 at 11:43:24AM -0600, dann frazier wrote:
>>>> Hi,
>>>> We're seeing a regression triggered by the stress-ng[*] "chdir" test
>>>> that I've bisected to:
>>>>
>>>> 044e6e3d74a3 ext4: don't update checksum of new initialized bitmaps
>>>>
>>>> So far we've only seen failures on servers based on HiSilicon's family
>>>> of ARM64 SoCs (D05/Hi1616 SoC, D06/Hi1620 SoC). On these systems it is
>>>> very reproducible.
>>>
>>> Thanks for the report. Can you verify whether or not this patch fixes
>>> things for you?
>>
>> hey Ted,
>> Sorry for the delayed response - was afk for a long weekend.
>> Your patch does seem to fix the issue for me - after applying the
>> patch, I was able to survive 20 iterations (and counting), where
>> previously it would always fail the first time.
>>
>> However, I've received a conflicting report from a colleague who
>> appears to still be seeing errors. I'll get back to you ASAP once I am
>> able to (in-?)validate that observation.
>
> OK - I believe I found an explanation for my colleague's continued
> test failures after applying the patch. The filesystem being used may
> have already been corrupted from a previous run, and the test w/ your
> patch just tripped over it. Details are here starting in comment #9 if
> you'd like to look them over:
>
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1780137
>
> -dann
>

Review console log and on each run I have filesystem rebuild. The problem
is that mke2fs I am using is 1.44.3-rc2. I am now reseting the environment
and re-test.

-- Ike

2018-07-12 23:09:26

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Bisect] ext4_validate_inode_bitmap:98: comm stress-ng: Corrupt inode bitmap

>
> Review console log and on each run I have filesystem rebuild. The problem
> is that mke2fs I am using is 1.44.3-rc2. I am now reseting the environment
> and re-test.
>

Could it be that you saw the error in ext4_validate_block_bitmap()?
The patch which I sent Dann only fixed the problem for inode bitmaps;
I noticed today that we need to fix it for block allocation bitmaps as
well.

- Ted

commit 8d5a803c6a6ce4ec258e31f76059ea5153ba46ef
Author: Theodore Ts'o <[email protected]>
Date: Thu Jul 12 19:08:05 2018 -0400

ext4: check for allocation block validity with block group locked

With commit 044e6e3d74a3: "ext4: don't update checksum of new
initialized bitmaps" the buffer valid bit will get set without
actually setting up the checksum for the allocation bitmap, since the
checksum will get calculated once we actually allocate an inode or
block.

If we are doing this, then we need to (re-)check the verified bit
after we take the block group lock. Otherwise, we could race with
another process reading and verifying the bitmap, which would then
complain about the checksum being invalid.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1780137

Signed-off-by: Theodore Ts'o <[email protected]>
Cc: [email protected]

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index e68cefe08261..aa52d87985aa 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -368,6 +368,8 @@ static int ext4_validate_block_bitmap(struct super_block *sb,
return -EFSCORRUPTED;

ext4_lock_group(sb, block_group);
+ if (buffer_verified(bh))
+ goto verified;
if (unlikely(!ext4_block_bitmap_csum_verify(sb, block_group,
desc, bh))) {
ext4_unlock_group(sb, block_group);
@@ -386,6 +388,7 @@ static int ext4_validate_block_bitmap(struct super_block *sb,
return -EFSCORRUPTED;
}
set_buffer_verified(bh);
+verified:
ext4_unlock_group(sb, block_group);
return 0;
}
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index fb83750c1a14..e9d8e2667ab5 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -90,6 +90,8 @@ static int ext4_validate_inode_bitmap(struct super_block *sb,
return -EFSCORRUPTED;

ext4_lock_group(sb, block_group);
+ if (buffer_verified(bh))
+ goto verified;
blk = ext4_inode_bitmap(sb, desc);
if (!ext4_inode_bitmap_csum_verify(sb, block_group, desc, bh,
EXT4_INODES_PER_GROUP(sb) / 8)) {
@@ -101,6 +103,7 @@ static int ext4_validate_inode_bitmap(struct super_block *sb,
return -EFSBADCRC;
}
set_buffer_verified(bh);
+verified:
ext4_unlock_group(sb, block_group);
return 0;
}

2018-07-14 11:22:42

by dann frazier

[permalink] [raw]
Subject: Re: [Bisect] ext4_validate_inode_bitmap:98: comm stress-ng: Corrupt inode bitmap

On Thu, Jul 12, 2018 at 5:08 PM Theodore Y. Ts'o <[email protected]> wrote:
>
> >
> > Review console log and on each run I have filesystem rebuild. The problem
> > is that mke2fs I am using is 1.44.3-rc2. I am now reseting the environment
> > and re-test.
> >
>
> Could it be that you saw the error in ext4_validate_block_bitmap()?

Looks like it. From Ike's report:

# grep EXT4 d05-4-ipmi.log
[ 26.215587] EXT4-fs (sdb2): mounted filesystem with ordered data
mode. Opts: (null)
[ 29.844105] EXT4-fs (sdb2): re-mounted. Opts: errors=remount-ro
[ 3586.211348] EXT4-fs error (device sda2):
ext4_validate_block_bitmap:383: comm stress-ng: bg 4705: bad block
bitmap checksum
[ 8254.776992] EXT4-fs error (device sda2):
ext4_validate_block_bitmap:383: comm stress-ng: bg 4193: bad block
bitmap checksum

I've ran my test case for several days w/ just the inode bitmap fix
and haven't been able to reproduce it - but perhaps that's just the
nature of the chdir test.

> The patch which I sent Dann only fixed the problem for inode bitmaps;
> I noticed today that we need to fix it for block allocation bitmaps as
> well.

I've also now ran several iterations w/ the block bitmap fix as well,
and still no problems, so:

Tested-by: dann frazier <[email protected]>

> commit 8d5a803c6a6ce4ec258e31f76059ea5153ba46ef
> Author: Theodore Ts'o <[email protected]>
> Date: Thu Jul 12 19:08:05 2018 -0400
>
> ext4: check for allocation block validity with block group locked
>
> With commit 044e6e3d74a3: "ext4: don't update checksum of new
> initialized bitmaps" the buffer valid bit will get set without
> actually setting up the checksum for the allocation bitmap, since the
> checksum will get calculated once we actually allocate an inode or
> block.
>
> If we are doing this, then we need to (re-)check the verified bit
> after we take the block group lock. Otherwise, we could race with
> another process reading and verifying the bitmap, which would then
> complain about the checksum being invalid.
>
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1780137
>
> Signed-off-by: Theodore Ts'o <[email protected]>
> Cc: [email protected]

Would it also make sense to add the following?

Fixes: 044e6e3d74a3 ("ext4: don't update checksum of new initialized bitmaps")

-dann

> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index e68cefe08261..aa52d87985aa 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -368,6 +368,8 @@ static int ext4_validate_block_bitmap(struct super_block *sb,
> return -EFSCORRUPTED;
>
> ext4_lock_group(sb, block_group);
> + if (buffer_verified(bh))
> + goto verified;
> if (unlikely(!ext4_block_bitmap_csum_verify(sb, block_group,
> desc, bh))) {
> ext4_unlock_group(sb, block_group);
> @@ -386,6 +388,7 @@ static int ext4_validate_block_bitmap(struct super_block *sb,
> return -EFSCORRUPTED;
> }
> set_buffer_verified(bh);
> +verified:
> ext4_unlock_group(sb, block_group);
> return 0;
> }
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index fb83750c1a14..e9d8e2667ab5 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -90,6 +90,8 @@ static int ext4_validate_inode_bitmap(struct super_block *sb,
> return -EFSCORRUPTED;
>
> ext4_lock_group(sb, block_group);
> + if (buffer_verified(bh))
> + goto verified;
> blk = ext4_inode_bitmap(sb, desc);
> if (!ext4_inode_bitmap_csum_verify(sb, block_group, desc, bh,
> EXT4_INODES_PER_GROUP(sb) / 8)) {
> @@ -101,6 +103,7 @@ static int ext4_validate_inode_bitmap(struct super_block *sb,
> return -EFSBADCRC;
> }
> set_buffer_verified(bh);
> +verified:
> ext4_unlock_group(sb, block_group);
> return 0;
> }

2018-07-16 23:15:08

by dann frazier

[permalink] [raw]
Subject: Re: [Bisect] ext4_validate_inode_bitmap:98: comm stress-ng: Corrupt inode bitmap

On Sat, Jul 14, 2018 at 5:21 AM dann frazier <[email protected]> wrote:
>
> On Thu, Jul 12, 2018 at 5:08 PM Theodore Y. Ts'o <[email protected]> wrote:
> >
> > >
> > > Review console log and on each run I have filesystem rebuild. The problem
> > > is that mke2fs I am using is 1.44.3-rc2. I am now reseting the environment
> > > and re-test.
> > >
> >
> > Could it be that you saw the error in ext4_validate_block_bitmap()?
>
> Looks like it. From Ike's report:
>
> # grep EXT4 d05-4-ipmi.log
> [ 26.215587] EXT4-fs (sdb2): mounted filesystem with ordered data
> mode. Opts: (null)
> [ 29.844105] EXT4-fs (sdb2): re-mounted. Opts: errors=remount-ro
> [ 3586.211348] EXT4-fs error (device sda2):
> ext4_validate_block_bitmap:383: comm stress-ng: bg 4705: bad block
> bitmap checksum
> [ 8254.776992] EXT4-fs error (device sda2):
> ext4_validate_block_bitmap:383: comm stress-ng: bg 4193: bad block
> bitmap checksum
>
> I've ran my test case for several days w/ just the inode bitmap fix
> and haven't been able to reproduce it - but perhaps that's just the
> nature of the chdir test.

hey Ted,

Turns out the stress-ng 'mknod' test and - less reliably - the
'dentry' test can tickle the "bad block bitmap checksum" bug pretty
easily. stress-ng wasn't *detecting* the error, but Colin has just
released a new version that does. We've been running with your updated
patch on 3 machines for several iterations, and have not seen another
occurrence.

-dann

> > The patch which I sent Dann only fixed the problem for inode bitmaps;
> > I noticed today that we need to fix it for block allocation bitmaps as
> > well.
>
> I've also now ran several iterations w/ the block bitmap fix as well,
> and still no problems, so:
>
> Tested-by: dann frazier <[email protected]>
>
> > commit 8d5a803c6a6ce4ec258e31f76059ea5153ba46ef
> > Author: Theodore Ts'o <[email protected]>
> > Date: Thu Jul 12 19:08:05 2018 -0400
> >
> > ext4: check for allocation block validity with block group locked
> >
> > With commit 044e6e3d74a3: "ext4: don't update checksum of new
> > initialized bitmaps" the buffer valid bit will get set without
> > actually setting up the checksum for the allocation bitmap, since the
> > checksum will get calculated once we actually allocate an inode or
> > block.
> >
> > If we are doing this, then we need to (re-)check the verified bit
> > after we take the block group lock. Otherwise, we could race with
> > another process reading and verifying the bitmap, which would then
> > complain about the checksum being invalid.
> >
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1780137
> >
> > Signed-off-by: Theodore Ts'o <[email protected]>
> > Cc: [email protected]
>
> Would it also make sense to add the following?
>
> Fixes: 044e6e3d74a3 ("ext4: don't update checksum of new initialized bitmaps")
>
> -dann
>
> > diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> > index e68cefe08261..aa52d87985aa 100644
> > --- a/fs/ext4/balloc.c
> > +++ b/fs/ext4/balloc.c
> > @@ -368,6 +368,8 @@ static int ext4_validate_block_bitmap(struct super_block *sb,
> > return -EFSCORRUPTED;
> >
> > ext4_lock_group(sb, block_group);
> > + if (buffer_verified(bh))
> > + goto verified;
> > if (unlikely(!ext4_block_bitmap_csum_verify(sb, block_group,
> > desc, bh))) {
> > ext4_unlock_group(sb, block_group);
> > @@ -386,6 +388,7 @@ static int ext4_validate_block_bitmap(struct super_block *sb,
> > return -EFSCORRUPTED;
> > }
> > set_buffer_verified(bh);
> > +verified:
> > ext4_unlock_group(sb, block_group);
> > return 0;
> > }
> > diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> > index fb83750c1a14..e9d8e2667ab5 100644
> > --- a/fs/ext4/ialloc.c
> > +++ b/fs/ext4/ialloc.c
> > @@ -90,6 +90,8 @@ static int ext4_validate_inode_bitmap(struct super_block *sb,
> > return -EFSCORRUPTED;
> >
> > ext4_lock_group(sb, block_group);
> > + if (buffer_verified(bh))
> > + goto verified;
> > blk = ext4_inode_bitmap(sb, desc);
> > if (!ext4_inode_bitmap_csum_verify(sb, block_group, desc, bh,
> > EXT4_INODES_PER_GROUP(sb) / 8)) {
> > @@ -101,6 +103,7 @@ static int ext4_validate_inode_bitmap(struct super_block *sb,
> > return -EFSBADCRC;
> > }
> > set_buffer_verified(bh);
> > +verified:
> > ext4_unlock_group(sb, block_group);
> > return 0;
> > }