LinuxLists.cc - [ext4] 21175ca434: mdadm-selftests.enchmarks/mdadm-selftests/tests/01r1fail.fail

2021-04-27 08:00:51

Subject: [ext4] 21175ca434: mdadm-selftests.enchmarks/mdadm-selftests/tests/01r1fail.fail

Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: 21175ca434c5d49509b73cf473618b01b0b85437 ("ext4: make prefetch_block_bitmaps default")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

in testcase: mdadm-selftests
version: mdadm-selftests-x86_64-5d518de-1_20201008
with following parameters:

disk: 1HDD
test_prefix: 01r1
ucode: 0x21

on test machine: 4 threads Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz with 8G memory

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):

If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>

2021-04-26 16:59:25 mkdir -p /var/tmp
2021-04-26 16:59:25 mke2fs -t ext3 -b 4096 -J size=4 -q /dev/sda2
2021-04-26 17:00:21 mount -t ext3 /dev/sda2 /var/tmp
sed -e 's/{DEFAULT_METADATA}/1.2/g' \
-e 's,{MAP_PATH},/run/mdadm/map,g' mdadm.8.in > mdadm.8
/usr/bin/install -D -m 644 mdadm.8 /usr/share/man/man8/mdadm.8
/usr/bin/install -D -m 644 mdmon.8 /usr/share/man/man8/mdmon.8
/usr/bin/install -D -m 644 md.4 /usr/share/man/man4/md.4
/usr/bin/install -D -m 644 mdadm.conf.5 /usr/share/man/man5/mdadm.conf.5
/usr/bin/install -D -m 644 udev-md-raid-creating.rules /lib/udev/rules.d/01-md-raid-creating.rules
/usr/bin/install -D -m 644 udev-md-raid-arrays.rules /lib/udev/rules.d/63-md-raid-arrays.rules
/usr/bin/install -D -m 644 udev-md-raid-assembly.rules /lib/udev/rules.d/64-md-raid-assembly.rules
/usr/bin/install -D -m 644 udev-md-clustered-confirm-device.rules /lib/udev/rules.d/69-md-clustered-confirm-device.rules
/usr/bin/install -D -m 755 mdadm /sbin/mdadm
/usr/bin/install -D -m 755 mdmon /sbin/mdmon
Testing on linux-5.12.0-rc4-00017-g21175ca434c5 kernel
/lkp/benchmarks/mdadm-selftests/tests/01r1fail... FAILED - see /var/tmp/01r1fail.log and /var/tmp/fail01r1fail.log for details

To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml
bin/lkp run compatible-job.yaml

---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Thanks,
Oliver Sang

Attachments:

(No filename) (2.35 kB)
config-5.12.0-rc4-00017-g21175ca434c5 (175.52 kB)
job-script (5.96 kB)
mdadm-selftests (1.18 kB)
job.yaml (5.07 kB)
reproduce (100.00 B)
kmsg.xz (17.79 kB)
Download all attachments

2021-04-28 17:35:16

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [ext4] 21175ca434: mdadm-selftests.enchmarks/mdadm-selftests/tests/01r1fail.fail

(Hmm, why did you cc linux-km on this report? I would have thought
dm-devel would have made more sense?)

On Tue, Apr 27, 2021 at 04:15:39PM +0800, kernel test robot wrote:
>
> FYI, we noticed the following commit (built with gcc-9):
>
> commit: 21175ca434c5d49509b73cf473618b01b0b85437 ("ext4: make prefetch_block_bitmaps default")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>

> in testcase: mdadm-selftests
> version: mdadm-selftests-x86_64-5d518de-1_20201008
> with following parameters:
>
> disk: 1HDD
> test_prefix: 01r1
> ucode: 0x21

So this failure makes no sense to me. Looking at the kmesg failure
logs, it's failing in the md layer:

kern :info : [ 99.775514] md/raid1:md0: not clean -- starting background reconstruction
kern :info : [ 99.783372] md/raid1:md0: active with 3 out of 4 mirrors
kern :info : [ 99.789735] md0: detected capacity change from 0 to 37888
kern :info : [ 99.796216] md: resync of RAID array md0
kern :crit : [ 99.900450] md/raid1:md0: Disk failure on loop2, disabling device.
md/raid1:md0: Operation continuing on 2 devices.
kern :crit : [ 99.918281] md/raid1:md0: Disk failure on loop1, disabling device.
md/raid1:md0: Operation continuing on 1 devices.
kern :info : [ 100.835833] md: md0: resync interrupted.
kern :info : [ 101.852898] md: resync of RAID array md0
kern :info : [ 101.858347] md: md0: resync done.
user :notice: [ 102.109684] /lkp/benchmarks/mdadm-selftests/tests/01r1fail... FAILED - see /var/tmp/01r1fail.log and /var/tmp/fail01r1fail.log for details

The referenced commit just turns block bitmap prefetching in ext4.
This should not cause md to failure; if so, that's an md bug, not an
ext4 bug. There should not be anything that the file system is doing
that would cause the kernel to think there is a disk failure.

By the way, the reproduction instructions aren't working currently:

> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml # job file is attached in this email

This fails because lkp is trying to apply a patch which does not apply
with the current version of the md tools.

> bin/lkp split-job --compatible job.yaml
> bin/lkp run compatible-job.yaml

And the current versions lkp don't generate a compatible-job.yaml file
when you run "lkp split-job --compatable"; instead it generates a new
yaml file with a set of random characters to generate a unique name.
(What Multics parlance would be called a "shriek name"[1] :-)

Since I was having trouble running the reproduction; could you send
the /var/tmp/*fail.logs so we could have a bit more insight what is
going on?

Thanks!

- Ted

2021-04-29 07:46:04

by Chen, Rong A

[permalink] [raw]

Subject: Re: [LKP] Re: [ext4] 21175ca434: mdadm-selftests.enchmarks/mdadm-selftests/tests/01r1fail.fail

On 4/28/21 10:03 PM, Theodore Ts'o wrote:
> (Hmm, why did you cc linux-km on this report? I would have thought
> dm-devel would have made more sense?)
>
> On Tue, Apr 27, 2021 at 04:15:39PM +0800, kernel test robot wrote:
>> FYI, we noticed the following commit (built with gcc-9):
>>
>> commit: 21175ca434c5d49509b73cf473618b01b0b85437 ("ext4: make prefetch_block_bitmaps default")
>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>>
>> in testcase: mdadm-selftests
>> version: mdadm-selftests-x86_64-5d518de-1_20201008
>> with following parameters:
>>
>> disk: 1HDD
>> test_prefix: 01r1
>> ucode: 0x21
> So this failure makes no sense to me. Looking at the kmesg failure
> logs, it's failing in the md layer:
>
> kern :info : [ 99.775514] md/raid1:md0: not clean -- starting background reconstruction
> kern :info : [ 99.783372] md/raid1:md0: active with 3 out of 4 mirrors
> kern :info : [ 99.789735] md0: detected capacity change from 0 to 37888
> kern :info : [ 99.796216] md: resync of RAID array md0
> kern :crit : [ 99.900450] md/raid1:md0: Disk failure on loop2, disabling device.
> md/raid1:md0: Operation continuing on 2 devices.
> kern :crit : [ 99.918281] md/raid1:md0: Disk failure on loop1, disabling device.
> md/raid1:md0: Operation continuing on 1 devices.
> kern :info : [ 100.835833] md: md0: resync interrupted.
> kern :info : [ 101.852898] md: resync of RAID array md0
> kern :info : [ 101.858347] md: md0: resync done.
> user :notice: [ 102.109684] /lkp/benchmarks/mdadm-selftests/tests/01r1fail... FAILED - see /var/tmp/01r1fail.log and /var/tmp/fail01r1fail.log for details
>
> The referenced commit just turns block bitmap prefetching in ext4.
> This should not cause md to failure; if so, that's an md bug, not an
> ext4 bug. There should not be anything that the file system is doing
> that would cause the kernel to think there is a disk failure.
>
> By the way, the reproduction instructions aren't working currently:
>
>> To reproduce:
>>
>> git clone https://github.com/intel/lkp-tests.git
>> cd lkp-tests
>> bin/lkp install job.yaml # job file is attached in this email
> This fails because lkp is trying to apply a patch which does not apply
> with the current version of the md tools.

Hi Ted,

Thanks for the feedback, yes, there's patch already be merged into mdadm,
we have removed it from our code.

>
>> bin/lkp split-job --compatible job.yaml
>> bin/lkp run compatible-job.yaml
> And the current versions lkp don't generate a compatible-job.yaml file
> when you run "lkp split-job --compatable"; instead it generates a new
> yaml file with a set of random characters to generate a unique name.
> (What Multics parlance would be called a "shriek name"[1] :-)

We have updated the steps to avoid misunderstanding.

>
> Since I was having trouble running the reproduction; could you send
> the /var/tmp/*fail.logs so we could have a bit more insight what is
> going on?

I attached the log file for your reference,
btw the test is from
https://github.com/neilbrown/mdadm/blob/master/tests/01r1fail,
you may want to run it directly.

Best Regards,
Rong Chen

>
> Thanks!
>
> - Ted
> _______________________________________________
> LKP mailing list -- [email protected]
> To unsubscribe send an email to [email protected]

Attachments:

log (3.19 kB)

2021-05-13 21:17:47

by kernel test robot

[permalink] [raw]

Subject: Re: [ext4] 21175ca434: mdadm-selftests.enchmarks/mdadm-selftests/tests/01r1fail.fail

Hi Theodore,

On Wed, Apr 28, 2021 at 10:03:16AM -0400, Theodore Ts'o wrote:
> (Hmm, why did you cc linux-km on this report? I would have thought
> dm-devel would have made more sense?)
>
> On Tue, Apr 27, 2021 at 04:15:39PM +0800, kernel test robot wrote:
> >
> > FYI, we noticed the following commit (built with gcc-9):
> >
> > commit: 21175ca434c5d49509b73cf473618b01b0b85437 ("ext4: make prefetch_block_bitmaps default")
> > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> >
>
> > in testcase: mdadm-selftests
> > version: mdadm-selftests-x86_64-5d518de-1_20201008
> > with following parameters:
> >
> > disk: 1HDD
> > test_prefix: 01r1
> > ucode: 0x21
>
> So this failure makes no sense to me. Looking at the kmesg failure
> logs, it's failing in the md layer:

just FYI. we rerun the tests for both parent and this commit, up to 56
times. the failure seems persistent on the commit, though not always.
but the test never failed on parent.

f68f4063855903fd 21175ca434c5d49509b73cf4736
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
:56 61% 34:56 mdadm-selftests.enchmarks/mdadm-selftests/tests/01r1fail.fail

>
> kern :info : [ 99.775514] md/raid1:md0: not clean -- starting background reconstruction
> kern :info : [ 99.783372] md/raid1:md0: active with 3 out of 4 mirrors
> kern :info : [ 99.789735] md0: detected capacity change from 0 to 37888
> kern :info : [ 99.796216] md: resync of RAID array md0
> kern :crit : [ 99.900450] md/raid1:md0: Disk failure on loop2, disabling device.
> md/raid1:md0: Operation continuing on 2 devices.
> kern :crit : [ 99.918281] md/raid1:md0: Disk failure on loop1, disabling device.
> md/raid1:md0: Operation continuing on 1 devices.
> kern :info : [ 100.835833] md: md0: resync interrupted.
> kern :info : [ 101.852898] md: resync of RAID array md0
> kern :info : [ 101.858347] md: md0: resync done.
> user :notice: [ 102.109684] /lkp/benchmarks/mdadm-selftests/tests/01r1fail... FAILED - see /var/tmp/01r1fail.log and /var/tmp/fail01r1fail.log for details
>
> The referenced commit just turns block bitmap prefetching in ext4.
> This should not cause md to failure; if so, that's an md bug, not an
> ext4 bug. There should not be anything that the file system is doing
> that would cause the kernel to think there is a disk failure.
>
> By the way, the reproduction instructions aren't working currently:
>
> > To reproduce:
> >
> > git clone https://github.com/intel/lkp-tests.git
> > cd lkp-tests
> > bin/lkp install job.yaml # job file is attached in this email
>
> This fails because lkp is trying to apply a patch which does not apply
> with the current version of the md tools.
>
> > bin/lkp split-job --compatible job.yaml
> > bin/lkp run compatible-job.yaml
>
> And the current versions lkp don't generate a compatible-job.yaml file
> when you run "lkp split-job --compatable"; instead it generates a new
> yaml file with a set of random characters to generate a unique name.
> (What Multics parlance would be called a "shriek name"[1] :-)
>
> Since I was having trouble running the reproduction; could you send
> the /var/tmp/*fail.logs so we could have a bit more insight what is
> going on?
>
> Thanks!
>
> - Ted
>