LinuxLists.cc - Bug 215605 - [5.14 regression] BUG: unable to handle page fault while running badblocks (fsck.ext4 -c) on a raid5 md array

2022-02-22 09:07:21

Subject: Bug 215605 - [5.14 regression] BUG: unable to handle page fault while running badblocks (fsck.ext4 -c) on a raid5 md array

Hi, this is your Linux kernel regression tracker.

I noticed a regression report in bugzilla.kernel.org that afaics nobody
acted upon since it was reported about a week ago, that's why I decided
to forward it to the lists and add a few relevant people to the CC. To
quote from https://bugzilla.kernel.org/show_bug.cgi?id=215605

> Dominik Mierzejewski 2022-02-14 10:36:36 UTC
>
> Created attachment 300450 [details]
> kernel-5.16.8 dmesg with crash
>
> I'm experiencing kernel crash when running badblocks (fsck.ext4 -c) on a raid5 md array in my Intel Atom-based NAS box (Thecus N5550):
> [ 720.911993] kernel: BUG: unable to handle page fault for address: ffffdbc681023bc8
> [ 720.912073] kernel: #PF: supervisor read access in kernel mode
> [ 720.912120] kernel: #PF: error_code(0x0000) - not-present page
> [ 720.912166] kernel: PGD 11ffc6067 P4D 11ffc6067 PUD 0
> [ 720.912213] kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
> [ 720.912256] kernel: CPU: 1 PID: 1406 Comm: badblocks Not tainted 5.16.8-200.fc35.x86_64 #1
> [ 720.912321] kernel: Hardware name: Intel Corporation Milstead Platform/Granite Well, BIOS CDV W Series 05 08/27/2015
> [ 720.912400] kernel: RIP: 0010:kfree+0x58/0x3e0
> [ 720.912449] kernel: Code: 80 4c 01 e5 0f 82 84 03 00 00 48 c7 c0 00 00 00 80 48 2b 05 4a 96 3b 01 48 01 c5 48 c1 ed 0c 48 c1 e5 06 48 03 2d 28 96 3b 01 <48> 8b 45 08 48 8d 50 ff a8 01 48 0f 45 ea 4
> 8 8b 55 08 48 8d 42 ff
> [ 720.912598] kernel: RSP: 0018:ffff9db4008efaf8 EFLAGS: 00010286
> [ 720.912648] kernel: RAX: 00006d7bc0000000 RBX: ffff9284c5214800 RCX: ffff9284c3758ff8
> [ 720.912708] kernel: RDX: ffff9283c1102740 RSI: ffffffffc07af091 RDI: ffff9db4008efd58
> [ 720.912767] kernel: RBP: ffffdbc681023bc0 R08: ffff9db4008efb88 R09: ffff9284c3759000
> [ 720.912826] kernel: R10: 0000000000000028 R11: ffff9284c213db48 R12: ffff9db4008efd58
> [ 720.912885] kernel: R13: ffff9284c213da00 R14: ffff9284c375f000 R15: ffff9db4008efd58
> [ 720.912945] kernel: FS: 00007f73e6669740(0000) GS:ffff9284dbc80000(0000) knlGS:0000000000000000
> [ 720.913012] kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 720.913062] kernel: CR2: ffffdbc681023bc8 CR3: 000000005c2cc000 CR4: 00000000000006e0
> [ 720.913122] kernel: Call Trace:
> [ 720.913150] kernel: <TASK>
> [ 720.913179] kernel: raid5_make_request+0xb31/0xb90 [raid456]
> [ 720.913247] kernel: ? do_wait_intr_irq+0xa0/0xa0
> [ 720.913292] kernel: ? __blk_queue_split+0x30a/0x470
> [ 720.913339] kernel: md_handle_request+0x119/0x180
> [ 720.913386] kernel: md_submit_bio+0x67/0xa0
> [ 720.913425] kernel: __submit_bio_fops+0x91/0x160
> [ 720.913468] kernel: submit_bio_noacct+0xd7/0x2c0
> [ 720.913510] kernel: __blkdev_direct_IO_simple+0x198/0x290
> [ 720.913576] kernel: ? __fpu_restore_sig+0x193/0x570
> [ 720.913623] kernel: ? sysvec_apic_timer_interrupt+0xaf/0xd0
> [ 720.913676] kernel: ? __blkdev_direct_IO_simple+0x290/0x290
> [ 720.913728] kernel: generic_file_read_iter+0x9b/0x160
> [ 720.913775] kernel: new_sync_read+0x105/0x180
> [ 720.913820] kernel: vfs_read+0xf1/0x190
> [ 720.913858] kernel: ksys_read+0x4f/0xc0
> [ 720.913896] kernel: do_syscall_64+0x38/0x90
> [ 720.913936] kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 720.913985] kernel: RIP: 0033:0x7f73e676d772
> [ 720.914024] kernel: Code: c0 e9 b2 fe ff ff 50 48 8d 3d da 2e 0c 00 e8 b5 f9 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
> [ 720.914166] kernel: RSP: 002b:00007fff1b8fcbb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [ 720.914231] kernel: RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 00007f73e676d772
> [ 720.917149] kernel: RDX: 0000000000040000 RSI: 00007f73e65d3000 RDI: 0000000000000004
> [ 720.920078] kernel: RBP: 0000000000001000 R08: 00000000015105c0 R09: 0000000000000080
> [ 720.922980] kernel: R10: 00007fff1b8fca00 R11: 0000000000000246 R12: 00000015105c0000
> [ 720.925875] kernel: R13: 0000000000000004 R14: 00007f73e65d3000 R15: 0000000000040000
> [ 720.928795] kernel: </TASK>
> [ 720.931704] kernel: Modules linked in: sctp ip6_udp_tunnel udp_tunnel rpcrdma rdma_cm iw_cm ib_cm ib_core sit tunnel4 ip_tunnel rfkill ipt_REJECT nf_reject_ipv4 iptable_filter xt_nat iptable_nat nf_nat iptable_mangle nf_conntrack_pptp xt_CT iptable_raw xt_multiport xt_set ip6t_REJECT nf_reject_ipv6 xt_LOG nf_log_syslog xt_limit xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables ip_set_hash_netport ip_set_hash_net ip_set drivetemp it87 nfnetlink hwmon_vid vfat fat iTCO_wdt intel_pmc_bxt iTCO_vendor_support at24 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx intel_powerclamp raid1 coretemp snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_usb_audio i2c_i801 gma500_gfx i2c_smbus snd_usbmidi_lib joydev snd_hwdep snd_rawmidi snd_seq_device mc snd_pcm lpc_ich i2c_algo_bit snd_timer drm_kms_helper snd cec soundcore nfsd auth_rpcgss
> [ 720.931885] kernel: nfs_acl lockd grace drm fuse sunrpc zram ip_tables hid_logitech_hidpp serio_raw r8152 sata_sil24 video mii hid_jabra e1000e hid_logitech_dj
> [ 720.952122] kernel: CR2: ffffdbc681023bc8
> [ 720.955651] kernel: ---[ end trace de2c3d5b971ae71d ]---
> [ 720.959186] kernel: RIP: 0010:kfree+0x58/0x3e0
> [ 720.962723] kernel: Code: 80 4c 01 e5 0f 82 84 03 00 00 48 c7 c0 00 00 00 80 48 2b 05 4a 96 3b 01 48 01 c5 48 c1 ed 0c 48 c1 e5 06 48 03 2d 28 96 3b 01 <48> 8b 45 08 48 8d 50 ff a8 01 48 0f 45 ea 48 8b 55 08 48 8d 42 ff
> [ 720.966472] kernel: RSP: 0018:ffff9db4008efaf8 EFLAGS: 00010286
> [ 720.970238] kernel: RAX: 00006d7bc0000000 RBX: ffff9284c5214800 RCX: ffff9284c3758ff8
> [ 720.973993] kernel: RDX: ffff9283c1102740 RSI: ffffffffc07af091 RDI: ffff9db4008efd58
> [ 720.977723] kernel: RBP: ffffdbc681023bc0 R08: ffff9db4008efb88 R09: ffff9284c3759000
> [ 720.981464] kernel: R10: 0000000000000028 R11: ffff9284c213db48 R12: ffff9db4008efd58
> [ 720.985228] kernel: R13: ffff9284c213da00 R14: ffff9284c375f000 R15: ffff9db4008efd58
> [ 720.988995] kernel: FS: 00007f73e6669740(0000) GS:ffff9284dbc80000(0000) knlGS:0000000000000000
> [ 720.992774] kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 720.996535] kernel: CR2: ffffdbc681023bc8 CR3: 000000005c2cc000 CR4: 00000000000006e0
>
> There was a crash earlier which caused the array to become degraded, but I wasn't able to capture the backtrace due to journal corruption. After resyncing the array I tried running fsck on the filesystem and it's causing a kernel crash each time.
>
> I've reproduced the crash with the following kernels:
> 5.16.8-200.fc35.x86_64
> 5.15.18-200.fc35.x86_64
> 5.14.10-300.fc35.x86_64
>
> It doesn't happen with 5.13 kernels. I've been running fsck.ext4 -c on the array for the last 10 hours without a crash. It found 27 read errors which it corrected.
>
> Steps to reproduce:
> 1. fsck.ext4 -c -v /dev/md126
> 2. crash
>
> The array was degraded as a result of a previous crash, but it recovered itself with no errors reported. Obviously, there remained some read errors which badblocks found and which the kernel tripped over.
>
> Crash with 5.14.10 looks different:
> Feb 13 17:55:06 kernel: general protection fault, probably for non-canonical address 0xef6dfcf53c7d6fc3: 0000 [#1] SMP NOPTI
> Feb 13 17:55:06 kernel: CPU: 2 PID: 831 Comm: md126_raid5 Not tainted 5.14.10-300.fc35.x86_64 #1
> Feb 13 17:55:06 kernel: Hardware name: Intel Corporation Milstead Platform/Granite Well, BIOS CDV W Series 05 08/27/2015
> Feb 13 17:55:06 kernel: RIP: 0010:bio_endio+0x87/0x130
> Feb 13 17:55:06 kernel: Code: 02 00 48 8b 45 08 48 85 c0 74 09 0f b7 55 14 f6 c6 01 75 2f 48 81 7d 38 50 48 5e bb 75 38 0f b6 45 1a 48 8b 5d 40 84 c0 74 09 <80> 7b 1a 00 75 03 88 43 1a 48 89 ef 48 89 dd e8 d5 fe ff ff e9 6a
> Feb 13 17:55:06 kernel: RSP: 0000:ffffa5bb80597b48 EFLAGS: 00010206
> Feb 13 17:55:06 kernel: RAX: 000000000000000a RBX: ef6dfcf53c7d6fa9 RCX: ffff92b840865b50
> Feb 13 17:55:06 kernel: RDX: 0000000000000082 RSI: ffff92b8497f46c0 RDI: ffff92b84881f180
> Feb 13 17:55:06 kernel: RBP: ffff92b8497f46c0 R08: 0000000000000001 R09: 0000000000000000
> Feb 13 17:55:06 kernel: R10: ffff92b74a722c01 R11: 0000000000000001 R12: 0000000000000000
> Feb 13 17:55:06 kernel: R13: 0000000000000000 R14: ffff92b74a722cd0 R15: ffff92b844748860
> Feb 13 17:55:06 kernel: FS: 0000000000000000(0000) GS:ffff92b85bd00000(0000) knlGS:0000000000000000
> Feb 13 17:55:06 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Feb 13 17:55:06 kernel: CR2: 0000557d787a2028 CR3: 000000003c194000 CR4: 00000000000006e0
> Feb 13 17:55:06 kernel: Call Trace:
> Feb 13 17:55:06 kernel: ops_complete_biofill+0x98/0xe0 [raid456]
> Feb 13 17:55:06 kernel: async_trigger_callback+0x7d/0xc1 [async_tx]
> Feb 13 17:55:06 kernel: raid_run_ops+0x98c/0x15b0 [raid456]
> Feb 13 17:55:06 kernel: ? handle_stripe_clean_event+0x400/0x400 [raid456]
> Feb 13 17:55:06 kernel: handle_stripe+0xe91/0x1dc0 [raid456]
> Feb 13 17:55:06 kernel: handle_active_stripes.constprop.0+0x390/0x560 [raid456]
> Feb 13 17:55:06 kernel: raid5d+0x39d/0x5d0 [raid456]
> Feb 13 17:55:06 kernel: ? prepare_to_wait_event+0x5d/0x170
> Feb 13 17:55:06 kernel: md_thread+0x98/0x140
> Feb 13 17:55:06 kernel: ? finish_wait+0x80/0x80
> Feb 13 17:55:06 kernel: ? md_write_inc+0x50/0x50
> Feb 13 17:55:06 kernel: kthread+0x124/0x150
> Feb 13 17:55:06 kernel: ? set_kthread_struct+0x40/0x40
> Feb 13 17:55:06 kernel: ret_from_fork+0x1f/0x30
> Feb 13 17:55:06 kernel: Modules linked in: rpcrdma rdma_cm iw_cm ib_cm ib_core rfkill ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables xt_set ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog xt_limit xt_multiport xt_state xt_conntrack iptable_filter xt_nat iptable_nat nf_nat iptable_mangle nf_conntrack_pptp xt_CT nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_raw ip_set_hash_netport ip_set_hash_net ip_set drivetemp it87 nfnetlink hwmon_vid vfat fat snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi iTCO_wdt snd_hda_intel intel_pmc_bxt snd_intel_dspcfg iTCO_vendor_support raid456 snd_intel_sdw_acpi snd_hda_codec async_raid6_recov at24 async_memcpy async_pq async_xor async_tx snd_hda_core gma500_gfx snd_hwdep i2c_algo_bit drm_kms_helper snd_pcm intel_powerclamp i2c_i801 snd_timer coretemp i2c_smbus snd cec soundcore lpc_ich nfsd auth_rpcgss nfs_acl lockd drm grace fuse sunrpc ip_tables raid1 serio_raw video sata_sil24 e1000e
> Feb 13 17:55:06 kernel: ---[ end trace 97447bed31e596e0 ]---
> Feb 13 17:55:06 kernel: RIP: 0010:bio_endio+0x87/0x130
> Feb 13 17:55:06 kernel: Code: 02 00 48 8b 45 08 48 85 c0 74 09 0f b7 55 14 f6 c6 01 75 2f 48 81 7d 38 50 48 5e bb 75 38 0f b6 45 1a 48 8b 5d 40 84 c0 74 09 <80> 7b 1a 00 75 03 88 43 1a 48 89 ef 48 89 dd e8 d5 fe ff ff e9 6a
> Feb 13 17:55:06 kernel: RSP: 0000:ffffa5bb80597b48 EFLAGS: 00010206
> Feb 13 17:55:06 kernel: RAX: 000000000000000a RBX: ef6dfcf53c7d6fa9 RCX: ffff92b840865b50
> Feb 13 17:55:06 kernel: RDX: 0000000000000082 RSI: ffff92b8497f46c0 RDI: ffff92b84881f180
> Feb 13 17:55:06 kernel: RBP: ffff92b8497f46c0 R08: 0000000000000001 R09: 0000000000000000
> Feb 13 17:55:06 kernel: R10: ffff92b74a722c01 R11: 0000000000000001 R12: 0000000000000000
> Feb 13 17:55:06 kernel: R13: 0000000000000000 R14: ffff92b74a722cd0 R15: ffff92b844748860
> Feb 13 17:55:06 kernel: FS: 0000000000000000(0000) GS:ffff92b85bd00000(0000) knlGS:0000000000000000
> Feb 13 17:55:06 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Feb 13 17:55:06 kernel: CR2: 0000557d787a2028 CR3: 000000003c194000 CR4: 00000000000006e0
> Feb 13 17:55:06 kernel: ------------[ cut here ]------------
> Feb 13 17:55:06 kernel: WARNING: CPU: 2 PID: 831 at kernel/exit.c:739 do_exit+0x37/0xa90
> Feb 13 17:55:06 kernel: Modules linked in: rpcrdma rdma_cm iw_cm ib_cm ib_core rfkill ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables xt_set ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog xt_limit xt_multiport xt_state xt_conntrack iptable_filter xt_nat iptable_nat nf_nat iptable_mangle nf_conntrack_pptp xt_CT nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_raw ip_set_hash_netport ip_set_hash_net ip_set drivetemp it87 nfnetlink hwmon_vid vfat fat snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi iTCO_wdt snd_hda_intel intel_pmc_bxt snd_intel_dspcfg iTCO_vendor_support raid456 snd_intel_sdw_acpi snd_hda_codec async_raid6_recov at24 async_memcpy async_pq async_xor async_tx snd_hda_core gma500_gfx snd_hwdep i2c_algo_bit drm_kms_helper snd_pcm intel_powerclamp i2c_i801 snd_timer coretemp i2c_smbus snd cec soundcore lpc_ich nfsd auth_rpcgss nfs_acl lockd drm grace fuse sunrpc ip_tables raid1 serio_raw video sata_sil24 e1000e
> Feb 13 17:55:06 kernel: CPU: 2 PID: 831 Comm: md126_raid5 Tainted: G D 5.14.10-300.fc35.x86_64 #1
> Feb 13 17:55:06 kernel: Hardware name: Intel Corporation Milstead Platform/Granite Well, BIOS CDV W Series 05 08/27/2015
> Feb 13 17:55:06 kernel: RIP: 0010:do_exit+0x37/0xa90
> Feb 13 17:55:06 kernel: Code: 55 48 89 fd 53 65 48 8b 1c 25 c0 7b 01 00 48 83 ec 28 48 8b 83 a8 0c 00 00 48 85 c0 74 0e 48 8b 10 48 39 d0 0f 84 56 04 00 00 <0f> 0b 65 8b 0d a0 2c f3 44 89 c8 25 00 ff ff 00 89 44 24 0c 0f 85
> Feb 13 17:55:06 kernel: RSP: 0000:ffffa5bb80597ef8 EFLAGS: 00010212
> Feb 13 17:55:06 kernel: RAX: ffffa5bb80597e50 RBX: ffff92b8437d8000 RCX: 0000000000000000
> Feb 13 17:55:06 kernel: RDX: ffff92b844217548 RSI: ffff92b85bd18a00 RDI: 000000000000000b
> Feb 13 17:55:06 kernel: RBP: 000000000000000b R08: 0000000000000000 R09: ffffa5bb805977a0
> Feb 13 17:55:06 kernel: R10: ffffa5bb80597798 R11: ffffffffbcf47468 R12: 000000000000000b
> Feb 13 17:55:06 kernel: R13: 0000000000000000 R14: ffff92b8437d8000 R15: 0000000000000000
> Feb 13 17:55:06 kernel: FS: 0000000000000000(0000) GS:ffff92b85bd00000(0000) knlGS:0000000000000000
> Feb 13 17:55:06 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Feb 13 17:55:06 kernel: CR2: 0000557d787a2028 CR3: 000000003c194000 CR4: 00000000000006e0
> Feb 13 17:55:06 kernel: Call Trace:
> Feb 13 17:55:06 kernel: ? kthread+0x124/0x150
> Feb 13 17:55:06 kernel: rewind_stack_do_exit+0x17/0x20
> Feb 13 17:55:06 kernel: RIP: 0000:0x0
> Feb 13 17:55:06 kernel: Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
> Feb 13 17:55:06 kernel: RSP: 0000:0000000000000000 EFLAGS: 00000000 ORIG_RAX: 0000000000000000
> Feb 13 17:55:06 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> Feb 13 17:55:06 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> Feb 13 17:55:06 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> Feb 13 17:55:06 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> Feb 13 17:55:06 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> Feb 13 17:55:06 kernel: ---[ end trace 97447bed31e596e1 ]---
>
> Downstream bug report: https://bugzilla.redhat.com/show_bug.cgi?id=2053936 .

Could somebody take a look into this? Or was this discussed somewhere
else already? Or even fixed?

FWIW, I searched the list and wondered if these might be related, but I
was unable to tell for sure:

https://lore.kernel.org/all/[email protected]/
https://lore.kernel.org/all/[email protected]/

Anyway, to get this tracked:

#regzbot introduced: v5.13..v5.14.10
#regzbot from: Dominik Mierzejewski <[email protected]>
#regzbot title: md: unable to handle page fault while running badblocks
(fsck.ext4 -c) on a raid5 md array
#regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=215605

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.

--
Additional information about regzbot:

If you want to know more about regzbot, check out its web-interface, the
getting start guide, and the references documentation:

https://linux-regtracking.leemhuis.info/regzbot/
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md

The last two documents will explain how you can interact with regzbot
yourself if your want to.

Hint for reporters: when reporting a regression it's in your interest to
CC the regression list and tell regzbot about the issue, as that ensures
the regression makes it onto the radar of the Linux kernel's regression
tracker -- that's in your interest, as it ensures your report won't fall
through the cracks unnoticed.

Hint for developers: you normally don't need to care about regzbot once
it's involved. Fix the issue as you normally would, just remember to
include 'Link:' tag in the patch descriptions pointing to all reports
about the issue. This has been expected from developers even before
regzbot showed up for reasons explained in
'Documentation/process/submitting-patches.rst' and
'Documentation/process/5.Posting.rst'.

2022-02-28 10:49:49

by Thorsten Leemhuis

[permalink] [raw]

Subject: Re: Bug 215605 - [5.14 regression] BUG: unable to handle page fault while running badblocks (fsck.ext4 -c) on a raid5 md array

[CCing Jens]

Hi, this is your Linux kernel regression tracker. Top-posting for once,
to make this easily accessible to everyone.

What's up here? Below regression was reported two weeks ago and I
forwarded it nearly a week ago, nevertheless the reporter afaics didn't
get a single reply. Is the issue discussed somewhere else and I just
missed it? Is the report not accurate for some reason or missing
something important? Or did the report fall throug the cracks?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.

#regzbot poke

On 22.02.22 09:59, Thorsten Leemhuis wrote:
> Hi, this is your Linux kernel regression tracker.
>
> I noticed a regression report in bugzilla.kernel.org that afaics nobody
> acted upon since it was reported about a week ago, that's why I decided
> to forward it to the lists and add a few relevant people to the CC. To
> quote from https://bugzilla.kernel.org/show_bug.cgi?id=215605
>
>> Dominik Mierzejewski 2022-02-14 10:36:36 UTC
>>
>> Created attachment 300450 [details]
>> kernel-5.16.8 dmesg with crash
>>
>> I'm experiencing kernel crash when running badblocks (fsck.ext4 -c) on a raid5 md array in my Intel Atom-based NAS box (Thecus N5550):
>> [ 720.911993] kernel: BUG: unable to handle page fault for address: ffffdbc681023bc8
>> [ 720.912073] kernel: #PF: supervisor read access in kernel mode
>> [ 720.912120] kernel: #PF: error_code(0x0000) - not-present page
>> [ 720.912166] kernel: PGD 11ffc6067 P4D 11ffc6067 PUD 0
>> [ 720.912213] kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
>> [ 720.912256] kernel: CPU: 1 PID: 1406 Comm: badblocks Not tainted 5.16.8-200.fc35.x86_64 #1
>> [ 720.912321] kernel: Hardware name: Intel Corporation Milstead Platform/Granite Well, BIOS CDV W Series 05 08/27/2015
>> [ 720.912400] kernel: RIP: 0010:kfree+0x58/0x3e0
>> [ 720.912449] kernel: Code: 80 4c 01 e5 0f 82 84 03 00 00 48 c7 c0 00 00 00 80 48 2b 05 4a 96 3b 01 48 01 c5 48 c1 ed 0c 48 c1 e5 06 48 03 2d 28 96 3b 01 <48> 8b 45 08 48 8d 50 ff a8 01 48 0f 45 ea 4
>> 8 8b 55 08 48 8d 42 ff
>> [ 720.912598] kernel: RSP: 0018:ffff9db4008efaf8 EFLAGS: 00010286
>> [ 720.912648] kernel: RAX: 00006d7bc0000000 RBX: ffff9284c5214800 RCX: ffff9284c3758ff8
>> [ 720.912708] kernel: RDX: ffff9283c1102740 RSI: ffffffffc07af091 RDI: ffff9db4008efd58
>> [ 720.912767] kernel: RBP: ffffdbc681023bc0 R08: ffff9db4008efb88 R09: ffff9284c3759000
>> [ 720.912826] kernel: R10: 0000000000000028 R11: ffff9284c213db48 R12: ffff9db4008efd58
>> [ 720.912885] kernel: R13: ffff9284c213da00 R14: ffff9284c375f000 R15: ffff9db4008efd58
>> [ 720.912945] kernel: FS: 00007f73e6669740(0000) GS:ffff9284dbc80000(0000) knlGS:0000000000000000
>> [ 720.913012] kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 720.913062] kernel: CR2: ffffdbc681023bc8 CR3: 000000005c2cc000 CR4: 00000000000006e0
>> [ 720.913122] kernel: Call Trace:
>> [ 720.913150] kernel: <TASK>
>> [ 720.913179] kernel: raid5_make_request+0xb31/0xb90 [raid456]
>> [ 720.913247] kernel: ? do_wait_intr_irq+0xa0/0xa0
>> [ 720.913292] kernel: ? __blk_queue_split+0x30a/0x470
>> [ 720.913339] kernel: md_handle_request+0x119/0x180
>> [ 720.913386] kernel: md_submit_bio+0x67/0xa0
>> [ 720.913425] kernel: __submit_bio_fops+0x91/0x160
>> [ 720.913468] kernel: submit_bio_noacct+0xd7/0x2c0
>> [ 720.913510] kernel: __blkdev_direct_IO_simple+0x198/0x290
>> [ 720.913576] kernel: ? __fpu_restore_sig+0x193/0x570
>> [ 720.913623] kernel: ? sysvec_apic_timer_interrupt+0xaf/0xd0
>> [ 720.913676] kernel: ? __blkdev_direct_IO_simple+0x290/0x290
>> [ 720.913728] kernel: generic_file_read_iter+0x9b/0x160
>> [ 720.913775] kernel: new_sync_read+0x105/0x180
>> [ 720.913820] kernel: vfs_read+0xf1/0x190
>> [ 720.913858] kernel: ksys_read+0x4f/0xc0
>> [ 720.913896] kernel: do_syscall_64+0x38/0x90
>> [ 720.913936] kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
>> [ 720.913985] kernel: RIP: 0033:0x7f73e676d772
>> [ 720.914024] kernel: Code: c0 e9 b2 fe ff ff 50 48 8d 3d da 2e 0c 00 e8 b5 f9 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
>> [ 720.914166] kernel: RSP: 002b:00007fff1b8fcbb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
>> [ 720.914231] kernel: RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 00007f73e676d772
>> [ 720.917149] kernel: RDX: 0000000000040000 RSI: 00007f73e65d3000 RDI: 0000000000000004
>> [ 720.920078] kernel: RBP: 0000000000001000 R08: 00000000015105c0 R09: 0000000000000080
>> [ 720.922980] kernel: R10: 00007fff1b8fca00 R11: 0000000000000246 R12: 00000015105c0000
>> [ 720.925875] kernel: R13: 0000000000000004 R14: 00007f73e65d3000 R15: 0000000000040000
>> [ 720.928795] kernel: </TASK>
>> [ 720.931704] kernel: Modules linked in: sctp ip6_udp_tunnel udp_tunnel rpcrdma rdma_cm iw_cm ib_cm ib_core sit tunnel4 ip_tunnel rfkill ipt_REJECT nf_reject_ipv4 iptable_filter xt_nat iptable_nat nf_nat iptable_mangle nf_conntrack_pptp xt_CT iptable_raw xt_multiport xt_set ip6t_REJECT nf_reject_ipv6 xt_LOG nf_log_syslog xt_limit xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables ip_set_hash_netport ip_set_hash_net ip_set drivetemp it87 nfnetlink hwmon_vid vfat fat iTCO_wdt intel_pmc_bxt iTCO_vendor_support at24 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx intel_powerclamp raid1 coretemp snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_usb_audio i2c_i801 gma500_gfx i2c_smbus snd_usbmidi_lib joydev snd_hwdep snd_rawmidi snd_seq_device mc snd_pcm lpc_ich i2c_algo_bit snd_timer drm_kms_helper snd cec soundcore nfsd auth_rpcgss
>> [ 720.931885] kernel: nfs_acl lockd grace drm fuse sunrpc zram ip_tables hid_logitech_hidpp serio_raw r8152 sata_sil24 video mii hid_jabra e1000e hid_logitech_dj
>> [ 720.952122] kernel: CR2: ffffdbc681023bc8
>> [ 720.955651] kernel: ---[ end trace de2c3d5b971ae71d ]---
>> [ 720.959186] kernel: RIP: 0010:kfree+0x58/0x3e0
>> [ 720.962723] kernel: Code: 80 4c 01 e5 0f 82 84 03 00 00 48 c7 c0 00 00 00 80 48 2b 05 4a 96 3b 01 48 01 c5 48 c1 ed 0c 48 c1 e5 06 48 03 2d 28 96 3b 01 <48> 8b 45 08 48 8d 50 ff a8 01 48 0f 45 ea 48 8b 55 08 48 8d 42 ff
>> [ 720.966472] kernel: RSP: 0018:ffff9db4008efaf8 EFLAGS: 00010286
>> [ 720.970238] kernel: RAX: 00006d7bc0000000 RBX: ffff9284c5214800 RCX: ffff9284c3758ff8
>> [ 720.973993] kernel: RDX: ffff9283c1102740 RSI: ffffffffc07af091 RDI: ffff9db4008efd58
>> [ 720.977723] kernel: RBP: ffffdbc681023bc0 R08: ffff9db4008efb88 R09: ffff9284c3759000
>> [ 720.981464] kernel: R10: 0000000000000028 R11: ffff9284c213db48 R12: ffff9db4008efd58
>> [ 720.985228] kernel: R13: ffff9284c213da00 R14: ffff9284c375f000 R15: ffff9db4008efd58
>> [ 720.988995] kernel: FS: 00007f73e6669740(0000) GS:ffff9284dbc80000(0000) knlGS:0000000000000000
>> [ 720.992774] kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 720.996535] kernel: CR2: ffffdbc681023bc8 CR3: 000000005c2cc000 CR4: 00000000000006e0
>>
>> There was a crash earlier which caused the array to become degraded, but I wasn't able to capture the backtrace due to journal corruption. After resyncing the array I tried running fsck on the filesystem and it's causing a kernel crash each time.
>>
>> I've reproduced the crash with the following kernels:
>> 5.16.8-200.fc35.x86_64
>> 5.15.18-200.fc35.x86_64
>> 5.14.10-300.fc35.x86_64
>>
>> It doesn't happen with 5.13 kernels. I've been running fsck.ext4 -c on the array for the last 10 hours without a crash. It found 27 read errors which it corrected.
>>
>> Steps to reproduce:
>> 1. fsck.ext4 -c -v /dev/md126
>> 2. crash
>>
>> The array was degraded as a result of a previous crash, but it recovered itself with no errors reported. Obviously, there remained some read errors which badblocks found and which the kernel tripped over.
>>
>> Crash with 5.14.10 looks different:
>> Feb 13 17:55:06 kernel: general protection fault, probably for non-canonical address 0xef6dfcf53c7d6fc3: 0000 [#1] SMP NOPTI
>> Feb 13 17:55:06 kernel: CPU: 2 PID: 831 Comm: md126_raid5 Not tainted 5.14.10-300.fc35.x86_64 #1
>> Feb 13 17:55:06 kernel: Hardware name: Intel Corporation Milstead Platform/Granite Well, BIOS CDV W Series 05 08/27/2015
>> Feb 13 17:55:06 kernel: RIP: 0010:bio_endio+0x87/0x130
>> Feb 13 17:55:06 kernel: Code: 02 00 48 8b 45 08 48 85 c0 74 09 0f b7 55 14 f6 c6 01 75 2f 48 81 7d 38 50 48 5e bb 75 38 0f b6 45 1a 48 8b 5d 40 84 c0 74 09 <80> 7b 1a 00 75 03 88 43 1a 48 89 ef 48 89 dd e8 d5 fe ff ff e9 6a
>> Feb 13 17:55:06 kernel: RSP: 0000:ffffa5bb80597b48 EFLAGS: 00010206
>> Feb 13 17:55:06 kernel: RAX: 000000000000000a RBX: ef6dfcf53c7d6fa9 RCX: ffff92b840865b50
>> Feb 13 17:55:06 kernel: RDX: 0000000000000082 RSI: ffff92b8497f46c0 RDI: ffff92b84881f180
>> Feb 13 17:55:06 kernel: RBP: ffff92b8497f46c0 R08: 0000000000000001 R09: 0000000000000000
>> Feb 13 17:55:06 kernel: R10: ffff92b74a722c01 R11: 0000000000000001 R12: 0000000000000000
>> Feb 13 17:55:06 kernel: R13: 0000000000000000 R14: ffff92b74a722cd0 R15: ffff92b844748860
>> Feb 13 17:55:06 kernel: FS: 0000000000000000(0000) GS:ffff92b85bd00000(0000) knlGS:0000000000000000
>> Feb 13 17:55:06 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> Feb 13 17:55:06 kernel: CR2: 0000557d787a2028 CR3: 000000003c194000 CR4: 00000000000006e0
>> Feb 13 17:55:06 kernel: Call Trace:
>> Feb 13 17:55:06 kernel: ops_complete_biofill+0x98/0xe0 [raid456]
>> Feb 13 17:55:06 kernel: async_trigger_callback+0x7d/0xc1 [async_tx]
>> Feb 13 17:55:06 kernel: raid_run_ops+0x98c/0x15b0 [raid456]
>> Feb 13 17:55:06 kernel: ? handle_stripe_clean_event+0x400/0x400 [raid456]
>> Feb 13 17:55:06 kernel: handle_stripe+0xe91/0x1dc0 [raid456]
>> Feb 13 17:55:06 kernel: handle_active_stripes.constprop.0+0x390/0x560 [raid456]
>> Feb 13 17:55:06 kernel: raid5d+0x39d/0x5d0 [raid456]
>> Feb 13 17:55:06 kernel: ? prepare_to_wait_event+0x5d/0x170
>> Feb 13 17:55:06 kernel: md_thread+0x98/0x140
>> Feb 13 17:55:06 kernel: ? finish_wait+0x80/0x80
>> Feb 13 17:55:06 kernel: ? md_write_inc+0x50/0x50
>> Feb 13 17:55:06 kernel: kthread+0x124/0x150
>> Feb 13 17:55:06 kernel: ? set_kthread_struct+0x40/0x40
>> Feb 13 17:55:06 kernel: ret_from_fork+0x1f/0x30
>> Feb 13 17:55:06 kernel: Modules linked in: rpcrdma rdma_cm iw_cm ib_cm ib_core rfkill ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables xt_set ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog xt_limit xt_multiport xt_state xt_conntrack iptable_filter xt_nat iptable_nat nf_nat iptable_mangle nf_conntrack_pptp xt_CT nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_raw ip_set_hash_netport ip_set_hash_net ip_set drivetemp it87 nfnetlink hwmon_vid vfat fat snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi iTCO_wdt snd_hda_intel intel_pmc_bxt snd_intel_dspcfg iTCO_vendor_support raid456 snd_intel_sdw_acpi snd_hda_codec async_raid6_recov at24 async_memcpy async_pq async_xor async_tx snd_hda_core gma500_gfx snd_hwdep i2c_algo_bit drm_kms_helper snd_pcm intel_powerclamp i2c_i801 snd_timer coretemp i2c_smbus snd cec soundcore lpc_ich nfsd auth_rpcgss nfs_acl lockd drm grace fuse sunrpc ip_tables raid1 serio_raw video sata_sil24 e1000e
>> Feb 13 17:55:06 kernel: ---[ end trace 97447bed31e596e0 ]---
>> Feb 13 17:55:06 kernel: RIP: 0010:bio_endio+0x87/0x130
>> Feb 13 17:55:06 kernel: Code: 02 00 48 8b 45 08 48 85 c0 74 09 0f b7 55 14 f6 c6 01 75 2f 48 81 7d 38 50 48 5e bb 75 38 0f b6 45 1a 48 8b 5d 40 84 c0 74 09 <80> 7b 1a 00 75 03 88 43 1a 48 89 ef 48 89 dd e8 d5 fe ff ff e9 6a
>> Feb 13 17:55:06 kernel: RSP: 0000:ffffa5bb80597b48 EFLAGS: 00010206
>> Feb 13 17:55:06 kernel: RAX: 000000000000000a RBX: ef6dfcf53c7d6fa9 RCX: ffff92b840865b50
>> Feb 13 17:55:06 kernel: RDX: 0000000000000082 RSI: ffff92b8497f46c0 RDI: ffff92b84881f180
>> Feb 13 17:55:06 kernel: RBP: ffff92b8497f46c0 R08: 0000000000000001 R09: 0000000000000000
>> Feb 13 17:55:06 kernel: R10: ffff92b74a722c01 R11: 0000000000000001 R12: 0000000000000000
>> Feb 13 17:55:06 kernel: R13: 0000000000000000 R14: ffff92b74a722cd0 R15: ffff92b844748860
>> Feb 13 17:55:06 kernel: FS: 0000000000000000(0000) GS:ffff92b85bd00000(0000) knlGS:0000000000000000
>> Feb 13 17:55:06 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> Feb 13 17:55:06 kernel: CR2: 0000557d787a2028 CR3: 000000003c194000 CR4: 00000000000006e0
>> Feb 13 17:55:06 kernel: ------------[ cut here ]------------
>> Feb 13 17:55:06 kernel: WARNING: CPU: 2 PID: 831 at kernel/exit.c:739 do_exit+0x37/0xa90
>> Feb 13 17:55:06 kernel: Modules linked in: rpcrdma rdma_cm iw_cm ib_cm ib_core rfkill ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables xt_set ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog xt_limit xt_multiport xt_state xt_conntrack iptable_filter xt_nat iptable_nat nf_nat iptable_mangle nf_conntrack_pptp xt_CT nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_raw ip_set_hash_netport ip_set_hash_net ip_set drivetemp it87 nfnetlink hwmon_vid vfat fat snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi iTCO_wdt snd_hda_intel intel_pmc_bxt snd_intel_dspcfg iTCO_vendor_support raid456 snd_intel_sdw_acpi snd_hda_codec async_raid6_recov at24 async_memcpy async_pq async_xor async_tx snd_hda_core gma500_gfx snd_hwdep i2c_algo_bit drm_kms_helper snd_pcm intel_powerclamp i2c_i801 snd_timer coretemp i2c_smbus snd cec soundcore lpc_ich nfsd auth_rpcgss nfs_acl lockd drm grace fuse sunrpc ip_tables raid1 serio_raw video sata_sil24 e1000e
>> Feb 13 17:55:06 kernel: CPU: 2 PID: 831 Comm: md126_raid5 Tainted: G D 5.14.10-300.fc35.x86_64 #1
>> Feb 13 17:55:06 kernel: Hardware name: Intel Corporation Milstead Platform/Granite Well, BIOS CDV W Series 05 08/27/2015
>> Feb 13 17:55:06 kernel: RIP: 0010:do_exit+0x37/0xa90
>> Feb 13 17:55:06 kernel: Code: 55 48 89 fd 53 65 48 8b 1c 25 c0 7b 01 00 48 83 ec 28 48 8b 83 a8 0c 00 00 48 85 c0 74 0e 48 8b 10 48 39 d0 0f 84 56 04 00 00 <0f> 0b 65 8b 0d a0 2c f3 44 89 c8 25 00 ff ff 00 89 44 24 0c 0f 85
>> Feb 13 17:55:06 kernel: RSP: 0000:ffffa5bb80597ef8 EFLAGS: 00010212
>> Feb 13 17:55:06 kernel: RAX: ffffa5bb80597e50 RBX: ffff92b8437d8000 RCX: 0000000000000000
>> Feb 13 17:55:06 kernel: RDX: ffff92b844217548 RSI: ffff92b85bd18a00 RDI: 000000000000000b
>> Feb 13 17:55:06 kernel: RBP: 000000000000000b R08: 0000000000000000 R09: ffffa5bb805977a0
>> Feb 13 17:55:06 kernel: R10: ffffa5bb80597798 R11: ffffffffbcf47468 R12: 000000000000000b
>> Feb 13 17:55:06 kernel: R13: 0000000000000000 R14: ffff92b8437d8000 R15: 0000000000000000
>> Feb 13 17:55:06 kernel: FS: 0000000000000000(0000) GS:ffff92b85bd00000(0000) knlGS:0000000000000000
>> Feb 13 17:55:06 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> Feb 13 17:55:06 kernel: CR2: 0000557d787a2028 CR3: 000000003c194000 CR4: 00000000000006e0
>> Feb 13 17:55:06 kernel: Call Trace:
>> Feb 13 17:55:06 kernel: ? kthread+0x124/0x150
>> Feb 13 17:55:06 kernel: rewind_stack_do_exit+0x17/0x20
>> Feb 13 17:55:06 kernel: RIP: 0000:0x0
>> Feb 13 17:55:06 kernel: Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
>> Feb 13 17:55:06 kernel: RSP: 0000:0000000000000000 EFLAGS: 00000000 ORIG_RAX: 0000000000000000
>> Feb 13 17:55:06 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
>> Feb 13 17:55:06 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
>> Feb 13 17:55:06 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
>> Feb 13 17:55:06 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>> Feb 13 17:55:06 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>> Feb 13 17:55:06 kernel: ---[ end trace 97447bed31e596e1 ]---
>>
>> Downstream bug report: https://bugzilla.redhat.com/show_bug.cgi?id=2053936 .
>
> Could somebody take a look into this? Or was this discussed somewhere
> else already? Or even fixed?
>
> FWIW, I searched the list and wondered if these might be related, but I
> was unable to tell for sure:
>
> https://lore.kernel.org/all/[email protected]/
> https://lore.kernel.org/all/[email protected]/
>
> Anyway, to get this tracked:
>
> #regzbot introduced: v5.13..v5.14.10
> #regzbot from: Dominik Mierzejewski <[email protected]>
> #regzbot title: md: unable to handle page fault while running badblocks
> (fsck.ext4 -c) on a raid5 md array
> #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=215605
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>
> P.S.: As the Linux kernel's regression tracker I'm getting a lot of
> reports on my table. I can only look briefly into most of them and lack
> knowledge about most of the areas they concern. I thus unfortunately
> will sometimes get things wrong or miss something important. I hope
> that's not the case here; if you think it is, don't hesitate to tell me
> in a public reply, it's in everyone's interest to set the public record
> straight.
>

2022-03-01 01:10:58

by Song Liu

[permalink] [raw]

Subject: Re: Bug 215605 - [5.14 regression] BUG: unable to handle page fault while running badblocks (fsck.ext4 -c) on a raid5 md array

On Mon, Feb 28, 2022 at 1:43 AM Thorsten Leemhuis
<[email protected]> wrote:
>
> [CCing Jens]
>
> Hi, this is your Linux kernel regression tracker. Top-posting for once,
> to make this easily accessible to everyone.
>
> What's up here? Below regression was reported two weeks ago and I
> forwarded it nearly a week ago, nevertheless the reporter afaics didn't
> get a single reply. Is the issue discussed somewhere else and I just
> missed it? Is the report not accurate for some reason or missing
> something important? Or did the report fall throug the cracks?

Sorry for the late reply. I was on vacation last week.

>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>
> P.S.: As the Linux kernel's regression tracker I'm getting a lot of
> reports on my table. I can only look briefly into most of them and lack
> knowledge about most of the areas they concern. I thus unfortunately
> will sometimes get things wrong or miss something important. I hope
> that's not the case here; if you think it is, don't hesitate to tell me
> in a public reply, it's in everyone's interest to set the public record
> straight.
>
> #regzbot poke
>
> On 22.02.22 09:59, Thorsten Leemhuis wrote:
> > Hi, this is your Linux kernel regression tracker.
> >
> > I noticed a regression report in bugzilla.kernel.org that afaics nobody
> > acted upon since it was reported about a week ago, that's why I decided
> > to forward it to the lists and add a few relevant people to the CC. To
> > quote from https://bugzilla.kernel.org/show_bug.cgi?id=215605
> >
> >> Dominik Mierzejewski 2022-02-14 10:36:36 UTC
> >>
> >> Created attachment 300450 [details]
> >> kernel-5.16.8 dmesg with crash
> >>
> >> I'm experiencing kernel crash when running badblocks (fsck.ext4 -c) on a raid5 md array in my Intel Atom-based NAS box (Thecus N5550):
> >> [ 720.911993] kernel: BUG: unable to handle page fault for address: ffffdbc681023bc8
> >> [ 720.912073] kernel: #PF: supervisor read access in kernel mode
> >> [ 720.912120] kernel: #PF: error_code(0x0000) - not-present page
> >> [ 720.912166] kernel: PGD 11ffc6067 P4D 11ffc6067 PUD 0
> >> [ 720.912213] kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
> >> [ 720.912256] kernel: CPU: 1 PID: 1406 Comm: badblocks Not tainted 5.16.8-200.fc35.x86_64 #1
> >> [ 720.912321] kernel: Hardware name: Intel Corporation Milstead Platform/Granite Well, BIOS CDV W Series 05 08/27/2015
> >> [ 720.912400] kernel: RIP: 0010:kfree+0x58/0x3e0
> >> [ 720.912449] kernel: Code: 80 4c 01 e5 0f 82 84 03 00 00 48 c7 c0 00 00 00 80 48 2b 05 4a 96 3b 01 48 01 c5 48 c1 ed 0c 48 c1 e5 06 48 03 2d 28 96 3b 01 <48> 8b 45 08 48 8d 50 ff a8 01 48 0f 45 ea 4
> >> 8 8b 55 08 48 8d 42 ff
> >> [ 720.912598] kernel: RSP: 0018:ffff9db4008efaf8 EFLAGS: 00010286
> >> [ 720.912648] kernel: RAX: 00006d7bc0000000 RBX: ffff9284c5214800 RCX: ffff9284c3758ff8
> >> [ 720.912708] kernel: RDX: ffff9283c1102740 RSI: ffffffffc07af091 RDI: ffff9db4008efd58
> >> [ 720.912767] kernel: RBP: ffffdbc681023bc0 R08: ffff9db4008efb88 R09: ffff9284c3759000
> >> [ 720.912826] kernel: R10: 0000000000000028 R11: ffff9284c213db48 R12: ffff9db4008efd58
> >> [ 720.912885] kernel: R13: ffff9284c213da00 R14: ffff9284c375f000 R15: ffff9db4008efd58
> >> [ 720.912945] kernel: FS: 00007f73e6669740(0000) GS:ffff9284dbc80000(0000) knlGS:0000000000000000
> >> [ 720.913012] kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [ 720.913062] kernel: CR2: ffffdbc681023bc8 CR3: 000000005c2cc000 CR4: 00000000000006e0
> >> [ 720.913122] kernel: Call Trace:
> >> [ 720.913150] kernel: <TASK>
> >> [ 720.913179] kernel: raid5_make_request+0xb31/0xb90 [raid456]
> >> [ 720.913247] kernel: ? do_wait_intr_irq+0xa0/0xa0
> >> [ 720.913292] kernel: ? __blk_queue_split+0x30a/0x470
> >> [ 720.913339] kernel: md_handle_request+0x119/0x180
> >> [ 720.913386] kernel: md_submit_bio+0x67/0xa0
> >> [ 720.913425] kernel: __submit_bio_fops+0x91/0x160
> >> [ 720.913468] kernel: submit_bio_noacct+0xd7/0x2c0
> >> [ 720.913510] kernel: __blkdev_direct_IO_simple+0x198/0x290
> >> [ 720.913576] kernel: ? __fpu_restore_sig+0x193/0x570
> >> [ 720.913623] kernel: ? sysvec_apic_timer_interrupt+0xaf/0xd0
> >> [ 720.913676] kernel: ? __blkdev_direct_IO_simple+0x290/0x290
> >> [ 720.913728] kernel: generic_file_read_iter+0x9b/0x160
> >> [ 720.913775] kernel: new_sync_read+0x105/0x180
> >> [ 720.913820] kernel: vfs_read+0xf1/0x190
> >> [ 720.913858] kernel: ksys_read+0x4f/0xc0
> >> [ 720.913896] kernel: do_syscall_64+0x38/0x90
> >> [ 720.913936] kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
> >> [ 720.913985] kernel: RIP: 0033:0x7f73e676d772
> >> [ 720.914024] kernel: Code: c0 e9 b2 fe ff ff 50 48 8d 3d da 2e 0c 00 e8 b5 f9 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
> >> [ 720.914166] kernel: RSP: 002b:00007fff1b8fcbb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> >> [ 720.914231] kernel: RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 00007f73e676d772
> >> [ 720.917149] kernel: RDX: 0000000000040000 RSI: 00007f73e65d3000 RDI: 0000000000000004
> >> [ 720.920078] kernel: RBP: 0000000000001000 R08: 00000000015105c0 R09: 0000000000000080
> >> [ 720.922980] kernel: R10: 00007fff1b8fca00 R11: 0000000000000246 R12: 00000015105c0000
> >> [ 720.925875] kernel: R13: 0000000000000004 R14: 00007f73e65d3000 R15: 0000000000040000
> >> [ 720.928795] kernel: </TASK>
> >> [ 720.931704] kernel: Modules linked in: sctp ip6_udp_tunnel udp_tunnel rpcrdma rdma_cm iw_cm ib_cm ib_core sit tunnel4 ip_tunnel rfkill ipt_REJECT nf_reject_ipv4 iptable_filter xt_nat iptable_nat nf_nat iptable_mangle nf_conntrack_pptp xt_CT iptable_raw xt_multiport xt_set ip6t_REJECT nf_reject_ipv6 xt_LOG nf_log_syslog xt_limit xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables ip_set_hash_netport ip_set_hash_net ip_set drivetemp it87 nfnetlink hwmon_vid vfat fat iTCO_wdt intel_pmc_bxt iTCO_vendor_support at24 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx intel_powerclamp raid1 coretemp snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_usb_audio i2c_i801 gma500_gfx i2c_smbus snd_usbmidi_lib joydev snd_hwdep snd_rawmidi snd_seq_device mc snd_pcm lpc_ich i2c_algo_bit snd_timer drm_kms_helper snd cec soundcore nfsd auth_rpcgss
> >> [ 720.931885] kernel: nfs_acl lockd grace drm fuse sunrpc zram ip_tables hid_logitech_hidpp serio_raw r8152 sata_sil24 video mii hid_jabra e1000e hid_logitech_dj
> >> [ 720.952122] kernel: CR2: ffffdbc681023bc8
> >> [ 720.955651] kernel: ---[ end trace de2c3d5b971ae71d ]---
> >> [ 720.959186] kernel: RIP: 0010:kfree+0x58/0x3e0
> >> [ 720.962723] kernel: Code: 80 4c 01 e5 0f 82 84 03 00 00 48 c7 c0 00 00 00 80 48 2b 05 4a 96 3b 01 48 01 c5 48 c1 ed 0c 48 c1 e5 06 48 03 2d 28 96 3b 01 <48> 8b 45 08 48 8d 50 ff a8 01 48 0f 45 ea 48 8b 55 08 48 8d 42 ff
> >> [ 720.966472] kernel: RSP: 0018:ffff9db4008efaf8 EFLAGS: 00010286
> >> [ 720.970238] kernel: RAX: 00006d7bc0000000 RBX: ffff9284c5214800 RCX: ffff9284c3758ff8
> >> [ 720.973993] kernel: RDX: ffff9283c1102740 RSI: ffffffffc07af091 RDI: ffff9db4008efd58
> >> [ 720.977723] kernel: RBP: ffffdbc681023bc0 R08: ffff9db4008efb88 R09: ffff9284c3759000
> >> [ 720.981464] kernel: R10: 0000000000000028 R11: ffff9284c213db48 R12: ffff9db4008efd58
> >> [ 720.985228] kernel: R13: ffff9284c213da00 R14: ffff9284c375f000 R15: ffff9db4008efd58
> >> [ 720.988995] kernel: FS: 00007f73e6669740(0000) GS:ffff9284dbc80000(0000) knlGS:0000000000000000
> >> [ 720.992774] kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [ 720.996535] kernel: CR2: ffffdbc681023bc8 CR3: 000000005c2cc000 CR4: 00000000000006e0

I tried a few things (injecting badblocks, etc.) but still could not
reproduce this issue.

Hi Dominik,

Could you please share more information about the array?

mdadm --detail /dev/mdXXX
mdadm --examine /dev/<devices>

Thanks,
Song

> >>
> >> There was a crash earlier which caused the array to become degraded, but I wasn't able to capture the backtrace due to journal corruption. After resyncing the array I tried running fsck on the filesystem and it's causing a kernel crash each time.
> >>
> >> I've reproduced the crash with the following kernels:
> >> 5.16.8-200.fc35.x86_64
> >> 5.15.18-200.fc35.x86_64
> >> 5.14.10-300.fc35.x86_64
> >>
> >> It doesn't happen with 5.13 kernels. I've been running fsck.ext4 -c on the array for the last 10 hours without a crash. It found 27 read errors which it corrected.
> >>
> >> Steps to reproduce:
> >> 1. fsck.ext4 -c -v /dev/md126
> >> 2. crash
> >>
> >> The array was degraded as a result of a previous crash, but it recovered itself with no errors reported. Obviously, there remained some read errors which badblocks found and which the kernel tripped over.
> >>
> >> Crash with 5.14.10 looks different:
> >> Feb 13 17:55:06 kernel: general protection fault, probably for non-canonical address 0xef6dfcf53c7d6fc3: 0000 [#1] SMP NOPTI
> >> Feb 13 17:55:06 kernel: CPU: 2 PID: 831 Comm: md126_raid5 Not tainted 5.14.10-300.fc35.x86_64 #1
> >> Feb 13 17:55:06 kernel: Hardware name: Intel Corporation Milstead Platform/Granite Well, BIOS CDV W Series 05 08/27/2015
> >> Feb 13 17:55:06 kernel: RIP: 0010:bio_endio+0x87/0x130
> >> Feb 13 17:55:06 kernel: Code: 02 00 48 8b 45 08 48 85 c0 74 09 0f b7 55 14 f6 c6 01 75 2f 48 81 7d 38 50 48 5e bb 75 38 0f b6 45 1a 48 8b 5d 40 84 c0 74 09 <80> 7b 1a 00 75 03 88 43 1a 48 89 ef 48 89 dd e8 d5 fe ff ff e9 6a
> >> Feb 13 17:55:06 kernel: RSP: 0000:ffffa5bb80597b48 EFLAGS: 00010206
> >> Feb 13 17:55:06 kernel: RAX: 000000000000000a RBX: ef6dfcf53c7d6fa9 RCX: ffff92b840865b50
> >> Feb 13 17:55:06 kernel: RDX: 0000000000000082 RSI: ffff92b8497f46c0 RDI: ffff92b84881f180
> >> Feb 13 17:55:06 kernel: RBP: ffff92b8497f46c0 R08: 0000000000000001 R09: 0000000000000000
> >> Feb 13 17:55:06 kernel: R10: ffff92b74a722c01 R11: 0000000000000001 R12: 0000000000000000
> >> Feb 13 17:55:06 kernel: R13: 0000000000000000 R14: ffff92b74a722cd0 R15: ffff92b844748860
> >> Feb 13 17:55:06 kernel: FS: 0000000000000000(0000) GS:ffff92b85bd00000(0000) knlGS:0000000000000000
> >> Feb 13 17:55:06 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> Feb 13 17:55:06 kernel: CR2: 0000557d787a2028 CR3: 000000003c194000 CR4: 00000000000006e0
> >> Feb 13 17:55:06 kernel: Call Trace:
> >> Feb 13 17:55:06 kernel: ops_complete_biofill+0x98/0xe0 [raid456]
> >> Feb 13 17:55:06 kernel: async_trigger_callback+0x7d/0xc1 [async_tx]
> >> Feb 13 17:55:06 kernel: raid_run_ops+0x98c/0x15b0 [raid456]
> >> Feb 13 17:55:06 kernel: ? handle_stripe_clean_event+0x400/0x400 [raid456]
> >> Feb 13 17:55:06 kernel: handle_stripe+0xe91/0x1dc0 [raid456]
> >> Feb 13 17:55:06 kernel: handle_active_stripes.constprop.0+0x390/0x560 [raid456]
> >> Feb 13 17:55:06 kernel: raid5d+0x39d/0x5d0 [raid456]
> >> Feb 13 17:55:06 kernel: ? prepare_to_wait_event+0x5d/0x170
> >> Feb 13 17:55:06 kernel: md_thread+0x98/0x140
> >> Feb 13 17:55:06 kernel: ? finish_wait+0x80/0x80
> >> Feb 13 17:55:06 kernel: ? md_write_inc+0x50/0x50
> >> Feb 13 17:55:06 kernel: kthread+0x124/0x150
> >> Feb 13 17:55:06 kernel: ? set_kthread_struct+0x40/0x40
> >> Feb 13 17:55:06 kernel: ret_from_fork+0x1f/0x30
> >> Feb 13 17:55:06 kernel: Modules linked in: rpcrdma rdma_cm iw_cm ib_cm ib_core rfkill ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables xt_set ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog xt_limit xt_multiport xt_state xt_conntrack iptable_filter xt_nat iptable_nat nf_nat iptable_mangle nf_conntrack_pptp xt_CT nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_raw ip_set_hash_netport ip_set_hash_net ip_set drivetemp it87 nfnetlink hwmon_vid vfat fat snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi iTCO_wdt snd_hda_intel intel_pmc_bxt snd_intel_dspcfg iTCO_vendor_support raid456 snd_intel_sdw_acpi snd_hda_codec async_raid6_recov at24 async_memcpy async_pq async_xor async_tx snd_hda_core gma500_gfx snd_hwdep i2c_algo_bit drm_kms_helper snd_pcm intel_powerclamp i2c_i801 snd_timer coretemp i2c_smbus snd cec soundcore lpc_ich nfsd auth_rpcgss nfs_acl lockd drm grace fuse sunrpc ip_tables raid1 serio_raw video sata_sil24 e1000e
> >> Feb 13 17:55:06 kernel: ---[ end trace 97447bed31e596e0 ]---
> >> Feb 13 17:55:06 kernel: RIP: 0010:bio_endio+0x87/0x130
> >> Feb 13 17:55:06 kernel: Code: 02 00 48 8b 45 08 48 85 c0 74 09 0f b7 55 14 f6 c6 01 75 2f 48 81 7d 38 50 48 5e bb 75 38 0f b6 45 1a 48 8b 5d 40 84 c0 74 09 <80> 7b 1a 00 75 03 88 43 1a 48 89 ef 48 89 dd e8 d5 fe ff ff e9 6a
> >> Feb 13 17:55:06 kernel: RSP: 0000:ffffa5bb80597b48 EFLAGS: 00010206
> >> Feb 13 17:55:06 kernel: RAX: 000000000000000a RBX: ef6dfcf53c7d6fa9 RCX: ffff92b840865b50
> >> Feb 13 17:55:06 kernel: RDX: 0000000000000082 RSI: ffff92b8497f46c0 RDI: ffff92b84881f180
> >> Feb 13 17:55:06 kernel: RBP: ffff92b8497f46c0 R08: 0000000000000001 R09: 0000000000000000
> >> Feb 13 17:55:06 kernel: R10: ffff92b74a722c01 R11: 0000000000000001 R12: 0000000000000000
> >> Feb 13 17:55:06 kernel: R13: 0000000000000000 R14: ffff92b74a722cd0 R15: ffff92b844748860
> >> Feb 13 17:55:06 kernel: FS: 0000000000000000(0000) GS:ffff92b85bd00000(0000) knlGS:0000000000000000
> >> Feb 13 17:55:06 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> Feb 13 17:55:06 kernel: CR2: 0000557d787a2028 CR3: 000000003c194000 CR4: 00000000000006e0
> >> Feb 13 17:55:06 kernel: ------------[ cut here ]------------
> >> Feb 13 17:55:06 kernel: WARNING: CPU: 2 PID: 831 at kernel/exit.c:739 do_exit+0x37/0xa90
> >> Feb 13 17:55:06 kernel: Modules linked in: rpcrdma rdma_cm iw_cm ib_cm ib_core rfkill ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables xt_set ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog xt_limit xt_multiport xt_state xt_conntrack iptable_filter xt_nat iptable_nat nf_nat iptable_mangle nf_conntrack_pptp xt_CT nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_raw ip_set_hash_netport ip_set_hash_net ip_set drivetemp it87 nfnetlink hwmon_vid vfat fat snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi iTCO_wdt snd_hda_intel intel_pmc_bxt snd_intel_dspcfg iTCO_vendor_support raid456 snd_intel_sdw_acpi snd_hda_codec async_raid6_recov at24 async_memcpy async_pq async_xor async_tx snd_hda_core gma500_gfx snd_hwdep i2c_algo_bit drm_kms_helper snd_pcm intel_powerclamp i2c_i801 snd_timer coretemp i2c_smbus snd cec soundcore lpc_ich nfsd auth_rpcgss nfs_acl lockd drm grace fuse sunrpc ip_tables raid1 serio_raw video sata_sil24 e1000e
> >> Feb 13 17:55:06 kernel: CPU: 2 PID: 831 Comm: md126_raid5 Tainted: G D 5.14.10-300.fc35.x86_64 #1
> >> Feb 13 17:55:06 kernel: Hardware name: Intel Corporation Milstead Platform/Granite Well, BIOS CDV W Series 05 08/27/2015
> >> Feb 13 17:55:06 kernel: RIP: 0010:do_exit+0x37/0xa90
> >> Feb 13 17:55:06 kernel: Code: 55 48 89 fd 53 65 48 8b 1c 25 c0 7b 01 00 48 83 ec 28 48 8b 83 a8 0c 00 00 48 85 c0 74 0e 48 8b 10 48 39 d0 0f 84 56 04 00 00 <0f> 0b 65 8b 0d a0 2c f3 44 89 c8 25 00 ff ff 00 89 44 24 0c 0f 85
> >> Feb 13 17:55:06 kernel: RSP: 0000:ffffa5bb80597ef8 EFLAGS: 00010212
> >> Feb 13 17:55:06 kernel: RAX: ffffa5bb80597e50 RBX: ffff92b8437d8000 RCX: 0000000000000000
> >> Feb 13 17:55:06 kernel: RDX: ffff92b844217548 RSI: ffff92b85bd18a00 RDI: 000000000000000b
> >> Feb 13 17:55:06 kernel: RBP: 000000000000000b R08: 0000000000000000 R09: ffffa5bb805977a0
> >> Feb 13 17:55:06 kernel: R10: ffffa5bb80597798 R11: ffffffffbcf47468 R12: 000000000000000b
> >> Feb 13 17:55:06 kernel: R13: 0000000000000000 R14: ffff92b8437d8000 R15: 0000000000000000
> >> Feb 13 17:55:06 kernel: FS: 0000000000000000(0000) GS:ffff92b85bd00000(0000) knlGS:0000000000000000
> >> Feb 13 17:55:06 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> Feb 13 17:55:06 kernel: CR2: 0000557d787a2028 CR3: 000000003c194000 CR4: 00000000000006e0
> >> Feb 13 17:55:06 kernel: Call Trace:
> >> Feb 13 17:55:06 kernel: ? kthread+0x124/0x150
> >> Feb 13 17:55:06 kernel: rewind_stack_do_exit+0x17/0x20
> >> Feb 13 17:55:06 kernel: RIP: 0000:0x0
> >> Feb 13 17:55:06 kernel: Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
> >> Feb 13 17:55:06 kernel: RSP: 0000:0000000000000000 EFLAGS: 00000000 ORIG_RAX: 0000000000000000
> >> Feb 13 17:55:06 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> >> Feb 13 17:55:06 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> >> Feb 13 17:55:06 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> >> Feb 13 17:55:06 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> >> Feb 13 17:55:06 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> >> Feb 13 17:55:06 kernel: ---[ end trace 97447bed31e596e1 ]---
> >>
> >> Downstream bug report: https://bugzilla.redhat.com/show_bug.cgi?id=2053936 .
> >
> > Could somebody take a look into this? Or was this discussed somewhere
> > else already? Or even fixed?
> >
> > FWIW, I searched the list and wondered if these might be related, but I
> > was unable to tell for sure:
> >
> > https://lore.kernel.org/all/[email protected]/
> > https://lore.kernel.org/all/[email protected]/
> >
> > Anyway, to get this tracked:
> >
> > #regzbot introduced: v5.13..v5.14.10
> > #regzbot from: Dominik Mierzejewski <[email protected]>
> > #regzbot title: md: unable to handle page fault while running badblocks
> > (fsck.ext4 -c) on a raid5 md array
> > #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=215605
> >
> > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >
> > P.S.: As the Linux kernel's regression tracker I'm getting a lot of
> > reports on my table. I can only look briefly into most of them and lack
> > knowledge about most of the areas they concern. I thus unfortunately
> > will sometimes get things wrong or miss something important. I hope
> > that's not the case here; if you think it is, don't hesitate to tell me
> > in a public reply, it's in everyone's interest to set the public record
> > straight.
> >

2022-03-02 22:25:38

by Dominik Mierzejewski

[permalink] [raw]

Subject: Re: Bug 215605 - [5.14 regression] BUG: unable to handle page fault while running badblocks (fsck.ext4 -c) on a raid5 md array

Hello!

On Tuesday, 01 March 2022 at 01:24, Song Liu wrote:
> On Mon, Feb 28, 2022 at 1:43 AM Thorsten Leemhuis
> <[email protected]> wrote:
> >
> > [CCing Jens]
> >
> > Hi, this is your Linux kernel regression tracker. Top-posting for once,
> > to make this easily accessible to everyone.
> >
> > What's up here? Below regression was reported two weeks ago and I
> > forwarded it nearly a week ago, nevertheless the reporter afaics didn't
> > get a single reply. Is the issue discussed somewhere else and I just
> > missed it? Is the report not accurate for some reason or missing
> > something important? Or did the report fall throug the cracks?
>
> Sorry for the late reply. I was on vacation last week.

No problem, thanks for responding.

[...]
> > On 22.02.22 09:59, Thorsten Leemhuis wrote:
> > > Hi, this is your Linux kernel regression tracker.
> > >
> > > I noticed a regression report in bugzilla.kernel.org that afaics nobody
> > > acted upon since it was reported about a week ago, that's why I decided
> > > to forward it to the lists and add a few relevant people to the CC. To
> > > quote from https://bugzilla.kernel.org/show_bug.cgi?id=215605
> > >
> > >> Dominik Mierzejewski 2022-02-14 10:36:36 UTC
> > >>
> > >> Created attachment 300450 [details]
> > >> kernel-5.16.8 dmesg with crash
> > >>
> > >> I'm experiencing kernel crash when running badblocks (fsck.ext4 -c) on a raid5 md array in my Intel Atom-based NAS box (Thecus N5550):
> > >> [ 720.911993] kernel: BUG: unable to handle page fault for address: ffffdbc681023bc8
> > >> [ 720.912073] kernel: #PF: supervisor read access in kernel mode
> > >> [ 720.912120] kernel: #PF: error_code(0x0000) - not-present page
> > >> [ 720.912166] kernel: PGD 11ffc6067 P4D 11ffc6067 PUD 0
> > >> [ 720.912213] kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
> > >> [ 720.912256] kernel: CPU: 1 PID: 1406 Comm: badblocks Not tainted 5.16.8-200.fc35.x86_64 #1
> > >> [ 720.912321] kernel: Hardware name: Intel Corporation Milstead Platform/Granite Well, BIOS CDV W Series 05 08/27/2015
> > >> [ 720.912400] kernel: RIP: 0010:kfree+0x58/0x3e0
> > >> [ 720.912449] kernel: Code: 80 4c 01 e5 0f 82 84 03 00 00 48 c7 c0 00 00 00 80 48 2b 05 4a 96 3b 01 48 01 c5 48 c1 ed 0c 48 c1 e5 06 48 03 2d 28 96 3b 01 <48> 8b 45 08 48 8d 50 ff a8 01 48 0f 45 ea 4
> > >> 8 8b 55 08 48 8d 42 ff
> > >> [ 720.912598] kernel: RSP: 0018:ffff9db4008efaf8 EFLAGS: 00010286
> > >> [ 720.912648] kernel: RAX: 00006d7bc0000000 RBX: ffff9284c5214800 RCX: ffff9284c3758ff8
> > >> [ 720.912708] kernel: RDX: ffff9283c1102740 RSI: ffffffffc07af091 RDI: ffff9db4008efd58
> > >> [ 720.912767] kernel: RBP: ffffdbc681023bc0 R08: ffff9db4008efb88 R09: ffff9284c3759000
> > >> [ 720.912826] kernel: R10: 0000000000000028 R11: ffff9284c213db48 R12: ffff9db4008efd58
> > >> [ 720.912885] kernel: R13: ffff9284c213da00 R14: ffff9284c375f000 R15: ffff9db4008efd58
> > >> [ 720.912945] kernel: FS: 00007f73e6669740(0000) GS:ffff9284dbc80000(0000) knlGS:0000000000000000
> > >> [ 720.913012] kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >> [ 720.913062] kernel: CR2: ffffdbc681023bc8 CR3: 000000005c2cc000 CR4: 00000000000006e0
> > >> [ 720.913122] kernel: Call Trace:
> > >> [ 720.913150] kernel: <TASK>
> > >> [ 720.913179] kernel: raid5_make_request+0xb31/0xb90 [raid456]
> > >> [ 720.913247] kernel: ? do_wait_intr_irq+0xa0/0xa0
> > >> [ 720.913292] kernel: ? __blk_queue_split+0x30a/0x470
> > >> [ 720.913339] kernel: md_handle_request+0x119/0x180
> > >> [ 720.913386] kernel: md_submit_bio+0x67/0xa0
> > >> [ 720.913425] kernel: __submit_bio_fops+0x91/0x160
> > >> [ 720.913468] kernel: submit_bio_noacct+0xd7/0x2c0
> > >> [ 720.913510] kernel: __blkdev_direct_IO_simple+0x198/0x290
> > >> [ 720.913576] kernel: ? __fpu_restore_sig+0x193/0x570
> > >> [ 720.913623] kernel: ? sysvec_apic_timer_interrupt+0xaf/0xd0
> > >> [ 720.913676] kernel: ? __blkdev_direct_IO_simple+0x290/0x290
> > >> [ 720.913728] kernel: generic_file_read_iter+0x9b/0x160
> > >> [ 720.913775] kernel: new_sync_read+0x105/0x180
> > >> [ 720.913820] kernel: vfs_read+0xf1/0x190
> > >> [ 720.913858] kernel: ksys_read+0x4f/0xc0
> > >> [ 720.913896] kernel: do_syscall_64+0x38/0x90
> > >> [ 720.913936] kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
> > >> [ 720.913985] kernel: RIP: 0033:0x7f73e676d772
> > >> [ 720.914024] kernel: Code: c0 e9 b2 fe ff ff 50 48 8d 3d da 2e 0c 00 e8 b5 f9 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
> > >> [ 720.914166] kernel: RSP: 002b:00007fff1b8fcbb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> > >> [ 720.914231] kernel: RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 00007f73e676d772
> > >> [ 720.917149] kernel: RDX: 0000000000040000 RSI: 00007f73e65d3000 RDI: 0000000000000004
> > >> [ 720.920078] kernel: RBP: 0000000000001000 R08: 00000000015105c0 R09: 0000000000000080
> > >> [ 720.922980] kernel: R10: 00007fff1b8fca00 R11: 0000000000000246 R12: 00000015105c0000
> > >> [ 720.925875] kernel: R13: 0000000000000004 R14: 00007f73e65d3000 R15: 0000000000040000
> > >> [ 720.928795] kernel: </TASK>
> > >> [ 720.931704] kernel: Modules linked in: sctp ip6_udp_tunnel udp_tunnel rpcrdma rdma_cm iw_cm ib_cm ib_core sit tunnel4 ip_tunnel rfkill ipt_REJECT nf_reject_ipv4 iptable_filter xt_nat iptable_nat nf_nat iptable_mangle nf_conntrack_pptp xt_CT iptable_raw xt_multiport xt_set ip6t_REJECT nf_reject_ipv6 xt_LOG nf_log_syslog xt_limit xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables ip_set_hash_netport ip_set_hash_net ip_set drivetemp it87 nfnetlink hwmon_vid vfat fat iTCO_wdt intel_pmc_bxt iTCO_vendor_support at24 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx intel_powerclamp raid1 coretemp snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_usb_audio i2c_i801 gma500_gfx i2c_smbus snd_usbmidi_lib joydev snd_hwdep snd_rawmidi snd_seq_device mc snd_pcm lpc_ich i2c_algo_bit snd_timer drm_kms_helper snd cec soundcore nfsd auth_rpcgss
> > >> [ 720.931885] kernel: nfs_acl lockd grace drm fuse sunrpc zram ip_tables hid_logitech_hidpp serio_raw r8152 sata_sil24 video mii hid_jabra e1000e hid_logitech_dj
> > >> [ 720.952122] kernel: CR2: ffffdbc681023bc8
> > >> [ 720.955651] kernel: ---[ end trace de2c3d5b971ae71d ]---
> > >> [ 720.959186] kernel: RIP: 0010:kfree+0x58/0x3e0
> > >> [ 720.962723] kernel: Code: 80 4c 01 e5 0f 82 84 03 00 00 48 c7 c0 00 00 00 80 48 2b 05 4a 96 3b 01 48 01 c5 48 c1 ed 0c 48 c1 e5 06 48 03 2d 28 96 3b 01 <48> 8b 45 08 48 8d 50 ff a8 01 48 0f 45 ea 48 8b 55 08 48 8d 42 ff
> > >> [ 720.966472] kernel: RSP: 0018:ffff9db4008efaf8 EFLAGS: 00010286
> > >> [ 720.970238] kernel: RAX: 00006d7bc0000000 RBX: ffff9284c5214800 RCX: ffff9284c3758ff8
> > >> [ 720.973993] kernel: RDX: ffff9283c1102740 RSI: ffffffffc07af091 RDI: ffff9db4008efd58
> > >> [ 720.977723] kernel: RBP: ffffdbc681023bc0 R08: ffff9db4008efb88 R09: ffff9284c3759000
> > >> [ 720.981464] kernel: R10: 0000000000000028 R11: ffff9284c213db48 R12: ffff9db4008efd58
> > >> [ 720.985228] kernel: R13: ffff9284c213da00 R14: ffff9284c375f000 R15: ffff9db4008efd58
> > >> [ 720.988995] kernel: FS: 00007f73e6669740(0000) GS:ffff9284dbc80000(0000) knlGS:0000000000000000
> > >> [ 720.992774] kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >> [ 720.996535] kernel: CR2: ffffdbc681023bc8 CR3: 000000005c2cc000 CR4: 00000000000006e0
>
> I tried a few things (injecting badblocks, etc.) but still could not
> reproduce this issue.
>
> Hi Dominik,
>
> Could you please share more information about the array?

Please note that I repaired the array by stopping it and reassembling
without badblocks maps:
# mdadm --assemble -U force-no-bbl /dev/md126 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1

I ran fsck on it and repaired multiple-owned blocks. It's been running
smoothly since then, even under kernel 5.16.9. I'm not sure the
following information will be useful now.

> mdadm --detail /dev/mdXXX

# mdadm --detail /dev/md126
/dev/md126:
Version : 1.2
Creation Time : Sat Oct 19 22:39:14 2019
Raid Level : raid5
Array Size : 11720531968 (10.92 TiB 12.00 TB)
Used Dev Size : 2930132992 (2.73 TiB 3.00 TB)
Raid Devices : 5
Total Devices : 5
Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Wed Mar 2 22:29:09 2022
State : clean
Active Devices : 5
Working Devices : 5
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 512K

Consistency Policy : bitmap

Name : any:nas
UUID : 4fef0683:8f39a12d:eb69f742:b501ef53
Events : 71011

Number Major Minor RaidDevice State
6 8 17 0 active sync /dev/sdb1
1 8 33 1 active sync /dev/sdc1
7 8 49 2 active sync /dev/sdd1
3 8 65 3 active sync /dev/sde1
5 8 81 4 active sync /dev/sdf1

> mdadm --examine /dev/<devices>

# mdadm --examine /dev/sdb1
/dev/sdb1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 4fef0683:8f39a12d:eb69f742:b501ef53
Name : any:nas
Creation Time : Sat Oct 19 22:39:14 2019
Raid Level : raid5
Raid Devices : 5

Avail Dev Size : 5860266895 sectors (2.73 TiB 3.00 TB)
Array Size : 11720531968 KiB (10.92 TiB 12.00 TB)
Used Dev Size : 5860265984 sectors (2.73 TiB 3.00 TB)
Data Offset : 264192 sectors
Super Offset : 8 sectors
Unused Space : before=264112 sectors, after=911 sectors
State : clean
Device UUID : 506215d3:7893b583:1b956e6a:532d7751

Internal Bitmap : 8 sectors from superblock
Update Time : Wed Mar 2 22:29:09 2022
Checksum : b8ce0df8 - correct
Events : 71011

Layout : left-symmetric
Chunk Size : 512K

Device Role : Active device 0
Array State : AAAAA ('A' == active, '.' == missing, 'R' == replacing)
# mdadm --examine /dev/sdc1
/dev/sdc1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 4fef0683:8f39a12d:eb69f742:b501ef53
Name : any:nas
Creation Time : Sat Oct 19 22:39:14 2019
Raid Level : raid5
Raid Devices : 5

Avail Dev Size : 5860266895 sectors (2.73 TiB 3.00 TB)
Array Size : 11720531968 KiB (10.92 TiB 12.00 TB)
Used Dev Size : 5860265984 sectors (2.73 TiB 3.00 TB)
Data Offset : 264192 sectors
Super Offset : 8 sectors
Unused Space : before=264112 sectors, after=911 sectors
State : clean
Device UUID : 9ab16a47:b23435f3:9231fd6b:e7d709d4

Internal Bitmap : 8 sectors from superblock
Update Time : Wed Mar 2 22:29:09 2022
Checksum : 20c43ebb - correct
Events : 71011

Layout : left-symmetric
Chunk Size : 512K

Device Role : Active device 1
Array State : AAAAA ('A' == active, '.' == missing, 'R' == replacing)
# mdadm --examine /dev/sdd1
/dev/sdd1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 4fef0683:8f39a12d:eb69f742:b501ef53
Name : any:nas
Creation Time : Sat Oct 19 22:39:14 2019
Raid Level : raid5
Raid Devices : 5

Avail Dev Size : 5860266895 sectors (2.73 TiB 3.00 TB)
Array Size : 11720531968 KiB (10.92 TiB 12.00 TB)
Used Dev Size : 5860265984 sectors (2.73 TiB 3.00 TB)
Data Offset : 264192 sectors
Super Offset : 8 sectors
Unused Space : before=264112 sectors, after=911 sectors
State : clean
Device UUID : 44265974:c83d899d:9bab17f4:59a0786e

Internal Bitmap : 8 sectors from superblock
Update Time : Wed Mar 2 22:29:09 2022
Checksum : 1a8ffefc - correct
Events : 71011

Layout : left-symmetric
Chunk Size : 512K

Device Role : Active device 2
Array State : AAAAA ('A' == active, '.' == missing, 'R' == replacing)
# mdadm --examine /dev/sde1
/dev/sde1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 4fef0683:8f39a12d:eb69f742:b501ef53
Name : any:nas
Creation Time : Sat Oct 19 22:39:14 2019
Raid Level : raid5
Raid Devices : 5

Avail Dev Size : 5860266895 sectors (2.73 TiB 3.00 TB)
Array Size : 11720531968 KiB (10.92 TiB 12.00 TB)
Used Dev Size : 5860265984 sectors (2.73 TiB 3.00 TB)
Data Offset : 264192 sectors
Super Offset : 8 sectors
Unused Space : before=264112 sectors, after=911 sectors
State : clean
Device UUID : 722420f5:3d9c05e0:71bad4ca:add63b08

Internal Bitmap : 8 sectors from superblock
Update Time : Wed Mar 2 22:29:09 2022
Checksum : 4e53a0c5 - correct
Events : 71011

Layout : left-symmetric
Chunk Size : 512K

Device Role : Active device 3
Array State : AAAAA ('A' == active, '.' == missing, 'R' == replacing)
# mdadm --examine /dev/sdf1
/dev/sdf1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 4fef0683:8f39a12d:eb69f742:b501ef53
Name : any:nas
Creation Time : Sat Oct 19 22:39:14 2019
Raid Level : raid5
Raid Devices : 5

Avail Dev Size : 5860266895 sectors (2.73 TiB 3.00 TB)
Array Size : 11720531968 KiB (10.92 TiB 12.00 TB)
Used Dev Size : 5860265984 sectors (2.73 TiB 3.00 TB)
Data Offset : 264192 sectors
Super Offset : 8 sectors
Unused Space : before=264112 sectors, after=911 sectors
State : clean
Device UUID : 6c2badc9:f7803c8b:9ee398c0:b4e7497d

Internal Bitmap : 8 sectors from superblock
Update Time : Wed Mar 2 22:29:09 2022
Checksum : 38e9c6af - correct
Events : 71011

Layout : left-symmetric
Chunk Size : 512K

Device Role : Active device 4
Array State : AAAAA ('A' == active, '.' == missing, 'R' == replacing)

Regards,
Dominik

--
Fedora https://getfedora.org | RPM Fusion http://rpmfusion.org
There should be a science of discontent. People need hard times and
oppression to develop psychic muscles.
-- from "Collected Sayings of Muad'Dib" by the Princess Irulan

2022-03-04 08:31:19

by Song Liu

[permalink] [raw]

Subject: Re: Bug 215605 - [5.14 regression] BUG: unable to handle page fault while running badblocks (fsck.ext4 -c) on a raid5 md array

On Wed, Mar 2, 2022 at 1:54 PM Dominik Mierzejewski
<[email protected]> wrote:
>
> Hello!
>
> On Tuesday, 01 March 2022 at 01:24, Song Liu wrote:
> > On Mon, Feb 28, 2022 at 1:43 AM Thorsten Leemhuis
> > <[email protected]> wrote:
> > >
> > > [CCing Jens]
> > >
> > > Hi, this is your Linux kernel regression tracker. Top-posting for once,
> > > to make this easily accessible to everyone.
> > >
> > > What's up here? Below regression was reported two weeks ago and I
> > > forwarded it nearly a week ago, nevertheless the reporter afaics didn't
> > > get a single reply. Is the issue discussed somewhere else and I just
> > > missed it? Is the report not accurate for some reason or missing
> > > something important? Or did the report fall throug the cracks?
> >
> > Sorry for the late reply. I was on vacation last week.
>
> No problem, thanks for responding.
>
> [...]
> > > On 22.02.22 09:59, Thorsten Leemhuis wrote:
> > > > Hi, this is your Linux kernel regression tracker.
> > > >
> > > > I noticed a regression report in bugzilla.kernel.org that afaics nobody
> > > > acted upon since it was reported about a week ago, that's why I decided
> > > > to forward it to the lists and add a few relevant people to the CC. To
> > > > quote from https://bugzilla.kernel.org/show_bug.cgi?id=215605
> > > >
> > > >> Dominik Mierzejewski 2022-02-14 10:36:36 UTC
> > > >>
> > > >> Created attachment 300450 [details]
> > > >> kernel-5.16.8 dmesg with crash
> > > >>
> > > >> I'm experiencing kernel crash when running badblocks (fsck.ext4 -c) on a raid5 md array in my Intel Atom-based NAS box (Thecus N5550):
> > > >> [ 720.911993] kernel: BUG: unable to handle page fault for address: ffffdbc681023bc8
> > > >> [ 720.912073] kernel: #PF: supervisor read access in kernel mode
> > > >> [ 720.912120] kernel: #PF: error_code(0x0000) - not-present page
> > > >> [ 720.912166] kernel: PGD 11ffc6067 P4D 11ffc6067 PUD 0
> > > >> [ 720.912213] kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
> > > >> [ 720.912256] kernel: CPU: 1 PID: 1406 Comm: badblocks Not tainted 5.16.8-200.fc35.x86_64 #1
> > > >> [ 720.912321] kernel: Hardware name: Intel Corporation Milstead Platform/Granite Well, BIOS CDV W Series 05 08/27/2015
> > > >> [ 720.912400] kernel: RIP: 0010:kfree+0x58/0x3e0
> > > >> [ 720.912449] kernel: Code: 80 4c 01 e5 0f 82 84 03 00 00 48 c7 c0 00 00 00 80 48 2b 05 4a 96 3b 01 48 01 c5 48 c1 ed 0c 48 c1 e5 06 48 03 2d 28 96 3b 01 <48> 8b 45 08 48 8d 50 ff a8 01 48 0f 45 ea 4
> > > >> 8 8b 55 08 48 8d 42 ff
> > > >> [ 720.912598] kernel: RSP: 0018:ffff9db4008efaf8 EFLAGS: 00010286
> > > >> [ 720.912648] kernel: RAX: 00006d7bc0000000 RBX: ffff9284c5214800 RCX: ffff9284c3758ff8
> > > >> [ 720.912708] kernel: RDX: ffff9283c1102740 RSI: ffffffffc07af091 RDI: ffff9db4008efd58
> > > >> [ 720.912767] kernel: RBP: ffffdbc681023bc0 R08: ffff9db4008efb88 R09: ffff9284c3759000
> > > >> [ 720.912826] kernel: R10: 0000000000000028 R11: ffff9284c213db48 R12: ffff9db4008efd58
> > > >> [ 720.912885] kernel: R13: ffff9284c213da00 R14: ffff9284c375f000 R15: ffff9db4008efd58
> > > >> [ 720.912945] kernel: FS: 00007f73e6669740(0000) GS:ffff9284dbc80000(0000) knlGS:0000000000000000
> > > >> [ 720.913012] kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > >> [ 720.913062] kernel: CR2: ffffdbc681023bc8 CR3: 000000005c2cc000 CR4: 00000000000006e0
> > > >> [ 720.913122] kernel: Call Trace:
> > > >> [ 720.913150] kernel: <TASK>
> > > >> [ 720.913179] kernel: raid5_make_request+0xb31/0xb90 [raid456]
> > > >> [ 720.913247] kernel: ? do_wait_intr_irq+0xa0/0xa0
> > > >> [ 720.913292] kernel: ? __blk_queue_split+0x30a/0x470
> > > >> [ 720.913339] kernel: md_handle_request+0x119/0x180
> > > >> [ 720.913386] kernel: md_submit_bio+0x67/0xa0
> > > >> [ 720.913425] kernel: __submit_bio_fops+0x91/0x160
> > > >> [ 720.913468] kernel: submit_bio_noacct+0xd7/0x2c0
> > > >> [ 720.913510] kernel: __blkdev_direct_IO_simple+0x198/0x290
> > > >> [ 720.913576] kernel: ? __fpu_restore_sig+0x193/0x570
> > > >> [ 720.913623] kernel: ? sysvec_apic_timer_interrupt+0xaf/0xd0
> > > >> [ 720.913676] kernel: ? __blkdev_direct_IO_simple+0x290/0x290
> > > >> [ 720.913728] kernel: generic_file_read_iter+0x9b/0x160
> > > >> [ 720.913775] kernel: new_sync_read+0x105/0x180
> > > >> [ 720.913820] kernel: vfs_read+0xf1/0x190
> > > >> [ 720.913858] kernel: ksys_read+0x4f/0xc0
> > > >> [ 720.913896] kernel: do_syscall_64+0x38/0x90
> > > >> [ 720.913936] kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > >> [ 720.913985] kernel: RIP: 0033:0x7f73e676d772
> > > >> [ 720.914024] kernel: Code: c0 e9 b2 fe ff ff 50 48 8d 3d da 2e 0c 00 e8 b5 f9 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
> > > >> [ 720.914166] kernel: RSP: 002b:00007fff1b8fcbb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> > > >> [ 720.914231] kernel: RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 00007f73e676d772
> > > >> [ 720.917149] kernel: RDX: 0000000000040000 RSI: 00007f73e65d3000 RDI: 0000000000000004
> > > >> [ 720.920078] kernel: RBP: 0000000000001000 R08: 00000000015105c0 R09: 0000000000000080
> > > >> [ 720.922980] kernel: R10: 00007fff1b8fca00 R11: 0000000000000246 R12: 00000015105c0000
> > > >> [ 720.925875] kernel: R13: 0000000000000004 R14: 00007f73e65d3000 R15: 0000000000040000
> > > >> [ 720.928795] kernel: </TASK>
> > > >> [ 720.931704] kernel: Modules linked in: sctp ip6_udp_tunnel udp_tunnel rpcrdma rdma_cm iw_cm ib_cm ib_core sit tunnel4 ip_tunnel rfkill ipt_REJECT nf_reject_ipv4 iptable_filter xt_nat iptable_nat nf_nat iptable_mangle nf_conntrack_pptp xt_CT iptable_raw xt_multiport xt_set ip6t_REJECT nf_reject_ipv6 xt_LOG nf_log_syslog xt_limit xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables ip_set_hash_netport ip_set_hash_net ip_set drivetemp it87 nfnetlink hwmon_vid vfat fat iTCO_wdt intel_pmc_bxt iTCO_vendor_support at24 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx intel_powerclamp raid1 coretemp snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_usb_audio i2c_i801 gma500_gfx i2c_smbus snd_usbmidi_lib joydev snd_hwdep snd_rawmidi snd_seq_device mc snd_pcm lpc_ich i2c_algo_bit snd_timer drm_kms_helper snd cec soundcore nfsd auth_rpcgss
> > > >> [ 720.931885] kernel: nfs_acl lockd grace drm fuse sunrpc zram ip_tables hid_logitech_hidpp serio_raw r8152 sata_sil24 video mii hid_jabra e1000e hid_logitech_dj
> > > >> [ 720.952122] kernel: CR2: ffffdbc681023bc8
> > > >> [ 720.955651] kernel: ---[ end trace de2c3d5b971ae71d ]---
> > > >> [ 720.959186] kernel: RIP: 0010:kfree+0x58/0x3e0
> > > >> [ 720.962723] kernel: Code: 80 4c 01 e5 0f 82 84 03 00 00 48 c7 c0 00 00 00 80 48 2b 05 4a 96 3b 01 48 01 c5 48 c1 ed 0c 48 c1 e5 06 48 03 2d 28 96 3b 01 <48> 8b 45 08 48 8d 50 ff a8 01 48 0f 45 ea 48 8b 55 08 48 8d 42 ff
> > > >> [ 720.966472] kernel: RSP: 0018:ffff9db4008efaf8 EFLAGS: 00010286
> > > >> [ 720.970238] kernel: RAX: 00006d7bc0000000 RBX: ffff9284c5214800 RCX: ffff9284c3758ff8
> > > >> [ 720.973993] kernel: RDX: ffff9283c1102740 RSI: ffffffffc07af091 RDI: ffff9db4008efd58
> > > >> [ 720.977723] kernel: RBP: ffffdbc681023bc0 R08: ffff9db4008efb88 R09: ffff9284c3759000
> > > >> [ 720.981464] kernel: R10: 0000000000000028 R11: ffff9284c213db48 R12: ffff9db4008efd58
> > > >> [ 720.985228] kernel: R13: ffff9284c213da00 R14: ffff9284c375f000 R15: ffff9db4008efd58
> > > >> [ 720.988995] kernel: FS: 00007f73e6669740(0000) GS:ffff9284dbc80000(0000) knlGS:0000000000000000
> > > >> [ 720.992774] kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > >> [ 720.996535] kernel: CR2: ffffdbc681023bc8 CR3: 000000005c2cc000 CR4: 00000000000006e0
> >
> > I tried a few things (injecting badblocks, etc.) but still could not
> > reproduce this issue.
> >
> > Hi Dominik,
> >
> > Could you please share more information about the array?
>
> Please note that I repaired the array by stopping it and reassembling
> without badblocks maps:
> # mdadm --assemble -U force-no-bbl /dev/md126 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1
>
> I ran fsck on it and repaired multiple-owned blocks. It's been running
> smoothly since then, even under kernel 5.16.9. I'm not sure the
> following information will be useful now.

Thanks for these information. I will try to repro with some similar setup.

Song

[..]

2022-03-17 20:43:13

by Thorsten Leemhuis

[permalink] [raw]

Subject: Re: Bug 215605 - [5.14 regression] BUG: unable to handle page fault while running badblocks (fsck.ext4 -c) on a raid5 md array

Hi, this is your Linux kernel regression tracker. Top-posting for once,
to make this easily accessible to everyone.

Song, two weeks ago you said you would try to reproduce this. Any
success? Or was the discussion moved somewhere else and I just missed it?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.

#regzbot poke

On 04.03.22 07:26, Song Liu wrote:
> On Wed, Mar 2, 2022 at 1:54 PM Dominik Mierzejewski
> <[email protected]> wrote:
>>
>> Hello!
>>
>> On Tuesday, 01 March 2022 at 01:24, Song Liu wrote:
>>> On Mon, Feb 28, 2022 at 1:43 AM Thorsten Leemhuis
>>> <[email protected]> wrote:
>>>>
>>>> [CCing Jens]
>>>>
>>>> Hi, this is your Linux kernel regression tracker. Top-posting for once,
>>>> to make this easily accessible to everyone.
>>>>
>>>> What's up here? Below regression was reported two weeks ago and I
>>>> forwarded it nearly a week ago, nevertheless the reporter afaics didn't
>>>> get a single reply. Is the issue discussed somewhere else and I just
>>>> missed it? Is the report not accurate for some reason or missing
>>>> something important? Or did the report fall throug the cracks?
>>>
>>> Sorry for the late reply. I was on vacation last week.
>>
>> No problem, thanks for responding.
>>
>> [...]
>>>> On 22.02.22 09:59, Thorsten Leemhuis wrote:
>>>>> Hi, this is your Linux kernel regression tracker.
>>>>>
>>>>> I noticed a regression report in bugzilla.kernel.org that afaics nobody
>>>>> acted upon since it was reported about a week ago, that's why I decided
>>>>> to forward it to the lists and add a few relevant people to the CC. To
>>>>> quote from https://bugzilla.kernel.org/show_bug.cgi?id=215605
>>>>>
>>>>>> Dominik Mierzejewski 2022-02-14 10:36:36 UTC
>>>>>>
>>>>>> Created attachment 300450 [details]
>>>>>> kernel-5.16.8 dmesg with crash
>>>>>>
>>>>>> I'm experiencing kernel crash when running badblocks (fsck.ext4 -c) on a raid5 md array in my Intel Atom-based NAS box (Thecus N5550):
>>>>>> [ 720.911993] kernel: BUG: unable to handle page fault for address: ffffdbc681023bc8
>>>>>> [ 720.912073] kernel: #PF: supervisor read access in kernel mode
>>>>>> [ 720.912120] kernel: #PF: error_code(0x0000) - not-present page
>>>>>> [ 720.912166] kernel: PGD 11ffc6067 P4D 11ffc6067 PUD 0
>>>>>> [ 720.912213] kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
>>>>>> [ 720.912256] kernel: CPU: 1 PID: 1406 Comm: badblocks Not tainted 5.16.8-200.fc35.x86_64 #1
>>>>>> [ 720.912321] kernel: Hardware name: Intel Corporation Milstead Platform/Granite Well, BIOS CDV W Series 05 08/27/2015
>>>>>> [ 720.912400] kernel: RIP: 0010:kfree+0x58/0x3e0
>>>>>> [ 720.912449] kernel: Code: 80 4c 01 e5 0f 82 84 03 00 00 48 c7 c0 00 00 00 80 48 2b 05 4a 96 3b 01 48 01 c5 48 c1 ed 0c 48 c1 e5 06 48 03 2d 28 96 3b 01 <48> 8b 45 08 48 8d 50 ff a8 01 48 0f 45 ea 4
>>>>>> 8 8b 55 08 48 8d 42 ff
>>>>>> [ 720.912598] kernel: RSP: 0018:ffff9db4008efaf8 EFLAGS: 00010286
>>>>>> [ 720.912648] kernel: RAX: 00006d7bc0000000 RBX: ffff9284c5214800 RCX: ffff9284c3758ff8
>>>>>> [ 720.912708] kernel: RDX: ffff9283c1102740 RSI: ffffffffc07af091 RDI: ffff9db4008efd58
>>>>>> [ 720.912767] kernel: RBP: ffffdbc681023bc0 R08: ffff9db4008efb88 R09: ffff9284c3759000
>>>>>> [ 720.912826] kernel: R10: 0000000000000028 R11: ffff9284c213db48 R12: ffff9db4008efd58
>>>>>> [ 720.912885] kernel: R13: ffff9284c213da00 R14: ffff9284c375f000 R15: ffff9db4008efd58
>>>>>> [ 720.912945] kernel: FS: 00007f73e6669740(0000) GS:ffff9284dbc80000(0000) knlGS:0000000000000000
>>>>>> [ 720.913012] kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> [ 720.913062] kernel: CR2: ffffdbc681023bc8 CR3: 000000005c2cc000 CR4: 00000000000006e0
>>>>>> [ 720.913122] kernel: Call Trace:
>>>>>> [ 720.913150] kernel: <TASK>
>>>>>> [ 720.913179] kernel: raid5_make_request+0xb31/0xb90 [raid456]
>>>>>> [ 720.913247] kernel: ? do_wait_intr_irq+0xa0/0xa0
>>>>>> [ 720.913292] kernel: ? __blk_queue_split+0x30a/0x470
>>>>>> [ 720.913339] kernel: md_handle_request+0x119/0x180
>>>>>> [ 720.913386] kernel: md_submit_bio+0x67/0xa0
>>>>>> [ 720.913425] kernel: __submit_bio_fops+0x91/0x160
>>>>>> [ 720.913468] kernel: submit_bio_noacct+0xd7/0x2c0
>>>>>> [ 720.913510] kernel: __blkdev_direct_IO_simple+0x198/0x290
>>>>>> [ 720.913576] kernel: ? __fpu_restore_sig+0x193/0x570
>>>>>> [ 720.913623] kernel: ? sysvec_apic_timer_interrupt+0xaf/0xd0
>>>>>> [ 720.913676] kernel: ? __blkdev_direct_IO_simple+0x290/0x290
>>>>>> [ 720.913728] kernel: generic_file_read_iter+0x9b/0x160
>>>>>> [ 720.913775] kernel: new_sync_read+0x105/0x180
>>>>>> [ 720.913820] kernel: vfs_read+0xf1/0x190
>>>>>> [ 720.913858] kernel: ksys_read+0x4f/0xc0
>>>>>> [ 720.913896] kernel: do_syscall_64+0x38/0x90
>>>>>> [ 720.913936] kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>> [ 720.913985] kernel: RIP: 0033:0x7f73e676d772
>>>>>> [ 720.914024] kernel: Code: c0 e9 b2 fe ff ff 50 48 8d 3d da 2e 0c 00 e8 b5 f9 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
>>>>>> [ 720.914166] kernel: RSP: 002b:00007fff1b8fcbb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
>>>>>> [ 720.914231] kernel: RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 00007f73e676d772
>>>>>> [ 720.917149] kernel: RDX: 0000000000040000 RSI: 00007f73e65d3000 RDI: 0000000000000004
>>>>>> [ 720.920078] kernel: RBP: 0000000000001000 R08: 00000000015105c0 R09: 0000000000000080
>>>>>> [ 720.922980] kernel: R10: 00007fff1b8fca00 R11: 0000000000000246 R12: 00000015105c0000
>>>>>> [ 720.925875] kernel: R13: 0000000000000004 R14: 00007f73e65d3000 R15: 0000000000040000
>>>>>> [ 720.928795] kernel: </TASK>
>>>>>> [ 720.931704] kernel: Modules linked in: sctp ip6_udp_tunnel udp_tunnel rpcrdma rdma_cm iw_cm ib_cm ib_core sit tunnel4 ip_tunnel rfkill ipt_REJECT nf_reject_ipv4 iptable_filter xt_nat iptable_nat nf_nat iptable_mangle nf_conntrack_pptp xt_CT iptable_raw xt_multiport xt_set ip6t_REJECT nf_reject_ipv6 xt_LOG nf_log_syslog xt_limit xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables ip_set_hash_netport ip_set_hash_net ip_set drivetemp it87 nfnetlink hwmon_vid vfat fat iTCO_wdt intel_pmc_bxt iTCO_vendor_support at24 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx intel_powerclamp raid1 coretemp snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_usb_audio i2c_i801 gma500_gfx i2c_smbus snd_usbmidi_lib joydev snd_hwdep snd_rawmidi snd_seq_device mc snd_pcm lpc_ich i2c_algo_bit snd_timer drm_kms_helper snd cec soundcore
nfsd auth_rpcgss
>>>>>> [ 720.931885] kernel: nfs_acl lockd grace drm fuse sunrpc zram ip_tables hid_logitech_hidpp serio_raw r8152 sata_sil24 video mii hid_jabra e1000e hid_logitech_dj
>>>>>> [ 720.952122] kernel: CR2: ffffdbc681023bc8
>>>>>> [ 720.955651] kernel: ---[ end trace de2c3d5b971ae71d ]---
>>>>>> [ 720.959186] kernel: RIP: 0010:kfree+0x58/0x3e0
>>>>>> [ 720.962723] kernel: Code: 80 4c 01 e5 0f 82 84 03 00 00 48 c7 c0 00 00 00 80 48 2b 05 4a 96 3b 01 48 01 c5 48 c1 ed 0c 48 c1 e5 06 48 03 2d 28 96 3b 01 <48> 8b 45 08 48 8d 50 ff a8 01 48 0f 45 ea 48 8b 55 08 48 8d 42 ff
>>>>>> [ 720.966472] kernel: RSP: 0018:ffff9db4008efaf8 EFLAGS: 00010286
>>>>>> [ 720.970238] kernel: RAX: 00006d7bc0000000 RBX: ffff9284c5214800 RCX: ffff9284c3758ff8
>>>>>> [ 720.973993] kernel: RDX: ffff9283c1102740 RSI: ffffffffc07af091 RDI: ffff9db4008efd58
>>>>>> [ 720.977723] kernel: RBP: ffffdbc681023bc0 R08: ffff9db4008efb88 R09: ffff9284c3759000
>>>>>> [ 720.981464] kernel: R10: 0000000000000028 R11: ffff9284c213db48 R12: ffff9db4008efd58
>>>>>> [ 720.985228] kernel: R13: ffff9284c213da00 R14: ffff9284c375f000 R15: ffff9db4008efd58
>>>>>> [ 720.988995] kernel: FS: 00007f73e6669740(0000) GS:ffff9284dbc80000(0000) knlGS:0000000000000000
>>>>>> [ 720.992774] kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> [ 720.996535] kernel: CR2: ffffdbc681023bc8 CR3: 000000005c2cc000 CR4: 00000000000006e0
>>>
>>> I tried a few things (injecting badblocks, etc.) but still could not
>>> reproduce this issue.
>>>
>>> Hi Dominik,
>>>
>>> Could you please share more information about the array?
>>
>> Please note that I repaired the array by stopping it and reassembling
>> without badblocks maps:
>> # mdadm --assemble -U force-no-bbl /dev/md126 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1
>>
>> I ran fsck on it and repaired multiple-owned blocks. It's been running
>> smoothly since then, even under kernel 5.16.9. I'm not sure the
>> following information will be useful now.
>
> Thanks for these information. I will try to repro with some similar setup.
>
> Song
>
> [..]
>
>

2022-03-17 23:27:34

by Song Liu

[permalink] [raw]

Subject: Re: Bug 215605 - [5.14 regression] BUG: unable to handle page fault while running badblocks (fsck.ext4 -c) on a raid5 md array

On Thu, Mar 17, 2022 at 6:14 AM Thorsten Leemhuis
<[email protected]> wrote:
>
> Hi, this is your Linux kernel regression tracker. Top-posting for once,
> to make this easily accessible to everyone.
>
> Song, two weeks ago you said you would try to reproduce this. Any
> success? Or was the discussion moved somewhere else and I just missed it?
>

I haven't got luck to reproduce the issue yet. I will keep trying and
update when
there is progress.

Thanks,
Song

> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>
> P.S.: As the Linux kernel's regression tracker I'm getting a lot of
> reports on my table. I can only look briefly into most of them and lack
> knowledge about most of the areas they concern. I thus unfortunately
> will sometimes get things wrong or miss something important. I hope
> that's not the case here; if you think it is, don't hesitate to tell me
> in a public reply, it's in everyone's interest to set the public record
> straight.
>