LinuxLists.cc - Re: Kernel OOPS while creating a NVMe Namespace

2024-06-10 09:58:40

Subject: Re: Kernel OOPS while creating a NVMe Namespace

On 10/06/2024 10:51, Venkat Rao Bagalkote wrote:
> Greetings!!!
>
> Observing Kernel OOPS, while creating namespace on a NVMe device.
>
> [ 140.209777] BUG: Unable to handle kernel data access at
> 0x18d7003065646fee
> [ 140.209792] Faulting instruction address: 0xc00000000023b45c
> [ 140.209798] Oops: Kernel access of bad area, sig: 11 [#1]
> [ 140.209802] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=8192 NUMA pSeries
> [ 140.209809] Modules linked in: rpadlpar_io rpaphp xsk_diag
> nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
> nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
> bonding nf_conntrack tls nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set
> nf_tables nfnetlink vmx_crypto pseries_rng binfmt_misc fuse xfs
> libcrc32c sd_mod sg ibmvscsi scsi_transport_srp ibmveth nvme nvme_core
> t10_pi crc64_rocksoft_generic crc64_rocksoft crc64
> [ 140.209864] CPU: 2 PID: 129 Comm: kworker/u65:3 Kdump: loaded Not
> tainted 6.10.0-rc3 #2
> [ 140.209870] Hardware name: IBM,9009-42A POWER9 (raw) 0x4e0202
> 0xf000005 of:IBM,FW950.A0 (VL950_141) hv:phyp pSeries
> [ 140.209876] Workqueue: nvme-wq nvme_scan_work [nvme_core]
> [ 140.209889] NIP: c00000000023b45c LR: c008000006a96b20 CTR:
> c00000000023b42c
> [ 140.209894] REGS: c0000000506078a0 TRAP: 0380 Not tainted
> (6.10.0-rc3)
> [ 140.209899] MSR: 800000000280b033
> <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24000244 XER: 00000000
> [ 140.209915] CFAR: c008000006aa80ac IRQMASK: 0
> [ 140.209915] GPR00: c008000006a96b20 c000000050607b40
> c000000001573700 c000000004291ee0
> [ 140.209915] GPR04: 0000000000000000 c000000006150080
> 00000000c0080005 fffffffffffe0000
> [ 140.209915] GPR08: 0000000000000000 18d7003065646f6e
> 0000000000000000 c008000006aa8098
> [ 140.209915] GPR12: c00000000023b42c c00000000f7cdf00
> c0000000001a151c c000000004f2be80
> [ 140.209915] GPR16: 0000000000000000 0000000000000000
> 0000000000000000 0000000000000000
> [ 140.209915] GPR20: c000000004dbcc00 0000000000000006
> 0000000000000002 c000000004911270
> [ 140.209915] GPR24: 0000000000000000 0000000000000000
> c0000000ee254ffc c0000000049111f0
> [ 140.209915] GPR28: 0000000000000000 c000000004911260
> c000000004291ee0 c000000004911260
> [ 140.209975] NIP [c00000000023b45c] synchronize_srcu+0x30/0x1c0
> [ 140.209984] LR [c008000006a96b20] nvme_ns_remove+0x80/0x2d8
> [nvme_core]
> [ 140.209994] Call Trace:
> [ 140.209997] [c000000050607b90] [c008000006a96b20]
> nvme_ns_remove+0x80/0x2d8 [nvme_core]
> [ 140.210008] [c000000050607bd0] [c008000006a972b4]
> nvme_remove_invalid_namespaces+0x144/0x1ac [nvme_core]
> [ 140.210020] [c000000050607c60] [c008000006a9dbd4]
> nvme_scan_ns_list+0x19c/0x370 [nvme_core]
> [ 140.210032] [c000000050607d70] [c008000006a9dfc8]
> nvme_scan_work+0xc8/0x278 [nvme_core]
> [ 140.210043] [c000000050607e40] [c00000000019414c]
> process_one_work+0x20c/0x4f4
> [ 140.210051] [c000000050607ef0] [c0000000001950cc]
> worker_thread+0x378/0x544
> [ 140.210058] [c000000050607f90] [c0000000001a164c] kthread+0x138/0x140
> [ 140.210065] [c000000050607fe0] [c00000000000df98]
> start_kernel_thread+0x14/0x18
> [ 140.210072] Code: 3c4c0134 384282d4 7c0802a6 60000000 7c0802a6
> fbc1fff0 fba1ffe8 fbe1fff8 7c7e1b78 f8010010 f821ffb1 e9230010
> <e9290080> 7c2004ac 71290003 41820008
> [ 140.210093] ---[ end trace 0000000000000000 ]---
>
>
> Issue is introduced by the patch:
> be647e2c76b27f409cdd520f66c95be888b553a3.

Exactly this was the concern when introducing a behavior change in a
sensitive area of the code
to silence lockdep...

I'm assuming that the bad dereference is:
synchronize_srcu(&ns->ctrl->srcu);

btw, looking at the code again, I'm assuming that synchronizing srcu in
every ns remove
slows down batch removal of many namespaces greatly...

>
>
> Reverting it, issue is not seen.
>
>
> Regards,
>
> Venkat.
>
>

2024-06-10 15:31:55

by Keith Busch

[permalink] [raw]

Subject: Re: Kernel OOPS while creating a NVMe Namespace

On Mon, Jun 10, 2024 at 12:57:02PM +0300, Sagi Grimberg wrote:
> On 10/06/2024 10:51, Venkat Rao Bagalkote wrote:
> >
> > [? 140.209777] BUG: Unable to handle kernel data access at 0x18d7003065646fee
> > [? 140.209792] Faulting instruction address: 0xc00000000023b45c
> > [? 140.209798] Oops: Kernel access of bad area, sig: 11 [#1]
> > [? 140.209802] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=8192 NUMA pSeries
> > [? 140.209864] CPU: 2 PID: 129 Comm: kworker/u65:3 Kdump: loaded Not tainted 6.10.0-rc3 #2
> > [? 140.209870] Hardware name: IBM,9009-42A POWER9 (raw) 0x4e0202 0xf000005 of:IBM,FW950.A0 (VL950_141) hv:phyp pSeries
> > [? 140.209876] Workqueue: nvme-wq nvme_scan_work [nvme_core]
> > [? 140.209889] NIP:? c00000000023b45c LR: c008000006a96b20 CTR: c00000000023b42c
> > [? 140.209894] REGS: c0000000506078a0 TRAP: 0380?? Not tainted (6.10.0-rc3)
> > [? 140.209899] MSR:? 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>? CR: 24000244? XER: 00000000
> > [? 140.209975] NIP [c00000000023b45c] synchronize_srcu+0x30/0x1c0
> > [? 140.209984] LR [c008000006a96b20] nvme_ns_remove+0x80/0x2d8 [nvme_core]
> > [? 140.209994] Call Trace:
> > [? 140.209997] [c000000050607b90] [c008000006a96b20] nvme_ns_remove+0x80/0x2d8 [nvme_core]
> > [? 140.210008] [c000000050607bd0] [c008000006a972b4] nvme_remove_invalid_namespaces+0x144/0x1ac [nvme_core]
> > [? 140.210020] [c000000050607c60] [c008000006a9dbd4] nvme_scan_ns_list+0x19c/0x370 [nvme_core]
> > [? 140.210032] [c000000050607d70] [c008000006a9dfc8] nvme_scan_work+0xc8/0x278 [nvme_core]
> > [? 140.210043] [c000000050607e40] [c00000000019414c] process_one_work+0x20c/0x4f4
> > [? 140.210051] [c000000050607ef0] [c0000000001950cc] worker_thread+0x378/0x544
> > [? 140.210058] [c000000050607f90] [c0000000001a164c] kthread+0x138/0x140
> > [? 140.210065] [c000000050607fe0] [c00000000000df98] start_kernel_thread+0x14/0x18
> > [? 140.210072] Code: 3c4c0134 384282d4 7c0802a6 60000000 7c0802a6 fbc1fff0 fba1ffe8 fbe1fff8 7c7e1b78 f8010010 f821ffb1 e9230010 e9290080> 7c2004ac 71290003 41820008
> > [? 140.210093] ---[ end trace 0000000000000000 ]---
> >
> > Issue is introduced by the patch:
> > be647e2c76b27f409cdd520f66c95be888b553a3.
>
> Exactly this was the concern when introducing a behavior change in a
> sensitive area of the code
> to silence lockdep...

No risk, no reward. :)

If we really can't figure this out, we can always revert and revisit for
the next merge.

> I'm assuming that the bad dereference is:
> ??? ??? synchronize_srcu(&ns->ctrl->srcu);

That would have to be it based on the report. Not sure what the problem
could be, though, the ns->ctrl must have been valid or we would have
failed earlier, and the srcu struct can't be released while the
controller is still in use by any namespace.

Anyway, I tested this path quite a bit, but I'll revisit with dynamic
attachments instead and see if that helps.

> btw, looking at the code again, I'm assuming that synchronizing srcu in
> every ns remove
> slows down batch removal of many namespaces greatly...

I may need to test this out, but I thought srcu sync was pretty quick if
there were no active readers, which there shouldn't be here outside
unusual cases.