LinuxLists.cc - Kernel Panic in skb_release

2021-05-24 13:05:21

Subject: Kernel Panic in skb_release_data using genet

Hi Doug, Florian,

I've been running a RaspberryPi4 with a mainline kernel for a while,
booting from NFS. Every once in a while (I'd say ~20-30% of all boots),
I'm getting a kernel panic around the time init is started.

I was debugging a kernel based on drm-misc-next-2021-05-17 today with
KASAN enabled and got this, which looks related:

[ 6.109454] mmc0: SDHCI controller on fe300000.sdhci [fe300000.sdhci] using PIO
[ 6.124819] bcmgenet fd580000.ethernet: configuring instance for external RGMII (RX delay)
[ 6.133391] ==================================================================
[ 6.140736] BUG: KASAN: user-memory-access in skb_release_data+0x14c/0x1fc
[ 6.147748] Read of size 4 at addr 1c8befdc by task swapper/0/0
[ 6.153776]
[ 6.155300] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.13.0-rc1-v7l #165
[ 6.162214] Hardware name: BCM2711
[ 6.165679] Backtrace:
[ 6.168183] [<c110f5a8>] (dump_backtrace) from [<c110f930>] (show_stack+0x20/0x24)
[ 6.175931] r7:c1e00000 r6:00000193 r5:00000000 r4:c837f8e0
[ 6.181683] [<c110f910>] (show_stack) from [<c11156c0>] (dump_stack+0xb8/0xdc)
[ 6.189051] [<c1115608>] (dump_stack) from [<c0514b30>] (kasan_report+0x11c/0x1c0)
[ 6.196789] r9:cc97ff02 r8:cc57e400 r7:c0ea3628 r6:00000000 r5:00000000 r4:1c8befdc
[ 6.204655] [<c0514a14>] (kasan_report) from [<c05154d4>] (__asan_load4+0x74/0x90)
[ 6.212393] r7:cc97ff00 r6:00000000 r5:cc97ff28 r4:1c8befd4
[ 6.218144] [<c0515460>] (__asan_load4) from [<c0ea3628>] (skb_release_data+0x14c/0x1fc)
[ 6.226395] [<c0ea34dc>] (skb_release_data) from [<c0ea9d2c>] (consume_skb+0x60/0x134)
[ 6.234479] r10:0000a8d8 r9:cc560000 r8:00000000 r7:cc560580 r6:00000001 r5:cc57e4ac
[ 6.242438] r4:cc57e400 r3:cc97f680
[ 6.246074] [<c0ea9ccc>] (consume_skb) from [<c0ec0d74>] (__dev_kfree_skb_any+0x60/0x64)
[ 6.254337] r9:cc560000 r8:00000000 r7:cc560580 r6:00000001 r5:cc57e400 r4:c1e00000
[ 6.262203] [<c0ec0d14>] (__dev_kfree_skb_any) from [<c0c814d4>] (bcmgenet_rx_poll+0x578/0x770)
[ 6.271081] r7:cc560580 r6:a8d81759 r5:cc57e400 r4:cc563ed8
[ 6.276831] [<c0c80f5c>] (bcmgenet_rx_poll) from [<c0ed3f0c>] (__napi_poll+0x60/0x2b8)
[ 6.284925] r10:c1e03d20 r9:c1e05d00 r8:cc563ee0 r7:c1e03d10 r6:00000040 r5:00000001
[ 6.292881] r4:cc563ed8
[ 6.295460] [<c0ed3eac>] (__napi_poll) from [<c0ed4a14>] (net_rx_action+0x580/0x620)
[ 6.303377] r10:c1e03d20 r9:c1e05d00 r8:0000012c r7:cc563edc r6:cc560000 r5:cc563ed8
[ 6.311333] r4:c1e03d80
[ 6.313911] [<c0ed4494>] (net_rx_action) from [<c02012e8>] (__do_softirq+0x1f0/0x69c)
[ 6.321916] r10:c1e00000 r9:00000008 r8:16b2f000 r7:00000003 r6:00000004 r5:c18b9360
[ 6.329872] r4:c1e0508c
[ 6.332449] [<c02010f8>] (__do_softirq) from [<c02367a4>] (irq_exit+0x188/0x1b0)
[ 6.340012] r10:16b2f000 r9:c1e03ec0 r8:16b2f000 r7:c1e03e28 r6:ffffc000 r5:c1cc0940
[ 6.347969] r4:c1e06ea4
[ 6.350546] [<c023661c>] (irq_exit) from [<c02c75fc>] (__handle_domain_irq+0xc4/0x128)
[ 6.353302] bcmgenet fd580000.ethernet eth0: Link is Down
[ 6.358635] r9:c1e03ec0 r8:00000001 r7:00000000 r6:c1e00000 r5:00000000 r4:c1cbfe80
[ 6.371956] [<c02c7538>] (__handle_domain_irq) from [<c09ef2b4>] (gic_handle_irq+0x9c/0xb4)
[ 6.380496] r10:f080200c r9:f0802000 r8:c1e03ec0 r7:c1e07878 r6:c1cbfe8c r5:000000bd
[ 6.388452] r4:000000bd
[ 6.391030] [<c09ef218>] (gic_handle_irq) from [<c0200abc>] (__irq_svc+0x5c/0x80)
[ 6.398666] Exception stack(0xc1e03ec0 to 0xc1e03f08)
[ 6.403821] 3ec0: c175a018 d87f0614 00000000 c0222bc0 c1e00000 c1e06e1c 00000000 c1e06e6c
[ 6.412145] 3ee0: c84ff712 c121e120 30c5387d c1e03f1c c175a018 c1e03f10 c020a204 c020a208
[ 6.420459] 3f00: 60000013 ffffffff
[ 6.424026] r10:30c5387d r9:c1e00000 r8:c84ff712 r7:c1e03ef4 r6:ffffffff r5:60000013
[ 6.431983] r4:c020a208
[ 6.434561] [<c020a1b8>] (arch_cpu_idle) from [<c112af34>] (default_idle_call+0x48/0x188)
[ 6.442906] [<c112aeec>] (default_idle_call) from [<c0287578>] (do_idle+0x11c/0x180)
[ 6.450816] r9:c121e120 r8:c84ff712 r7:c1e06e6c r6:00000000 r5:c1e06e1c r4:c1e00000
[ 6.458681] [<c028745c>] (do_idle) from [<c0287a00>] (cpu_startup_entry+0x28/0x2c)
[ 6.466416] r9:410fd083 r8:c187df68 r7:c1e00000 r6:ca9d6000 r5:c85201e0 r4:000000e1
[ 6.474283] [<c02879d8>] (cpu_startup_entry) from [<c111d7b8>] (rest_init+0x148/0x150)
[ 6.482358] [<c111d670>] (rest_init) from [<c1801534>] (arch_call_rest_init+0x18/0x1c)
[ 6.490450] r7:c1e06dc0 r6:c1e00000 r5:c1e00000 r4:c851d5c0
[ 6.496202] [<c180151c>] (arch_call_rest_init) from [<c1801990>] (start_kernel+0x3e0/0x424)
[ 6.504723] [<c18015b0>] (start_kernel) from [<00000000>] (0x0)
[ 6.510776] r8:2eff9400 r7:00000c42 r6:30c0387d r5:00000000 r4:c1800334
[ 6.517584] ==================================================================
[ 6.524921] Disabling lock debugging due to kernel taint
[ 6.530467] 8<--- cut here ---
[ 6.533628] Unable to handle kernel paging request at virtual address 1c8befdc
[ 6.541025] pgd = (ptrval)
[ 6.543837] [1c8befdc] *pgd=80000000004003, *pmd=00000000
[ 6.549431] Internal error: Oops: 206 [#1] SMP ARM
[ 6.554311] Modules linked in:
[ 6.557433] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G B 5.13.0-rc1-v7l #165
[ 6.565755] Hardware name: BCM2711
[ 6.569217] PC is at skb_release_data+0x14c/0x1fc
[ 6.574015] LR is at end_report+0x6c/0xf0
[ 6.578109] pc : [<c0ea3628>] lr : [<c05148ac>] psr: 60000113
[ 6.584484] sp : c1e03ac8 ip : c1e03a60 fp : c1e03af4
[ 6.589801] r10: cc57e462 r9 : cc97ff02 r8 : cc57e400
[ 6.595116] r7 : cc97ff00 r6 : 00000000 r5 : cc97ff28 r4 : 1c8befd4
[ 6.601755] r3 : 00000000 r2 : c1e0ccc0 r1 : c0514884 r0 : 00000001
[ 6.608393] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[ 6.615657] Control: 30c5383d Table: 00003000 DAC: fffffffd
[ 6.621497] Register r0 information: non-paged memory
[ 6.626651] Register r1 information: non-slab/vmalloc memory
[ 6.632418] Register r2 information: non-slab/vmalloc memory
[ 6.638185] Register r3 information: NULL pointer
[ 6.642981] Register r4 information: non-paged memory
[ 6.648129] Register r5 information: non-slab/vmalloc memory
[ 6.653895] Register r6 information: NULL pointer
[ 6.658690] Register r7 information: non-slab/vmalloc memory
[ 6.664455] Register r8 information: slab skbuff_head_cache start cc57e400 pointer offset 0 size 48
[ 6.673715] Register r9 information: non-slab/vmalloc memory
[ 6.679481] Register r10 information: slab skbuff_head_cache start cc57e400 pointer offset 98 size 48
[ 6.688914] Register r11 information: non-slab/vmalloc memory
[ 6.694768] Register r12 information: non-slab/vmalloc memory
[ 6.700621] Process swapper/0 (pid: 0, stack limit = 0x(ptrval))
[ 6.706730] Stack: (0xc1e03ac8 to 0xc1e04000)
[ 6.711177] 3ac0: cc97f680 cc57e400 cc57e4ac 00000001 cc560580 00000000
[ 6.719502] 3ae0: cc560000 0000a8d8 c1e03b1c c1e03af8 c0ea9d2c c0ea34e8 c1e00000 cc57e400
[ 6.727825] 3b00: 00000001 cc560580 00000000 cc560000 c1e03b3c c1e03b20 c0ec0d74 c0ea9cd8
[ 6.736149] 3b20: cc563ed8 cc57e400 a8d81759 cc560580 c1e03c54 c1e03b40 c0c814d4 c0ec0d20
[ 6.744473] 3b40: c0210414 c05154fc c02103d8 ffffc000 c1e03b84 c1e03be0 c1e03c20 b73c0778
[ 6.752795] 3b60: 00000040 cc5640e8 cc560588 cc560088 cc561944 cc563fe0 cc561580 cc563fd8
[ 6.761118] 3b80: ca5c3c00 cc563fc8 cc563fd4 cc564078 0000000c 00000000 c1e03be4 00000000
[ 6.769441] 3ba0: 00000000 cc563580 c0210430 00000000 00000000 cc920374 c1e03be4 c1e03c88
[ 6.777764] 3bc0: 41b58ab3 c1730000 c0c80f5c cc920374 cc920340 00000001 c1e03d04 c1e03be8
[ 6.786084] 3be0: 00000000 00000000 00000000 00000000 00000000 00000000 c1e03c24 c1e03c08
[ 6.794406] 3c00: 41b58ab3 c16ca308 c02a73bc d87efd80 cab4d000 d87f0318 c1e03c4c 0147adf0
[ 6.802729] 3c20: c175a5d0 b5ed3f2f c02012e8 cc563ed8 00000001 00000040 c1e03d10 cc563ee0
[ 6.811052] 3c40: c1e05d00 c1e03d20 c1e03c94 c1e03c58 c0ed3f0c c0c80f68 c03a9ed0 c05154fc
[ 6.819376] 3c60: c03aa120 cc563ed8 60000113 c1e03d80 cc563ed8 cc560000 cc563edc 0000012c
[ 6.827700] 3c80: c1e05d00 c1e03d20 c1e03db4 c1e03c98 c0ed4a14 c0ed3eb8 d87f0740 b73c079c
[ 6.836023] 3ca0: c02104fc c1e00000 c1e00010 c1e00000 c1e03d40 16b2f000 c1cc1740 c1e05d00
[ 6.844347] 3cc0: ffff8d38 c1e03d20 c0210430 0000004c c1e03d04 c1e03ce0 c0c7e6e4 c051546c
[ 6.852670] 3ce0: 41b58ab3 c17447f0 c0ed4494 cb17fc00 c1e03da0 0000004c c1e03d54 c1e03d08
[ 6.860994] 3d00: c1e03e4c c1e03e28 c03a9eb0 c02367a4 c02cdd00 c051546c d87efdc0 cb17fc44
[ 6.869316] 3d20: c1e03d20 c1e03d20 d87efdf0 cb17fc00 c1e03d54 c1e03d40 c112b730 c051546c
[ 6.877640] 3d40: c1e03d40 c1e03d40 c02367a4 c0201240 c02c8548 c1e00000 c1cbec50 c02367a4
[ 6.885965] 3d60: c0201240 c1e00004 16b2f000 c1e00000 c1e03db4 c1e03d80 c03a9ed0 c05154fc
[ 6.894287] 3d80: 41b58ab3 b5ed3f2f c1e03db4 c1e0508c c18b9360 00000004 00000003 16b2f000
[ 6.902610] 3da0: 00000008 c1e00000 c1e03e24 c1e03db8 c02012e8 c0ed44a0 c1e03de4 c1e03dc8
[ 6.910932] 3dc0: 00000001 00200002 c1213840 c1e05d00 ffff8d37 c18b92d4 0000000a c1cc0940
[ 6.919256] 3de0: c09eed7c c18b9350 c1e05080 c1e03db8 00000101 c1e06e1c c1e03e24 c1e06ea4
[ 6.927580] 3e00: c1cc0940 ffffc000 c1e03e28 16b2f000 c1e03ec0 16b2f000 c1e03e4c c1e03e28
[ 6.935901] 3e20: c02367a4 c0201104 c1cbfe80 00000000 c1e00000 00000000 00000001 c1e03ec0
[ 6.944224] 3e40: c1e03e84 c1e03e50 c02c75fc c0236628 c112af34 ca91f000 c1e03ebc 000000bd
[ 6.952547] 3e60: 000000bd c1cbfe8c c1e07878 c1e03ec0 f0802000 f080200c c1e03ebc c1e03e88
[ 6.960872] 3e80: c09ef2b4 c02c7544 c03aa120 c021043c c020a204 c020a208 60000013 ffffffff
[ 6.969195] 3ea0: c1e03ef4 c84ff712 c1e00000 30c5387d c1e03f1c c1e03ec0 c0200abc c09ef224
[ 6.977516] 3ec0: c175a018 d87f0614 00000000 c0222bc0 c1e00000 c1e06e1c 00000000 c1e06e6c
[ 6.985840] 3ee0: c84ff712 c121e120 30c5387d c1e03f1c c175a018 c1e03f10 c020a204 c020a208
[ 6.994163] 3f00: 60000013 ffffffff c020a1f4 00000000 c1e03f44 c1e03f20 c112af34 c020a1c4
[ 7.002486] 3f20: c1e00000 c1e06e1c 00000000 c1e06e6c c84ff712 c121e120 c1e03f6c c1e03f48
[ 7.010810] 3f40: c0287578 c112aef8 000000e1 c85201e0 ca9d6000 c1e00000 c187df68 410fd083
[ 7.019134] 3f60: c1e03f7c c1e03f70 c0287a00 c0287468 c1e03f9c c1e03f80 c111d7b8 c02879e4
[ 7.027458] 3f80: c851d5c0 c1e00000 c1e00000 c1e06dc0 c1e03fac c1e03fa0 c1801534 c111d67c
[ 7.035781] 3fa0: c1e03ff4 c1e03fb0 c1801990 c1801528 ffffffff ffffffff 00000000 c18006b8
[ 7.044103] 3fc0: 00000000 c187df68 b5e8322f 00000000 410fd083 c1800334 00000000 30c0387d
[ 7.052425] 3fe0: 00000c42 2eff9400 00000000 c1e03ff8 00000000 c18015bc 00000000 00000000
[ 7.060734] Backtrace:
[ 7.063237] [<c0ea34dc>] (skb_release_data) from [<c0ea9d2c>] (consume_skb+0x60/0x134)
[ 7.071327] r10:0000a8d8 r9:cc560000 r8:00000000 r7:cc560580 r6:00000001 r5:cc57e4ac
[ 7.079286] r4:cc57e400 r3:cc97f680
[ 7.082923] [<c0ea9ccc>] (consume_skb) from [<c0ec0d74>] (__dev_kfree_skb_any+0x60/0x64)
[ 7.091187] r9:cc560000 r8:00000000 r7:cc560580 r6:00000001 r5:cc57e400 r4:c1e00000
[ 7.099054] [<c0ec0d14>] (__dev_kfree_skb_any) from [<c0c814d4>] (bcmgenet_rx_poll+0x578/0x770)
[ 7.107934] r7:cc560580 r6:a8d81759 r5:cc57e400 r4:cc563ed8
[ 7.113686] [<c0c80f5c>] (bcmgenet_rx_poll) from [<c0ed3f0c>] (__napi_poll+0x60/0x2b8)
[ 7.121778] r10:c1e03d20 r9:c1e05d00 r8:cc563ee0 r7:c1e03d10 r6:00000040 r5:00000001
[ 7.129735] r4:cc563ed8
[ 7.132313] [<c0ed3eac>] (__napi_poll) from [<c0ed4a14>] (net_rx_action+0x580/0x620)
[ 7.140233] r10:c1e03d20 r9:c1e05d00 r8:0000012c r7:cc563edc r6:cc560000 r5:cc563ed8
[ 7.148189] r4:c1e03d80
[ 7.150767] [<c0ed4494>] (net_rx_action) from [<c02012e8>] (__do_softirq+0x1f0/0x69c)
[ 7.158772] r10:c1e00000 r9:00000008 r8:16b2f000 r7:00000003 r6:00000004 r5:c18b9360
[ 7.166728] r4:c1e0508c
[ 7.169306] [<c02010f8>] (__do_softirq) from [<c02367a4>] (irq_exit+0x188/0x1b0)
[ 7.176870] r10:16b2f000 r9:c1e03ec0 r8:16b2f000 r7:c1e03e28 r6:ffffc000 r5:c1cc0940
[ 7.184827] r4:c1e06ea4
[ 7.187405] [<c023661c>] (irq_exit) from [<c02c75fc>] (__handle_domain_irq+0xc4/0x128)
[ 7.195497] r9:c1e03ec0 r8:00000001 r7:00000000 r6:c1e00000 r5:00000000 r4:c1cbfe80
[ 7.203363] [<c02c7538>] (__handle_domain_irq) from [<c09ef2b4>] (gic_handle_irq+0x9c/0xb4)
[ 7.211902] r10:f080200c r9:f0802000 r8:c1e03ec0 r7:c1e07878 r6:c1cbfe8c r5:000000bd
[ 7.219858] r4:000000bd
[ 7.222436] [<c09ef218>] (gic_handle_irq) from [<c0200abc>] (__irq_svc+0x5c/0x80)
[ 7.230074] Exception stack(0xc1e03ec0 to 0xc1e03f08)
[ 7.235228] 3ec0: c175a018 d87f0614 00000000 c0222bc0 c1e00000 c1e06e1c 00000000 c1e06e6c
[ 7.243552] 3ee0: c84ff712 c121e120 30c5387d c1e03f1c c175a018 c1e03f10 c020a204 c020a208
[ 7.251865] 3f00: 60000013 ffffffff
[ 7.255432] r10:30c5387d r9:c1e00000 r8:c84ff712 r7:c1e03ef4 r6:ffffffff r5:60000013
[ 7.263389] r4:c020a208
[ 7.265967] [<c020a1b8>] (arch_cpu_idle) from [<c112af34>] (default_idle_call+0x48/0x188)
[ 7.274309] [<c112aeec>] (default_idle_call) from [<c0287578>] (do_idle+0x11c/0x180)
[ 7.282219] r9:c121e120 r8:c84ff712 r7:c1e06e6c r6:00000000 r5:c1e06e1c r4:c1e00000
[ 7.290085] [<c028745c>] (do_idle) from [<c0287a00>] (cpu_startup_entry+0x28/0x2c)
[ 7.297821] r9:410fd083 r8:c187df68 r7:c1e00000 r6:ca9d6000 r5:c85201e0 r4:000000e1
[ 7.305687] [<c02879d8>] (cpu_startup_entry) from [<c111d7b8>] (rest_init+0x148/0x150)
[ 7.313764] [<c111d670>] (rest_init) from [<c1801534>] (arch_call_rest_init+0x18/0x1c)
[ 7.321855] r7:c1e06dc0 r6:c1e00000 r5:c1e00000 r4:c851d5c0
[ 7.327606] [<c180151c>] (arch_call_rest_init) from [<c1801990>] (start_kernel+0x3e0/0x424)
[ 7.336126] [<c18015b0>] (start_kernel) from [<00000000>] (0x0)
[ 7.342179] r8:2eff9400 r7:00000c42 r6:30c0387d r5:00000000 r4:c1800334
[ 7.349000] Code: ebd9c790 e5954000 e2840008 ebd9c78d (e5943008)
[ 7.355247] ---[ end trace 38b3df6838c109c3 ]---

Let me know if you need any other information, thanks!
Maxime

Attachments:

(No filename) (14.09 kB)
signature.asc (235.00 B)
Download all attachments

2021-05-24 15:14:12

by Florian Fainelli

[permalink] [raw]

Subject: Re: Kernel Panic in skb_release_data using genet

Hi Maxime,

On 5/24/2021 6:01 AM, Maxime Ripard wrote:
> Hi Doug, Florian,
>
> I've been running a RaspberryPi4 with a mainline kernel for a while,
> booting from NFS. Every once in a while (I'd say ~20-30% of all boots),
> I'm getting a kernel panic around the time init is started.
>
> I was debugging a kernel based on drm-misc-next-2021-05-17 today with
> KASAN enabled and got this, which looks related:

Is there a known good version that could be used for bisection or you
just started to do this test and you have no reference point?

How stable in terms of clocking is the configuration that you are using?
I could try to fire up a similar test on a Pi4 at home, or use one of
our 72112 systems which is the closest we have to a Pi4 and see if that
happens there as well.

>
> [ 6.109454] mmc0: SDHCI controller on fe300000.sdhci [fe300000.sdhci] using PIO
> [ 6.124819] bcmgenet fd580000.ethernet: configuring instance for external RGMII (RX delay)
> [ 6.133391] ==================================================================
> [ 6.140736] BUG: KASAN: user-memory-access in skb_release_data+0x14c/0x1fc
> [ 6.147748] Read of size 4 at addr 1c8befdc by task swapper/0/0
> [ 6.153776]
> [ 6.155300] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.13.0-rc1-v7l #165
> [ 6.162214] Hardware name: BCM2711
> [ 6.165679] Backtrace:
> [ 6.168183] [<c110f5a8>] (dump_backtrace) from [<c110f930>] (show_stack+0x20/0x24)
> [ 6.175931] r7:c1e00000 r6:00000193 r5:00000000 r4:c837f8e0
> [ 6.181683] [<c110f910>] (show_stack) from [<c11156c0>] (dump_stack+0xb8/0xdc)
> [ 6.189051] [<c1115608>] (dump_stack) from [<c0514b30>] (kasan_report+0x11c/0x1c0)
> [ 6.196789] r9:cc97ff02 r8:cc57e400 r7:c0ea3628 r6:00000000 r5:00000000 r4:1c8befdc
> [ 6.204655] [<c0514a14>] (kasan_report) from [<c05154d4>] (__asan_load4+0x74/0x90)
> [ 6.212393] r7:cc97ff00 r6:00000000 r5:cc97ff28 r4:1c8befd4
> [ 6.218144] [<c0515460>] (__asan_load4) from [<c0ea3628>] (skb_release_data+0x14c/0x1fc)
> [ 6.226395] [<c0ea34dc>] (skb_release_data) from [<c0ea9d2c>] (consume_skb+0x60/0x134)
> [ 6.234479] r10:0000a8d8 r9:cc560000 r8:00000000 r7:cc560580 r6:00000001 r5:cc57e4ac
> [ 6.242438] r4:cc57e400 r3:cc97f680
> [ 6.246074] [<c0ea9ccc>] (consume_skb) from [<c0ec0d74>] (__dev_kfree_skb_any+0x60/0x64)
> [ 6.254337] r9:cc560000 r8:00000000 r7:cc560580 r6:00000001 r5:cc57e400 r4:c1e00000
> [ 6.262203] [<c0ec0d14>] (__dev_kfree_skb_any) from [<c0c814d4>] (bcmgenet_rx_poll+0x578/0x770)
> [ 6.271081] r7:cc560580 r6:a8d81759 r5:cc57e400 r4:cc563ed8
> [ 6.276831] [<c0c80f5c>] (bcmgenet_rx_poll) from [<c0ed3f0c>] (__napi_poll+0x60/0x2b8)
> [ 6.284925] r10:c1e03d20 r9:c1e05d00 r8:cc563ee0 r7:c1e03d10 r6:00000040 r5:00000001
> [ 6.292881] r4:cc563ed8
> [ 6.295460] [<c0ed3eac>] (__napi_poll) from [<c0ed4a14>] (net_rx_action+0x580/0x620)
> [ 6.303377] r10:c1e03d20 r9:c1e05d00 r8:0000012c r7:cc563edc r6:cc560000 r5:cc563ed8
> [ 6.311333] r4:c1e03d80
> [ 6.313911] [<c0ed4494>] (net_rx_action) from [<c02012e8>] (__do_softirq+0x1f0/0x69c)
> [ 6.321916] r10:c1e00000 r9:00000008 r8:16b2f000 r7:00000003 r6:00000004 r5:c18b9360
> [ 6.329872] r4:c1e0508c
> [ 6.332449] [<c02010f8>] (__do_softirq) from [<c02367a4>] (irq_exit+0x188/0x1b0)
> [ 6.340012] r10:16b2f000 r9:c1e03ec0 r8:16b2f000 r7:c1e03e28 r6:ffffc000 r5:c1cc0940
> [ 6.347969] r4:c1e06ea4
> [ 6.350546] [<c023661c>] (irq_exit) from [<c02c75fc>] (__handle_domain_irq+0xc4/0x128)
> [ 6.353302] bcmgenet fd580000.ethernet eth0: Link is Down
> [ 6.358635] r9:c1e03ec0 r8:00000001 r7:00000000 r6:c1e00000 r5:00000000 r4:c1cbfe80
> [ 6.371956] [<c02c7538>] (__handle_domain_irq) from [<c09ef2b4>] (gic_handle_irq+0x9c/0xb4)
> [ 6.380496] r10:f080200c r9:f0802000 r8:c1e03ec0 r7:c1e07878 r6:c1cbfe8c r5:000000bd
> [ 6.388452] r4:000000bd
> [ 6.391030] [<c09ef218>] (gic_handle_irq) from [<c0200abc>] (__irq_svc+0x5c/0x80)
> [ 6.398666] Exception stack(0xc1e03ec0 to 0xc1e03f08)
> [ 6.403821] 3ec0: c175a018 d87f0614 00000000 c0222bc0 c1e00000 c1e06e1c 00000000 c1e06e6c
> [ 6.412145] 3ee0: c84ff712 c121e120 30c5387d c1e03f1c c175a018 c1e03f10 c020a204 c020a208
> [ 6.420459] 3f00: 60000013 ffffffff
> [ 6.424026] r10:30c5387d r9:c1e00000 r8:c84ff712 r7:c1e03ef4 r6:ffffffff r5:60000013
> [ 6.431983] r4:c020a208
> [ 6.434561] [<c020a1b8>] (arch_cpu_idle) from [<c112af34>] (default_idle_call+0x48/0x188)
> [ 6.442906] [<c112aeec>] (default_idle_call) from [<c0287578>] (do_idle+0x11c/0x180)
> [ 6.450816] r9:c121e120 r8:c84ff712 r7:c1e06e6c r6:00000000 r5:c1e06e1c r4:c1e00000
> [ 6.458681] [<c028745c>] (do_idle) from [<c0287a00>] (cpu_startup_entry+0x28/0x2c)
> [ 6.466416] r9:410fd083 r8:c187df68 r7:c1e00000 r6:ca9d6000 r5:c85201e0 r4:000000e1
> [ 6.474283] [<c02879d8>] (cpu_startup_entry) from [<c111d7b8>] (rest_init+0x148/0x150)
> [ 6.482358] [<c111d670>] (rest_init) from [<c1801534>] (arch_call_rest_init+0x18/0x1c)
> [ 6.490450] r7:c1e06dc0 r6:c1e00000 r5:c1e00000 r4:c851d5c0
> [ 6.496202] [<c180151c>] (arch_call_rest_init) from [<c1801990>] (start_kernel+0x3e0/0x424)
> [ 6.504723] [<c18015b0>] (start_kernel) from [<00000000>] (0x0)
> [ 6.510776] r8:2eff9400 r7:00000c42 r6:30c0387d r5:00000000 r4:c1800334
> [ 6.517584] ==================================================================
> [ 6.524921] Disabling lock debugging due to kernel taint
> [ 6.530467] 8<--- cut here ---
> [ 6.533628] Unable to handle kernel paging request at virtual address 1c8befdc
> [ 6.541025] pgd = (ptrval)
> [ 6.543837] [1c8befdc] *pgd=80000000004003, *pmd=00000000
> [ 6.549431] Internal error: Oops: 206 [#1] SMP ARM
> [ 6.554311] Modules linked in:
> [ 6.557433] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G B 5.13.0-rc1-v7l #165
> [ 6.565755] Hardware name: BCM2711
> [ 6.569217] PC is at skb_release_data+0x14c/0x1fc
> [ 6.574015] LR is at end_report+0x6c/0xf0
> [ 6.578109] pc : [<c0ea3628>] lr : [<c05148ac>] psr: 60000113
> [ 6.584484] sp : c1e03ac8 ip : c1e03a60 fp : c1e03af4
> [ 6.589801] r10: cc57e462 r9 : cc97ff02 r8 : cc57e400
> [ 6.595116] r7 : cc97ff00 r6 : 00000000 r5 : cc97ff28 r4 : 1c8befd4
> [ 6.601755] r3 : 00000000 r2 : c1e0ccc0 r1 : c0514884 r0 : 00000001
> [ 6.608393] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
> [ 6.615657] Control: 30c5383d Table: 00003000 DAC: fffffffd
> [ 6.621497] Register r0 information: non-paged memory
> [ 6.626651] Register r1 information: non-slab/vmalloc memory
> [ 6.632418] Register r2 information: non-slab/vmalloc memory
> [ 6.638185] Register r3 information: NULL pointer
> [ 6.642981] Register r4 information: non-paged memory
> [ 6.648129] Register r5 information: non-slab/vmalloc memory
> [ 6.653895] Register r6 information: NULL pointer
> [ 6.658690] Register r7 information: non-slab/vmalloc memory
> [ 6.664455] Register r8 information: slab skbuff_head_cache start cc57e400 pointer offset 0 size 48
> [ 6.673715] Register r9 information: non-slab/vmalloc memory
> [ 6.679481] Register r10 information: slab skbuff_head_cache start cc57e400 pointer offset 98 size 48
> [ 6.688914] Register r11 information: non-slab/vmalloc memory
> [ 6.694768] Register r12 information: non-slab/vmalloc memory
> [ 6.700621] Process swapper/0 (pid: 0, stack limit = 0x(ptrval))
> [ 6.706730] Stack: (0xc1e03ac8 to 0xc1e04000)
> [ 6.711177] 3ac0: cc97f680 cc57e400 cc57e4ac 00000001 cc560580 00000000
> [ 6.719502] 3ae0: cc560000 0000a8d8 c1e03b1c c1e03af8 c0ea9d2c c0ea34e8 c1e00000 cc57e400
> [ 6.727825] 3b00: 00000001 cc560580 00000000 cc560000 c1e03b3c c1e03b20 c0ec0d74 c0ea9cd8
> [ 6.736149] 3b20: cc563ed8 cc57e400 a8d81759 cc560580 c1e03c54 c1e03b40 c0c814d4 c0ec0d20
> [ 6.744473] 3b40: c0210414 c05154fc c02103d8 ffffc000 c1e03b84 c1e03be0 c1e03c20 b73c0778
> [ 6.752795] 3b60: 00000040 cc5640e8 cc560588 cc560088 cc561944 cc563fe0 cc561580 cc563fd8
> [ 6.761118] 3b80: ca5c3c00 cc563fc8 cc563fd4 cc564078 0000000c 00000000 c1e03be4 00000000
> [ 6.769441] 3ba0: 00000000 cc563580 c0210430 00000000 00000000 cc920374 c1e03be4 c1e03c88
> [ 6.777764] 3bc0: 41b58ab3 c1730000 c0c80f5c cc920374 cc920340 00000001 c1e03d04 c1e03be8
> [ 6.786084] 3be0: 00000000 00000000 00000000 00000000 00000000 00000000 c1e03c24 c1e03c08
> [ 6.794406] 3c00: 41b58ab3 c16ca308 c02a73bc d87efd80 cab4d000 d87f0318 c1e03c4c 0147adf0
> [ 6.802729] 3c20: c175a5d0 b5ed3f2f c02012e8 cc563ed8 00000001 00000040 c1e03d10 cc563ee0
> [ 6.811052] 3c40: c1e05d00 c1e03d20 c1e03c94 c1e03c58 c0ed3f0c c0c80f68 c03a9ed0 c05154fc
> [ 6.819376] 3c60: c03aa120 cc563ed8 60000113 c1e03d80 cc563ed8 cc560000 cc563edc 0000012c
> [ 6.827700] 3c80: c1e05d00 c1e03d20 c1e03db4 c1e03c98 c0ed4a14 c0ed3eb8 d87f0740 b73c079c
> [ 6.836023] 3ca0: c02104fc c1e00000 c1e00010 c1e00000 c1e03d40 16b2f000 c1cc1740 c1e05d00
> [ 6.844347] 3cc0: ffff8d38 c1e03d20 c0210430 0000004c c1e03d04 c1e03ce0 c0c7e6e4 c051546c
> [ 6.852670] 3ce0: 41b58ab3 c17447f0 c0ed4494 cb17fc00 c1e03da0 0000004c c1e03d54 c1e03d08
> [ 6.860994] 3d00: c1e03e4c c1e03e28 c03a9eb0 c02367a4 c02cdd00 c051546c d87efdc0 cb17fc44
> [ 6.869316] 3d20: c1e03d20 c1e03d20 d87efdf0 cb17fc00 c1e03d54 c1e03d40 c112b730 c051546c
> [ 6.877640] 3d40: c1e03d40 c1e03d40 c02367a4 c0201240 c02c8548 c1e00000 c1cbec50 c02367a4
> [ 6.885965] 3d60: c0201240 c1e00004 16b2f000 c1e00000 c1e03db4 c1e03d80 c03a9ed0 c05154fc
> [ 6.894287] 3d80: 41b58ab3 b5ed3f2f c1e03db4 c1e0508c c18b9360 00000004 00000003 16b2f000
> [ 6.902610] 3da0: 00000008 c1e00000 c1e03e24 c1e03db8 c02012e8 c0ed44a0 c1e03de4 c1e03dc8
> [ 6.910932] 3dc0: 00000001 00200002 c1213840 c1e05d00 ffff8d37 c18b92d4 0000000a c1cc0940
> [ 6.919256] 3de0: c09eed7c c18b9350 c1e05080 c1e03db8 00000101 c1e06e1c c1e03e24 c1e06ea4
> [ 6.927580] 3e00: c1cc0940 ffffc000 c1e03e28 16b2f000 c1e03ec0 16b2f000 c1e03e4c c1e03e28
> [ 6.935901] 3e20: c02367a4 c0201104 c1cbfe80 00000000 c1e00000 00000000 00000001 c1e03ec0
> [ 6.944224] 3e40: c1e03e84 c1e03e50 c02c75fc c0236628 c112af34 ca91f000 c1e03ebc 000000bd
> [ 6.952547] 3e60: 000000bd c1cbfe8c c1e07878 c1e03ec0 f0802000 f080200c c1e03ebc c1e03e88
> [ 6.960872] 3e80: c09ef2b4 c02c7544 c03aa120 c021043c c020a204 c020a208 60000013 ffffffff
> [ 6.969195] 3ea0: c1e03ef4 c84ff712 c1e00000 30c5387d c1e03f1c c1e03ec0 c0200abc c09ef224
> [ 6.977516] 3ec0: c175a018 d87f0614 00000000 c0222bc0 c1e00000 c1e06e1c 00000000 c1e06e6c
> [ 6.985840] 3ee0: c84ff712 c121e120 30c5387d c1e03f1c c175a018 c1e03f10 c020a204 c020a208
> [ 6.994163] 3f00: 60000013 ffffffff c020a1f4 00000000 c1e03f44 c1e03f20 c112af34 c020a1c4
> [ 7.002486] 3f20: c1e00000 c1e06e1c 00000000 c1e06e6c c84ff712 c121e120 c1e03f6c c1e03f48
> [ 7.010810] 3f40: c0287578 c112aef8 000000e1 c85201e0 ca9d6000 c1e00000 c187df68 410fd083
> [ 7.019134] 3f60: c1e03f7c c1e03f70 c0287a00 c0287468 c1e03f9c c1e03f80 c111d7b8 c02879e4
> [ 7.027458] 3f80: c851d5c0 c1e00000 c1e00000 c1e06dc0 c1e03fac c1e03fa0 c1801534 c111d67c
> [ 7.035781] 3fa0: c1e03ff4 c1e03fb0 c1801990 c1801528 ffffffff ffffffff 00000000 c18006b8
> [ 7.044103] 3fc0: 00000000 c187df68 b5e8322f 00000000 410fd083 c1800334 00000000 30c0387d
> [ 7.052425] 3fe0: 00000c42 2eff9400 00000000 c1e03ff8 00000000 c18015bc 00000000 00000000
> [ 7.060734] Backtrace:
> [ 7.063237] [<c0ea34dc>] (skb_release_data) from [<c0ea9d2c>] (consume_skb+0x60/0x134)
> [ 7.071327] r10:0000a8d8 r9:cc560000 r8:00000000 r7:cc560580 r6:00000001 r5:cc57e4ac
> [ 7.079286] r4:cc57e400 r3:cc97f680
> [ 7.082923] [<c0ea9ccc>] (consume_skb) from [<c0ec0d74>] (__dev_kfree_skb_any+0x60/0x64)
> [ 7.091187] r9:cc560000 r8:00000000 r7:cc560580 r6:00000001 r5:cc57e400 r4:c1e00000
> [ 7.099054] [<c0ec0d14>] (__dev_kfree_skb_any) from [<c0c814d4>] (bcmgenet_rx_poll+0x578/0x770)
> [ 7.107934] r7:cc560580 r6:a8d81759 r5:cc57e400 r4:cc563ed8
> [ 7.113686] [<c0c80f5c>] (bcmgenet_rx_poll) from [<c0ed3f0c>] (__napi_poll+0x60/0x2b8)
> [ 7.121778] r10:c1e03d20 r9:c1e05d00 r8:cc563ee0 r7:c1e03d10 r6:00000040 r5:00000001
> [ 7.129735] r4:cc563ed8
> [ 7.132313] [<c0ed3eac>] (__napi_poll) from [<c0ed4a14>] (net_rx_action+0x580/0x620)
> [ 7.140233] r10:c1e03d20 r9:c1e05d00 r8:0000012c r7:cc563edc r6:cc560000 r5:cc563ed8
> [ 7.148189] r4:c1e03d80
> [ 7.150767] [<c0ed4494>] (net_rx_action) from [<c02012e8>] (__do_softirq+0x1f0/0x69c)
> [ 7.158772] r10:c1e00000 r9:00000008 r8:16b2f000 r7:00000003 r6:00000004 r5:c18b9360
> [ 7.166728] r4:c1e0508c
> [ 7.169306] [<c02010f8>] (__do_softirq) from [<c02367a4>] (irq_exit+0x188/0x1b0)
> [ 7.176870] r10:16b2f000 r9:c1e03ec0 r8:16b2f000 r7:c1e03e28 r6:ffffc000 r5:c1cc0940
> [ 7.184827] r4:c1e06ea4
> [ 7.187405] [<c023661c>] (irq_exit) from [<c02c75fc>] (__handle_domain_irq+0xc4/0x128)
> [ 7.195497] r9:c1e03ec0 r8:00000001 r7:00000000 r6:c1e00000 r5:00000000 r4:c1cbfe80
> [ 7.203363] [<c02c7538>] (__handle_domain_irq) from [<c09ef2b4>] (gic_handle_irq+0x9c/0xb4)
> [ 7.211902] r10:f080200c r9:f0802000 r8:c1e03ec0 r7:c1e07878 r6:c1cbfe8c r5:000000bd
> [ 7.219858] r4:000000bd
> [ 7.222436] [<c09ef218>] (gic_handle_irq) from [<c0200abc>] (__irq_svc+0x5c/0x80)
> [ 7.230074] Exception stack(0xc1e03ec0 to 0xc1e03f08)
> [ 7.235228] 3ec0: c175a018 d87f0614 00000000 c0222bc0 c1e00000 c1e06e1c 00000000 c1e06e6c
> [ 7.243552] 3ee0: c84ff712 c121e120 30c5387d c1e03f1c c175a018 c1e03f10 c020a204 c020a208
> [ 7.251865] 3f00: 60000013 ffffffff
> [ 7.255432] r10:30c5387d r9:c1e00000 r8:c84ff712 r7:c1e03ef4 r6:ffffffff r5:60000013
> [ 7.263389] r4:c020a208
> [ 7.265967] [<c020a1b8>] (arch_cpu_idle) from [<c112af34>] (default_idle_call+0x48/0x188)
> [ 7.274309] [<c112aeec>] (default_idle_call) from [<c0287578>] (do_idle+0x11c/0x180)
> [ 7.282219] r9:c121e120 r8:c84ff712 r7:c1e06e6c r6:00000000 r5:c1e06e1c r4:c1e00000
> [ 7.290085] [<c028745c>] (do_idle) from [<c0287a00>] (cpu_startup_entry+0x28/0x2c)
> [ 7.297821] r9:410fd083 r8:c187df68 r7:c1e00000 r6:ca9d6000 r5:c85201e0 r4:000000e1
> [ 7.305687] [<c02879d8>] (cpu_startup_entry) from [<c111d7b8>] (rest_init+0x148/0x150)
> [ 7.313764] [<c111d670>] (rest_init) from [<c1801534>] (arch_call_rest_init+0x18/0x1c)
> [ 7.321855] r7:c1e06dc0 r6:c1e00000 r5:c1e00000 r4:c851d5c0
> [ 7.327606] [<c180151c>] (arch_call_rest_init) from [<c1801990>] (start_kernel+0x3e0/0x424)
> [ 7.336126] [<c18015b0>] (start_kernel) from [<00000000>] (0x0)
> [ 7.342179] r8:2eff9400 r7:00000c42 r6:30c0387d r5:00000000 r4:c1800334
> [ 7.349000] Code: ebd9c790 e5954000 e2840008 ebd9c78d (e5943008)
> [ 7.355247] ---[ end trace 38b3df6838c109c3 ]---
>
> Let me know if you need any other information, thanks!
> Maxime
>

--
Florian

2021-05-24 15:25:06

by Maxime Ripard

[permalink] [raw]

Subject: Re: Kernel Panic in skb_release_data using genet

Hi Florian,

On Mon, May 24, 2021 at 07:49:25AM -0700, Florian Fainelli wrote:
> Hi Maxime,
>
> On 5/24/2021 6:01 AM, Maxime Ripard wrote:
> > Hi Doug, Florian,
> >
> > I've been running a RaspberryPi4 with a mainline kernel for a while,
> > booting from NFS. Every once in a while (I'd say ~20-30% of all boots),
> > I'm getting a kernel panic around the time init is started.
> >
> > I was debugging a kernel based on drm-misc-next-2021-05-17 today with
> > KASAN enabled and got this, which looks related:
>
> Is there a known good version that could be used for bisection or you
> just started to do this test and you have no reference point?

I've had this issue for over a year and never (I think?) got a good
version, so while it might be a regression, it's not a recent one.

> How stable in terms of clocking is the configuration that you are using?
> I could try to fire up a similar test on a Pi4 at home, or use one of
> our 72112 systems which is the closest we have to a Pi4 and see if that
> happens there as well.

I'm not really sure about the clocking. Is there any clock you want to
look at in particular?

My setup is fairly simple: the firmware and kernel are loaded over TFTP
and the rootfs is mounted over NFS, and the crash always occur around
init start, so I guess when it actually starts to transmit a decent
amount of data?

Maxime

Attachments:

(No filename) (1.37 kB)
signature.asc (235.00 B)
Download all attachments

2021-05-24 16:12:29

by Florian Fainelli

[permalink] [raw]

Subject: Re: Kernel Panic in skb_release_data using genet

On 5/24/2021 8:13 AM, Maxime Ripard wrote:
> Hi Florian,
>
> On Mon, May 24, 2021 at 07:49:25AM -0700, Florian Fainelli wrote:
>> Hi Maxime,
>>
>> On 5/24/2021 6:01 AM, Maxime Ripard wrote:
>>> Hi Doug, Florian,
>>>
>>> I've been running a RaspberryPi4 with a mainline kernel for a while,
>>> booting from NFS. Every once in a while (I'd say ~20-30% of all boots),
>>> I'm getting a kernel panic around the time init is started.
>>>
>>> I was debugging a kernel based on drm-misc-next-2021-05-17 today with
>>> KASAN enabled and got this, which looks related:
>>
>> Is there a known good version that could be used for bisection or you
>> just started to do this test and you have no reference point?
>
> I've had this issue for over a year and never (I think?) got a good
> version, so while it might be a regression, it's not a recent one.

OK, this helps and does not really help.

>
>> How stable in terms of clocking is the configuration that you are using?
>> I could try to fire up a similar test on a Pi4 at home, or use one of
>> our 72112 systems which is the closest we have to a Pi4 and see if that
>> happens there as well.
>
> I'm not really sure about the clocking. Is there any clock you want to
> look at in particular?

ARM, DDR, AXI, anything that could cause some memory corruption to occur
essentially. GENET clocks are fairly fixed, you have a 250MHz clock and
a 125MHz clock feeding the data path.

>
> My setup is fairly simple: the firmware and kernel are loaded over TFTP
> and the rootfs is mounted over NFS, and the crash always occur around
> init start, so I guess when it actually starts to transmit a decent
> amount of data?

Do you reproduce this problem with KASAN disabled, do you eventually
have a crash pointing back to the same location?

I have a suspicion that this is all Pi4 specific because we regularly
run the GENET driver through various kernel versions (4.9, 5.4 and 5.10
and mainline) and did not run into that.
--
Florian

2021-05-28 17:16:33

by Florian Fainelli

[permalink] [raw]

Subject: Re: Kernel Panic in skb_release_data using genet

On 5/24/21 8:37 AM, Florian Fainelli wrote:
>
>
> On 5/24/2021 8:13 AM, Maxime Ripard wrote:
>> Hi Florian,
>>
>> On Mon, May 24, 2021 at 07:49:25AM -0700, Florian Fainelli wrote:
>>> Hi Maxime,
>>>
>>> On 5/24/2021 6:01 AM, Maxime Ripard wrote:
>>>> Hi Doug, Florian,
>>>>
>>>> I've been running a RaspberryPi4 with a mainline kernel for a while,
>>>> booting from NFS. Every once in a while (I'd say ~20-30% of all boots),
>>>> I'm getting a kernel panic around the time init is started.
>>>>
>>>> I was debugging a kernel based on drm-misc-next-2021-05-17 today with
>>>> KASAN enabled and got this, which looks related:
>>>
>>> Is there a known good version that could be used for bisection or you
>>> just started to do this test and you have no reference point?
>>
>> I've had this issue for over a year and never (I think?) got a good
>> version, so while it might be a regression, it's not a recent one.
>
> OK, this helps and does not really help.
>
>>
>>> How stable in terms of clocking is the configuration that you are using?
>>> I could try to fire up a similar test on a Pi4 at home, or use one of
>>> our 72112 systems which is the closest we have to a Pi4 and see if that
>>> happens there as well.
>>
>> I'm not really sure about the clocking. Is there any clock you want to
>> look at in particular?
>
> ARM, DDR, AXI, anything that could cause some memory corruption to occur
> essentially. GENET clocks are fairly fixed, you have a 250MHz clock and
> a 125MHz clock feeding the data path.
>
>>
>> My setup is fairly simple: the firmware and kernel are loaded over TFTP
>> and the rootfs is mounted over NFS, and the crash always occur around
>> init start, so I guess when it actually starts to transmit a decent
>> amount of data?
>
> Do you reproduce this problem with KASAN disabled, do you eventually
> have a crash pointing back to the same location?
>
> I have a suspicion that this is all Pi4 specific because we regularly
> run the GENET driver through various kernel versions (4.9, 5.4 and 5.10
> and mainline) and did not run into that.

I have not had time to get a set-up to reproduce what you are seeing,
could you share your .config meanwhile? Thanks
--
Florian

2021-05-28 17:18:52

by Maxime Ripard

[permalink] [raw]

Subject: Re: Kernel Panic in skb_release_data using genet

hi Florian,

On Fri, May 28, 2021 at 09:21:27AM -0700, Florian Fainelli wrote:
> On 5/24/21 8:37 AM, Florian Fainelli wrote:
> >
> >
> > On 5/24/2021 8:13 AM, Maxime Ripard wrote:
> >> Hi Florian,
> >>
> >> On Mon, May 24, 2021 at 07:49:25AM -0700, Florian Fainelli wrote:
> >>> Hi Maxime,
> >>>
> >>> On 5/24/2021 6:01 AM, Maxime Ripard wrote:
> >>>> Hi Doug, Florian,
> >>>>
> >>>> I've been running a RaspberryPi4 with a mainline kernel for a while,
> >>>> booting from NFS. Every once in a while (I'd say ~20-30% of all boots),
> >>>> I'm getting a kernel panic around the time init is started.
> >>>>
> >>>> I was debugging a kernel based on drm-misc-next-2021-05-17 today with
> >>>> KASAN enabled and got this, which looks related:
> >>>
> >>> Is there a known good version that could be used for bisection or you
> >>> just started to do this test and you have no reference point?
> >>
> >> I've had this issue for over a year and never (I think?) got a good
> >> version, so while it might be a regression, it's not a recent one.
> >
> > OK, this helps and does not really help.
> >
> >>
> >>> How stable in terms of clocking is the configuration that you are using?
> >>> I could try to fire up a similar test on a Pi4 at home, or use one of
> >>> our 72112 systems which is the closest we have to a Pi4 and see if that
> >>> happens there as well.
> >>
> >> I'm not really sure about the clocking. Is there any clock you want to
> >> look at in particular?
> >
> > ARM, DDR, AXI, anything that could cause some memory corruption to occur
> > essentially. GENET clocks are fairly fixed, you have a 250MHz clock and
> > a 125MHz clock feeding the data path.
> >
> >>
> >> My setup is fairly simple: the firmware and kernel are loaded over TFTP
> >> and the rootfs is mounted over NFS, and the crash always occur around
> >> init start, so I guess when it actually starts to transmit a decent
> >> amount of data?
> >
> > Do you reproduce this problem with KASAN disabled, do you eventually
> > have a crash pointing back to the same location?
> >
> > I have a suspicion that this is all Pi4 specific because we regularly
> > run the GENET driver through various kernel versions (4.9, 5.4 and 5.10
> > and mainline) and did not run into that.
>
> I have not had time to get a set-up to reproduce what you are seeing,
> could you share your .config meanwhile? Thanks

Sorry, I didn't have the time to check how the clock were behaving.

You'll find attached my config.txt file and .config

I'm booting the board entirely from TFTP (which might introduce some
issues in the "handoff" from the bootloader to the kernel), you'll find
some guide there:

https://www.raspberrypi.org/documentation/hardware/raspberrypi/bootmodes/net_tutorial.md

Maxime

Attachments:

(No filename) (0.00 B)
signature.asc (235.00 B)
Download all attachments

2021-05-28 17:19:26

by Florian Fainelli

[permalink] [raw]

Subject: Re: Kernel Panic in skb_release_data using genet

On 5/28/21 9:32 AM, Maxime Ripard wrote:
> hi Florian,
>
> On Fri, May 28, 2021 at 09:21:27AM -0700, Florian Fainelli wrote:
>> On 5/24/21 8:37 AM, Florian Fainelli wrote:
>>>
>>>
>>> On 5/24/2021 8:13 AM, Maxime Ripard wrote:
>>>> Hi Florian,
>>>>
>>>> On Mon, May 24, 2021 at 07:49:25AM -0700, Florian Fainelli wrote:
>>>>> Hi Maxime,
>>>>>
>>>>> On 5/24/2021 6:01 AM, Maxime Ripard wrote:
>>>>>> Hi Doug, Florian,
>>>>>>
>>>>>> I've been running a RaspberryPi4 with a mainline kernel for a while,
>>>>>> booting from NFS. Every once in a while (I'd say ~20-30% of all boots),
>>>>>> I'm getting a kernel panic around the time init is started.
>>>>>>
>>>>>> I was debugging a kernel based on drm-misc-next-2021-05-17 today with
>>>>>> KASAN enabled and got this, which looks related:
>>>>>
>>>>> Is there a known good version that could be used for bisection or you
>>>>> just started to do this test and you have no reference point?
>>>>
>>>> I've had this issue for over a year and never (I think?) got a good
>>>> version, so while it might be a regression, it's not a recent one.
>>>
>>> OK, this helps and does not really help.
>>>
>>>>
>>>>> How stable in terms of clocking is the configuration that you are using?
>>>>> I could try to fire up a similar test on a Pi4 at home, or use one of
>>>>> our 72112 systems which is the closest we have to a Pi4 and see if that
>>>>> happens there as well.
>>>>
>>>> I'm not really sure about the clocking. Is there any clock you want to
>>>> look at in particular?
>>>
>>> ARM, DDR, AXI, anything that could cause some memory corruption to occur
>>> essentially. GENET clocks are fairly fixed, you have a 250MHz clock and
>>> a 125MHz clock feeding the data path.
>>>
>>>>
>>>> My setup is fairly simple: the firmware and kernel are loaded over TFTP
>>>> and the rootfs is mounted over NFS, and the crash always occur around
>>>> init start, so I guess when it actually starts to transmit a decent
>>>> amount of data?
>>>
>>> Do you reproduce this problem with KASAN disabled, do you eventually
>>> have a crash pointing back to the same location?
>>>
>>> I have a suspicion that this is all Pi4 specific because we regularly
>>> run the GENET driver through various kernel versions (4.9, 5.4 and 5.10
>>> and mainline) and did not run into that.
>>
>> I have not had time to get a set-up to reproduce what you are seeing,
>> could you share your .config meanwhile? Thanks
>
> Sorry, I didn't have the time to check how the clock were behaving.
>
> You'll find attached my config.txt file and .config
>
> I'm booting the board entirely from TFTP (which might introduce some
> issues in the "handoff" from the bootloader to the kernel), you'll find
> some guide there:
>
> https://www.raspberrypi.org/documentation/hardware/raspberrypi/bootmodes/net_tutorial.md

That is also how I boot my Pi4 at home, and I suspect you are right, if
the VPU does not shut down GENET's DMA, and leaves buffer addresses in
the on-chip descriptors that point to an address space that is managed
totally differently by Linux, then we can have a serious problem and
create some memory corruption when the ring is being reclaimed. I will
run a few experiments to test that theory and there may be a solution
using the SW_INIT reset controller to have a big reset of the controller
before handing it over to the Linux driver.
--
Florian

2021-06-01 02:37:24

by Florian Fainelli

[permalink] [raw]

Subject: Re: Kernel Panic in skb_release_data using genet

On 5/28/2021 9:48 AM, Florian Fainelli wrote:
> On 5/28/21 9:32 AM, Maxime Ripard wrote:
>> hi Florian,
>>
>> On Fri, May 28, 2021 at 09:21:27AM -0700, Florian Fainelli wrote:
>>> On 5/24/21 8:37 AM, Florian Fainelli wrote:
>>>>
>>>>
>>>> On 5/24/2021 8:13 AM, Maxime Ripard wrote:
>>>>> Hi Florian,
>>>>>
>>>>> On Mon, May 24, 2021 at 07:49:25AM -0700, Florian Fainelli wrote:
>>>>>> Hi Maxime,
>>>>>>
>>>>>> On 5/24/2021 6:01 AM, Maxime Ripard wrote:
>>>>>>> Hi Doug, Florian,
>>>>>>>
>>>>>>> I've been running a RaspberryPi4 with a mainline kernel for a while,
>>>>>>> booting from NFS. Every once in a while (I'd say ~20-30% of all boots),
>>>>>>> I'm getting a kernel panic around the time init is started.
>>>>>>>
>>>>>>> I was debugging a kernel based on drm-misc-next-2021-05-17 today with
>>>>>>> KASAN enabled and got this, which looks related:
>>>>>>
>>>>>> Is there a known good version that could be used for bisection or you
>>>>>> just started to do this test and you have no reference point?
>>>>>
>>>>> I've had this issue for over a year and never (I think?) got a good
>>>>> version, so while it might be a regression, it's not a recent one.
>>>>
>>>> OK, this helps and does not really help.
>>>>
>>>>>
>>>>>> How stable in terms of clocking is the configuration that you are using?
>>>>>> I could try to fire up a similar test on a Pi4 at home, or use one of
>>>>>> our 72112 systems which is the closest we have to a Pi4 and see if that
>>>>>> happens there as well.
>>>>>
>>>>> I'm not really sure about the clocking. Is there any clock you want to
>>>>> look at in particular?
>>>>
>>>> ARM, DDR, AXI, anything that could cause some memory corruption to occur
>>>> essentially. GENET clocks are fairly fixed, you have a 250MHz clock and
>>>> a 125MHz clock feeding the data path.
>>>>
>>>>>
>>>>> My setup is fairly simple: the firmware and kernel are loaded over TFTP
>>>>> and the rootfs is mounted over NFS, and the crash always occur around
>>>>> init start, so I guess when it actually starts to transmit a decent
>>>>> amount of data?
>>>>
>>>> Do you reproduce this problem with KASAN disabled, do you eventually
>>>> have a crash pointing back to the same location?
>>>>
>>>> I have a suspicion that this is all Pi4 specific because we regularly
>>>> run the GENET driver through various kernel versions (4.9, 5.4 and 5.10
>>>> and mainline) and did not run into that.
>>>
>>> I have not had time to get a set-up to reproduce what you are seeing,
>>> could you share your .config meanwhile? Thanks
>>
>> Sorry, I didn't have the time to check how the clock were behaving.
>>
>> You'll find attached my config.txt file and .config
>>
>> I'm booting the board entirely from TFTP (which might introduce some
>> issues in the "handoff" from the bootloader to the kernel), you'll find
>> some guide there:
>>
>> https://www.raspberrypi.org/documentation/hardware/raspberrypi/bootmodes/net_tutorial.md
>
> That is also how I boot my Pi4 at home, and I suspect you are right, if
> the VPU does not shut down GENET's DMA, and leaves buffer addresses in
> the on-chip descriptors that point to an address space that is managed
> totally differently by Linux, then we can have a serious problem and
> create some memory corruption when the ring is being reclaimed. I will
> run a few experiments to test that theory and there may be a solution
> using the SW_INIT reset controller to have a big reset of the controller
> before handing it over to the Linux driver.

Adding a WARN_ON(reg & DMA_EN) in bcmgenet_dma_disable() has not shown
that the TX or RX DMA have been left running during the hand over from
the VPU to the kernel. I checked out drm-misc-next-2021-05-17 to reduce
as much as possible the differences between your set-up and my set-up
but so far have not been able to reproduce the crash in booting from NFS
repeatedly, I will try again.
--
Florian

2021-06-01 09:35:19

by nicolas saenz julienne

[permalink] [raw]

Subject: Re: Kernel Panic in skb_release_data using genet

On Mon, 2021-05-31 at 19:36 -0700, Florian Fainelli wrote:
> > That is also how I boot my Pi4 at home, and I suspect you are right, if
> > the VPU does not shut down GENET's DMA, and leaves buffer addresses in
> > the on-chip descriptors that point to an address space that is managed
> > totally differently by Linux, then we can have a serious problem and
> > create some memory corruption when the ring is being reclaimed. I will
> > run a few experiments to test that theory and there may be a solution
> > using the SW_INIT reset controller to have a big reset of the controller
> > before handing it over to the Linux driver.
>
> Adding a WARN_ON(reg & DMA_EN) in bcmgenet_dma_disable() has not shown
> that the TX or RX DMA have been left running during the hand over from
> the VPU to the kernel. I checked out drm-misc-next-2021-05-17 to reduce
> as much as possible the differences between your set-up and my set-up
> but so far have not been able to reproduce the crash in booting from NFS
> repeatedly, I will try again.

FWIW I can reproduce the error too. That said it's rather hard to reproduce,
something in the order of 1 failure every 20 tries.

Regards,
Nicolas

2021-06-02 13:31:52

by Maxime Ripard

[permalink] [raw]

Subject: Re: Kernel Panic in skb_release_data using genet

On Tue, Jun 01, 2021 at 11:33:18AM +0200, nicolas saenz julienne wrote:
> On Mon, 2021-05-31 at 19:36 -0700, Florian Fainelli wrote:
> > > That is also how I boot my Pi4 at home, and I suspect you are right, if
> > > the VPU does not shut down GENET's DMA, and leaves buffer addresses in
> > > the on-chip descriptors that point to an address space that is managed
> > > totally differently by Linux, then we can have a serious problem and
> > > create some memory corruption when the ring is being reclaimed. I will
> > > run a few experiments to test that theory and there may be a solution
> > > using the SW_INIT reset controller to have a big reset of the controller
> > > before handing it over to the Linux driver.
> >
> > Adding a WARN_ON(reg & DMA_EN) in bcmgenet_dma_disable() has not shown
> > that the TX or RX DMA have been left running during the hand over from
> > the VPU to the kernel. I checked out drm-misc-next-2021-05-17 to reduce
> > as much as possible the differences between your set-up and my set-up
> > but so far have not been able to reproduce the crash in booting from NFS
> > repeatedly, I will try again.
>
> FWIW I can reproduce the error too. That said it's rather hard to reproduce,
> something in the order of 1 failure every 20 tries.

Yeah, it looks like it's only from a cold boot and comes in "bursts",
where you would get like 5 in a row and be done with it for a while.

Maxime

2021-06-10 21:36:16

by Florian Fainelli

[permalink] [raw]

Subject: Re: Kernel Panic in skb_release_data using genet

On 6/2/2021 6:28 AM, Maxime Ripard wrote:
> On Tue, Jun 01, 2021 at 11:33:18AM +0200, nicolas saenz julienne wrote:
>> On Mon, 2021-05-31 at 19:36 -0700, Florian Fainelli wrote:
>>>> That is also how I boot my Pi4 at home, and I suspect you are right, if
>>>> the VPU does not shut down GENET's DMA, and leaves buffer addresses in
>>>> the on-chip descriptors that point to an address space that is managed
>>>> totally differently by Linux, then we can have a serious problem and
>>>> create some memory corruption when the ring is being reclaimed. I will
>>>> run a few experiments to test that theory and there may be a solution
>>>> using the SW_INIT reset controller to have a big reset of the controller
>>>> before handing it over to the Linux driver.
>>>
>>> Adding a WARN_ON(reg & DMA_EN) in bcmgenet_dma_disable() has not shown
>>> that the TX or RX DMA have been left running during the hand over from
>>> the VPU to the kernel. I checked out drm-misc-next-2021-05-17 to reduce
>>> as much as possible the differences between your set-up and my set-up
>>> but so far have not been able to reproduce the crash in booting from NFS
>>> repeatedly, I will try again.
>>
>> FWIW I can reproduce the error too. That said it's rather hard to reproduce,
>> something in the order of 1 failure every 20 tries.
>
> Yeah, it looks like it's only from a cold boot and comes in "bursts",
> where you would get like 5 in a row and be done with it for a while.

Here are two patches that you could try exclusive from one another

1) Limit GENET to a single queue

diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index fcca023f22e5..e400c12e6868 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -3652,6 +3652,12 @@ static int bcmgenet_change_carrier(struct
net_device *dev, bool new_carrier)
return 0;
}

+static u16 bcmgenet_select_queue(struct net_device *dev, struct sk_buff
*skb,
+ struct net_device *sb_dev)
+{
+ return 0;
+}
+
static const struct net_device_ops bcmgenet_netdev_ops = {
.ndo_open = bcmgenet_open,
.ndo_stop = bcmgenet_close,
@@ -3666,6 +3672,7 @@ static const struct net_device_ops
bcmgenet_netdev_ops = {
#endif
.ndo_get_stats = bcmgenet_get_stats,
.ndo_change_carrier = bcmgenet_change_carrier,
+ .ndo_select_queue = bcmgenet_select_queue,
};

/* Array of GENET hardware parameters/characteristics */

2) Ensure that all TX/RX queues are disabled upon DMA initialization

diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index fcca023f22e5..7f8a5996fbbb 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -3237,15 +3237,21 @@ static void bcmgenet_get_hw_addr(struct
bcmgenet_priv *priv,
/* Returns a reusable dma control register value */
static u32 bcmgenet_dma_disable(struct bcmgenet_priv *priv)
{
+ unsigned int i;
u32 reg;
u32 dma_ctrl;

/* disable DMA */
dma_ctrl = 1 << (DESC_INDEX + DMA_RING_BUF_EN_SHIFT) | DMA_EN;
+ for (i = 0; i < priv->hw_params->tx_queues; i++)
+ dma_ctrl |= (1 << (i + DMA_RING_BUF_EN_SHIFT));
reg = bcmgenet_tdma_readl(priv, DMA_CTRL);
reg &= ~dma_ctrl;
bcmgenet_tdma_writel(priv, reg, DMA_CTRL);

+ dma_ctrl = 1 << (DESC_INDEX + DMA_RING_BUF_EN_SHIFT) | DMA_EN;
+ for (i = 0; i < priv->hw_params->rx_queues; i++)
+ dma_ctrl |= (1 << (i + DMA_RING_BUF_EN_SHIFT));
reg = bcmgenet_rdma_readl(priv, DMA_CTRL);
reg &= ~dma_ctrl;
bcmgenet_rdma_writel(priv, reg, DMA_CTRL);
--
Florian

2021-06-25 13:00:35

by Maxime Ripard

[permalink] [raw]

Subject: Re: Kernel Panic in skb_release_data using genet

Hi Florian,

Sorry for the late reply

On Thu, Jun 10, 2021 at 02:33:17PM -0700, Florian Fainelli wrote:
> On 6/2/2021 6:28 AM, Maxime Ripard wrote:
> > On Tue, Jun 01, 2021 at 11:33:18AM +0200, nicolas saenz julienne wrote:
> >> On Mon, 2021-05-31 at 19:36 -0700, Florian Fainelli wrote:
> >>>> That is also how I boot my Pi4 at home, and I suspect you are right, if
> >>>> the VPU does not shut down GENET's DMA, and leaves buffer addresses in
> >>>> the on-chip descriptors that point to an address space that is managed
> >>>> totally differently by Linux, then we can have a serious problem and
> >>>> create some memory corruption when the ring is being reclaimed. I will
> >>>> run a few experiments to test that theory and there may be a solution
> >>>> using the SW_INIT reset controller to have a big reset of the controller
> >>>> before handing it over to the Linux driver.
> >>>
> >>> Adding a WARN_ON(reg & DMA_EN) in bcmgenet_dma_disable() has not shown
> >>> that the TX or RX DMA have been left running during the hand over from
> >>> the VPU to the kernel. I checked out drm-misc-next-2021-05-17 to reduce
> >>> as much as possible the differences between your set-up and my set-up
> >>> but so far have not been able to reproduce the crash in booting from NFS
> >>> repeatedly, I will try again.
> >>
> >> FWIW I can reproduce the error too. That said it's rather hard to reproduce,
> >> something in the order of 1 failure every 20 tries.
> >
> > Yeah, it looks like it's only from a cold boot and comes in "bursts",
> > where you would get like 5 in a row and be done with it for a while.
>
> Here are two patches that you could try exclusive from one another
>
> 1) Limit GENET to a single queue
>
> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> index fcca023f22e5..e400c12e6868 100644
> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> @@ -3652,6 +3652,12 @@ static int bcmgenet_change_carrier(struct
> net_device *dev, bool new_carrier)
> return 0;
> }
>
> +static u16 bcmgenet_select_queue(struct net_device *dev, struct sk_buff
> *skb,
> + struct net_device *sb_dev)
> +{
> + return 0;
> +}
> +
> static const struct net_device_ops bcmgenet_netdev_ops = {
> .ndo_open = bcmgenet_open,
> .ndo_stop = bcmgenet_close,
> @@ -3666,6 +3672,7 @@ static const struct net_device_ops
> bcmgenet_netdev_ops = {
> #endif
> .ndo_get_stats = bcmgenet_get_stats,
> .ndo_change_carrier = bcmgenet_change_carrier,
> + .ndo_select_queue = bcmgenet_select_queue,
> };
>
> /* Array of GENET hardware parameters/characteristics */
>
> 2) Ensure that all TX/RX queues are disabled upon DMA initialization
>
> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> index fcca023f22e5..7f8a5996fbbb 100644
> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> @@ -3237,15 +3237,21 @@ static void bcmgenet_get_hw_addr(struct
> bcmgenet_priv *priv,
> /* Returns a reusable dma control register value */
> static u32 bcmgenet_dma_disable(struct bcmgenet_priv *priv)
> {
> + unsigned int i;
> u32 reg;
> u32 dma_ctrl;
>
> /* disable DMA */
> dma_ctrl = 1 << (DESC_INDEX + DMA_RING_BUF_EN_SHIFT) | DMA_EN;
> + for (i = 0; i < priv->hw_params->tx_queues; i++)
> + dma_ctrl |= (1 << (i + DMA_RING_BUF_EN_SHIFT));
> reg = bcmgenet_tdma_readl(priv, DMA_CTRL);
> reg &= ~dma_ctrl;
> bcmgenet_tdma_writel(priv, reg, DMA_CTRL);
>
> + dma_ctrl = 1 << (DESC_INDEX + DMA_RING_BUF_EN_SHIFT) | DMA_EN;
> + for (i = 0; i < priv->hw_params->rx_queues; i++)
> + dma_ctrl |= (1 << (i + DMA_RING_BUF_EN_SHIFT));
> reg = bcmgenet_rdma_readl(priv, DMA_CTRL);
> reg &= ~dma_ctrl;
> bcmgenet_rdma_writel(priv, reg, DMA_CTRL);

I had a bunch of issues popping up today so I took the occasion to test
those patches. The first one doesn't change anything, I still had the
crash occurring with it. With the second applied (in addition), it seems
like it's fixed. I'll keep testing and will let you know.

Maxime

Attachments:

(No filename) (4.40 kB)
signature.asc (235.00 B)
Download all attachments

2021-07-02 16:55:09

by Florian Fainelli

[permalink] [raw]

Subject: Re: Kernel Panic in skb_release_data using genet

Hey Maxime,

On 6/25/2021 5:59 AM, Maxime Ripard wrote:
> Hi Florian,
>
> Sorry for the late reply
>
> On Thu, Jun 10, 2021 at 02:33:17PM -0700, Florian Fainelli wrote:
>> On 6/2/2021 6:28 AM, Maxime Ripard wrote:
>>> On Tue, Jun 01, 2021 at 11:33:18AM +0200, nicolas saenz julienne wrote:
>>>> On Mon, 2021-05-31 at 19:36 -0700, Florian Fainelli wrote:
>>>>>> That is also how I boot my Pi4 at home, and I suspect you are right, if
>>>>>> the VPU does not shut down GENET's DMA, and leaves buffer addresses in
>>>>>> the on-chip descriptors that point to an address space that is managed
>>>>>> totally differently by Linux, then we can have a serious problem and
>>>>>> create some memory corruption when the ring is being reclaimed. I will
>>>>>> run a few experiments to test that theory and there may be a solution
>>>>>> using the SW_INIT reset controller to have a big reset of the controller
>>>>>> before handing it over to the Linux driver.
>>>>>
>>>>> Adding a WARN_ON(reg & DMA_EN) in bcmgenet_dma_disable() has not shown
>>>>> that the TX or RX DMA have been left running during the hand over from
>>>>> the VPU to the kernel. I checked out drm-misc-next-2021-05-17 to reduce
>>>>> as much as possible the differences between your set-up and my set-up
>>>>> but so far have not been able to reproduce the crash in booting from NFS
>>>>> repeatedly, I will try again.
>>>>
>>>> FWIW I can reproduce the error too. That said it's rather hard to reproduce,
>>>> something in the order of 1 failure every 20 tries.
>>>
>>> Yeah, it looks like it's only from a cold boot and comes in "bursts",
>>> where you would get like 5 in a row and be done with it for a while.
>>
>> Here are two patches that you could try exclusive from one another
>>
>> 1) Limit GENET to a single queue
>>
>> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
>> b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
>> index fcca023f22e5..e400c12e6868 100644
>> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
>> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
>> @@ -3652,6 +3652,12 @@ static int bcmgenet_change_carrier(struct
>> net_device *dev, bool new_carrier)
>> return 0;
>> }
>>
>> +static u16 bcmgenet_select_queue(struct net_device *dev, struct sk_buff
>> *skb,
>> + struct net_device *sb_dev)
>> +{
>> + return 0;
>> +}
>> +
>> static const struct net_device_ops bcmgenet_netdev_ops = {
>> .ndo_open = bcmgenet_open,
>> .ndo_stop = bcmgenet_close,
>> @@ -3666,6 +3672,7 @@ static const struct net_device_ops
>> bcmgenet_netdev_ops = {
>> #endif
>> .ndo_get_stats = bcmgenet_get_stats,
>> .ndo_change_carrier = bcmgenet_change_carrier,
>> + .ndo_select_queue = bcmgenet_select_queue,
>> };
>>
>> /* Array of GENET hardware parameters/characteristics */
>>
>> 2) Ensure that all TX/RX queues are disabled upon DMA initialization
>>
>> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
>> b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
>> index fcca023f22e5..7f8a5996fbbb 100644
>> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
>> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
>> @@ -3237,15 +3237,21 @@ static void bcmgenet_get_hw_addr(struct
>> bcmgenet_priv *priv,
>> /* Returns a reusable dma control register value */
>> static u32 bcmgenet_dma_disable(struct bcmgenet_priv *priv)
>> {
>> + unsigned int i;
>> u32 reg;
>> u32 dma_ctrl;
>>
>> /* disable DMA */
>> dma_ctrl = 1 << (DESC_INDEX + DMA_RING_BUF_EN_SHIFT) | DMA_EN;
>> + for (i = 0; i < priv->hw_params->tx_queues; i++)
>> + dma_ctrl |= (1 << (i + DMA_RING_BUF_EN_SHIFT));
>> reg = bcmgenet_tdma_readl(priv, DMA_CTRL);
>> reg &= ~dma_ctrl;
>> bcmgenet_tdma_writel(priv, reg, DMA_CTRL);
>>
>> + dma_ctrl = 1 << (DESC_INDEX + DMA_RING_BUF_EN_SHIFT) | DMA_EN;
>> + for (i = 0; i < priv->hw_params->rx_queues; i++)
>> + dma_ctrl |= (1 << (i + DMA_RING_BUF_EN_SHIFT));
>> reg = bcmgenet_rdma_readl(priv, DMA_CTRL);
>> reg &= ~dma_ctrl;
>> bcmgenet_rdma_writel(priv, reg, DMA_CTRL);
>
> I had a bunch of issues popping up today so I took the occasion to test
> those patches. The first one doesn't change anything, I still had the
> crash occurring with it. With the second applied (in addition), it seems
> like it's fixed. I'll keep testing and will let you know.

Did this patch survive more days of testing? I am tempted to send it
regardless of your testing because it is a correctness issue that is
being fixed. There is a global DMA enable bit which should "cut" any
TX/RX queues, but still, for symmetry with other code paths all queues
should be disabled.

Thanks!
--
Florian

2021-07-06 08:18:10

by Maxime Ripard

[permalink] [raw]

Subject: Re: Kernel Panic in skb_release_data using genet

Hi Florian,

On Fri, Jul 02, 2021 at 09:49:31AM -0700, Florian Fainelli wrote:
> On 6/25/2021 5:59 AM, Maxime Ripard wrote:
> > Hi Florian,
> >
> > Sorry for the late reply
> >
> > On Thu, Jun 10, 2021 at 02:33:17PM -0700, Florian Fainelli wrote:
> > > On 6/2/2021 6:28 AM, Maxime Ripard wrote:
> > > > On Tue, Jun 01, 2021 at 11:33:18AM +0200, nicolas saenz julienne wrote:
> > > > > On Mon, 2021-05-31 at 19:36 -0700, Florian Fainelli wrote:
> > > > > > > That is also how I boot my Pi4 at home, and I suspect you are right, if
> > > > > > > the VPU does not shut down GENET's DMA, and leaves buffer addresses in
> > > > > > > the on-chip descriptors that point to an address space that is managed
> > > > > > > totally differently by Linux, then we can have a serious problem and
> > > > > > > create some memory corruption when the ring is being reclaimed. I will
> > > > > > > run a few experiments to test that theory and there may be a solution
> > > > > > > using the SW_INIT reset controller to have a big reset of the controller
> > > > > > > before handing it over to the Linux driver.
> > > > > >
> > > > > > Adding a WARN_ON(reg & DMA_EN) in bcmgenet_dma_disable() has not shown
> > > > > > that the TX or RX DMA have been left running during the hand over from
> > > > > > the VPU to the kernel. I checked out drm-misc-next-2021-05-17 to reduce
> > > > > > as much as possible the differences between your set-up and my set-up
> > > > > > but so far have not been able to reproduce the crash in booting from NFS
> > > > > > repeatedly, I will try again.
> > > > >
> > > > > FWIW I can reproduce the error too. That said it's rather hard to reproduce,
> > > > > something in the order of 1 failure every 20 tries.
> > > >
> > > > Yeah, it looks like it's only from a cold boot and comes in "bursts",
> > > > where you would get like 5 in a row and be done with it for a while.
> > >
> > > Here are two patches that you could try exclusive from one another
> > >
> > > 1) Limit GENET to a single queue
> > >
> > > diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> > > b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> > > index fcca023f22e5..e400c12e6868 100644
> > > --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> > > +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> > > @@ -3652,6 +3652,12 @@ static int bcmgenet_change_carrier(struct
> > > net_device *dev, bool new_carrier)
> > > return 0;
> > > }
> > >
> > > +static u16 bcmgenet_select_queue(struct net_device *dev, struct sk_buff
> > > *skb,
> > > + struct net_device *sb_dev)
> > > +{
> > > + return 0;
> > > +}
> > > +
> > > static const struct net_device_ops bcmgenet_netdev_ops = {
> > > .ndo_open = bcmgenet_open,
> > > .ndo_stop = bcmgenet_close,
> > > @@ -3666,6 +3672,7 @@ static const struct net_device_ops
> > > bcmgenet_netdev_ops = {
> > > #endif
> > > .ndo_get_stats = bcmgenet_get_stats,
> > > .ndo_change_carrier = bcmgenet_change_carrier,
> > > + .ndo_select_queue = bcmgenet_select_queue,
> > > };
> > >
> > > /* Array of GENET hardware parameters/characteristics */
> > >
> > > 2) Ensure that all TX/RX queues are disabled upon DMA initialization
> > >
> > > diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> > > b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> > > index fcca023f22e5..7f8a5996fbbb 100644
> > > --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> > > +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> > > @@ -3237,15 +3237,21 @@ static void bcmgenet_get_hw_addr(struct
> > > bcmgenet_priv *priv,
> > > /* Returns a reusable dma control register value */
> > > static u32 bcmgenet_dma_disable(struct bcmgenet_priv *priv)
> > > {
> > > + unsigned int i;
> > > u32 reg;
> > > u32 dma_ctrl;
> > >
> > > /* disable DMA */
> > > dma_ctrl = 1 << (DESC_INDEX + DMA_RING_BUF_EN_SHIFT) | DMA_EN;
> > > + for (i = 0; i < priv->hw_params->tx_queues; i++)
> > > + dma_ctrl |= (1 << (i + DMA_RING_BUF_EN_SHIFT));
> > > reg = bcmgenet_tdma_readl(priv, DMA_CTRL);
> > > reg &= ~dma_ctrl;
> > > bcmgenet_tdma_writel(priv, reg, DMA_CTRL);
> > >
> > > + dma_ctrl = 1 << (DESC_INDEX + DMA_RING_BUF_EN_SHIFT) | DMA_EN;
> > > + for (i = 0; i < priv->hw_params->rx_queues; i++)
> > > + dma_ctrl |= (1 << (i + DMA_RING_BUF_EN_SHIFT));
> > > reg = bcmgenet_rdma_readl(priv, DMA_CTRL);
> > > reg &= ~dma_ctrl;
> > > bcmgenet_rdma_writel(priv, reg, DMA_CTRL);
> >
> > I had a bunch of issues popping up today so I took the occasion to test
> > those patches. The first one doesn't change anything, I still had the
> > crash occurring with it. With the second applied (in addition), it seems
> > like it's fixed. I'll keep testing and will let you know.
>
> Did this patch survive more days of testing? I am tempted to send it
> regardless of your testing because it is a correctness issue that is being
> fixed. There is a global DMA enable bit which should "cut" any TX/RX queues,
> but still, for symmetry with other code paths all queues should be disabled.

Unfortunately, I haven't spent too much time working on mainline
recently, so I didn't really have the occasion to test further that
patch.

It seems to make sense anyway like you said, so you can definitely send
it, with my Tested-by :)

Maxime

Attachments:

(No filename) (5.53 kB)
signature.asc (235.00 B)
Download all attachments

2022-05-14 01:47:57

by Maxime Ripard

[permalink] [raw]

Subject: Re: Kernel Panic in skb_release_data using genet

Hi Florian,

Sorry for reviving this old thread...

On Thu, Jun 10, 2021 at 02:33:17PM -0700, Florian Fainelli wrote:
> On 6/2/2021 6:28 AM, Maxime Ripard wrote:
> > On Tue, Jun 01, 2021 at 11:33:18AM +0200, nicolas saenz julienne wrote:
> >> On Mon, 2021-05-31 at 19:36 -0700, Florian Fainelli wrote:
> >>>> That is also how I boot my Pi4 at home, and I suspect you are right, if
> >>>> the VPU does not shut down GENET's DMA, and leaves buffer addresses in
> >>>> the on-chip descriptors that point to an address space that is managed
> >>>> totally differently by Linux, then we can have a serious problem and
> >>>> create some memory corruption when the ring is being reclaimed. I will
> >>>> run a few experiments to test that theory and there may be a solution
> >>>> using the SW_INIT reset controller to have a big reset of the controller
> >>>> before handing it over to the Linux driver.
> >>>
> >>> Adding a WARN_ON(reg & DMA_EN) in bcmgenet_dma_disable() has not shown
> >>> that the TX or RX DMA have been left running during the hand over from
> >>> the VPU to the kernel. I checked out drm-misc-next-2021-05-17 to reduce
> >>> as much as possible the differences between your set-up and my set-up
> >>> but so far have not been able to reproduce the crash in booting from NFS
> >>> repeatedly, I will try again.
> >>
> >> FWIW I can reproduce the error too. That said it's rather hard to reproduce,
> >> something in the order of 1 failure every 20 tries.
> >
> > Yeah, it looks like it's only from a cold boot and comes in "bursts",
> > where you would get like 5 in a row and be done with it for a while.
>
> Here are two patches that you could try exclusive from one another
>
> 1) Limit GENET to a single queue
>
> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> index fcca023f22e5..e400c12e6868 100644
> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> @@ -3652,6 +3652,12 @@ static int bcmgenet_change_carrier(struct
> net_device *dev, bool new_carrier)
> return 0;
> }
>
> +static u16 bcmgenet_select_queue(struct net_device *dev, struct sk_buff
> *skb,
> + struct net_device *sb_dev)
> +{
> + return 0;
> +}
> +
> static const struct net_device_ops bcmgenet_netdev_ops = {
> .ndo_open = bcmgenet_open,
> .ndo_stop = bcmgenet_close,
> @@ -3666,6 +3672,7 @@ static const struct net_device_ops
> bcmgenet_netdev_ops = {
> #endif
> .ndo_get_stats = bcmgenet_get_stats,
> .ndo_change_carrier = bcmgenet_change_carrier,
> + .ndo_select_queue = bcmgenet_select_queue,
> };
>
> /* Array of GENET hardware parameters/characteristics */
>
> 2) Ensure that all TX/RX queues are disabled upon DMA initialization
>
> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> index fcca023f22e5..7f8a5996fbbb 100644
> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> @@ -3237,15 +3237,21 @@ static void bcmgenet_get_hw_addr(struct
> bcmgenet_priv *priv,
> /* Returns a reusable dma control register value */
> static u32 bcmgenet_dma_disable(struct bcmgenet_priv *priv)
> {
> + unsigned int i;
> u32 reg;
> u32 dma_ctrl;
>
> /* disable DMA */
> dma_ctrl = 1 << (DESC_INDEX + DMA_RING_BUF_EN_SHIFT) | DMA_EN;
> + for (i = 0; i < priv->hw_params->tx_queues; i++)
> + dma_ctrl |= (1 << (i + DMA_RING_BUF_EN_SHIFT));
> reg = bcmgenet_tdma_readl(priv, DMA_CTRL);
> reg &= ~dma_ctrl;
> bcmgenet_tdma_writel(priv, reg, DMA_CTRL);
>
> + dma_ctrl = 1 << (DESC_INDEX + DMA_RING_BUF_EN_SHIFT) | DMA_EN;
> + for (i = 0; i < priv->hw_params->rx_queues; i++)
> + dma_ctrl |= (1 << (i + DMA_RING_BUF_EN_SHIFT));
> reg = bcmgenet_rdma_readl(priv, DMA_CTRL);
> reg &= ~dma_ctrl;
> bcmgenet_rdma_writel(priv, reg, DMA_CTRL);

It looks like current upstream still has this issue, which also upsets KASAN:

[ 16.798433] ==================================================================
[ 16.809347] BUG: KASAN: wild-memory-access in skb_release_data+0x124/0x270
[ 16.816379] Read of size 8 at addr 80800000807f2e0c by task swapper/0/0
[ 16.823122]
[ 16.824655] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.18.0-rc5-v8+ #210
[ 16.831581] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT)
[ 16.837525] Call trace:
[ 16.840025] dump_backtrace.part.0+0x1dc/0x1f0
[ 16.844576] show_stack+0x24/0x80
[ 16.847974] dump_stack_lvl+0x8c/0xb8
[ 16.851735] print_report+0x1cc/0x240
[ 16.855494] kasan_report+0xb4/0x120
[ 16.859161] __asan_load8+0xa0/0xc4
[ 16.862743] skb_release_data+0x124/0x270
[ 16.866849] consume_skb+0x74/0xe0
[ 16.870337] __dev_kfree_skb_any+0x74/0x90
[ 16.874538] bcmgenet_desc_rx+0x4b4/0x620
[ 16.878642] bcmgenet_rx_poll+0x78/0x150
[ 16.882657] __napi_poll.constprop.0+0x64/0x240
[ 16.887290] net_rx_action+0x4d4/0x590
[ 16.891127] __do_softirq+0x228/0x4d8
[ 16.894875] __irq_exit_rcu+0x1e4/0x24c
[ 16.898806] irq_exit_rcu+0x20/0x54
[ 16.902382] el1_interrupt+0x38/0x50
[ 16.906051] el1h_64_irq_handler+0x18/0x2c
[ 16.910250] el1h_64_irq+0x64/0x68
[ 16.913733] arch_local_irq_enable+0xc/0x20
[ 16.918010] default_idle_call+0x80/0x114
[ 16.922118] cpuidle_idle_call+0x1e0/0x224
[ 16.926310] do_idle+0x104/0x14c
[ 16.929616] cpu_startup_entry+0x34/0x3c
[ 16.933630] rest_init+0x180/0x200
[ 16.937113] arch_post_acpi_subsys_init+0x0/0x30
[ 16.941840] start_kernel+0x3c8/0x400
[ 16.945592] __primary_switched+0xa8/0xb0
[ 16.949699] ==================================================================
[ 16.957052] Disabling lock debugging due to kernel taint
[ 16.962507] Unable to handle kernel paging request at virtual address 80800000807f2e0c
[ 16.970590] Mem abort info:
[ 16.973461] ESR = 0x96000004
[ 16.976602] EC = 0x25: DABT (current EL), IL = 32 bits
[ 16.982038] SET = 0, FnV = 0
[ 16.985176] EA = 0, S1PTW = 0
[ 16.988403] FSC = 0x04: level 0 translation fault
[ 16.993440] Data abort info:
[ 16.996401] ISV = 0, ISS = 0x00000004
[ 17.000333] CM = 0, WnR = 0
[ 17.003384] [80800000807f2e0c] address between user and kernel address ranges
[ 17.010674] Internal error: Oops: 96000004 [#1] SMP
[ 17.015651] Modules linked in:
[ 17.018784] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G B 5.18.0-rc5-v8+ #210
[ 17.027115] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT)
[ 17.033052] pstate: 40000005 (nZcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 17.040148] pc : skb_release_data+0x124/0x270
[ 17.044603] lr : skb_release_data+0x124/0x270
[ 17.049055] sp : ffffffc00a477690
[ 17.052434] x29: ffffffc00a477690 x28: 0000000000000000 x27: ffffff8043cfeb42
[ 17.059744] x26: 0000000000000001 x25: ffffff8040a5d5be x24: 00000000ffffffff
[ 17.067049] x23: ffffff8043cfeb40 x22: 0000000000000000 x21: ffffff8040a5d540
[ 17.074355] x20: ffffff8043cfeb70 x19: 80800000807f2e04 x18: 00000000ee6397dc
[ 17.081661] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 17.088961] x14: 0000000000000000 x13: 746e696174206c65 x12: ffffffb801567cf9
[ 17.096265] x11: 1ffffff801567cf8 x10: ffffffb801567cf8 x9 : dfffffc000000000
[ 17.103572] x8 : ffffffc00ab3e7c7 x7 : 0000000000000001 x6 : ffffffb801567cf8
[ 17.110875] x5 : ffffffc00ab3e7c0 x4 : ffffffb801567cf9 x3 : 0000000000000000
[ 17.118178] x2 : 0000000000000020 x1 : ffffffc00a48e7c0 x0 : 0000000000000001
[ 17.125481] Call trace:
[ 17.127978] skb_release_data+0x124/0x270
[ 17.132083] consume_skb+0x74/0xe0
[ 17.135567] __dev_kfree_skb_any+0x74/0x90
[ 17.139764] bcmgenet_desc_rx+0x4b4/0x620
[ 17.143863] bcmgenet_rx_poll+0x78/0x150
[ 17.147873] __napi_poll.constprop.0+0x64/0x240
[ 17.152503] net_rx_action+0x4d4/0x590
[ 17.156338] __do_softirq+0x228/0x4d8
[ 17.160083] __irq_exit_rcu+0x1e4/0x24c
[ 17.164008] irq_exit_rcu+0x20/0x54
[ 17.167580] el1_interrupt+0x38/0x50
[ 17.171247] el1h_64_irq_handler+0x18/0x2c
[ 17.175444] el1h_64_irq+0x64/0x68
[ 17.178923] arch_local_irq_enable+0xc/0x20
[ 17.183197] default_idle_call+0x80/0x114
[ 17.187301] cpuidle_idle_call+0x1e0/0x224
[ 17.191490] do_idle+0x104/0x14c
[ 17.194793] cpu_startup_entry+0x34/0x3c
[ 17.198803] rest_init+0x180/0x200
[ 17.202283] arch_post_acpi_subsys_init+0x0/0x30
[ 17.207006] start_kernel+0x3c8/0x400
[ 17.210755] __primary_switched+0xa8/0xb0
[ 17.214872] Code: 72001c1f 540001e1 91002260 97d3086f (f9400660)
[ 17.221083] ---[ end trace 0000000000000000 ]---
[ 17.225791] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[ 17.232785] SMP: stopping secondary CPUs
[ 17.236795] Kernel Offset: disabled
[ 17.240348] CPU features: 0x100,00000d08,00001086
[ 17.245143] Memory Limit: none
[ 17.248273] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---

This is at boot, over TFTP and NFS on a RaspberryPi4

Maxime

Attachments:

(No filename) (9.26 kB)
signature.asc (235.00 B)
Download all attachments

2022-05-15 09:48:42

by Florian Fainelli

[permalink] [raw]

Subject: Re: Kernel Panic in skb_release_data using genet

On 5/13/2022 7:56 AM, Maxime Ripard wrote:
> Hi Florian,
>
> Sorry for reviving this old thread...
>
> On Thu, Jun 10, 2021 at 02:33:17PM -0700, Florian Fainelli wrote:
>> On 6/2/2021 6:28 AM, Maxime Ripard wrote:
>>> On Tue, Jun 01, 2021 at 11:33:18AM +0200, nicolas saenz julienne wrote:
>>>> On Mon, 2021-05-31 at 19:36 -0700, Florian Fainelli wrote:
>>>>>> That is also how I boot my Pi4 at home, and I suspect you are right, if
>>>>>> the VPU does not shut down GENET's DMA, and leaves buffer addresses in
>>>>>> the on-chip descriptors that point to an address space that is managed
>>>>>> totally differently by Linux, then we can have a serious problem and
>>>>>> create some memory corruption when the ring is being reclaimed. I will
>>>>>> run a few experiments to test that theory and there may be a solution
>>>>>> using the SW_INIT reset controller to have a big reset of the controller
>>>>>> before handing it over to the Linux driver.
>>>>>
>>>>> Adding a WARN_ON(reg & DMA_EN) in bcmgenet_dma_disable() has not shown
>>>>> that the TX or RX DMA have been left running during the hand over from
>>>>> the VPU to the kernel. I checked out drm-misc-next-2021-05-17 to reduce
>>>>> as much as possible the differences between your set-up and my set-up
>>>>> but so far have not been able to reproduce the crash in booting from NFS
>>>>> repeatedly, I will try again.
>>>>
>>>> FWIW I can reproduce the error too. That said it's rather hard to reproduce,
>>>> something in the order of 1 failure every 20 tries.
>>>
>>> Yeah, it looks like it's only from a cold boot and comes in "bursts",
>>> where you would get like 5 in a row and be done with it for a while.
>>
>> Here are two patches that you could try exclusive from one another
>>
>> 1) Limit GENET to a single queue
>>
>> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
>> b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
>> index fcca023f22e5..e400c12e6868 100644
>> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
>> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
>> @@ -3652,6 +3652,12 @@ static int bcmgenet_change_carrier(struct
>> net_device *dev, bool new_carrier)
>> return 0;
>> }
>>
>> +static u16 bcmgenet_select_queue(struct net_device *dev, struct sk_buff
>> *skb,
>> + struct net_device *sb_dev)
>> +{
>> + return 0;
>> +}
>> +
>> static const struct net_device_ops bcmgenet_netdev_ops = {
>> .ndo_open = bcmgenet_open,
>> .ndo_stop = bcmgenet_close,
>> @@ -3666,6 +3672,7 @@ static const struct net_device_ops
>> bcmgenet_netdev_ops = {
>> #endif
>> .ndo_get_stats = bcmgenet_get_stats,
>> .ndo_change_carrier = bcmgenet_change_carrier,
>> + .ndo_select_queue = bcmgenet_select_queue,
>> };
>>
>> /* Array of GENET hardware parameters/characteristics */
>>
>> 2) Ensure that all TX/RX queues are disabled upon DMA initialization
>>
>> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
>> b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
>> index fcca023f22e5..7f8a5996fbbb 100644
>> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
>> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
>> @@ -3237,15 +3237,21 @@ static void bcmgenet_get_hw_addr(struct
>> bcmgenet_priv *priv,
>> /* Returns a reusable dma control register value */
>> static u32 bcmgenet_dma_disable(struct bcmgenet_priv *priv)
>> {
>> + unsigned int i;
>> u32 reg;
>> u32 dma_ctrl;
>>
>> /* disable DMA */
>> dma_ctrl = 1 << (DESC_INDEX + DMA_RING_BUF_EN_SHIFT) | DMA_EN;
>> + for (i = 0; i < priv->hw_params->tx_queues; i++)
>> + dma_ctrl |= (1 << (i + DMA_RING_BUF_EN_SHIFT));
>> reg = bcmgenet_tdma_readl(priv, DMA_CTRL);
>> reg &= ~dma_ctrl;
>> bcmgenet_tdma_writel(priv, reg, DMA_CTRL);
>>
>> + dma_ctrl = 1 << (DESC_INDEX + DMA_RING_BUF_EN_SHIFT) | DMA_EN;
>> + for (i = 0; i < priv->hw_params->rx_queues; i++)
>> + dma_ctrl |= (1 << (i + DMA_RING_BUF_EN_SHIFT));
>> reg = bcmgenet_rdma_readl(priv, DMA_CTRL);
>> reg &= ~dma_ctrl;
>> bcmgenet_rdma_writel(priv, reg, DMA_CTRL);
>
> It looks like current upstream still has this issue, which also upsets KASAN:
>
> [ 16.798433] ==================================================================
> [ 16.809347] BUG: KASAN: wild-memory-access in skb_release_data+0x124/0x270
> [ 16.816379] Read of size 8 at addr 80800000807f2e0c by task swapper/0/0
> [ 16.823122]
> [ 16.824655] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.18.0-rc5-v8+ #210
> [ 16.831581] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT)
> [ 16.837525] Call trace:
> [ 16.840025] dump_backtrace.part.0+0x1dc/0x1f0
> [ 16.844576] show_stack+0x24/0x80
> [ 16.847974] dump_stack_lvl+0x8c/0xb8
> [ 16.851735] print_report+0x1cc/0x240
> [ 16.855494] kasan_report+0xb4/0x120
> [ 16.859161] __asan_load8+0xa0/0xc4
> [ 16.862743] skb_release_data+0x124/0x270
> [ 16.866849] consume_skb+0x74/0xe0
> [ 16.870337] __dev_kfree_skb_any+0x74/0x90
> [ 16.874538] bcmgenet_desc_rx+0x4b4/0x620
> [ 16.878642] bcmgenet_rx_poll+0x78/0x150
> [ 16.882657] __napi_poll.constprop.0+0x64/0x240
> [ 16.887290] net_rx_action+0x4d4/0x590
> [ 16.891127] __do_softirq+0x228/0x4d8
> [ 16.894875] __irq_exit_rcu+0x1e4/0x24c
> [ 16.898806] irq_exit_rcu+0x20/0x54
> [ 16.902382] el1_interrupt+0x38/0x50
> [ 16.906051] el1h_64_irq_handler+0x18/0x2c
> [ 16.910250] el1h_64_irq+0x64/0x68
> [ 16.913733] arch_local_irq_enable+0xc/0x20
> [ 16.918010] default_idle_call+0x80/0x114
> [ 16.922118] cpuidle_idle_call+0x1e0/0x224
> [ 16.926310] do_idle+0x104/0x14c
> [ 16.929616] cpu_startup_entry+0x34/0x3c
> [ 16.933630] rest_init+0x180/0x200
> [ 16.937113] arch_post_acpi_subsys_init+0x0/0x30
> [ 16.941840] start_kernel+0x3c8/0x400
> [ 16.945592] __primary_switched+0xa8/0xb0
> [ 16.949699] ==================================================================
> [ 16.957052] Disabling lock debugging due to kernel taint
> [ 16.962507] Unable to handle kernel paging request at virtual address 80800000807f2e0c
> [ 16.970590] Mem abort info:
> [ 16.973461] ESR = 0x96000004
> [ 16.976602] EC = 0x25: DABT (current EL), IL = 32 bits
> [ 16.982038] SET = 0, FnV = 0
> [ 16.985176] EA = 0, S1PTW = 0
> [ 16.988403] FSC = 0x04: level 0 translation fault
> [ 16.993440] Data abort info:
> [ 16.996401] ISV = 0, ISS = 0x00000004
> [ 17.000333] CM = 0, WnR = 0
> [ 17.003384] [80800000807f2e0c] address between user and kernel address ranges
> [ 17.010674] Internal error: Oops: 96000004 [#1] SMP
> [ 17.015651] Modules linked in:
> [ 17.018784] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G B 5.18.0-rc5-v8+ #210
> [ 17.027115] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT)
> [ 17.033052] pstate: 40000005 (nZcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [ 17.040148] pc : skb_release_data+0x124/0x270
> [ 17.044603] lr : skb_release_data+0x124/0x270
> [ 17.049055] sp : ffffffc00a477690
> [ 17.052434] x29: ffffffc00a477690 x28: 0000000000000000 x27: ffffff8043cfeb42
> [ 17.059744] x26: 0000000000000001 x25: ffffff8040a5d5be x24: 00000000ffffffff
> [ 17.067049] x23: ffffff8043cfeb40 x22: 0000000000000000 x21: ffffff8040a5d540
> [ 17.074355] x20: ffffff8043cfeb70 x19: 80800000807f2e04 x18: 00000000ee6397dc
> [ 17.081661] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
> [ 17.088961] x14: 0000000000000000 x13: 746e696174206c65 x12: ffffffb801567cf9
> [ 17.096265] x11: 1ffffff801567cf8 x10: ffffffb801567cf8 x9 : dfffffc000000000
> [ 17.103572] x8 : ffffffc00ab3e7c7 x7 : 0000000000000001 x6 : ffffffb801567cf8
> [ 17.110875] x5 : ffffffc00ab3e7c0 x4 : ffffffb801567cf9 x3 : 0000000000000000
> [ 17.118178] x2 : 0000000000000020 x1 : ffffffc00a48e7c0 x0 : 0000000000000001
> [ 17.125481] Call trace:
> [ 17.127978] skb_release_data+0x124/0x270
> [ 17.132083] consume_skb+0x74/0xe0
> [ 17.135567] __dev_kfree_skb_any+0x74/0x90
> [ 17.139764] bcmgenet_desc_rx+0x4b4/0x620
> [ 17.143863] bcmgenet_rx_poll+0x78/0x150
> [ 17.147873] __napi_poll.constprop.0+0x64/0x240
> [ 17.152503] net_rx_action+0x4d4/0x590
> [ 17.156338] __do_softirq+0x228/0x4d8
> [ 17.160083] __irq_exit_rcu+0x1e4/0x24c
> [ 17.164008] irq_exit_rcu+0x20/0x54
> [ 17.167580] el1_interrupt+0x38/0x50
> [ 17.171247] el1h_64_irq_handler+0x18/0x2c
> [ 17.175444] el1h_64_irq+0x64/0x68
> [ 17.178923] arch_local_irq_enable+0xc/0x20
> [ 17.183197] default_idle_call+0x80/0x114
> [ 17.187301] cpuidle_idle_call+0x1e0/0x224
> [ 17.191490] do_idle+0x104/0x14c
> [ 17.194793] cpu_startup_entry+0x34/0x3c
> [ 17.198803] rest_init+0x180/0x200
> [ 17.202283] arch_post_acpi_subsys_init+0x0/0x30
> [ 17.207006] start_kernel+0x3c8/0x400
> [ 17.210755] __primary_switched+0xa8/0xb0
> [ 17.214872] Code: 72001c1f 540001e1 91002260 97d3086f (f9400660)
> [ 17.221083] ---[ end trace 0000000000000000 ]---
> [ 17.225791] Kernel panic - not syncing: Oops: Fatal exception in interrupt
> [ 17.232785] SMP: stopping secondary CPUs
> [ 17.236795] Kernel Offset: disabled
> [ 17.240348] CPU features: 0x100,00000d08,00001086
> [ 17.245143] Memory Limit: none
> [ 17.248273] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---
>
> This is at boot, over TFTP and NFS on a RaspberryPi4

How do I reproduce this reliably? What version of GCC did you build your
kernel with? How often does that happen? What config.txt file are you
using for your Pi4 B?
--
Florian

2022-05-17 16:55:30

by Maxime Ripard

[permalink] [raw]

Subject: Re: Kernel Panic in skb_release_data using genet

On Sat, May 14, 2022 at 09:35:42AM -0700, Florian Fainelli wrote:
> On 5/13/2022 7:56 AM, Maxime Ripard wrote:
> > Hi Florian,
> >
> > Sorry for reviving this old thread...
> >
> > On Thu, Jun 10, 2021 at 02:33:17PM -0700, Florian Fainelli wrote:
> > > On 6/2/2021 6:28 AM, Maxime Ripard wrote:
> > > > On Tue, Jun 01, 2021 at 11:33:18AM +0200, nicolas saenz julienne wrote:
> > > > > On Mon, 2021-05-31 at 19:36 -0700, Florian Fainelli wrote:
> > > > > > > That is also how I boot my Pi4 at home, and I suspect you are right, if
> > > > > > > the VPU does not shut down GENET's DMA, and leaves buffer addresses in
> > > > > > > the on-chip descriptors that point to an address space that is managed
> > > > > > > totally differently by Linux, then we can have a serious problem and
> > > > > > > create some memory corruption when the ring is being reclaimed. I will
> > > > > > > run a few experiments to test that theory and there may be a solution
> > > > > > > using the SW_INIT reset controller to have a big reset of the controller
> > > > > > > before handing it over to the Linux driver.
> > > > > >
> > > > > > Adding a WARN_ON(reg & DMA_EN) in bcmgenet_dma_disable() has not shown
> > > > > > that the TX or RX DMA have been left running during the hand over from
> > > > > > the VPU to the kernel. I checked out drm-misc-next-2021-05-17 to reduce
> > > > > > as much as possible the differences between your set-up and my set-up
> > > > > > but so far have not been able to reproduce the crash in booting from NFS
> > > > > > repeatedly, I will try again.
> > > > >
> > > > > FWIW I can reproduce the error too. That said it's rather hard to reproduce,
> > > > > something in the order of 1 failure every 20 tries.
> > > >
> > > > Yeah, it looks like it's only from a cold boot and comes in "bursts",
> > > > where you would get like 5 in a row and be done with it for a while.
> > >
> > > Here are two patches that you could try exclusive from one another
> > >
> > > 1) Limit GENET to a single queue
> > >
> > > diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> > > b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> > > index fcca023f22e5..e400c12e6868 100644
> > > --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> > > +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> > > @@ -3652,6 +3652,12 @@ static int bcmgenet_change_carrier(struct
> > > net_device *dev, bool new_carrier)
> > > return 0;
> > > }
> > >
> > > +static u16 bcmgenet_select_queue(struct net_device *dev, struct sk_buff
> > > *skb,
> > > + struct net_device *sb_dev)
> > > +{
> > > + return 0;
> > > +}
> > > +
> > > static const struct net_device_ops bcmgenet_netdev_ops = {
> > > .ndo_open = bcmgenet_open,
> > > .ndo_stop = bcmgenet_close,
> > > @@ -3666,6 +3672,7 @@ static const struct net_device_ops
> > > bcmgenet_netdev_ops = {
> > > #endif
> > > .ndo_get_stats = bcmgenet_get_stats,
> > > .ndo_change_carrier = bcmgenet_change_carrier,
> > > + .ndo_select_queue = bcmgenet_select_queue,
> > > };
> > >
> > > /* Array of GENET hardware parameters/characteristics */
> > >
> > > 2) Ensure that all TX/RX queues are disabled upon DMA initialization
> > >
> > > diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> > > b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> > > index fcca023f22e5..7f8a5996fbbb 100644
> > > --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> > > +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> > > @@ -3237,15 +3237,21 @@ static void bcmgenet_get_hw_addr(struct
> > > bcmgenet_priv *priv,
> > > /* Returns a reusable dma control register value */
> > > static u32 bcmgenet_dma_disable(struct bcmgenet_priv *priv)
> > > {
> > > + unsigned int i;
> > > u32 reg;
> > > u32 dma_ctrl;
> > >
> > > /* disable DMA */
> > > dma_ctrl = 1 << (DESC_INDEX + DMA_RING_BUF_EN_SHIFT) | DMA_EN;
> > > + for (i = 0; i < priv->hw_params->tx_queues; i++)
> > > + dma_ctrl |= (1 << (i + DMA_RING_BUF_EN_SHIFT));
> > > reg = bcmgenet_tdma_readl(priv, DMA_CTRL);
> > > reg &= ~dma_ctrl;
> > > bcmgenet_tdma_writel(priv, reg, DMA_CTRL);
> > >
> > > + dma_ctrl = 1 << (DESC_INDEX + DMA_RING_BUF_EN_SHIFT) | DMA_EN;
> > > + for (i = 0; i < priv->hw_params->rx_queues; i++)
> > > + dma_ctrl |= (1 << (i + DMA_RING_BUF_EN_SHIFT));
> > > reg = bcmgenet_rdma_readl(priv, DMA_CTRL);
> > > reg &= ~dma_ctrl;
> > > bcmgenet_rdma_writel(priv, reg, DMA_CTRL);
> >
> > It looks like current upstream still has this issue, which also upsets KASAN:
> >
> > [ 16.798433] ==================================================================
> > [ 16.809347] BUG: KASAN: wild-memory-access in skb_release_data+0x124/0x270
> > [ 16.816379] Read of size 8 at addr 80800000807f2e0c by task swapper/0/0
> > [ 16.823122]
> > [ 16.824655] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.18.0-rc5-v8+ #210
> > [ 16.831581] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT)
> > [ 16.837525] Call trace:
> > [ 16.840025] dump_backtrace.part.0+0x1dc/0x1f0
> > [ 16.844576] show_stack+0x24/0x80
> > [ 16.847974] dump_stack_lvl+0x8c/0xb8
> > [ 16.851735] print_report+0x1cc/0x240
> > [ 16.855494] kasan_report+0xb4/0x120
> > [ 16.859161] __asan_load8+0xa0/0xc4
> > [ 16.862743] skb_release_data+0x124/0x270
> > [ 16.866849] consume_skb+0x74/0xe0
> > [ 16.870337] __dev_kfree_skb_any+0x74/0x90
> > [ 16.874538] bcmgenet_desc_rx+0x4b4/0x620
> > [ 16.878642] bcmgenet_rx_poll+0x78/0x150
> > [ 16.882657] __napi_poll.constprop.0+0x64/0x240
> > [ 16.887290] net_rx_action+0x4d4/0x590
> > [ 16.891127] __do_softirq+0x228/0x4d8
> > [ 16.894875] __irq_exit_rcu+0x1e4/0x24c
> > [ 16.898806] irq_exit_rcu+0x20/0x54
> > [ 16.902382] el1_interrupt+0x38/0x50
> > [ 16.906051] el1h_64_irq_handler+0x18/0x2c
> > [ 16.910250] el1h_64_irq+0x64/0x68
> > [ 16.913733] arch_local_irq_enable+0xc/0x20
> > [ 16.918010] default_idle_call+0x80/0x114
> > [ 16.922118] cpuidle_idle_call+0x1e0/0x224
> > [ 16.926310] do_idle+0x104/0x14c
> > [ 16.929616] cpu_startup_entry+0x34/0x3c
> > [ 16.933630] rest_init+0x180/0x200
> > [ 16.937113] arch_post_acpi_subsys_init+0x0/0x30
> > [ 16.941840] start_kernel+0x3c8/0x400
> > [ 16.945592] __primary_switched+0xa8/0xb0
> > [ 16.949699] ==================================================================
> > [ 16.957052] Disabling lock debugging due to kernel taint
> > [ 16.962507] Unable to handle kernel paging request at virtual address 80800000807f2e0c
> > [ 16.970590] Mem abort info:
> > [ 16.973461] ESR = 0x96000004
> > [ 16.976602] EC = 0x25: DABT (current EL), IL = 32 bits
> > [ 16.982038] SET = 0, FnV = 0
> > [ 16.985176] EA = 0, S1PTW = 0
> > [ 16.988403] FSC = 0x04: level 0 translation fault
> > [ 16.993440] Data abort info:
> > [ 16.996401] ISV = 0, ISS = 0x00000004
> > [ 17.000333] CM = 0, WnR = 0
> > [ 17.003384] [80800000807f2e0c] address between user and kernel address ranges
> > [ 17.010674] Internal error: Oops: 96000004 [#1] SMP
> > [ 17.015651] Modules linked in:
> > [ 17.018784] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G B 5.18.0-rc5-v8+ #210
> > [ 17.027115] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT)
> > [ 17.033052] pstate: 40000005 (nZcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [ 17.040148] pc : skb_release_data+0x124/0x270
> > [ 17.044603] lr : skb_release_data+0x124/0x270
> > [ 17.049055] sp : ffffffc00a477690
> > [ 17.052434] x29: ffffffc00a477690 x28: 0000000000000000 x27: ffffff8043cfeb42
> > [ 17.059744] x26: 0000000000000001 x25: ffffff8040a5d5be x24: 00000000ffffffff
> > [ 17.067049] x23: ffffff8043cfeb40 x22: 0000000000000000 x21: ffffff8040a5d540
> > [ 17.074355] x20: ffffff8043cfeb70 x19: 80800000807f2e04 x18: 00000000ee6397dc
> > [ 17.081661] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
> > [ 17.088961] x14: 0000000000000000 x13: 746e696174206c65 x12: ffffffb801567cf9
> > [ 17.096265] x11: 1ffffff801567cf8 x10: ffffffb801567cf8 x9 : dfffffc000000000
> > [ 17.103572] x8 : ffffffc00ab3e7c7 x7 : 0000000000000001 x6 : ffffffb801567cf8
> > [ 17.110875] x5 : ffffffc00ab3e7c0 x4 : ffffffb801567cf9 x3 : 0000000000000000
> > [ 17.118178] x2 : 0000000000000020 x1 : ffffffc00a48e7c0 x0 : 0000000000000001
> > [ 17.125481] Call trace:
> > [ 17.127978] skb_release_data+0x124/0x270
> > [ 17.132083] consume_skb+0x74/0xe0
> > [ 17.135567] __dev_kfree_skb_any+0x74/0x90
> > [ 17.139764] bcmgenet_desc_rx+0x4b4/0x620
> > [ 17.143863] bcmgenet_rx_poll+0x78/0x150
> > [ 17.147873] __napi_poll.constprop.0+0x64/0x240
> > [ 17.152503] net_rx_action+0x4d4/0x590
> > [ 17.156338] __do_softirq+0x228/0x4d8
> > [ 17.160083] __irq_exit_rcu+0x1e4/0x24c
> > [ 17.164008] irq_exit_rcu+0x20/0x54
> > [ 17.167580] el1_interrupt+0x38/0x50
> > [ 17.171247] el1h_64_irq_handler+0x18/0x2c
> > [ 17.175444] el1h_64_irq+0x64/0x68
> > [ 17.178923] arch_local_irq_enable+0xc/0x20
> > [ 17.183197] default_idle_call+0x80/0x114
> > [ 17.187301] cpuidle_idle_call+0x1e0/0x224
> > [ 17.191490] do_idle+0x104/0x14c
> > [ 17.194793] cpu_startup_entry+0x34/0x3c
> > [ 17.198803] rest_init+0x180/0x200
> > [ 17.202283] arch_post_acpi_subsys_init+0x0/0x30
> > [ 17.207006] start_kernel+0x3c8/0x400
> > [ 17.210755] __primary_switched+0xa8/0xb0
> > [ 17.214872] Code: 72001c1f 540001e1 91002260 97d3086f (f9400660)
> > [ 17.221083] ---[ end trace 0000000000000000 ]---
> > [ 17.225791] Kernel panic - not syncing: Oops: Fatal exception in interrupt
> > [ 17.232785] SMP: stopping secondary CPUs
> > [ 17.236795] Kernel Offset: disabled
> > [ 17.240348] CPU features: 0x100,00000d08,00001086
> > [ 17.245143] Memory Limit: none
> > [ 17.248273] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---
> >
> > This is at boot, over TFTP and NFS on a RaspberryPi4
>
> How do I reproduce this reliably?

It's not really 100% reliable, but happens 30%-50% of the time at boot
when KASAN is enabled. It seems like enabling KASAN increases that
likelihood though, it went unnoticed for some time before I started
having those issues again when I enabled it for something unrelated.

It looks like it happens in bursts though, so I would get 10-15 boots
fine, and then 4-5 boots with that crash.

Cold boot vs reboot doesn't seem to affect it in one way or the other.

> What version of GCC did you build your kernel with?

The arm64 cross-compiler packaged by Fedora, which is GCC 11.2
at the moment.

> How often does that happen? What config.txt file are you using
> for your Pi4 B?

You'll find my config.txt and kernel .config attached

Thanks!
Maxime

Attachments:

(No filename) (0.00 B)
signature.asc (235.00 B)
Download all attachments

2022-08-12 04:10:20

by Florian Fainelli

[permalink] [raw]

Subject: Re: Kernel Panic in skb_release_data using genet

On 5/17/2022 12:52 AM, Maxime Ripard wrote:
> It's not really 100% reliable, but happens 30%-50% of the time at boot
> when KASAN is enabled. It seems like enabling KASAN increases that
> likelihood though, it went unnoticed for some time before I started
> having those issues again when I enabled it for something unrelated.
>
> It looks like it happens in bursts though, so I would get 10-15 boots
> fine, and then 4-5 boots with that crash.
>
> Cold boot vs reboot doesn't seem to affect it in one way or the other.
>
>> What version of GCC did you build your kernel with?
>
> The arm64 cross-compiler packaged by Fedora, which is GCC 11.2
> at the moment.
>
>> How often does that happen? What config.txt file are you using
>> for your Pi4 B?
>
> You'll find my config.txt and kernel .config attached

OK, so this is what I have been able to reproduce so far but this does
not appear to be very reliable to reproduce, I will try my best to hold
on to that lead though, thanks for your patience.

# udhcpc -i eth0
udhcpc: started, v1.35.0
[ 34.355086] bcmgenet fd580000.ethernet: configuring instance for
external RGMII (RX delay)
[ 34.363758]
==================================================================
[ 34.371106] BUG: KASAN: user-memory-access in put_page+0x10/0x64
[ 34.377227] Read of size 4 at addr 01000085 by task ifconfig/165
[ 34.383338]
[ 34.384857] CPU: 0 PID: 165 Comm: ifconfig Tainted: G W
5.19.0 #43
[ 34.392560] Hardware name: BCM2711
[ 34.396020] unwind_backtrace from show_stack+0x18/0x1c
[ 34.401354] show_stack from dump_stack_lvl+0x40/0x4c
[ 34.406502] dump_stack_lvl from kasan_report+0x8c/0xa4
[ 34.411825] kasan_report from put_page+0x10/0x64
[ 34.416615] put_page from skb_release_data+0x84/0x13c
[ 34.421847] skb_release_data from __kfree_skb+0x14/0x20
[ 34.427256] __kfree_skb from bcmgenet_rx_poll+0x504/0x6f8
[ 34.432846] bcmgenet_rx_poll from __napi_poll.constprop.0+0x50/0x1c0
[ 34.439407] __napi_poll.constprop.0 from net_rx_action+0x278/0x488
[ 34.445787] net_rx_action from __do_softirq+0x268/0x390
[ 34.451197] __do_softirq from __irq_exit_rcu+0x88/0xf8
[ 34.456521] __irq_exit_rcu from irq_exit+0x10/0x18
[ 34.461492] irq_exit from call_with_stack+0x18/0x20
[ 34.466553] call_with_stack from __irq_svc+0x84/0x94
[ 34.471696] Exception stack(0xf0d337f8 to 0xf0d33840)
[ 34.476835] 37e0:
c5548580 00000003
[ 34.485156] 3800: 00002000 f0a40808 c5548000 c5548580 00000000
c554b000 c5548580 c554bdd0
[ 34.493474] 3820: 00000000 00000004 c5548580 f0d33848 c094329c
c09432bc 00070013 ffffffff
[ 34.501788] __irq_svc from bcmgenet_open+0xe1c/0x1094
[ 34.507023] bcmgenet_open from __dev_open+0x1e4/0x21c
[ 34.512258] __dev_open from __dev_change_flags+0x228/0x25c
[ 34.517931] __dev_change_flags from dev_change_flags+0x48/0x88
[ 34.523958] dev_change_flags from devinet_ioctl+0x3ac/0x834
[ 34.529723] devinet_ioctl from inet_ioctl+0x250/0x2a4
[ 34.534956] inet_ioctl from sock_ioctl+0x1dc/0x410
[ 34.539927] sock_ioctl from vfs_ioctl+0x50/0x64
[ 34.544632] vfs_ioctl from sys_ioctl+0x134/0xa7c
[ 34.549422] sys_ioctl from ret_fast_syscall+0x0/0x4c
[ 34.554565] Exception stack(0xf0d33fa8 to 0xf0d33ff0)
[ 34.559705] 3fa0: 0051fd98 0053f9dc 00000003
00008914 b6dc5c4c b6dc5bd0
[ 34.568025] 3fc0: 0051fd98 0053f9dc b6dc5f55 00000036 b6dc5e48
00000003 aed11d00 aed12010
[ 34.576341] 3fe0: 00000036 b6dc5bb8 aec4c2f3 aebdda66
[ 34.581475]
==================================================================
[ 34.588882] Disabling lock debugging due to kernel taint
[ 34.594288] 8<--- cut here ---
[ 34.597412] Unable to handle kernel paging request at virtual address
01000085
[ 34.604775] [01000085] *pgd=01982003, *pmd=00000000
[ 34.609751] Internal error: Oops: 206 [#1] SMP ARM
[ 34.614624] Modules linked in:
[ 34.617734] CPU: 0 PID: 165 Comm: ifconfig Tainted: G B W
5.19.0 #43
[ 34.625435] Hardware name: BCM2711
[ 34.628892] PC is at put_page+0x14/0x64
[ 34.632800] LR is at kasan_report+0x98/0xa4
[ 34.637056] pc : [<c0b4bee4>] lr : [<c047ea5c>] psr: 60070113
[ 34.643427] sp : f0803d50 ip : 00000000 fp : c554bfd8
[ 34.648739] r10: 00007f5e r9 : c694f582 r8 : c1fef15e
[ 34.654052] r7 : c694f5b8 r6 : c694f580 r5 : 01000081 r4 : c1fef100
[ 34.660689] r3 : 00000000 r2 : c1f047c0 r1 : 00000004 r0 : 00000001
[ 34.667325] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM
Segment user
[ 34.674582] Control: 30c5383d Table: 0606b700 DAC: fffffffd
[ 34.680422] Register r0 information: non-paged memory
[ 34.685565] Register r1 information: non-paged memory
[ 34.690705] Register r2 information: slab task_struct start c1f047c0
pointer offset 0
[ 34.698690] Register r3 information: NULL pointer
[ 34.703477] Register r4 information: slab skbuff_head_cache start
c1fef100 pointer offset 0 size 48
[ 34.712699] Register r5 information: non-paged memory
[ 34.717839] Register r6 information: non-slab/vmalloc memory
[ 34.723595] Register r7 information: non-slab/vmalloc memory
[ 34.729352] Register r8 information: slab skbuff_head_cache start
c1fef100 pointer offset 94 size 48
[ 34.738662] Register r9 information: non-slab/vmalloc memory
[ 34.744419] Register r10 information: non-paged memory
[ 34.749646] Register r11 information: non-slab/vmalloc memory
[ 34.755492] Register r12 information: NULL pointer
[ 34.760366] Process ifconfig (pid: 165, stack limit = 0xf517d551)
[ 34.766573] Stack: (0xf0803d50 to 0xf0804000)
[ 34.771005] 3d40: c1fef100
00000001 c694f580 c0b4dc74
[ 34.779325] 3d60: c1fef100 c5548000 c5548580 c1fef100 f0803e40
7f5e0001 00007f5e c0b4db24
[ 34.787644] 3d80: c554bdd0 c0940f84 0bc80000 b4c23195 c2cb12c0
c0efdab0 c2cb12c0 00000001
[ 34.795963] 3da0: 00000000 00000040 00000004 c554bec4 1e1007bc
c554beb8 c5548588 00000004
[ 34.804282] 3dc0: c55498bc c554bec8 c02d5684 00000003 00000000
c02b6e10 e7df0980 c02bf390
[ 34.812601] 3de0: 41b58ab3 c15fec7a c0940a80 c1f047c0 00070113
257ac000 e7de97cc ffff982d
[ 34.820919] 3e00: 00000000 00000000 00000000 00000000 00000000
00000000 00000000 b4c23195
[ 34.829237] 3e20: c1f047c0 e7de8680 00000000 c1f047c0 00000000
c076733c e7de9ad8 00000000
[ 34.837556] 3e40: e7de97d4 c613e0a0 00000001 c554bdd0 00000001
00000040 f0803ef0 c554bdd8
[ 34.845875] 3e60: 257ac000 c2805d40 e7df0d00 c0b70f24 c554bdd0
f0803ef0 00000000 e7df0b40
[ 34.854195] 3e80: f0803f60 bd1007d8 c554bdd0 c2644b40 257ac000
c0b7130c 0000012c e7df0d0c
[ 34.862513] 3ea0: ffff9839 f0803ef0 81d99054 c554bdd4 0000002c
257ac000 c26433c8 c0840554
[ 34.870832] 3ec0: 41b58ab3 c1612850 c0b71094 c2cb12c0 e7df0980
c02d8a5c ea8ed400 c02d8ae0
[ 34.879150] 3ee0: 41b58ab3 c15f3580 c08403c4 00000010 c554bd00
c554bdd8 00000000 00000010
[ 34.887470] 3f00: f0803f00 f0803f00 c5548580 00002000 c554bdd0
c554b580 0000010a c093e0b8
[ 34.895788] 3f20: f0803f20 f0803f20 0000002c c093df98 c2806f18
c029f4ac 00000000 00000007
[ 34.904108] 3f40: e7de9780 c02a4218 00000104 c4dca800 00000001
c4dca824 c4dca86c c4dca86c
[ 34.912427] 3f60: c4dca848 f0803fc8 f0d337f0 b4c23195 c4dca800
c1f047c0 c280508c 00000008
[ 34.920747] 3f80: c2643dc0 c1f047c4 00000003 00000100 c1f049d4
c02014d8 c4dca800 c1f047c0
[ 34.929066] 3fa0: 00400100 0000000a ffff9838 00000004 c263c3c8
257ac000 c26433c0 c1f047c0
[ 34.937385] 3fc0: c2643dc0 c1f047c4 257ac000 257ac000 c1f047c0
00000000 f0d337f0 c02312c4
[ 34.945704] 3fe0: c09432bc 00070013 ffffffff f0d3382c c5548580
c0231418 c09432bc c07559fc
[ 34.954019] put_page from skb_release_data+0x84/0x13c
[ 34.959252] skb_release_data from __kfree_skb+0x14/0x20
[ 34.964660] __kfree_skb from bcmgenet_rx_poll+0x504/0x6f8
[ 34.970250] bcmgenet_rx_poll from __napi_poll.constprop.0+0x50/0x1c0
[ 34.976812] __napi_poll.constprop.0 from net_rx_action+0x278/0x488
[ 34.983192] net_rx_action from __do_softirq+0x268/0x390
[ 34.988602] __do_softirq from __irq_exit_rcu+0x88/0xf8
[ 34.993927] __irq_exit_rcu from irq_exit+0x10/0x18
[ 34.998899] irq_exit from call_with_stack+0x18/0x20
[ 35.003958] call_with_stack from __irq_svc+0x84/0x94
[ 35.009101] Exception stack(0xf0d337f8 to 0xf0d33840)
[ 35.014238] 37e0:
c5548580 00000003
[ 35.022557] 3800: 00002000 f0a40808 c5548000 c5548580 00000000
c554b000 c5548580 c554bdd0
[ 35.030877] 3820: 00000000 00000004 c5548580 f0d33848 c094329c
c09432bc 00070013 ffffffff
[ 35.039192] __irq_svc from bcmgenet_open+0xe1c/0x1094
[ 35.044427] bcmgenet_open from __dev_open+0x1e4/0x21c
[ 35.049661] __dev_open from __dev_change_flags+0x228/0x25c
[ 35.055334] __dev_change_flags from dev_change_flags+0x48/0x88
[ 35.061361] dev_change_flags from devinet_ioctl+0x3ac/0x834
[ 35.067125] devinet_ioctl from inet_ioctl+0x250/0x2a4
[ 35.072359] inet_ioctl from sock_ioctl+0x1dc/0x410
[ 35.077330] sock_ioctl from vfs_ioctl+0x50/0x64
[ 35.082034] vfs_ioctl from sys_ioctl+0x134/0xa7c
[ 35.086825] sys_ioctl from ret_fast_syscall+0x0/0x4c
[ 35.091969] Exception stack(0xf0d33fa8 to 0xf0d33ff0)
[ 35.097109] 3fa0: 0051fd98 0053f9dc 00000003
00008914 b6dc5c4c b6dc5bd0
[ 35.105428] 3fc0: 0051fd98 0053f9dc b6dc5f55 00000036 b6dc5e48
00000003 aed11d00 aed12010
[ 35.113744] 3fe0: 00000036 b6dc5bb8 aec4c2f3 aebdda66
[ 35.118883] Code: e1a05000 e2800004 ebe4cca7 e3a01004 (e5953004)
[ 35.125104] ---[ end trace 0000000000000000 ]---
[ 35.129801] Kernel panic - not syncing: Fatal exception in interrupt
[ 35.136260] CPU3: stopping
[ 35.139009] CPU: 3 PID: 27 Comm: migration/3 Tainted: G B D W
5.19.0 #43
[ 35.146872] Hardware name: BCM2711
[ 35.150318] Stopper: multi_cpu_stop+0x0/0x140 <-
stop_machine_cpuslocked+0x180/0x1e4
[ 35.158197] unwind_backtrace from show_stack+0x18/0x1c
[ 35.163509] show_stack from dump_stack_lvl+0x40/0x4c
[ 35.168643] dump_stack_lvl from do_handle_IPI+0x150/0x2a8
[ 35.174218] do_handle_IPI from ipi_handler+0x1c/0x28
[ 35.179351] ipi_handler from handle_percpu_devid_irq+0x94/0x150
[ 35.185454] handle_percpu_devid_irq from handle_irq_desc+0x38/0x48
[ 35.191820] handle_irq_desc from gic_handle_irq+0x6c/0x78
[ 35.197393] gic_handle_irq from generic_handle_arch_irq+0x28/0x3c
[ 35.203671] generic_handle_arch_irq from call_with_stack+0x18/0x20
[ 35.210038] call_with_stack from __irq_svc+0x84/0x94
[ 35.215168] Exception stack(0xf0913e98 to 0xf0913ee0)
[ 35.220293] 3e80:
e7e20a10 00000000
[ 35.228594] 3ea0: 00000000 257dc000 e7e1ec68 f0913ee8 257dc000
00000000 c2806f18 60070013
[ 35.236896] 3ec0: f0863d70 f0863d74 f0863d70 f0913ee8 c02bebd4
c02bebe8 60070013 ffffffff
[ 35.245192] __irq_svc from rcu_momentary_dyntick_idle+0x2c/0x9c
[ 35.251296] rcu_momentary_dyntick_idle from multi_cpu_stop+0xd4/0x140
[ 35.257931] multi_cpu_stop from cpu_stopper_thread+0x120/0x1d8
[ 35.263947] cpu_stopper_thread from smpboot_thread_fn+0x25c/0x264
[ 35.270228] smpboot_thread_fn from kthread+0x12c/0x140
[ 35.275539] kthread from ret_from_fork+0x14/0x1c
[ 35.280317] Exception stack(0xf0913fb0 to 0xf0913ff8)
[ 35.285441] 3fa0: 00000000
00000000 00000000 00000000
[ 35.293739] 3fc0: 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[ 35.302037] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[ 35.308746] CPU2: stopping
[ 35.311492] CPU: 2 PID: 22 Comm: migration/2 Tainted: G B D W
5.19.0 #43
[ 35.319355] Hardware name: BCM2711
[ 35.322803] Stopper: multi_cpu_stop+0x0/0x140 <-
stop_machine_cpuslocked+0x180/0x1e4
[ 35.330677] unwind_backtrace from show_stack+0x18/0x1c
[ 35.335988] show_stack from dump_stack_lvl+0x40/0x4c
[ 35.341122] dump_stack_lvl from do_handle_IPI+0x150/0x2a8
[ 35.346697] do_handle_IPI from ipi_handler+0x1c/0x28
[ 35.351830] ipi_handler from handle_percpu_devid_irq+0x94/0x150
[ 35.357932] handle_percpu_devid_irq from handle_irq_desc+0x38/0x48
[ 35.364298] handle_irq_desc from gic_handle_irq+0x6c/0x78
[ 35.369870] gic_handle_irq from generic_handle_arch_irq+0x28/0x3c
[ 35.376148] generic_handle_arch_irq from call_with_stack+0x18/0x20
[ 35.382515] call_with_stack from __irq_svc+0x84/0x94
[ 35.387646] Exception stack(0xf08ebea8 to 0xf08ebef0)
[ 35.392773] bea0: f0863d70 00000003 00000000
00000001 f0863d60 00000000
[ 35.401074] bec0: 00000001 00000000 c2806f18 600c0013 f0863d70
f0863d74 f0863d70 f08ebef8
[ 35.409372] bee0: c030acac c02bebbc 600c0013 ffffffff
[ 35.414495] __irq_svc from rcu_momentary_dyntick_idle+0x0/0x9c
[ 35.420511] rcu_momentary_dyntick_idle from 0xc31d0000
[ 35.425820] CPU1: stopping
[ 35.428568] CPU: 1 PID: 17 Comm: migration/1 Tainted: G B D W
5.19.0 #43
[ 35.436430] Hardware name: BCM2711
[ 35.439879] Stopper: multi_cpu_stop+0x0/0x140 <-
stop_machine_cpuslocked+0x180/0x1e4
[ 35.447752] unwind_backtrace from show_stack+0x18/0x1c
[ 35.453064] show_stack from dump_stack_lvl+0x40/0x4c
[ 35.458198] dump_stack_lvl from do_handle_IPI+0x150/0x2a8
[ 35.463772] do_handle_IPI from ipi_handler+0x1c/0x28
[ 35.468905] ipi_handler from handle_percpu_devid_irq+0x94/0x150
[ 35.475006] handle_percpu_devid_irq from handle_irq_desc+0x38/0x48
[ 35.481373] handle_irq_desc from gic_handle_irq+0x6c/0x78
[ 35.486945] gic_handle_irq from generic_handle_arch_irq+0x28/0x3c
[ 35.493222] generic_handle_arch_irq from call_with_stack+0x18/0x20
[ 35.499590] call_with_stack from __irq_svc+0x84/0x94
[ 35.504721] Exception stack(0xf08c3e98 to 0xf08c3ee0)
[ 35.509847] 3e80:
e7e00a10 00000000
[ 35.518148] 3ea0: 00000000 257bc000 e7dfec68 f08c3ee8 257bc000
00000000 c2806f18 600f0013
[ 35.526449] 3ec0: f0863d70 f0863d74 f0863d70 f08c3ee8 c02bebd4
c02bebe8 600f0013 ffffffff
[ 35.534745] __irq_svc from rcu_momentary_dyntick_idle+0x2c/0x9c
[ 35.540849] rcu_momentary_dyntick_idle from multi_cpu_stop+0xd4/0x140
[ 35.547483] multi_cpu_stop from cpu_stopper_thread+0x120/0x1d8
[ 35.553499] cpu_stopper_thread from smpboot_thread_fn+0x25c/0x264
[ 35.559780] smpboot_thread_fn from kthread+0x12c/0x140
[ 35.565090] kthread from ret_from_fork+0x14/0x1c
[ 35.569868] Exception stack(0xf08c3fb0 to 0xf08c3ff8)
[ 35.574992] 3fa0: 00000000
00000000 00000000 00000000
[ 35.583292] 3fc0: 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000
[ 35.591589] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[ 35.599291] ---[ end Kernel panic - not syncing: Fatal exception in
interrupt ]---
--
Florian

2022-08-15 07:17:19

by Maxime Ripard

[permalink] [raw]

Subject: Re: Kernel Panic in skb_release_data using genet

Hi Florian,

On Thu, Aug 11, 2022 at 08:33:58PM -0700, Florian Fainelli wrote:
>
>
> On 5/17/2022 12:52 AM, Maxime Ripard wrote:
> > It's not really 100% reliable, but happens 30%-50% of the time at boot
> > when KASAN is enabled. It seems like enabling KASAN increases that
> > likelihood though, it went unnoticed for some time before I started
> > having those issues again when I enabled it for something unrelated.
> >
> > It looks like it happens in bursts though, so I would get 10-15 boots
> > fine, and then 4-5 boots with that crash.
> >
> > Cold boot vs reboot doesn't seem to affect it in one way or the other.
> >
> > > What version of GCC did you build your kernel with?
> >
> > The arm64 cross-compiler packaged by Fedora, which is GCC 11.2
> > at the moment.
> >
> > > How often does that happen? What config.txt file are you using
> > > for your Pi4 B?
> >
> > You'll find my config.txt and kernel .config attached
>
> OK, so this is what I have been able to reproduce so far but this does not
> appear to be very reliable to reproduce, I will try my best to hold on to
> that lead though, thanks for your patience.
>
> # udhcpc -i eth0
> udhcpc: started, v1.35.0
> [ 34.355086] bcmgenet fd580000.ethernet: configuring instance for external
> RGMII (RX delay)
> [ 34.363758]
> ==================================================================
> [ 34.371106] BUG: KASAN: user-memory-access in put_page+0x10/0x64
> [ 34.377227] Read of size 4 at addr 01000085 by task ifconfig/165
> [ 34.383338]
> [ 34.384857] CPU: 0 PID: 165 Comm: ifconfig Tainted: G W 5.19.0
> #43
> [ 34.392560] Hardware name: BCM2711
> [ 34.396020] unwind_backtrace from show_stack+0x18/0x1c
> [ 34.401354] show_stack from dump_stack_lvl+0x40/0x4c
> [ 34.406502] dump_stack_lvl from kasan_report+0x8c/0xa4
> [ 34.411825] kasan_report from put_page+0x10/0x64
> [ 34.416615] put_page from skb_release_data+0x84/0x13c
> [ 34.421847] skb_release_data from __kfree_skb+0x14/0x20
> [ 34.427256] __kfree_skb from bcmgenet_rx_poll+0x504/0x6f8
> [ 34.432846] bcmgenet_rx_poll from __napi_poll.constprop.0+0x50/0x1c0
> [ 34.439407] __napi_poll.constprop.0 from net_rx_action+0x278/0x488
> [ 34.445787] net_rx_action from __do_softirq+0x268/0x390
> [ 34.451197] __do_softirq from __irq_exit_rcu+0x88/0xf8
> [ 34.456521] __irq_exit_rcu from irq_exit+0x10/0x18
> [ 34.461492] irq_exit from call_with_stack+0x18/0x20
> [ 34.466553] call_with_stack from __irq_svc+0x84/0x94
> [ 34.471696] Exception stack(0xf0d337f8 to 0xf0d33840)

It looks fairly close indeed.

There's a bunch of notable differences though (user-memory-access vs
wild-memory-access, the read size) but the type of memory access error
can just be due to the randomness of the memory address we try to
access, and the read 4 vs 8 could be because you're running on ARM and
I'm running on arm64?

Thanks again for looking into it

Maxime

Attachments:

(No filename) (2.95 kB)
signature.asc (235.00 B)
Download all attachments