2022-04-27 08:05:20

by H. Nikolaus Schaller

[permalink] [raw]
Subject: kernel panic with v5.18-rc1 on OpenPandora (only)

Hi Tony,
I ran across a new issue on the OpenPandora (omap3530) first appearing with v5.18-rc1.
It seems as if there is something happening within the omap3 L3 irq handler which
is used (only?) for the wl1251. And this triggers the timeout BUG_ON() in
omap3_l3_app_irq().

I have not seen this issue on the GTA04.

It goes away if I remove the wlan interrupt from omap3-pandora-common.dtsi.
Interestingly, removing the wl1251.ko does NOT stop it. So it is not the driver.

git bisect reported:

a1c510d0adc604bb143c86052bc5be48cbcfa17c is the first bad commit
commit a1c510d0adc604bb143c86052bc5be48cbcfa17c
Author: Ard Biesheuvel <[email protected]>
Date: Thu Sep 23 09:15:53 2021 +0200

ARM: implement support for vmap'ed stacks

Any ideas?

BR and thanks,
Nikolaus

...
[ 27.357666] omap_hsmmc 480ad000.mmc: found wl1251
[ 27.465332] OF: graph: no port node found in /ocp@68000000/isp@480bc000/ports
[ 27.535552] wl1251: using dedicated interrupt line
[ 27.597839] usb0: HOST MAC 32:70:05:18:ff:78
[ 27.751281] ------------[ cut here ]------------
[ 27.756164] kernel BUG at drivers/bus/omap_l3_smx.c:177!
[ 27.761749] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
[ 27.767913] Modules linked in: omap3_isp(+) v4l2_fwnode v4l2_async wl1251_sdio(+) videobuf2_dma_contig videobuf2_memops wl1251 usb_f_ecm videobuf2_v4l2 videobuf2_common bq27xxx_battery_hdq omap_sham videodev mac80211 mc omap_aes_driver libarc4 crypto_engine omap_crypto g_ether usb_f_rndis u_ether omap_hdq libcomposite configfs panel_tpo_td043mtea1 omap2430(+) phy_twl4030_usb musb_hdrc twl4030_pwrbutton twl4030_charger twl4030_madc snd_soc_twl4030 industrialio twl4030_keypad ehci_omap bq27xxx_battery_i2c bq27xxx_battery pandora_nub omapdrm cfg80211 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm drm_panel_orientation_quirks cec ip_tables x_tables ipv6 autofs4
[ 27.831207] CPU: 0 PID: 1812 Comm: systemd-udevd Not tainted 5.16.0-rc1-letux+ #9435
[ 27.839355] Hardware name: Generic OMAP3 (Flattened Device Tree)
[ 27.845672] PC is at omap3_l3_app_irq+0x3c/0x120
[ 27.850555] LR is at __handle_irq_event_percpu+0xb0/0x1f8
[ 27.856262] pc : [<c053f5a4>] lr : [<c017d36c>] psr: 20070193
[ 27.862854] sp : e0001eb0 ip : 0dcf2f92 fp : c0d04014
[ 27.868377] r10: e0001f00 r9 : c1032200 r8 : 00000000
[ 27.873870] r7 : e0001f84 r6 : 0000001a r5 : 00000000 r4 : f8000000
[ 27.880767] r3 : 00000000 r2 : 00080000 r1 : 00080000 r0 : 00080000
[ 27.887634] Flags: nzCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment none
[ 27.895263] Control: 10c5387d Table: 829a8019 DAC: 00000051
[ 27.901306] Register r0 information: non-paged memory
[ 27.906646] Register r1 information: non-paged memory
[ 27.911987] Register r2 information: non-paged memory
[ 27.917297] Register r3 information: NULL pointer
[ 27.922271] Register r4 information: 0-page vmalloc region starting at 0xf8000000 allocated at iotable_init+0x0/0xf4
[ 27.933380] Register r5 information: NULL pointer
[ 27.938354] Register r6 information: non-paged memory
[ 27.943695] Register r7 information: 2-page vmalloc region starting at 0xe0000000 allocated at start_kernel+0x5c0/0x918
[ 27.955047] Register r8 information: NULL pointer
[ 27.959991] Register r9 information: slab kmalloc-256 start c1032200 pointer offset 0 size 256
[ 27.969085] Register r10 information: 2-page vmalloc region starting at 0xe0000000 allocated at start_kernel+0x5c0/0x918
[ 27.980529] Register r11 information: non-slab/vmalloc memory
[ 27.986572] Register r12 information: non-paged memory
[ 27.991973] Process systemd-udevd (pid: 1812, stack limit = 0x9cb9744d)
[ 27.998931] Stack: (0xe0001eb0 to 0xe0002000)
[ 28.003509] 1ea0: 0000001a c111a880 00000000 c111a800
[ 28.012115] 1ec0: c1032200 0000001a e0001f84 c017d36c c1032200 e0001f00 00000000 c0d03fc8
[ 28.020721] 1ee0: c1032200 c1032218 e0001f84 c0c79374 c29a15c0 c0deb720 e0001f50 c017d4d8
[ 28.029296] 1f00: 00000000 ea298583 c1032200 c103226c c1032218 c017d55c c1032200 c103226c
[ 28.037902] 1f20: c1032218 c0181c64 c0c7f684 e0001f50 00000000 c017ccbc c0c7f684 c0901018
[ 28.046508] 1f40: c0101314 60070113 ffffffff c0100b60 c29a15c0 c0c805c0 00000000 1ed14000
[ 28.055114] 1f60: c0c805c0 c0c805c0 00000040 c0d02d00 c0c79374 c29a15c0 c0deb720 c0d02080
[ 28.063690] 1f80: 0dcf2f92 e0001fa0 c01012e0 c0101314 60070113 ffffffff 00000051 c01012e0
[ 28.072265] 1fa0: c29a15c0 e01ddb8c e01dda50 00400140 0000000a ffff95a7 c1005c50 c0c805c0
[ 28.080871] 1fc0: c0d02d00 c0c805c0 c29a15c0 c29a15c0 60070013 ffffffff e01dda84 e01ddab0
[ 28.089447] 1fe0: c29a15c0 e01ddb8c e01dda50 c0136f10 c0905de0 c0137008 c0905de0 c0100b60
[ 28.098052] [<c053f5a4>] (omap3_l3_app_irq) from [<c017d36c>] (__handle_irq_event_percpu+0xb0/0x1f8)
[ 28.107666] [<c017d36c>] (__handle_irq_event_percpu) from [<c017d4d8>] (handle_irq_event_percpu+0x24/0x70)
[ 28.117797] [<c017d4d8>] (handle_irq_event_percpu) from [<c017d55c>] (handle_irq_event+0x38/0x5c)
[ 28.127136] [<c017d55c>] (handle_irq_event) from [<c0181c64>] (handle_level_irq+0x7c/0xb4)
[ 28.135864] [<c0181c64>] (handle_level_irq) from [<c017ccbc>] (handle_irq_desc+0x1c/0x2c)
[ 28.144470] [<c017ccbc>] (handle_irq_desc) from [<c0901018>] (generic_handle_arch_irq+0x2c/0x44)
[ 28.153717] [<c0901018>] (generic_handle_arch_irq) from [<c0100b60>] (__irq_svc+0x80/0xb8)
[ 28.162414] Exception stack(0xe0001fa0 to 0xe0001fe8)
[ 28.167755] 1fa0: c29a15c0 e01ddb8c e01dda50 00400140 0000000a ffff95a7 c1005c50 c0c805c0
[ 28.176330] 1fc0: c0d02d00 c0c805c0 c29a15c0 c29a15c0 60070013 ffffffff e01dda84 e01ddab0
[ 28.184906] 1fe0: c29a15c0 e01ddb8c
[ 28.188598] [<c0100b60>] (__irq_svc) from [<c0101314>] (__do_softirq+0x84/0x304)
[ 28.196380] [<c0101314>] (__do_softirq) from [<c0136f10>] (__irq_exit_rcu+0x8c/0xd4)
[ 28.204528] [<c0136f10>] (__irq_exit_rcu) from [<c0137008>] (irq_exit+0x8/0x10)
[ 28.212219] [<c0137008>] (irq_exit) from [<c0100b60>] (__irq_svc+0x80/0xb8)
[ 28.219543] Exception stack(0xe01ddaa0 to 0xe01ddae8)
[ 28.224884] daa0: 7fffffff c0902e94 00000004 00000000 c29a15c0 e01ddb90 e01ddb90 ea298583
[ 28.233459] dac0: 00000000 e01ddb8c e01ddb88 7fffffff 00000002 00000010 00000100 c202a000
[ 28.242065] dae0: 97ff8010 c0902f54
[ 28.245727] [<c0100b60>] (__irq_svc) from [<c0905de0>] (_raw_spin_unlock_irq+0x20/0x4c)
[ 28.254150] [<c0905de0>] (_raw_spin_unlock_irq) from [<c0902e94>] (do_wait_for_common+0xa8/0x138)
[ 28.263488] [<c0902e94>] (do_wait_for_common) from [<c0902f54>] (wait_for_common+0x30/0x48)
[ 28.272277] [<c0902f54>] (wait_for_common) from [<c074edf8>] (mmc_wait_for_req_done+0x1c/0x90)
[ 28.281341] [<c074edf8>] (mmc_wait_for_req_done) from [<c075a72c>] (mmc_io_rw_extended+0x1c0/0x2f4)
[ 28.290893] [<c075a72c>] (mmc_io_rw_extended) from [<c075bd00>] (sdio_io_rw_ext_helper+0x118/0x140)
[ 28.300415] [<c075bd00>] (sdio_io_rw_ext_helper) from [<c075bdd0>] (sdio_memcpy_toio+0x18/0x20)
[ 28.309570] [<c075bdd0>] (sdio_memcpy_toio) from [<bf3de1ec>] (wl1251_sdio_write+0x34/0x54 [wl1251_sdio])
[ 28.319702] [<bf3de1ec>] (wl1251_sdio_write [wl1251_sdio]) from [<bf40fc8c>] (wl1251_set_partition+0x90/0x404 [wl1251])
[ 28.331207] [<bf40fc8c>] (wl1251_set_partition [wl1251]) from [<bf4074ec>] (wl1251_init_ieee80211+0x1c0/0x3dc [wl1251])
[ 28.342712] [<bf4074ec>] (wl1251_init_ieee80211 [wl1251]) from [<bf3de474>] (wl1251_sdio_probe+0x1a0/0x250 [wl1251_sdio])
[ 28.354339] [<bf3de474>] (wl1251_sdio_probe [wl1251_sdio]) from [<c075aee8>] (sdio_bus_probe+0x128/0x138)
[ 28.364410] [<c075aee8>] (sdio_bus_probe) from [<c061c7d4>] (really_probe+0x170/0x2fc)
[ 28.372741] [<c061c7d4>] (really_probe) from [<c061ca24>] (__driver_probe_device+0xc4/0xd8)
[ 28.381530] [<c061ca24>] (__driver_probe_device) from [<c061ca68>] (driver_probe_device+0x30/0xac)
[ 28.390960] [<c061ca68>] (driver_probe_device) from [<c061ce98>] (__driver_attach+0xc4/0xd8)
[ 28.399841] [<c061ce98>] (__driver_attach) from [<c061acec>] (bus_for_each_dev+0x64/0xa0)
[ 28.408447] [<c061acec>] (bus_for_each_dev) from [<c061bc2c>] (bus_add_driver+0x148/0x1a4)
[ 28.417144] [<c061bc2c>] (bus_add_driver) from [<c061d700>] (driver_register+0xb4/0xf8)
[ 28.425567] [<c061d700>] (driver_register) from [<bf29d00c>] (wl1251_sdio_init+0xc/0x1000 [wl1251_sdio])
[ 28.435577] [<bf29d00c>] (wl1251_sdio_init [wl1251_sdio]) from [<c0101f94>] (do_one_initcall+0x90/0x1c8)
[ 28.445587] [<c0101f94>] (do_one_initcall) from [<c08f4e00>] (do_init_module+0x4c/0x204)
[ 28.454132] [<c08f4e00>] (do_init_module) from [<c01b5148>] (load_module+0x13f0/0x1928)
[ 28.462524] [<c01b5148>] (load_module) from [<c01b58b8>] (sys_finit_module+0xa0/0xc0)
[ 28.470794] [<c01b58b8>] (sys_finit_module) from [<c0100270>] (__sys_trace_return+0x0/0x10)
[ 28.479583] Code: e0000002 e0011003 e1901001 0a000002 (e7f001f2)
[ 28.485992] ---[ end trace 6855b0c4a2214070 ]---
[ 28.490844] Kernel panic - not syncing: Fatal exception in interrupt
[ 28.497528] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---


2022-04-27 08:55:52

by Tony Lindgren

[permalink] [raw]
Subject: Re: kernel panic with v5.18-rc1 on OpenPandora (only)

Hi,

* H. Nikolaus Schaller <[email protected]> [220426 20:16]:
> Hi Tony,
> I ran across a new issue on the OpenPandora (omap3530) first appearing with v5.18-rc1.
> It seems as if there is something happening within the omap3 L3 irq handler which
> is used (only?) for the wl1251. And this triggers the timeout BUG_ON() in
> omap3_l3_app_irq().
>
> I have not seen this issue on the GTA04.
>
> It goes away if I remove the wlan interrupt from omap3-pandora-common.dtsi.
> Interestingly, removing the wl1251.ko does NOT stop it. So it is not the driver.
>
> git bisect reported:
>
> a1c510d0adc604bb143c86052bc5be48cbcfa17c is the first bad commit
> commit a1c510d0adc604bb143c86052bc5be48cbcfa17c
> Author: Ard Biesheuvel <[email protected]>
> Date: Thu Sep 23 09:15:53 2021 +0200
>
> ARM: implement support for vmap'ed stacks
>
> Any ideas?

We had to add commit 8cf8df89678a ("ARM: OMAP2+: Fix regression for smc
calls for vmap stack") to fix vmap related issues in case you have not
seen that one yet. This seems different though, it's like the L3 interrupt
handler is not reading the right register?

Not sure why the L3 interrupt is triggering, that could be caused by
another issue with the wl1251 somewhere.

Ard and Arnd, any ideas?

Regards,

Tony

> ...
> [ 27.357666] omap_hsmmc 480ad000.mmc: found wl1251
> [ 27.465332] OF: graph: no port node found in /ocp@68000000/isp@480bc000/ports
> [ 27.535552] wl1251: using dedicated interrupt line
> [ 27.597839] usb0: HOST MAC 32:70:05:18:ff:78
> [ 27.751281] ------------[ cut here ]------------
> [ 27.756164] kernel BUG at drivers/bus/omap_l3_smx.c:177!
> [ 27.761749] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
> [ 27.767913] Modules linked in: omap3_isp(+) v4l2_fwnode v4l2_async wl1251_sdio(+) videobuf2_dma_contig videobuf2_memops wl1251 usb_f_ecm videobuf2_v4l2 videobuf2_common bq27xxx_battery_hdq omap_sham videodev mac80211 mc omap_aes_driver libarc4 crypto_engine omap_crypto g_ether usb_f_rndis u_ether omap_hdq libcomposite configfs panel_tpo_td043mtea1 omap2430(+) phy_twl4030_usb musb_hdrc twl4030_pwrbutton twl4030_charger twl4030_madc snd_soc_twl4030 industrialio twl4030_keypad ehci_omap bq27xxx_battery_i2c bq27xxx_battery pandora_nub omapdrm cfg80211 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm drm_panel_orientation_quirks cec ip_tables x_tables ipv6 autofs4
> [ 27.831207] CPU: 0 PID: 1812 Comm: systemd-udevd Not tainted 5.16.0-rc1-letux+ #9435
> [ 27.839355] Hardware name: Generic OMAP3 (Flattened Device Tree)
> [ 27.845672] PC is at omap3_l3_app_irq+0x3c/0x120
> [ 27.850555] LR is at __handle_irq_event_percpu+0xb0/0x1f8
> [ 27.856262] pc : [<c053f5a4>] lr : [<c017d36c>] psr: 20070193
> [ 27.862854] sp : e0001eb0 ip : 0dcf2f92 fp : c0d04014
> [ 27.868377] r10: e0001f00 r9 : c1032200 r8 : 00000000
> [ 27.873870] r7 : e0001f84 r6 : 0000001a r5 : 00000000 r4 : f8000000
> [ 27.880767] r3 : 00000000 r2 : 00080000 r1 : 00080000 r0 : 00080000
> [ 27.887634] Flags: nzCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment none
> [ 27.895263] Control: 10c5387d Table: 829a8019 DAC: 00000051
> [ 27.901306] Register r0 information: non-paged memory
> [ 27.906646] Register r1 information: non-paged memory
> [ 27.911987] Register r2 information: non-paged memory
> [ 27.917297] Register r3 information: NULL pointer
> [ 27.922271] Register r4 information: 0-page vmalloc region starting at 0xf8000000 allocated at iotable_init+0x0/0xf4
> [ 27.933380] Register r5 information: NULL pointer
> [ 27.938354] Register r6 information: non-paged memory
> [ 27.943695] Register r7 information: 2-page vmalloc region starting at 0xe0000000 allocated at start_kernel+0x5c0/0x918
> [ 27.955047] Register r8 information: NULL pointer
> [ 27.959991] Register r9 information: slab kmalloc-256 start c1032200 pointer offset 0 size 256
> [ 27.969085] Register r10 information: 2-page vmalloc region starting at 0xe0000000 allocated at start_kernel+0x5c0/0x918
> [ 27.980529] Register r11 information: non-slab/vmalloc memory
> [ 27.986572] Register r12 information: non-paged memory
> [ 27.991973] Process systemd-udevd (pid: 1812, stack limit = 0x9cb9744d)
> [ 27.998931] Stack: (0xe0001eb0 to 0xe0002000)
> [ 28.003509] 1ea0: 0000001a c111a880 00000000 c111a800
> [ 28.012115] 1ec0: c1032200 0000001a e0001f84 c017d36c c1032200 e0001f00 00000000 c0d03fc8
> [ 28.020721] 1ee0: c1032200 c1032218 e0001f84 c0c79374 c29a15c0 c0deb720 e0001f50 c017d4d8
> [ 28.029296] 1f00: 00000000 ea298583 c1032200 c103226c c1032218 c017d55c c1032200 c103226c
> [ 28.037902] 1f20: c1032218 c0181c64 c0c7f684 e0001f50 00000000 c017ccbc c0c7f684 c0901018
> [ 28.046508] 1f40: c0101314 60070113 ffffffff c0100b60 c29a15c0 c0c805c0 00000000 1ed14000
> [ 28.055114] 1f60: c0c805c0 c0c805c0 00000040 c0d02d00 c0c79374 c29a15c0 c0deb720 c0d02080
> [ 28.063690] 1f80: 0dcf2f92 e0001fa0 c01012e0 c0101314 60070113 ffffffff 00000051 c01012e0
> [ 28.072265] 1fa0: c29a15c0 e01ddb8c e01dda50 00400140 0000000a ffff95a7 c1005c50 c0c805c0
> [ 28.080871] 1fc0: c0d02d00 c0c805c0 c29a15c0 c29a15c0 60070013 ffffffff e01dda84 e01ddab0
> [ 28.089447] 1fe0: c29a15c0 e01ddb8c e01dda50 c0136f10 c0905de0 c0137008 c0905de0 c0100b60
> [ 28.098052] [<c053f5a4>] (omap3_l3_app_irq) from [<c017d36c>] (__handle_irq_event_percpu+0xb0/0x1f8)
> [ 28.107666] [<c017d36c>] (__handle_irq_event_percpu) from [<c017d4d8>] (handle_irq_event_percpu+0x24/0x70)
> [ 28.117797] [<c017d4d8>] (handle_irq_event_percpu) from [<c017d55c>] (handle_irq_event+0x38/0x5c)
> [ 28.127136] [<c017d55c>] (handle_irq_event) from [<c0181c64>] (handle_level_irq+0x7c/0xb4)
> [ 28.135864] [<c0181c64>] (handle_level_irq) from [<c017ccbc>] (handle_irq_desc+0x1c/0x2c)
> [ 28.144470] [<c017ccbc>] (handle_irq_desc) from [<c0901018>] (generic_handle_arch_irq+0x2c/0x44)
> [ 28.153717] [<c0901018>] (generic_handle_arch_irq) from [<c0100b60>] (__irq_svc+0x80/0xb8)
> [ 28.162414] Exception stack(0xe0001fa0 to 0xe0001fe8)
> [ 28.167755] 1fa0: c29a15c0 e01ddb8c e01dda50 00400140 0000000a ffff95a7 c1005c50 c0c805c0
> [ 28.176330] 1fc0: c0d02d00 c0c805c0 c29a15c0 c29a15c0 60070013 ffffffff e01dda84 e01ddab0
> [ 28.184906] 1fe0: c29a15c0 e01ddb8c
> [ 28.188598] [<c0100b60>] (__irq_svc) from [<c0101314>] (__do_softirq+0x84/0x304)
> [ 28.196380] [<c0101314>] (__do_softirq) from [<c0136f10>] (__irq_exit_rcu+0x8c/0xd4)
> [ 28.204528] [<c0136f10>] (__irq_exit_rcu) from [<c0137008>] (irq_exit+0x8/0x10)
> [ 28.212219] [<c0137008>] (irq_exit) from [<c0100b60>] (__irq_svc+0x80/0xb8)
> [ 28.219543] Exception stack(0xe01ddaa0 to 0xe01ddae8)
> [ 28.224884] daa0: 7fffffff c0902e94 00000004 00000000 c29a15c0 e01ddb90 e01ddb90 ea298583
> [ 28.233459] dac0: 00000000 e01ddb8c e01ddb88 7fffffff 00000002 00000010 00000100 c202a000
> [ 28.242065] dae0: 97ff8010 c0902f54
> [ 28.245727] [<c0100b60>] (__irq_svc) from [<c0905de0>] (_raw_spin_unlock_irq+0x20/0x4c)
> [ 28.254150] [<c0905de0>] (_raw_spin_unlock_irq) from [<c0902e94>] (do_wait_for_common+0xa8/0x138)
> [ 28.263488] [<c0902e94>] (do_wait_for_common) from [<c0902f54>] (wait_for_common+0x30/0x48)
> [ 28.272277] [<c0902f54>] (wait_for_common) from [<c074edf8>] (mmc_wait_for_req_done+0x1c/0x90)
> [ 28.281341] [<c074edf8>] (mmc_wait_for_req_done) from [<c075a72c>] (mmc_io_rw_extended+0x1c0/0x2f4)
> [ 28.290893] [<c075a72c>] (mmc_io_rw_extended) from [<c075bd00>] (sdio_io_rw_ext_helper+0x118/0x140)
> [ 28.300415] [<c075bd00>] (sdio_io_rw_ext_helper) from [<c075bdd0>] (sdio_memcpy_toio+0x18/0x20)
> [ 28.309570] [<c075bdd0>] (sdio_memcpy_toio) from [<bf3de1ec>] (wl1251_sdio_write+0x34/0x54 [wl1251_sdio])
> [ 28.319702] [<bf3de1ec>] (wl1251_sdio_write [wl1251_sdio]) from [<bf40fc8c>] (wl1251_set_partition+0x90/0x404 [wl1251])
> [ 28.331207] [<bf40fc8c>] (wl1251_set_partition [wl1251]) from [<bf4074ec>] (wl1251_init_ieee80211+0x1c0/0x3dc [wl1251])
> [ 28.342712] [<bf4074ec>] (wl1251_init_ieee80211 [wl1251]) from [<bf3de474>] (wl1251_sdio_probe+0x1a0/0x250 [wl1251_sdio])
> [ 28.354339] [<bf3de474>] (wl1251_sdio_probe [wl1251_sdio]) from [<c075aee8>] (sdio_bus_probe+0x128/0x138)
> [ 28.364410] [<c075aee8>] (sdio_bus_probe) from [<c061c7d4>] (really_probe+0x170/0x2fc)
> [ 28.372741] [<c061c7d4>] (really_probe) from [<c061ca24>] (__driver_probe_device+0xc4/0xd8)
> [ 28.381530] [<c061ca24>] (__driver_probe_device) from [<c061ca68>] (driver_probe_device+0x30/0xac)
> [ 28.390960] [<c061ca68>] (driver_probe_device) from [<c061ce98>] (__driver_attach+0xc4/0xd8)
> [ 28.399841] [<c061ce98>] (__driver_attach) from [<c061acec>] (bus_for_each_dev+0x64/0xa0)
> [ 28.408447] [<c061acec>] (bus_for_each_dev) from [<c061bc2c>] (bus_add_driver+0x148/0x1a4)
> [ 28.417144] [<c061bc2c>] (bus_add_driver) from [<c061d700>] (driver_register+0xb4/0xf8)
> [ 28.425567] [<c061d700>] (driver_register) from [<bf29d00c>] (wl1251_sdio_init+0xc/0x1000 [wl1251_sdio])
> [ 28.435577] [<bf29d00c>] (wl1251_sdio_init [wl1251_sdio]) from [<c0101f94>] (do_one_initcall+0x90/0x1c8)
> [ 28.445587] [<c0101f94>] (do_one_initcall) from [<c08f4e00>] (do_init_module+0x4c/0x204)
> [ 28.454132] [<c08f4e00>] (do_init_module) from [<c01b5148>] (load_module+0x13f0/0x1928)
> [ 28.462524] [<c01b5148>] (load_module) from [<c01b58b8>] (sys_finit_module+0xa0/0xc0)
> [ 28.470794] [<c01b58b8>] (sys_finit_module) from [<c0100270>] (__sys_trace_return+0x0/0x10)
> [ 28.479583] Code: e0000002 e0011003 e1901001 0a000002 (e7f001f2)
> [ 28.485992] ---[ end trace 6855b0c4a2214070 ]---
> [ 28.490844] Kernel panic - not syncing: Fatal exception in interrupt
> [ 28.497528] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
>

2022-04-27 11:27:59

by Tony Lindgren

[permalink] [raw]
Subject: Re: kernel panic with v5.18-rc1 on OpenPandora (only)

* Tony Lindgren <[email protected]> [220427 08:40]:
> Hi,
>
> * H. Nikolaus Schaller <[email protected]> [220426 20:16]:
> > Hi Tony,
> > I ran across a new issue on the OpenPandora (omap3530) first appearing with v5.18-rc1.
> > It seems as if there is something happening within the omap3 L3 irq handler which
> > is used (only?) for the wl1251. And this triggers the timeout BUG_ON() in
> > omap3_l3_app_irq().
> >
> > I have not seen this issue on the GTA04.
> >
> > It goes away if I remove the wlan interrupt from omap3-pandora-common.dtsi.
> > Interestingly, removing the wl1251.ko does NOT stop it. So it is not the driver.
> >
> > git bisect reported:
> >
> > a1c510d0adc604bb143c86052bc5be48cbcfa17c is the first bad commit
> > commit a1c510d0adc604bb143c86052bc5be48cbcfa17c
> > Author: Ard Biesheuvel <[email protected]>
> > Date: Thu Sep 23 09:15:53 2021 +0200
> >
> > ARM: implement support for vmap'ed stacks
> >
> > Any ideas?
>
> We had to add commit 8cf8df89678a ("ARM: OMAP2+: Fix regression for smc
> calls for vmap stack") to fix vmap related issues in case you have not
> seen that one yet. This seems different though, it's like the L3 interrupt
> handler is not reading the right register?
>
> Not sure why the L3 interrupt is triggering, that could be caused by
> another issue with the wl1251 somewhere.

Can you maybe try to temporarily disable the L3 driver and see if the stack
trace is more accurate on what goes wrong? Just comment out the line for
postcore_initcall_sync(omap3_l3_init) in drivers/bus/omap_l3_smx.c. The
system will hang on the invalid access rather than triggering the L3
error first.

Regards,

Tony

2022-04-27 11:44:19

by Arnd Bergmann

[permalink] [raw]
Subject: Re: kernel panic with v5.18-rc1 on OpenPandora (only)

On Wed, Apr 27, 2022 at 10:38 AM Tony Lindgren <[email protected]> wrote:
> * H. Nikolaus Schaller <[email protected]> [220426 20:16]:
> > [ 28.245727] [<c0100b60>] (__irq_svc) from [<c0905de0>] (_raw_spin_unlock_irq+0x20/0x4c)
> > [ 28.254150] [<c0905de0>] (_raw_spin_unlock_irq) from [<c0902e94>] (do_wait_for_common+0xa8/0x138)
> > [ 28.263488] [<c0902e94>] (do_wait_for_common) from [<c0902f54>] (wait_for_common+0x30/0x48)
> > [ 28.272277] [<c0902f54>] (wait_for_common) from [<c074edf8>] (mmc_wait_for_req_done+0x1c/0x90)
> > [ 28.281341] [<c074edf8>] (mmc_wait_for_req_done) from [<c075a72c>] (mmc_io_rw_extended+0x1c0/0x2f4)
> > [ 28.290893] [<c075a72c>] (mmc_io_rw_extended) from [<c075bd00>] (sdio_io_rw_ext_helper+0x118/0x140)
> > [ 28.300415] [<c075bd00>] (sdio_io_rw_ext_helper) from [<c075bdd0>] (sdio_memcpy_toio+0x18/0x20)
> > [ 28.309570] [<c075bdd0>] (sdio_memcpy_toio) from [<bf3de1ec>] (wl1251_sdio_write+0x34/0x54 [wl1251_sdio])
> > [ 28.319702] [<bf3de1ec>] (wl1251_sdio_write [wl1251_sdio]) from [<bf40fc8c>] (wl1251_set_partition+0x90/0x404 [wl1251])
> > [ 28.331207] [<bf40fc8c>] (wl1251_set_partition [wl1251]) from [<bf4074ec>] (wl1251_init_ieee80211+0x1c0/0x3dc [wl1251])

I think the problem is here: wl1251_set_partition() passes a local
stack variable into
an SDIO API function that is given to the hardware. This was never
safe and could
cause a corrupted stack because of the cache management, but with vmap stacks
it turns into a reliable DMA error, which I guess is what the l3
interrupt is about.

Can you change wl1251_set_partition() to use kmalloc()/kfree() to allocate the
partitions[] array? You said that it still crashes without the wl1251
driver, so I assume
there is at least one more related bug. If you get a different call
chain without the
driver, or with the kmalloc() call, can you post that as well?

Arnd

2022-05-02 20:12:24

by H. Nikolaus Schaller

[permalink] [raw]
Subject: Re: kernel panic with v5.18-rc1 on OpenPandora (only)

Hi Arnd,

> Am 30.04.2022 um 17:36 schrieb Arnd Bergmann <[email protected]>:
>
> On Sat, Apr 30, 2022 at 3:16 PM H. Nikolaus Schaller <[email protected]> wrote:
>>> Am 27.04.2022 um 11:37 schrieb Arnd Bergmann <[email protected]>:
>>> On Wed, Apr 27, 2022 at 10:38 AM Tony Lindgren <[email protected]> wrote:
>>
>>> You said that it still crashes without the wl1251
>>> driver, so I assume
>>> there is at least one more related bug. If you get a different call
>>> chain without the
>>> driver, or with the kmalloc() call, can you post that as well?
>>
>> For some time it did disappear but reported e.g.
>>
>> [ 29.457946] omap_hsmmc 480ad000.mmc: found wl1251
>> [ 29.516174] wl1251: ERROR unsupported chip id: 0xdb0aea56
>
> That does point to invalid DMA addresses.

Ok.

>
>> (value did change a little randomly), but just before sending out this mail I tried
>> again (now with v5.18-rc4) and got this log (with wl1251 driver fixed as below):
>>
>> [ 31.069580] 1ec0: 00000000 c11198c0 c103226c 0000001a 00000000 c017d654 c1032200 c37c0a40
>> [ 31.078155] 1ee0: 00000000 c1032200 c103226c c1032218 e0001f84 c0c77370 c37c0a40 c0def3c0
>> [ 31.086761] 1f00: c0d02080 c017d78c c1032200 c103226c c1032218 c017d7f0 c1032200 c103226c
>> [ 31.095336] 1f20: c1032218 c0181ee0 e0001f50 00000000 ffffffff c017cf6c e0001f50 c08b4d7c
>> [ 31.103942] 1f40: c01013c4 600f0113 ffffffff c0100bec c37c0a40 c0c7e6c0 00000000 1ed15000
>> [ 31.112548] 1f60: c0c7e6c0 c0c7e6c0 00000040 c0d02d00 c0c77370 c37c0a40 c0def3c0 c0d02080
>> [ 31.121154] 1f80: c0c7d850 e0001fa0 c0101390 c01013c4 600f0113 ffffffff 00000051 c0101390
>> [ 31.129730] 1fa0: e01b9e94 c37c0a40 c37c0a40 00400000 0000000a ffff96d9 c1037850 c0c7e6c0
>> [ 31.138336] 1fc0: c0d02d00 c0c7e6c0 c37c0a40 c37c0a40 600f0113 ffffffff e01b9e94 c37c0a40
>> [ 31.146911] 1fe0: c37c0a40 e01b9f60 e01b9e58 c0137314 c0158434 c013740c c0158434 c04c9c6c
>> [ 31.155517] omap3_l3_app_irq from __handle_irq_event_percpu+0xb0/0x1dc
>> [ 31.162475] __handle_irq_event_percpu from handle_irq_event_percpu+0xc/0x38
>> [ 31.169891] handle_irq_event_percpu from handle_irq_event+0x38/0x5c
>> [ 31.176605] handle_irq_event from handle_level_irq+0x7c/0xb4
>> [ 31.182647] handle_level_irq from handle_irq_desc+0x1c/0x2c
>> [ 31.188629] handle_irq_desc from generic_handle_arch_irq+0x28/0x3c
>> [ 31.195220] generic_handle_arch_irq from __irq_svc+0x8c/0xcc
>> [ 31.201263] Exception stack(0xe0001f50 to 0xe0001f98)
>> [ 31.206604] 1f40: c37c0a40 c0c7e6c0 00000000 1ed15000
>> [ 31.215179] 1f60: c0c7e6c0 c0c7e6c0 00000040 c0d02d00 c0c77370 c37c0a40 c0def3c0 c0d02080
>> [ 31.223785] 1f80: c0c7d850 e0001fa0 c0101390 c01013c4 600f0113 ffffffff
>> [ 31.230743] __irq_svc from __do_softirq+0x84/0x304
>> [ 31.235870] __do_softirq from __irq_exit_rcu+0x8c/0xd4
>> [ 31.241363] __irq_exit_rcu from irq_exit+0x8/0x10
>> [ 31.246429] irq_exit from call_with_stack+0x18/0x20
>> [ 31.251647] call_with_stack from __irq_svc+0x98/0xcc
>> [ 31.256988] Exception stack(0xe01b9e60 to 0xe01b9ea8)
>> [ 31.262298] 9e60: df993a40 c37c0a40 00000000 00000001 df993a40 c3245000 c133c2c0 00000002
>> [ 31.270904] 9e80: c37c0a40 00000000 e01b9f60 e01b9edc e01b9ee0 e01b9eb0 c08ba55c c0158434
>> [ 31.279479] 9ea0: 600f0113 ffffffff
>> [ 31.283172] __irq_svc from finish_task_switch+0x12c/0x1ec
>> [ 31.288940] finish_task_switch from __schedule+0x3cc/0x558
>> [ 31.294799] __schedule from schedule+0x70/0xc0
>> [ 31.299591] schedule from do_work_pending+0x30/0x3dc
>> [ 31.304901] do_work_pending from slow_work_pending+0xc/0x20
>> [ 31.310852] Exception stack(0xe01b9fb0 to 0xe01b9ff8)
>> [ 31.316192] 9fa0: 00002cf8 00000000 50000000 b6f99000
>> [ 31.324768] 9fc0: b6f9bcfc b6f9bcf8 00000000 00000000 00000010 00000000 00001e94 00000000
>> [ 31.333374] 9fe0: b6f9bcf8 bea66f80 b6f9bcfc 004bfc6a 40070030 ffffffff
>> [ 31.340332] Code: e0000002 e0011003 e1901001 0a000002 (e7f001f2)
>> [ 31.346740] ---[ end trace 0000000000000000 ]---
>
>
> I suppose this could be anywhere then. The backtrace seems to point
> to re-enabling interupts in do_work_pending, so something probably
> accessed DMA memory asynchronously.

Yes. I now (or still) sometimes see the same omap l3 irq issue when plugging in/out the USB/OTG
cable. Not with a kernel panic, but in the same driver omap_l3_smx.c.
This happens even if the wl1251 driver is removed.

root@letux:~# [ 1010.707519] In-band Error seen by USB_OTG at address 0
[ 1010.713195] ------------[ cut here ]------------
[ 1010.718109] WARNING: CPU: 0 PID: 1860 at drivers/bus/omap_l3_smx.c:152 omap3_l3_app_irq+0xdc/0x120
[ 1010.727691] Modules linked in: pvrsrvkm_omap3_sgx530_121 snd_soc_pcm1773 snd_soc_omap3pandora leds_gpio pandora_bl display_connector snd_soc_omap_mcbsp snd_soc_ti_sdma omap_sham omap3_isp usb_f_ecm videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 videobuf2_common v4l2_fwnode v4l2_async videodev bq27xxx_battery_hdq mc g_ether usb_f_rndis u_ether libcomposite configfs omap_aes_driver crypto_engine omap_crypto omap_hdq omap2430 panel_tpo_td043mtea1 ads7846 phy_twl4030_usb musb_hdrc twl4030_pwrbutton twl4030_madc twl4030_charger snd_soc_twl4030 industrialio twl4030_keypad bq27xxx_battery_i2c bq27xxx_battery pandora_nub ehci_omap omapdrm drm_kms_helper syscopyarea sysfillrect cfg80211 sysimgblt fb_sys_fops drm drm_panel_orientation_quirks cec ip_tables x_tables ipv6 autofs4
[ 1010.801666] CPU: 0 PID: 1860 Comm: kworker/0:5 Not tainted 5.18.0-rc4-letux+ #9499
[ 1010.809753] Hardware name: Generic OMAP3 (Flattened Device Tree)
[ 1010.816131] Workqueue: events omap_musb_mailbox_work [omap2430]
[ 1010.822570] unwind_backtrace from show_stack+0x10/0x14
[ 1010.828186] show_stack from dump_stack_lvl+0x40/0x4c
[ 1010.833648] dump_stack_lvl from __warn+0xb0/0x168
[ 1010.838806] __warn from warn_slowpath_fmt+0x70/0x9c
[ 1010.844146] warn_slowpath_fmt from omap3_l3_app_irq+0xdc/0x120
[ 1010.850494] omap3_l3_app_irq from __handle_irq_event_percpu+0xb0/0x1dc
[ 1010.857574] __handle_irq_event_percpu from handle_irq_event_percpu+0xc/0x38
[ 1010.865081] handle_irq_event_percpu from handle_irq_event+0x38/0x5c
[ 1010.871887] handle_irq_event from handle_level_irq+0x7c/0xb4
[ 1010.878051] handle_level_irq from handle_irq_desc+0x1c/0x2c
[ 1010.884124] handle_irq_desc from generic_handle_arch_irq+0x28/0x3c
[ 1010.890838] generic_handle_arch_irq from call_with_stack+0x18/0x20
[ 1010.897552] call_with_stack from __irq_svc+0x98/0xcc
[ 1010.902954] Exception stack(0xe0135de8 to 0xe0135e30)
[ 1010.908355] 5de0: fa0ab000 0000040c 00000001 00000000 c3708040 c36882c0
[ 1010.917083] 5e00: 00000000 c1247094 e0135e8c 00000000 c1247094 c1115094 00000000 e0135e38
[ 1010.925781] 5e20: bf1b6548 bf1b6550 60070013 ffffffff
[ 1010.931152] __irq_svc from omap2430_runtime_resume+0x58/0x94 [omap2430]
[ 1010.938354] omap2430_runtime_resume [omap2430] from __rpm_callback+0x3c/0x11c
[ 1010.946105] __rpm_callback from rpm_callback+0x2c/0x54
[ 1010.951721] rpm_callback from rpm_resume+0x3d4/0x4e0
[ 1010.957122] rpm_resume from rpm_resume+0x37c/0x4e0
[ 1010.962341] rpm_resume from __pm_runtime_resume+0x38/0x50
[ 1010.968200] __pm_runtime_resume from omap_musb_set_mailbox+0x1c/0x14c [omap2430]
[ 1010.976226] omap_musb_set_mailbox [omap2430] from process_one_work+0x1ac/0x2e8
[ 1010.984069] process_one_work from worker_thread+0x21c/0x2cc
[ 1010.990112] worker_thread from kthread+0xf0/0xfc
[ 1010.995208] kthread from ret_from_fork+0x14/0x2c
[ 1011.000244] Exception stack(0xe0135fb0 to 0xe0135ff8)
[ 1011.005645] 5fa0: 00000000 00000000 00000000 00000000
[ 1011.014343] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 1011.023040] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[ 1011.030059] ---[ end trace 0000000000000000 ]---

>
>
>>
>> rm -rf lib/modules/5.18.0-rc4-letux+/kernel/drivers/net/wireless/ti/wl1251
>>
>> done on the SD card makes the problems go away.
>
> Good, so I guess that means there is another bug in wl1251 DMA handling,
> while everything else is fine.

It looks as if there are two bugs. One inside wl1251 and one outside. The latter
seems to be omap3530 or Pandora specific since I don't see it on dm3730 / GTA04.
If I find time I'll try to reproduce the usb/otg issue on the oldest BeagleBoard.
Then we can find out if it is really an omap3530 issue or only OpenPandora specific.

>
>> diff --git a/drivers/net/wireless/ti/wl1251/io.c b/drivers/net/wireless/ti/wl1251/io.c
>> index 5ebe7958ed5c7..76aceecc281fb 100644
>> --- a/drivers/net/wireless/ti/wl1251/io.c
>> +++ b/drivers/net/wireless/ti/wl1251/io.c
>> @@ -121,7 +121,13 @@ void wl1251_set_partition(struct wl1251 *wl,
>> u32 mem_start, u32 mem_size,
>> u32 reg_start, u32 reg_size)
>> {
>> - struct wl1251_partition partition[2];
>> + struct wl1251_partition_set *partition;
>> +
>> + partition = kmalloc(sizeof(*partition), GFP_KERNEL);
>> + if (!partition) {
>> + wl1251_error("can not set partition");
>> + return;
>> + }
>>
>> wl1251_debug(DEBUG_SPI, "mem_start %08X mem_size %08X",
>> mem_start, mem_size);
>> @@ -164,10 +170,10 @@ void wl1251_set_partition(struct wl1251 *wl,
>> reg_start, reg_size);
>> }
>>
>> - partition[0].start = mem_start;
>> - partition[0].size = mem_size;
>> - partition[1].start = reg_start;
>> - partition[1].size = reg_size;
>> + partition->mem.start = mem_start;
>> + partition->mem.size = mem_size;
>> + partition->reg.start = reg_start;
>> + partition->reg.size = reg_size;
>>
>> wl->physical_mem_addr = mem_start;
>> wl->physical_reg_addr = reg_start;
>> @@ -176,5 +182,7 @@ void wl1251_set_partition(struct wl1251 *wl,
>> wl->virtual_reg_addr = mem_size;
>>
>> wl->if_ops->write(wl, HW_ACCESS_PART0_SIZE_ADDR, partition,
>> - sizeof(partition));
>> + sizeof(*partition));
>> +
>
> Changing the type of the structure looks a bit odd, but it does seem
> like a valid transformation otherwise.

Well, the new struct has two substructs of the original type which
essentially changes from array size 2 to named substruct access.

The definition of the new struct wl1251_partition_set already exists
in the header since 2009 but was nowhere actively used.

>
> I see more callers of wl1251_mem_write() or wl1251_mem_read() with
> on-stack arguments in wl1251_tx_complete(), wl1251_event_wait(),
> and wl1251_event_handle(). Those will need the same kmalloc()
> change as your wl1251_set_partition() fix I think.

I had searched for ops->write() calls but didn't think about more indirect
calls... Yes, they likely need fixes as well. I'll test/add before I
submit any patch.

> If that's not enough, try enabling CONFIG_DMA_API_DEBUG
> to get an is_vmalloc_address() check on every DMA operation.

Ok.

BR and thanks,
Nikolaus

2022-05-02 23:11:19

by Arnd Bergmann

[permalink] [raw]
Subject: Re: kernel panic with v5.18-rc1 on OpenPandora (only)

On Sat, Apr 30, 2022 at 3:16 PM H. Nikolaus Schaller <[email protected]> wrote:
> > Am 27.04.2022 um 11:37 schrieb Arnd Bergmann <[email protected]>:
> > On Wed, Apr 27, 2022 at 10:38 AM Tony Lindgren <[email protected]> wrote:
>
> > You said that it still crashes without the wl1251
> > driver, so I assume
> > there is at least one more related bug. If you get a different call
> > chain without the
> > driver, or with the kmalloc() call, can you post that as well?
>
> For some time it did disappear but reported e.g.
>
> [ 29.457946] omap_hsmmc 480ad000.mmc: found wl1251
> [ 29.516174] wl1251: ERROR unsupported chip id: 0xdb0aea56

That does point to invalid DMA addresses.

> (value did change a little randomly), but just before sending out this mail I tried
> again (now with v5.18-rc4) and got this log (with wl1251 driver fixed as below):
>
> [ 31.069580] 1ec0: 00000000 c11198c0 c103226c 0000001a 00000000 c017d654 c1032200 c37c0a40
> [ 31.078155] 1ee0: 00000000 c1032200 c103226c c1032218 e0001f84 c0c77370 c37c0a40 c0def3c0
> [ 31.086761] 1f00: c0d02080 c017d78c c1032200 c103226c c1032218 c017d7f0 c1032200 c103226c
> [ 31.095336] 1f20: c1032218 c0181ee0 e0001f50 00000000 ffffffff c017cf6c e0001f50 c08b4d7c
> [ 31.103942] 1f40: c01013c4 600f0113 ffffffff c0100bec c37c0a40 c0c7e6c0 00000000 1ed15000
> [ 31.112548] 1f60: c0c7e6c0 c0c7e6c0 00000040 c0d02d00 c0c77370 c37c0a40 c0def3c0 c0d02080
> [ 31.121154] 1f80: c0c7d850 e0001fa0 c0101390 c01013c4 600f0113 ffffffff 00000051 c0101390
> [ 31.129730] 1fa0: e01b9e94 c37c0a40 c37c0a40 00400000 0000000a ffff96d9 c1037850 c0c7e6c0
> [ 31.138336] 1fc0: c0d02d00 c0c7e6c0 c37c0a40 c37c0a40 600f0113 ffffffff e01b9e94 c37c0a40
> [ 31.146911] 1fe0: c37c0a40 e01b9f60 e01b9e58 c0137314 c0158434 c013740c c0158434 c04c9c6c
> [ 31.155517] omap3_l3_app_irq from __handle_irq_event_percpu+0xb0/0x1dc
> [ 31.162475] __handle_irq_event_percpu from handle_irq_event_percpu+0xc/0x38
> [ 31.169891] handle_irq_event_percpu from handle_irq_event+0x38/0x5c
> [ 31.176605] handle_irq_event from handle_level_irq+0x7c/0xb4
> [ 31.182647] handle_level_irq from handle_irq_desc+0x1c/0x2c
> [ 31.188629] handle_irq_desc from generic_handle_arch_irq+0x28/0x3c
> [ 31.195220] generic_handle_arch_irq from __irq_svc+0x8c/0xcc
> [ 31.201263] Exception stack(0xe0001f50 to 0xe0001f98)
> [ 31.206604] 1f40: c37c0a40 c0c7e6c0 00000000 1ed15000
> [ 31.215179] 1f60: c0c7e6c0 c0c7e6c0 00000040 c0d02d00 c0c77370 c37c0a40 c0def3c0 c0d02080
> [ 31.223785] 1f80: c0c7d850 e0001fa0 c0101390 c01013c4 600f0113 ffffffff
> [ 31.230743] __irq_svc from __do_softirq+0x84/0x304
> [ 31.235870] __do_softirq from __irq_exit_rcu+0x8c/0xd4
> [ 31.241363] __irq_exit_rcu from irq_exit+0x8/0x10
> [ 31.246429] irq_exit from call_with_stack+0x18/0x20
> [ 31.251647] call_with_stack from __irq_svc+0x98/0xcc
> [ 31.256988] Exception stack(0xe01b9e60 to 0xe01b9ea8)
> [ 31.262298] 9e60: df993a40 c37c0a40 00000000 00000001 df993a40 c3245000 c133c2c0 00000002
> [ 31.270904] 9e80: c37c0a40 00000000 e01b9f60 e01b9edc e01b9ee0 e01b9eb0 c08ba55c c0158434
> [ 31.279479] 9ea0: 600f0113 ffffffff
> [ 31.283172] __irq_svc from finish_task_switch+0x12c/0x1ec
> [ 31.288940] finish_task_switch from __schedule+0x3cc/0x558
> [ 31.294799] __schedule from schedule+0x70/0xc0
> [ 31.299591] schedule from do_work_pending+0x30/0x3dc
> [ 31.304901] do_work_pending from slow_work_pending+0xc/0x20
> [ 31.310852] Exception stack(0xe01b9fb0 to 0xe01b9ff8)
> [ 31.316192] 9fa0: 00002cf8 00000000 50000000 b6f99000
> [ 31.324768] 9fc0: b6f9bcfc b6f9bcf8 00000000 00000000 00000010 00000000 00001e94 00000000
> [ 31.333374] 9fe0: b6f9bcf8 bea66f80 b6f9bcfc 004bfc6a 40070030 ffffffff
> [ 31.340332] Code: e0000002 e0011003 e1901001 0a000002 (e7f001f2)
> [ 31.346740] ---[ end trace 0000000000000000 ]---


I suppose this could be anywhere then. The backtrace seems to point
to re-enabling interupts in do_work_pending, so something probably
accessed DMA memory asynchronously.


>
> rm -rf lib/modules/5.18.0-rc4-letux+/kernel/drivers/net/wireless/ti/wl1251
>
> done on the SD card makes the problems go away.

Good, so I guess that means there is another bug in wl1251 DMA handling,
while everything else is fine.

> diff --git a/drivers/net/wireless/ti/wl1251/io.c b/drivers/net/wireless/ti/wl1251/io.c
> index 5ebe7958ed5c7..76aceecc281fb 100644
> --- a/drivers/net/wireless/ti/wl1251/io.c
> +++ b/drivers/net/wireless/ti/wl1251/io.c
> @@ -121,7 +121,13 @@ void wl1251_set_partition(struct wl1251 *wl,
> u32 mem_start, u32 mem_size,
> u32 reg_start, u32 reg_size)
> {
> - struct wl1251_partition partition[2];
> + struct wl1251_partition_set *partition;
> +
> + partition = kmalloc(sizeof(*partition), GFP_KERNEL);
> + if (!partition) {
> + wl1251_error("can not set partition");
> + return;
> + }
>
> wl1251_debug(DEBUG_SPI, "mem_start %08X mem_size %08X",
> mem_start, mem_size);
> @@ -164,10 +170,10 @@ void wl1251_set_partition(struct wl1251 *wl,
> reg_start, reg_size);
> }
>
> - partition[0].start = mem_start;
> - partition[0].size = mem_size;
> - partition[1].start = reg_start;
> - partition[1].size = reg_size;
> + partition->mem.start = mem_start;
> + partition->mem.size = mem_size;
> + partition->reg.start = reg_start;
> + partition->reg.size = reg_size;
>
> wl->physical_mem_addr = mem_start;
> wl->physical_reg_addr = reg_start;
> @@ -176,5 +182,7 @@ void wl1251_set_partition(struct wl1251 *wl,
> wl->virtual_reg_addr = mem_size;
>
> wl->if_ops->write(wl, HW_ACCESS_PART0_SIZE_ADDR, partition,
> - sizeof(partition));
> + sizeof(*partition));
> +

Changing the type of the structure looks a bit odd, but it does seem
like a valid transformation otherwise.

I see more callers of wl1251_mem_write() or wl1251_mem_read() with
on-stack arguments in wl1251_tx_complete(), wl1251_event_wait(),
and wl1251_event_handle(). Those will need the same kmalloc()
change as your wl1251_set_partition() fix I think.

If that's not enough, try enabling CONFIG_DMA_API_DEBUG
to get an is_vmalloc_address() check on every DMA operation.

Arnd

2022-05-03 00:20:53

by H. Nikolaus Schaller

[permalink] [raw]
Subject: Re: kernel panic with v5.18-rc1 on OpenPandora (only)

Hi,

> Am 27.04.2022 um 11:37 schrieb Arnd Bergmann <[email protected]>:
>
> On Wed, Apr 27, 2022 at 10:38 AM Tony Lindgren <[email protected]> wrote:
>> * H. Nikolaus Schaller <[email protected]> [220426 20:16]:
>>> [ 28.245727] [<c0100b60>] (__irq_svc) from [<c0905de0>] (_raw_spin_unlock_irq+0x20/0x4c)
>>> [ 28.254150] [<c0905de0>] (_raw_spin_unlock_irq) from [<c0902e94>] (do_wait_for_common+0xa8/0x138)
>>> [ 28.263488] [<c0902e94>] (do_wait_for_common) from [<c0902f54>] (wait_for_common+0x30/0x48)
>>> [ 28.272277] [<c0902f54>] (wait_for_common) from [<c074edf8>] (mmc_wait_for_req_done+0x1c/0x90)
>>> [ 28.281341] [<c074edf8>] (mmc_wait_for_req_done) from [<c075a72c>] (mmc_io_rw_extended+0x1c0/0x2f4)
>>> [ 28.290893] [<c075a72c>] (mmc_io_rw_extended) from [<c075bd00>] (sdio_io_rw_ext_helper+0x118/0x140)
>>> [ 28.300415] [<c075bd00>] (sdio_io_rw_ext_helper) from [<c075bdd0>] (sdio_memcpy_toio+0x18/0x20)
>>> [ 28.309570] [<c075bdd0>] (sdio_memcpy_toio) from [<bf3de1ec>] (wl1251_sdio_write+0x34/0x54 [wl1251_sdio])
>>> [ 28.319702] [<bf3de1ec>] (wl1251_sdio_write [wl1251_sdio]) from [<bf40fc8c>] (wl1251_set_partition+0x90/0x404 [wl1251])
>>> [ 28.331207] [<bf40fc8c>] (wl1251_set_partition [wl1251]) from [<bf4074ec>] (wl1251_init_ieee80211+0x1c0/0x3dc [wl1251])
>
> I think the problem is here: wl1251_set_partition() passes a local
> stack variable into
> an SDIO API function that is given to the hardware. This was never
> safe and could
> cause a corrupted stack because of the cache management, but with vmap stacks
> it turns into a reliable DMA error, which I guess is what the l3
> interrupt is about.

This seems to me to be a fitting explanation. Thanks!

>
> Can you change wl1251_set_partition() to use kmalloc()/kfree() to allocate the
> partitions[] array?

Yes, I have tried and made it use struct wl1251_partition_set. Diff is attached.

> You said that it still crashes without the wl1251
> driver, so I assume
> there is at least one more related bug. If you get a different call
> chain without the
> driver, or with the kmalloc() call, can you post that as well?

For some time it did disappear but reported e.g.

[ 29.457946] omap_hsmmc 480ad000.mmc: found wl1251
[ 29.516174] wl1251: ERROR unsupported chip id: 0xdb0aea56

(value did change a little randomly), but just before sending out this mail I tried
again (now with v5.18-rc4) and got this log (with wl1251 driver fixed as below):

[ 29.813232] omap_hsmmc 480ad000.mmc: found wl1251
[ 30.131164] omap3isp 480bc000.isp: Entity type for entity OMAP3 ISP CCP2 was not initialized!
[ 30.432006] omap3isp 480bc000.isp: Entity type for entity OMAP3 ISP CSI2a was not initialized!
[ 30.660888] omap3isp 480bc000.isp: Entity type for entity OMAP3 ISP CCDC was not initialized!
[ 30.791229] wl1251: 151 tx blocks at 0x3b788, 35 rx blocks at 0x3a780
[ 30.811218] wl1251: firmware booted (Rev 4.0.4.3.7)
[ 30.817047] ------------[ cut here ]------------
[ 30.821899] kernel BUG at drivers/bus/omap_l3_smx.c:177!
[ 30.827484] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
[ 30.833618] Modules linked in: snd_soc_omap_mcbsp snd_soc_ti_sdma omap3_isp(+) wl1251_sdio videobuf2_dma_contig videobuf2_memops wl1251 videobuf2_v4l2 usb_f_ecm videobuf2_common v4l2_fwnode g_ether v4l2_async usb_f_rndis u_ether libcomposite configfs mac80211 bq27xxx_battery_hdq videodev omap_sham libarc4 mc omap_aes_driver crypto_engine omap_hdq panel_tpo_td043mtea1 ads7846 omap_crypto omap2430 phy_twl4030_usb musb_hdrc twl4030_pwrbutton twl4030_charger twl4030_madc snd_soc_twl4030 industrialio twl4030_keypad ehci_omap bq27xxx_battery_i2c pandora_nub bq27xxx_battery omapdrm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cfg80211 drm drm_panel_orientation_quirks cec ip_tables x_tables ipv6 autofs4
[ 30.900390] CPU: 0 PID: 2143 Comm: depmod Not tainted 5.18.0-rc4-letux+ #9499
[ 30.907867] Hardware name: Generic OMAP3 (Flattened Device Tree)
[ 30.914184] PC is at omap3_l3_app_irq+0x3c/0x120
[ 30.919067] LR is at __handle_irq_event_percpu+0xb0/0x1dc
[ 30.924743] pc : [<c04e6e9c>] lr : [<c017d654>] psr: 200f0193
[ 30.931304] sp : e0001eb8 ip : c0c7d850 fp : c0def3e0
[ 30.936828] r10: c0d04014 r9 : c1032200 r8 : c0c77370
[ 30.942291] r7 : 00000000 r6 : 0000001a r5 : 00000000 r4 : f8000000
[ 30.949157] r3 : 00000000 r2 : 00400000 r1 : 00400000 r0 : 00400000
[ 30.956024] Flags: nzCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment none
[ 30.963623] Control: 10c5387d Table: 83618019 DAC: 00000051
[ 30.969665] Register r0 information: non-paged memory
[ 30.974975] Register r1 information: non-paged memory
[ 30.980285] Register r2 information: non-paged memory
[ 30.985626] Register r3 information: NULL pointer
[ 30.990570] Register r4 information: 0-page vmalloc region starting at 0xf8000000 allocated at iotable_init+0x0/0xf4
[ 31.001647] Register r5 information: NULL pointer
[ 31.006591] Register r6 information: non-paged memory
[ 31.011901] Register r7 information: NULL pointer
[ 31.016845] Register r8 information: non-slab/vmalloc memory
[ 31.022827] Register r9 information: slab kmalloc-256 start c1032200 pointer offset 0 size 256
[ 31.031921] Register r10 information: non-slab/vmalloc memory
[ 31.037994] Register r11 information: non-slab/vmalloc memory
[ 31.044036] Register r12 information: non-slab/vmalloc memory
[ 31.050079] Process depmod (pid: 2143, stack limit = 0x2035257a)
[ 31.056396] Stack: (0xe0001eb8 to 0xe0002000)
[ 31.060974] 1ea0: 0000001a c1119940
[ 31.069580] 1ec0: 00000000 c11198c0 c103226c 0000001a 00000000 c017d654 c1032200 c37c0a40
[ 31.078155] 1ee0: 00000000 c1032200 c103226c c1032218 e0001f84 c0c77370 c37c0a40 c0def3c0
[ 31.086761] 1f00: c0d02080 c017d78c c1032200 c103226c c1032218 c017d7f0 c1032200 c103226c
[ 31.095336] 1f20: c1032218 c0181ee0 e0001f50 00000000 ffffffff c017cf6c e0001f50 c08b4d7c
[ 31.103942] 1f40: c01013c4 600f0113 ffffffff c0100bec c37c0a40 c0c7e6c0 00000000 1ed15000
[ 31.112548] 1f60: c0c7e6c0 c0c7e6c0 00000040 c0d02d00 c0c77370 c37c0a40 c0def3c0 c0d02080
[ 31.121154] 1f80: c0c7d850 e0001fa0 c0101390 c01013c4 600f0113 ffffffff 00000051 c0101390
[ 31.129730] 1fa0: e01b9e94 c37c0a40 c37c0a40 00400000 0000000a ffff96d9 c1037850 c0c7e6c0
[ 31.138336] 1fc0: c0d02d00 c0c7e6c0 c37c0a40 c37c0a40 600f0113 ffffffff e01b9e94 c37c0a40
[ 31.146911] 1fe0: c37c0a40 e01b9f60 e01b9e58 c0137314 c0158434 c013740c c0158434 c04c9c6c
[ 31.155517] omap3_l3_app_irq from __handle_irq_event_percpu+0xb0/0x1dc
[ 31.162475] __handle_irq_event_percpu from handle_irq_event_percpu+0xc/0x38
[ 31.169891] handle_irq_event_percpu from handle_irq_event+0x38/0x5c
[ 31.176605] handle_irq_event from handle_level_irq+0x7c/0xb4
[ 31.182647] handle_level_irq from handle_irq_desc+0x1c/0x2c
[ 31.188629] handle_irq_desc from generic_handle_arch_irq+0x28/0x3c
[ 31.195220] generic_handle_arch_irq from __irq_svc+0x8c/0xcc
[ 31.201263] Exception stack(0xe0001f50 to 0xe0001f98)
[ 31.206604] 1f40: c37c0a40 c0c7e6c0 00000000 1ed15000
[ 31.215179] 1f60: c0c7e6c0 c0c7e6c0 00000040 c0d02d00 c0c77370 c37c0a40 c0def3c0 c0d02080
[ 31.223785] 1f80: c0c7d850 e0001fa0 c0101390 c01013c4 600f0113 ffffffff
[ 31.230743] __irq_svc from __do_softirq+0x84/0x304
[ 31.235870] __do_softirq from __irq_exit_rcu+0x8c/0xd4
[ 31.241363] __irq_exit_rcu from irq_exit+0x8/0x10
[ 31.246429] irq_exit from call_with_stack+0x18/0x20
[ 31.251647] call_with_stack from __irq_svc+0x98/0xcc
[ 31.256988] Exception stack(0xe01b9e60 to 0xe01b9ea8)
[ 31.262298] 9e60: df993a40 c37c0a40 00000000 00000001 df993a40 c3245000 c133c2c0 00000002
[ 31.270904] 9e80: c37c0a40 00000000 e01b9f60 e01b9edc e01b9ee0 e01b9eb0 c08ba55c c0158434
[ 31.279479] 9ea0: 600f0113 ffffffff
[ 31.283172] __irq_svc from finish_task_switch+0x12c/0x1ec
[ 31.288940] finish_task_switch from __schedule+0x3cc/0x558
[ 31.294799] __schedule from schedule+0x70/0xc0
[ 31.299591] schedule from do_work_pending+0x30/0x3dc
[ 31.304901] do_work_pending from slow_work_pending+0xc/0x20
[ 31.310852] Exception stack(0xe01b9fb0 to 0xe01b9ff8)
[ 31.316192] 9fa0: 00002cf8 00000000 50000000 b6f99000
[ 31.324768] 9fc0: b6f9bcfc b6f9bcf8 00000000 00000000 00000010 00000000 00001e94 00000000
[ 31.333374] 9fe0: b6f9bcf8 bea66f80 b6f9bcfc 004bfc6a 40070030 ffffffff
[ 31.340332] Code: e0000002 e0011003 e1901001 0a000002 (e7f001f2)
[ 31.346740] ---[ end trace 0000000000000000 ]---
[ 31.351593] Kernel panic - not syncing: Fatal exception in interrupt
[ 31.358276] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

>
> Arnd

rm -rf lib/modules/5.18.0-rc4-letux+/kernel/drivers/net/wireless/ti/wl1251

done on the SD card makes the problems go away.

BR and thanks,
Nikolaus


diff --git a/drivers/net/wireless/ti/wl1251/io.c b/drivers/net/wireless/ti/wl1251/io.c
index 5ebe7958ed5c7..76aceecc281fb 100644
--- a/drivers/net/wireless/ti/wl1251/io.c
+++ b/drivers/net/wireless/ti/wl1251/io.c
@@ -121,7 +121,13 @@ void wl1251_set_partition(struct wl1251 *wl,
u32 mem_start, u32 mem_size,
u32 reg_start, u32 reg_size)
{
- struct wl1251_partition partition[2];
+ struct wl1251_partition_set *partition;
+
+ partition = kmalloc(sizeof(*partition), GFP_KERNEL);
+ if (!partition) {
+ wl1251_error("can not set partition");
+ return;
+ }

wl1251_debug(DEBUG_SPI, "mem_start %08X mem_size %08X",
mem_start, mem_size);
@@ -164,10 +170,10 @@ void wl1251_set_partition(struct wl1251 *wl,
reg_start, reg_size);
}

- partition[0].start = mem_start;
- partition[0].size = mem_size;
- partition[1].start = reg_start;
- partition[1].size = reg_size;
+ partition->mem.start = mem_start;
+ partition->mem.size = mem_size;
+ partition->reg.start = reg_start;
+ partition->reg.size = reg_size;

wl->physical_mem_addr = mem_start;
wl->physical_reg_addr = reg_start;
@@ -176,5 +182,7 @@ void wl1251_set_partition(struct wl1251 *wl,
wl->virtual_reg_addr = mem_size;

wl->if_ops->write(wl, HW_ACCESS_PART0_SIZE_ADDR, partition,
- sizeof(partition));
+ sizeof(*partition));
+
+ kfree(partition);
}


2022-05-03 00:46:44

by Arnd Bergmann

[permalink] [raw]
Subject: Re: kernel panic with v5.18-rc1 on OpenPandora (only)

On Sat, Apr 30, 2022 at 7:18 PM H. Nikolaus Schaller <[email protected]> wrote:
> > Am 30.04.2022 um 17:36 schrieb Arnd Bergmann <[email protected]>:
> >
> >
> > I suppose this could be anywhere then. The backtrace seems to point
> > to re-enabling interupts in do_work_pending, so something probably
> > accessed DMA memory asynchronously.
>
> Yes. I now (or still) sometimes see the same omap l3 irq issue when plugging in/out the USB/OTG
> cable. Not with a kernel panic, but in the same driver omap_l3_smx.c.
> This happens even if the wl1251 driver is removed.

Is this also a regression, or did it happen before the vmap-stack
change? If this only
appeared now, then this points to another bug somewhere that you
should find using
CONFIG_DMA_API_DEBUG.

I think what is going on here is that your platform is able to detect
the broken DMA
because of the l3 interrupt handler telling the kernel about it, when
on other platforms
we would see either silent data corruption or a DMA that never reaches
its target.

Arnd

2022-05-03 08:46:31

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: kernel panic with v5.18-rc1 on OpenPandora (only)

On Sat, 30 Apr 2022 at 20:48, Arnd Bergmann <[email protected]> wrote:
>
> On Sat, Apr 30, 2022 at 7:18 PM H. Nikolaus Schaller <[email protected]> wrote:
> > > Am 30.04.2022 um 17:36 schrieb Arnd Bergmann <[email protected]>:
> > >
> > >
> > > I suppose this could be anywhere then. The backtrace seems to point
> > > to re-enabling interupts in do_work_pending, so something probably
> > > accessed DMA memory asynchronously.
> >
> > Yes. I now (or still) sometimes see the same omap l3 irq issue when plugging in/out the USB/OTG
> > cable. Not with a kernel panic, but in the same driver omap_l3_smx.c.
> > This happens even if the wl1251 driver is removed.
>
> Is this also a regression, or did it happen before the vmap-stack
> change? If this only
> appeared now, then this points to another bug somewhere that you
> should find using
> CONFIG_DMA_API_DEBUG.
>
> I think what is going on here is that your platform is able to detect
> the broken DMA
> because of the l3 interrupt handler telling the kernel about it, when
> on other platforms
> we would see either silent data corruption or a DMA that never reaches
> its target.
>

I wonder if we could narrow this down by adding the possibility to use
IRQ stacks in the linear map, while using vmap'ed task stacks.

2022-05-03 13:02:31

by Arnd Bergmann

[permalink] [raw]
Subject: Re: kernel panic with v5.18-rc1 on OpenPandora (only)

On Tue, May 3, 2022 at 9:28 AM Ard Biesheuvel <[email protected]> wrote:
> On Sat, 30 Apr 2022 at 20:48, Arnd Bergmann <[email protected]> wrote:
> >
> > I think what is going on here is that your platform is able to detect
> > the broken DMA because of the l3 interrupt handler telling the kernel
> > about it, when on other platforms we would see either silent data corruption
> > or a DMA that never reaches its target.
> >
>
> I wonder if we could narrow this down by adding the possibility to use
> IRQ stacks in the linear map, while using vmap'ed task stacks.

I don't think we have actual DMA attempts to the IRQ stack, so this should
not make a difference. What might help is to print some more information
in omap3_l3_app_irq() that is likely provided by the hardware. The BUG_ON()
happens for any timeout error, and that is most of the possible errors.

Simply dumping the L3 registers should at least show the exact type of
timeout, and maybe the DMA master ID and physical address that can
be traced back into a virtual address.
Setting CONFIG_DMA_API_DEBUG=y should get the same information
I think, but it can't hurt to do both.

Arnd