2023-08-01 23:03:46

by Rafał Miłecki

[permalink] [raw]
Subject: ARM board lockups/hangs triggered by locks and mutexes

Hi,

Years ago I added support for Broadcom's BCM53573 SoCs. We released
firmwares based on Linux 4.4 (and later on 4.14) that worked almost
fine. There was one little issue we couldn't debug or fix: random hangs
and reboots. They were too rare to deal with (most devices worked fine
for weeks or months).

Recently I updated my stable kernel 5.4 and I started experiencing
stability issues on my own! After some uptime (usually from 0 to 20
minutes of close to zero activity) serial console hangs. I can't type
anything and I stop getting any messages. I've to wait about a minute
for watchdog to kick in and reboot device.

#####

I took that great chance and decided to track the regression.

Linux 5.4 stable branch worked stable up to the release v5.4.197.
Starting with v5.4.198 I started experiencing those stability issues. I
bisected it down to the commit 4460066eb248 ("ipv6: fix locking issues
with loops over idev->addr_list"):
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=4460066eb2480b9e203c73755e12e2efc820a27e

With above commit reverted I was able to use stable 5.4 branch up to the
release v5.4.207. Starting with v5.4.208 it got unstable again. I
bisected it down to:
commit d0d583484d2e ("locking/refcount: Consolidate implementations of
refcount_t")
commit dab787c73f6e ("locking/refcount: Consolidate
REFCOUNT_{MAX,SATURATED} definitions")
commit 0d3182fbe689 ("locking/refcount: Move saturation warnings out of line")
commit 809554147d60 ("locking/refcount: Improve performance of generic
REFCOUNT_FULL code")
commit 9c9269977f03 ("locking/refcount: Move the bulk of the
REFCOUNT_FULL implementation into the <linux/refcount.h> header")
commit 04bff7d7b808 ("locking/refcount: Remove unused
refcount_*_checked() variants")
commit 513b19a43bec ("locking/refcount: Ensure integer operands are
treated as signed")
commit 68b4ee68e8c8 ("locking/refcount: Define constants for
saturation and max refcount values")
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=d0d583484d2ed9f5903edbbfa7e2a68f78b950b0
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=dab787c73f6e38d8e7ed3c1e683385e8f0fe28a2
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=0d3182fbe689e3808c03b6cde6be98237f9e0a4a
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=809554147d609163cfbaf815c443c575b538a7ef
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=9c9269977f03ab9c448c8b71581a951e0eb4fb7b
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=04bff7d7b8081c4bb2e8171be31d33df297eee5b
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=513b19a43becee5f7af6d283bb9d3d241a8a21a8
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=68b4ee68e8c8800cf8d6b61cc74b4031a0742a4c
(I didn't actually check above commits individually).

Reverting above locking/refcount commits worked fine for few releases:
up to the v5.4.219. Starting with v5.4.220 I got hangs again. I bisected
that down to the commit 131287ff833d ("once: add DO_ONCE_SLOW() for
sleepable contexts").

Reverting that extra commit from v5.4.238 allows me to run Linux for
hours again (currently 3 devices x 6 hours and counting). So I need in
total 10+1 reverts from 5.4 branch to get a stable kernel.

#####

I'm clueless at this point. Is that possible kernel has some locking bug
I can hit only using this specific SoC? BCM53573s have a single ARM
Cortex-A7 CPU running at 900 MHz. The only unusual thing about this hw I
can think of is a slow arch timer running at 36,8 kHz.

I tried compiling kernel with:
CONFIG_SOFTLOCKUP_DETECTOR=y
CONFIG_DETECT_HUNG_TASK=y
CONFIG_WQ_WATCHDOG=y
but it didn't change or report anything.

Unfortunately enabling *any* of following options:
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
seems to make locksup/hangs go away. I tried for few hours.

Sadly I don't have access to JTAG or any low level debugging interface.

Does looking at commits I reported above give anyone a hint on what may
be going on maybe?

--
Rafał


Attachments:
dmesg.txt (14.66 kB)
config.gz (22.83 kB)
Download all attachments

2023-08-01 23:42:33

by Russell King (Oracle)

[permalink] [raw]
Subject: Re: ARM board lockups/hangs triggered by locks and mutexes

On Wed, Aug 02, 2023 at 12:10:24AM +0200, Rafał Miłecki wrote:
> Hi,
>
> Years ago I added support for Broadcom's BCM53573 SoCs. We released
> firmwares based on Linux 4.4 (and later on 4.14) that worked almost
> fine. There was one little issue we couldn't debug or fix: random hangs
> and reboots. They were too rare to deal with (most devices worked fine
> for weeks or months).
>
> Recently I updated my stable kernel 5.4 and I started experiencing
> stability issues on my own! After some uptime (usually from 0 to 20
> minutes of close to zero activity) serial console hangs. I can't type
> anything and I stop getting any messages. I've to wait about a minute
> for watchdog to kick in and reboot device.
>
> #####
>
> I took that great chance and decided to track the regression.
>
> Linux 5.4 stable branch worked stable up to the release v5.4.197.
> Starting with v5.4.198 I started experiencing those stability issues. I
> bisected it down to the commit 4460066eb248 ("ipv6: fix locking issues
> with loops over idev->addr_list"):
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=4460066eb2480b9e203c73755e12e2efc820a27e
>
> With above commit reverted I was able to use stable 5.4 branch up to the
> release v5.4.207. Starting with v5.4.208 it got unstable again. I
> bisected it down to:
> commit d0d583484d2e ("locking/refcount: Consolidate implementations of
> refcount_t")
> commit dab787c73f6e ("locking/refcount: Consolidate
> REFCOUNT_{MAX,SATURATED} definitions")
> commit 0d3182fbe689 ("locking/refcount: Move saturation warnings out of line")
> commit 809554147d60 ("locking/refcount: Improve performance of generic
> REFCOUNT_FULL code")
> commit 9c9269977f03 ("locking/refcount: Move the bulk of the
> REFCOUNT_FULL implementation into the <linux/refcount.h> header")
> commit 04bff7d7b808 ("locking/refcount: Remove unused
> refcount_*_checked() variants")
> commit 513b19a43bec ("locking/refcount: Ensure integer operands are
> treated as signed")
> commit 68b4ee68e8c8 ("locking/refcount: Define constants for
> saturation and max refcount values")
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=d0d583484d2ed9f5903edbbfa7e2a68f78b950b0
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=dab787c73f6e38d8e7ed3c1e683385e8f0fe28a2
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=0d3182fbe689e3808c03b6cde6be98237f9e0a4a
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=809554147d609163cfbaf815c443c575b538a7ef
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=9c9269977f03ab9c448c8b71581a951e0eb4fb7b
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=04bff7d7b8081c4bb2e8171be31d33df297eee5b
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=513b19a43becee5f7af6d283bb9d3d241a8a21a8
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=68b4ee68e8c8800cf8d6b61cc74b4031a0742a4c
> (I didn't actually check above commits individually).
>
> Reverting above locking/refcount commits worked fine for few releases:
> up to the v5.4.219. Starting with v5.4.220 I got hangs again. I bisected
> that down to the commit 131287ff833d ("once: add DO_ONCE_SLOW() for
> sleepable contexts").
>
> Reverting that extra commit from v5.4.238 allows me to run Linux for
> hours again (currently 3 devices x 6 hours and counting). So I need in
> total 10+1 reverts from 5.4 branch to get a stable kernel.
>
> #####
>
> I'm clueless at this point. Is that possible kernel has some locking bug
> I can hit only using this specific SoC? BCM53573s have a single ARM
> Cortex-A7 CPU running at 900 MHz. The only unusual thing about this hw I
> can think of is a slow arch timer running at 36,8 kHz.
>
> I tried compiling kernel with:
> CONFIG_SOFTLOCKUP_DETECTOR=y
> CONFIG_DETECT_HUNG_TASK=y
> CONFIG_WQ_WATCHDOG=y
> but it didn't change or report anything.
>
> Unfortunately enabling *any* of following options:
> CONFIG_DEBUG_RT_MUTEXES=y
> CONFIG_DEBUG_SPINLOCK=y
> CONFIG_DEBUG_MUTEXES=y
> seems to make locksup/hangs go away. I tried for few hours.
>
> Sadly I don't have access to JTAG or any low level debugging interface.
>
> Does looking at commits I reported above give anyone a hint on what may
> be going on maybe?

If you suspect locking issues, make sure you have lockdep enabled which
will detect locking errors. You will want CONFIG_PROVE_LOCKING enabled.

I will say that I use IPv6, and I run 32-bit kernels here both on real
ARMv7 hardware (Armada 388 and iMX6 based stuff) and also in KVM based
VMs, and these have run virtually every release of the kernel (not
stable kernels though) and I haven't ever seen the behaviour that you
describe.

If it is specific to stable kernels, then that would be rather
disappointing.

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

2023-08-02 07:22:53

by Rafał Miłecki

[permalink] [raw]
Subject: Re: ARM board lockups/hangs triggered by locks and mutexes

On 2.08.2023 00:21, Russell King (Oracle) wrote:
> On Wed, Aug 02, 2023 at 12:10:24AM +0200, Rafał Miłecki wrote:
>> Years ago I added support for Broadcom's BCM53573 SoCs. We released
>> firmwares based on Linux 4.4 (and later on 4.14) that worked almost
>> fine. There was one little issue we couldn't debug or fix: random hangs
>> and reboots. They were too rare to deal with (most devices worked fine
>> for weeks or months).
>>
>> Recently I updated my stable kernel 5.4 and I started experiencing
>> stability issues on my own! After some uptime (usually from 0 to 20
>> minutes of close to zero activity) serial console hangs. I can't type
>> anything and I stop getting any messages. I've to wait about a minute
>> for watchdog to kick in and reboot device.
>>
>> (...)
>>
>> I'm clueless at this point. Is that possible kernel has some locking bug
>> I can hit only using this specific SoC? BCM53573s have a single ARM
>> Cortex-A7 CPU running at 900 MHz. The only unusual thing about this hw I
>> can think of is a slow arch timer running at 36,8 kHz.
>>
>> I tried compiling kernel with:
>> CONFIG_SOFTLOCKUP_DETECTOR=y
>> CONFIG_DETECT_HUNG_TASK=y
>> CONFIG_WQ_WATCHDOG=y
>> but it didn't change or report anything.
>>
>> Unfortunately enabling *any* of following options:
>> CONFIG_DEBUG_RT_MUTEXES=y
>> CONFIG_DEBUG_SPINLOCK=y
>> CONFIG_DEBUG_MUTEXES=y
>> seems to make locksup/hangs go away. I tried for few hours.
>>
>> Sadly I don't have access to JTAG or any low level debugging interface.
>>
>> Does looking at commits I reported above give anyone a hint on what may
>> be going on maybe?
>
> If you suspect locking issues, make sure you have lockdep enabled which
> will detect locking errors. You will want CONFIG_PROVE_LOCKING enabled.
>
> I will say that I use IPv6, and I run 32-bit kernels here both on real
> ARMv7 hardware (Armada 388 and iMX6 based stuff) and also in KVM based
> VMs, and these have run virtually every release of the kernel (not
> stable kernels though) and I haven't ever seen the behaviour that you
> describe.
>
> If it is specific to stable kernels, then that would be rather
> disappointing.

I wrote above that with any of: CONFIG_DEBUG_RT_MUTEXES,
CONFIG_DEBUG_SPINLOCK or CONFIG_DEBUG_MUTEXES enabled I can't reproduce
the issue anymore. Right? Well I swear it was true for some random 5.4
release I tested before.

With your comment I decided to try CONFIG_PROVE_LOCKING anyway / again
and this time on 1 of my BCM53573 devices I got something very
interesting on the first boot.

FWIW following error:
Broadcom B53 (2) bcma_mdio-0-0:1e: failed to register switch: -517
is caused by invalid DT I sent fixes for just recently.

Please scroll through the first booting lines for the WARNING:

[ 0.000000] Booting Linux on physical CPU 0x0
[ 0.000000] Linux version 5.4.238 (ubuntu@nat) (gcc version 8.4.0 (OpenWrt GCC 8.4.0 r15234+1-d89a7f0120)) #0 SMP Fri Jul 14 12:56:51 2023
[ 0.000000] CPU: ARMv7 Processor [410fc075] revision 5 (ARMv7), cr=10c5387d
[ 0.000000] CPU: div instructions available: patching division code
[ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
[ 0.000000] OF: fdt: Machine model: Tenda AC9
[ 0.000000] earlycon: ns16550a0 at MMIO 0x18000300 (options '115200n8')
[ 0.000000] printk: bootconsole [ns16550a0] enabled
[ 0.000000] Memory policy: Data cache writealloc
[ 0.000000] Hit pending asynchronous external abort (FSR=0x00000c06) during first unmask, this is most likely caused by a firmware/bootloader bug.
[ 0.000000] percpu: Embedded 14 pages/cpu s27944 r8192 d21208 u57344
[ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 32480
[ 0.000000] Kernel command line: console=ttyS0,115200 earlycon
[ 0.000000] Dentry cache hash table entries: 16384 (order: 4, 65536 bytes, linear)
[ 0.000000] Inode-cache hash table entries: 8192 (order: 3, 32768 bytes, linear)
[ 0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[ 0.000000] Memory: 118164K/131072K available (5531K kernel code, 201K rwdata, 1960K rodata, 1024K init, 2106K bss, 12908K reserved, 0K cma-reserved, 0K highmem)
[ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[ 0.000000] rcu: Hierarchical RCU implementation.
[ 0.000000] rcu: RCU restricting CPUs from NR_CPUS=2 to nr_cpu_ids=1.
[ 0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
[ 0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
[ 0.000000] NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
[ 0.000000] arch_timer: cp15 timer(s) running at 0.03MHz (virt).
[ 0.000000] clocksource: arch_sys_counter: mask: 0xffffffffffffff max_cycles: 0x10eb00226, max_idle_ns: 56421785894076 ns
[ 0.000027] sched_clock: 56 bits at 35kHz, resolution 27918ns, wraps every 70368744165810ns
[ 0.008654] Ignoring delay timer arch_delay_timer, which has insufficient resolution of 27918ns
[ 0.017951] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
[ 0.025936] ... MAX_LOCKDEP_SUBCLASSES: 8
[ 0.030180] ... MAX_LOCK_DEPTH: 48
[ 0.034479] ... MAX_LOCKDEP_KEYS: 8192
[ 0.039002] ... CLASSHASH_SIZE: 4096
[ 0.043525] ... MAX_LOCKDEP_ENTRIES: 32768
[ 0.048104] ... MAX_LOCKDEP_CHAINS: 65536
[ 0.052683] ... CHAINHASH_SIZE: 32768
[ 0.057289] memory used by lock dependency info: 1797 kB
[ 0.062873] per task-struct memory footprint: 1536 bytes
[ 0.068485] Calibrating delay loop... 1790.77 BogoMIPS (lpj=8953856)
[ 0.134987] pid_max: default: 32768 minimum: 301
[ 0.140097] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[ 0.147663] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[ 0.157937] CPU: Testing write buffer coherency: ok
[ 0.164079] /cpus/cpu@0 missing clock-frequency property
[ 0.169691] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
[ 0.177787] Setting up static identity map for 0x100000 - 0x10003c
[ 0.184767] rcu: Hierarchical SRCU implementation.
[ 0.191077] smp: Bringing up secondary CPUs ...
[ 0.195879] smp: Brought up 1 node, 1 CPU
[ 0.200039] SMP: Total of 1 processors activated (1790.77 BogoMIPS).
[ 0.206655] CPU: All CPU(s) started in SVC mode.
[ 0.217572] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[ 0.227846] futex hash table entries: 256 (order: 2, 16384 bytes, linear)
[ 0.235244] pinctrl core: initialized pinctrl subsystem
[ 0.242699] NET: Registered protocol family 16
[ 0.249958] DMA: preallocated 256 KiB pool for atomic coherent allocations
[ 0.333156] clocksource: Switched to clocksource arch_sys_counter
[ 0.342062] thermal_sys: Registered thermal governor 'step_wise'
[ 0.342732] NET: Registered protocol family 2
[ 0.353984] IP idents hash table entries: 2048 (order: 2, 16384 bytes, linear)
[ 0.362694] tcp_listen_portaddr_hash hash table entries: 128 (order: 0, 5120 bytes, linear)
[ 0.371433] TCP established hash table entries: 1024 (order: 0, 4096 bytes, linear)
[ 0.379446] TCP bind hash table entries: 1024 (order: 3, 36864 bytes, linear)
[ 0.386984] TCP: Hash tables configured (established 1024 bind 1024)
[ 0.393908] UDP hash table entries: 256 (order: 2, 20480 bytes, linear)
[ 0.400831] UDP-Lite hash table entries: 256 (order: 2, 20480 bytes, linear)
[ 0.408733] NET: Registered protocol family 1
[ 0.413423] PCI: CLS 0 bytes, default 64
[ 0.427857] workingset: timestamp_bits=14 max_order=15 bucket_order=1
[ 0.449187] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[ 0.455301] jffs2: version 2.2 (NAND) (SUMMARY) (LZMA) (RTIME) (CMODE_PRIORITY) (c) 2001-2006 Red Hat, Inc.
[ 0.501005] Serial: 8250/16550 driver, 2 ports, IRQ sharing disabled
[ 0.522195] bgmac_bcma: Broadcom 47xx GBit MAC driver loaded
[ 0.529063] bcma-host-soc 18000000.axi: bus0: Found chip with id 53573, rev 0x02 and package 0x01
[ 0.538583] bcma-host-soc 18000000.axi: bus0: Core 0 found: ChipCommon (manuf 0x4BF, id 0x800, rev 0x36, class 0x0)
[ 0.549835] bcma-host-soc 18000000.axi: bus0: Core 1 found: IEEE 802.11 (manuf 0x4BF, id 0x812, rev 0x38, class 0x0)
[ 0.561030] bcma-host-soc 18000000.axi: bus0: Core 2 found: PCIe Gen 2 (manuf 0x4BF, id 0x501, rev 0x05, class 0x0)
[ 0.572142] bcma-host-soc 18000000.axi: bus0: Core 3 found: ARM CA7 (manuf 0x4BF, id 0x847, rev 0x00, class 0x0)
[ 0.583030] bcma-host-soc 18000000.axi: bus0: Core 4 found: USB 2.0 Host (manuf 0x4BF, id 0x819, rev 0x05, class 0x0)
[ 0.594421] bcma-host-soc 18000000.axi: bus0: Core 5 found: GBit MAC (manuf 0x4BF, id 0x82D, rev 0x08, class 0x0)
[ 0.605338] bcma-host-soc 18000000.axi: bus0: Core 6 found: I2S (manuf 0x4BF, id 0x834, rev 0x06, class 0x0)
[ 0.615807] bcma-host-soc 18000000.axi: bus0: Core 7 found: CNDS DDR2/3 memory controller (manuf 0x4BF, id 0x846, rev 0x00, class 0x0)
[ 0.628594] bcma-host-soc 18000000.axi: bus0: Core 8 found: NAND flash controller (manuf 0x4BF, id 0x509, rev 0x01, class 0x0)
[ 0.640767] bcma-host-soc 18000000.axi: bus0: Core 9 found: IEEE 802.11 (manuf 0x4BF, id 0x812, rev 0x38, class 0x0)
[ 0.652046] bcma-host-soc 18000000.axi: bus0: Core 10 found: GBit MAC (manuf 0x4BF, id 0x82D, rev 0x08, class 0x0)
[ 0.663074] bcma-host-soc 18000000.axi: bus0: Core 11 found: I2S (manuf 0x4BF, id 0x834, rev 0x06, class 0x0)
[ 0.673599] bcma-host-soc 18000000.axi: bus0: Core 12 found: GCI (manuf 0x4BF, id 0x840, rev 0x08, class 0x0)
[ 0.684153] bcma-host-soc 18000000.axi: bus0: Core 13 found: PMU (manuf 0x4BF, id 0x827, rev 0x1C, class 0x0)
[ 0.709559] bcma-host-soc 18000000.axi: bus0: Found M25P64 serial flash (size: 8192KiB, blocksize: 0x10000, blocks: 128)
[ 0.723602] printk: console [ttyS0] disabled
[ 0.728293] 18000300.serial: ttyS0 at MMIO 0x18000300 (irq = 20, base_baud = 2500000) is a 16550A
[ 0.737645] printk: console [ttyS0] enabled
[ 0.737645] printk: console [ttyS0] enabled
[ 0.746272] printk: bootconsole [ns16550a0] disabled
[ 0.746272] printk: bootconsole [ns16550a0] disabled
[ 0.776648] 4 bcm47xxpart partitions found on MTD device bcm47xxsflash
[ 0.783321] Creating 4 MTD partitions on "bcm47xxsflash":
[ 0.788905] 0x000000000000-0x000000040000 : "boot"
[ 0.797085] 0x000000040000-0x0000007b0000 : "firmware"
[ 0.804400] 2 trx partitions found on MTD device firmware
[ 0.809900] Creating 2 MTD partitions on "firmware":
[ 0.815065] 0x00000000001c-0x000000285400 : "linux"
[ 0.822100] 0x000000285400-0x000000770000 : "rootfs"
[ 0.829219] mtd: device 3 (rootfs) set to be root filesystem
[ 0.835082] 1 squashfs-split partitions found on MTD device rootfs
[ 0.841392] 0x000000540000-0x000000770000 : "rootfs_data"
[ 0.848958] 0x0000007b0000-0x0000007f0000 : "boot"
[ 0.855966] 0x0000007f0000-0x000000800000 : "nvram"
[ 1.022586] pcie_iproc_bcma bcma0:2: link: UP
[ 1.027555] pcie_iproc_bcma bcma0:2: PCI host bridge to bus 0000:00
[ 1.034033] pci_bus 0000:00: root bus resource [mem 0x10000000-0x17ffffff]
[ 1.041012] pci_bus 0000:00: No busn resource found for root bus, will use [bus 00-ff]
[ 1.049304] pci 0000:00:00.0: [14e4:d145] type 01 class 0x060400
[ 1.055502] pci_bus 0000:00: 2-byte config write to 0000:00:00.0 offset 0x4 may corrupt adjacent RW1C bits
[ 1.065553] pci 0000:00:00.0: PME# supported from D0 D3hot D3cold
[ 1.072924] PCI: bus0: Fast back to back transfers disabled
[ 1.078619] pci 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 1.087134] pci 0000:01:00.0: [14e4:a8db] type 00 class 0x028000
[ 1.093360] pci_bus 0000:01: 2-byte config write to 0000:01:00.0 offset 0x4 may corrupt adjacent RW1C bits
[ 1.103244] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00007fff 64bit]
[ 1.110195] pci 0000:01:00.0: enabling Extended Tags
[ 1.115444] pci 0000:01:00.0: supports D1 D2
[ 1.120637] PCI: bus1: Fast back to back transfers disabled
[ 1.126416] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
[ 1.133201] pci_bus 0000:00: busn_res: [bus 00-ff] end is updated to 01
[ 1.139957] pci 0000:00:00.0: BAR 8: assigned [mem 0x10000000-0x100fffff]
[ 1.146909] pci 0000:01:00.0: BAR 0: assigned [mem 0x10000000-0x10007fff 64bit]
[ 1.154391] pci 0000:00:00.0: PCI bridge to [bus 01]
[ 1.159444] pci 0000:00:00.0: bridge window [mem 0x10000000-0x100fffff]
[ 1.167234] bgmac_bcma bcma0:5: Found PHY addr: 30 (NOREGS)
[ 1.173655] ------------[ cut here ]------------
[ 1.178374] WARNING: CPU: 0 PID: 1 at kernel/locking/mutex.c:950 __mutex_lock+0x6b4/0x8a0
[ 1.186721] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
[ 1.186721] Modules linked in:
[ 1.194902] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.4.238 #0
[ 1.200988] Hardware name: Generic DT based system
[ 1.205874] [<c010a240>] (unwind_backtrace) from [<c01064d0>] (show_stack+0x10/0x20)
[ 1.213747] [<c01064d0>] (show_stack) from [<c0648e3c>] (dump_stack+0x94/0xb8)
[ 1.221062] [<c0648e3c>] (dump_stack) from [<c011ad4c>] (__warn+0xa0/0xbc)
[ 1.228041] [<c011ad4c>] (__warn) from [<c011adec>] (warn_slowpath_fmt+0x84/0x98)
[ 1.235635] [<c011adec>] (warn_slowpath_fmt) from [<c0662fd4>] (__mutex_lock+0x6b4/0x8a0)
[ 1.243955] [<c0662fd4>] (__mutex_lock) from [<c06631dc>] (mutex_lock_nested+0x1c/0x40)
[ 1.252079] [<c06631dc>] (mutex_lock_nested) from [<c04a31cc>] (phy_probe+0x4c/0x1c0)
[ 1.260008] [<c04a31cc>] (phy_probe) from [<c0442d30>] (really_probe+0x1f0/0x360)
[ 1.267602] [<c0442d30>] (really_probe) from [<c0440fb0>] (bus_for_each_drv+0x70/0xa0)
[ 1.275643] [<c0440fb0>] (bus_for_each_drv) from [<c0442ab0>] (__device_attach+0xd0/0x140)
[ 1.284019] [<c0442ab0>] (__device_attach) from [<c0441d48>] (bus_probe_device+0x88/0xa0)
[ 1.292311] [<c0441d48>] (bus_probe_device) from [<c043f064>] (device_add+0x3c4/0x5e0)
[ 1.300351] [<c043f064>] (device_add) from [<c04a4344>] (mdio_device_register+0x24/0x60)
[ 1.308559] [<c04a4344>] (mdio_device_register) from [<c04c9c44>] (of_mdiobus_register+0x1e4/0x2e0)
[ 1.317745] [<c04c9c44>] (of_mdiobus_register) from [<c04aed54>] (bcma_mdio_mii_register+0xb4/0x100)
[ 1.327014] [<c04aed54>] (bcma_mdio_mii_register) from [<c04ae6f4>] (bgmac_probe+0x154/0x5a0)
[ 1.335669] [<c04ae6f4>] (bgmac_probe) from [<c04ca308>] (bcma_device_probe+0x28/0x60)
[ 1.343681] [<c04ca308>] (bcma_device_probe) from [<c0442d30>] (really_probe+0x1f0/0x360)
[ 1.352001] [<c0442d30>] (really_probe) from [<c0440fb0>] (bus_for_each_drv+0x70/0xa0)
[ 1.360014] [<c0440fb0>] (bus_for_each_drv) from [<c0442ab0>] (__device_attach+0xd0/0x140)
[ 1.368390] [<c0442ab0>] (__device_attach) from [<c0441d48>] (bus_probe_device+0x88/0xa0)
[ 1.376682] [<c0441d48>] (bus_probe_device) from [<c043f064>] (device_add+0x3c4/0x5e0)
[ 1.384722] [<c043f064>] (device_add) from [<c04cae60>] (bcma_bus_register+0x480/0x640)
[ 1.392847] [<c04cae60>] (bcma_bus_register) from [<c04d09a8>] (bcma_host_soc_probe+0x68/0xa0)
[ 1.401585] [<c04d09a8>] (bcma_host_soc_probe) from [<c0444b74>] (platform_drv_probe+0x34/0x80)
[ 1.410408] [<c0444b74>] (platform_drv_probe) from [<c0442d30>] (really_probe+0x1f0/0x360)
[ 1.418783] [<c0442d30>] (really_probe) from [<c0443358>] (device_driver_attach+0x58/0x60)
[ 1.427187] [<c0443358>] (device_driver_attach) from [<c04433c0>] (__driver_attach+0x60/0xe0)
[ 1.435814] [<c04433c0>] (__driver_attach) from [<c0440ee8>] (bus_for_each_dev+0x68/0xa0)
[ 1.444134] [<c0440ee8>] (bus_for_each_dev) from [<c0442004>] (bus_add_driver+0x1a4/0x200)
[ 1.452509] [<c0442004>] (bus_add_driver) from [<c04439f4>] (driver_register+0x74/0x120)
[ 1.460718] [<c04439f4>] (driver_register) from [<c091d8f8>] (bcma_modinit+0x14/0x4c)
[ 1.468674] [<c091d8f8>] (bcma_modinit) from [<c01027e4>] (do_one_initcall+0x84/0x200)
[ 1.476715] [<c01027e4>] (do_one_initcall) from [<c0901178>] (kernel_init_freeable+0x210/0x2b8)
[ 1.485537] [<c0901178>] (kernel_init_freeable) from [<c065f4d0>] (kernel_init+0x8/0x118)
[ 1.493829] [<c065f4d0>] (kernel_init) from [<c01010e8>] (ret_from_fork+0x14/0x2c)
[ 1.501507] Exception stack(0xc7823fb0 to 0xc7823ff8)
[ 1.506616] 3fa0: 00000000 00000000 00000000 00000000
[ 1.514908] 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 1.523200] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[ 1.530180] ---[ end trace 335ee604b627e32a ]---
[ 1.535792] Broadcom B53 (2) bcma_mdio-0-0:1e: failed to register switch: -517
[ 1.543190] 8<--- cut here ---
[ 1.546289] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[ 1.554497] pgd = (ptrval)
[ 1.557233] [00000000] *pgd=00000000
[ 1.560863] Internal error: Oops: 805 [#1] SMP ARM
[ 1.565721] Modules linked in:
[ 1.568820] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 5.4.238 #0
[ 1.576330] Hardware name: Generic DT based system
[ 1.581188] PC is at release_nodes+0x5c/0x1e0
[ 1.585599] LR is at 0x0
[ 1.588167] pc : [<c0446fdc>] lr : [<00000000>] psr: 80000093
[ 1.594533] sp : c7823c10 ip : c7b7c700 fp : c07f7460
[ 1.599838] r10: 00000000 r9 : 00000000 r8 : c0447240
[ 1.605142] r7 : c04461c0 r6 : c7b73800 r5 : c7823c10 r4 : 00000001
[ 1.611759] r3 : a0000013 r2 : c7b73940 r1 : c7b7c700 r0 : c7b73a00
[ 1.618376] Flags: Nzcv IRQs off FIQs on Mode SVC_32 ISA ARM Segment none
[ 1.625690] Control: 10c5387d Table: 0000406a DAC: 00000051
[ 1.631526] Process swapper/0 (pid: 1, stack limit = 0x(ptrval))
[ 1.637612] Stack: (0xc7823c10 to 0xc7824000)
[ 1.642023] 3c00: c7823c10 c7823c10 c0c2b504 c7b73800
[ 1.650315] 3c20: c0c2b4fc 00000000 fffffdfb c0a26f64 00000000 c0442c54 00000000 c7823c70
[ 1.658607] 3c40: c0443220 00000000 00000001 00000000 c7b73800 c0440fb0 c7a214d4 c7918c54
[ 1.666899] 3c60: c7b73800 c7b73844 c0a26970 c0442ab0 c7b73800 00000001 c7b73800 c7b73800
[ 1.675191] 3c80: c0a26970 c7b790a8 00000000 c0441d48 c7b73800 c0a08648 00000000 c043f064
[ 1.683483] 3ca0: c7b738a8 c7b738a8 c07e5f54 c0c2b3b0 0000001e 98eaeeb8 c7b73800 00000000
[ 1.691775] 3cc0: c7b73800 c7ed660c 0000001e c07f5e88 c0a283e8 c04a4344 c7b79000 c7ed6500
[ 1.700067] 3ce0: c7ed660c c04c9c44 00000000 80000013 c07d8a38 00000001 c7b790a8 00000000
[ 1.708358] 3d00: 00000000 c07ff530 c7ed6500 0000001e c07d8a38 c7b79000 c7964800 c7b98580
[ 1.716650] 3d20: c7964810 00000000 c0a283e8 00000000 c08001f8 c04aed54 00000000 c7823d50
[ 1.724942] 3d40: c07f9e98 c7b98580 c7964800 c7b9d580 c79f0840 c04ae6f4 c0a283e8 c7964810
[ 1.733234] 3d60: c0c2b4fc 00000000 00000000 c04ca308 c0c2b504 c7964810 c0c2b4fc c0442d30
[ 1.741526] 3d80: 00000000 c7823db8 c0443220 00000000 00000001 00000000 00000001 c0440fb0
[ 1.749818] 3da0: c796c0d4 c78ad654 c7964810 c7964854 c0a2995c c0442ab0 c7964810 00000001
[ 1.758110] 3dc0: c7964810 c7964810 c0a2995c c789d610 00000000 c0441d48 c7964810 c0a08648
[ 1.766402] 3de0: 00000000 c043f064 c08011c0 00000000 0000d145 c7823dec 000a005a 98eaeeb8
[ 1.774694] 3e00: c79f0870 c79f0864 c79f0840 c79f0870 00000000 c7964800 0000052d c04cae60
[ 1.782986] 3e20: 98eaeeb8 c79f0840 c789d600 c789d610 c7ed4d2c 00000000 c0a29d8c 00000000
[ 1.791278] 3e40: c7ffeec0 c04d09a8 c789d610 c0a29d8c c0c2b4fc 00000000 00000000 c0444b74
[ 1.799570] 3e60: c0c2b504 c789d610 c0c2b4fc c0442d30 c789d610 00000000 c0a29d8c 00000000
[ 1.807861] 3e80: c797b1d4 c0928850 00000007 c0443358 c0a29d8c c789d610 c0443360 c04433c0
[ 1.816153] 3ea0: c789d610 c0a29d8c c0443360 c0440ee8 c781c6a4 c7877590 c0a29d8c c797b180
[ 1.824445] 3ec0: c0a23308 c0442004 c0801be0 c0a33014 c0a29d8c c0a33014 ffffe000 c091d8e4
[ 1.832737] 3ee0: 00000000 c04439f4 00000000 c0a33014 ffffe000 c091d8f8 c0a08648 c01027e4
[ 1.841029] 3f00: c7ffeed5 00000000 c0850948 00000000 c7ffeed5 c0139de8 c0a0f7f8 c0819c24
[ 1.849321] 3f20: 00000006 00000006 c079a9f8 00000000 c07a7304 c079aa6c c7ffeed5 00000000
[ 1.857613] 3f40: 00000000 00000006 c09005e8 98eaeeb8 c0930370 c09005e8 c0a33028 c0a33010
[ 1.865905] 3f60: c0928870 c0901178 00000006 00000006 00000000 c09005e8 c7824000 00000056
[ 1.874197] 3f80: 00000001 00000000 c065f4c8 00000000 00000000 00000000 00000000 00000000
[ 1.882489] 3fa0: 00000000 c065f4d0 00000000 c01010e8 00000000 00000000 00000000 00000000
[ 1.890781] 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 1.899073] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
[ 1.907365] [<c0446fdc>] (release_nodes) from [<c0442c54>] (really_probe+0x114/0x360)
[ 1.915321] [<c0442c54>] (really_probe) from [<c0440fb0>] (bus_for_each_drv+0x70/0xa0)
[ 1.923362] [<c0440fb0>] (bus_for_each_drv) from [<c0442ab0>] (__device_attach+0xd0/0x140)
[ 1.931738] [<c0442ab0>] (__device_attach) from [<c0441d48>] (bus_probe_device+0x88/0xa0)
[ 1.940030] [<c0441d48>] (bus_probe_device) from [<c043f064>] (device_add+0x3c4/0x5e0)
[ 1.948070] [<c043f064>] (device_add) from [<c04a4344>] (mdio_device_register+0x24/0x60)
[ 1.956278] [<c04a4344>] (mdio_device_register) from [<c04c9c44>] (of_mdiobus_register+0x1e4/0x2e0)
[ 1.965464] [<c04c9c44>] (of_mdiobus_register) from [<c04aed54>] (bcma_mdio_mii_register+0xb4/0x100)
[ 1.974733] [<c04aed54>] (bcma_mdio_mii_register) from [<c04ae6f4>] (bgmac_probe+0x154/0x5a0)
[ 1.983360] [<c04ae6f4>] (bgmac_probe) from [<c04ca308>] (bcma_device_probe+0x28/0x60)
[ 1.991400] [<c04ca308>] (bcma_device_probe) from [<c0442d30>] (really_probe+0x1f0/0x360)
[ 1.999720] [<c0442d30>] (really_probe) from [<c0440fb0>] (bus_for_each_drv+0x70/0xa0)
[ 2.007733] [<c0440fb0>] (bus_for_each_drv) from [<c0442ab0>] (__device_attach+0xd0/0x140)
[ 2.016137] [<c0442ab0>] (__device_attach) from [<c0441d48>] (bus_probe_device+0x88/0xa0)
[ 2.024429] [<c0441d48>] (bus_probe_device) from [<c043f064>] (device_add+0x3c4/0x5e0)
[ 2.032441] [<c043f064>] (device_add) from [<c04cae60>] (bcma_bus_register+0x480/0x640)
[ 2.040566] [<c04cae60>] (bcma_bus_register) from [<c04d09a8>] (bcma_host_soc_probe+0x68/0xa0)
[ 2.049304] [<c04d09a8>] (bcma_host_soc_probe) from [<c0444b74>] (platform_drv_probe+0x34/0x80)
[ 2.058155] [<c0444b74>] (platform_drv_probe) from [<c0442d30>] (really_probe+0x1f0/0x360)
[ 2.066530] [<c0442d30>] (really_probe) from [<c0443358>] (device_driver_attach+0x58/0x60)
[ 2.074906] [<c0443358>] (device_driver_attach) from [<c04433c0>] (__driver_attach+0x60/0xe0)
[ 2.083561] [<c04433c0>] (__driver_attach) from [<c0440ee8>] (bus_for_each_dev+0x68/0xa0)
[ 2.091853] [<c0440ee8>] (bus_for_each_dev) from [<c0442004>] (bus_add_driver+0x1a4/0x200)
[ 2.100256] [<c0442004>] (bus_add_driver) from [<c04439f4>] (driver_register+0x74/0x120)
[ 2.108465] [<c04439f4>] (driver_register) from [<c091d8f8>] (bcma_modinit+0x14/0x4c)
[ 2.116393] [<c091d8f8>] (bcma_modinit) from [<c01027e4>] (do_one_initcall+0x84/0x200)
[ 2.124434] [<c01027e4>] (do_one_initcall) from [<c0901178>] (kernel_init_freeable+0x210/0x2b8)
[ 2.133257] [<c0901178>] (kernel_init_freeable) from [<c065f4d0>] (kernel_init+0x8/0x118)
[ 2.141576] [<c065f4d0>] (kernel_init) from [<c01010e8>] (ret_from_fork+0x14/0x2c)
[ 2.149254] Exception stack(0xc7823fb0 to 0xc7823ff8)
[ 2.154363] 3fa0: 00000000 00000000 00000000 00000000
[ 2.162655] 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 2.170947] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[ 2.177648] Code: e2844001 05901000 e152000c e58ce004 (e58ec000)
[ 2.183846] ---[ end trace 335ee604b627e32b ]---
[ 2.188508] Kernel panic - not syncing: Fatal exception
[ 2.193841] Rebooting in 1 seconds..
[ 4.187419] Reboot failed -- System halted

2023-08-02 09:36:16

by Rafał Miłecki

[permalink] [raw]
Subject: Re: ARM board lockups/hangs triggered by locks and mutexes

On 2.08.2023 09:00, Rafał Miłecki wrote:
> With your comment I decided to try CONFIG_PROVE_LOCKING anyway / again
> and this time on 1 of my BCM53573 devices I got something very
> interesting on the first boot.
>
> FWIW following error:
> Broadcom B53 (2) bcma_mdio-0-0:1e: failed to register switch: -517
> is caused by invalid DT I sent fixes for just recently.
>
> Please scroll through the first booting lines for the WARNING:
>
> (...)
> [    1.167234] bgmac_bcma bcma0:5: Found PHY addr: 30 (NOREGS)
> [    1.173655] ------------[ cut here ]------------
> [    1.178374] WARNING: CPU: 0 PID: 1 at kernel/locking/mutex.c:950 __mutex_lock+0x6b4/0x8a0
> [    1.186721] DEBUG_LOCKS_WARN_ON(lock->magic != lock)

Ah, that mutex WARNING comes from my Tenda AC9 device which happens to
use a hacky OpenWrt downstream b53 driver. That driver uses wrong API
(it behaves as PHY driver instead of MDIO driver). It results in probing
against PHY device which isn't properly initialized.

Long story short: above WARNING is just a noise. Ignore it please. Sorry
for that.

Kernel compiled with CONFIG_PROVE_LOCKING still works fine on other
devices and on Tenda AC9 after fixing PHY<->MDIO thing. That kernel
option hides actual bug whatever it is.

2023-08-04 11:40:34

by Rafał Miłecki

[permalink] [raw]
Subject: Re: ARM board lockups/hangs triggered by locks and mutexes

On 2.08.2023 00:10, Rafał Miłecki wrote:
> Reverting that extra commit from v5.4.238 allows me to run Linux for
> hours again (currently 3 devices x 6 hours and counting). So I need in
> total 10+1 reverts from 5.4 branch to get a stable kernel.

I switched back to OpenWrt's kernel 5.4 and applied all those reverts I
found. Nothing. I was still getting hangs / lockups + reboots.

After more bisecting and I found out it's because OpenWrt backported
commit ad9b10d1eaad ("mtd: core: introduce of support for dynamic
partitions"):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ad9b10d1eaada169bd764abcab58f08538877e26

It didn't make any sense to me. That patch does nothing on my device and
its code is only executed when booting.

It makes even less sense to me. Why such changes that should not affect
anything actually break stability for BCM53573?

I narrowed above patch even furher. It's actually enough to apply below
diff to break kernel stability:

diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
index f69c5b94e..f10dd3af1 100644
--- a/drivers/mtd/mtdcore.c
+++ b/drivers/mtd/mtdcore.c
@@ -590,6 +590,25 @@ static int mtd_nvmem_add(struct mtd_info *mtd)
return 0;
}

+static void mtd_check_of_node(struct mtd_info *mtd)
+{
+ struct device_node *partitions, *parent_dn;
+ struct mtd_info *parent;
+
+ /* Check if MTD already has a device node */
+ if (dev_of_node(&mtd->dev))
+ return;
+
+ /* Check if a partitions node exist */
+ parent = mtd_get_master(mtd);
+ parent_dn = dev_of_node(&parent->dev);
+ pr_info("[%s] mtd->name:%s parent_dn:%pOF\n", __func__, mtd->name, parent_dn);
+ if (!parent_dn)
+ return;
+
+ of_node_put(parent_dn);
+}
+
/**
* add_mtd_device - register an MTD device
* @mtd: pointer to new MTD device info structure
@@ -673,6 +692,7 @@ int add_mtd_device(struct mtd_info *mtd)
mtd->dev.devt = MTD_DEVT(i);
dev_set_name(&mtd->dev, "mtd%d", i);
dev_set_drvdata(&mtd->dev, mtd);
+ mtd_check_of_node(mtd);
of_node_get(mtd_get_of_node(mtd));
error = device_register(&mtd->dev);
if (error) {


2023-08-04 11:50:25

by Rafał Miłecki

[permalink] [raw]
Subject: Re: ARM board lockups/hangs triggered by locks and mutexes

On 2.08.2023 00:10, Rafał Miłecki wrote:
> Unfortunately enabling *any* of following options:
> CONFIG_DEBUG_RT_MUTEXES=y
> CONFIG_DEBUG_SPINLOCK=y
> CONFIG_DEBUG_MUTEXES=y
> seems to make locksup/hangs go away. I tried for few hours.

I decided to find out why enabling CONFIG_DEBUG_MUTEXES "fixes" kernel /
device stability for me. I tried enabling manually code that normally
hides behind the #ifdev CONFIG_DEBUG_MUTEXES.

Attached to this e-mail is a small patch that is enough to make my
kernel stable (mutex-fix-bcm53573.diff).

#####

It's not what's the most interesting thought. What really doesn't make
sense anymore is that below diff (on top of attached one) brings back
hangs/lockups.

I triple checked that. Dropping a single unused function breaks kernel /
device stability on BCM53573!

AFAIK the only thing below diff actually affects is location of symbols
(I actually verified that by comparing System.map before and after -
over 22'000 of relocated symbols).

Can some unfortunate location of symbols cause those hangs/lockups?


diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 4fe40910f..c440222a4 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -34,6 +34,8 @@ void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
INIT_LIST_HEAD(&waiter->list);
}

+/* Dropping below function brings back hangs/lockups & reboots */
+#if 0
void debug_mutex_wake_waiter(struct mutex *lock, struct mutex_waiter *waiter)
{
lockdep_assert_held(&lock->wait_lock);
@@ -41,6 +43,7 @@ void debug_mutex_wake_waiter(struct mutex *lock, struct mutex_waiter *waiter)
DEBUG_LOCKS_WARN_ON(waiter->magic != waiter);
DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
}
+#endif

void debug_mutex_free_waiter(struct mutex_waiter *waiter)
{


Attachments:
mutex-fix-bcm53573.diff (2.19 kB)

2023-08-07 13:16:05

by Rafał Miłecki

[permalink] [raw]
Subject: Re: ARM board lockups/hangs triggered by locks and mutexes

On 4.08.2023 13:07, Rafał Miłecki wrote:
> I triple checked that. Dropping a single unused function breaks kernel /
> device stability on BCM53573!
>
> AFAIK the only thing below diff actually affects is location of symbols
> (I actually verified that by comparing System.map before and after -
> over 22'000 of relocated symbols).
>
> Can some unfortunate location of symbols cause those hangs/lockups?

I performed another experiment. First I dropped mtd_check_of_node() to
bring kernel back to the stable state.

Then I started adding useless code to the mtdchar_unlocked_ioctl(). I
ended up adding just enough to make sure all post-mtd symbols in
System.map got the same offset as in case of backporting
mtd_check_of_node().

I started experiencing lockups/hangs again.

I repeated the same test with adding dumb code to the brcm_nvram_probe()
and verifying symbols offsets following brcm_nvram_probe one.

I believe this confirms that this problem is about offset or alignment
of some specific symbol(s). The remaining question is what symbols and
how to fix or workaround that.

Following dump change brings back lockups/hangs:

diff --git a/drivers/mtd/mtdchar.c b/drivers/mtd/mtdchar.c
index ee437af41..0a24dec55 100644
--- a/drivers/mtd/mtdchar.c
+++ b/drivers/mtd/mtdchar.c
@@ -1028,6 +1028,22 @@ static long mtdchar_unlocked_ioctl(struct file *file, u_int cmd, u_long arg)
{
int ret;

+ if (!file)
+ pr_info("Missing\n");
+ WARN_ON(!file);
+ WARN_ON(cmd == 1234);
+ WARN_ON(cmd == 5678);
+ WARN_ON(cmd == 1234);
+ WARN_ON(cmd == 5678);
+ WARN_ON(cmd == 1234);
+ WARN_ON(cmd == 5678);
+ WARN_ON(cmd == 1234);
+ WARN_ON(cmd == 5678);
+ WARN_ON(cmd == 1234);
+ WARN_ON(cmd == 5678);
+ WARN_ON(cmd == 1234);
+ WARN_ON(cmd == 5678);
+
mutex_lock(&mtd_mutex);
ret = mtdchar_ioctl(file, cmd, arg);
mutex_unlock(&mtd_mutex);


2023-08-07 20:36:01

by Florian Fainelli

[permalink] [raw]
Subject: Re: ARM board lockups/hangs triggered by locks and mutexes

On 8/7/23 04:10, Rafał Miłecki wrote:
> On 4.08.2023 13:07, Rafał Miłecki wrote:
>> I triple checked that. Dropping a single unused function breaks kernel /
>> device stability on BCM53573!
>>
>> AFAIK the only thing below diff actually affects is location of symbols
>> (I actually verified that by comparing System.map before and after -
>> over 22'000 of relocated symbols).
>>
>> Can some unfortunate location of symbols cause those hangs/lockups?
>
> I performed another experiment. First I dropped mtd_check_of_node() to
> bring kernel back to the stable state.
>
> Then I started adding useless code to the mtdchar_unlocked_ioctl(). I
> ended up adding just enough to make sure all post-mtd symbols in
> System.map got the same offset as in case of backporting
> mtd_check_of_node().
>
> I started experiencing lockups/hangs again.
>
> I repeated the same test with adding dumb code to the brcm_nvram_probe()
> and verifying symbols offsets following brcm_nvram_probe one.
>
> I believe this confirms that this problem is about offset or alignment
> of some specific symbol(s). The remaining question is what symbols and
> how to fix or workaround that.

In the config.gz file you attached in your first email, both
CONFIG_MTD_* and CONFIG_NVMEM_* so it is not like we are reaching into
module space for code and/or data and need veneers or anything, it is
part of the kernel image so we can assert the maximum distance between
instructions etc.

Now is it just that specific mutex that is an issue, or do other mutexes
through the system do cause problems as well?

Do we suspect the toolchain to be possibly problematic?

>
> Following dump change brings back lockups/hangs:
>
> diff --git a/drivers/mtd/mtdchar.c b/drivers/mtd/mtdchar.c
> index ee437af41..0a24dec55 100644
> --- a/drivers/mtd/mtdchar.c
> +++ b/drivers/mtd/mtdchar.c
> @@ -1028,6 +1028,22 @@ static long mtdchar_unlocked_ioctl(struct file
> *file, u_int cmd, u_long arg)
>  {
>      int ret;
>
> +    if (!file)
> +        pr_info("Missing\n");
> +    WARN_ON(!file);
> +    WARN_ON(cmd == 1234);
> +    WARN_ON(cmd == 5678);
> +    WARN_ON(cmd == 1234);
> +    WARN_ON(cmd == 5678);
> +    WARN_ON(cmd == 1234);
> +    WARN_ON(cmd == 5678);
> +    WARN_ON(cmd == 1234);
> +    WARN_ON(cmd == 5678);
> +    WARN_ON(cmd == 1234);
> +    WARN_ON(cmd == 5678);
> +    WARN_ON(cmd == 1234);
> +    WARN_ON(cmd == 5678);
> +
>      mutex_lock(&mtd_mutex);
>      ret = mtdchar_ioctl(file, cmd, arg);
>      mutex_unlock(&mtd_mutex);
>

--
Florian


Attachments:
smime.p7s (4.12 kB)
S/MIME Cryptographic Signature

2023-08-11 12:56:37

by Rafał Miłecki

[permalink] [raw]
Subject: Re: ARM board lockups/hangs triggered by locks and mutexes

On 7.08.2023 20:34, Florian Fainelli wrote:
> On 8/7/23 04:10, Rafał Miłecki wrote:
>> On 4.08.2023 13:07, Rafał Miłecki wrote:
>>> I triple checked that. Dropping a single unused function breaks kernel /
>>> device stability on BCM53573!
>>>
>>> AFAIK the only thing below diff actually affects is location of symbols
>>> (I actually verified that by comparing System.map before and after -
>>> over 22'000 of relocated symbols).
>>>
>>> Can some unfortunate location of symbols cause those hangs/lockups?
>>
>> I performed another experiment. First I dropped mtd_check_of_node() to
>> bring kernel back to the stable state.
>>
>> Then I started adding useless code to the mtdchar_unlocked_ioctl(). I
>> ended up adding just enough to make sure all post-mtd symbols in
>> System.map got the same offset as in case of backporting
>> mtd_check_of_node().
>>
>> I started experiencing lockups/hangs again.
>>
>> I repeated the same test with adding dumb code to the brcm_nvram_probe()
>> and verifying symbols offsets following brcm_nvram_probe one.
>>
>> I believe this confirms that this problem is about offset or alignment
>> of some specific symbol(s). The remaining question is what symbols and
>> how to fix or workaround that.
>
> In the config.gz file you attached in your first email, both CONFIG_MTD_* and CONFIG_NVMEM_* so it is not like we are reaching into module space for code and/or data and need veneers or anything, it is part of the kernel image so we can assert the maximum distance between instructions etc.
>
> Now is it just that specific mutex that is an issue, or do other mutexes through the system do cause problems as well?

If you mean mtd mutex, I'm quite sure it's not the one to blame. It just
happened modified function was using a mutex. Could be any other.


> Do we suspect the toolchain to be possibly problematic?

Maybe, I really don't know much such low level stuff.


>>
>> Following dump change brings back lockups/hangs:
>>
>> diff --git a/drivers/mtd/mtdchar.c b/drivers/mtd/mtdchar.c
>> index ee437af41..0a24dec55 100644
>> --- a/drivers/mtd/mtdchar.c
>> +++ b/drivers/mtd/mtdchar.c
>> @@ -1028,6 +1028,22 @@ static long mtdchar_unlocked_ioctl(struct file *file, u_int cmd, u_long arg)
>>   {
>>       int ret;
>>
>> +    if (!file)
>> +        pr_info("Missing\n");
>> +    WARN_ON(!file);
>> +    WARN_ON(cmd == 1234);
>> +    WARN_ON(cmd == 5678);
>> +    WARN_ON(cmd == 1234);
>> +    WARN_ON(cmd == 5678);
>> +    WARN_ON(cmd == 1234);
>> +    WARN_ON(cmd == 5678);
>> +    WARN_ON(cmd == 1234);
>> +    WARN_ON(cmd == 5678);
>> +    WARN_ON(cmd == 1234);
>> +    WARN_ON(cmd == 5678);
>> +    WARN_ON(cmd == 1234);
>> +    WARN_ON(cmd == 5678);
>> +
>>       mutex_lock(&mtd_mutex);
>>       ret = mtdchar_ioctl(file, cmd, arg);
>>       mutex_unlock(&mtd_mutex);
>>
>


2023-08-14 09:38:09

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: ARM board lockups/hangs triggered by locks and mutexes

Hi Rafal,

On Mon, Aug 7, 2023 at 1:11 PM Rafał Miłecki <[email protected]> wrote:
> On 4.08.2023 13:07, Rafał Miłecki wrote:
> > I triple checked that. Dropping a single unused function breaks kernel /
> > device stability on BCM53573!
> >
> > AFAIK the only thing below diff actually affects is location of symbols
> > (I actually verified that by comparing System.map before and after -
> > over 22'000 of relocated symbols).
> >
> > Can some unfortunate location of symbols cause those hangs/lockups?
>
> I performed another experiment. First I dropped mtd_check_of_node() to
> bring kernel back to the stable state.
>
> Then I started adding useless code to the mtdchar_unlocked_ioctl(). I
> ended up adding just enough to make sure all post-mtd symbols in
> System.map got the same offset as in case of backporting
> mtd_check_of_node().
>
> I started experiencing lockups/hangs again.
>
> I repeated the same test with adding dumb code to the brcm_nvram_probe()
> and verifying symbols offsets following brcm_nvram_probe one.
>
> I believe this confirms that this problem is about offset or alignment
> of some specific symbol(s). The remaining question is what symbols and
> how to fix or workaround that.

I had similar experiences on other ARM platforms many years ago:
bisection lead to something completely bogus, and it turned out
adding a single line of innocent code made the system lock-up or crash
unexpectedly. It was definitely related to alignment, as adding the
right extra amount of innocent code would fix the problem. Until some
later change changing alignment again...
I never found the real cause, but the problems went away over time.
I am not sure I did enable all required errata config options, so I
may have missed some...

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds