2023-08-01 22:46:26

by Florian Fainelli

[permalink] [raw]
Subject: Re: ARM board lockups/hangs triggered by locks and mutexes

Hi Rafal,

On 8/1/23 15:10, Rafał Miłecki wrote:
> Hi,
>
> Years ago I added support for Broadcom's BCM53573 SoCs. We released
> firmwares based on Linux 4.4 (and later on 4.14) that worked almost
> fine. There was one little issue we couldn't debug or fix: random hangs
> and reboots. They were too rare to deal with (most devices worked fine
> for weeks or months).
>
> Recently I updated my stable kernel 5.4 and I started experiencing
> stability issues on my own! After some uptime (usually from 0 to 20
> minutes of close to zero activity) serial console hangs. I can't type
> anything and I stop getting any messages. I've to wait about a minute
> for watchdog to kick in and reboot device.
>
> #####
>
> I took that great chance and decided to track the regression.
>
> Linux 5.4 stable branch worked stable up to the release v5.4.197.
> Starting with v5.4.198 I started experiencing those stability issues. I
> bisected it down to the commit 4460066eb248 ("ipv6: fix locking issues
> with loops over idev->addr_list"):
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=4460066eb2480b9e203c73755e12e2efc820a27e
>
> With above commit reverted I was able to use stable 5.4 branch up to the
> release v5.4.207. Starting with v5.4.208 it got unstable again. I
> bisected it down to:
> commit d0d583484d2e ("locking/refcount: Consolidate implementations of
> refcount_t")
> commit dab787c73f6e ("locking/refcount: Consolidate
> REFCOUNT_{MAX,SATURATED} definitions")
> commit 0d3182fbe689 ("locking/refcount: Move saturation warnings out of line")
> commit 809554147d60 ("locking/refcount: Improve performance of generic
> REFCOUNT_FULL code")
> commit 9c9269977f03 ("locking/refcount: Move the bulk of the
> REFCOUNT_FULL implementation into the <linux/refcount.h> header")
> commit 04bff7d7b808 ("locking/refcount: Remove unused
> refcount_*_checked() variants")
> commit 513b19a43bec ("locking/refcount: Ensure integer operands are
> treated as signed")
> commit 68b4ee68e8c8 ("locking/refcount: Define constants for
> saturation and max refcount values")
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=d0d583484d2ed9f5903edbbfa7e2a68f78b950b0
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=dab787c73f6e38d8e7ed3c1e683385e8f0fe28a2
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=0d3182fbe689e3808c03b6cde6be98237f9e0a4a
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=809554147d609163cfbaf815c443c575b538a7ef
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=9c9269977f03ab9c448c8b71581a951e0eb4fb7b
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=04bff7d7b8081c4bb2e8171be31d33df297eee5b
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=513b19a43becee5f7af6d283bb9d3d241a8a21a8
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=68b4ee68e8c8800cf8d6b61cc74b4031a0742a4c
> (I didn't actually check above commits individually).
>
> Reverting above locking/refcount commits worked fine for few releases:
> up to the v5.4.219. Starting with v5.4.220 I got hangs again. I bisected
> that down to the commit 131287ff833d ("once: add DO_ONCE_SLOW() for
> sleepable contexts").
>
> Reverting that extra commit from v5.4.238 allows me to run Linux for
> hours again (currently 3 devices x 6 hours and counting). So I need in
> total 10+1 reverts from 5.4 branch to get a stable kernel.
>
> #####
>
> I'm clueless at this point. Is that possible kernel has some locking bug
> I can hit only using this specific SoC? BCM53573s have a single ARM
> Cortex-A7 CPU running at 900 MHz. The only unusual thing about this hw I
> can think of is a slow arch timer running at 36,8 kHz.

From the look of it, it seems like the CPU might have bugs with atomics?

Your log indicates that your Cortex-A7 is r0p5 which is described to be
susceptible to ARM_ERRATA_814220, do you have it enabled by any chance,
if not, can you enable it and see if makes any difference?

>
> I tried compiling kernel with:
> CONFIG_SOFTLOCKUP_DETECTOR=y
> CONFIG_DETECT_HUNG_TASK=y
> CONFIG_WQ_WATCHDOG=y
> but it didn't change or report anything.
>
> Unfortunately enabling *any* of following options:
> CONFIG_DEBUG_RT_MUTEXES=y
> CONFIG_DEBUG_SPINLOCK=y
> CONFIG_DEBUG_MUTEXES=y
> seems to make locksup/hangs go away. I tried for few hours.
>
> Sadly I don't have access to JTAG or any low level debugging interface.
>
> Does looking at commits I reported above give anyone a hint on what may
> be going on maybe?
>

--
Florian


Attachments:
smime.p7s (4.12 kB)
S/MIME Cryptographic Signature

2023-08-02 07:23:29

by Rafał Miłecki

[permalink] [raw]
Subject: Re: ARM board lockups/hangs triggered by locks and mutexes

On 2.08.2023 00:25, Florian Fainelli wrote:
> Hi Rafal,
>
> On 8/1/23 15:10, Rafał Miłecki wrote:
>> Hi,
>>
>> Years ago I added support for Broadcom's BCM53573 SoCs. We released
>> firmwares based on Linux 4.4 (and later on 4.14) that worked almost
>> fine. There was one little issue we couldn't debug or fix: random hangs
>> and reboots. They were too rare to deal with (most devices worked fine
>> for weeks or months).
>>
>> Recently I updated my stable kernel 5.4 and I started experiencing
>> stability issues on my own! After some uptime (usually from 0 to 20
>> minutes of close to zero activity) serial console hangs. I can't type
>> anything and I stop getting any messages. I've to wait about a minute
>> for watchdog to kick in and reboot device.
>>
>> #####
>>
>> I took that great chance and decided to track the regression.
>>
>> Linux 5.4 stable branch worked stable up to the release v5.4.197.
>> Starting with v5.4.198 I started experiencing those stability issues. I
>> bisected it down to the commit 4460066eb248 ("ipv6: fix locking issues
>> with loops over idev->addr_list"):
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=4460066eb2480b9e203c73755e12e2efc820a27e
>>
>> With above commit reverted I was able to use stable 5.4 branch up to the
>> release v5.4.207. Starting with v5.4.208 it got unstable again. I
>> bisected it down to:
>> commit d0d583484d2e ("locking/refcount: Consolidate implementations of
>> refcount_t")
>> commit dab787c73f6e ("locking/refcount: Consolidate
>> REFCOUNT_{MAX,SATURATED} definitions")
>> commit 0d3182fbe689 ("locking/refcount: Move saturation warnings out of line")
>> commit 809554147d60 ("locking/refcount: Improve performance of generic
>> REFCOUNT_FULL code")
>> commit 9c9269977f03 ("locking/refcount: Move the bulk of the
>> REFCOUNT_FULL implementation into the <linux/refcount.h> header")
>> commit 04bff7d7b808 ("locking/refcount: Remove unused
>> refcount_*_checked() variants")
>> commit 513b19a43bec ("locking/refcount: Ensure integer operands are
>> treated as signed")
>> commit 68b4ee68e8c8 ("locking/refcount: Define constants for
>> saturation and max refcount values")
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=d0d583484d2ed9f5903edbbfa7e2a68f78b950b0
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=dab787c73f6e38d8e7ed3c1e683385e8f0fe28a2
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=0d3182fbe689e3808c03b6cde6be98237f9e0a4a
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=809554147d609163cfbaf815c443c575b538a7ef
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=9c9269977f03ab9c448c8b71581a951e0eb4fb7b
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=04bff7d7b8081c4bb2e8171be31d33df297eee5b
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=513b19a43becee5f7af6d283bb9d3d241a8a21a8
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=68b4ee68e8c8800cf8d6b61cc74b4031a0742a4c
>> (I didn't actually check above commits individually).
>>
>> Reverting above locking/refcount commits worked fine for few releases:
>> up to the v5.4.219. Starting with v5.4.220 I got hangs again. I bisected
>> that down to the commit 131287ff833d ("once: add DO_ONCE_SLOW() for
>> sleepable contexts").
>>
>> Reverting that extra commit from v5.4.238 allows me to run Linux for
>> hours again (currently 3 devices x 6 hours and counting). So I need in
>> total 10+1 reverts from 5.4 branch to get a stable kernel.
>>
>> #####
>>
>> I'm clueless at this point. Is that possible kernel has some locking bug
>> I can hit only using this specific SoC? BCM53573s have a single ARM
>> Cortex-A7 CPU running at 900 MHz. The only unusual thing about this hw I
>> can think of is a slow arch timer running at 36,8 kHz.
>
> From the look of it, it seems like the CPU might have bugs with atomics?
>
> Your log indicates that your Cortex-A7 is r0p5 which is described to be susceptible to ARM_ERRATA_814220, do you have it enabled by any chance, if not, can you enable it and see if makes any difference?

I had it disabled. Unfortunately CONFIG_ARM_ERRATA_814220=y doesn't help.