#regzbot introduced: 0c5ffc3d7b15
#regzbot title: CPUIDLE_FLAG_RCU_IDLE, blk_mq_freeze_queue_wait() and slow-stuck reboots
The upstream changes are being merged into android-mainline repo and at some point we started to observe kernel panics on reboot or long reboot times.
The merge commit is 1f2d9ffc7a5f Merge tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.
Before, the reboot usually took significantly less than 15 seconds and after merge the reboot time fall in the range of 60-100 seconds.
At some point watchdog-like functionality or softdog start to worry about the system stuck somewhere nd panic the system.
The delay is found to be in device's ->shutdown() methods called from kernel_restart():
void kernel_restart_prepare(char *cmd)
{
blocking_notifier_call_chain(&reboot_notifier_list, SYS_RESTART, cmd);
system_state = SYSTEM_RESTART;
usermodehelper_disable();
device_shutdown(); <---- here
}
The driver in question is ufshcd and its ufshcd_wl_shutdown() shutdown method. It often blocks on scsi_device_quiesce() and upon manual checking it seems that it sleeps on blk_mq_freeze_queue_wait()/wait_event() in blk_freeze_queue():
scsi_device_quiesce(struct scsi_device *sdev)
{
...
blk_mq_freeze_queue(q);
...
}
||
V
void blk_freeze_queue(struct request_queue *q)
{
/*
* In the !blk_mq case we are only calling this to kill the
* q_usage_counter, otherwise this increases the freeze depth
* and waits for it to return to zero. For this reason there is
* no blk_unfreeze_queue(), and blk_freeze_queue() is not
* exported to drivers as the only user for unfreeze is blk_mq.
*/
blk_freeze_queue_start(q);
blk_mq_freeze_queue_wait(q); <--- sleeps on wait_event() here
}
Or in other words:
[ 34.785050][ C4] sysrq: Show Blocked State
[ 34.785132][ C4] task:init state:D stack:9680 pid:1 ppid:0 flags:0x04000008
[ 34.785301][ C4] Call trace:
[ 34.785360][ C4] __switch_to+0x180/0x308
[ 34.785452][ C4] __schedule+0x61c/0x9f0
[ 34.785530][ C4] schedule+0x84/0xf4
[ 34.785602][ C4] blk_mq_freeze_queue_wait+0x78/0xbc
[ 34.785707][ C4] blk_freeze_queue+0x74/0x8c
[ 34.785850][ C4] blk_mq_freeze_queue+0x18/0x2c
[ 34.786033][ C4] scsi_device_quiesce+0x54/0xec
[ 34.786216][ C4] ufshcd_wl_shutdown+0x98/0xc0
[ 34.786396][ C4] device_shutdown+0x1a8/0x264
[ 34.786572][ C4] kernel_restart+0x48/0x11c
[ 34.786742][ C4] __arm64_sys_reboot+0x1a8/0x27c
[ 34.786927][ C4] invoke_syscall+0x60/0x130
[ 34.787096][ C4] el0_svc_common+0xbc/0x100
[ 34.787266][ C4] do_el0_svc+0x38/0xc4
[ 34.787420][ C4] el0_svc+0x34/0xc4
[ 34.787563][ C4] el0t_64_sync_handler+0x8c/0xfc
[ 34.787749][ C4] el0t_64_sync+0x1a4/0x1a8
However, bisect pointed to 0c5ffc3d7b15 cpuidle, dt: Push RCU-idle into driver
->BAD 924aed1646bf cpuidle, cpu_pm: Remove RCU fiddling from cpu_pm_{enter,exit}()
->BAD a01353cf1896 cpuidle: Fix ct_idle_*() usage
->BAD (doesn't compile, needs one missing header file) 0c5ffc3d7b15 cpuidle, dt: Push RCU-idle into driver
->good c3d42418dca5 cpuidle, OMAP4: Push RCU-idle into driver
Looks like adding CPUIDLE_FLAG_RCU_IDLE flag to idle driver caused this behaviour.
The minimal change that is required for this system to avoid the regression would be one liner that removes the flag (below).
But if it is a real regression, then other idle drivers if used will likely cause this regression too withe same ufshcd driver. There is also a suspicion that CPUIDLE_FLAG_RCU_IDLE
just revealed or uncovered some other problem.
Any thoughts on this? Some missing __cpuidle or noinstr annotations?
Best regards,
Alexey
---
From: Alexey Klimov <[email protected]>
Date: Tue, 14 Mar 2023 17:04:06 +0000
Subject: [PATCH] cpuidle, dt: partially revert "0c5ffc3d7b15 cpuidle, dt: Push RCU-idle into driver"
This partially reverts commit 0c5ffc3d7b15978c6b184938cd6b8af06e436424
as pointed by bisect.
The introduced CPUIDLE_FLAG_RCU_IDLE flag causes much longer reboots:
60..100 seconds vs less than 15 seconds before this change.
The reboot flow seems to block in ufshcd driver ->shutdown() method:
[ 34.785050][ C4] sysrq: Show Blocked State
[ 34.785132][ C4] task:init state:D stack:9680 pid:1 ppid:0 flags:0x04000008
[ 34.785301][ C4] Call trace:
[ 34.785360][ C4] __switch_to+0x180/0x308
[ 34.785452][ C4] __schedule+0x61c/0x9f0
[ 34.785530][ C4] schedule+0x84/0xf4
[ 34.785602][ C4] blk_mq_freeze_queue_wait+0x78/0xbc
[ 34.785707][ C4] blk_freeze_queue+0x74/0x8c
[ 34.785850][ C4] blk_mq_freeze_queue+0x18/0x2c
[ 34.786033][ C4] scsi_device_quiesce+0x54/0xec
[ 34.786216][ C4] ufshcd_wl_shutdown+0x98/0xc0
[ 34.786396][ C4] device_shutdown+0x1a8/0x264
[ 34.786572][ C4] kernel_restart+0x48/0x11c
[ 34.786742][ C4] __arm64_sys_reboot+0x1a8/0x27c
[ 34.786927][ C4] invoke_syscall+0x60/0x130
[ 34.787096][ C4] el0_svc_common+0xbc/0x100
[ 34.787266][ C4] do_el0_svc+0x38/0xc4
[ 34.787420][ C4] el0_svc+0x34/0xc4
[ 34.787563][ C4] el0t_64_sync_handler+0x8c/0xfc
[ 34.787749][ C4] el0t_64_sync+0x1a4/0x1a8
Signed-off-by: Alexey Klimov <[email protected]>
---
drivers/cpuidle/dt_idle_states.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/cpuidle/dt_idle_states.c b/drivers/cpuidle/dt_idle_states.c
index 02aa0b39af9d..7ca3d7d9b5ea 100644
--- a/drivers/cpuidle/dt_idle_states.c
+++ b/drivers/cpuidle/dt_idle_states.c
@@ -77,7 +77,7 @@ static int init_state_node(struct cpuidle_state *idle_state,
if (err)
desc = state_node->name;
- idle_state->flags = CPUIDLE_FLAG_RCU_IDLE;
+ idle_state->flags = 0;
if (of_property_read_bool(state_node, "local-timer-stop"))
idle_state->flags |= CPUIDLE_FLAG_TIMER_STOP;
/*
--
2.40.0.rc1.284.g88254d51c5-goog
On 3/14/23 16:00, Alexey Klimov wrote:
> The delay is found to be in device's ->shutdown() methods called from kernel_restart():
> void kernel_restart_prepare(char *cmd)
> {
> blocking_notifier_call_chain(&reboot_notifier_list, SYS_RESTART, cmd);
> system_state = SYSTEM_RESTART;
> usermodehelper_disable();
> device_shutdown(); <---- here
> }
>
> The driver in question is ufshcd and its ufshcd_wl_shutdown() shutdown method. It often blocks on scsi_device_quiesce() and upon manual checking it seems that it sleeps on blk_mq_freeze_queue_wait()/wait_event() in blk_freeze_queue():
>
> scsi_device_quiesce(struct scsi_device *sdev)
> {
> ...
> blk_mq_freeze_queue(q);
> ...
> }
> ||
> V
> void blk_freeze_queue(struct request_queue *q)
> {
> /*
> * In the !blk_mq case we are only calling this to kill the
> * q_usage_counter, otherwise this increases the freeze depth
> * and waits for it to return to zero. For this reason there is
> * no blk_unfreeze_queue(), and blk_freeze_queue() is not
> * exported to drivers as the only user for unfreeze is blk_mq.
> */
> blk_freeze_queue_start(q);
> blk_mq_freeze_queue_wait(q); <--- sleeps on wait_event() here
> }
Please let me know if you want me to resubmit patch "scsi: ufs: Remove
the LUN quiescing code from ufshcd_wl_shutdown()"
(https://lore.kernel.org/linux-scsi/[email protected]/).
Thanks,
Bart.
(could you wrap your email please)
On Tue, Mar 14, 2023 at 11:00:04PM +0000, Alexey Klimov wrote:
> #regzbot introduced: 0c5ffc3d7b15 #regzbot title:
> CPUIDLE_FLAG_RCU_IDLE, blk_mq_freeze_queue_wait() and slow-stuck
> reboots
>
> The upstream changes are being merged into android-mainline repo and
> at some point we started to observe kernel panics on reboot or long
> reboot times.
On what hardware? I find it somewhat hard to follow this DT code :/
> Looks like adding CPUIDLE_FLAG_RCU_IDLE flag to idle driver caused
> this behaviour. The minimal change that is required for this system
> to avoid the regression would be one liner that removes the flag
> (below).
>
> But if it is a real regression, then other idle drivers if used will
> likely cause this regression too withe same ufshcd driver. There is
> also a suspicion that CPUIDLE_FLAG_RCU_IDLE just revealed or uncovered
> some other problem.
>
> Any thoughts on this?
So ARM has a weird 'rule' in that idle state 0 (wfi) should not have
RCU_IDLE set, while others should have.
Of the dt_init_idle_driver() users:
- cpuidle-arm: arm_enter_idle_state()
- cpuidle-big_little: bl_enter_powerdown() does ct_cpuidle_{enter,exit}()
- cpuidle-psci: psci_enter_idle_state() uses CPU_PM_CPU_IDLE_ENTER_PARAM_RCU()
- cpuidle-qcom-spm: spm_enter_idle_state() uses CPU_PM_CPU_IDLE_ENTER_PARAM()
- cpuidle-riscv-sbi: sbi_cpuidle_enter_state() uses CPU_PM_CPU_IDLE_ENTER_*_PARAM()
All of them start on index 1 and hence should have RCU_IDLE set, but at
least the arm, qcom-spm and riscv-sbi don't actually appear to abide by
the rules.
Fixing that gives me the below; does that help?
---
diff --git a/drivers/cpuidle/cpuidle-arm.c b/drivers/cpuidle/cpuidle-arm.c
index 7cfb980a357d..58fa81f0fa7d 100644
--- a/drivers/cpuidle/cpuidle-arm.c
+++ b/drivers/cpuidle/cpuidle-arm.c
@@ -39,7 +39,7 @@ static __cpuidle int arm_enter_idle_state(struct cpuidle_device *dev,
* will call the CPU ops suspend protocol with idle index as a
* parameter.
*/
- return CPU_PM_CPU_IDLE_ENTER(arm_cpuidle_suspend, idx);
+ return CPU_PM_CPU_IDLE_ENTER_RCU(arm_cpuidle_suspend, idx);
}
static struct cpuidle_driver arm_idle_driver __initdata = {
diff --git a/drivers/cpuidle/cpuidle-qcom-spm.c b/drivers/cpuidle/cpuidle-qcom-spm.c
index c6e2e91bb4c3..429db2d40114 100644
--- a/drivers/cpuidle/cpuidle-qcom-spm.c
+++ b/drivers/cpuidle/cpuidle-qcom-spm.c
@@ -64,7 +64,7 @@ static __cpuidle int spm_enter_idle_state(struct cpuidle_device *dev,
struct cpuidle_qcom_spm_data *data = container_of(drv, struct cpuidle_qcom_spm_data,
cpuidle_driver);
- return CPU_PM_CPU_IDLE_ENTER_PARAM(qcom_cpu_spc, idx, data->spm);
+ return CPU_PM_CPU_IDLE_ENTER_PARAM_RCU(qcom_cpu_spc, idx, data->spm);
}
static struct cpuidle_driver qcom_spm_idle_driver = {
diff --git a/drivers/cpuidle/cpuidle-riscv-sbi.c b/drivers/cpuidle/cpuidle-riscv-sbi.c
index be383f4b6855..04a601cda06b 100644
--- a/drivers/cpuidle/cpuidle-riscv-sbi.c
+++ b/drivers/cpuidle/cpuidle-riscv-sbi.c
@@ -100,10 +100,9 @@ static __cpuidle int sbi_cpuidle_enter_state(struct cpuidle_device *dev,
u32 state = states[idx];
if (state & SBI_HSM_SUSP_NON_RET_BIT)
- return CPU_PM_CPU_IDLE_ENTER_PARAM(sbi_suspend, idx, state);
- else
- return CPU_PM_CPU_IDLE_ENTER_RETENTION_PARAM(sbi_suspend,
- idx, state);
+ return CPU_PM_CPU_IDLE_ENTER_PARAM_RCU(sbi_suspend, idx, state);
+
+ return CPU_PM_CPU_IDLE_ENTER_RETENTION_PARAM_RCU(sbi_suspend, idx, state);
}
static __cpuidle int __sbi_enter_domain_idle_state(struct cpuidle_device *dev,
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 3183aeb7f5b4..dd92bdafe2d3 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -334,6 +334,9 @@ extern s64 cpuidle_governor_latency_req(unsigned int cpu);
#define CPU_PM_CPU_IDLE_ENTER(low_level_idle_enter, idx) \
__CPU_PM_CPU_IDLE_ENTER(low_level_idle_enter, idx, idx, 0, 0)
+#define CPU_PM_CPU_IDLE_ENTER_RCU(low_level_idle_enter, idx) \
+ __CPU_PM_CPU_IDLE_ENTER(low_level_idle_enter, idx, idx, 0, 1)
+
#define CPU_PM_CPU_IDLE_ENTER_RETENTION(low_level_idle_enter, idx) \
__CPU_PM_CPU_IDLE_ENTER(low_level_idle_enter, idx, idx, 1, 0)
On Tue, 14 Mar 2023 at 23:21, Bart Van Assche <[email protected]> wrote:
>
> On 3/14/23 16:00, Alexey Klimov wrote:
> > The delay is found to be in device's ->shutdown() methods called from kernel_restart():
> > void kernel_restart_prepare(char *cmd)
> > {
> > blocking_notifier_call_chain(&reboot_notifier_list, SYS_RESTART, cmd);
> > system_state = SYSTEM_RESTART;
> > usermodehelper_disable();
> > device_shutdown(); <---- here
[..]
>
> Please let me know if you want me to resubmit patch "scsi: ufs: Remove
> the LUN quiescing code from ufshcd_wl_shutdown()"
> (https://lore.kernel.org/linux-scsi/[email protected]/).
This indeed works and helps to reduce reboot time. Thanks!
If you decide to resubmit, feel free to ping me to test it.
However I have no knowledge how comments from Adrian can be addressed.
With that patch the reboot time decreases from 60-100 seconds to
~10-20 seconds. The next
"slow" thing is wlan driver which callback is called from
blocking_notifier_call_chain(&reboot_notifier_list, ...) but obviously
it is not a part of ufs code.
Best regards,
Alexey
On Wed, 15 Mar 2023 at 11:16, Peter Zijlstra <[email protected]> wrote:
>
>
> (could you wrap your email please)
Ouch. Sorry.
> On Tue, Mar 14, 2023 at 11:00:04PM +0000, Alexey Klimov wrote:
> > #regzbot introduced: 0c5ffc3d7b15 #regzbot title:
> > CPUIDLE_FLAG_RCU_IDLE, blk_mq_freeze_queue_wait() and slow-stuck
> > reboots
> >
> > The upstream changes are being merged into android-mainline repo and
> > at some point we started to observe kernel panics on reboot or long
> > reboot times.
>
> On what hardware? I find it somewhat hard to follow this DT code :/
Pixel 6.
> > Looks like adding CPUIDLE_FLAG_RCU_IDLE flag to idle driver caused
> > this behaviour. The minimal change that is required for this system
> > to avoid the regression would be one liner that removes the flag
> > (below).
> >
> > But if it is a real regression, then other idle drivers if used will
> > likely cause this regression too withe same ufshcd driver. There is
> > also a suspicion that CPUIDLE_FLAG_RCU_IDLE just revealed or uncovered
> > some other problem.
> >
> > Any thoughts on this?
>
> So ARM has a weird 'rule' in that idle state 0 (wfi) should not have
> RCU_IDLE set, while others should have.
>
> Of the dt_init_idle_driver() users:
>
> - cpuidle-arm: arm_enter_idle_state()
> - cpuidle-big_little: bl_enter_powerdown() does ct_cpuidle_{enter,exit}()
> - cpuidle-psci: psci_enter_idle_state() uses CPU_PM_CPU_IDLE_ENTER_PARAM_RCU()
> - cpuidle-qcom-spm: spm_enter_idle_state() uses CPU_PM_CPU_IDLE_ENTER_PARAM()
> - cpuidle-riscv-sbi: sbi_cpuidle_enter_state() uses CPU_PM_CPU_IDLE_ENTER_*_PARAM()
>
> All of them start on index 1 and hence should have RCU_IDLE set, but at
> least the arm, qcom-spm and riscv-sbi don't actually appear to abide by
> the rules.
>
> Fixing that gives me the below; does that help?
Double-checked and it seems, unfortunately, the patch doesn't change
the behaviour at all.
The first problematic driver is ufshcd that slows down the reboot the most.
The another one is wlan bcm driver which callback is called from
blocking_notifier_call_chain(...).
Backtraces from it, when it is stuck/slow, involve pci and net
subsystems but I didn't yet narrow it
down to exact function or specific flow.
The patch from Bart helps with ufshcd driver but still reboot times
are 10-20 seconds.
The removing of RCU IDLE flag helps with both drivers.
Is there any debug data I can collect to help with this or any other
patches to test please?
Thanks,
Alexey
On Fri, Mar 17, 2023 at 02:11:25AM +0000, Alexey Klimov wrote:
> On Wed, 15 Mar 2023 at 11:16, Peter Zijlstra <[email protected]> wrote:
> >
> >
> > (could you wrap your email please)
>
> Ouch. Sorry.
>
> > On Tue, Mar 14, 2023 at 11:00:04PM +0000, Alexey Klimov wrote:
> > > #regzbot introduced: 0c5ffc3d7b15 #regzbot title:
> > > CPUIDLE_FLAG_RCU_IDLE, blk_mq_freeze_queue_wait() and slow-stuck
> > > reboots
> > >
> > > The upstream changes are being merged into android-mainline repo and
> > > at some point we started to observe kernel panics on reboot or long
> > > reboot times.
> >
> > On what hardware? I find it somewhat hard to follow this DT code :/
>
> Pixel 6.
What actual cpuidle driver is that thing using? Is there any out-of-tree
code involved? Mark tells me anything arm64 should be using PSCI, so let
me to stare hard at that again.
On Wed, Mar 15, 2023 at 12:16:06PM +0100, Peter Zijlstra wrote:
> So ARM has a weird 'rule' in that idle state 0 (wfi) should not have
> RCU_IDLE set, while others should have.
>
> Of the dt_init_idle_driver() users:
>
> - cpuidle-arm: arm_enter_idle_state()
> - cpuidle-big_little: bl_enter_powerdown() does ct_cpuidle_{enter,exit}()
> - cpuidle-psci: psci_enter_idle_state() uses CPU_PM_CPU_IDLE_ENTER_PARAM_RCU()
> - cpuidle-qcom-spm: spm_enter_idle_state() uses CPU_PM_CPU_IDLE_ENTER_PARAM()
> - cpuidle-riscv-sbi: sbi_cpuidle_enter_state() uses CPU_PM_CPU_IDLE_ENTER_*_PARAM()
>
> All of them start on index 1 and hence should have RCU_IDLE set, but at
> least the arm, qcom-spm and riscv-sbi don't actually appear to abide by
> the rules.
>
> Fixing that gives me the below; does that help?
FWIW.. I got the meaning of CPU_PM_CPU_IDLE_*'s _RCU thing inverted
last week; they're actually ok as is.
Specifically _RCU wants the @low_level_idle_enter to do
ct_cpuidle_{enter,exit}(). And since the ones I audited didn't in fact
do that, they should not be using the _RCU version.
On Mon, Mar 20, 2023 at 10:05:58AM +0100, Peter Zijlstra wrote:
> On Fri, Mar 17, 2023 at 02:11:25AM +0000, Alexey Klimov wrote:
> > On Wed, 15 Mar 2023 at 11:16, Peter Zijlstra <[email protected]> wrote:
> > >
> > >
> > > (could you wrap your email please)
> >
> > Ouch. Sorry.
> >
> > > On Tue, Mar 14, 2023 at 11:00:04PM +0000, Alexey Klimov wrote:
> > > > #regzbot introduced: 0c5ffc3d7b15 #regzbot title:
> > > > CPUIDLE_FLAG_RCU_IDLE, blk_mq_freeze_queue_wait() and slow-stuck
> > > > reboots
> > > >
> > > > The upstream changes are being merged into android-mainline repo and
> > > > at some point we started to observe kernel panics on reboot or long
> > > > reboot times.
> > >
> > > On what hardware? I find it somewhat hard to follow this DT code :/
> >
> > Pixel 6.
>
> What actual cpuidle driver is that thing using? Is there any out-of-tree
> code involved? Mark tells me anything arm64 should be using PSCI, so let
> me to stare hard at that again.
So specifically, your problem sounds like rcu_synchronize() is taking
very much longer than it used to. Specifically combined with the patch
that makes it 'go-away' this seems to indicate you lost a
ct_cpuidle_enter() call, which is what ends up telling RCU the cpu is
idle and no longer partakes in the whole grace period machinery. Not
telling RCU this results in RCU waiting for an idle cpu to report back
on it's RCU progress, but it being idle means it's not going to be doing
that and things sorta wait around until RCU gets fed up and starts
spraying IPIs to try and get things moving.
Now... if a driver sets CPUIDLE_FLAG_RCU_IDLE it promises to call
ct_cpuidle_{enter,exit}() itself. Hence for any driver that does *NOT*
set that flag, cpuidle_enter_state() calls these functions.
Now, fo PSCI, the DT handler is psci_enter_idle_state(), which uses
CPU_PM_CPU_IDLE_ENTER_PARAM_RCU(), which per the other email, means that
it's low_level_idle_enter := psci_cpu_suspend_enter(), *will* call
ct_cpuidle_{enter,exit}().
Then if we look at psci_cpu_suspend_enter(), it has two cases depending
on psci_power_state_loses_context(). If it doesn't lose context it does
ct_cpuidle_enter() right there and proceeds to call
psci_ops.cpu_suspend() -- whatever that does.
If it does lose state, then it depends on CONFIG_ARM64, on arm64 we do
not call ct_cpuidle_{enter,exit}() but proceed into cpu_suspend().
We can find that function in arch/arm64/kernel/suspend.c, and if you
look at it, you'll note it does in fact call ct_cpuidle_{enter,exit}()
as per promises made.
So AFAICT every path into idle will pass through ct_cpuidle_enter().
On Tue, Mar 14, 2023 at 11:00:04PM +0000, Alexey Klimov wrote:
> #regzbot introduced: 0c5ffc3d7b15
> #regzbot title: CPUIDLE_FLAG_RCU_IDLE, blk_mq_freeze_queue_wait() and slow-stuck reboots
>
> The upstream changes are being merged into android-mainline repo and at some
> point we started to observe kernel panics on reboot or long reboot times.
Where can I find the android-mainline repo, and which specific branch/commit
from that repo is being merged in?
> The merge commit is 1f2d9ffc7a5f Merge tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.
> Before, the reboot usually took significantly less than 15 seconds and after merge the reboot time fall in the range of 60-100 seconds.
> At some point watchdog-like functionality or softdog start to worry about the system stuck somewhere nd panic the system.
>
> The delay is found to be in device's ->shutdown() methods called from kernel_restart():
> void kernel_restart_prepare(char *cmd)
> {
> blocking_notifier_call_chain(&reboot_notifier_list, SYS_RESTART, cmd);
> system_state = SYSTEM_RESTART;
> usermodehelper_disable();
> device_shutdown(); <---- here
> }
>
> The driver in question is ufshcd and its ufshcd_wl_shutdown() shutdown method. It often blocks on scsi_device_quiesce() and upon manual checking it seems that it sleeps on blk_mq_freeze_queue_wait()/wait_event() in blk_freeze_queue():
>
> scsi_device_quiesce(struct scsi_device *sdev)
> {
> ...
> blk_mq_freeze_queue(q);
> ...
> }
> ||
> V
> void blk_freeze_queue(struct request_queue *q)
> {
> /*
> * In the !blk_mq case we are only calling this to kill the
> * q_usage_counter, otherwise this increases the freeze depth
> * and waits for it to return to zero. For this reason there is
> * no blk_unfreeze_queue(), and blk_freeze_queue() is not
> * exported to drivers as the only user for unfreeze is blk_mq.
> */
> blk_freeze_queue_start(q);
> blk_mq_freeze_queue_wait(q); <--- sleeps on wait_event() here
> }
>
> Or in other words:
>
> [ 34.785050][ C4] sysrq: Show Blocked State
> [ 34.785132][ C4] task:init state:D stack:9680 pid:1 ppid:0 flags:0x04000008
> [ 34.785301][ C4] Call trace:
> [ 34.785360][ C4] __switch_to+0x180/0x308
> [ 34.785452][ C4] __schedule+0x61c/0x9f0
> [ 34.785530][ C4] schedule+0x84/0xf4
> [ 34.785602][ C4] blk_mq_freeze_queue_wait+0x78/0xbc
> [ 34.785707][ C4] blk_freeze_queue+0x74/0x8c
> [ 34.785850][ C4] blk_mq_freeze_queue+0x18/0x2c
> [ 34.786033][ C4] scsi_device_quiesce+0x54/0xec
> [ 34.786216][ C4] ufshcd_wl_shutdown+0x98/0xc0
> [ 34.786396][ C4] device_shutdown+0x1a8/0x264
> [ 34.786572][ C4] kernel_restart+0x48/0x11c
> [ 34.786742][ C4] __arm64_sys_reboot+0x1a8/0x27c
> [ 34.786927][ C4] invoke_syscall+0x60/0x130
> [ 34.787096][ C4] el0_svc_common+0xbc/0x100
> [ 34.787266][ C4] do_el0_svc+0x38/0xc4
> [ 34.787420][ C4] el0_svc+0x34/0xc4
> [ 34.787563][ C4] el0t_64_sync_handler+0x8c/0xfc
> [ 34.787749][ C4] el0t_64_sync+0x1a4/0x1a8
>
>
> However, bisect pointed to 0c5ffc3d7b15 cpuidle, dt: Push RCU-idle into driver
>
> ->BAD 924aed1646bf cpuidle, cpu_pm: Remove RCU fiddling from cpu_pm_{enter,exit}()
> ->BAD a01353cf1896 cpuidle: Fix ct_idle_*() usage
> ->BAD (doesn't compile, needs one missing header file) 0c5ffc3d7b15 cpuidle, dt: Push RCU-idle into driver
> ->good c3d42418dca5 cpuidle, OMAP4: Push RCU-idle into driver
>
> Looks like adding CPUIDLE_FLAG_RCU_IDLE flag to idle driver caused this behaviour.
> The minimal change that is required for this system to avoid the regression
> would be one liner that removes the flag (below).
>
> But if it is a real regression, then other idle drivers if used will likely
> cause this regression too withe same ufshcd driver. There is also a suspicion
> that CPUIDLE_FLAG_RCU_IDLE just revealed or uncovered some other problem.
>
> Any thoughts on this? Some missing __cpuidle or noinstr annotations?
As Peter has suggested in another reply, this might not be an *upstream*
regression, but rather an interaction with those out-of-tree patches. If you
can provide a pointer to the branch above that, it'll make it much easier to
figure out what's going on.
The code in mainline is unfortunately convoluted, but it doesn't look to be
obviously wrong.
FWIW, I've just tried v6.3-rc3 on Juno R2 in a few configurations (ACPI, DT,
with PROVE_LOCKING=y and DEBUG_LOCKDEP=y, with PSCI_CHECKER=y) and I'm not
seeing similar issues. So either I'm not tickling the same code path, or
something else is involved.
Thanks,
Mark.
On Mon, Mar 20, 2023 at 01:52:47PM +0000, Mark Rutland wrote:
> On Tue, Mar 14, 2023 at 11:00:04PM +0000, Alexey Klimov wrote:
> > #regzbot introduced: 0c5ffc3d7b15
> > #regzbot title: CPUIDLE_FLAG_RCU_IDLE, blk_mq_freeze_queue_wait() and slow-stuck reboots
> >
> > The upstream changes are being merged into android-mainline repo and at some
> > point we started to observe kernel panics on reboot or long reboot times.
>
> Where can I find the android-mainline repo, and which specific branch/commit
> from that repo is being merged in?
I assume that was the android-mainline branch in:
https://android.googlesource.com/kernel/common/
... and I had a go with commit:
8338670fd5bdf8d7 ("Merge "Merge 36289a03bcd3 ("Merge tag 'v6.3-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6") into android-mainline" into android-mainline")
... as that was the commit immediately before your local revert:
a32cec8e3f2253bc ("ANDROID: Revert "cpuidle, dt: Push RCU-idle into driver")
Testing on Juno R2 with defconfig + PROVE_LOCKING=y + DEBUG_LOCKDEP=y, I cannot
reproduce the reboot issue; everything seems to work just fine.
Can you say which config you're using?
Just to check: are you using a pristine version of that tree, or do you have
any vendor hooks present? I note that there are special hooks added to the
cpuidle and PSCI code, and I can imagine those might expect the old behaviour
w.r.t. RCU and idle, so ruling those out would help.
Thanks,
Mark.
>
> > The merge commit is 1f2d9ffc7a5f Merge tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.
> > Before, the reboot usually took significantly less than 15 seconds and after merge the reboot time fall in the range of 60-100 seconds.
> > At some point watchdog-like functionality or softdog start to worry about the system stuck somewhere nd panic the system.
> >
> > The delay is found to be in device's ->shutdown() methods called from kernel_restart():
> > void kernel_restart_prepare(char *cmd)
> > {
> > blocking_notifier_call_chain(&reboot_notifier_list, SYS_RESTART, cmd);
> > system_state = SYSTEM_RESTART;
> > usermodehelper_disable();
> > device_shutdown(); <---- here
> > }
> >
> > The driver in question is ufshcd and its ufshcd_wl_shutdown() shutdown method. It often blocks on scsi_device_quiesce() and upon manual checking it seems that it sleeps on blk_mq_freeze_queue_wait()/wait_event() in blk_freeze_queue():
> >
> > scsi_device_quiesce(struct scsi_device *sdev)
> > {
> > ...
> > blk_mq_freeze_queue(q);
> > ...
> > }
> > ||
> > V
> > void blk_freeze_queue(struct request_queue *q)
> > {
> > /*
> > * In the !blk_mq case we are only calling this to kill the
> > * q_usage_counter, otherwise this increases the freeze depth
> > * and waits for it to return to zero. For this reason there is
> > * no blk_unfreeze_queue(), and blk_freeze_queue() is not
> > * exported to drivers as the only user for unfreeze is blk_mq.
> > */
> > blk_freeze_queue_start(q);
> > blk_mq_freeze_queue_wait(q); <--- sleeps on wait_event() here
> > }
> >
> > Or in other words:
> >
> > [ 34.785050][ C4] sysrq: Show Blocked State
> > [ 34.785132][ C4] task:init state:D stack:9680 pid:1 ppid:0 flags:0x04000008
> > [ 34.785301][ C4] Call trace:
> > [ 34.785360][ C4] __switch_to+0x180/0x308
> > [ 34.785452][ C4] __schedule+0x61c/0x9f0
> > [ 34.785530][ C4] schedule+0x84/0xf4
> > [ 34.785602][ C4] blk_mq_freeze_queue_wait+0x78/0xbc
> > [ 34.785707][ C4] blk_freeze_queue+0x74/0x8c
> > [ 34.785850][ C4] blk_mq_freeze_queue+0x18/0x2c
> > [ 34.786033][ C4] scsi_device_quiesce+0x54/0xec
> > [ 34.786216][ C4] ufshcd_wl_shutdown+0x98/0xc0
> > [ 34.786396][ C4] device_shutdown+0x1a8/0x264
> > [ 34.786572][ C4] kernel_restart+0x48/0x11c
> > [ 34.786742][ C4] __arm64_sys_reboot+0x1a8/0x27c
> > [ 34.786927][ C4] invoke_syscall+0x60/0x130
> > [ 34.787096][ C4] el0_svc_common+0xbc/0x100
> > [ 34.787266][ C4] do_el0_svc+0x38/0xc4
> > [ 34.787420][ C4] el0_svc+0x34/0xc4
> > [ 34.787563][ C4] el0t_64_sync_handler+0x8c/0xfc
> > [ 34.787749][ C4] el0t_64_sync+0x1a4/0x1a8
> >
> >
> > However, bisect pointed to 0c5ffc3d7b15 cpuidle, dt: Push RCU-idle into driver
> >
> > ->BAD 924aed1646bf cpuidle, cpu_pm: Remove RCU fiddling from cpu_pm_{enter,exit}()
> > ->BAD a01353cf1896 cpuidle: Fix ct_idle_*() usage
> > ->BAD (doesn't compile, needs one missing header file) 0c5ffc3d7b15 cpuidle, dt: Push RCU-idle into driver
> > ->good c3d42418dca5 cpuidle, OMAP4: Push RCU-idle into driver
> >
> > Looks like adding CPUIDLE_FLAG_RCU_IDLE flag to idle driver caused this behaviour.
> > The minimal change that is required for this system to avoid the regression
> > would be one liner that removes the flag (below).
> >
> > But if it is a real regression, then other idle drivers if used will likely
> > cause this regression too withe same ufshcd driver. There is also a suspicion
> > that CPUIDLE_FLAG_RCU_IDLE just revealed or uncovered some other problem.
> >
> > Any thoughts on this? Some missing __cpuidle or noinstr annotations?
>
> As Peter has suggested in another reply, this might not be an *upstream*
> regression, but rather an interaction with those out-of-tree patches. If you
> can provide a pointer to the branch above that, it'll make it much easier to
> figure out what's going on.
>
> The code in mainline is unfortunately convoluted, but it doesn't look to be
> obviously wrong.
>
> FWIW, I've just tried v6.3-rc3 on Juno R2 in a few configurations (ACPI, DT,
> with PROVE_LOCKING=y and DEBUG_LOCKDEP=y, with PSCI_CHECKER=y) and I'm not
> seeing similar issues. So either I'm not tickling the same code path, or
> something else is involved.
>
> Thanks,
> Mark.
[TLDR: This mail in primarily relevant for Linux kernel regression
tracking. See link in footer if these mails annoy you.]
On 20.03.23 17:04, Mark Rutland wrote:
> On Mon, Mar 20, 2023 at 01:52:47PM +0000, Mark Rutland wrote:
>> On Tue, Mar 14, 2023 at 11:00:04PM +0000, Alexey Klimov wrote:
>>> #regzbot introduced: 0c5ffc3d7b15
>>> #regzbot title: CPUIDLE_FLAG_RCU_IDLE, blk_mq_freeze_queue_wait() and slow-stuck reboots
>>>
>>> The upstream changes are being merged into android-mainline repo and at some
>>> point we started to observe kernel panics on reboot or long reboot times.
>>
>> Where can I find the android-mainline repo, and which specific branch/commit
>> from that repo is being merged in?
>
> I assume that was the android-mainline branch in:
>
> https://android.googlesource.com/kernel/common/
>
> ... and I had a go with commit:
>
> 8338670fd5bdf8d7 ("Merge "Merge 36289a03bcd3 ("Merge tag 'v6.3-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6") into android-mainline" into android-mainline")
>
> ... as that was the commit immediately before your local revert:
>
> a32cec8e3f2253bc ("ANDROID: Revert "cpuidle, dt: Push RCU-idle into driver")
>
> Testing on Juno R2 with defconfig + PROVE_LOCKING=y + DEBUG_LOCKDEP=y, I cannot
> reproduce the reboot issue; everything seems to work just fine.
>
> Can you say which config you're using?
>
> Just to check: are you using a pristine version of that tree, or do you have
> any vendor hooks present? I note that there are special hooks added to the
> cpuidle and PSCI code, and I can imagine those might expect the old behaviour
> w.r.t. RCU and idle, so ruling those out would help.
No reply to this since more than 10 days afaics, hence for now assuming
that this is not really a regression.
#regzbot inconclusive: another dev could not reproduce issue; reporter
MIA and might have used a patched tree
#regzbot ignore-activity
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.
On Mon, 20 Mar 2023 10:36:14 +0100
Peter Zijlstra <[email protected]> wrote:
Sorry for the delay.
(adding Mark in c/c)
> On Mon, Mar 20, 2023 at 10:05:58AM +0100, Peter Zijlstra wrote:
> > On Fri, Mar 17, 2023 at 02:11:25AM +0000, Alexey Klimov wrote:
> > > On Wed, 15 Mar 2023 at 11:16, Peter Zijlstra
> > > <[email protected]> wrote:
> > > >
> > > >
> > > > (could you wrap your email please)
> > >
> > > Ouch. Sorry.
> > >
> > > > On Tue, Mar 14, 2023 at 11:00:04PM +0000, Alexey Klimov wrote:
> > > > > #regzbot introduced: 0c5ffc3d7b15 #regzbot title:
> > > > > CPUIDLE_FLAG_RCU_IDLE, blk_mq_freeze_queue_wait() and
> > > > > slow-stuck reboots
> > > > >
> > > > > The upstream changes are being merged into android-mainline
> > > > > repo and at some point we started to observe kernel panics on
> > > > > reboot or long reboot times.
> > > >
> > > > On what hardware? I find it somewhat hard to follow this DT
> > > > code :/
> > >
> > > Pixel 6.
> >
> > What actual cpuidle driver is that thing using? Is there any
> > out-of-tree code involved? Mark tells me anything arm64 should be
> > using PSCI, so let me to stare hard at that again.
Yep, it uses PSCI (psci_idle). But there is some out-of-tree code
involved.
> So specifically, your problem sounds like rcu_synchronize() is taking
> very much longer than it used to. Specifically combined with the patch
> that makes it 'go-away' this seems to indicate you lost a
> ct_cpuidle_enter() call, which is what ends up telling RCU the cpu is
> idle and no longer partakes in the whole grace period machinery. Not
> telling RCU this results in RCU waiting for an idle cpu to report back
> on it's RCU progress, but it being idle means it's not going to be
> doing that and things sorta wait around until RCU gets fed up and
> starts spraying IPIs to try and get things moving.
rcu_barrier().
Eventually managed to make sysrq work and blocked tasks point to
rcu_barrier():
[ 454.626264][ C4] task:init state:D stack:9728 pid:1 ppid:0
flags:0x04000008
[ 454.626463][ C4] Call trace:
[ 454.626570][ C4] __switch_to+0x180/0x308
[ 454.626733][ C4] __schedule+0x620/0x9e4
[ 454.626892][ C4] schedule+0x84/0xf4
[ 454.627039][ C4] schedule_timeout+0x48/0x1d8
[ 454.627215][ C4] wait_for_common+0xe0/0x1a8
[ 454.627388][ C4] wait_for_completion+0x20/0x30
[ 454.627571][ C4] rcu_barrier+0x3f8/0x674
[ 454.627734][ C4] netdev_run_todo+0x74/0x5d8
[ 454.627907][ C4] rtnl_unlock+0x18/0x2c
[ 454.628063][ C4] dhd_unregister_net+0x30/0x54 [bcmdhd4389]
[ 454.629084][ C4] wl_cfg80211_unregister_static_if+0x98/0x18c
[bcmdhd4389]
[ 454.630101][ C4] wl_cfg80211_detach+0xcc/0x344
[bcmdhd4389]
[ 454.631091][ C4] dhd_detach+0x4d0/0xc78 [bcmdhd4389]
[ 454.632064][ C4] dhdpcie_bus_release+0x28c/0x6bc [bcmdhd4389]
[ 454.633052][ C4] dhdpcie_pci_remove+0xb4/0x1a4 [bcmdhd4389]
[ 454.634036][ C4] pci_device_remove+0x60/0x118
[ 454.634152][ C4] device_release_driver_internal+0x19c/0x2f0
[ 454.634291][ C4] driver_detach+0x98/0xe0
[ 454.634395][ C4] bus_remove_driver+0x80/0xbc
[ 454.634504][ C4] driver_unregister+0x38/0x64
[ 454.634616][ C4] pci_unregister_driver+0x2c/0xb8
[ 454.634735][ C4] dhdpcie_bus_unregister+0xd8/0x1b0 [bcmdhd4389]
[ 454.635739][ C4] dhd_bus_unregister+0x24/0x50 [bcmdhd4389]
[ 454.636722][ C4] dhd_reboot_callback+0x19c/0x1d8 [bcmdhd4389]
[ 454.637709][ C4] blocking_notifier_call_chain+0x78/0xc8
[ 454.637836][ C4] kernel_power_off+0x2c/0xfc
[ 454.637946][ C4] __arm64_sys_reboot+0x270/0x27c
[ 454.638060][ C4] invoke_syscall+0x60/0x130
[ 454.638169][ C4] el0_svc_common+0xbc/0x100
[ 454.638278][ C4] do_el0_svc+0x38/0xb0
[ 454.638377][ C4] el0_svc+0x34/0xc4
[ 454.638464][ C4] el0t_64_sync_handler+0x8c/0xfc
[ 454.638573][ C4] el0t_64_sync+0x1a8/0x1ac
and
[ 13.799675][ C2] task:kworker/1:1 state:D stack:12784 pid:97
ppid:2 flags:0x00000008
[ 13.799879][ C2] Workqueue: events
slab_caches_to_rcu_destroy_workfn
[ 13.800112][ C2] Call trace:
[ 13.800231][ C2] __switch_to+0x180/0x308
[ 13.800394][ C2] __schedule+0x620/0x9e4
[ 13.800553][ C2] schedule+0x84/0xf4
[ 13.800700][ C2] schedule_timeout+0x48/0x1d8
[ 13.800876][ C2] wait_for_common+0xe0/0x1a8
[ 13.801049][ C2] wait_for_completion+0x20/0x30
[ 13.801232][ C2] rcu_barrier+0x3f8/0x674
[ 13.801394][ C2] slab_caches_to_rcu_destroy_workfn+0x90/0x108
[ 13.801627][ C2] process_one_work+0x1f0/0x454
[ 13.801807][ C2] worker_thread+0x330/0x43c
[ 13.801977][ C2] kthread+0x10c/0x1b8
[ 13.802127][ C2] ret_from_fork+0x10/0x20
> Now... if a driver sets CPUIDLE_FLAG_RCU_IDLE it promises to call
> ct_cpuidle_{enter,exit}() itself. Hence for any driver that does *NOT*
> set that flag, cpuidle_enter_state() calls these functions.
>
> Now, fo PSCI, the DT handler is psci_enter_idle_state(), which uses
> CPU_PM_CPU_IDLE_ENTER_PARAM_RCU(), which per the other email, means
> that it's low_level_idle_enter := psci_cpu_suspend_enter(), *will*
> call ct_cpuidle_{enter,exit}().
>
> Then if we look at psci_cpu_suspend_enter(), it has two cases
> depending on psci_power_state_loses_context(). If it doesn't lose
> context it does ct_cpuidle_enter() right there and proceeds to call
> psci_ops.cpu_suspend() -- whatever that does.
>
> If it does lose state, then it depends on CONFIG_ARM64, on arm64 we do
> not call ct_cpuidle_{enter,exit}() but proceed into cpu_suspend().
>
> We can find that function in arch/arm64/kernel/suspend.c, and if you
> look at it, you'll note it does in fact call ct_cpuidle_{enter,exit}()
> as per promises made.
>
> So AFAICT every path into idle will pass through ct_cpuidle_enter().
I tried to understand where ct_cpuidle_{enter,exit}() is missing and
narrow it down until found out that the problem is cpu pm notifier
returning NOTIFY_BAD during reboot and calls to ct_cpuidle_{enter,exit}
are correct. Thanks for checking.
In __CPU_PM_CPU_IDLE_ENTER() macro we have:
if (!is_retention)
__ret = cpu_pm_enter();
if (!__ret) {
if (!is_rcu)
ct_cpuidle_enter();
__ret = low_level_idle_enter(state);
if (!is_rcu)
ct_cpuidle_exit();
if (!is_retention)
cpu_pm_exit();
}
While rebooting cpu_pm_enter() returns an error since one of the
notifiers starts to return NOTIFY_BAD. Looks like kernel starts to
"spin" trying to put CPUs in idle state and always gets an error from
cpu_pm_enter() never reaching the section covered by
ct_cpuidle_{enter,exit}. Before the mentioned commit all of this were
under ct_cpuidle_{enter,exit}. Hence, yes, in some sense we lost a
ct_cpuidle_enter+exit() call. As long as I manually cover this small
piece with ct_cpuidle_{enter,exit} for testing, the reboot times return
back to as it was before the commit. Currently I removed returning
NOTIFY_BAD from notifier when rebooting.
It seems RCU waits for these CPUs to attend RCU machinery.
Also it probably does not count as an upstream regression since it
involves out of tree code but in theory this could happen with any
other notifier. Probably WARN_ONCE() will be too bad too.
The cpu pm notifier is this one, line 972,
exynos_cpu_pm_notify_callback():
https://android.googlesource.com/kernel/gs/+/refs/tags/android-t-qpr3-beta-2_r0.5/drivers/soc/google/exynos-cpupm.c#972
Thank you for checking all this!
Best regards,
Alexey