2018-09-04 19:04:57

by Niklas Cassel

[permalink] [raw]
Subject: stop_machine() soft lockup

Hello Peter,

I'm seeing some lockups when booting linux-next on a db820c arm64 board.
I've tried to analyze, but I'm currently stuck.


[ 30.974764] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 31.010121] rcu: 2-...0: (1 GPs behind) idle=3f6/1/0x4000000000000002 softirq=53/53 fqs=3099
[ 31.035352] rcu: (t=6518 jiffies g=-1119 q=1279)
[ 31.054464] Task dump for CPU 2:
[ 31.065349] migration/2 R running task 0 22 2 0x0000002a
[ 31.091937] Call trace:
[ 31.103357] dump_backtrace+0x0/0x200
[ 31.114780] show_stack+0x24/0x30
[ 31.126183] sched_show_task+0x20c/0x2d8
[ 31.137604] dump_cpu_task+0x48/0x58
[ 31.149111] rcu_dump_cpu_stacks+0xa0/0xe0
[ 31.160620] rcu_check_callbacks+0x85c/0xb60
[ 31.172151] update_process_times+0x34/0x60
[ 31.183575] tick_periodic+0x58/0x110
[ 31.195048] tick_handle_periodic+0x94/0xc8
[ 31.206573] arch_timer_handler_virt+0x38/0x58
[ 31.218058] handle_percpu_devid_irq+0xe4/0x458
[ 31.229544] generic_handle_irq+0x34/0x50
[ 31.240952] __handle_domain_irq+0x8c/0xf8
[ 31.252375] gic_handle_irq+0x84/0x180
[ 31.263776] el1_irq+0xec/0x198
[ 31.275117] multi_cpu_stop+0x198/0x220
[ 31.286574] cpu_stopper_thread+0xfc/0x188
[ 31.298050] smpboot_thread_fn+0x1bc/0x2c0
[ 31.309562] kthread+0x134/0x138
[ 31.320888] ret_from_fork+0x10/0x1c
[ 31.332075] Task dump for CPU 3:
[ 31.342925] migration/3 R running task 0 27 2 0x0000002a
[ 31.369347] Call trace:
[ 31.380732] __switch_to+0xa8/0xf0
[ 31.391599] (null)
[ 32.813449] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 0
[ 32.853356] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 1
[ 32.996597] watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [migration/3:27]
[ 33.008710] Modules linked in:
[ 33.026913] irq event stamp: 702
[ 33.038360] hardirqs last enabled at (701): [<ffff0000081bb60c>] multi_cpu_stop+0x194/0x220
[ 33.050062] hardirqs last disabled at (702): [<ffff000008083e2c>] el1_irq+0xac/0x198
[ 33.061600] softirqs last enabled at (396): [<ffff000008082164>] __do_softirq+0x47c/0x574
[ 33.073284] softirqs last disabled at (389): [<ffff0000080ca41c>] irq_exit+0x134/0x148
[ 33.084852] CPU: 3 PID: 27 Comm: migration/3 Tainted: G W 4.19.0-rc2-next-20180904-00001-g32563c06c8db-dirty #42
[ 33.095929] Hardware name: Qualcomm Technologies, Inc. DB820c (DT)
[ 33.107086] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 33.118542] pc : multi_cpu_stop+0x198/0x220
[ 33.129799] lr : multi_cpu_stop+0x194/0x220
[ 33.140522] sp : ffff00000ae5bd50
[ 33.151104] x29: ffff00000ae5bd50 x28: 0000000000000000
[ 33.176154] x27: ffff00000803bc80 x26: 0000000000000000
[ 33.201057] x25: 0000000000000000 x24: 0000000000000000
[ 33.226014] x23: ffff0000090b4000 x22: 0000000000000003
[ 33.250973] x21: ffff00000803bca4 x20: ffff00000803bc80
[ 33.275806] x19: 0000000000000004 x18: ffffffffffffffff
[ 33.300799] x17: 0000000000000727 x16: ffff000009fcfd00
[ 33.325856] x15: ffff000009cbe1c8 x14: 31203a64656c6261
[ 33.350852] x13: 000000006f1527a4 x12: ffff8000d958e3b0
[ 33.375706] x11: 0000000000000000 x10: 0000000000000428
[ 33.400629] x9 : ffff8000d9591000 x8 : 000000000000001c
[ 33.425566] x7 : ffff8000d9c08400 x6 : 0000000000000444
[ 33.450410] x5 : 0000000000007045 x4 : 0000000000000000
[ 33.475399] x3 : 0000000000000000 x2 : ffff00000aabb000
[ 33.500394] x1 : 0000000000000001 x0 : ffff8000d958db00
[ 33.525460] Call trace:
[ 33.536707] multi_cpu_stop+0x198/0x220
[ 33.548035] cpu_stopper_thread+0xfc/0x188
[ 33.559444] smpboot_thread_fn+0x1bc/0x2c0
[ 33.570610] kthread+0x134/0x138
[ 33.581944] ret_from_fork+0x10/0x1c
[ 36.813418] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 0
[ 36.853359] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 1
[ 40.813415] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 0
[ 40.853352] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 1
[ 44.813415] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 0
[ 44.853342] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 1
[ 48.813418] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 0
[ 48.853344] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 1
[ 52.813407] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 0
[ 52.853341] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 1
[ 56.813414] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 0
[ 56.853345] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 1
[ 56.921016] watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [migration/2:22]
[ 56.933363] Modules linked in:
[ 56.951546] irq event stamp: 604
[ 56.963173] hardirqs last enabled at (603): [<ffff0000081bb60c>] multi_cpu_stop+0x194/0x220
[ 56.974744] hardirqs last disabled at (604): [<ffff000008083e2c>] el1_irq+0xac/0x198
[ 56.986463] softirqs last enabled at (450): [<ffff000008082164>] __do_softirq+0x47c/0x574
[ 56.998266] softirqs last disabled at (443): [<ffff0000080ca41c>] irq_exit+0x134/0x148
[ 57.009651] CPU: 2 PID: 22 Comm: migration/2 Tainted: G W L 4.19.0-rc2-next-20180904-00001-g32563c06c8db-dirty #42
[ 57.021070] Hardware name: Qualcomm Technologies, Inc. DB820c (DT)
[ 57.032279] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 57.043733] pc : multi_cpu_stop+0x198/0x220
[ 57.055152] lr : multi_cpu_stop+0x194/0x220
[ 57.065929] sp : ffff00000ae33d50
[ 57.076598] x29: ffff00000ae33d50 x28: 0000000000000000
[ 57.101700] x27: ffff00000803bc80 x26: 0000000000000000
[ 57.126947] x25: 0000000000000000 x24: 0000000000000000
[ 57.152160] x23: ffff0000090b4000 x22: 0000000000000002
[ 57.177351] x21: ffff00000803bca4 x20: ffff00000803bc80
[ 57.202378] x19: 0000000000000004 x18: ffffffffffffffff
[ 57.227500] x17: 0000000000000708 x16: ffff000009fcfbe0
[ 57.252639] x15: ffff000009cbe1c8 x14: 31203a64656c6261
[ 57.277712] x13: 000000006f1527a4 x12: ffff8000d9fc63b0
[ 57.302892] x11: 0000000000000000 x10: 0000000000000268
[ 57.328157] x9 : ffff8000d9590400 x8 : 000000000000001c
[ 57.353240] x7 : ffff8000d9c08400 x6 : 0000000000000284
[ 57.378364] x5 : 0000000000004f5d x4 : 0000000000000000
[ 57.403477] x3 : 0000000000000000 x2 : ffff00000aabb000
[ 57.428601] x1 : 0000000000000001 x0 : ffff8000d9fc5b00
[ 57.453865] Call trace:
[ 57.465264] multi_cpu_stop+0x198/0x220
[ 57.476690] cpu_stopper_thread+0xfc/0x188
[ 57.487921] smpboot_thread_fn+0x1bc/0x2c0
[ 57.499416] kthread+0x134/0x138
[ 57.510752] ret_from_fork+0x10/0x1c
[ 60.813418] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 0
[ 60.853344] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 1
[ 63.137327] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 60s!
[ 63.138392] Showing busy workqueues and worker pools:
[ 63.145567] workqueue mm_percpu_wq: flags=0x8
[ 63.150564] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 63.154568] pending: vmstat_update
[ 64.813418] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 0
[ 64.853348] cpu_stopper_thread: calling softlockup_fn+0x0/0x78 for cpu: 1
[ 67.233334] INFO: task swapper/0:1 blocked for more than 30 seconds.
[ 67.233610] Tainted: G W L 4.19.0-rc2-next-20180904-00001-g32563c06c8db-dirty #42
[ 67.239301] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 67.248525] swapper/0 D 0 1 0 0x00000028
[ 67.256236] Call trace:
[ 67.261631] __switch_to+0xa8/0xf0
[ 67.263546] __schedule+0x34c/0xc90
[ 67.267279] schedule+0x34/0x98
[ 67.270662] schedule_timeout+0x250/0x540
[ 67.273785] wait_for_common+0xe0/0x1a0
[ 67.277955] wait_for_completion+0x28/0x38
[ 67.281602] __stop_cpus+0xd4/0xf8
[ 67.285766] stop_cpus+0x70/0xa8
[ 67.289145] stop_machine_cpuslocked+0x124/0x130
[ 67.292276] stop_machine+0x54/0x70
[ 67.297299] timekeeping_notify+0x44/0x70
[ 67.300088] __clocksource_select+0xa8/0x1d8
[ 67.304526] clocksource_done_booting+0x4c/0x64
[ 67.309124] do_one_initcall+0x94/0x3f8
[ 67.312847] kernel_init_freeable+0x47c/0x528
[ 67.316671] kernel_init+0x18/0x110
[ 67.321603] ret_from_fork+0x10/0x1c
[ 67.324490]
[ 67.324490] Showing all locks held in the system:
[ 67.328566] 3 locks held by swapper/0/1:
[ 67.334709] #0: (____ptrval____) (clocksource_mutex){+.+.}, at: clocksource_done_booting+0x2c/0x64
[ 67.338641] #1: (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: stop_machine+0x28/0x70
[ 67.347569] #2: (____ptrval____) (stop_cpus_mutex){+.+.}, at: stop_cpus+0x60/0xa8
[ 67.356260] 1 lock held by khungtaskd/543:
[ 67.363757] #0: (____ptrval____) (rcu_read_lock){....}, at: debug_show_all_locks+0x14/0x1b0
[ 67.367628]
[ 67.376368] =============================================

These lines:
[ 33.038360] hardirqs last enabled at (701): [<ffff0000081bb60c>] multi_cpu_stop+0x194/0x220
[ 33.050062] hardirqs last disabled at (702): [<ffff000008083e2c>] el1_irq+0xac/0x198
seem to indicate that an irq comes directly after local_irq_restore() in multi_cpu_stop().


I've added some debug prints. In the first stop_machine call, everything
completes as it should, in the second iteration, it looks like an irq comes
directly after local_irq_restore() in multi_cpu_stop(), and after that cpu2
and cpu3 is never completed, so wait_for_completion() never returns.


[ 0.418197] stop_machine: calling stop_machine_cpuslocked with fn: __enable_cpu_capability+0x0/0x38
[ 0.420051] stop_machine_cpuslocked: calling stop_cpus
[ 0.429017] >>>>>>>>>>>>>>>>>>>>>>>> stop_cpus:450 taking stop_cpus_mutex
[ 0.434205] __stop_cpus: queueing work
[ 0.441027] queueing multi_cpu_stop+0x0/0x220 for cpu: 0
[ 0.444641] queueing multi_cpu_stop+0x0/0x220 for cpu: 1
[ 0.450145] queueing multi_cpu_stop+0x0/0x220 for cpu: 2
[ 0.450188] cpu_stopper_thread: calling multi_cpu_stop+0x0/0x220 for cpu: 1
[ 0.455438] queueing multi_cpu_stop+0x0/0x220 for cpu: 3
[ 0.458486] cpu_stopper_thread: calling multi_cpu_stop+0x0/0x220 for cpu: 2
[ 0.459101] multi_cpu_stop: cpu: 2 before local_save_flags, in atomic: 1 irqs_disabled: 0
[ 0.459660] multi_cpu_stop: cpu: 2 after local_save_flags, in atomic: 1 irqs_disabled: 0
[ 0.460713] multi_cpu_stop: cpu: 1 before local_save_flags, in atomic: 1 irqs_disabled: 0
[ 0.467456] cpu_stopper_thread: calling multi_cpu_stop+0x0/0x220 for cpu: 0
[ 0.469026] cpu_stopper_thread: calling multi_cpu_stop+0x0/0x220 for cpu: 3
[ 0.469635] multi_cpu_stop: cpu: 3 before local_save_flags, in atomic: 1 irqs_disabled: 0
[ 0.470190] multi_cpu_stop: cpu: 3 after local_save_flags, in atomic: 1 irqs_disabled: 0
[ 0.472945] multi_cpu_stop: cpu: 1 after local_save_flags, in atomic: 1 irqs_disabled: 0
[ 0.479629] multi_cpu_stop: cpu: 0 before local_save_flags, in atomic: 1 irqs_disabled: 0
[ 0.542478] multi_cpu_stop: cpu: 0 after local_save_flags, in atomic: 1 irqs_disabled: 0
[ 0.555518] multi_cpu_stop: cpu: 1 before local_irq_restore, in atomic: 1 irqs_disabled: 128
[ 0.555523] multi_cpu_stop: cpu: 0 before local_irq_restore, in atomic: 1 irqs_disabled: 128
[ 0.555830] multi_cpu_stop: cpu: 2 before local_irq_restore, in atomic: 1 irqs_disabled: 128
[ 0.556074] multi_cpu_stop: cpu: 3 before local_irq_restore, in atomic: 1 irqs_disabled: 128
[ 0.556366] multi_cpu_stop: cpu: 2 after local_irq_restore, in atomic: 1 irqs_disabled: 0
[ 0.556624] multi_cpu_stop: cpu: 3 after local_irq_restore, in atomic: 1 irqs_disabled: 0
[ 0.557133] cpu_stopper_thread: call to multi_cpu_stop+0x0/0x220 complete, setting done for cpu: 2
[ 0.558722] multi_cpu_stop: cpu: 1 after local_irq_restore, in atomic: 1 irqs_disabled: 0
[ 0.559213] cpu_stopper_thread: call to multi_cpu_stop+0x0/0x220 complete, setting done for cpu: 3
[ 0.567138] multi_cpu_stop: cpu: 0 after local_irq_restore, in atomic: 1 irqs_disabled: 0
[ 0.575551] cpu_stopper_thread: call to multi_cpu_stop+0x0/0x220 complete, setting done for cpu: 1
[ 0.583972] cpu_stopper_thread: call to multi_cpu_stop+0x0/0x220 complete, setting done for cpu: 0
[ 0.651730] <<<<<<<<<<<<<<<<<<<<<< __stop_cpus:411 now waiting for completion
[ 0.660642] <<<<<<<<<<<<<<<<<<<<<< __stop_cpus:413 completion done
[ 0.667836] >>>>>>>>>>>>>>>>>>>>>>>> stop_cpus:454 returning stop_cpus_mutex


[ 5.016096] stop_machine: calling stop_machine_cpuslocked with fn: change_clocksource+0x0/0x118
[ 5.016188] stop_machine_cpuslocked: calling stop_cpus
[ 5.016271] >>>>>>>>>>>>>>>>>>>>>>>> stop_cpus:450 taking stop_cpus_mutex
[ 5.016365] __stop_cpus: queueing work
[ 5.016471] queueing multi_cpu_stop+0x0/0x220 for cpu: 0
[ 5.016699] queueing multi_cpu_stop+0x0/0x220 for cpu: 1
[ 5.016858] queueing multi_cpu_stop+0x0/0x220 for cpu: 2
[ 5.016901] cpu_stopper_thread: calling multi_cpu_stop+0x0/0x220 for cpu: 0
[ 5.016984] multi_cpu_stop: cpu: 0 before local_save_flags, in atomic: 1 irqs_disabled: 0
[ 5.017004] queueing multi_cpu_stop+0x0/0x220 for cpu: 3
[ 5.017063] multi_cpu_stop: cpu: 0 after local_save_flags, in atomic: 1 irqs_disabled: 0
[ 5.017270] cpu_stopper_thread: calling multi_cpu_stop+0x0/0x220 for cpu: 1
[ 5.017351] multi_cpu_stop: cpu: 1 before local_save_flags, in atomic: 1 irqs_disabled: 0
[ 5.017429] multi_cpu_stop: cpu: 1 after local_save_flags, in atomic: 1 irqs_disabled: 0
[ 5.046491] cpu_stopper_thread: calling multi_cpu_stop+0x0/0x220 for cpu: 2
[ 5.079044] multi_cpu_stop: cpu: 2 before local_save_flags, in atomic: 1 irqs_disabled: 0
[ 5.097884] multi_cpu_stop: cpu: 2 after local_save_flags, in atomic: 1 irqs_disabled: 0
[ 5.518726] cpu_stopper_thread: calling multi_cpu_stop+0x0/0x220 for cpu: 3
[ 5.571321] multi_cpu_stop: cpu: 3 before local_save_flags, in atomic: 1 irqs_disabled: 0
[ 5.624144] multi_cpu_stop: cpu: 3 after local_save_flags, in atomic: 1 irqs_disabled: 0
[ 5.680879] multi_cpu_stop: cpu: 1 before local_irq_restore, in atomic: 1 irqs_disabled: 128
[ 5.680891] multi_cpu_stop: cpu: 0 before local_irq_restore, in atomic: 1 irqs_disabled: 128
[ 5.681227] multi_cpu_stop: cpu: 0 after local_irq_restore, in atomic: 1 irqs_disabled: 0
[ 5.682256] multi_cpu_stop: cpu: 3 before local_irq_restore, in atomic: 1 irqs_disabled: 128
[ 5.682986] multi_cpu_stop: cpu: 2 before local_irq_restore, in atomic: 1 irqs_disabled: 128
[ 5.688761] multi_cpu_stop: cpu: 1 after local_irq_restore, in atomic: 1 irqs_disabled: 0
[ 5.697559] cpu_stopper_thread: call to multi_cpu_stop+0x0/0x220 complete, setting done for cpu: 0
[ 5.705384] cpu_stopper_thread: call to multi_cpu_stop+0x0/0x220 complete, setting done for cpu: 1
[ 5.705896] <<<<<<<<<<<<<<<<<<<<<< __stop_cpus:411 now waiting for completion



[ 33.038360] hardirqs last enabled at (701): [<ffff0000081bb60c>] multi_cpu_stop+0x194/0x220
translates to:
arch_local_irq_restore at /home/nks/linux-next/./arch/arm64/include/asm/irqflags.h:88 (discriminator 2)
(inlined by) multi_cpu_stop at /home/nks/linux-next/kernel/stop_machine.c:238 (discriminator 2)

[ 33.118542] pc : multi_cpu_stop+0x198/0x220
translates to:
get_current at /home/nks/linux-next/./arch/arm64/include/asm/current.h:19
(inlined by) preempt_count at /home/nks/linux-next/./include/asm-generic/preempt.h:11
(inlined by) multi_cpu_stop at /home/nks/linux-next/kernel/stop_machine.c:239
which equals:
asm ("mrs %0, sp_el0" : "=r" (sp_el0));


If we got an irq directly after enabling irqs, I would assume that the execution
should have continued after the irq was done. Any thoughts?


Kind regards,
Niklas


2018-09-05 08:44:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: stop_machine() soft lockup

On Tue, Sep 04, 2018 at 09:03:22PM +0200, Niklas Cassel wrote:
> Hello Peter,
>
> I'm seeing some lockups when booting linux-next on a db820c arm64 board.
> I've tried to analyze, but I'm currently stuck.

Please see (should be in your Inbox too):

https://lkml.kernel.org/r/[email protected]

2018-09-05 11:49:26

by Niklas Cassel

[permalink] [raw]
Subject: Re: stop_machine() soft lockup

On Wed, Sep 05, 2018 at 10:42:41AM +0200, Peter Zijlstra wrote:
> On Tue, Sep 04, 2018 at 09:03:22PM +0200, Niklas Cassel wrote:
> > Hello Peter,
> >
> > I'm seeing some lockups when booting linux-next on a db820c arm64 board.
> > I've tried to analyze, but I'm currently stuck.
>
> Please see (should be in your Inbox too):
>
> https://lkml.kernel.org/r/[email protected]

I'm sorry if I mislead you by replying to your other mail thread,
both of them have timekeeping_notify() in the call trace,
but my problem has this call trace:

[ 128.747853] wait_for_common+0xe0/0x1a0
[ 128.752023] wait_for_completionx+0x28/0x38
[ 128.755677] __stop_cpus+0xd4/0xf8
[ 128.759837] stop_cpus+0x70/0xa8
[ 128.762958] stop_machine_cpuslocked+0x124/0x130
[ 128.766345] stop_machine+0x54/0x70
[ 128.771373] timekeeping_notify+0x44/0x70
[ 128.774158] __clocksource_select+0xa8/0x1d8
[ 128.778605] clocksource_done_booting+0x4c/0x64
[ 128.782931] do_one_initcall+0x94/0x3f8
[ 128.786921] kernel_init_freeable+0x47c/0x528
[ 128.790742] kernel_init+0x18/0x110
[ 128.795673] ret_from_fork+0x10/0x1c


while your other mail thread has this call trace:

* stop_machine()
* timekeeping_notify()
* __clocksource_select()
* clocksource_select()
* clocksource_watchdog_work()


So my problem is not related to the watchdog, I tried your revert anyway,
but unfortunately my problem persists.


In my problem, what appears to happen is that in the end of multi_cpu_stop(),
local_irq_restore() is called, and an irq comes directly after that,
then after 22 seconds soft lockup detector kicks in (so I appear to still
get timer ticks). The PC that the soft lockup detector prints is the line
after local_irq_restore(). Without any prints, that is simply end brace of
the function.

I'm booting with kernel command line: "earlycon ftrace=irqsoff"
I can't seem to reproduce the problem without ftrace=irqsoff,
not sure if it because of timing or because ftrace is involved.

I managed to get another call trace with ftrace included,
unfortunately both CPUs appear to be printing at the same time.

[ 32.703910] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:22]
[ 32.728589] Modules linked in:
[ 32.747176] irq event stamp: 154
[ 32.758851] hardirqs last enabled at (153): [<ffff0000081bb59c>] multi_cpu_stop+0xfc/0x168
[ 32.770736] hardirqs last disabled at (154): [<ffff000008083e2c>] el1_irq+0xac/0x198
[ 32.774911] watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [kworker/3:0H:30]
[ 32.779946] Modules linked in:
[ 32.786425] softirqs last enabled at (114): [<ffff000008082164>] __do_softirq+0x47c/0x574
[ 32.795450] irq event stamp: 62
[ 32.805549] softirqs last disabled at (95): [<ffff0000080ca41c>] irq_exit+0x134/0x148
[ 32.810450] hardirqs last enabled at (61): [<ffff00000908efe8>] _raw_spin_unlock_irq+0x38/0x78
[ 32.816075] hardirqs last disabled at (62): [<ffff000008083e2c>] el1_irq+0xac/0x198
[ 32.821812] softirqs last enabled at (0): [<ffff0000080bdb68>] copy_process.isra.5.part.6+0x338/0x1ac$
[ 32.829394] CPU: 2 PID: 22 Comm: migration/2 Tainted: G W 4.19.0-rc2-next-20180904-0000$
-gd2e5ea917660 #47
[ 32.836772] softirqs last disabled at (0): [<0000000000000000>] (null)
[ 32.846193] Hardware name: Qualcomm Technologies, Inc. DB820c (DT)
[ 32.857546] CPU: 3 PID: 30 Comm: kworker/3:0H Tainted: G W 4.19.0-rc2-next-20180904-000$
2-gd2e5ea917660 #47
[ 32.864532] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 32.870789] Hardware name: Qualcomm Technologies, Inc. DB820c (DT)
[ 32.881909] pc : multi_cpu_stop+0x100/0x168
[ 32.887583] lr : multi_cpu_stop+0xfc/0x168
[ 32.892723] sp : ffff00000ae33d70
[ 32.897717] x29: ffff00000ae33d70 x28: 0000000000000000
[ 32.906981] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 32.916202] x27: 0000000000000002 x26: ffff00000803bc80
[ 32.923256] pc : _raw_spin_unlock_irq+0x3c/0x78
[ 32.933817] x25: ffff0000081bb4a0
[ 32.937340] lr : _raw_spin_unlock_irq+0x38/0x78
[ 32.942465] sp : ffff00000ae73c60
[ 32.947455] x29: ffff00000ae73c60 x28: ffff00000978b000
[ 32.953060] x24: 0000000000000000
[ 32.967484] x23: 0000000000000000
[ 32.971265] x27: ffff8000d958db00
[ 32.976343] x22: 0000000000000000
[ 32.982508] x26: 0000000000000001
[ 32.996862] x21: ffff00000803bca4
[ 33.000630] x25: ffff8000d9e8a328
[ 33.005754] x20: ffff00000803bc80
[ 33.011888] x24: ffff000009087f88
[ 33.026057] x19: 0000000000000004
[ 33.029982] x23: ffff8000d971ad80
[ 33.035127] x18: 00000000000008bd
[ 33.041263] x22: ffff000009cbf000
[ 33.055312] x17: 00000000000008bc
[ 33.059310] x21: ffff8000d958db00
[ 33.064303] x16: ffff000009fd1360
[ 33.070605] x20: ffff0000080fc20c
[ 33.084645] x15: ffff000009f2a300
[ 33.088605] x19: ffff8000da7f4d40
[ 33.093572] x14: 00000000000017fc
[ 33.099943] x18: ffffffffffffffff
[ 33.114000] x13: 00000000c04846ac
[ 33.117948] x17: 0000000000000693
[ 33.122855] x12: ffff8000d9fc63b0
[ 33.129195] x16: ffff000009fcf7e0
[ 33.143370] x11: 0000000000000000
[ 33.147294] x15: ffff000009cbe1c8
[ 33.152152] x10: 0000000000000348
[ 33.158535] x14: 0000000000002fff
[ 33.172812] x9 : ffff8000d9590400
[ 33.176620] x13: 0000000016163b60
[ 33.181506] x8 : 000000000000001c
[ 33.187866] x12: ffff8000d971b630
[ 33.202251] x7 : ffff8000d9c08400
[ 33.205880] x11: 0000000000000000
[ 33.210939] x6 : 0000000000000364
[ 33.217236] x10: 0000000000000a48
[ 33.231721] x5 : 0000000000005356
[ 33.235281] x9 : ffff8000d9590c00
[ 33.240397] x4 : 0000000000000000
[ 33.246592] x8 : 000000000000001c
[ 33.261065] x3 : 0000000000000000
[ 33.264749] x7 : ffff8000d9c08400
[ 33.269874] x2 : ffff00000aabb000
[ 33.276019] x6 : 0000000000000a64
[ 33.290389] x1 : 0000000000000001
[ 33.294192] x5 : 000000000000cf1a
[ 33.299235] x0 : ffff8000d9fc5b00
[ 33.305452] x4 : 0000000000000000
[ 33.319989] Call trace:
[ 33.324715] x3 : 0000000000000000 x2 : ffff00000aabb000
[ 33.331717] multi_cpu_stop+0x100/0x168
[ 33.341973] x1 : 0000000000000001
[ 33.345891] cpu_stopper_thread+0xa8/0x118
[ 33.351777] smpboot_thread_fn+0x1bc/0x2c0
[ 33.357845] x0 : ffff8000d971ad80
[ 33.362176] kthread+0x134/0x138
[ 33.371281] Call trace:
[ 33.377420] ret_from_fork+0x10/0x1c
[ 33.382212] _raw_spin_unlock_irq+0x3c/0x78
[ 33.387685] finish_task_switch+0xa4/0x200
[ 33.396875] __schedule+0x350/0xc90
[ 33.408272] preempt_schedule_notrace+0x5c/0x130
[ 33.419484] ftrace_ops_no_ops+0xf4/0x180
[ 33.430795] ftrace_graph_call+0x0/0xc
[ 33.442095] preempt_count_add+0x1c/0x130
[ 33.453259] schedule+0x2c/0x98
[ 33.464586] worker_thread+0xdc/0x478
[ 33.475886] kthread+0x134/0x138
[ 33.487099] ret_from_fork+0x10/0x1c
[ 34.464807] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 34.487238] rcu: 2-...0: (2757 ticks this GP) idle=20e/1/0x4000000000000002 softirq=120/120 fqs=316
5
[ 34.509364] rcu: (t=6514 jiffies g=-1095 q=1250)
[ 34.528505] Task dump for CPU 2:
[ 34.539317] migration/2 R running task 0 22 2 0x0000002a
[ 34.565716] Call trace:
[ 34.577387] dump_backtrace+0x0/0x200
[ 34.588782] show_stack+0x24/0x30
[ 34.600098] sched_show_task+0x20c/0x2d8
[ 34.611726] dump_cpu_task+0x48/0x58
[ 34.623109] rcu_dump_cpu_stacks+0xa0/0xe0
[ 34.634637] rcu_check_callbacks+0x85c/0xb60
[ 34.646324] update_process_times+0x34/0x60
[ 34.657569] tick_periodic+0x58/0x110
[ 34.669169] tick_handle_periodic+0x94/0xc8
[ 34.680719] arch_timer_handler_virt+0x38/0x58
[ 34.692090] handle_percpu_devid_irq+0xe4/0x458
[ 34.703783] generic_handle_irq+0x34/0x50
[ 34.715121] __handle_domain_irq+0x8c/0xf8
[ 34.726529] gic_handle_irq+0x84/0x180
[ 34.738131] el1_irq+0xec/0x198
[ 34.749285] multi_cpu_stop+0x100/0x168
[ 34.760840] cpu_stopper_thread+0xa8/0x118
[ 34.772384] smpboot_thread_fn+0x1bc/0x2c0
[ 34.783587] kthread+0x134/0x138
[ 34.795188] ret_from_fork+0x10/0x1c



Could perhaps
[ 32.821812] softirqs last enabled at (0): [<ffff0000080bdb68>] copy_process.isra.5.part.6+0x338/0x1ac
be involved in why the execution appears to never continue after the irq?


Kind regards,
Niklas

2018-09-05 13:17:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: stop_machine() soft lockup

On Wed, Sep 05, 2018 at 01:47:49PM +0200, Niklas Cassel wrote:
> On Wed, Sep 05, 2018 at 10:42:41AM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 04, 2018 at 09:03:22PM +0200, Niklas Cassel wrote:
> > > Hello Peter,
> > >
> > > I'm seeing some lockups when booting linux-next on a db820c arm64 board.
> > > I've tried to analyze, but I'm currently stuck.
> >
> > Please see (should be in your Inbox too):
> >
> > https://lkml.kernel.org/r/[email protected]
>
> I'm sorry if I mislead you by replying to your other mail thread,
> both of them have timekeeping_notify() in the call trace,
> but my problem has this call trace:
>
> [ 128.747853] wait_for_common+0xe0/0x1a0
> [ 128.752023] wait_for_completionx+0x28/0x38
> [ 128.755677] __stop_cpus+0xd4/0xf8
> [ 128.759837] stop_cpus+0x70/0xa8
> [ 128.762958] stop_machine_cpuslocked+0x124/0x130
> [ 128.766345] stop_machine+0x54/0x70
> [ 128.771373] timekeeping_notify+0x44/0x70
> [ 128.774158] __clocksource_select+0xa8/0x1d8
> [ 128.778605] clocksource_done_booting+0x4c/0x64
> [ 128.782931] do_one_initcall+0x94/0x3f8
> [ 128.786921] kernel_init_freeable+0x47c/0x528
> [ 128.790742] kernel_init+0x18/0x110
> [ 128.795673] ret_from_fork+0x10/0x1c
>
>
> while your other mail thread has this call trace:
>
> * stop_machine()
> * timekeeping_notify()
> * __clocksource_select()
> * clocksource_select()
> * clocksource_watchdog_work()
>
>
> So my problem is not related to the watchdog, I tried your revert anyway,
> but unfortunately my problem persists.

Oh, right, missed that distinction. And this is new?

I'll try and have a look. Lockdep doesn't suggest anything?

2018-09-05 13:47:33

by Niklas Cassel

[permalink] [raw]
Subject: Re: stop_machine() soft lockup

On Wed, Sep 05, 2018 at 03:14:10PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 05, 2018 at 01:47:49PM +0200, Niklas Cassel wrote:
> > On Wed, Sep 05, 2018 at 10:42:41AM +0200, Peter Zijlstra wrote:
> > > On Tue, Sep 04, 2018 at 09:03:22PM +0200, Niklas Cassel wrote:
> > > > Hello Peter,
> > > >
> > > > I'm seeing some lockups when booting linux-next on a db820c arm64 board.
> > > > I've tried to analyze, but I'm currently stuck.
> > >
> > > Please see (should be in your Inbox too):
> > >
> > > https://lkml.kernel.org/r/[email protected]
> >
> > I'm sorry if I mislead you by replying to your other mail thread,
> > both of them have timekeeping_notify() in the call trace,
> > but my problem has this call trace:
> >
> > [ 128.747853] wait_for_common+0xe0/0x1a0
> > [ 128.752023] wait_for_completionx+0x28/0x38
> > [ 128.755677] __stop_cpus+0xd4/0xf8
> > [ 128.759837] stop_cpus+0x70/0xa8
> > [ 128.762958] stop_machine_cpuslocked+0x124/0x130
> > [ 128.766345] stop_machine+0x54/0x70
> > [ 128.771373] timekeeping_notify+0x44/0x70
> > [ 128.774158] __clocksource_select+0xa8/0x1d8
> > [ 128.778605] clocksource_done_booting+0x4c/0x64
> > [ 128.782931] do_one_initcall+0x94/0x3f8
> > [ 128.786921] kernel_init_freeable+0x47c/0x528
> > [ 128.790742] kernel_init+0x18/0x110
> > [ 128.795673] ret_from_fork+0x10/0x1c
> >
> >
> > while your other mail thread has this call trace:
> >
> > * stop_machine()
> > * timekeeping_notify()
> > * __clocksource_select()
> > * clocksource_select()
> > * clocksource_watchdog_work()
> >
> >
> > So my problem is not related to the watchdog, I tried your revert anyway,
> > but unfortunately my problem persists.
>
> Oh, right, missed that distinction. And this is new?

I can reproduce it on 4.14.15 so probably not that new.

>
> I'll try and have a look. Lockdep doesn't suggest anything?

Thanks.

Usually, when seeing these soft lockups, no.

However, I just managed to get this lockdep splat when booting without
earlycon and ftrace=irqsoff
I'm not sure that it is related, but I'm planning on looking into it anyway:

[ 70.644248] Bluetooth: hci0: QCA Failed to request file: qca/rampatch_00440302.bin (-11)
[ 70.644299] Bluetooth: hci0: QCA Failed to download patch (-11)
[ 70.644475] cfg80211: failed to load regulatory.db
[ 70.658037] remoteproc remoteproc0: powering up adsp-pil
[ 70.663080] remoteproc remoteproc0: Direct firmware load for adsp.mdt failed with error -2
[ 70.667438]
[ 70.667445] ======================================================
[ 70.667450] WARNING: possible circular locking dependency detected
[ 70.667459] 4.19.0-rc2-next-20180904-00002-gd2e5ea917660-dirty #44 Tainted: G W
[ 70.667464] ------------------------------------------------------
[ 70.667471] kworker/0:1/14 is trying to acquire lock:
[ 70.667478] (____ptrval____) (&(&pool->lock)->rlock){-.-.}, at: __queue_work+0x3d4/0x6a0
[ 70.667512]
[ 70.667512] but task is already holding lock:
[ 70.667517] (____ptrval____) (&port_lock_key){-.-.}, at: msm_uart_irq+0x38/0x710
[ 70.667545]
[ 70.667545] which lock already depends on the new lock.
[ 70.667545]
[ 70.667550]
[ 70.667550] the existing dependency chain (in reverse order) is:
[ 70.667555]
[ 70.667555] -> #2 (&port_lock_key){-.-.}:
[ 70.667585] _raw_spin_lock+0x44/0x58
[ 70.667591] __msm_console_write+0x84/0x1e8
[ 70.667598] msm_console_write+0x64/0x78
[ 70.667608] console_unlock+0x400/0x610
[ 70.667615] register_console+0x290/0x3b8
[ 70.667622] uart_add_one_port+0x4cc/0x4d8
[ 70.667628] msm_serial_probe+0x158/0x1d8
[ 70.667639] platform_drv_probe+0x58/0xa8
[ 70.667645] really_probe+0x280/0x3d8
[ 70.667651] driver_probe_device+0x60/0x148
[ 70.667658] __driver_attach+0x144/0x148
[ 70.667665] bus_for_each_dev+0x84/0xd8
[ 70.667671] driver_attach+0x30/0x40
[ 70.667677] bus_add_driver+0x234/0x2a8
[ 70.667684] driver_register+0x64/0x110
[ 70.667691] __platform_driver_register+0x54/0x60
[ 70.667700] msm_serial_init+0x40/0x70
[ 70.667709] do_one_initcall+0x94/0x3f8
[ 70.667717] kernel_init_freeable+0x47c/0x528
[ 70.667726] kernel_init+0x18/0x110
[ 70.667732] ret_from_fork+0x10/0x1c
[ 70.667737]
[ 70.667737]
[ 70.667737] -> #1 (console_owner){-.-.}:
[ 70.667762] console_unlock+0x298/0x610
[ 70.667769] vprintk_emit+0x110/0x298
[ 70.667776] vprintk_default+0x48/0x58
[ 70.667782] vprintk_func+0x100/0x200
[ 70.667789] printk+0x74/0x94
[ 70.667797] __warn_printk+0x48/0xa0
[ 70.667803] check_flush_dependency+0xf8/0x150
[ 70.667809] __flush_work+0xc0/0x2e0
[ 70.667815] __cancel_work_timer+0x154/0x1c8
[ 70.667821] cancel_delayed_work_sync+0x24/0x30
[ 70.667832] ufshcd_ungate_work+0x30/0x100
[ 70.667838] process_one_work+0x2a0/0x710
[ 70.667844] worker_thread+0x48/0x478
[ 70.667852] kthread+0x134/0x138
[ 70.667858] ret_from_fork+0x10/0x1c
[ 70.667863]
[ 70.667863] -> #0 (&(&pool->lock)->rlock){-.-.}:
[ 70.667890] lock_acquire+0xc0/0x230
[ 70.667897] _raw_spin_lock+0x44/0x58
[ 70.667903] __queue_work+0x3d4/0x6a0
[ 70.667909] queue_work_on+0xc8/0xd0
[ 70.667920] hci_uart_tx_wakeup+0x188/0x228
[ 70.667926] hci_uart_write_wakeup+0x70/0x78
[ 70.667933] ttyport_write_wakeup+0xdc/0xe8
[ 70.667940] tty_port_tty_wakeup+0x28/0x38
[ 70.667946] uart_write_wakeup+0x24/0x38
[ 70.667952] msm_handle_tx_pio+0x94/0x1d0
[ 70.667958] msm_handle_tx+0x10c/0x3c0
[ 70.667964] msm_uart_irq+0x310/0x710
[ 70.667972] __handle_irq_event_percpu+0xb8/0x440
[ 70.667978] handle_irq_event_percpu+0x40/0x98
[ 70.667983] handle_irq_event+0x50/0x80
[ 70.667990] handle_fasteoi_irq+0xc8/0x1a0
[ 70.667998] generic_handle_irq+0x34/0x50
[ 70.668004] __handle_domain_irq+0x8c/0xf8
[ 70.668010] gic_handle_irq+0x84/0x180
[ 70.668016] el1_irq+0xec/0x198
[ 70.668022] console_unlock+0x4e0/0x610
[ 70.668029] vprintk_emit+0x110/0x298
[ 70.668036] dev_vprintk_emit+0x150/0x248
[ 70.668042] dev_printk_emit+0x84/0xa8
[ 70.668048] __dev_printk+0x5c/0xa0
[ 70.668054] _dev_warn+0x74/0x98
[ 70.668062] _request_firmware+0x428/0x5d8
[ 70.668068] request_firmware+0x40/0x50
[ 70.668076] rproc_boot+0xfc/0x480
[ 70.668081] rproc_auto_boot_callback+0x24/0x38
[ 70.668087] request_firmware_work_func+0x50/0x88
[ 70.668094] process_one_work+0x2a0/0x710
[ 70.668100] worker_thread+0x48/0x478
[ 70.668106] kthread+0x134/0x138
[ 70.668113] ret_from_fork+0x10/0x1c
[ 70.668118]
[ 70.668124] Chain exists of:
[ 70.668124] &(&pool->lock)->rlock --> console_owner --> &port_lock_key
[ 70.668124]
[ 70.668155] Possible unsafe locking scenario:
[ 70.668155]
[ 70.668160] CPU0 CPU1
[ 70.668164] ---- ----
[ 70.668169] lock(&port_lock_key);
[ 70.668182] lock(console_owner);
[ 70.668195] lock(&port_lock_key);
[ 70.668209] lock(&(&pool->lock)->rlock);
[ 70.668222]
[ 70.668222] *** DEADLOCK ***
[ 70.668222]
[ 70.668229] 6 locks held by kworker/0:1/14:
[ 70.668234] #0: (____ptrval____) ((wq_completion)"events"){+.+.}, at: process_one_work+0x1f8/0x710
[ 70.668263] #1: (____ptrval____) ((work_completion)(&fw_work->work)){+.+.}, at: process_one_work+0x1f8
/0x710
[ 70.668293] #2: (____ptrval____) (&rproc->lock){+.+.}, at: rproc_boot+0x4c/0x480
[ 70.668322] #3: (____ptrval____) (console_lock){+.+.}, at: vprintk_emit+0x108/0x298
[ 70.668351] #4: (____ptrval____) (&port_lock_key){-.-.}, at: msm_uart_irq+0x38/0x710
[ 70.668379] #5: (____ptrval____) (&hu->proto_lock){.+.+}, at: hci_uart_write_wakeup+0x70/0x78
[ 70.668409]
[ 70.668409] stack backtrace:
[ 70.668418] CPU: 0 PID: 14 Comm: kworker/0:1 Tainted: G W 4.19.0-rc2-next-20180904-00002
-gd2e5ea917660-dirty #44
[ 70.668423] Hardware name: Qualcomm Technologies, Inc. DB820c (DT)
[ 70.668435] Workqueue: events request_firmware_work_func
[ 70.668446] Call trace:
[ 70.668454] dump_backtrace+0x0/0x200
[ 70.668460] show_stack+0x24/0x30
[ 70.668469] dump_stack+0xac/0xe4
[ 70.668476] print_circular_bug.isra.19+0x1d4/0x2e8
[ 70.668482] __lock_acquire+0x1814/0x1878
[ 70.668489] lock_acquire+0xc0/0x230
[ 70.668495] _raw_spin_lock+0x44/0x58
[ 70.668501] __queue_work+0x3d4/0x6a0
[ 70.668507] queue_work_on+0xc8/0xd0
[ 70.668514] hci_uart_tx_wakeup+0x188/0x228
[ 70.668520] hci_uart_write_wakeup+0x70/0x78
[ 70.668527] ttyport_write_wakeup+0xdc/0xe8
[ 70.668533] tty_port_tty_wakeup+0x28/0x38
[ 70.668539] uart_write_wakeup+0x24/0x38
[ 70.668544] msm_handle_tx_pio+0x94/0x1d0
[ 70.668550] msm_handle_tx+0x10c/0x3c0
[ 70.668557] msm_uart_irq+0x310/0x710
[ 70.668563] __handle_irq_event_percpu+0xb8/0x440
[ 70.668568] handle_irq_event_percpu+0x40/0x98
[ 70.668574] handle_irq_event+0x50/0x80
[ 70.668581] handle_fasteoi_irq+0xc8/0x1a0
[ 70.668588] generic_handle_irq+0x34/0x50
[ 70.668594] __handle_domain_irq+0x8c/0xf8
[ 70.668599] gic_handle_irq+0x84/0x180
[ 70.668606] el1_irq+0xec/0x198
[ 70.668612] console_unlock+0x4e0/0x610
[ 70.668619] vprintk_emit+0x110/0x298
[ 70.668625] dev_vprintk_emit+0x150/0x248
[ 70.668631] dev_printk_emit+0x84/0xa8
[ 70.668637] __dev_printk+0x5c/0xa0
[ 70.668643] _dev_warn+0x74/0x98
[ 70.668649] _request_firmware+0x428/0x5d8
[ 70.668656] request_firmware+0x40/0x50
[ 70.668662] rproc_boot+0xfc/0x480
[ 70.668668] rproc_auto_boot_callback+0x24/0x38
[ 70.668674] request_firmware_work_func+0x50/0x88
[ 70.668680] process_one_work+0x2a0/0x710
[ 70.668686] worker_thread+0x48/0x478
[ 70.668693] kthread+0x134/0x138
[ 70.668699] ret_from_fork+0x10/0x1c


Kind regards,
Niklas

2018-09-06 13:07:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: stop_machine() soft lockup

On Wed, Sep 05, 2018 at 03:45:53PM +0200, Niklas Cassel wrote:

> However, I just managed to get this lockdep splat when booting without
> earlycon and ftrace=irqsoff
> I'm not sure that it is related, but I'm planning on looking into it anyway:

That looks like your generic printk is crap splat.

I tend to use this patch-set:

https://lkml.kernel.org/r/[email protected]

and have (on x86):

earlyprintk=serial,ttyS0,115200 force_early_printk debug ignore_loglevel

That results in a printk that does pure UART bit-banging and works from
any context without issue.