2017-07-24 14:34:42

by Leo Yan

[permalink] [raw]
Subject: ARM64 board Hikey960 boot failure due to f2545b2d4ce1 (jump_label: Reorder hotplug lock and jump_label_lock)

Hi all,

We found the mainline arm64 kernel boot failure on Hikey960 board,
this is caused by patch f2545b2d4ce1 (jump_label: Reorder hotplug lock
and jump_label_lock), this patch adds locking cpus_read_lock() in
function static_key_slow_inc() and introduce the dead lock issue by
acquiring lock twice. Below are detailed flow:

arch_timer_register()
`> cpuhp_setup_state()
`> __cpuhp_setup_state()
cpus_read_lock()
`> __cpuhp_setup_state_cpuslocked()
`> cpuhp_issue_call()
`> arch_timer_starting_cpu()
`> __arch_timer_setup()
`> arch_timer_check_ool_workaround()
`> arch_timer_enable_workaround()
`> static_branch_enable()
`> static_key_enable()
`> static_key_slow_inc()
`> cpus_read_lock()

So finally there have called cpus_read_lock() twice, and kernel report
log as below. So I am not sure what's the best way to fix this issue,
could you give some suggestion for this? Thanks.

---8<---

** 510118 printk messages dropped ** [ 1.299405] [<ffff000008831140>] arch_timer_check_ool_workaround+0xc8/0x1fc
** 5516 printk messages dropped ** [ 1.309965] [<ffff00000899d3a4>] schedule+0x38/0x9c
** 4289 printk messages dropped ** [ 1.317961] [<ffff00000899d3a4>] schedule+0x38/0x9c
** 4293 printk messages dropped ** [ 1.325961] [<ffff000008172370>] static_key_slow_inc+0xc8/0x124
** 4857 printk messages dropped ** [ 1.335010] [<ffff00000899d3a4>] schedule+0x38/0x9c
** 4290 printk messages dropped ** [ 1.343008] [<ffff0000089a0260>] rwsem_down_read_failed+0xdc/0x108
** 4999 printk messages dropped ** [ 1.352325] [<ffff000008831584>] arch_timer_starting_cpu+0xf8/0x2c4
** 5048 printk messages dropped ** [ 1.361726] [<ffff000008989e70>] dump_stack+0x90/0xb0
** 4381 printk messages dropped ** [ 1.369892] [<ffff00000899d3a4>] schedule+0x38/0x9c
** 4291 printk messages dropped ** [ 1.377895] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
** 4811 printk messages dropped ** [ 1.386856] [<ffff000008989e70>] dump_stack+0x90/0xb0
** 4384 printk messages dropped ** [ 1.395026] [<ffff0000080c1e44>] cpus_read_lock+0x70/0x74
** 4575 printk messages dropped ** [ 1.403553] [<ffff0000080c1e44>] cpus_read_lock+0x70/0x74
** 4572 printk messages dropped ** [ 1.412071] [<ffff00000899d3a4>] schedule+0x38/0x9c
** 4291 printk messages dropped ** [ 1.420070] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
** 4812 printk messages dropped ** [ 1.429034] [<ffff0000080eacb4>] dequeue_task_idle+0x24/0x38
** 4714 printk messages dropped ** [ 1.437821] [<ffff000008172370>] static_key_slow_inc+0xc8/0x124
** 4855 printk messages dropped ** [ 1.446871] [<ffff0000080e53f8>] deactivate_task+0x70/0xac
** 4620 printk messages dropped ** [ 1.455480] [<ffff00000899cfec>] __schedule+0x244/0x5c4
** 4480 printk messages dropped ** [ 1.463828] Hardware name: HiKey960 (DT)
** 3771 printk messages dropped ** [ 1.470855] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
** 4808 printk messages dropped ** [ 1.479815] Call trace:
** 2965 printk messages dropped ** [ 1.485352] [<00000000009a81b4>] 0x9a81b4
** 3818 printk messages dropped ** [ 1.492473] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
** 4808 printk messages dropped ** [ 1.501436] Call trace:
** 2968 printk messages dropped ** [ 1.506971] Hardware name: HiKey960 (DT)
** 3769 printk messages dropped ** [ 1.514005] [<ffff00000899d3a4>] schedule+0x38/0x9c
** 4288 printk messages dropped ** [ 1.522001] [<ffff00000899cfec>] __schedule+0x244/0x5c4
** 4473 printk messages dropped ** [ 1.530347] [<ffff000008831584>] arch_timer_starting_cpu+0xf8/0x2c4
** 5041 printk messages dropped ** [ 1.539747] [<00000000009a81b4>] 0x9a81b4
** 3819 printk messages dropped ** [ 1.546867] [<ffff0000080c1e44>] cpus_read_lock+0x70/0x74
** 4572 printk messages dropped ** [ 1.555393] [<ffff00000899d3a4>] schedule+0x38/0x9c
** 4288 printk messages dropped ** [ 1.563390] [<ffff00000899cfec>] __schedule+0x244/0x5c4
** 4473 printk messages dropped ** [ 1.571734] [<ffff000008831584>] arch_timer_starting_cpu+0xf8/0x2c4
** 5043 printk messages dropped ** [ 1.581138] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G S W 4.13.0-rc1-linaro-hikey960+ #30
** 6551 printk messages dropped ** [ 1.593352] [<ffff0000080c3640>] notify_cpu_starting+0x68/0x88
** 4807 printk messages dropped ** [ 1.602312] [<ffff00000899cfec>] __schedule+0x244/0x5c4
** 4476 printk messages dropped ** [ 1.610663] [<ffff00000808e9d0>] secondary_start_kernel+0xb0/0x118
** 4996 printk messages dropped ** [ 1.619972] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G S W 4.13.0-rc1-linaro-hikey960+ #30
** 6553 printk messages dropped ** [ 1.632182] [<00000000009a81b4>] 0x9a81b4
** 3820 printk messages dropped ** [ 1.639301] [<ffff000008172370>] static_key_slow_inc+0xc8/0x124
** 4854 printk messages dropped ** [ 1.648349] [<ffff0000080eacb4>] dequeue_task_idle+0x24/0x38
** 4712 printk messages dropped ** [ 1.657139] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
** 4804 printk messages dropped ** [ 1.666098] [<00000000009a81b4>] 0x9a81b4
** 3821 printk messages dropped ** [ 1.673223] [<ffff0000081723ec>] static_key_enable+0x20/0x34
** 4713 printk messages dropped ** [ 1.682006] [<00000000009a81b4>] 0x9a81b4
** 3821 printk messages dropped ** [ 1.689128] [<ffff0000081723ec>] static_key_enable+0x20/0x34
** 4713 printk messages dropped ** [ 1.697911] [<00000000009a81b4>] 0x9a81b4
** 3820 printk messages dropped ** [ 1.705033] [<ffff000008172370>] static_key_slow_inc+0xc8/0x124
** 4854 printk messages dropped ** [ 1.714082] [<ffff0000080eacb4>] dequeue_task_idle+0x24/0x38
** 4711 printk messages dropped ** [ 1.722868] [<ffff0000089a0260>] rwsem_down_read_failed+0xdc/0x108
** 4998 printk messages dropped ** [ 1.732185] [<ffff000008831140>] arch_timer_check_ool_workaround+0xc8/0x1fc
** 5419 printk messages dropped ** [ 1.742283] Hardware name: HiKey960 (DT)
** 3773 printk messages dropped ** [ 1.749320] [<ffff000008172370>] static_key_slow_inc+0xc8/0x124
** 4856 printk messages dropped ** [ 1.758369] [<ffff00000899cfec>] __schedule+0x244/0x5c4
** 4477 printk messages dropped ** [ 1.766714] [<00000000009a81b4>] 0x9a81b4
** 3821 printk messages dropped ** [ 1.773834] [<ffff0000081723ec>] static_key_enable+0x20/0x34
** 4713 printk messages dropped ** [ 1.782619] [<00000000009a81b4>] 0x9a81b4
** 3820 printk messages dropped ** [ 1.789743] [<ffff000008172370>] static_key_slow_inc+0xc8/0x124
** 4857 printk messages dropped ** [ 1.798790] [<ffff00000899d3a4>] schedule+0x38/0x9c
** 4290 printk messages dropped ** [ 1.806785] [<ffff0000089a0260>] rwsem_down_read_failed+0xdc/0x108
** 4995 printk messages dropped ** [ 1.816100] [<ffff0000080c1e44>] cpus_read_lock+0x70/0x74
** 4574 printk messages dropped ** [ 1.824627] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
** 4811 printk messages dropped ** [ 1.833590] [<ffff000008989e70>] dump_stack+0x90/0xb0
** 4385 printk messages dropped ** [ 1.841760] [<ffff000008172370>] static_key_slow_inc+0xc8/0x124
** 4856 printk messages dropped ** [ 1.850811] [<ffff00000899cfec>] __schedule+0x244/0x5c4
** 4476 printk messages dropped ** [ 1.859155] [<ffff00000808e9d0>] secondary_start_kernel+0xb0/0x118
** 4997 printk messages dropped ** [ 1.868468] Hardware name: HiKey960 (DT)
** 3771 printk messages dropped ** [ 1.875501] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
** 4810 printk messages dropped ** [ 1.884467] [<ffff000008088590>] show_stack+0x14/0x1c
** 4383 printk messages dropped ** [ 1.892638] [<ffff0000089a0260>] rwsem_down_read_failed+0xdc/0x108
** 4997 printk messages dropped ** [ 1.901951] [<ffff0000081723ec>] static_key_enable+0x20/0x34
** 4713 printk messages dropped ** [ 1.910736] [<00000000009a81b4>] 0x9a81b4
** 3822 printk messages dropped ** [ 1.917857] [<ffff000008831140>] arch_timer_check_ool_workaround+0xc8/0x1fc
** 5423 printk messages dropped ** [ 1.927961] [<ffff000008989e70>] dump_stack+0x90/0xb0
** 4386 printk messages dropped ** [ 1.936135] [<ffff0000081723ec>] static_key_enable+0x20/0x34
** 4713 printk messages dropped ** [ 1.944918] [<00000000009a81b4>] 0x9a81b4
** 3819 printk messages dropped ** [ 1.952036] [<ffff0000080c1e44>] cpus_read_lock+0x70/0x74
** 4575 printk messages dropped ** [ 1.960563] [<ffff0000080c1e44>] cpus_read_lock+0x70/0x74
** 4569 printk messages dropped ** [ 1.969080] [<ffff0000080eacb4>] dequeue_task_idle+0x24/0x38
** 4713 printk messages dropped ** [ 1.977869] [<ffff0000080c1e44>] cpus_read_lock+0x70/0x74
** 4572 printk messages dropped ** [ 1.986395] [<ffff00000899d3a4>] schedule+0x38/0x9c
** 4291 printk messages dropped ** [ 1.994392] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
** 4810 printk messages dropped ** [ 2.003355] [<ffff000008088590>] show_stack+0x14/0x1c
** 4385 printk messages dropped ** [ 2.011528] [<ffff0000080c1e44>] cpus_read_lock+0x70/0x74
** 4572 printk messages dropped ** [ 2.020048] [<ffff00000899d3a4>] schedule+0x38/0x9c
** 4291 printk messages dropped ** [ 2.028045] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
** 4810 printk messages dropped ** [ 2.037012] [<ffff000008088590>] show_stack+0x14/0x1c
** 4385 printk messages dropped ** [ 2.045184] [<ffff0000080c1e44>] cpus_read_lock+0x70/0x74
** 4573 printk messages dropped ** [ 2.053709] [<ffff0000089a0260>] rwsem_down_read_failed+0xdc/0x108
** 4998 printk messages dropped ** [ 2.063026] [<ffff000008831140>] arch_timer_check_ool_workaround+0xc8/0x1fc
** 5425 printk messages dropped ** [ 2.073128] [<ffff0000080e53f8>] deactivate_task+0x70/0xac
** 4620 printk messages dropped ** [ 2.081738] [<ffff00000899cfec>] __schedule+0x244/0x5c4
** 4478 printk messages dropped ** [ 2.090083] bad: scheduling from the idle thread!
** 4196 printk messages dropped ** [ 2.097910] [<ffff000008831584>] arch_timer_starting_cpu+0xf8/0x2c4
** 5044 printk messages dropped ** [ 2.107306] Hardware name: HiKey960 (DT)
** 3771 printk messages dropped ** [ 2.114340] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
** 4810 printk messages dropped ** [ 2.123304] [<ffff000008088590>] show_stack+0x14/0x1c
** 4383 printk messages dropped ** [ 2.131475] [<ffff0000089a0260>] rwsem_down_read_failed+0xdc/0x108
** 4998 printk messages dropped ** [ 2.140793] [<ffff000008831140>] arch_timer_check_ool_workaround+0xc8/0x1fc
** 5424 printk messages dropped ** [ 2.150895] [<ffff0000080eacb4>] dequeue_task_idle+0x24/0x38
** 4715 printk messages dropped ** [ 2.159680] [<ffff0000081723ec>] static_key_enable+0x20/0x34
** 4716 printk messages dropped ** [ 2.168467] Hardware name: HiKey960 (DT)
** 3771 printk messages dropped ** [ 2.175497] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
** 4804 printk messages dropped ** [ 2.184458] [<00000000009a81b4>] 0x9a81b4
** 3823 printk messages dropped ** [ 2.191583] [<ffff000008831584>] arch_timer_starting_cpu+0xf8/0x2c4
** 5041 printk messages dropped ** [ 2.200979] [<00000000009a81b4>] 0x9a81b4
** 3821 printk messages dropped ** [ 2.208105] [<ffff0000081723ec>] static_key_enable+0x20/0x34
** 4713 printk messages dropped ** [ 2.216886] [<00000000009a81b4>] 0x9a81b4
** 3819 printk messages dropped ** [ 2.224004] [<ffff0000080c1e44>] cpus_read_lock+0x70/0x74
** 4574 printk messages dropped ** [ 2.232530] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
** 4807 printk messages dropped ** [ 2.241489] Hardware name: HiKey960 (DT


2017-07-25 10:47:05

by Robin Murphy

[permalink] [raw]
Subject: Re: ARM64 board Hikey960 boot failure due to f2545b2d4ce1 (jump_label: Reorder hotplug lock and jump_label_lock)

On 24/07/17 15:34, Leo Yan wrote:
> Hi all,
>
> We found the mainline arm64 kernel boot failure on Hikey960 board,
> this is caused by patch f2545b2d4ce1 (jump_label: Reorder hotplug lock
> and jump_label_lock), this patch adds locking cpus_read_lock() in
> function static_key_slow_inc() and introduce the dead lock issue by
> acquiring lock twice. Below are detailed flow:
>
> arch_timer_register()
> `> cpuhp_setup_state()
> `> __cpuhp_setup_state()
> cpus_read_lock()
> `> __cpuhp_setup_state_cpuslocked()
> `> cpuhp_issue_call()
> `> arch_timer_starting_cpu()
> `> __arch_timer_setup()
> `> arch_timer_check_ool_workaround()
> `> arch_timer_enable_workaround()
> `> static_branch_enable()
> `> static_key_enable()
> `> static_key_slow_inc()
> `> cpus_read_lock()
>
> So finally there have called cpus_read_lock() twice, and kernel report
> log as below. So I am not sure what's the best way to fix this issue,
> could you give some suggestion for this? Thanks.

FWIW, I've reproduced this on Juno by hacking the erratum 858921
workaround to match Cortex-A53 or A57 instead - interestingly, when it's
applied for the boot CPU things seem OK; only when it's applied on
secondary bringup do I see the explosion as below.

Robin.

> ---8<---
>
> ** 510118 printk messages dropped ** [ 1.299405] [<ffff000008831140>] arch_timer_check_ool_workaround+0xc8/0x1fc
> ** 5516 printk messages dropped ** [ 1.309965] [<ffff00000899d3a4>] schedule+0x38/0x9c
> ** 4289 printk messages dropped ** [ 1.317961] [<ffff00000899d3a4>] schedule+0x38/0x9c
> ** 4293 printk messages dropped ** [ 1.325961] [<ffff000008172370>] static_key_slow_inc+0xc8/0x124
> ** 4857 printk messages dropped ** [ 1.335010] [<ffff00000899d3a4>] schedule+0x38/0x9c
> ** 4290 printk messages dropped ** [ 1.343008] [<ffff0000089a0260>] rwsem_down_read_failed+0xdc/0x108
> ** 4999 printk messages dropped ** [ 1.352325] [<ffff000008831584>] arch_timer_starting_cpu+0xf8/0x2c4
> ** 5048 printk messages dropped ** [ 1.361726] [<ffff000008989e70>] dump_stack+0x90/0xb0
> ** 4381 printk messages dropped ** [ 1.369892] [<ffff00000899d3a4>] schedule+0x38/0x9c
> ** 4291 printk messages dropped ** [ 1.377895] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
> ** 4811 printk messages dropped ** [ 1.386856] [<ffff000008989e70>] dump_stack+0x90/0xb0
> ** 4384 printk messages dropped ** [ 1.395026] [<ffff0000080c1e44>] cpus_read_lock+0x70/0x74
> ** 4575 printk messages dropped ** [ 1.403553] [<ffff0000080c1e44>] cpus_read_lock+0x70/0x74
> ** 4572 printk messages dropped ** [ 1.412071] [<ffff00000899d3a4>] schedule+0x38/0x9c
> ** 4291 printk messages dropped ** [ 1.420070] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
> ** 4812 printk messages dropped ** [ 1.429034] [<ffff0000080eacb4>] dequeue_task_idle+0x24/0x38
> ** 4714 printk messages dropped ** [ 1.437821] [<ffff000008172370>] static_key_slow_inc+0xc8/0x124
> ** 4855 printk messages dropped ** [ 1.446871] [<ffff0000080e53f8>] deactivate_task+0x70/0xac
> ** 4620 printk messages dropped ** [ 1.455480] [<ffff00000899cfec>] __schedule+0x244/0x5c4
> ** 4480 printk messages dropped ** [ 1.463828] Hardware name: HiKey960 (DT)
> ** 3771 printk messages dropped ** [ 1.470855] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
> ** 4808 printk messages dropped ** [ 1.479815] Call trace:
> ** 2965 printk messages dropped ** [ 1.485352] [<00000000009a81b4>] 0x9a81b4
> ** 3818 printk messages dropped ** [ 1.492473] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
> ** 4808 printk messages dropped ** [ 1.501436] Call trace:
> ** 2968 printk messages dropped ** [ 1.506971] Hardware name: HiKey960 (DT)
> ** 3769 printk messages dropped ** [ 1.514005] [<ffff00000899d3a4>] schedule+0x38/0x9c
> ** 4288 printk messages dropped ** [ 1.522001] [<ffff00000899cfec>] __schedule+0x244/0x5c4
> ** 4473 printk messages dropped ** [ 1.530347] [<ffff000008831584>] arch_timer_starting_cpu+0xf8/0x2c4
> ** 5041 printk messages dropped ** [ 1.539747] [<00000000009a81b4>] 0x9a81b4
> ** 3819 printk messages dropped ** [ 1.546867] [<ffff0000080c1e44>] cpus_read_lock+0x70/0x74
> ** 4572 printk messages dropped ** [ 1.555393] [<ffff00000899d3a4>] schedule+0x38/0x9c
> ** 4288 printk messages dropped ** [ 1.563390] [<ffff00000899cfec>] __schedule+0x244/0x5c4
> ** 4473 printk messages dropped ** [ 1.571734] [<ffff000008831584>] arch_timer_starting_cpu+0xf8/0x2c4
> ** 5043 printk messages dropped ** [ 1.581138] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G S W 4.13.0-rc1-linaro-hikey960+ #30
> ** 6551 printk messages dropped ** [ 1.593352] [<ffff0000080c3640>] notify_cpu_starting+0x68/0x88
> ** 4807 printk messages dropped ** [ 1.602312] [<ffff00000899cfec>] __schedule+0x244/0x5c4
> ** 4476 printk messages dropped ** [ 1.610663] [<ffff00000808e9d0>] secondary_start_kernel+0xb0/0x118
> ** 4996 printk messages dropped ** [ 1.619972] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G S W 4.13.0-rc1-linaro-hikey960+ #30
> ** 6553 printk messages dropped ** [ 1.632182] [<00000000009a81b4>] 0x9a81b4
> ** 3820 printk messages dropped ** [ 1.639301] [<ffff000008172370>] static_key_slow_inc+0xc8/0x124
> ** 4854 printk messages dropped ** [ 1.648349] [<ffff0000080eacb4>] dequeue_task_idle+0x24/0x38
> ** 4712 printk messages dropped ** [ 1.657139] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
> ** 4804 printk messages dropped ** [ 1.666098] [<00000000009a81b4>] 0x9a81b4
> ** 3821 printk messages dropped ** [ 1.673223] [<ffff0000081723ec>] static_key_enable+0x20/0x34
> ** 4713 printk messages dropped ** [ 1.682006] [<00000000009a81b4>] 0x9a81b4
> ** 3821 printk messages dropped ** [ 1.689128] [<ffff0000081723ec>] static_key_enable+0x20/0x34
> ** 4713 printk messages dropped ** [ 1.697911] [<00000000009a81b4>] 0x9a81b4
> ** 3820 printk messages dropped ** [ 1.705033] [<ffff000008172370>] static_key_slow_inc+0xc8/0x124
> ** 4854 printk messages dropped ** [ 1.714082] [<ffff0000080eacb4>] dequeue_task_idle+0x24/0x38
> ** 4711 printk messages dropped ** [ 1.722868] [<ffff0000089a0260>] rwsem_down_read_failed+0xdc/0x108
> ** 4998 printk messages dropped ** [ 1.732185] [<ffff000008831140>] arch_timer_check_ool_workaround+0xc8/0x1fc
> ** 5419 printk messages dropped ** [ 1.742283] Hardware name: HiKey960 (DT)
> ** 3773 printk messages dropped ** [ 1.749320] [<ffff000008172370>] static_key_slow_inc+0xc8/0x124
> ** 4856 printk messages dropped ** [ 1.758369] [<ffff00000899cfec>] __schedule+0x244/0x5c4
> ** 4477 printk messages dropped ** [ 1.766714] [<00000000009a81b4>] 0x9a81b4
> ** 3821 printk messages dropped ** [ 1.773834] [<ffff0000081723ec>] static_key_enable+0x20/0x34
> ** 4713 printk messages dropped ** [ 1.782619] [<00000000009a81b4>] 0x9a81b4
> ** 3820 printk messages dropped ** [ 1.789743] [<ffff000008172370>] static_key_slow_inc+0xc8/0x124
> ** 4857 printk messages dropped ** [ 1.798790] [<ffff00000899d3a4>] schedule+0x38/0x9c
> ** 4290 printk messages dropped ** [ 1.806785] [<ffff0000089a0260>] rwsem_down_read_failed+0xdc/0x108
> ** 4995 printk messages dropped ** [ 1.816100] [<ffff0000080c1e44>] cpus_read_lock+0x70/0x74
> ** 4574 printk messages dropped ** [ 1.824627] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
> ** 4811 printk messages dropped ** [ 1.833590] [<ffff000008989e70>] dump_stack+0x90/0xb0
> ** 4385 printk messages dropped ** [ 1.841760] [<ffff000008172370>] static_key_slow_inc+0xc8/0x124
> ** 4856 printk messages dropped ** [ 1.850811] [<ffff00000899cfec>] __schedule+0x244/0x5c4
> ** 4476 printk messages dropped ** [ 1.859155] [<ffff00000808e9d0>] secondary_start_kernel+0xb0/0x118
> ** 4997 printk messages dropped ** [ 1.868468] Hardware name: HiKey960 (DT)
> ** 3771 printk messages dropped ** [ 1.875501] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
> ** 4810 printk messages dropped ** [ 1.884467] [<ffff000008088590>] show_stack+0x14/0x1c
> ** 4383 printk messages dropped ** [ 1.892638] [<ffff0000089a0260>] rwsem_down_read_failed+0xdc/0x108
> ** 4997 printk messages dropped ** [ 1.901951] [<ffff0000081723ec>] static_key_enable+0x20/0x34
> ** 4713 printk messages dropped ** [ 1.910736] [<00000000009a81b4>] 0x9a81b4
> ** 3822 printk messages dropped ** [ 1.917857] [<ffff000008831140>] arch_timer_check_ool_workaround+0xc8/0x1fc
> ** 5423 printk messages dropped ** [ 1.927961] [<ffff000008989e70>] dump_stack+0x90/0xb0
> ** 4386 printk messages dropped ** [ 1.936135] [<ffff0000081723ec>] static_key_enable+0x20/0x34
> ** 4713 printk messages dropped ** [ 1.944918] [<00000000009a81b4>] 0x9a81b4
> ** 3819 printk messages dropped ** [ 1.952036] [<ffff0000080c1e44>] cpus_read_lock+0x70/0x74
> ** 4575 printk messages dropped ** [ 1.960563] [<ffff0000080c1e44>] cpus_read_lock+0x70/0x74
> ** 4569 printk messages dropped ** [ 1.969080] [<ffff0000080eacb4>] dequeue_task_idle+0x24/0x38
> ** 4713 printk messages dropped ** [ 1.977869] [<ffff0000080c1e44>] cpus_read_lock+0x70/0x74
> ** 4572 printk messages dropped ** [ 1.986395] [<ffff00000899d3a4>] schedule+0x38/0x9c
> ** 4291 printk messages dropped ** [ 1.994392] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
> ** 4810 printk messages dropped ** [ 2.003355] [<ffff000008088590>] show_stack+0x14/0x1c
> ** 4385 printk messages dropped ** [ 2.011528] [<ffff0000080c1e44>] cpus_read_lock+0x70/0x74
> ** 4572 printk messages dropped ** [ 2.020048] [<ffff00000899d3a4>] schedule+0x38/0x9c
> ** 4291 printk messages dropped ** [ 2.028045] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
> ** 4810 printk messages dropped ** [ 2.037012] [<ffff000008088590>] show_stack+0x14/0x1c
> ** 4385 printk messages dropped ** [ 2.045184] [<ffff0000080c1e44>] cpus_read_lock+0x70/0x74
> ** 4573 printk messages dropped ** [ 2.053709] [<ffff0000089a0260>] rwsem_down_read_failed+0xdc/0x108
> ** 4998 printk messages dropped ** [ 2.063026] [<ffff000008831140>] arch_timer_check_ool_workaround+0xc8/0x1fc
> ** 5425 printk messages dropped ** [ 2.073128] [<ffff0000080e53f8>] deactivate_task+0x70/0xac
> ** 4620 printk messages dropped ** [ 2.081738] [<ffff00000899cfec>] __schedule+0x244/0x5c4
> ** 4478 printk messages dropped ** [ 2.090083] bad: scheduling from the idle thread!
> ** 4196 printk messages dropped ** [ 2.097910] [<ffff000008831584>] arch_timer_starting_cpu+0xf8/0x2c4
> ** 5044 printk messages dropped ** [ 2.107306] Hardware name: HiKey960 (DT)
> ** 3771 printk messages dropped ** [ 2.114340] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
> ** 4810 printk messages dropped ** [ 2.123304] [<ffff000008088590>] show_stack+0x14/0x1c
> ** 4383 printk messages dropped ** [ 2.131475] [<ffff0000089a0260>] rwsem_down_read_failed+0xdc/0x108
> ** 4998 printk messages dropped ** [ 2.140793] [<ffff000008831140>] arch_timer_check_ool_workaround+0xc8/0x1fc
> ** 5424 printk messages dropped ** [ 2.150895] [<ffff0000080eacb4>] dequeue_task_idle+0x24/0x38
> ** 4715 printk messages dropped ** [ 2.159680] [<ffff0000081723ec>] static_key_enable+0x20/0x34
> ** 4716 printk messages dropped ** [ 2.168467] Hardware name: HiKey960 (DT)
> ** 3771 printk messages dropped ** [ 2.175497] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
> ** 4804 printk messages dropped ** [ 2.184458] [<00000000009a81b4>] 0x9a81b4
> ** 3823 printk messages dropped ** [ 2.191583] [<ffff000008831584>] arch_timer_starting_cpu+0xf8/0x2c4
> ** 5041 printk messages dropped ** [ 2.200979] [<00000000009a81b4>] 0x9a81b4
> ** 3821 printk messages dropped ** [ 2.208105] [<ffff0000081723ec>] static_key_enable+0x20/0x34
> ** 4713 printk messages dropped ** [ 2.216886] [<00000000009a81b4>] 0x9a81b4
> ** 3819 printk messages dropped ** [ 2.224004] [<ffff0000080c1e44>] cpus_read_lock+0x70/0x74
> ** 4574 printk messages dropped ** [ 2.232530] [<ffff0000080ffe60>] __percpu_down_read+0xfc/0x110
> ** 4807 printk messages dropped ** [ 2.241489] Hardware name: HiKey960 (DT
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>

2017-07-26 15:13:53

by Marc Zyngier

[permalink] [raw]
Subject: Re: ARM64 board Hikey960 boot failure due to f2545b2d4ce1 (jump_label: Reorder hotplug lock and jump_label_lock)

[+Mark]

Hi Leo,

On 24/07/17 15:34, Leo Yan wrote:
> Hi all,
>
> We found the mainline arm64 kernel boot failure on Hikey960 board,
> this is caused by patch f2545b2d4ce1 (jump_label: Reorder hotplug lock
> and jump_label_lock), this patch adds locking cpus_read_lock() in
> function static_key_slow_inc() and introduce the dead lock issue by
> acquiring lock twice. Below are detailed flow:
>
> arch_timer_register()
> `> cpuhp_setup_state()
> `> __cpuhp_setup_state()
> cpus_read_lock()
> `> __cpuhp_setup_state_cpuslocked()
> `> cpuhp_issue_call()
> `> arch_timer_starting_cpu()
> `> __arch_timer_setup()
> `> arch_timer_check_ool_workaround()
> `> arch_timer_enable_workaround()
> `> static_branch_enable()
> `> static_key_enable()
> `> static_key_slow_inc()
> `> cpus_read_lock()
>
> So finally there have called cpus_read_lock() twice, and kernel report
> log as below. So I am not sure what's the best way to fix this issue,
> could you give some suggestion for this? Thanks.

[...]

Thanks for this. Unfortunately, there is no easy fix for this.
Can you give the patch below a go and let us know if that solves
the issue you observed? I only tested in on a model...

Should this be considered an acceptable solution, I'll split that
into individual patches and repost it as a proper series.

Thanks,

M.

diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index aae87c4c546e..44232f378543 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -455,7 +455,11 @@ void arch_timer_enable_workaround(const struct arch_timer_erratum_workaround *wa
per_cpu(timer_unstable_counter_workaround, i) = wa;
}

- static_branch_enable(&arch_timer_read_ool_enabled);
+ /*
+ * Use the _locked version, as we're called from the CPU
+ * hotplug framework. Otherwise, we end-up in deadlock-land.
+ */
+ static_branch_enable_locked(&arch_timer_read_ool_enabled);

/*
* Don't use the vdso fastpath if errata require using the
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 2afd74b9d844..5cfbf9ff3fe8 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -159,10 +159,14 @@ extern void arch_jump_label_transform_static(struct jump_entry *entry,
extern int jump_label_text_reserved(void *start, void *end);
extern void static_key_slow_inc(struct static_key *key);
extern void static_key_slow_dec(struct static_key *key);
+extern void static_key_slow_inc_locked(struct static_key *key);
+extern void static_key_slow_dec_locked(struct static_key *key);
extern void jump_label_apply_nops(struct module *mod);
extern int static_key_count(struct static_key *key);
extern void static_key_enable(struct static_key *key);
extern void static_key_disable(struct static_key *key);
+extern void static_key_enable_locked(struct static_key *key);
+extern void static_key_disable_locked(struct static_key *key);

/*
* We should be using ATOMIC_INIT() for initializing .enabled, but
@@ -213,12 +217,16 @@ static inline void static_key_slow_inc(struct static_key *key)
atomic_inc(&key->enabled);
}

+#define static_key_slow_inc_locked(k) static_key_slow_inc((k))
+
static inline void static_key_slow_dec(struct static_key *key)
{
STATIC_KEY_CHECK_USE();
atomic_dec(&key->enabled);
}

+#define static_key_slow_dec_locked(k) static_key_slow_inc((k))
+
static inline int jump_label_text_reserved(void *start, void *end)
{
return 0;
@@ -415,6 +423,8 @@ extern bool ____wrong_branch_error(void);

#define static_branch_enable(x) static_key_enable(&(x)->key)
#define static_branch_disable(x) static_key_disable(&(x)->key)
+#define static_branch_enable_locked(x) static_key_enable_locked(&(x)->key)
+#define static_branch_disable_locked(x) static_key_disable_locked(&(x)->key)

#endif /* __ASSEMBLY__ */

diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index d11c506a6ac3..f543f765a738 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -57,6 +57,11 @@ jump_label_sort_entries(struct jump_entry *start, struct jump_entry *stop)
}

static void jump_label_update(struct static_key *key);
+static void static_key_slow_inc_with_lock(struct static_key *key, bool lock);
+static void __static_key_slow_dec_with_lock(struct static_key *key,
+ bool lock,
+ unsigned long rate_limit,
+ struct delayed_work *work);

/*
* There are similar definitions for the !HAVE_JUMP_LABEL case in jump_label.h.
@@ -79,29 +84,53 @@ int static_key_count(struct static_key *key)
}
EXPORT_SYMBOL_GPL(static_key_count);

-void static_key_enable(struct static_key *key)
+static void static_key_enable_with_lock(struct static_key *key, bool lock)
{
int count = static_key_count(key);

WARN_ON_ONCE(count < 0 || count > 1);

if (!count)
- static_key_slow_inc(key);
+ static_key_slow_inc_with_lock(key, lock);
+}
+
+void static_key_enable(struct static_key *key)
+{
+ static_key_enable_with_lock(key, true);
}
EXPORT_SYMBOL_GPL(static_key_enable);

-void static_key_disable(struct static_key *key)
+void static_key_enable_locked(struct static_key *key)
+{
+ lockdep_assert_cpus_held();
+ static_key_enable_with_lock(key, false);
+}
+EXPORT_SYMBOL_GPL(static_key_enable_locked);
+
+static void static_key_disable_with_lock(struct static_key *key, bool lock)
{
int count = static_key_count(key);

WARN_ON_ONCE(count < 0 || count > 1);

if (count)
- static_key_slow_dec(key);
+ __static_key_slow_dec_with_lock(key, lock, 0, NULL);
+}
+
+void static_key_disable(struct static_key *key)
+{
+ static_key_disable_with_lock(key, true);
}
EXPORT_SYMBOL_GPL(static_key_disable);

-void static_key_slow_inc(struct static_key *key)
+void static_key_disable_locked(struct static_key *key)
+{
+ lockdep_assert_cpus_held();
+ static_key_disable_with_lock(key, false);
+}
+EXPORT_SYMBOL_GPL(static_key_disable_locked);
+
+static void static_key_slow_inc_with_lock(struct static_key *key, bool lock)
{
int v, v1;

@@ -125,7 +154,8 @@ void static_key_slow_inc(struct static_key *key)
return;
}

- cpus_read_lock();
+ if (lock)
+ cpus_read_lock();
jump_label_lock();
if (atomic_read(&key->enabled) == 0) {
atomic_set(&key->enabled, -1);
@@ -135,14 +165,30 @@ void static_key_slow_inc(struct static_key *key)
atomic_inc(&key->enabled);
}
jump_label_unlock();
- cpus_read_unlock();
+ if (lock)
+ cpus_read_unlock();
+}
+
+void static_key_slow_inc(struct static_key *key)
+{
+ static_key_slow_inc_with_lock(key, true);
}
EXPORT_SYMBOL_GPL(static_key_slow_inc);

-static void __static_key_slow_dec(struct static_key *key,
- unsigned long rate_limit, struct delayed_work *work)
+void static_key_slow_inc_locked(struct static_key *key)
{
- cpus_read_lock();
+ lockdep_assert_cpus_held();
+ static_key_slow_inc_with_lock(key, false);
+}
+EXPORT_SYMBOL_GPL(static_key_slow_inc_locked);
+
+static void __static_key_slow_dec_with_lock(struct static_key *key,
+ bool lock,
+ unsigned long rate_limit,
+ struct delayed_work *work)
+{
+ if (lock)
+ cpus_read_lock();
/*
* The negative count check is valid even when a negative
* key->enabled is in use by static_key_slow_inc(); a
@@ -153,7 +199,8 @@ static void __static_key_slow_dec(struct static_key *key,
if (!atomic_dec_and_mutex_lock(&key->enabled, &jump_label_mutex)) {
WARN(atomic_read(&key->enabled) < 0,
"jump label: negative count!\n");
- cpus_read_unlock();
+ if (lock)
+ cpus_read_unlock();
return;
}

@@ -164,27 +211,36 @@ static void __static_key_slow_dec(struct static_key *key,
jump_label_update(key);
}
jump_label_unlock();
- cpus_read_unlock();
+ if (lock)
+ cpus_read_unlock();
}

static void jump_label_update_timeout(struct work_struct *work)
{
struct static_key_deferred *key =
container_of(work, struct static_key_deferred, work.work);
- __static_key_slow_dec(&key->key, 0, NULL);
+ __static_key_slow_dec_with_lock(&key->key, true, 0, NULL);
}

void static_key_slow_dec(struct static_key *key)
{
STATIC_KEY_CHECK_USE();
- __static_key_slow_dec(key, 0, NULL);
+ __static_key_slow_dec_with_lock(key, true, 0, NULL);
}
EXPORT_SYMBOL_GPL(static_key_slow_dec);

+void static_key_slow_dec_locked(struct static_key *key)
+{
+ lockdep_assert_cpus_held();
+ STATIC_KEY_CHECK_USE();
+ __static_key_slow_dec_with_lock(key, false, 0, NULL);
+}
+EXPORT_SYMBOL_GPL(static_key_slow_dec_locked);
+
void static_key_slow_dec_deferred(struct static_key_deferred *key)
{
STATIC_KEY_CHECK_USE();
- __static_key_slow_dec(&key->key, key->timeout, &key->work);
+ __static_key_slow_dec_with_lock(&key->key, true, key->timeout, &key->work);
}
EXPORT_SYMBOL_GPL(static_key_slow_dec_deferred);


--
Jazz is not dead. It just smells funny...

2017-07-27 02:08:51

by Leo Yan

[permalink] [raw]
Subject: Re: ARM64 board Hikey960 boot failure due to f2545b2d4ce1 (jump_label: Reorder hotplug lock and jump_label_lock)

On Wed, Jul 26, 2017 at 04:13:49PM +0100, Marc Zyngier wrote:
> [+Mark]
>
> Hi Leo,
>
> On 24/07/17 15:34, Leo Yan wrote:
> > Hi all,
> >
> > We found the mainline arm64 kernel boot failure on Hikey960 board,
> > this is caused by patch f2545b2d4ce1 (jump_label: Reorder hotplug lock
> > and jump_label_lock), this patch adds locking cpus_read_lock() in
> > function static_key_slow_inc() and introduce the dead lock issue by
> > acquiring lock twice. Below are detailed flow:
> >
> > arch_timer_register()
> > `> cpuhp_setup_state()
> > `> __cpuhp_setup_state()
> > cpus_read_lock()
> > `> __cpuhp_setup_state_cpuslocked()
> > `> cpuhp_issue_call()
> > `> arch_timer_starting_cpu()
> > `> __arch_timer_setup()
> > `> arch_timer_check_ool_workaround()
> > `> arch_timer_enable_workaround()
> > `> static_branch_enable()
> > `> static_key_enable()
> > `> static_key_slow_inc()
> > `> cpus_read_lock()
> >
> > So finally there have called cpus_read_lock() twice, and kernel report
> > log as below. So I am not sure what's the best way to fix this issue,
> > could you give some suggestion for this? Thanks.
>
> [...]
>
> Thanks for this. Unfortunately, there is no easy fix for this.
> Can you give the patch below a go and let us know if that solves
> the issue you observed? I only tested in on a model...
>
> Should this be considered an acceptable solution, I'll split that
> into individual patches and repost it as a proper series.

Thanks, Marc.

I confirm below patch can fix the booting failure issue on Hikey960;
after generate formal patch set, also welcome to send me for testing.

> diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
> index aae87c4c546e..44232f378543 100644
> --- a/drivers/clocksource/arm_arch_timer.c
> +++ b/drivers/clocksource/arm_arch_timer.c
> @@ -455,7 +455,11 @@ void arch_timer_enable_workaround(const struct arch_timer_erratum_workaround *wa
> per_cpu(timer_unstable_counter_workaround, i) = wa;
> }
>
> - static_branch_enable(&arch_timer_read_ool_enabled);
> + /*
> + * Use the _locked version, as we're called from the CPU
> + * hotplug framework. Otherwise, we end-up in deadlock-land.
> + */
> + static_branch_enable_locked(&arch_timer_read_ool_enabled);
>
> /*
> * Don't use the vdso fastpath if errata require using the
> diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
> index 2afd74b9d844..5cfbf9ff3fe8 100644
> --- a/include/linux/jump_label.h
> +++ b/include/linux/jump_label.h
> @@ -159,10 +159,14 @@ extern void arch_jump_label_transform_static(struct jump_entry *entry,
> extern int jump_label_text_reserved(void *start, void *end);
> extern void static_key_slow_inc(struct static_key *key);
> extern void static_key_slow_dec(struct static_key *key);
> +extern void static_key_slow_inc_locked(struct static_key *key);
> +extern void static_key_slow_dec_locked(struct static_key *key);
> extern void jump_label_apply_nops(struct module *mod);
> extern int static_key_count(struct static_key *key);
> extern void static_key_enable(struct static_key *key);
> extern void static_key_disable(struct static_key *key);
> +extern void static_key_enable_locked(struct static_key *key);
> +extern void static_key_disable_locked(struct static_key *key);
>
> /*
> * We should be using ATOMIC_INIT() for initializing .enabled, but
> @@ -213,12 +217,16 @@ static inline void static_key_slow_inc(struct static_key *key)
> atomic_inc(&key->enabled);
> }
>
> +#define static_key_slow_inc_locked(k) static_key_slow_inc((k))
> +
> static inline void static_key_slow_dec(struct static_key *key)
> {
> STATIC_KEY_CHECK_USE();
> atomic_dec(&key->enabled);
> }
>
> +#define static_key_slow_dec_locked(k) static_key_slow_inc((k))
> +
> static inline int jump_label_text_reserved(void *start, void *end)
> {
> return 0;
> @@ -415,6 +423,8 @@ extern bool ____wrong_branch_error(void);
>
> #define static_branch_enable(x) static_key_enable(&(x)->key)
> #define static_branch_disable(x) static_key_disable(&(x)->key)
> +#define static_branch_enable_locked(x) static_key_enable_locked(&(x)->key)
> +#define static_branch_disable_locked(x) static_key_disable_locked(&(x)->key)
>
> #endif /* __ASSEMBLY__ */
>
> diff --git a/kernel/jump_label.c b/kernel/jump_label.c
> index d11c506a6ac3..f543f765a738 100644
> --- a/kernel/jump_label.c
> +++ b/kernel/jump_label.c
> @@ -57,6 +57,11 @@ jump_label_sort_entries(struct jump_entry *start, struct jump_entry *stop)
> }
>
> static void jump_label_update(struct static_key *key);
> +static void static_key_slow_inc_with_lock(struct static_key *key, bool lock);
> +static void __static_key_slow_dec_with_lock(struct static_key *key,
> + bool lock,
> + unsigned long rate_limit,
> + struct delayed_work *work);
>
> /*
> * There are similar definitions for the !HAVE_JUMP_LABEL case in jump_label.h.
> @@ -79,29 +84,53 @@ int static_key_count(struct static_key *key)
> }
> EXPORT_SYMBOL_GPL(static_key_count);
>
> -void static_key_enable(struct static_key *key)
> +static void static_key_enable_with_lock(struct static_key *key, bool lock)
> {
> int count = static_key_count(key);
>
> WARN_ON_ONCE(count < 0 || count > 1);
>
> if (!count)
> - static_key_slow_inc(key);
> + static_key_slow_inc_with_lock(key, lock);
> +}
> +
> +void static_key_enable(struct static_key *key)
> +{
> + static_key_enable_with_lock(key, true);
> }
> EXPORT_SYMBOL_GPL(static_key_enable);
>
> -void static_key_disable(struct static_key *key)
> +void static_key_enable_locked(struct static_key *key)
> +{
> + lockdep_assert_cpus_held();
> + static_key_enable_with_lock(key, false);
> +}
> +EXPORT_SYMBOL_GPL(static_key_enable_locked);
> +
> +static void static_key_disable_with_lock(struct static_key *key, bool lock)
> {
> int count = static_key_count(key);
>
> WARN_ON_ONCE(count < 0 || count > 1);
>
> if (count)
> - static_key_slow_dec(key);
> + __static_key_slow_dec_with_lock(key, lock, 0, NULL);
> +}
> +
> +void static_key_disable(struct static_key *key)
> +{
> + static_key_disable_with_lock(key, true);
> }
> EXPORT_SYMBOL_GPL(static_key_disable);
>
> -void static_key_slow_inc(struct static_key *key)
> +void static_key_disable_locked(struct static_key *key)
> +{
> + lockdep_assert_cpus_held();
> + static_key_disable_with_lock(key, false);
> +}
> +EXPORT_SYMBOL_GPL(static_key_disable_locked);
> +
> +static void static_key_slow_inc_with_lock(struct static_key *key, bool lock)
> {
> int v, v1;
>
> @@ -125,7 +154,8 @@ void static_key_slow_inc(struct static_key *key)
> return;
> }
>
> - cpus_read_lock();
> + if (lock)
> + cpus_read_lock();
> jump_label_lock();
> if (atomic_read(&key->enabled) == 0) {
> atomic_set(&key->enabled, -1);
> @@ -135,14 +165,30 @@ void static_key_slow_inc(struct static_key *key)
> atomic_inc(&key->enabled);
> }
> jump_label_unlock();
> - cpus_read_unlock();
> + if (lock)
> + cpus_read_unlock();
> +}
> +
> +void static_key_slow_inc(struct static_key *key)
> +{
> + static_key_slow_inc_with_lock(key, true);
> }
> EXPORT_SYMBOL_GPL(static_key_slow_inc);
>
> -static void __static_key_slow_dec(struct static_key *key,
> - unsigned long rate_limit, struct delayed_work *work)
> +void static_key_slow_inc_locked(struct static_key *key)
> {
> - cpus_read_lock();
> + lockdep_assert_cpus_held();
> + static_key_slow_inc_with_lock(key, false);
> +}
> +EXPORT_SYMBOL_GPL(static_key_slow_inc_locked);
> +
> +static void __static_key_slow_dec_with_lock(struct static_key *key,
> + bool lock,
> + unsigned long rate_limit,
> + struct delayed_work *work)
> +{
> + if (lock)
> + cpus_read_lock();
> /*
> * The negative count check is valid even when a negative
> * key->enabled is in use by static_key_slow_inc(); a
> @@ -153,7 +199,8 @@ static void __static_key_slow_dec(struct static_key *key,
> if (!atomic_dec_and_mutex_lock(&key->enabled, &jump_label_mutex)) {
> WARN(atomic_read(&key->enabled) < 0,
> "jump label: negative count!\n");
> - cpus_read_unlock();
> + if (lock)
> + cpus_read_unlock();
> return;
> }
>
> @@ -164,27 +211,36 @@ static void __static_key_slow_dec(struct static_key *key,
> jump_label_update(key);
> }
> jump_label_unlock();
> - cpus_read_unlock();
> + if (lock)
> + cpus_read_unlock();
> }
>
> static void jump_label_update_timeout(struct work_struct *work)
> {
> struct static_key_deferred *key =
> container_of(work, struct static_key_deferred, work.work);
> - __static_key_slow_dec(&key->key, 0, NULL);
> + __static_key_slow_dec_with_lock(&key->key, true, 0, NULL);
> }
>
> void static_key_slow_dec(struct static_key *key)
> {
> STATIC_KEY_CHECK_USE();
> - __static_key_slow_dec(key, 0, NULL);
> + __static_key_slow_dec_with_lock(key, true, 0, NULL);
> }
> EXPORT_SYMBOL_GPL(static_key_slow_dec);
>
> +void static_key_slow_dec_locked(struct static_key *key)
> +{
> + lockdep_assert_cpus_held();
> + STATIC_KEY_CHECK_USE();
> + __static_key_slow_dec_with_lock(key, false, 0, NULL);
> +}
> +EXPORT_SYMBOL_GPL(static_key_slow_dec_locked);
> +
> void static_key_slow_dec_deferred(struct static_key_deferred *key)
> {
> STATIC_KEY_CHECK_USE();
> - __static_key_slow_dec(&key->key, key->timeout, &key->work);
> + __static_key_slow_dec_with_lock(&key->key, true, key->timeout, &key->work);
> }
> EXPORT_SYMBOL_GPL(static_key_slow_dec_deferred);
>
>
> --
> Jazz is not dead. It just smells funny...

2017-07-27 07:44:52

by Marc Zyngier

[permalink] [raw]
Subject: Re: ARM64 board Hikey960 boot failure due to f2545b2d4ce1 (jump_label: Reorder hotplug lock and jump_label_lock)

On 27/07/17 03:08, Leo Yan wrote:
> On Wed, Jul 26, 2017 at 04:13:49PM +0100, Marc Zyngier wrote:
>> [+Mark]
>>
>> Hi Leo,
>>
>> On 24/07/17 15:34, Leo Yan wrote:
>>> Hi all,
>>>
>>> We found the mainline arm64 kernel boot failure on Hikey960 board,
>>> this is caused by patch f2545b2d4ce1 (jump_label: Reorder hotplug lock
>>> and jump_label_lock), this patch adds locking cpus_read_lock() in
>>> function static_key_slow_inc() and introduce the dead lock issue by
>>> acquiring lock twice. Below are detailed flow:
>>>
>>> arch_timer_register()
>>> `> cpuhp_setup_state()
>>> `> __cpuhp_setup_state()
>>> cpus_read_lock()
>>> `> __cpuhp_setup_state_cpuslocked()
>>> `> cpuhp_issue_call()
>>> `> arch_timer_starting_cpu()
>>> `> __arch_timer_setup()
>>> `> arch_timer_check_ool_workaround()
>>> `> arch_timer_enable_workaround()
>>> `> static_branch_enable()
>>> `> static_key_enable()
>>> `> static_key_slow_inc()
>>> `> cpus_read_lock()
>>>
>>> So finally there have called cpus_read_lock() twice, and kernel report
>>> log as below. So I am not sure what's the best way to fix this issue,
>>> could you give some suggestion for this? Thanks.
>>
>> [...]
>>
>> Thanks for this. Unfortunately, there is no easy fix for this.
>> Can you give the patch below a go and let us know if that solves
>> the issue you observed? I only tested in on a model...
>>
>> Should this be considered an acceptable solution, I'll split that
>> into individual patches and repost it as a proper series.
>
> Thanks, Marc.
>
> I confirm below patch can fix the booting failure issue on Hikey960;
> after generate formal patch set, also welcome to send me for testing.

Thanks for testing this. There is a couple of issues in this patch
which I'm ironing out at the moment.

It turns out that the above call stack is only one part of the problem.
The other part is on the secondary boot path, where the CPU is not yet
in a context where we can take the rwsem:

[ 1.151153] [<ffff000008089de8>] dump_backtrace+0x0/0x278
[ 1.151153] [<ffff00000808a144>] show_stack+0x24/0x30
[ 1.151153] [<ffff000008c22d8c>] dump_stack+0x8c/0xb0
[ 1.151253] [<ffff000008106010>] dequeue_task_idle+0x30/0x48
[ 1.151253] [<ffff0000080fed80>] deactivate_task+0xa8/0xf0
[ 1.151384] [<ffff000008c3935c>] __schedule+0x41c/0x8e0
[ 1.151432] [<ffff000008c39854>] schedule+0x34/0x98
[ 1.151466] [<ffff000008c3cd5c>] rwsem_down_read_failed+0xcc/0x110
[ 1.151466] [<ffff0000081249c4>] __percpu_down_read+0xe4/0x110
[ 1.151573] [<ffff0000080d33b8>] cpus_read_lock+0x70/0xa0
[ 1.151630] [<ffff0000081de864>] static_key_slow_inc_with_lock+0x14c/0x150
[ 1.151679] [<ffff0000081de8a4>] static_key_enable_with_lock+0x3c/0x58
[ 1.151753] [<ffff0000081de8e4>] static_key_enable+0x24/0x30
[ 1.151794] [<ffff000008a59364>] arch_timer_check_ool_workaround+0x204/0x248
[ 1.151853] [<ffff000008a596f8>] arch_timer_starting_cpu+0xe0/0x2b0
[ 1.151893] [<ffff0000080d2828>] cpuhp_invoke_callback+0x98/0x5c8
[ 1.151958] [<ffff0000080d4af8>] notify_cpu_starting+0x78/0x98
[ 1.152006] [<ffff000008090810>] secondary_start_kernel+0xb8/0x120
[ 1.152040] [<0000000080c441b4>] 0x80c441b4

I'll cc you on the updated patches.

Thanks,

M.
--
Jazz is not dead. It just smells funny...