2020-07-21 08:33:19

by Naresh Kamboju

[permalink] [raw]
Subject: stable-rc 4.14: arm64: Internal error: Oops: clk_reparent __clk_set_parent_before on db410c

Kernel Internal oops while booting stable-rc 4.14 kernel on qcom db410c device
this problem happened only once on this specific platform.
and rcu_preempt detected stalls on CPUs/tasks detected after this and
board hung.

metadata:
git branch: linux-4.14.y
git repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
git commit: 5b1e982af0f810358664827a6333affb4f5d8eb5
git describe: v4.14.188-126-g5b1e982af0f8
make_kernelversion: 4.14.189-rc1
kernel-config:
https://builds.tuxbuild.com/12PM71zBW-5EAp5ztC_yxg/kernel.config

Crash dump:

[ 5.424958] Unable to handle kernel paging request at virtual
address 00001008
[ 5.435485] Mem abort info:
[ 5.442509] Exception class = DABT (current EL), IL = 32 bits
[ 5.445203] SET = 0, FnV = 0
[[ 5.451101] EA = 0, S1PTW =[ 5.454226] Data abort info:
[ 5.457264] ISV = 0, ISS = 0x00000044
[ 5.460390] CM = 0, WnR = 1
[ 5.463951] user pgtable: 4k pages, 48-bit VAs, pgd = ffff80003d66d000
[ 5.467078] [0000000000001008] *pgd=0000000000000000
[ 5.473503] Internal error: Oops: 96000044 [#1] PREEMPT SMP
[ 5.479838] Modules linked in: adv7511 msm mdt_loader msm_rng
drm_kms_helper rng_core drm fuse
[ 5.485405] Process kworker/2:0 (pid: 21, stack limit = 0xffff000009450000)
[ 5.494090] CPU: 2 PID: 21 Comm: kworker/2:0 Not tainted 4.14.189-rc1 #1
[ 5.501036] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
[ 5.507996] Workqueue: events deferred_probe_work_func
[ 5.514935] task: ffff80003d454380 task.stack: ffff000009450000
[ 5.520155] pc : clk_reparent+0x60/0xe8
[ 5.526058] lr : __clk_set_parent_before+0x40/0x88
[ 5.529882] sp : ffff000009453640 pstate : 800001c5
[ 5.534748] x29: ffff000009453640 x28: ffff0000090b7000
[ 5.539615] x27: ffff80003fe7c478 x26: ffff0000094537a8
[ 5.545175] x25: 0000000000000001 x24: ffff000009239038
[ 5.550736] x23: ffff80003b6be688 x22: 0000000000000000
[ 5.556297] x21: 0000000000000000 x20: ffff80003c9d8c00
[ 5.561858] x19: ffff80003d798900 x18: 00000000fffffffe
[ 5.567419] x17: 0000ffff7fdbb6a0 x16: ffff00000821ad98
[ 5.572980] x15: 0000000000000001 x14: ffffffffffffffff
[ 5.578540] x13: ffff0000094537c8 x12: 0000000000000010
[ 5.584102] x11: 0000000000000010 x10: 0101010101010101
[ 5.589663] x9 : 0000000000000000 x8 : 7f7f7f7f7f7f7f7f
[ 5.595223] x7 : fefefefefeff6e77 x6 : 0000000000000140
[ 5.600784] x5 : 0000000000000001 x4 : ffff80003c9d8c00
[ 5.606344] x3 : ffff80003d798900 x2 : 0000000000000004
[ 5.611905] x1 : ffff80003d7989a8 x0 : 0000000000001000
[ 5.617467] Call trace:
[ 5.623030] clk_reparent+0x60/0xe8
[ 5.625465] __clk_set_parent_before+0x40/0x88
[ 5.628943] clk_register+0x330/0x618
[ 5.633668] pll_28nm_register+0xa4/0x340 [msm]
[ 5.637492] msm_dsi_pll_28nm_init+0xc8/0x1d8 [msm]
[ 5.642007] msm_dsi_pll_init+0x34/0xe0 [msm]
[ 5.646870] dsi_phy_driver_probe+0x1cc/0x310 [msm]
[ 5.651196] platform_drv_probe+0x58/0xb8
[ 5.656060] driver_probe_device+0x228/0x2d0
[ 5.660231] __device_attach_driver+0xb8/0xe8
[ 5.664750] bus_for_each_drv+0x64/0xa0
[ 5.669269] __device_attach+0xcc/0x138
[ 5.673093] device_initial_probe+0x10/0x18
[ 5.676918] bus_probe_device+0x90/0x98
[ 5.681088] device_add+0x3c4/0x5a8
[ 5.684915] of_device_add+0x40/0x58
[ 5.688392] of_platform_device_create_pdata+0x80/0xe8
[ 5.692219] of_platform_bus_create+0xd4/0x308
[ 5.697432] of_platform_populate+0x48/0xb8
[ 5.702143] msm_pdev_probe+0x3c/0x328 [msm]
[ 5.706125] platform_drv_probe+0x58/0xb8
[ 5.710642] driver_probe_device+0x228/0x2d0
[ 5.714814] __device_attach_driver+0xb8/0xe8
[ 5.719334] bus_for_each_drv+0x64/0xa0
[ 5.723852] __device_attach+0xcc/0x138
[ 5.727677] device_initial_probe+0x10/0x18
[ 5.731502] bus_probe_device+0x90/0x98
[ 5.735675] deferred_probe_work_func+0xa4/0x140
[ 5.739502] process_one_work+0x19c/0x300
[ 5.744366] worker_thread+0x4c/0x420
[ 5.748539] kthread+0x100/0x130
[ 5.752362] ret_from_fork+0x10/0x1c
[ 5.755842] Code: 54000260 f9405080 f9005460 b4000040 (f9000401)
[ 5.759669] ---[ end trace 6d70d7dd8a236384 ]---
[ 5.765922] note: kworker/2:0[21] exited with preempt_count 1
[ 26.777168] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 26.777204] 0-...: (1 GPs behind) idle=2fa/140000000000001/0
softirq=1679/1723 fqs=2625
[ 26.783112] 1-...: (1 GPs behind) idle=53a/140000000000000/0
softirq=1946/1958 fqs=2625
[ 26.791444] (detected by 3, t=5252 jiffies, g=58, c=57, q=2362)
[ 26.799781] Task dump for CPU 0:
[ 26.806033] systemd-udevd R running task 0 2533 1 0x00000202
[ 26.809515] Call trace:
[ 26.816814] __switch_to+0xe8/0x148
[ 26.819248] __wake_up_common+0x80/0x170
[ 26.822724] __wake_up_common_lock+0x7c/0xa8
[ 26.826897] __wake_up_sync_key+0x1c/0x28
[ 26.831416] sock_def_readable+0x40/0x88
[ 26.835584] Task dump for CPU 1:
[ 26.839755] systemd R running task 0 1 0 0x0000000a
[ 26.843237] Call trace:
[ 26.850531] __switch_to+0xe8/0x148
[ 26.852968] 0xffff800009fc8000
[ 26.856467] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 26.859584] 0-...: (1 GPs behind) idle=2fa/140000000000001/0
softirq=1722/1723 fqs=2625
[ 26.865145] 1-...: (1 GPs behind) idle=53a/140000000000000/0
softirq=1957/1958 fqs=2625
[ 26.873477] (detected by 3, t=5270 jiffies, g=-150, c=-151, q=3)
[ 26.881815] Task dump for CPU 0:
[ 26.888067] systemd-udevd R running task 0 2533 1 0x00000202
[ 26.891549] Call trace:
[ 26.898845] __switch_to+0xe8/0x148
[ 26.901281] __wake_up_common+0x80/0x170
[ 26.904758] __wake_up_common_lock+0x7c/0xa8
[ 26.908931] __wake_up_sync_key+0x1c/0x28
[ 26.913449] sock_def_readable+0x40/0x88
[ 26.917618] Task dump for CPU 1:
[ 26.921790] systemd R running task 0 1 0 0x0000000a
[ 26.925271] Call trace:
[ 26.932565] __switch_to+0xe8/0x148
[ 26.935002] 0xffff800009fc8000
[ 41.449201] random: crng init done
[ 41.449221] random: 7 urandom warning(s) missed due to ratelimiting
[ 89.797164] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 89.797195] 0-...: (1 GPs behind) idle=2fa/140000000000001/0
softirq=1679/1723 fqs=10482
[ 89.803103] 1-...: (1 GPs behind) idle=53a/140000000000000/0
softirq=1946/1958 fqs=10482
[ 89.811437] (detected by 3, t=21007 jiffies, g=58, c=57, q=2578)
[ 89.819773] Task dump for CPU 0:
[ 89.826027] systemd-udevd R running task 0 2533 1 0x00000202
[ 89.829508] Call trace:
[ 89.836807] __switch_to+0xe8/0x148
[ 89.839241] __wake_up_common+0x80/0x170
[ 89.842717] __wake_up_common_lock+0x7c/0xa8
[ 89.846891] __wake_up_sync_key+0x1c/0x28
[ 89.851410] sock_def_readable+0x40/0x88
<System Hung>

Full test log,
https://qa-reports.linaro.org/lkft/linux-stable-rc-4.14-oe/build/v4.14.188-126-g5b1e982af0f8/testrun/2969667/suite/linux-log-parser/test/check-kernel-oops-1592908/log


--
Linaro LKFT
https://lkft.linaro.org


2020-07-21 09:54:30

by Arnd Bergmann

[permalink] [raw]
Subject: Re: stable-rc 4.14: arm64: Internal error: Oops: clk_reparent __clk_set_parent_before on db410c

On Tue, Jul 21, 2020 at 10:32 AM Naresh Kamboju
<[email protected]> wrote:
>
> Kernel Internal oops while booting stable-rc 4.14 kernel on qcom db410c device
> this problem happened only once on this specific platform.
> and rcu_preempt detected stalls on CPUs/tasks detected after this and
> board hung.
>
> metadata:
> git branch: linux-4.14.y
> git repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
> git commit: 5b1e982af0f810358664827a6333affb4f5d8eb5
> git describe: v4.14.188-126-g5b1e982af0f8
> make_kernelversion: 4.14.189-rc1
> kernel-config:
> https://builds.tuxbuild.com/12PM71zBW-5EAp5ztC_yxg/kernel.config

I looked at the object file at
https://builds.tuxbuild.com/12PM71zBW-5EAp5ztC_yxg/vmlinux.xz

> Crash dump:
>
> [ 5.424958] Unable to handle kernel paging request at virtual
> address 00001008
> [ 5.435485] Mem abort info:
> [ 5.442509] Exception class = DABT (current EL), IL = 32 bits
> [ 5.445203] SET = 0, FnV = 0
> [[ 5.451101] EA = 0, S1PTW =[ 5.454226] Data abort info:
> [ 5.457264] ISV = 0, ISS = 0x00000044
> [ 5.460390] CM = 0, WnR = 1
> [ 5.463951] user pgtable: 4k pages, 48-bit VAs, pgd = ffff80003d66d000
> [ 5.467078] [0000000000001008] *pgd=0000000000000000
> [ 5.473503] Internal error: Oops: 96000044 [#1] PREEMPT SMP
> [ 5.479838] Modules linked in: adv7511 msm mdt_loader msm_rng
> drm_kms_helper rng_core drm fuse
> [ 5.485405] Process kworker/2:0 (pid: 21, stack limit = 0xffff000009450000)
> [ 5.494090] CPU: 2 PID: 21 Comm: kworker/2:0 Not tainted 4.14.189-rc1 #1
> [ 5.501036] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
> [ 5.507996] Workqueue: events deferred_probe_work_func
> [ 5.514935] task: ffff80003d454380 task.stack: ffff000009450000
> [ 5.520155] pc : clk_reparent+0x60/0xe8
> [ 5.526058] lr : __clk_set_parent_before+0x40/0x88
> [ 5.529882] sp : ffff000009453640 pstate : 800001c5
> [ 5.534748] x29: ffff000009453640 x28: ffff0000090b7000
> [ 5.539615] x27: ffff80003fe7c478 x26: ffff0000094537a8
> [ 5.545175] x25: 0000000000000001 x24: ffff000009239038
> [ 5.550736] x23: ffff80003b6be688 x22: 0000000000000000
> [ 5.556297] x21: 0000000000000000 x20: ffff80003c9d8c00
> [ 5.561858] x19: ffff80003d798900 x18: 00000000fffffffe
> [ 5.567419] x17: 0000ffff7fdbb6a0 x16: ffff00000821ad98
> [ 5.572980] x15: 0000000000000001 x14: ffffffffffffffff
> [ 5.578540] x13: ffff0000094537c8 x12: 0000000000000010
> [ 5.584102] x11: 0000000000000010 x10: 0101010101010101
> [ 5.589663] x9 : 0000000000000000 x8 : 7f7f7f7f7f7f7f7f
> [ 5.595223] x7 : fefefefefeff6e77 x6 : 0000000000000140
> [ 5.600784] x5 : 0000000000000001 x4 : ffff80003c9d8c00
> [ 5.606344] x3 : ffff80003d798900 x2 : 0000000000000004
> [ 5.611905] x1 : ffff80003d7989a8 x0 : 0000000000001000
> [ 5.617467] Call trace:
> [ 5.623030] clk_reparent+0x60/0xe8

We are here:

static void clk_reparent(struct clk_core *core, struct clk_core *new_parent)
{
bool was_orphan = core->orphan;

hlist_del(&core->child_node);

if (new_parent) {
bool becomes_orphan = new_parent->orphan;

/* avoid duplicate POST_RATE_CHANGE notifications */
if (new_parent->new_child == core)
new_parent->new_child = NULL;

hlist_add_head(&core->child_node, &new_parent->children);


and "new_parent->children.first" points to 0x1000, which is not
a valid pointer, so the "first->pprev = &n->next" in hlist_add_head()
causes a crash.

> [ 5.625465] __clk_set_parent_before+0x40/0x88
> [ 5.628943] clk_register+0x330/0x618

(speculation) the most likely call chain here would be __clk_core_init(),
implying that there is some random other orphan clock that has a
broken parent:

/*
* walk the list of orphan clocks and reparent any that newly finds a
* parent.
*/
hlist_for_each_entry_safe(orphan, tmp2, &clk_orphan_list, child_node) {
struct clk_core *parent = __clk_init_parent(orphan);

/*
* We need to use __clk_set_parent_before() and _after() to
* to properly migrate any prepare/enable count of the orphan
* clock. This is important for CLK_IS_CRITICAL clocks, which
* are enabled during init but might not have a parent yet.
*/
if (parent) {
/* update the clk tree topology */
__clk_set_parent_before(orphan, parent);

None of the above have changed in stable kernels.

> [ 5.633668] pll_28nm_register+0xa4/0x340 [msm]
> [ 5.637492] msm_dsi_pll_28nm_init+0xc8/0x1d8 [msm]
> [ 5.642007] msm_dsi_pll_init+0x34/0xe0 [msm]
> [ 5.646870] dsi_phy_driver_probe+0x1cc/0x310 [msm]

The only changes to the dsi driver in v4.14-stable were:

89e30bb46074 drm/msm/dsi: save pll state before dsi host is powered off
892afde0f4a1 drm: msm: Fix return type of dsi_mgr_connector_mode_valid for kCFI
35ff594b0da2 drm/msm/dsi: Implement reset correctly
5151a0c8d730 drm/msm/dsi: use correct enum in dsi_get_cmd_fmt
e6bc3a4b0c23 clk: divider: fix incorrect usage of container_of

None of these look suspicious to me.

Arnd

> [ 5.651196] platform_drv_probe+0x58/0xb8
> [ 5.656060] driver_probe_device+0x228/0x2d0
> [ 5.660231] __device_attach_driver+0xb8/0xe8
> [ 5.664750] bus_for_each_drv+0x64/0xa0
> [ 5.669269] __device_attach+0xcc/0x138
> [ 5.673093] device_initial_probe+0x10/0x18
> [ 5.676918] bus_probe_device+0x90/0x98
> [ 5.681088] device_add+0x3c4/0x5a8
> [ 5.684915] of_device_add+0x40/0x58
> [ 5.688392] of_platform_device_create_pdata+0x80/0xe8
> [ 5.692219] of_platform_bus_create+0xd4/0x308
> [ 5.697432] of_platform_populate+0x48/0xb8
> [ 5.702143] msm_pdev_probe+0x3c/0x328 [msm]
> [ 5.706125] platform_drv_probe+0x58/0xb8
> [ 5.710642] driver_probe_device+0x228/0x2d0
> [ 5.714814] __device_attach_driver+0xb8/0xe8
> [ 5.719334] bus_for_each_drv+0x64/0xa0
> [ 5.723852] __device_attach+0xcc/0x138
> [ 5.727677] device_initial_probe+0x10/0x18
> [ 5.731502] bus_probe_device+0x90/0x98
> [ 5.735675] deferred_probe_work_func+0xa4/0x140

It's worth noting that this is a deferred probe, so maybe
there is a race in the clk registration that gets hit if two drivers
register a clock at the same time?

> [ 5.739502] process_one_work+0x19c/0x300
> [ 5.744366] worker_thread+0x4c/0x420
> [ 5.748539] kthread+0x100/0x130
> [ 5.752362] ret_from_fork+0x10/0x1c
> [ 5.755842] Code: 54000260 f9405080 f9005460 b4000040 (f9000401)
> [ 5.759669] ---[ end trace 6d70d7dd8a236384 ]---
> [ 5.765922] note: kworker/2:0[21] exited with preempt_count 1
> [ 26.777168] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 26.777204] 0-...: (1 GPs behind) idle=2fa/140000000000001/0
> softirq=1679/1723 fqs=2625
> [ 26.783112] 1-...: (1 GPs behind) idle=53a/140000000000000/0
> softirq=1946/1958 fqs=2625
> [ 26.791444] (detected by 3, t=5252 jiffies, g=58, c=57, q=2362)
> [ 26.799781] Task dump for CPU 0:
> [ 26.806033] systemd-udevd R running task 0 2533 1 0x00000202
> [ 26.809515] Call trace:
> [ 26.816814] __switch_to+0xe8/0x148
> [ 26.819248] __wake_up_common+0x80/0x170
> [ 26.822724] __wake_up_common_lock+0x7c/0xa8
> [ 26.826897] __wake_up_sync_key+0x1c/0x28
> [ 26.831416] sock_def_readable+0x40/0x88
> [ 26.835584] Task dump for CPU 1:
> [ 26.839755] systemd R running task 0 1 0 0x0000000a
> [ 26.843237] Call trace:
> [ 26.850531] __switch_to+0xe8/0x148
> [ 26.852968] 0xffff800009fc8000
> [ 26.856467] INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 26.859584] 0-...: (1 GPs behind) idle=2fa/140000000000001/0
> softirq=1722/1723 fqs=2625
> [ 26.865145] 1-...: (1 GPs behind) idle=53a/140000000000000/0
> softirq=1957/1958 fqs=2625
> [ 26.873477] (detected by 3, t=5270 jiffies, g=-150, c=-151, q=3)
> [ 26.881815] Task dump for CPU 0:
> [ 26.888067] systemd-udevd R running task 0 2533 1 0x00000202
> [ 26.891549] Call trace:
> [ 26.898845] __switch_to+0xe8/0x148
> [ 26.901281] __wake_up_common+0x80/0x170
> [ 26.904758] __wake_up_common_lock+0x7c/0xa8
> [ 26.908931] __wake_up_sync_key+0x1c/0x28
> [ 26.913449] sock_def_readable+0x40/0x88
> [ 26.917618] Task dump for CPU 1:
> [ 26.921790] systemd R running task 0 1 0 0x0000000a
> [ 26.925271] Call trace:
> [ 26.932565] __switch_to+0xe8/0x148
> [ 26.935002] 0xffff800009fc8000
> [ 41.449201] random: crng init done
> [ 41.449221] random: 7 urandom warning(s) missed due to ratelimiting
> [ 89.797164] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 89.797195] 0-...: (1 GPs behind) idle=2fa/140000000000001/0
> softirq=1679/1723 fqs=10482
> [ 89.803103] 1-...: (1 GPs behind) idle=53a/140000000000000/0
> softirq=1946/1958 fqs=10482
> [ 89.811437] (detected by 3, t=21007 jiffies, g=58, c=57, q=2578)
> [ 89.819773] Task dump for CPU 0:
> [ 89.826027] systemd-udevd R running task 0 2533 1 0x00000202
> [ 89.829508] Call trace:
> [ 89.836807] __switch_to+0xe8/0x148
> [ 89.839241] __wake_up_common+0x80/0x170
> [ 89.842717] __wake_up_common_lock+0x7c/0xa8
> [ 89.846891] __wake_up_sync_key+0x1c/0x28
> [ 89.851410] sock_def_readable+0x40/0x88
> <System Hung>
>
> Full test log,
> https://qa-reports.linaro.org/lkft/linux-stable-rc-4.14-oe/build/v4.14.188-126-g5b1e982af0f8/testrun/2969667/suite/linux-log-parser/test/check-kernel-oops-1592908/log
>
>
> --
> Linaro LKFT
> https://lkft.linaro.org

2020-07-23 08:07:22

by Stephen Boyd

[permalink] [raw]
Subject: Re: stable-rc 4.14: arm64: Internal error: Oops: clk_reparent __clk_set_parent_before on db410c

Quoting Arnd Bergmann (2020-07-21 02:51:32)
> __clk_set_parent_before(orphan, parent);
>
> None of the above have changed in stable kernels.
>
> > [ 5.633668] pll_28nm_register+0xa4/0x340 [msm]
> > [ 5.637492] msm_dsi_pll_28nm_init+0xc8/0x1d8 [msm]
> > [ 5.642007] msm_dsi_pll_init+0x34/0xe0 [msm]
> > [ 5.646870] dsi_phy_driver_probe+0x1cc/0x310 [msm]
>
> The only changes to the dsi driver in v4.14-stable were:
>
> 89e30bb46074 drm/msm/dsi: save pll state before dsi host is powered off
> 892afde0f4a1 drm: msm: Fix return type of dsi_mgr_connector_mode_valid for kCFI
> 35ff594b0da2 drm/msm/dsi: Implement reset correctly
> 5151a0c8d730 drm/msm/dsi: use correct enum in dsi_get_cmd_fmt
> e6bc3a4b0c23 clk: divider: fix incorrect usage of container_of
>
> None of these look suspicious to me.
>

It sounds like maybe you need this patch?

bdcf1dc25324 ("clk: Evict unregistered clks from parent caches")

or

4368a1539c6b ("drm/msm: Depopulate platform on probe failure")

I vaguelly recall that the display driver wasn't removing clks becaues
it wasn't removing devices when probe defer happened and then we had
dangling clks in the parent cache confusing things.

2020-07-23 10:11:25

by Naresh Kamboju

[permalink] [raw]
Subject: Re: stable-rc 4.14: arm64: Internal error: Oops: clk_reparent __clk_set_parent_before on db410c

On Thu, 23 Jul 2020 at 13:36, Stephen Boyd <[email protected]> wrote:
>
> Quoting Arnd Bergmann (2020-07-21 02:51:32)
> > __clk_set_parent_before(orphan, parent);
> >
> > None of the above have changed in stable kernels.
> >
> > > [ 5.633668] pll_28nm_register+0xa4/0x340 [msm]
> > > [ 5.637492] msm_dsi_pll_28nm_init+0xc8/0x1d8 [msm]
> > > [ 5.642007] msm_dsi_pll_init+0x34/0xe0 [msm]
> > > [ 5.646870] dsi_phy_driver_probe+0x1cc/0x310 [msm]
> >
> > The only changes to the dsi driver in v4.14-stable were:
> >
> > 89e30bb46074 drm/msm/dsi: save pll state before dsi host is powered off
> > 892afde0f4a1 drm: msm: Fix return type of dsi_mgr_connector_mode_valid for kCFI
> > 35ff594b0da2 drm/msm/dsi: Implement reset correctly
> > 5151a0c8d730 drm/msm/dsi: use correct enum in dsi_get_cmd_fmt
> > e6bc3a4b0c23 clk: divider: fix incorrect usage of container_of
> >
> > None of these look suspicious to me.
> >
>
> It sounds like maybe you need this patch?
>
> bdcf1dc25324 ("clk: Evict unregistered clks from parent caches")

Cherry-pick did not work on stable-rc 4.14
this patch might need backporting.
I am not sure.

>
> or
>
> 4368a1539c6b ("drm/msm: Depopulate platform on probe failure")

This commit already is in stable-rc 4.14 branch.
drm/msm: Depopulate platform on probe failure

[ Upstream commit 4368a1539c6b41ac3cddc06f5a5117952998804c ]

> I vaguelly recall that the display driver wasn't removing clks becaues
> it wasn't removing devices when probe defer happened and then we had
> dangling clks in the parent cache confusing things.

Thanks for your email.

- Naresh

2020-07-23 10:26:33

by Stephen Boyd

[permalink] [raw]
Subject: Re: stable-rc 4.14: arm64: Internal error: Oops: clk_reparent __clk_set_parent_before on db410c

Quoting Naresh Kamboju (2020-07-23 03:10:37)
> On Thu, 23 Jul 2020 at 13:36, Stephen Boyd <[email protected]> wrote:
> >
> > Quoting Arnd Bergmann (2020-07-21 02:51:32)
> > > __clk_set_parent_before(orphan, parent);
> > >
> > > None of the above have changed in stable kernels.
> > >
> > > > [ 5.633668] pll_28nm_register+0xa4/0x340 [msm]
> > > > [ 5.637492] msm_dsi_pll_28nm_init+0xc8/0x1d8 [msm]
> > > > [ 5.642007] msm_dsi_pll_init+0x34/0xe0 [msm]
> > > > [ 5.646870] dsi_phy_driver_probe+0x1cc/0x310 [msm]
> > >
> > > The only changes to the dsi driver in v4.14-stable were:
> > >
> > > 89e30bb46074 drm/msm/dsi: save pll state before dsi host is powered off
> > > 892afde0f4a1 drm: msm: Fix return type of dsi_mgr_connector_mode_valid for kCFI
> > > 35ff594b0da2 drm/msm/dsi: Implement reset correctly
> > > 5151a0c8d730 drm/msm/dsi: use correct enum in dsi_get_cmd_fmt
> > > e6bc3a4b0c23 clk: divider: fix incorrect usage of container_of
> > >
> > > None of these look suspicious to me.
> > >
> >
> > It sounds like maybe you need this patch?
> >
> > bdcf1dc25324 ("clk: Evict unregistered clks from parent caches")
>
> Cherry-pick did not work on stable-rc 4.14
> this patch might need backporting.
> I am not sure.
>

Ok. That commit fixes a regression in the 3.x series of the kernel so it
should go back to any LTS kernels. It looks like at least on 4.14 it's a
trivial conflict. Here's a backport to 4.14

---8<---
From: Stephen Boyd <[email protected]>
Date: Wed, 28 Aug 2019 11:19:59 -0700
Subject: [PATCH] clk: Evict unregistered clks from parent caches

We leave a dangling pointer in each clk_core::parents array that has an
unregistered clk as a potential parent when that clk_core pointer is
freed by clk{_hw}_unregister(). It is impossible for the true parent of
a clk to be set with clk_set_parent() once the dangling pointer is left
in the cache because we compare parent pointers in
clk_fetch_parent_index() instead of checking for a matching clk name or
clk_hw pointer.

Before commit ede77858473a ("clk: Remove global clk traversal on fetch
parent index"), we would check clk_hw pointers, which has a higher
chance of being the same between registration and unregistration, but it
can still be allocated and freed by the clk provider. In fact, this has
been a long standing problem since commit da0f0b2c3ad2 ("clk: Correct
lookup logic in clk_fetch_parent_index()") where we stopped trying to
compare clk names and skipped over entries in the cache that weren't
NULL.

There are good (performance) reasons to not do the global tree lookup in
cases where the cache holds dangling pointers to parents that have been
unregistered. Let's take the performance hit on the uncommon
registration path instead. Loop through all the clk_core::parents arrays
when a clk is unregistered and set the entry to NULL when the parent
cache entry and clk being unregistered are the same pointer. This will
fix this problem and avoid the overhead for the "normal" case.

Based on a patch by Bjorn Andersson.

Fixes: da0f0b2c3ad2 ("clk: Correct lookup logic in clk_fetch_parent_index()")
Reviewed-by: Bjorn Andersson <[email protected]>
Tested-by: Sai Prakash Ranjan <[email protected]>
Signed-off-by: Stephen Boyd <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
drivers/clk/clk.c | 52 +++++++++++++++++++++++++++++++++++++----------
1 file changed, 41 insertions(+), 11 deletions(-)

diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c
index 44b6f23cc851..4289c519af1b 100644
--- a/drivers/clk/clk.c
+++ b/drivers/clk/clk.c
@@ -39,6 +39,17 @@ static HLIST_HEAD(clk_root_list);
static HLIST_HEAD(clk_orphan_list);
static LIST_HEAD(clk_notifier_list);

+static struct hlist_head *all_lists[] = {
+ &clk_root_list,
+ &clk_orphan_list,
+ NULL,
+};
+
+static struct hlist_head *orphan_list[] = {
+ &clk_orphan_list,
+ NULL,
+};
+
/*** private data structures ***/

struct clk_core {
@@ -1993,17 +2004,6 @@ static int inited = 0;
static DEFINE_MUTEX(clk_debug_lock);
static HLIST_HEAD(clk_debug_list);

-static struct hlist_head *all_lists[] = {
- &clk_root_list,
- &clk_orphan_list,
- NULL,
-};
-
-static struct hlist_head *orphan_list[] = {
- &clk_orphan_list,
- NULL,
-};
-
static void clk_summary_show_one(struct seq_file *s, struct clk_core *c,
int level)
{
@@ -2735,6 +2735,34 @@ static const struct clk_ops clk_nodrv_ops = {
.set_parent = clk_nodrv_set_parent,
};

+static void clk_core_evict_parent_cache_subtree(struct clk_core *root,
+ struct clk_core *target)
+{
+ int i;
+ struct clk_core *child;
+
+ for (i = 0; i < root->num_parents; i++)
+ if (root->parents[i] == target)
+ root->parents[i] = NULL;
+
+ hlist_for_each_entry(child, &root->children, child_node)
+ clk_core_evict_parent_cache_subtree(child, target);
+}
+
+/* Remove this clk from all parent caches */
+static void clk_core_evict_parent_cache(struct clk_core *core)
+{
+ struct hlist_head **lists;
+ struct clk_core *root;
+
+ lockdep_assert_held(&prepare_lock);
+
+ for (lists = all_lists; *lists; lists++)
+ hlist_for_each_entry(root, *lists, child_node)
+ clk_core_evict_parent_cache_subtree(root, core);
+
+}
+
/**
* clk_unregister - unregister a currently registered clock
* @clk: clock to unregister
@@ -2773,6 +2801,8 @@ void clk_unregister(struct clk *clk)
clk_core_set_parent(child, NULL);
}

+ clk_core_evict_parent_cache(clk->core);
+
hlist_del_init(&clk->core->child_node);

if (clk->core->prepare_count)
--
https://git.kernel.org/pub/scm/linux/kernel/git/clk/linux.git/

2020-08-12 17:45:42

by Stephen Boyd

[permalink] [raw]
Subject: Re: stable-rc 4.14: arm64: Internal error: Oops: clk_reparent __clk_set_parent_before on db410c

Quoting Stephen Boyd (2020-07-23 03:26:02)
> Quoting Naresh Kamboju (2020-07-23 03:10:37)
> > On Thu, 23 Jul 2020 at 13:36, Stephen Boyd <[email protected]> wrote:
> > >
> > > It sounds like maybe you need this patch?
> > >
> > > bdcf1dc25324 ("clk: Evict unregistered clks from parent caches")
> >
> > Cherry-pick did not work on stable-rc 4.14
> > this patch might need backporting.
> > I am not sure.
> >
>
> Ok. That commit fixes a regression in the 3.x series of the kernel so it
> should go back to any LTS kernels. It looks like at least on 4.14 it's a
> trivial conflict. Here's a backport to 4.14

Did this help?

2020-08-13 14:02:21

by Naresh Kamboju

[permalink] [raw]
Subject: Re: stable-rc 4.14: arm64: Internal error: Oops: clk_reparent __clk_set_parent_before on db410c

On Wed, 12 Aug 2020 at 23:14, Stephen Boyd <[email protected]> wrote:
>
> Quoting Stephen Boyd (2020-07-23 03:26:02)
> > Quoting Naresh Kamboju (2020-07-23 03:10:37)
> > > On Thu, 23 Jul 2020 at 13:36, Stephen Boyd <[email protected]> wrote:
> > > >
> > > > It sounds like maybe you need this patch?
> > > >
> > > > bdcf1dc25324 ("clk: Evict unregistered clks from parent caches")
> > >
> > > Cherry-pick did not work on stable-rc 4.14
> > > this patch might need backporting.
> > > I am not sure.
> > >
> >
> > Ok. That commit fixes a regression in the 3.x series of the kernel so it
> > should go back to any LTS kernels. It looks like at least on 4.14 it's a
> > trivial conflict. Here's a backport to 4.14
>
> Did this help?

Thanks for your patch.
Sorry for the late reply.
The reported issue is not always reproducible However,
I have tested this for 100 cycles and did not notice the reported problem.

Tested-by: Naresh Kamboju <[email protected]>

- Naresh

2020-08-24 08:19:23

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: stable-rc 4.14: arm64: Internal error: Oops: clk_reparent __clk_set_parent_before on db410c

On Thu, Aug 13, 2020 at 07:30:40PM +0530, Naresh Kamboju wrote:
> On Wed, 12 Aug 2020 at 23:14, Stephen Boyd <[email protected]> wrote:
> >
> > Quoting Stephen Boyd (2020-07-23 03:26:02)
> > > Quoting Naresh Kamboju (2020-07-23 03:10:37)
> > > > On Thu, 23 Jul 2020 at 13:36, Stephen Boyd <[email protected]> wrote:
> > > > >
> > > > > It sounds like maybe you need this patch?
> > > > >
> > > > > bdcf1dc25324 ("clk: Evict unregistered clks from parent caches")
> > > >
> > > > Cherry-pick did not work on stable-rc 4.14
> > > > this patch might need backporting.
> > > > I am not sure.
> > > >
> > >
> > > Ok. That commit fixes a regression in the 3.x series of the kernel so it
> > > should go back to any LTS kernels. It looks like at least on 4.14 it's a
> > > trivial conflict. Here's a backport to 4.14
> >
> > Did this help?
>
> Thanks for your patch.
> Sorry for the late reply.
> The reported issue is not always reproducible However,
> I have tested this for 100 cycles and did not notice the reported problem.
>
> Tested-by: Naresh Kamboju <[email protected]>

Great, now queued up to 4.14.y and 4.19.y.

greg k-h