LinuxLists.cc - [PATCH] drm/msm/a6xx: Fix GMU lockdep splat

2023-08-03 18:42:49

Subject: [PATCH] drm/msm/a6xx: Fix GMU lockdep splat

From: Rob Clark <[email protected]>

For normal GPU devfreq, we need to acquire the GMU lock while already
holding devfreq locks. But in the teardown path, we were calling
dev_pm_domain_detach() while already holding the GMU lock, resulting in
this lockdep splat:

======================================================
WARNING: possible circular locking dependency detected
6.4.3-debug+ #3 Not tainted
------------------------------------------------------
ring0/391 is trying to acquire lock:
ffffff80a025c078 (&devfreq->lock){+.+.}-{3:3}, at: qos_notifier_call+0x30/0x74

but task is already holding lock:
ffffff809b8c1ce8 (&(c->notifiers)->rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x34/0x78

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #4 (&(c->notifiers)->rwsem){++++}-{3:3}:
down_write+0x58/0x74
__blocking_notifier_chain_register+0x64/0x84
blocking_notifier_chain_register+0x1c/0x28
freq_qos_add_notifier+0x5c/0x7c
dev_pm_qos_add_notifier+0xd4/0xf0
devfreq_add_device+0x42c/0x560
devm_devfreq_add_device+0x6c/0xb8
msm_devfreq_init+0xa8/0x16c [msm]
msm_gpu_init+0x368/0x54c [msm]
adreno_gpu_init+0x248/0x2b0 [msm]
a6xx_gpu_init+0x2d0/0x384 [msm]
adreno_bind+0x264/0x2bc [msm]
component_bind_all+0x124/0x1f4
msm_drm_bind+0x2d0/0x5f4 [msm]
try_to_bring_up_aggregate_device+0x88/0x1a4
__component_add+0xd4/0x128
component_add+0x1c/0x28
dp_display_probe+0x37c/0x3c0 [msm]
platform_probe+0x70/0xc0
really_probe+0x148/0x280
__driver_probe_device+0xfc/0x114
driver_probe_device+0x44/0x100
__device_attach_driver+0x64/0xdc
bus_for_each_drv+0xb0/0xd8
__device_attach+0xe4/0x140
device_initial_probe+0x1c/0x28
bus_probe_device+0x44/0xb0
deferred_probe_work_func+0xb0/0xc8
process_one_work+0x288/0x3d8
worker_thread+0x1f0/0x260
kthread+0xf0/0x100
ret_from_fork+0x10/0x20

-> #3 (dev_pm_qos_mtx){+.+.}-{3:3}:
__mutex_lock+0xc8/0x388
mutex_lock_nested+0x2c/0x38
dev_pm_qos_remove_notifier+0x3c/0xc8
genpd_remove_device+0x40/0x11c
genpd_dev_pm_detach+0x88/0x130
dev_pm_domain_detach+0x2c/0x3c
a6xx_gmu_remove+0x44/0xdc [msm]
a6xx_destroy+0x7c/0xa4 [msm]
adreno_unbind+0x50/0x64 [msm]
component_unbind+0x44/0x64
component_unbind_all+0xb4/0xbc
msm_drm_uninit.isra.0+0x124/0x17c [msm]
msm_drm_bind+0x340/0x5f4 [msm]
try_to_bring_up_aggregate_device+0x88/0x1a4
__component_add+0xd4/0x128
component_add+0x1c/0x28
dp_display_probe+0x37c/0x3c0 [msm]
platform_probe+0x70/0xc0
really_probe+0x148/0x280
__driver_probe_device+0xfc/0x114
driver_probe_device+0x44/0x100
__device_attach_driver+0x64/0xdc
bus_for_each_drv+0xb0/0xd8
__device_attach+0xe4/0x140
device_initial_probe+0x1c/0x28
bus_probe_device+0x44/0xb0
deferred_probe_work_func+0xb0/0xc8
process_one_work+0x288/0x3d8
worker_thread+0x1f0/0x260
kthread+0xf0/0x100
ret_from_fork+0x10/0x20

-> #2 (&a6xx_gpu->gmu.lock){+.+.}-{3:3}:
__mutex_lock+0xc8/0x388
mutex_lock_nested+0x2c/0x38
a6xx_gpu_set_freq+0x38/0x64 [msm]
msm_devfreq_target+0x170/0x18c [msm]
devfreq_set_target+0x90/0x1e4
devfreq_update_target+0xb4/0xf0
update_devfreq+0x1c/0x28
devfreq_monitor+0x3c/0x10c
process_one_work+0x288/0x3d8
worker_thread+0x1f0/0x260
kthread+0xf0/0x100
ret_from_fork+0x10/0x20

-> #1 (&df->lock){+.+.}-{3:3}:
__mutex_lock+0xc8/0x388
mutex_lock_nested+0x2c/0x38
msm_devfreq_get_dev_status+0x4c/0x104 [msm]
devfreq_simple_ondemand_func+0x5c/0x128
devfreq_update_target+0x68/0xf0
update_devfreq+0x1c/0x28
devfreq_monitor+0x3c/0x10c
process_one_work+0x288/0x3d8
worker_thread+0x1f0/0x260
kthread+0xf0/0x100
ret_from_fork+0x10/0x20

-> #0 (&devfreq->lock){+.+.}-{3:3}:
__lock_acquire+0xdf8/0x109c
lock_acquire+0x234/0x284
__mutex_lock+0xc8/0x388
mutex_lock_nested+0x2c/0x38
qos_notifier_call+0x30/0x74
qos_min_notifier_call+0x1c/0x28
notifier_call_chain+0xf4/0x114
blocking_notifier_call_chain+0x4c/0x78
pm_qos_update_target+0x184/0x190
freq_qos_apply+0x4c/0x64
apply_constraint+0xf8/0xfc
__dev_pm_qos_update_request+0x138/0x164
dev_pm_qos_update_request+0x44/0x68
msm_devfreq_boost+0x40/0x70 [msm]
msm_devfreq_active+0xc0/0xf0 [msm]
msm_gpu_submit+0xc8/0x12c [msm]
msm_job_run+0x88/0x128 [msm]
drm_sched_main+0x240/0x324 [gpu_sched]
kthread+0xf0/0x100
ret_from_fork+0x10/0x20

other info that might help us debug this:
Chain exists of:
&devfreq->lock --> dev_pm_qos_mtx --> &(c->notifiers)->rwsem
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
rlock(&(c->notifiers)->rwsem);
lock(dev_pm_qos_mtx);
lock(&(c->notifiers)->rwsem);
lock(&devfreq->lock);

*** DEADLOCK ***
4 locks held by ring0/391:
#0: ffffff809c811170 (&gpu->lock){+.+.}-{3:3}, at: msm_job_run+0x7c/0x128 [msm]
#1: ffffff809c811208 (&gpu->active_lock){+.+.}-{3:3}, at: msm_gpu_submit+0xa8/0x12c [msm]
#2: ffffffecbbb46600 (dev_pm_qos_mtx){+.+.}-{3:3}, at: dev_pm_qos_update_request+0x38/0x68
#3: ffffff809b8c1ce8 (&(c->notifiers)->rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x34/0x78

stack backtrace:
CPU: 6 PID: 391 Comm: ring0 Not tainted 6.4.3debug+ #3
Hardware name: Google Villager (rev1+) with LTE (DT)
Call trace:
dump_backtrace+0xb4/0xf0
show_stack+0x20/0x30
dump_stack_lvl+0x60/0x84
dump_stack+0x18/0x24
print_circular_bug+0x1cc/0x234
check_noncircular+0x78/0xac
__lock_acquire+0xdf8/0x109c
lock_acquire+0x234/0x284
__mutex_lock+0xc8/0x388
mutex_lock_nested+0x2c/0x38
qos_notifier_call+0x30/0x74
qos_min_notifier_call+0x1c/0x28
notifier_call_chain+0xf4/0x114
blocking_notifier_call_chain+0x4c/0x78
pm_qos_update_target+0x184/0x190
freq_qos_apply+0x4c/0x64
apply_constraint+0xf8/0xfc
__dev_pm_qos_update_request+0x138/0x164
dev_pm_qos_update_request+0x44/0x68
msm_devfreq_boost+0x40/0x70 [msm]
msm_devfreq_active+0xc0/0xf0 [msm]
msm_gpu_submit+0xc8/0x12c [msm]
msm_job_run+0x88/0x128 [msm]
drm_sched_main+0x240/0x324 [gpu_sched]
kthread+0xf0/0x100
ret_from_fork+0x10/0x20

Fix this by only synchronizing access to gmu->initialized.

Fixes: 4cd15a3e8b36 ("drm/msm/a6xx: Make GPU destroy a bit safer")
Cc: Douglas Anderson <[email protected]>
Signed-off-by: Rob Clark <[email protected]>
---
drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 11 ++++++++---
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 2 --
2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
index bf7f855f4a34..3e0033666a2a 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
@@ -1441,8 +1441,15 @@ void a6xx_gmu_remove(struct a6xx_gpu *a6xx_gpu)
struct a6xx_gmu *gmu = &a6xx_gpu->gmu;
struct platform_device *pdev = to_platform_device(gmu->dev);

- if (!gmu->initialized)
+ mutex_lock(&gmu->lock);
+ if (!gmu->initialized) {
+ mutex_unlock(&gmu->lock);
return;
+ }
+
+ gmu->initialized = false;
+
+ mutex_unlock(&gmu->lock);

pm_runtime_force_suspend(gmu->dev);

@@ -1472,8 +1479,6 @@ void a6xx_gmu_remove(struct a6xx_gpu *a6xx_gpu)

/* Drop reference taken in of_find_device_by_node */
put_device(gmu->dev);
-
- gmu->initialized = false;
}

static int cxpd_notifier_cb(struct notifier_block *nb,
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 9be3260c8033..67dd2eeecf62 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -2091,9 +2091,7 @@ static void a6xx_destroy(struct msm_gpu *gpu)

a6xx_llc_slices_destroy(a6xx_gpu);

- mutex_lock(&a6xx_gpu->gmu.lock);
a6xx_gmu_remove(a6xx_gpu);
- mutex_unlock(&a6xx_gpu->gmu.lock);

adreno_gpu_cleanup(adreno_gpu);

--
2.41.0

2023-08-04 18:38:57

by Doug Anderson

[permalink] [raw]

Subject: Re: [PATCH] drm/msm/a6xx: Fix GMU lockdep splat

Hi,

On Thu, Aug 3, 2023 at 10:34 AM Rob Clark <[email protected]> wrote:
>
> From: Rob Clark <[email protected]>
>
> For normal GPU devfreq, we need to acquire the GMU lock while already
> holding devfreq locks. But in the teardown path, we were calling
> dev_pm_domain_detach() while already holding the GMU lock, resulting in
> this lockdep splat:
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 6.4.3-debug+ #3 Not tainted
> ------------------------------------------------------
> ring0/391 is trying to acquire lock:
> ffffff80a025c078 (&devfreq->lock){+.+.}-{3:3}, at: qos_notifier_call+0x30/0x74
>
> but task is already holding lock:
> ffffff809b8c1ce8 (&(c->notifiers)->rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x34/0x78
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #4 (&(c->notifiers)->rwsem){++++}-{3:3}:
> down_write+0x58/0x74
> __blocking_notifier_chain_register+0x64/0x84
> blocking_notifier_chain_register+0x1c/0x28
> freq_qos_add_notifier+0x5c/0x7c
> dev_pm_qos_add_notifier+0xd4/0xf0
> devfreq_add_device+0x42c/0x560
> devm_devfreq_add_device+0x6c/0xb8
> msm_devfreq_init+0xa8/0x16c [msm]
> msm_gpu_init+0x368/0x54c [msm]
> adreno_gpu_init+0x248/0x2b0 [msm]
> a6xx_gpu_init+0x2d0/0x384 [msm]
> adreno_bind+0x264/0x2bc [msm]
> component_bind_all+0x124/0x1f4
> msm_drm_bind+0x2d0/0x5f4 [msm]
> try_to_bring_up_aggregate_device+0x88/0x1a4
> __component_add+0xd4/0x128
> component_add+0x1c/0x28
> dp_display_probe+0x37c/0x3c0 [msm]
> platform_probe+0x70/0xc0
> really_probe+0x148/0x280
> __driver_probe_device+0xfc/0x114
> driver_probe_device+0x44/0x100
> __device_attach_driver+0x64/0xdc
> bus_for_each_drv+0xb0/0xd8
> __device_attach+0xe4/0x140
> device_initial_probe+0x1c/0x28
> bus_probe_device+0x44/0xb0
> deferred_probe_work_func+0xb0/0xc8
> process_one_work+0x288/0x3d8
> worker_thread+0x1f0/0x260
> kthread+0xf0/0x100
> ret_from_fork+0x10/0x20
>
> -> #3 (dev_pm_qos_mtx){+.+.}-{3:3}:
> __mutex_lock+0xc8/0x388
> mutex_lock_nested+0x2c/0x38
> dev_pm_qos_remove_notifier+0x3c/0xc8
> genpd_remove_device+0x40/0x11c
> genpd_dev_pm_detach+0x88/0x130
> dev_pm_domain_detach+0x2c/0x3c
> a6xx_gmu_remove+0x44/0xdc [msm]
> a6xx_destroy+0x7c/0xa4 [msm]
> adreno_unbind+0x50/0x64 [msm]
> component_unbind+0x44/0x64
> component_unbind_all+0xb4/0xbc
> msm_drm_uninit.isra.0+0x124/0x17c [msm]
> msm_drm_bind+0x340/0x5f4 [msm]
> try_to_bring_up_aggregate_device+0x88/0x1a4
> __component_add+0xd4/0x128
> component_add+0x1c/0x28
> dp_display_probe+0x37c/0x3c0 [msm]
> platform_probe+0x70/0xc0
> really_probe+0x148/0x280
> __driver_probe_device+0xfc/0x114
> driver_probe_device+0x44/0x100
> __device_attach_driver+0x64/0xdc
> bus_for_each_drv+0xb0/0xd8
> __device_attach+0xe4/0x140
> device_initial_probe+0x1c/0x28
> bus_probe_device+0x44/0xb0
> deferred_probe_work_func+0xb0/0xc8
> process_one_work+0x288/0x3d8
> worker_thread+0x1f0/0x260
> kthread+0xf0/0x100
> ret_from_fork+0x10/0x20
>
> -> #2 (&a6xx_gpu->gmu.lock){+.+.}-{3:3}:
> __mutex_lock+0xc8/0x388
> mutex_lock_nested+0x2c/0x38
> a6xx_gpu_set_freq+0x38/0x64 [msm]
> msm_devfreq_target+0x170/0x18c [msm]
> devfreq_set_target+0x90/0x1e4
> devfreq_update_target+0xb4/0xf0
> update_devfreq+0x1c/0x28
> devfreq_monitor+0x3c/0x10c
> process_one_work+0x288/0x3d8
> worker_thread+0x1f0/0x260
> kthread+0xf0/0x100
> ret_from_fork+0x10/0x20
>
> -> #1 (&df->lock){+.+.}-{3:3}:
> __mutex_lock+0xc8/0x388
> mutex_lock_nested+0x2c/0x38
> msm_devfreq_get_dev_status+0x4c/0x104 [msm]
> devfreq_simple_ondemand_func+0x5c/0x128
> devfreq_update_target+0x68/0xf0
> update_devfreq+0x1c/0x28
> devfreq_monitor+0x3c/0x10c
> process_one_work+0x288/0x3d8
> worker_thread+0x1f0/0x260
> kthread+0xf0/0x100
> ret_from_fork+0x10/0x20
>
> -> #0 (&devfreq->lock){+.+.}-{3:3}:
> __lock_acquire+0xdf8/0x109c
> lock_acquire+0x234/0x284
> __mutex_lock+0xc8/0x388
> mutex_lock_nested+0x2c/0x38
> qos_notifier_call+0x30/0x74
> qos_min_notifier_call+0x1c/0x28
> notifier_call_chain+0xf4/0x114
> blocking_notifier_call_chain+0x4c/0x78
> pm_qos_update_target+0x184/0x190
> freq_qos_apply+0x4c/0x64
> apply_constraint+0xf8/0xfc
> __dev_pm_qos_update_request+0x138/0x164
> dev_pm_qos_update_request+0x44/0x68
> msm_devfreq_boost+0x40/0x70 [msm]
> msm_devfreq_active+0xc0/0xf0 [msm]
> msm_gpu_submit+0xc8/0x12c [msm]
> msm_job_run+0x88/0x128 [msm]
> drm_sched_main+0x240/0x324 [gpu_sched]
> kthread+0xf0/0x100
> ret_from_fork+0x10/0x20
>
> other info that might help us debug this:
> Chain exists of:
> &devfreq->lock --> dev_pm_qos_mtx --> &(c->notifiers)->rwsem
> Possible unsafe locking scenario:
> CPU0 CPU1
> ---- ----
> rlock(&(c->notifiers)->rwsem);
> lock(dev_pm_qos_mtx);
> lock(&(c->notifiers)->rwsem);
> lock(&devfreq->lock);
>
> *** DEADLOCK ***
> 4 locks held by ring0/391:
> #0: ffffff809c811170 (&gpu->lock){+.+.}-{3:3}, at: msm_job_run+0x7c/0x128 [msm]
> #1: ffffff809c811208 (&gpu->active_lock){+.+.}-{3:3}, at: msm_gpu_submit+0xa8/0x12c [msm]
> #2: ffffffecbbb46600 (dev_pm_qos_mtx){+.+.}-{3:3}, at: dev_pm_qos_update_request+0x38/0x68
> #3: ffffff809b8c1ce8 (&(c->notifiers)->rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x34/0x78
>
> stack backtrace:
> CPU: 6 PID: 391 Comm: ring0 Not tainted 6.4.3debug+ #3
> Hardware name: Google Villager (rev1+) with LTE (DT)
> Call trace:
> dump_backtrace+0xb4/0xf0
> show_stack+0x20/0x30
> dump_stack_lvl+0x60/0x84
> dump_stack+0x18/0x24
> print_circular_bug+0x1cc/0x234
> check_noncircular+0x78/0xac
> __lock_acquire+0xdf8/0x109c
> lock_acquire+0x234/0x284
> __mutex_lock+0xc8/0x388
> mutex_lock_nested+0x2c/0x38
> qos_notifier_call+0x30/0x74
> qos_min_notifier_call+0x1c/0x28
> notifier_call_chain+0xf4/0x114
> blocking_notifier_call_chain+0x4c/0x78
> pm_qos_update_target+0x184/0x190
> freq_qos_apply+0x4c/0x64
> apply_constraint+0xf8/0xfc
> __dev_pm_qos_update_request+0x138/0x164
> dev_pm_qos_update_request+0x44/0x68
> msm_devfreq_boost+0x40/0x70 [msm]
> msm_devfreq_active+0xc0/0xf0 [msm]
> msm_gpu_submit+0xc8/0x12c [msm]
> msm_job_run+0x88/0x128 [msm]
> drm_sched_main+0x240/0x324 [gpu_sched]
> kthread+0xf0/0x100
> ret_from_fork+0x10/0x20
>
> Fix this by only synchronizing access to gmu->initialized.
>
> Fixes: 4cd15a3e8b36 ("drm/msm/a6xx: Make GPU destroy a bit safer")
> Cc: Douglas Anderson <[email protected]>
> Signed-off-by: Rob Clark <[email protected]>
> ---
> drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 11 ++++++++---
> drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 2 --
> 2 files changed, 8 insertions(+), 5 deletions(-)

While this is tangled enough that I can't necessarily think through
every aspect of it, this seems to me like it should do the trick.
Thanks for the fix!

Reviewed-by: Douglas Anderson <[email protected]>

-Doug