2018-08-13 19:18:38

by Lyude Paul

[permalink] [raw]
Subject: [PATCH v6 0/5] Fix connector probing deadlocks from RPM bugs

Latest version of https://patchwork.freedesktop.org/series/46815/ with
some significant improvements:

- I finally figured out a clean way to do this entirely with runtime PM
helpers, no avoiding grabbing refs required!
- Since this new method removes the need for a lot of the other changes
I made (although we probably still want those changes, but not for
fixing these deadlocks) I've removed those now irrelevant patches from
the series and will resend them on their own to make reviewing this
go faster.

Lyude Paul (5):
drm/nouveau: Fix bogus drm_kms_helper_poll_enable() placement
drm/nouveau: Remove duplicate poll_enable() in pmops_runtime_suspend()
drm/nouveau: Fix deadlock with fb_helper with async RPM requests
drm/nouveau: Use pm_runtime_get_noresume() in connector_detect()
drm/nouveau: Fix deadlocks in nouveau_connector_detect()

drivers/gpu/drm/nouveau/dispnv50/disp.c | 2 +-
drivers/gpu/drm/nouveau/nouveau_connector.c | 42 +++++++++---
drivers/gpu/drm/nouveau/nouveau_display.c | 9 ++-
drivers/gpu/drm/nouveau/nouveau_drm.c | 1 -
drivers/gpu/drm/nouveau/nouveau_fbcon.c | 72 +++++++++++++++++++++
drivers/gpu/drm/nouveau/nouveau_fbcon.h | 7 ++
6 files changed, 119 insertions(+), 14 deletions(-)

--
2.17.1



2018-08-13 19:08:55

by Lyude Paul

[permalink] [raw]
Subject: [PATCH v6 3/5] drm/nouveau: Fix deadlock with fb_helper with async RPM requests

I'm sure I don't need to tell you that fb_helper's locking is a mess.
That being said; fb_helper's locking mess can seriously complicate the
runtime suspend/resume operations of drivers because it can invoke
atomic commits and connector probing from anywhere that calls
drm_fb_helper_hotplug_event(). Since most drivers use
drm_fb_helper_output_poll_changed() as their output_poll_changed
handler, this can happen in every single context that can fire off a
hotplug event. An example:

[ 246.669625] INFO: task kworker/4:0:37 blocked for more than 120 seconds.
[ 246.673398] Not tainted 4.18.0-rc5Lyude-Test+ #2
[ 246.675271] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 246.676527] kworker/4:0 D 0 37 2 0x80000000
[ 246.677580] Workqueue: events output_poll_execute [drm_kms_helper]
[ 246.678704] Call Trace:
[ 246.679753] __schedule+0x322/0xaf0
[ 246.680916] schedule+0x33/0x90
[ 246.681924] schedule_preempt_disabled+0x15/0x20
[ 246.683023] __mutex_lock+0x569/0x9a0
[ 246.684035] ? kobject_uevent_env+0x117/0x7b0
[ 246.685132] ? drm_fb_helper_hotplug_event.part.28+0x20/0xb0 [drm_kms_helper]
[ 246.686179] mutex_lock_nested+0x1b/0x20
[ 246.687278] ? mutex_lock_nested+0x1b/0x20
[ 246.688307] drm_fb_helper_hotplug_event.part.28+0x20/0xb0 [drm_kms_helper]
[ 246.689420] drm_fb_helper_output_poll_changed+0x23/0x30 [drm_kms_helper]
[ 246.690462] drm_kms_helper_hotplug_event+0x2a/0x30 [drm_kms_helper]
[ 246.691570] output_poll_execute+0x198/0x1c0 [drm_kms_helper]
[ 246.692611] process_one_work+0x231/0x620
[ 246.693725] worker_thread+0x214/0x3a0
[ 246.694756] kthread+0x12b/0x150
[ 246.695856] ? wq_pool_ids_show+0x140/0x140
[ 246.696888] ? kthread_create_worker_on_cpu+0x70/0x70
[ 246.697998] ret_from_fork+0x3a/0x50
[ 246.699034] INFO: task kworker/0:1:60 blocked for more than 120 seconds.
[ 246.700153] Not tainted 4.18.0-rc5Lyude-Test+ #2
[ 246.701182] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 246.702278] kworker/0:1 D 0 60 2 0x80000000
[ 246.703293] Workqueue: pm pm_runtime_work
[ 246.704393] Call Trace:
[ 246.705403] __schedule+0x322/0xaf0
[ 246.706439] ? wait_for_completion+0x104/0x190
[ 246.707393] schedule+0x33/0x90
[ 246.708375] schedule_timeout+0x3a5/0x590
[ 246.709289] ? mark_held_locks+0x58/0x80
[ 246.710208] ? _raw_spin_unlock_irq+0x2c/0x40
[ 246.711222] ? wait_for_completion+0x104/0x190
[ 246.712134] ? trace_hardirqs_on_caller+0xf4/0x190
[ 246.713094] ? wait_for_completion+0x104/0x190
[ 246.713964] wait_for_completion+0x12c/0x190
[ 246.714895] ? wake_up_q+0x80/0x80
[ 246.715727] ? get_work_pool+0x90/0x90
[ 246.716649] flush_work+0x1c9/0x280
[ 246.717483] ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
[ 246.718442] __cancel_work_timer+0x146/0x1d0
[ 246.719247] cancel_delayed_work_sync+0x13/0x20
[ 246.720043] drm_kms_helper_poll_disable+0x1f/0x30 [drm_kms_helper]
[ 246.721123] nouveau_pmops_runtime_suspend+0x3d/0xb0 [nouveau]
[ 246.721897] pci_pm_runtime_suspend+0x6b/0x190
[ 246.722825] ? pci_has_legacy_pm_support+0x70/0x70
[ 246.723737] __rpm_callback+0x7a/0x1d0
[ 246.724721] ? pci_has_legacy_pm_support+0x70/0x70
[ 246.725607] rpm_callback+0x24/0x80
[ 246.726553] ? pci_has_legacy_pm_support+0x70/0x70
[ 246.727376] rpm_suspend+0x142/0x6b0
[ 246.728185] pm_runtime_work+0x97/0xc0
[ 246.728938] process_one_work+0x231/0x620
[ 246.729796] worker_thread+0x44/0x3a0
[ 246.730614] kthread+0x12b/0x150
[ 246.731395] ? wq_pool_ids_show+0x140/0x140
[ 246.732202] ? kthread_create_worker_on_cpu+0x70/0x70
[ 246.732878] ret_from_fork+0x3a/0x50
[ 246.733768] INFO: task kworker/4:2:422 blocked for more than 120 seconds.
[ 246.734587] Not tainted 4.18.0-rc5Lyude-Test+ #2
[ 246.735393] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 246.736113] kworker/4:2 D 0 422 2 0x80000080
[ 246.736789] Workqueue: events_long drm_dp_mst_link_probe_work [drm_kms_helper]
[ 246.737665] Call Trace:
[ 246.738490] __schedule+0x322/0xaf0
[ 246.739250] schedule+0x33/0x90
[ 246.739908] rpm_resume+0x19c/0x850
[ 246.740750] ? finish_wait+0x90/0x90
[ 246.741541] __pm_runtime_resume+0x4e/0x90
[ 246.742370] nv50_disp_atomic_commit+0x31/0x210 [nouveau]
[ 246.743124] drm_atomic_commit+0x4a/0x50 [drm]
[ 246.743775] restore_fbdev_mode_atomic+0x1c8/0x240 [drm_kms_helper]
[ 246.744603] restore_fbdev_mode+0x31/0x140 [drm_kms_helper]
[ 246.745373] drm_fb_helper_restore_fbdev_mode_unlocked+0x54/0xb0 [drm_kms_helper]
[ 246.746220] drm_fb_helper_set_par+0x2d/0x50 [drm_kms_helper]
[ 246.746884] drm_fb_helper_hotplug_event.part.28+0x96/0xb0 [drm_kms_helper]
[ 246.747675] drm_fb_helper_output_poll_changed+0x23/0x30 [drm_kms_helper]
[ 246.748544] drm_kms_helper_hotplug_event+0x2a/0x30 [drm_kms_helper]
[ 246.749439] nv50_mstm_hotplug+0x15/0x20 [nouveau]
[ 246.750111] drm_dp_send_link_address+0x177/0x1c0 [drm_kms_helper]
[ 246.750764] drm_dp_check_and_send_link_address+0xa8/0xd0 [drm_kms_helper]
[ 246.751602] drm_dp_mst_link_probe_work+0x51/0x90 [drm_kms_helper]
[ 246.752314] process_one_work+0x231/0x620
[ 246.752979] worker_thread+0x44/0x3a0
[ 246.753838] kthread+0x12b/0x150
[ 246.754619] ? wq_pool_ids_show+0x140/0x140
[ 246.755386] ? kthread_create_worker_on_cpu+0x70/0x70
[ 246.756162] ret_from_fork+0x3a/0x50
[ 246.756847]
Showing all locks held in the system:
[ 246.758261] 3 locks held by kworker/4:0/37:
[ 246.759016] #0: 00000000f8df4d2d ((wq_completion)"events"){+.+.}, at: process_one_work+0x1b3/0x620
[ 246.759856] #1: 00000000e6065461 ((work_completion)(&(&dev->mode_config.output_poll_work)->work)){+.+.}, at: process_one_work+0x1b3/0x620
[ 246.760670] #2: 00000000cb66735f (&helper->lock){+.+.}, at: drm_fb_helper_hotplug_event.part.28+0x20/0xb0 [drm_kms_helper]
[ 246.761516] 2 locks held by kworker/0:1/60:
[ 246.762274] #0: 00000000fff6be0f ((wq_completion)"pm"){+.+.}, at: process_one_work+0x1b3/0x620
[ 246.762982] #1: 000000005ab44fb4 ((work_completion)(&dev->power.work)){+.+.}, at: process_one_work+0x1b3/0x620
[ 246.763890] 1 lock held by khungtaskd/64:
[ 246.764664] #0: 000000008cb8b5c3 (rcu_read_lock){....}, at: debug_show_all_locks+0x23/0x185
[ 246.765588] 5 locks held by kworker/4:2/422:
[ 246.766440] #0: 00000000232f0959 ((wq_completion)"events_long"){+.+.}, at: process_one_work+0x1b3/0x620
[ 246.767390] #1: 00000000bb59b134 ((work_completion)(&mgr->work)){+.+.}, at: process_one_work+0x1b3/0x620
[ 246.768154] #2: 00000000cb66735f (&helper->lock){+.+.}, at: drm_fb_helper_restore_fbdev_mode_unlocked+0x4c/0xb0 [drm_kms_helper]
[ 246.768966] #3: 000000004c8f0b6b (crtc_ww_class_acquire){+.+.}, at: restore_fbdev_mode_atomic+0x4b/0x240 [drm_kms_helper]
[ 246.769921] #4: 000000004c34a296 (crtc_ww_class_mutex){+.+.}, at: drm_modeset_backoff+0x8a/0x1b0 [drm]
[ 246.770839] 1 lock held by dmesg/1038:
[ 246.771739] 2 locks held by zsh/1172:
[ 246.772650] #0: 00000000836d0438 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
[ 246.773680] #1: 000000001f4f4d48 (&ldata->atomic_read_lock){+.+.}, at: n_tty_read+0xc1/0x870

[ 246.775522] =============================================

After trying dozens of different solutions, I found one very simple one
that should also have the benefit of preventing us from having to fight
locking for the rest of our lives. So, we work around these deadlocks by
deferring all fbcon hotplug events that happen after the runtime suspend
process starts until after the device is resumed again.

Changes since v5:
- Come up with the (hopefully final) solution for solving this dumb
problem, one that is a lot less likely to cause issues with locking in
the future. This should work around all deadlock conditions with fbcon
brought up thus far.

Changes since v4:
- Add nouveau_fbcon_hotplugged_in_suspend() to workaround deadlock
condition that Lukas described
- Just move all of this out of drm_fb_helper. It seems that other DRM
drivers have already figured out other workarounds for this. If other
drivers do end up needing this in the future, we can just move this
back into drm_fb_helper again.

Changes since v3:
- Actually check if fb_helper is NULL in both new helpers
- Actually check drm_fbdev_emulation in both new helpers
- Don't fire off a fb_helper hotplug unconditionally; only do it if
the following conditions are true (as otherwise, calling this in the
wrong spot will cause Bad Things to happen):
- fb_helper hotplug handling was actually inhibited previously
- fb_helper actually has a delayed hotplug pending
- fb_helper is actually bound
- fb_helper is actually initialized
- Add __must_check to drm_fb_helper_suspend_hotplug(). There's no
situation where a driver would actually want to use this without
checking the return value, so enforce that
- Rewrite and clarify the documentation for both helpers.
- Make sure to return true in the drm_fb_helper_suspend_hotplug() stub
that's provided in drm_fb_helper.h when CONFIG_DRM_FBDEV_EMULATION
isn't enabled
- Actually grab the toplevel fb_helper lock in
drm_fb_helper_resume_hotplug(), since it's possible other activity
(such as a hotplug) could be going on at the same time the driver
calls drm_fb_helper_resume_hotplug(). We need this to check whether or
not drm_fb_helper_hotplug_event() needs to be called anyway

Signed-off-by: Lyude Paul <[email protected]>
Cc: [email protected]
Cc: Lukas Wunner <[email protected]>
Cc: Karol Herbst <[email protected]>
---
drivers/gpu/drm/nouveau/dispnv50/disp.c | 2 +-
drivers/gpu/drm/nouveau/nouveau_display.c | 2 +-
drivers/gpu/drm/nouveau/nouveau_fbcon.c | 72 +++++++++++++++++++++++
drivers/gpu/drm/nouveau/nouveau_fbcon.h | 7 +++
4 files changed, 81 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/dispnv50/disp.c b/drivers/gpu/drm/nouveau/dispnv50/disp.c
index 8b522a9b12f6..a0772389ed90 100644
--- a/drivers/gpu/drm/nouveau/dispnv50/disp.c
+++ b/drivers/gpu/drm/nouveau/dispnv50/disp.c
@@ -2049,7 +2049,7 @@ nv50_disp_atomic_state_alloc(struct drm_device *dev)
static const struct drm_mode_config_funcs
nv50_disp_func = {
.fb_create = nouveau_user_framebuffer_create,
- .output_poll_changed = drm_fb_helper_output_poll_changed,
+ .output_poll_changed = nouveau_fbcon_output_poll_changed,
.atomic_check = nv50_disp_atomic_check,
.atomic_commit = nv50_disp_atomic_commit,
.atomic_state_alloc = nv50_disp_atomic_state_alloc,
diff --git a/drivers/gpu/drm/nouveau/nouveau_display.c b/drivers/gpu/drm/nouveau/nouveau_display.c
index 1d36ab5d4796..4b873e668b26 100644
--- a/drivers/gpu/drm/nouveau/nouveau_display.c
+++ b/drivers/gpu/drm/nouveau/nouveau_display.c
@@ -293,7 +293,7 @@ nouveau_user_framebuffer_create(struct drm_device *dev,

static const struct drm_mode_config_funcs nouveau_mode_config_funcs = {
.fb_create = nouveau_user_framebuffer_create,
- .output_poll_changed = drm_fb_helper_output_poll_changed,
+ .output_poll_changed = nouveau_fbcon_output_poll_changed,
};


diff --git a/drivers/gpu/drm/nouveau/nouveau_fbcon.c b/drivers/gpu/drm/nouveau/nouveau_fbcon.c
index 85c1f10bc2b6..c7e86e9232a6 100644
--- a/drivers/gpu/drm/nouveau/nouveau_fbcon.c
+++ b/drivers/gpu/drm/nouveau/nouveau_fbcon.c
@@ -466,6 +466,7 @@ nouveau_fbcon_set_suspend_work(struct work_struct *work)
console_unlock();

if (state == FBINFO_STATE_RUNNING) {
+ nouveau_fbcon_hotplug_resume(drm->fbcon);
pm_runtime_mark_last_busy(drm->dev->dev);
pm_runtime_put_sync(drm->dev->dev);
}
@@ -487,6 +488,76 @@ nouveau_fbcon_set_suspend(struct drm_device *dev, int state)
schedule_work(&drm->fbcon_work);
}

+void
+nouveau_fbcon_output_poll_changed(struct drm_device *dev)
+{
+ struct nouveau_drm *drm = nouveau_drm(dev);
+ struct nouveau_fbdev *fbcon = drm->fbcon;
+ int ret;
+
+ if (!fbcon)
+ return;
+
+ mutex_lock(&fbcon->hotplug_lock);
+
+ ret = pm_runtime_get(dev->dev);
+ if (ret == 1 || ret == -EACCES) {
+ drm_fb_helper_hotplug_event(&fbcon->helper);
+
+ pm_runtime_mark_last_busy(dev->dev);
+ pm_runtime_put_autosuspend(dev->dev);
+ } else if (ret == 0) {
+ /* If the GPU was already in the process of suspending before
+ * this event happened, then we can't block here as we'll
+ * deadlock the runtime pmops since they wait for us to
+ * finish. So, just defer this event for when we runtime
+ * resume again. It will be handled by fbcon_work.
+ */
+ NV_DEBUG(drm, "fbcon HPD event deferred until runtime resume\n");
+ fbcon->hotplug_waiting = true;
+ pm_runtime_put_noidle(drm->dev->dev);
+ } else {
+ DRM_WARN("fbcon HPD event lost due to RPM failure: %d\n",
+ ret);
+ }
+
+ mutex_unlock(&fbcon->hotplug_lock);
+}
+
+void
+nouveau_fbcon_hotplug_resume(struct nouveau_fbdev *fbcon)
+{
+ struct nouveau_drm *drm;
+
+ if (!fbcon)
+ return;
+ drm = nouveau_drm(fbcon->helper.dev);
+
+ mutex_lock(&fbcon->hotplug_lock);
+ if (fbcon->hotplug_waiting) {
+ fbcon->hotplug_waiting = false;
+
+ NV_DEBUG(drm, "Handling deferred fbcon HPD events\n");
+ drm_fb_helper_hotplug_event(&fbcon->helper);
+ }
+ mutex_unlock(&fbcon->hotplug_lock);
+}
+
+bool
+nouveau_fbcon_hotplugged_in_suspend(struct nouveau_fbdev *fbcon)
+{
+ bool hotplug;
+
+ if (!fbcon)
+ return false;
+
+ mutex_lock(&fbcon->hotplug_lock);
+ hotplug = fbcon->hotplug_waiting;
+ mutex_unlock(&fbcon->hotplug_lock);
+
+ return hotplug;
+}
+
int
nouveau_fbcon_init(struct drm_device *dev)
{
@@ -505,6 +576,7 @@ nouveau_fbcon_init(struct drm_device *dev)

drm->fbcon = fbcon;
INIT_WORK(&drm->fbcon_work, nouveau_fbcon_set_suspend_work);
+ mutex_init(&fbcon->hotplug_lock);

drm_fb_helper_prepare(dev, &fbcon->helper, &nouveau_fbcon_helper_funcs);

diff --git a/drivers/gpu/drm/nouveau/nouveau_fbcon.h b/drivers/gpu/drm/nouveau/nouveau_fbcon.h
index a6f192ea3fa6..148c6b2cc6da 100644
--- a/drivers/gpu/drm/nouveau/nouveau_fbcon.h
+++ b/drivers/gpu/drm/nouveau/nouveau_fbcon.h
@@ -41,6 +41,9 @@ struct nouveau_fbdev {
struct nvif_object gdi;
struct nvif_object blit;
struct nvif_object twod;
+
+ struct mutex hotplug_lock;
+ bool hotplug_waiting;
};

void nouveau_fbcon_restore(void);
@@ -68,6 +71,10 @@ void nouveau_fbcon_set_suspend(struct drm_device *dev, int state);
void nouveau_fbcon_accel_save_disable(struct drm_device *dev);
void nouveau_fbcon_accel_restore(struct drm_device *dev);

+void nouveau_fbcon_output_poll_changed(struct drm_device *dev);
+void nouveau_fbcon_hotplug_resume(struct nouveau_fbdev *fbcon);
+bool nouveau_fbcon_hotplugged_in_suspend(struct nouveau_fbdev *fbcon);
+
extern int nouveau_nofbaccel;

#endif /* __NV50_FBCON_H__ */
--
2.17.1


2018-08-13 19:09:05

by Lyude Paul

[permalink] [raw]
Subject: [PATCH v6 4/5] drm/nouveau: Use pm_runtime_get_noresume() in connector_detect()

It's true we can't resume the device from poll workers in
nouveau_connector_detect(). We can however, prevent the autosuspend
timer from elapsing immediately if it hasn't already without risking any
sort of deadlock with the runtime suspend/resume operations. So do that
instead of entirely avoiding grabbing a power reference.

Signed-off-by: Lyude Paul <[email protected]>
Cc: [email protected]
Cc: Lukas Wunner <[email protected]>
Cc: Karol Herbst <[email protected]>
---
drivers/gpu/drm/nouveau/nouveau_connector.c | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_connector.c b/drivers/gpu/drm/nouveau/nouveau_connector.c
index 2a45b4c2ceb0..010d6db14cba 100644
--- a/drivers/gpu/drm/nouveau/nouveau_connector.c
+++ b/drivers/gpu/drm/nouveau/nouveau_connector.c
@@ -572,12 +572,16 @@ nouveau_connector_detect(struct drm_connector *connector, bool force)
nv_connector->edid = NULL;
}

- /* Outputs are only polled while runtime active, so acquiring a
- * runtime PM ref here is unnecessary (and would deadlock upon
- * runtime suspend because it waits for polling to finish).
+ /* Outputs are only polled while runtime active, so resuming the
+ * device here is unnecessary (and would deadlock upon runtime suspend
+ * because it waits for polling to finish). We do however, want to
+ * prevent the autosuspend timer from elapsing during this operation
+ * if possible.
*/
- if (!drm_kms_helper_is_poll_worker()) {
- ret = pm_runtime_get_sync(connector->dev->dev);
+ if (drm_kms_helper_is_poll_worker()) {
+ pm_runtime_get_noresume(dev->dev);
+ } else {
+ ret = pm_runtime_get_sync(dev->dev);
if (ret < 0 && ret != -EACCES)
return conn_status;
}
@@ -655,10 +659,8 @@ nouveau_connector_detect(struct drm_connector *connector, bool force)

out:

- if (!drm_kms_helper_is_poll_worker()) {
- pm_runtime_mark_last_busy(connector->dev->dev);
- pm_runtime_put_autosuspend(connector->dev->dev);
- }
+ pm_runtime_mark_last_busy(dev->dev);
+ pm_runtime_put_autosuspend(dev->dev);

return conn_status;
}
--
2.17.1


2018-08-13 19:09:45

by Lyude Paul

[permalink] [raw]
Subject: [PATCH v6 1/5] drm/nouveau: Fix bogus drm_kms_helper_poll_enable() placement

Turns out this part is my fault for not noticing when reviewing
9a2eba337cace ("drm/nouveau: Fix drm poll_helper handling"). Currently
we call drm_kms_helper_poll_enable() from nouveau_display_hpd_work().
This makes basically no sense however, because that means we're calling
drm_kms_helper_poll_enable() every time we schedule the hotplug
detection work. This is also against the advice mentioned in
drm_kms_helper_poll_enable()'s documentation:

Note that calls to enable and disable polling must be strictly ordered,
which is automatically the case when they're only call from
suspend/resume callbacks.

Of course, hotplugs can't really be ordered. They could even happen
immediately after we called drm_kms_helper_poll_disable() in
nouveau_display_fini(), which can lead to all sorts of issues.

Additionally; enabling polling /after/ we call
drm_helper_hpd_irq_event() could also mean that we'd miss a hotplug
event anyway, since drm_helper_hpd_irq_event() wouldn't bother trying to
probe connectors so long as polling is disabled.

So; simply move this back into nouveau_display_init() again. The race
condition that both of these patches attempted to work around has
already been fixed properly in

d61a5c106351 ("drm/nouveau: Fix deadlock on runtime suspend")

Fixes: 9a2eba337cace ("drm/nouveau: Fix drm poll_helper handling")
Signed-off-by: Lyude Paul <[email protected]>
Cc: Lukas Wunner <[email protected]>
Cc: Peter Ujfalusi <[email protected]>
Cc: [email protected]
---
drivers/gpu/drm/nouveau/nouveau_display.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_display.c b/drivers/gpu/drm/nouveau/nouveau_display.c
index ec7861457b84..1d36ab5d4796 100644
--- a/drivers/gpu/drm/nouveau/nouveau_display.c
+++ b/drivers/gpu/drm/nouveau/nouveau_display.c
@@ -355,8 +355,6 @@ nouveau_display_hpd_work(struct work_struct *work)
pm_runtime_get_sync(drm->dev->dev);

drm_helper_hpd_irq_event(drm->dev);
- /* enable polling for external displays */
- drm_kms_helper_poll_enable(drm->dev);

pm_runtime_mark_last_busy(drm->dev->dev);
pm_runtime_put_sync(drm->dev->dev);
@@ -411,6 +409,11 @@ nouveau_display_init(struct drm_device *dev)
if (ret)
return ret;

+ /* enable connector detection and polling for connectors without HPD
+ * support
+ */
+ drm_kms_helper_poll_enable(dev);
+
/* enable hotplug interrupts */
drm_connector_list_iter_begin(dev, &conn_iter);
nouveau_for_each_non_mst_connector_iter(connector, &conn_iter) {
--
2.17.1


2018-08-13 19:10:48

by Lyude Paul

[permalink] [raw]
Subject: [PATCH v6 2/5] drm/nouveau: Remove duplicate poll_enable() in pmops_runtime_suspend()

Since actual hotplug notifications don't get disabled until
nouveau_display_fini() is called, all this will do is cause any hotplugs
that happen between this drm_kms_helper_poll_disable() call and the
actual hotplug disablement to potentially be dropped if ACPI isn't
around to help us.

Signed-off-by: Lyude Paul <[email protected]>
Cc: [email protected]
Cc: Lukas Wunner <[email protected]>
Cc: Karol Herbst <[email protected]>
---
drivers/gpu/drm/nouveau/nouveau_drm.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c
index c7ec86d6c3c9..5fdc1fbe2ee5 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -835,7 +835,6 @@ nouveau_pmops_runtime_suspend(struct device *dev)
return -EBUSY;
}

- drm_kms_helper_poll_disable(drm_dev);
nouveau_switcheroo_optimus_dsm();
ret = nouveau_do_suspend(drm_dev, true);
pci_save_state(pdev);
--
2.17.1


2018-08-13 19:12:35

by Ilia Mirkin

[permalink] [raw]
Subject: Re: [PATCH v6 3/5] drm/nouveau: Fix deadlock with fb_helper with async RPM requests

On Mon, Aug 13, 2018 at 3:07 PM, Lyude Paul <[email protected]> wrote:
> +bool
> +nouveau_fbcon_hotplugged_in_suspend(struct nouveau_fbdev *fbcon)
> +{
> + bool hotplug;
> +
> + if (!fbcon)
> + return false;
> +
> + mutex_lock(&fbcon->hotplug_lock);
> + hotplug = fbcon->hotplug_waiting;
> + mutex_unlock(&fbcon->hotplug_lock);
> +
> + return hotplug;
> +}

I'm probably missing it, but where is this function used?

2018-08-13 19:13:35

by Lyude Paul

[permalink] [raw]
Subject: Re: [PATCH v6 3/5] drm/nouveau: Fix deadlock with fb_helper with async RPM requests

On Mon, 2018-08-13 at 15:11 -0400, Ilia Mirkin wrote:
> On Mon, Aug 13, 2018 at 3:07 PM, Lyude Paul <[email protected]> wrote:
> > +bool
> > +nouveau_fbcon_hotplugged_in_suspend(struct nouveau_fbdev *fbcon)
> > +{
> > + bool hotplug;
> > +
> > + if (!fbcon)
> > + return false;
> > +
> > + mutex_lock(&fbcon->hotplug_lock);
> > + hotplug = fbcon->hotplug_waiting;
> > + mutex_unlock(&fbcon->hotplug_lock);
> > +
> > + return hotplug;
> > +}
>
> I'm probably missing it, but where is this function used?
oh no you aren't, I thought I got rid of that function! will respin in just a
moment
--
Cheers,
Lyude Paul


2018-08-13 19:18:40

by Lyude Paul

[permalink] [raw]
Subject: [PATCH v6 5/5] drm/nouveau: Fix deadlocks in nouveau_connector_detect()

When we disable hotplugging on the GPU, we need to be able to
synchronize with each connector's hotplug interrupt handler before the
interrupt is finally disabled. This can be a problem however, since
nouveau_connector_detect() currently grabs a runtime power reference
when handling connector probing. This will deadlock the runtime suspend
handler like so:

[ 861.480896] INFO: task kworker/0:2:61 blocked for more than 120 seconds.
[ 861.483290] Tainted: G O 4.18.0-rc6Lyude-Test+ #1
[ 861.485158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 861.486332] kworker/0:2 D 0 61 2 0x80000000
[ 861.487044] Workqueue: events nouveau_display_hpd_work [nouveau]
[ 861.487737] Call Trace:
[ 861.488394] __schedule+0x322/0xaf0
[ 861.489070] schedule+0x33/0x90
[ 861.489744] rpm_resume+0x19c/0x850
[ 861.490392] ? finish_wait+0x90/0x90
[ 861.491068] __pm_runtime_resume+0x4e/0x90
[ 861.491753] nouveau_display_hpd_work+0x22/0x60 [nouveau]
[ 861.492416] process_one_work+0x231/0x620
[ 861.493068] worker_thread+0x44/0x3a0
[ 861.493722] kthread+0x12b/0x150
[ 861.494342] ? wq_pool_ids_show+0x140/0x140
[ 861.494991] ? kthread_create_worker_on_cpu+0x70/0x70
[ 861.495648] ret_from_fork+0x3a/0x50
[ 861.496304] INFO: task kworker/6:2:320 blocked for more than 120 seconds.
[ 861.496968] Tainted: G O 4.18.0-rc6Lyude-Test+ #1
[ 861.497654] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 861.498341] kworker/6:2 D 0 320 2 0x80000080
[ 861.499045] Workqueue: pm pm_runtime_work
[ 861.499739] Call Trace:
[ 861.500428] __schedule+0x322/0xaf0
[ 861.501134] ? wait_for_completion+0x104/0x190
[ 861.501851] schedule+0x33/0x90
[ 861.502564] schedule_timeout+0x3a5/0x590
[ 861.503284] ? mark_held_locks+0x58/0x80
[ 861.503988] ? _raw_spin_unlock_irq+0x2c/0x40
[ 861.504710] ? wait_for_completion+0x104/0x190
[ 861.505417] ? trace_hardirqs_on_caller+0xf4/0x190
[ 861.506136] ? wait_for_completion+0x104/0x190
[ 861.506845] wait_for_completion+0x12c/0x190
[ 861.507555] ? wake_up_q+0x80/0x80
[ 861.508268] flush_work+0x1c9/0x280
[ 861.508990] ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
[ 861.509735] nvif_notify_put+0xb1/0xc0 [nouveau]
[ 861.510482] nouveau_display_fini+0xbd/0x170 [nouveau]
[ 861.511241] nouveau_display_suspend+0x67/0x120 [nouveau]
[ 861.511969] nouveau_do_suspend+0x5e/0x2d0 [nouveau]
[ 861.512715] nouveau_pmops_runtime_suspend+0x47/0xb0 [nouveau]
[ 861.513435] pci_pm_runtime_suspend+0x6b/0x180
[ 861.514165] ? pci_has_legacy_pm_support+0x70/0x70
[ 861.514897] __rpm_callback+0x7a/0x1d0
[ 861.515618] ? pci_has_legacy_pm_support+0x70/0x70
[ 861.516313] rpm_callback+0x24/0x80
[ 861.517027] ? pci_has_legacy_pm_support+0x70/0x70
[ 861.517741] rpm_suspend+0x142/0x6b0
[ 861.518449] pm_runtime_work+0x97/0xc0
[ 861.519144] process_one_work+0x231/0x620
[ 861.519831] worker_thread+0x44/0x3a0
[ 861.520522] kthread+0x12b/0x150
[ 861.521220] ? wq_pool_ids_show+0x140/0x140
[ 861.521925] ? kthread_create_worker_on_cpu+0x70/0x70
[ 861.522622] ret_from_fork+0x3a/0x50
[ 861.523299] INFO: task kworker/6:0:1329 blocked for more than 120 seconds.
[ 861.523977] Tainted: G O 4.18.0-rc6Lyude-Test+ #1
[ 861.524644] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 861.525349] kworker/6:0 D 0 1329 2 0x80000000
[ 861.526073] Workqueue: events nvif_notify_work [nouveau]
[ 861.526751] Call Trace:
[ 861.527411] __schedule+0x322/0xaf0
[ 861.528089] schedule+0x33/0x90
[ 861.528758] rpm_resume+0x19c/0x850
[ 861.529399] ? finish_wait+0x90/0x90
[ 861.530073] __pm_runtime_resume+0x4e/0x90
[ 861.530798] nouveau_connector_detect+0x7e/0x510 [nouveau]
[ 861.531459] ? ww_mutex_lock+0x47/0x80
[ 861.532097] ? ww_mutex_lock+0x47/0x80
[ 861.532819] ? drm_modeset_lock+0x88/0x130 [drm]
[ 861.533481] drm_helper_probe_detect_ctx+0xa0/0x100 [drm_kms_helper]
[ 861.534127] drm_helper_hpd_irq_event+0xa4/0x120 [drm_kms_helper]
[ 861.534940] nouveau_connector_hotplug+0x98/0x120 [nouveau]
[ 861.535556] nvif_notify_work+0x2d/0xb0 [nouveau]
[ 861.536221] process_one_work+0x231/0x620
[ 861.536994] worker_thread+0x44/0x3a0
[ 861.537757] kthread+0x12b/0x150
[ 861.538463] ? wq_pool_ids_show+0x140/0x140
[ 861.539102] ? kthread_create_worker_on_cpu+0x70/0x70
[ 861.539815] ret_from_fork+0x3a/0x50
[ 861.540521]
Showing all locks held in the system:
[ 861.541696] 2 locks held by kworker/0:2/61:
[ 861.542406] #0: 000000002dbf8af5 ((wq_completion)"events"){+.+.}, at: process_one_work+0x1b3/0x620
[ 861.543071] #1: 0000000076868126 ((work_completion)(&drm->hpd_work)){+.+.}, at: process_one_work+0x1b3/0x620
[ 861.543814] 1 lock held by khungtaskd/64:
[ 861.544535] #0: 0000000059db4b53 (rcu_read_lock){....}, at: debug_show_all_locks+0x23/0x185
[ 861.545160] 3 locks held by kworker/6:2/320:
[ 861.545896] #0: 00000000d9e1bc59 ((wq_completion)"pm"){+.+.}, at: process_one_work+0x1b3/0x620
[ 861.546702] #1: 00000000c9f92d84 ((work_completion)(&dev->power.work)){+.+.}, at: process_one_work+0x1b3/0x620
[ 861.547443] #2: 000000004afc5de1 (drm_connector_list_iter){.+.+}, at: nouveau_display_fini+0x96/0x170 [nouveau]
[ 861.548146] 1 lock held by dmesg/983:
[ 861.548889] 2 locks held by zsh/1250:
[ 861.549605] #0: 00000000348e3cf6 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
[ 861.550393] #1: 000000007009a7a8 (&ldata->atomic_read_lock){+.+.}, at: n_tty_read+0xc1/0x870
[ 861.551122] 6 locks held by kworker/6:0/1329:
[ 861.551957] #0: 000000002dbf8af5 ((wq_completion)"events"){+.+.}, at: process_one_work+0x1b3/0x620
[ 861.552765] #1: 00000000ddb499ad ((work_completion)(&notify->work)#2){+.+.}, at: process_one_work+0x1b3/0x620
[ 861.553582] #2: 000000006e013cbe (&dev->mode_config.mutex){+.+.}, at: drm_helper_hpd_irq_event+0x6c/0x120 [drm_kms_helper]
[ 861.554357] #3: 000000004afc5de1 (drm_connector_list_iter){.+.+}, at: drm_helper_hpd_irq_event+0x78/0x120 [drm_kms_helper]
[ 861.555227] #4: 0000000044f294d9 (crtc_ww_class_acquire){+.+.}, at: drm_helper_probe_detect_ctx+0x3d/0x100 [drm_kms_helper]
[ 861.556133] #5: 00000000db193642 (crtc_ww_class_mutex){+.+.}, at: drm_modeset_lock+0x4b/0x130 [drm]

[ 861.557864] =============================================

[ 861.559507] NMI backtrace for cpu 2
[ 861.560363] CPU: 2 PID: 64 Comm: khungtaskd Tainted: G O 4.18.0-rc6Lyude-Test+ #1
[ 861.561197] Hardware name: LENOVO 20EQS64N0B/20EQS64N0B, BIOS N1EET78W (1.51 ) 05/18/2018
[ 861.561948] Call Trace:
[ 861.562757] dump_stack+0x8e/0xd3
[ 861.563516] nmi_cpu_backtrace.cold.3+0x14/0x5a
[ 861.564269] ? lapic_can_unplug_cpu.cold.27+0x42/0x42
[ 861.565029] nmi_trigger_cpumask_backtrace+0xa1/0xae
[ 861.565789] arch_trigger_cpumask_backtrace+0x19/0x20
[ 861.566558] watchdog+0x316/0x580
[ 861.567355] kthread+0x12b/0x150
[ 861.568114] ? reset_hung_task_detector+0x20/0x20
[ 861.568863] ? kthread_create_worker_on_cpu+0x70/0x70
[ 861.569598] ret_from_fork+0x3a/0x50
[ 861.570370] Sending NMI from CPU 2 to CPUs 0-1,3-7:
[ 861.571426] NMI backtrace for cpu 6 skipped: idling at intel_idle+0x7f/0x120
[ 861.571429] NMI backtrace for cpu 7 skipped: idling at intel_idle+0x7f/0x120
[ 861.571432] NMI backtrace for cpu 3 skipped: idling at intel_idle+0x7f/0x120
[ 861.571464] NMI backtrace for cpu 5 skipped: idling at intel_idle+0x7f/0x120
[ 861.571467] NMI backtrace for cpu 0 skipped: idling at intel_idle+0x7f/0x120
[ 861.571469] NMI backtrace for cpu 4 skipped: idling at intel_idle+0x7f/0x120
[ 861.571472] NMI backtrace for cpu 1 skipped: idling at intel_idle+0x7f/0x120
[ 861.572428] Kernel panic - not syncing: hung_task: blocked tasks

So: fix this by making it so that normal hotplug handling /only/ happens
so long as the GPU is currently awake without any pending runtime PM
requests. In the event that a hotplug occurs while the device is
suspending or resuming, we can simply defer our response until the GPU
is fully runtime resumed again.

Changes since v5:
- Use a new trick I came up with using pm_runtime_get() instead of the
hackish junk we had before

Signed-off-by: Lyude Paul <[email protected]>
Cc: [email protected]
Cc: Lukas Wunner <[email protected]>
Cc: Karol Herbst <[email protected]>
---
drivers/gpu/drm/nouveau/nouveau_connector.c | 22 +++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_connector.c b/drivers/gpu/drm/nouveau/nouveau_connector.c
index 010d6db14cba..109240c03357 100644
--- a/drivers/gpu/drm/nouveau/nouveau_connector.c
+++ b/drivers/gpu/drm/nouveau/nouveau_connector.c
@@ -1124,6 +1124,26 @@ nouveau_connector_hotplug(struct nvif_notify *notify)
const struct nvif_notify_conn_rep_v0 *rep = notify->data;
const char *name = connector->name;
struct nouveau_encoder *nv_encoder;
+ int ret;
+
+ ret = pm_runtime_get(drm->dev->dev);
+ if (ret == 0) {
+ /* We can't block here if there's a pending PM request
+ * running, as we'll deadlock nouveau_display_fini() when it
+ * calls nvif_put() on our nvif_notify struct. So, simply
+ * defer the hotplug event until the device finishes resuming
+ */
+ NV_DEBUG(drm, "Deferring HPD on %s until runtime resume\n",
+ name);
+ schedule_work(&drm->hpd_work);
+
+ pm_runtime_put_noidle(drm->dev->dev);
+ return NVIF_NOTIFY_KEEP;
+ } else if (ret != 1 && ret != -EACCES) {
+ NV_WARN(drm, "HPD on %s dropped due to RPM failure: %d\n",
+ name, ret);
+ return NVIF_NOTIFY_DROP;
+ }

if (rep->mask & NVIF_NOTIFY_CONN_V0_IRQ) {
NV_DEBUG(drm, "service %s\n", name);
@@ -1141,6 +1161,8 @@ nouveau_connector_hotplug(struct nvif_notify *notify)
drm_helper_hpd_irq_event(connector->dev);
}

+ pm_runtime_mark_last_busy(drm->dev->dev);
+ pm_runtime_put_autosuspend(drm->dev->dev);
return NVIF_NOTIFY_KEEP;
}

--
2.17.1