Received: by 10.223.185.116 with SMTP id b49csp2451633wrg; Mon, 12 Feb 2018 09:44:26 -0800 (PST) X-Google-Smtp-Source: AH8x22653OM2IcniHAsxRHbsyH8YHzncItNVz4fwsCK7jLaei9NWvnUVqSCDW//33Zbly46OizYm X-Received: by 10.99.42.9 with SMTP id q9mr10228953pgq.325.1518457466806; Mon, 12 Feb 2018 09:44:26 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1518457466; cv=none; d=google.com; s=arc-20160816; b=Pc51DW5/YmYtmnHUVUzb8ocgA8nZ1dYMp3i0uR3aBDYacM9jT/YgAa4BF5qiuo9yJy iClNY06hoL8h9NWxmlyE9ITbQ59Y4zxXqKb0YFiyAlhhOTMa9xYIu7Xd88Jby9Pl3qhC coDrfALSnOvsr4gWTNQBGxDqS3bPjuDYelmdYUNA4xUNB2q4v7SxgA2nVxD0yLKWCZz9 pDvZeuqKqfMjCvoTNXJUcw5GDq9O/cEqjpxZkSjaLiwNbSV9rHgvK20IZnqU0mTH4y7P vFU/yu1s4G28ldxXT2XdoukvR6j8DZiXv0yyMcozpuOm9DoYpoeBlRxMo9Ig+Sc5t2Rw KM0A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :organization:references:in-reply-to:date:cc:to:from:subject :message-id:arc-authentication-results; bh=t/mmHA7oCCQxiH4ZBCi8oDKXmncRNChMFtCwm+SZYXI=; b=mjdmESmtCTqIri1traEWLwqWv6HsyBhA2Vur7MGkLvRbsznhUjLuqc8hJeewwB+xV4 GuzTMBSiwjbNLdILEpTd4mPdGCo5+3NDxbqMVTII/edRg/2x4i+KAVxUKj8sXPuFzKhZ d1n8Goptm0ZuR7Jcm/Gsh4KHnF4HtBJ8J7YrIeF8/DdSN1WXGLIRNIP0JVz8BV0jbQ28 itWc9xYxxjIS/JEtcg+fDXBPmPgqaCDoRGEThSr2OwU4IjFbvU9kthwHsXmKo60Mw5EY ovrZTQQRFt6fByO4V8iVvVxEckoc2W3bWV0kLBzSpCb4pHH28kFWHHNQ+yF825qPv5yr 2Bdw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l1si1852801pga.408.2018.02.12.09.44.11; Mon, 12 Feb 2018 09:44:26 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753487AbeBLRn0 (ORCPT + 99 others); Mon, 12 Feb 2018 12:43:26 -0500 Received: from mail-qt0-f193.google.com ([209.85.216.193]:37043 "EHLO mail-qt0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753355AbeBLRnX (ORCPT ); Mon, 12 Feb 2018 12:43:23 -0500 Received: by mail-qt0-f193.google.com with SMTP id w1so521885qti.4 for ; Mon, 12 Feb 2018 09:43:23 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:subject:from:to:cc:date:in-reply-to :references:organization:mime-version:content-transfer-encoding; bh=t/mmHA7oCCQxiH4ZBCi8oDKXmncRNChMFtCwm+SZYXI=; b=PlELywpumbtBBgmuGQtt2x2iKSWuHELLoz7X7oJEer3FUwp/I+sE8F6rqRNuMSmlb/ xExohQ4Eamzs+Ys+KzLnNHv17rPK0wOlYKwTsD2vzjJF8Av8mKwn9UF28aoNxvg7FixA 3lfKR9RiJBwcvggB7L55hhk8YWiIvR0rQtQPcB3gzbEJCE+Ekh1aV597t/O0nLmIZng+ 9+ippi5c8XD3brh+KGpQT6eQq+2xoFVdegQ5VKwy0VcTG7yHhwNb3JDYo10OhSdx+6rD yo7pa/p4lYdjfRH0RP2KdkRlqXdD/jO9wmU2FA7XApIB0ifSFINweZ8kpRtx7O4pjR8p /f/g== X-Gm-Message-State: APf1xPAGhgrOg/Qnw3ftoicZKVDtzmkKQKG5ms0GiP/LLtmwLD0UnIlP pwcsd/sCxrUnjIdlFaXhd/qCZw== X-Received: by 10.200.23.176 with SMTP id o45mr12066232qtj.255.1518457402542; Mon, 12 Feb 2018 09:43:22 -0800 (PST) Received: from dhcp-10-20-1-30.bss.redhat.com ([144.121.20.162]) by smtp.gmail.com with ESMTPSA id u57sm2415429qtc.41.2018.02.12.09.43.21 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Mon, 12 Feb 2018 09:43:22 -0800 (PST) Message-ID: <1518457400.5319.5.camel@redhat.com> Subject: Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers From: Lyude Paul To: Lukas Wunner , Tejun Heo , Lai Jiangshan , Alex Deucher , Dave Airlie , Ben Skeggs Cc: dri-devel@lists.freedesktop.org, Peter Wu , nouveau@lists.freedesktop.org, Hans de Goede , Pierre Moreau , linux-kernel@vger.kernel.org, Ismo Toijala , intel-gfx@lists.freedesktop.org, Liviu Dudau , Archit Taneja Date: Mon, 12 Feb 2018 12:43:20 -0500 In-Reply-To: References: Organization: Red Hat Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.26.4 (3.26.4-1.fc27) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patchset is a no-brainer for me. I'm ashamed to say I think I might have seen this issue before, but polling gets used so rarely nowadays it slipped under the rug by accident. Anyway, for the whole series (except patch #2, which I will respond to with a short comment in just a moment): Reviewed-by: Lyude Paul On Sun, 2018-02-11 at 10:38 +0100, Lukas Wunner wrote: > Fix a deadlock on hybrid graphics laptops that's been present since 2013: > > DRM drivers poll connectors in 10 sec intervals. The poll worker is > stopped on ->runtime_suspend with cancel_delayed_work_sync(). However > the poll worker invokes the DRM drivers' ->detect callbacks, which call > pm_runtime_get_sync(). If the poll worker starts after runtime suspend > has begun, pm_runtime_get_sync() will wait for runtime suspend to finish > with the intention of runtime resuming the device afterwards. The result > is a circular wait between poll worker and autosuspend worker. > > I'm seeing this deadlock so often it's not funny anymore. I've also found > 3 nouveau bugzillas about it and 1 radeon bugzilla. (See patch [3/5] and > [4/5].) > > One theoretical solution would be to add a flag to the ->detect callback > to tell it that it's running in the poll worker's context. In that case > it doesn't need to call pm_runtime_get_sync() because the poll worker is > only enabled while runtime active and we know that ->runtime_suspend > waits for it to finish. But this approach would require touching every > single DRM driver's ->detect hook. Moreover the ->detect hook is called > from numerous other places, both in the DRM library as well as in each > driver, making it necessary to trace every possible call chain and check > if it's coming from the poll worker or not. I've tried to do that for > nouveau (see the call sites listed in the commit message of patch [3/5]) > and concluded that this approach is a nightmare to implement. > > Another reason for the infeasibility of this approach is that ->detect > is called from within the DRM library without driver involvement, e.g. > via DRM's sysfs interface. In those cases, pm_runtime_get_sync() would > have to be called by the DRM library, but the DRM library is supposed to > stay generic, wheras runtime PM is driver-specific. > > So instead, I've come up with this much simpler solution which gleans > from the current task's flags if it's a workqueue worker, retrieves the > work_struct and checks if it's the poll worker. All that's needed is > one small helper in the workqueue code and another small helper in the > DRM library. This solution lends itself nicely to stable-backporting. > > The patches for radeon and amdgpu are compile-tested only, I only have a > MacBook Pro with an Nvidia GK107 to test. To test the patches, add an > "msleep(12*1000);" at the top of the driver's ->runtime_suspend hook. > This ensures that the poll worker runs after ->runtime_suspend has begun. > Wait 12 sec after the GPU has begun runtime suspend, then check > /sys/bus/pci/devices/0000:01:00.0/power/runtime_status. Without this > series, the status will be stuck at "suspending" and you'll get hung task > errors in dmesg after a few minutes. > > i915, malidp and msm "solved" this issue by not stopping the poll worker > on runtime suspend. But this results in the GPU bouncing back and forth > between D0 and D3 continuously. That's a total no-go for GPUs which > runtime suspend to D3cold since every suspend/resume cycle costs a > significant amount of time and energy. (i915 and malidp do not seem > to acquire a runtime PM ref in the ->detect callbacks, which seems > questionable. msm however does and would also deadlock if it disabled > the poll worker on runtime suspend. cc += Archit, Liviu, intel-gfx) > > Please review. Thanks, > > Lukas > > Lukas Wunner (5): > workqueue: Allow retrieval of current task's work struct > drm: Allow determining if current task is output poll worker > drm/nouveau: Fix deadlock on runtime suspend > drm/radeon: Fix deadlock on runtime suspend > drm/amdgpu: Fix deadlock on runtime suspend > > drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c | 58 +++++++++++++------- > drivers/gpu/drm/drm_probe_helper.c | 14 +++++ > drivers/gpu/drm/nouveau/nouveau_connector.c | 18 +++++-- > drivers/gpu/drm/radeon/radeon_connectors.c | 74 +++++++++++++++++---- > ----- > include/drm/drm_crtc_helper.h | 1 + > include/linux/workqueue.h | 1 + > kernel/workqueue.c | 16 ++++++ > 7 files changed, 132 insertions(+), 50 deletions(-) > -- Cheers, Lyude Paul