Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp3852861imm; Tue, 17 Jul 2018 11:22:32 -0700 (PDT) X-Google-Smtp-Source: AAOMgpedSjZOmnKmMqaC33UHKSkl4rXogIEgEsPSefA+yK9TjnKEMY6LPHVekqNZx7Ww7ElGcZZc X-Received: by 2002:a63:9f0a:: with SMTP id g10-v6mr2662782pge.324.1531851752830; Tue, 17 Jul 2018 11:22:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1531851752; cv=none; d=google.com; s=arc-20160816; b=Od7H4sJjDSmAX4N6o5/FqyNbkEIZzUaCzp1XwgHattmVANkVA+BLbSOnhswmCyK+Y4 hdfxEq6pVkmL4a6S9tm3/dgk/Oacenr0qUOBoR5snz12glyXaGMmvWJ8Mee4RJf+sGhK H/DnbMsGvSp6/aZZQvWgaZ5Ott/ZmMQrTAnZD2DvmiAvSAjW8zNq8bHHuzxLJADPbWsW vgBaOtBji+M+7mI0XGMmv/3bDloAc9D3yPpMqb1WgoIve+GkK05Cakm2JKXnmngEg0Em Hb1kWTiYDcEeCX6MTXyMwVYs7V1BQNXoVvuXc/+4neGfcYR8eV5GfJctckRM/dDwe/7I R4lg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=ZdIwv+yikUR7KSkNUGbobexhhPv+G3CJ8If84o9McYQ=; b=NMAWiO18HUaAPkttkCxdhX7IuvCeG57ARQXRA01XDFjJ3L8XsngfUMmRIAEpkkQ9qM nPzK5XxZ2D/GL4Yz+bzkZ06C3WdAdbI1t2AVwIX9ZAZ/0Rd1oYSoD1UGCipHew69OheU 46zbgn6jf9j1dGiGx2PdU930e0qnsdypxGGUcTI3TEaiA/pKLEbGtFCleC7k5vDTZxU2 4yEHNvOBf7BC+YP1u+JSNXKmP3lXtJ4xkqnZGleQidgg2Xp4PJfRlcmp4WU3a0bYjTNW NrWlt9Lbv4TD4UjSaWeD6m9Ukmnh8vHSwIR1iXUp5V7MHQUPO2BxYh1TdR7gAE6wlvnD 5Fqg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u12-v6si1452736pgb.280.2018.07.17.11.22.17; Tue, 17 Jul 2018 11:22:32 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730565AbeGQSyf (ORCPT + 99 others); Tue, 17 Jul 2018 14:54:35 -0400 Received: from bmailout2.hostsharing.net ([83.223.90.240]:53037 "EHLO bmailout2.hostsharing.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729759AbeGQSyf (ORCPT ); Tue, 17 Jul 2018 14:54:35 -0400 Received: from h08.hostsharing.net (h08.hostsharing.net [83.223.95.28]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "*.hostsharing.net", Issuer "COMODO RSA Domain Validation Secure Server CA" (not verified)) by bmailout2.hostsharing.net (Postfix) with ESMTPS id 492CD2800B48A; Tue, 17 Jul 2018 20:20:42 +0200 (CEST) Received: by h08.hostsharing.net (Postfix, from userid 100393) id E984A227519; Tue, 17 Jul 2018 20:20:41 +0200 (CEST) Date: Tue, 17 Jul 2018 20:20:41 +0200 From: Lukas Wunner To: Lyude Paul Cc: nouveau@lists.freedesktop.org, David Airlie , linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, Ben Skeggs , linux-pm@vger.kernel.org Subject: Re: [Nouveau] [PATCH 1/5] drm/nouveau: Prevent RPM callback recursion in suspend/resume paths Message-ID: <20180717182041.GA18363@wunner.de> References: <20180716235936.11268-1-lyude@redhat.com> <20180716235936.11268-2-lyude@redhat.com> <20180717071641.GA5411@wunner.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 17, 2018 at 12:53:11PM -0400, Lyude Paul wrote: > On Tue, 2018-07-17 at 09:16 +0200, Lukas Wunner wrote: > > On Mon, Jul 16, 2018 at 07:59:25PM -0400, Lyude Paul wrote: > > > In order to fix all of the spots that need to have runtime PM get/puts() > > > added, we need to ensure that it's possible for us to call > > > pm_runtime_get/put() in any context, regardless of how deep, since > > > almost all of the spots that are currently missing refs can potentially > > > get called in the runtime suspend/resume path. Otherwise, we'll try to > > > resume the GPU as we're trying to resume the GPU (and vice-versa) and > > > cause the kernel to deadlock. > > > > > > With this, it should be safe to call the pm runtime functions in any > > > context in nouveau with one condition: any point in the driver that > > > calls pm_runtime_get*() cannot hold any locks owned by nouveau that > > > would be acquired anywhere inside nouveau_pmops_runtime_resume(). > > > This includes modesetting locks, i2c bus locks, etc. > > > > [snip] > > > --- a/drivers/gpu/drm/nouveau/nouveau_drm.c > > > +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c > > > @@ -835,6 +835,8 @@ nouveau_pmops_runtime_suspend(struct device *dev) > > > return -EBUSY; > > > } > > > > > > + dev->power.disable_depth++; > > > + > > > > Anyway, if I understand the commit message correctly, you're hitting a > > pm_runtime_get_sync() in a code path that itself is called during a > > pm_runtime_get_sync(). Could you include stack traces in the commit > > message? My gut feeling is that this patch masks a deeper issue, > > e.g. if the runtime_resume code path does in fact directly poll outputs, > > that would seem wrong. Runtime resume should merely make the card > > accessible, i.e. reinstate power if necessary, put into PCI_D0, > > restore registers, etc. Output polling should be scheduled > > asynchronously. > > So: the reason that patch was added was mainly for the patches later in the > series that add guards around the i2c bus and aux bus, since both of those > require that the device be awake for it to work. Currently, the spot where it > would recurse is: Okay, the PCI device is suspending and the nvkm_i2c_aux_acquire() wants it in resumed state, so is waiting forever for the device to runtime suspend in order to resume it again immediately afterwards. The deadlock in the stack trace you've posted could be resolved using the technique I used in d61a5c106351 by adding the following to include/linux/pm_runtime.h: static inline bool pm_runtime_status_suspending(struct device *dev) { return dev->power.runtime_status == RPM_SUSPENDING; } static inline bool is_pm_work(struct device *dev) { struct work_struct *work = current_work(); return work && work->func == dev->power.work; } Then adding this to nvkm_i2c_aux_acquire(): struct device *dev = pad->i2c->subdev.device->dev; if (!(is_pm_work(dev) && pm_runtime_status_suspending(dev))) { ret = pm_runtime_get_sync(dev); if (ret < 0 && ret != -EACCES) return ret; } But here's the catch: This only works for an *async* runtime suspend. It doesn't work for pm_runtime_put_sync(), pm_runtime_suspend() etc, because then the runtime suspend is executed in the context of the caller, not in the context of dev->power.work. So it's not a full solution, but hopefully something that gets you going. I'm not really familiar with the code paths leading to nvkm_i2c_aux_acquire() to come up with a full solution off the top of my head I'm afraid. Note, it's not sufficient to just check pm_runtime_status_suspending(dev) because if the runtime_suspend is carried out concurrently by something else, this will return true but it's not guaranteed that the device is actually kept awake until the i2c communication has been fully performed. HTH, Lukas