Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp3992161ybl; Mon, 9 Dec 2019 03:39:48 -0800 (PST) X-Google-Smtp-Source: APXvYqzv5yWzvfUaNJx2qkbTXMlG4uJrw32PVuDTs4A35wjEACJAZA3WdbWkioLCZmZU3AFkC0MP X-Received: by 2002:a05:6808:197:: with SMTP id w23mr24374734oic.46.1575891588281; Mon, 09 Dec 2019 03:39:48 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1575891588; cv=none; d=google.com; s=arc-20160816; b=vTbTdCTKEkurKFmPGxZ1LCrSuoagdKwm07XStrH7J2Xbp6feJO+k9j7LR1aeTcgNnD GOcB81TtWpFjBicfMNbzOcQVXa/nGzv2/5a+UhYs0FfHsv57J8rmw5ZLBTkFJAhxzIPz iiyk8HBkNvQrxZoYuwwm+13S3TVwIloMOlxw7M5pBiJNvAfxlv92hJ3KoM+peFnRCHPK 2Hm8CVV7nrgO3vVjOMNNcdDdEp7ZMZglpaeAgBraofHYB+6Fvb/vUaicCXBJ0FFTowbM DOQIos30Wr+iS6Ehm9H/YnbPwDztJJ+Epx16/tHA6KF15XV5FHmJZlao8nep4RY9DGYL BKkA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version; bh=mPM/hwSdd2dpR68KRDIRUCSfL+ZarPQKfGdIq8K3ckk=; b=F5Q4XtluLNnYizGU1gf9JiDzZHxdf535xoClRj8rp6/N2KVGt7GX5j4gnO0JExtmQM cDats1FoNH3pIA3rWOrl/sHO8zyWOpoqXOP+1Abm1uEPCH2OpEgDn7egSPtOr7n2UZop Icif9lKyHu57QMLQjlLR1yHgAFCahTVFjBp7NzxheNt8eEr4R/njowRCL48i1oBA662v 7+4WccAIUuY8tK+PTcYpGHNsDe1c5g8JSVMLd/JdGlh5XgjW4DQGX1GmQFdqfNwGFa73 8QDdr8bNkOOgQumtq1qLD603ojn+vDUs22+RRmSJ4ao0BG2LrdPhNVinowk/uYaFHMKC 3cPw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id f4si11899516oib.104.2019.12.09.03.39.34; Mon, 09 Dec 2019 03:39:48 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727492AbfLILjG (ORCPT + 99 others); Mon, 9 Dec 2019 06:39:06 -0500 Received: from mail-oi1-f196.google.com ([209.85.167.196]:42249 "EHLO mail-oi1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727326AbfLILjG (ORCPT ); Mon, 9 Dec 2019 06:39:06 -0500 Received: by mail-oi1-f196.google.com with SMTP id j22so5985784oij.9; Mon, 09 Dec 2019 03:39:05 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=mPM/hwSdd2dpR68KRDIRUCSfL+ZarPQKfGdIq8K3ckk=; b=Y4eQkSZOSr43h0aL45DzLMphxNKlD0BI/If9WdLqXzn7K7BUcBBmdQWzg2OVXb4AIS mcG+v3YF8MHn/CtlnPnh6i8gu7pWhSqNHchoBv9CEKFlM98Il5ciD86gFSuKGEGV4cCP wcx+BgXMzuryB89HBwFi5TMa2rH81MdNnFEcw7Q/ojylaNIVC1rju+Gh3ZzhlvVJ0Ac4 D8DZCLeCps3Seh2zVQ1KcZ6Q6nV0v+OpPOetnwCpzNwkWdT3a6Bthk518s+WT+6ucx9W PXhCbUcQJPDOtYYXdsjjtb7RCLRT8M8ZfC1IIbxJLgqoESCMqiV10DGT6/ZtVXmY504T 6glQ== X-Gm-Message-State: APjAAAV9rYFSjXtWoAJrRMPjyMBuYCZwwg9K0JT3mvDHdXvgP1eV2t/e oNgLHHrmDria8Wd6GqVGVv81zn0bhsnhrNfZlLk= X-Received: by 2002:a05:6808:1c5:: with SMTP id x5mr17465882oic.57.1575891545016; Mon, 09 Dec 2019 03:39:05 -0800 (PST) MIME-Version: 1.0 References: <20191121112821.GU11621@lahna.fi.intel.com> <20191121114610.GW11621@lahna.fi.intel.com> <20191127114856.GZ11621@lahna.fi.intel.com> In-Reply-To: From: "Rafael J. Wysocki" Date: Mon, 9 Dec 2019 12:38:53 +0100 Message-ID: Subject: Re: [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges To: Karol Herbst Cc: Lyude Paul , Mika Westerberg , "Rafael J. Wysocki" , Bjorn Helgaas , LKML , "Rafael J . Wysocki" , Linux PCI , Linux PM , dri-devel , nouveau , Dave Airlie , Mario Limonciello Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 9, 2019 at 12:17 PM Karol Herbst wrote: > > anybody any other ideas? Not yet, but I'm trying to collect some more information. > It seems that both patches don't really fix > the issue and I have no idea left on my side to try out. The only > thing left I could do to further investigate would be to reverse > engineer the Nvidia driver as they support runpm on Turing+ GPUs now, > but I've heard users having similar issues to the one Lyude told us > about... and I couldn't verify that the patches help there either in a > reliable way. It looks like the newer (8+) versions of Windows expect the GPU driver to prepare the GPU for power removal in some specific way and the latter fails if the GPU has not been prepared as expected. Because testing indicates that the Windows 7 path in the platform firmware works, it may be worth trying to do what it does to the PCIe link before invoking the _OFF method for the power resource controlling the GPU power. If the Mika's theory that the Win7 path simply turns the PCIe link off is correct, then whatever the _OFF method tries to do to the link after that should not matter. > On Wed, Nov 27, 2019 at 8:55 PM Lyude Paul wrote: > > > > On Wed, 2019-11-27 at 12:51 +0100, Karol Herbst wrote: > > > On Wed, Nov 27, 2019 at 12:49 PM Mika Westerberg > > > wrote: > > > > On Tue, Nov 26, 2019 at 06:10:36PM -0500, Lyude Paul wrote: > > > > > Hey-this is almost certainly not the right place in this thread to > > > > > respond, > > > > > but this thread has gotten so deep evolution can't push the subject > > > > > further to > > > > > the right, heh. So I'll just respond here. > > > > > > > > :) > > > > > > > > > I've been following this and helping out Karol with testing here and > > > > > there. > > > > > They had me test Bjorn's PCI branch on the X1 Extreme 2nd generation, > > > > > which > > > > > has a turing GPU and 8086:1901 PCI bridge. > > > > > > > > > > I was about to say "the patch fixed things, hooray!" but it seems that > > > > > after > > > > > trying runtime suspend/resume a couple times things fall apart again: > > > > > > > > You mean $subject patch, no? > > > > > > > > > > no, I told Lyude to test the pci/pm branch as the runpm errors we saw > > > on that machine looked different. Some BAR error the GPU reported > > > after it got resumed, so I was wondering if the delays were helping > > > with that. But after some cycles it still caused the same issue, that > > > the GPU disappeared. Later testing also showed that my patch also > > > didn't seem to help with this error sadly :/ > > > > > > > > [ 686.883247] nouveau 0000:01:00.0: DRM: suspending object tree... > > > > > [ 752.866484] ACPI Error: Aborting method \_SB.PCI0.PEG0.PEGP.NVPO due > > > > > to previous error (AE_AML_LOOP_TIMEOUT) (20190816/psparse-529) > > > > > [ 752.866508] ACPI Error: Aborting method \_SB.PCI0.PGON due to > > > > > previous error (AE_AML_LOOP_TIMEOUT) (20190816/psparse-529) > > > > > [ 752.866521] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due > > > > > to previous error (AE_AML_LOOP_TIMEOUT) (20190816/psparse-529) > > > > > > > > This is probably the culprit. The same AML code fails to properly turn > > > > on the device. > > > > > > > > Is acpidump from this system available somewhere? > > > > Attached it to this email > > > > > > > > -- > > Cheers, > > Lyude Paul >