Received: by 2002:a25:7ec1:0:0:0:0:0 with SMTP id z184csp3583030ybc; Thu, 14 Nov 2019 11:19:09 -0800 (PST) X-Google-Smtp-Source: APXvYqz6S+nsYOHnmAZ0RCEChpAaV6U5FcVACMOk0r2rpR8+7352ThdnhdhB7Sbb5E97QPk4ROEh X-Received: by 2002:ac2:48b8:: with SMTP id u24mr6698683lfg.133.1573759149396; Thu, 14 Nov 2019 11:19:09 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1573759149; cv=none; d=google.com; s=arc-20160816; b=lFpT89FgRh1opBbk3H/TAGnrU13c2aVlY0mKJ8tsW1m+bR7OVPsGWSea/laGg3PalO uAZxMGNWfZaw4wHMghl44XS8vnRRrTwf7qpkF/QEiM7uIpwyLr4BIe5khiGK0C6TfrWL gkRRDoXIBSgkNkkeAuZTK4NdTnUo12r5zolYqkg05zpFqjnul6yRjEi1xIpm9mRlx5YS PltUhSAU7Cs2DCrIuKQnnGU6VmGrQw5zIJJNYiGETTh1cgPSsSNyAYRgFZpeX0TeCsJb dI5E0vC32sc1jRD7JPKHUHuPmG9fewJs+FFWnIJTxpZvmCQ0To97ClAEHzvvDkdrmCqw gHnA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=iO7IjIFxA01AP12H/14msCUb1F6tFNp4BMry0uJ072A=; b=hFllfFHReX0s5+THJgToIzuGbOgF7wRxtjWSjefIaXZEoTcaBUDABFssacMmU5yhrl 3v4Bkt5djOm4mEoeIk4Gvdy9n037SxNYFbhZQSpuhQRjaIBed9niGkID614hD1E+r+BM GTp+PxJx3aMdD63waksDUDbugtgDKTgh+hfzZpAKJsdVsWzG6drjsFvRfVT0EgaW3RGu +lM2DLqarCb4q5ZgypN5gVhnp+9CQJKpR2C9OLjOVi0CqCyNvoqHJkiicJbkRky9xa/1 Q4uCGtiRPeIPhJQxyQoRid/tomA+uS3DyRtV8R4yyUPpnnjGhLaIjrXsqXQ7swUR/y8n GCMA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=HRypMVOj; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 17si3975684ejt.431.2019.11.14.11.18.44; Thu, 14 Nov 2019 11:19:09 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=HRypMVOj; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726910AbfKNTRs (ORCPT + 99 others); Thu, 14 Nov 2019 14:17:48 -0500 Received: from us-smtp-2.mimecast.com ([207.211.31.81]:55209 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726098AbfKNTRr (ORCPT ); Thu, 14 Nov 2019 14:17:47 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1573759065; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=iO7IjIFxA01AP12H/14msCUb1F6tFNp4BMry0uJ072A=; b=HRypMVOjexCJwVILTCJwj8G1ru7R+Q2La/eG+ix+Cmi5Gbd3hJOE3EH0azpFirJfKq1FjF 6uDzvHrf8tyi5+u1S/43TQH4FuMTdfQ6PBqutrTygNoiF2FMWTHjL21aBDF2fsSP/slsn0 Eb3yici3jQ1yjLkW91EcD/ul+XhCCuE= Received: from mail-qt1-f200.google.com (mail-qt1-f200.google.com [209.85.160.200]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-195-IvCL-xL1PKGhG9_j6hb5tg-1; Thu, 14 Nov 2019 14:17:44 -0500 Received: by mail-qt1-f200.google.com with SMTP id i1so4684470qtj.19 for ; Thu, 14 Nov 2019 11:17:43 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=yzU2orsV1IQ0b605t7GrUNqdH9lGit9QLMaD0lmIB5c=; b=kIkr0uyx9hDufl+oyfO83KRy7eCi1eYWv5yVotYvYWb371FU2FB28sdugCovlucv6j z8l5oDMRXYUCV4gC/ugoS+Aqm2Mc0AkJUPO8LeH8AplcLiKk0rGzJS23j/i/HyESm0hu D9RYI5M+nsVsqNLpBP1Y2bzI/U3fXJgSxgIBZjsZpcUwC95VsHNnwJh/kCL3NtMC8PcH EWwRsr5gmurafjCMLmy7053ZH7s3HbrX62Y1/TlZxFypFbSG1aIpiLQslE/XMPl/ESoG GWolzdQqwDaG869qJzA5EzJJvsOsPruqt9dWYvSJVLa4HSyHnMjZOotGJN2SCMEUDwDY xOmw== X-Gm-Message-State: APjAAAXxUrKhzQ8I3j5+IQDUtrRuPsDDVm9PZSla55uFJK9vhefbb5ky x+KJDuFWAodbRl3/p9qBkSVyyUbYVDTFzUMZoAg9CwLbONLt5aGXalfTD93P9WVZE7OyfdPrcn4 BTnAWDWbZ25U6/gA7wGcss5ZsNOvmVGOY2sIHtAAt X-Received: by 2002:ac8:75ce:: with SMTP id z14mr9911914qtq.130.1573759062327; Thu, 14 Nov 2019 11:17:42 -0800 (PST) X-Received: by 2002:ac8:75ce:: with SMTP id z14mr9911890qtq.130.1573759062129; Thu, 14 Nov 2019 11:17:42 -0800 (PST) MIME-Version: 1.0 References: <20191017121901.13699-1-kherbst@redhat.com> In-Reply-To: <20191017121901.13699-1-kherbst@redhat.com> From: Karol Herbst Date: Thu, 14 Nov 2019 20:17:30 +0100 Message-ID: Subject: Re: [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges To: LKML Cc: Bjorn Helgaas , Lyude Paul , "Rafael J . Wysocki" , Mika Westerberg , Linux PCI , Linux PM , dri-devel , nouveau X-MC-Unique: IvCL-xL1PKGhG9_j6hb5tg-1 X-Mimecast-Spam-Score: 0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org ping on the patch. I wasn't able to verify this issue on any other bridge controller, so it really might be only this one. On Thu, Oct 17, 2019 at 2:19 PM Karol Herbst wrote: > > Fixes state transitions of Nvidia Pascal GPUs from D3cold into higher dev= ice > states. > > v2: convert to pci_dev quirk > put a proper technical explanation of the issue as a in-code comment > v3: disable it only for certain combinations of intel and nvidia hardware > v4: simplify quirk by setting flag on the GPU itself > > Signed-off-by: Karol Herbst > Cc: Bjorn Helgaas > Cc: Lyude Paul > Cc: Rafael J. Wysocki > Cc: Mika Westerberg > Cc: linux-pci@vger.kernel.org > Cc: linux-pm@vger.kernel.org > Cc: dri-devel@lists.freedesktop.org > Cc: nouveau@lists.freedesktop.org > --- > drivers/pci/pci.c | 7 ++++++ > drivers/pci/quirks.c | 53 ++++++++++++++++++++++++++++++++++++++++++++ > include/linux/pci.h | 1 + > 3 files changed, 61 insertions(+) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index b97d9e10c9cc..02e71e0bcdd7 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -850,6 +850,13 @@ static int pci_raw_set_power_state(struct pci_dev *d= ev, pci_power_t state) > || (state =3D=3D PCI_D2 && !dev->d2_support)) > return -EIO; > > + /* > + * check if we have a bad combination of bridge controller and nv= idia > + * GPU, see quirk_broken_nv_runpm for more info > + */ > + if (state !=3D PCI_D0 && dev->broken_nv_runpm) > + return 0; > + > pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr); > > /* > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index 44c4ae1abd00..0006c9e37b6f 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -5268,3 +5268,56 @@ static void quirk_reset_lenovo_thinkpad_p50_nvgpu(= struct pci_dev *pdev) > DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, 0x13b1, > PCI_CLASS_DISPLAY_VGA, 8, > quirk_reset_lenovo_thinkpad_p50_nvgpu); > + > +/* > + * Some Intel PCIe bridges cause devices to disappear from the PCIe bus = after > + * those were put into D3cold state if they were put into a non D0 PCI P= M > + * device state before doing so. > + * > + * This leads to various issue different issues which all manifest diffe= rently, > + * but have the same root cause: > + * - AIML code execution hits an infinite loop (as the coe waits on dev= ice > + * memory to change). > + * - kernel crashes, as all pci reads return -1, which most code isn't = able > + * to handle well enough. > + * - sudden shutdowns, as the kernel identified an unrecoverable error = after > + * userspace tries to access the GPU. > + * > + * In all cases dmesg will contain at least one line like this: > + * 'nouveau 0000:01:00.0: Refused to change power state, currently in D3= ' > + * followed by a lot of nouveau timeouts. > + * > + * ACPI code writes bit 0x80 to the not documented PCI register 0x248 of= the > + * PCIe bridge controller in order to power down the GPU. > + * Nonetheless, there are other code paths inside the ACPI firmware whic= h use > + * other registers, which seem to work fine: > + * - 0xbc bit 0x20 (publicly available documentation claims 'reserved') > + * - 0xb0 bit 0x10 (link disable) > + * Changing the conditions inside the firmware by poking into the releva= nt > + * addresses does resolve the issue, but it seemed to be ACPI private me= mory > + * and not any device accessible memory at all, so there is no portable = way of > + * changing the conditions. > + * > + * The only systems where this behavior can be seen are hybrid graphics = laptops > + * with a secondary Nvidia Pascal GPU. It cannot be ruled out that this = issue > + * only occurs in combination with listed Intel PCIe bridge controllers = and > + * the mentioned GPUs or if it's only a hw bug in the bridge controller. > + * > + * But because this issue was NOT seen on laptops with an Nvidia Pascal = GPU > + * and an Intel Coffee Lake SoC, there is a higher chance of there being= a bug > + * in the bridge controller rather than in the GPU. > + * > + * This issue was not able to be reproduced on non laptop systems. > + */ > + > +static void quirk_broken_nv_runpm(struct pci_dev *dev) > +{ > + struct pci_dev *bridge =3D pci_upstream_bridge(dev); > + > + if (bridge->vendor =3D=3D PCI_VENDOR_ID_INTEL && > + bridge->device =3D=3D 0x1901) > + dev->broken_nv_runpm =3D 1; > +} > +DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, > + PCI_BASE_CLASS_DISPLAY, 16, > + quirk_broken_nv_runpm); > diff --git a/include/linux/pci.h b/include/linux/pci.h > index ac8a6c4e1792..903a0b3a39ec 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -416,6 +416,7 @@ struct pci_dev { > unsigned int __aer_firmware_first_valid:1; > unsigned int __aer_firmware_first:1; > unsigned int broken_intx_masking:1; /* INTx masking can't be = used */ > + unsigned int broken_nv_runpm:1; /* some combinations of i= ntel bridge controller and nvidia GPUs break rtd3 */ > unsigned int io_window_1k:1; /* Intel bridge 1K I/O wi= ndows */ > unsigned int irq_managed:1; > unsigned int has_secondary_link:1; > -- > 2.21.0 >