Received: by 10.223.185.116 with SMTP id b49csp993147wrg; Fri, 16 Feb 2018 10:27:19 -0800 (PST) X-Google-Smtp-Source: AH8x226omlqlXeTt7CqwRuv7ua0AkRHPGZ0GwqzBPV7u3iQnAWCejTB3NNtxNIx4NIvtsOTrxqeY X-Received: by 10.101.93.15 with SMTP id e15mr703812pgr.175.1518805639182; Fri, 16 Feb 2018 10:27:19 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1518805639; cv=none; d=google.com; s=arc-20160816; b=Qu8H3lR169ixyGKST/ZcNgKxA1UFpfuaTqLJN82UC9DTa6fBfS0MlWftRWK3HgoM3R W4+Ez9dk/S/FdYcNXDn7948VOmox0zWBiN2yMJT0rBkLtKE4+bJsk0Cvd2lX8V4R5L7N 40QEzGwh+oUqIkVmxrxD/gXvsNMjGe1oOSluTsmEpIeGqEKuG1WYsMSLK1m/rtwl6942 G8hbpISScwZg98+rl0CfDUGpBMp30p540hs69Z+Wv1SxFxpzYGSO8lsSjIYxkUEJRR3H cpi1B9hI3EPt7oT4Pas4R07Ruy12mGMMD9f3Iev8ek4OGMskG4DCRjg5SVSu7c3v9th9 Tr+Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dmarc-filter :arc-authentication-results; bh=wwch1ocD8KBM9yN526en3/si596vbwvw1JuJ5gw8rOk=; b=oWZ2lrjWNK9jJ3ZQH19FZPkToGtb/JBmDep0XdbwWCf086RjFFDGQY2I7LpVB+5zXG RllgQb8wBLUFvl6gHmSEbeZkidq9NDnWW1y/8VN93XYulAWsJ58YciP/314lnlQBef09 iZawpwQ4+oBmsPlGXzelVNq3Q52rD96QbsEz4jK8TcOOHctYTKgNImtwoGU9XR9BmEJH Vjp7sxAbDBN0TaEzxU5ju/atj4ZX95lW9ol+VMoSd76ibOC5rqJb0aOgXeYX3vj2PkR8 iXVDSJUYcqJex+9fR28KHhjkaT7709KX+93/d8RoFzzyywcs8oqPXce0UtkZjrhxRCe7 hINg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u66si1760034pgc.651.2018.02.16.10.27.04; Fri, 16 Feb 2018 10:27:19 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1167363AbeBOXjH (ORCPT + 99 others); Thu, 15 Feb 2018 18:39:07 -0500 Received: from mail.kernel.org ([198.145.29.99]:58950 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1164340AbeBOXjF (ORCPT ); Thu, 15 Feb 2018 18:39:05 -0500 Received: from localhost (unknown [69.71.4.159]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 7068821777; Thu, 15 Feb 2018 23:39:02 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7068821777 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=helgaas@kernel.org Date: Thu, 15 Feb 2018 17:39:00 -0600 From: Bjorn Helgaas To: "Rafael J. Wysocki" Cc: Mika Westerberg , George Cherian , George Cherian , linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, bhelgaas@google.com, Jayachandran.Nair@cavium.com, Robert.Richter@cavium.com, Lorenzo Pieralisi Subject: Re: [PATCH] PCI: Add quirk for Cavium Thunder-X2 PCIe erratum #173 Message-ID: <20180215233900.GA195653@bhelgaas-glaptop.roam.corp.google.com> References: <1517554846-16703-1-git-send-email-george.cherian@cavium.com> <20180214201653.GA62268@bhelgaas-glaptop.roam.corp.google.com> <26660034.hqiP3ND89J@aspire.rjw.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <26660034.hqiP3ND89J@aspire.rjw.lan> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 15, 2018 at 10:57:25PM +0100, Rafael J. Wysocki wrote: > On Wednesday, February 14, 2018 9:16:53 PM CET Bjorn Helgaas wrote: > > On Wed, Feb 14, 2018 at 04:58:08PM +0530, George Cherian wrote: > > > On 02/13/2018 08:39 PM, Bjorn Helgaas wrote: > > > >On Fri, Feb 02, 2018 at 07:00:46AM +0000, George Cherian wrote: > > > >>The PCIe Controller on Cavium ThunderX2 processors does not > > > >>respond to downstream CFG/ECFG cycles when root port is > > > >>in power management D3-hot state. > > > > > > > >I think you're talking about the CPU initiating a config cycle to > > > >a device below the root port, right? > > > Yes > > > > If a bridge, e.g., a Root Port in your case, is in D3hot, we should be > > able to access config space of the bridge itself, but the secondary > > bus will be in B2 or B3 and we won't be able to access config space > > for any devices below the bridge. This is true for *all* bridges, not > > just this Cavium Root Port. > > Right. > > But AFAICS config space reads from devices that aren't there (which > effectively is what happens if the bridge is in D3hot) are at least > expected to return all ones. Yes. AIUI, the PCIe spec doesn't actually *require* all ones in this case, but the sec 2.3.2 implementation note says hardware intended for use with software (like Linux) that assumes the all ones behavior should synthesize all ones. So I'm speculating that this Cavium device is supposed to synthesize all ones, but does not. But from the discussion below, it sounds like this may have helped uncover a more serious Linux bug, i.e., we don't resume a device before trying to use it. > > The PCIe r4.0, sec 5.3.1, implementation note seems relevant: > > > > When a Type 1 Function associated with a Switch/Root Port (a > > "virtual bridge") is in a non-D0 power state, it will emulate the > > behavior of a conventional PCI bridge in its handling of Memory, > > I/O, and Configuration Requests and Completions. ... All Type 1 > > Configuration Requests are terminated as Unsupported Requests, ... > > > > > >This erratum isn't published anywhere where we could include a URL, > > > >is it? > > > This erratum is available at support.cavium.com, You might need to > > > register to the website to get hold of a copy. > > > > That appears to require an NDA, so that doesn't work for me. Can you > > just include the erratum text in the changelog? > > > > > >>In our tests the above mentioned errata causes the following crash when > > > >>the downstream endpoint config space is accessed, while root port is in > > > >>D3 state. > > > >> > > > >>[ 12.775202] Unhandled fault: synchronous external abort (0x96000610) at 0x0000000000000000 > > > > > > > >This looks like another example of not being able to deal with error > > > >responses to PCIe events, for example, the whole mess with drivers > > > >checking whether the link is up before issuing a config access. > > > > > > > >In that sense, this looks like a band-aid that avoids the issue by > > > >preventing the use of D3, but doesn't fix the underlying problem. If > > > >we *could* deal nicely with this error, maybe we could use D3 on these > > > >root ports. > > > > > > > >So I'm not convinced yet that this is actually a PCIe erratum. Does > > > >the hardware actually violate the PCIe spec, or is this just a problem > > > >with the kernel not knowing how to deal nicely with a PCIe error? > > > > > > > This is not an issue with the way kernel handles the PCIe error. > > > > > > >If you could include the erratum text and reference to the relevant > > > >section of the PCIe spec, that would be useful. > > > > > > > The relevant section of the PCIe Spec is the following Section 5.3 > > > on wards (subsection 5.3.1.4.1) > > > > > > Configuration and Message requests are the only TLPs accepted by a > > > Function in the D3hot state. All other received Requests must be > > > handled as Unsupported Requests, and all received Completions may > > > optionally be handled as Unexpected Completions. > > > > This isn't exactly relevant because it says requests *other than* > > config and message requests must be handled as Unsupported Requests, > > and we're talking about a config request. I think sec 5.3.1 is more > > to the point: the Root Port sees a Type 1 Configuration Request that > > would be forwarded to its secondary interface if the port were in D0, > > and that request should be terminated as an Unsupported Request. > > > > I think the question is how the Root Complex turns this Unsupported > > Request into some signal to the CPU. The implementation note in 2.3.2 > > might be relevant: > > > > Some system configuration software depends on reading a data value > > of all 1’s when a Configuration Read Request is terminated as an > > Unsupported Request, particularly when probing to determine the > > existence of a device in the system. A Root Complex intended for use > > with software that depends on a read-data value of all 1’s must > > synthesize this value when UR Completion Status is returned for a > > Configuration Read Request. > > > > So maybe the erratum is that the RC was intended to synthesize all 1's > > but it doesn't? > > > > There are other cases that can result in Unsupported Request > > completions, so my fear is that this change will take care of one such > > case but leave others that will cause similar unhandled external > > aborts. > > > > > >>[ 12.775202] Unhandled fault: synchronous external abort (0x96000610) at 0x0000000000000000 > > > > > >>[ 12.813518] PC is at pci_generic_config_read+0x5c/0xf0 > > > >>[ 12.818643] LR is at pci_generic_config_read+0x48/0xf0 > > > >> ... > > > >>[ 13.269819] [] pci_generic_config_read+0x5c/0xf0 > > > >>[ 13.275987] [] pci_bus_read_config_dword+0xb4/0xd8 > > > >>[ 13.282328] [] pcie_capability_read_dword+0x64/0xb8 > > > >>[ 13.288757] [] __pci_dev_reset+0x90/0x328 > > > >>[ 13.294317] [] pci_probe_reset_function+0x24/0x30 > > > >>[ 13.300571] [] pci_create_sysfs_dev_files+0x18c/0x2a0 > > > >>[ 13.307173] [] pci_sysfs_init+0x38/0x60 > > > >>[ 13.312560] [] do_one_initcall+0x5c/0x170 > > > >>[ 13.318122] [] kernel_init_freeable+0x1c0/0x27c > > > >>[ 13.324205] [] kernel_init+0x18/0x110 > > > >>[ 13.329416] [] ret_from_fork+0x10/0x40 > > > > I should have asked this before: why are we even trying to do this > > config read to a device that's in D3? I assume this root port started > > out in D0 because we apparently enumerated it successfully, but it > > must have been put into D3 later? > > > > The pci_probe_reset_function() function comment says "the PCI device > > must be responsive to PCI config space in order to use this function." > > So apparently the caller should make sure the device is in at least > > D3hot and any bridges leading to it are in D0. > > > > If we're missing whatever it is that makes sure the target device is > > reachable and responsive to config space, we likely have issues on > > other systems as well. On Cavium this causes the external abort, but > > on other systems, we'd probably just get all 1's data back from the > > config read and silently do the wrong thing in > > pci_probe_reset_function(). > > > > I don't know how this runtime PM works, but maybe Rafael can help us > > out. > > I'm not sure what the question is to be honest. My questions are basically "What does the PCI core need to do to make sure a device is in D0 before it operates on it? And where do we need to do that?" > Unused ports are autosuspended after 100ms as per pcie_portdrv_probe(). So I guess this is only done for PCIe bridges, not for conventional PCI bridges. Is this because autosuspend depends on a PCIe-only feature? Here's the path we're in now: pcie_portdrv_init # device_initcall (level 6) pci_register_driver(&pcie_portdriver) ... pcie_portdrv_probe # pcie_portdriver.probe pm_runtime_set_autosuspend_delay(&dev->dev, 100) pci_sysfs_init # late_initcall (level 7) for_each_pci_dev(dev) pci_create_sysfs_dev_files(dev) pci_create_capabilities_sysfs(dev) pci_probe_reset_function(dev) ... pcie_capability_read_dword <-- config read causing abort I guess it's conceivable that more than 100ms elapsed between pcie_portdrv_probe() and pci_probe_reset_function(), and obviously we can't depend on any particular timing. > Now, it looks like they should be resumed in the code path in question, > so this case seems to have been overlooked. How do we identify which code paths need to resume the device, and what do we need to do to resume it? I'm kind of scared about this because most hardware just returns all ones in this case and we might never notice. Bjorn