Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp2359853imm; Thu, 9 Aug 2018 11:30:13 -0700 (PDT) X-Google-Smtp-Source: AA+uWPwLkV76vPLzFvM6Wo5/3AhEufXHu6cAIVznKvC5asfi9yoRVX+dge0dRIs9GlGyv2P4NYpL X-Received: by 2002:a17:902:585:: with SMTP id f5-v6mr3079578plf.7.1533839413152; Thu, 09 Aug 2018 11:30:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1533839413; cv=none; d=google.com; s=arc-20160816; b=grLQLCvnaHHhDvj6HH2Ab5VpHLEAn1TUh564YQOsltxPmnLhdI2LnIf4ZarnTw9xc5 6SWxRKPaYBSoWdzsJpn6ClP3sJ3mnehaRv33e+Iik2zfZHPSgCgvRnYR/nXGYIiLOg0c XCwo2ivf5F+Y3eOiOU5VtK5Yz/CYgmBb5Zhi6fgAfYSrWn1J4HYzd47t2JfUHuZhFBjN NwKHgwIhkCp0wvu+YAtNdxYyb+tL1QRywi0PI9AkNLhjtyJK2udVcektG6ybgQjzTKR6 HEqJy1mrjsDbuIozByT7INBpvMmqw9ckg6dX7p3kUjt+UYx0MbWpy6L6KMbwlvm9gPaq q5yQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:arc-authentication-results; bh=gjqT3Dd97t3ZUs0RuACkYVNCIN/F4/nvZKFp2hKgW+A=; b=wawivi3omlGetw82KAZEtVSyBjEIwDMcZfIwH9WFX5yX1A7wsRIxXbwLpgml3sak9O 8tRqhEWxoZeE+b71On8AemZtcXcdBwUwOBKKWvIo4ZmEJlHSPAtAb6fC1XkTJTIRiUHU xysSmMqFZzentoIX55vqh1VNICxz6msCnsY3plb1idRBRPy4x0AbYi/oxN4wNpRryUmA x7Fl73bNHgLXJrkZXrz2HTIvitU2hZGkCOKGUYyDS/c1J/DPvQvHPtuJmqQe6ULWCkLw r8k9KZUJwtzy5MT+fxmfj79yE7wCxEUvUxzQ4Fnz62htreoOFI59Wei5/aSm6bp7nr+Z lLzg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=Jb7LvkCp; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w6-v6si7559814pgb.61.2018.08.09.11.29.57; Thu, 09 Aug 2018 11:30:13 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=Jb7LvkCp; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726975AbeHIUzO (ORCPT + 99 others); Thu, 9 Aug 2018 16:55:14 -0400 Received: from mail.kernel.org ([198.145.29.99]:45072 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726310AbeHIUzO (ORCPT ); Thu, 9 Aug 2018 16:55:14 -0400 Received: from localhost (unknown [69.71.4.100]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 7DC0E21EB7; Thu, 9 Aug 2018 18:29:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1533839348; bh=NsVhe85vpgN2sJ+m0U64EGpsyo0pEg1RQHdGo49WfOM=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Jb7LvkCpR1g2nZI/lqw++/ErZ2/i/6t+pkWo5x7y4alnwy4XWFBmSYtSjKm7edJ/n bU+BF7ZCKUIvcyldtFRtgwkOfcQCLKZOck659YbRBljEr9OEGaz4Y+7Z3n5n9jx1j6 tkJIJPAhMKzEaDvAqQw7vkQjgbowKDUxnhZ3CUL8= Date: Thu, 9 Aug 2018 13:29:05 -0500 From: Bjorn Helgaas To: Alex_Gagniuc@Dellteam.com Cc: mr.nuke.me@gmail.com, bhelgaas@google.com, keith.busch@intel.com, Austin.Bolen@dell.com, Shyam.Iyer@dell.com, fred@fredlawl.com, poza@codeaurora.org, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER Message-ID: <20180809182905.GA113140@bhelgaas-glaptop.roam.corp.google.com> References: <20180717153135.25925-1-mr.nuke.me@gmail.com> <20180809141551.GH49411@bhelgaas-glaptop.roam.corp.google.com> <2cae6a5ac8324be18b8dcf3d7dfcc288@ausx13mps321.AMER.DELL.COM> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2cae6a5ac8324be18b8dcf3d7dfcc288@ausx13mps321.AMER.DELL.COM> User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Aug 09, 2018 at 04:46:32PM +0000, Alex_Gagniuc@Dellteam.com wrote: > On 08/09/2018 09:16 AM, Bjorn Helgaas wrote: > > On Tue, Jul 17, 2018 at 10:31:23AM -0500, Alexandru Gagniuc wrote: > >> When we don't own AER, we shouldn't touch the AER error bits. This > >> happens unconditionally on device probe(). Clearing AER bits > >> willy-nilly might cause firmware to miss errors. Instead > >> these bits should get cleared by FFS, or via ACPI _HPX method. > >> > >> This race is mostly of theoretical significance, as it is not easy to > >> reasonably demonstrate it in testing. > >> > >> Signed-off-by: Alexandru Gagniuc > >> --- > >> drivers/pci/pcie/aer.c | 3 +++ > >> 1 file changed, 3 insertions(+) > >> > >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > >> index a2e88386af28..18037a2a8231 100644 > >> --- a/drivers/pci/pcie/aer.c > >> +++ b/drivers/pci/pcie/aer.c > >> @@ -383,6 +383,9 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev) > >> if (!pci_is_pcie(dev)) > >> return -ENODEV; > >> > >> + if (pcie_aer_get_firmware_first(dev)) > >> + return -EIO; > > > > I like this patch. > > > > Do we need the same thing in the following places that also clear AER > > status bits or write AER control bits? > > In theory, every exported function would guard for this. I think the > idea a long long time ago was that the check happens during > initialization, and the others are not hit. > > > enable_ecrc_checking() > > disable_ecrc_checking() > > I don't immediately see how this would affect FFS, but the bits are part > of the AER capability structure. According to the FFS model, those would > be owned by FW, and we'd have to avoid touching them. Per ACPI v6.2, sec 18.3.2.4, the HEST may contain entries for Root Ports that contain the FIRMWARE_FIRST flag as well as values the OS is supposed to write to several AER capability registers. It looks like we currently ignore everything except the FIRMWARE_FIRST and GLOBAL flags (ACPI_HEST_FIRMWARE_FIRST and ACPI_HEST_GLOBAL in Linux). That seems like a pretty major screwup and more than I want to fix right now. > > pci_cleanup_aer_uncorrect_error_status() > > This probably should be guarded. It's only called from a few specific > drivers, so the impact is not as high as being called from the core. > > > pci_aer_clear_fatal_status() > > This is only called when doing fatal_recovery, right? True. It takes a lot of analysis to convince oneself that this is not used in the firmware-first path, so I think we should add a guard there. > For practical considerations this is not an issue today. The ACPI error > handling code currently crashes when it encounters any fatal error, so > we wouldn't hit this in the FFS case. I wasn't aware the firmware-first path was *that* broken. Are there problem reports for this? Is this a regression? > The PCIe standards contact I usually talk to about these PCIe subtleties > is currently on vacation. The number one issue was a FFS corner case > with OS clearing bits on probe. The other functions you mention are a > corner case of a corner case. The big fish is > pci_cleanup_aer_error_status_regs() on probe(), and it would be nice to > have that resolved. > > I'll sync up with Austin when he gets back to see about the other > functions though I suspect we'll end up fixing them as well. I'd like to fix all the obvious cases at once (excluding the ECRC stuff). What do you think about the following patch? commit 15ed68dcc26864c849a12a36db4d4771bad7991f Author: Alexandru Gagniuc Date: Tue Jul 17 10:31:23 2018 -0500 PCI/AER: Don't clear AER bits if error handling is Firmware-First If the platform requests Firmware-First error handling, firmware is responsible for reading and clearing AER status bits. If OSPM also clears them, we may miss errors. See ACPI v6.2, sec 18.3.2.5 and 18.4. This race is mostly of theoretical significance, as it is not easy to reasonably demonstrate it in testing. Signed-off-by: Alexandru Gagniuc [bhelgaas: add similar guards to pci_cleanup_aer_uncorrect_error_status() and pci_aer_clear_fatal_status()] Signed-off-by: Bjorn Helgaas diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index c6cc855bfa22..4e823ae051a7 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -397,6 +397,9 @@ int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev) if (!pos) return -EIO; + if (pcie_aer_get_firmware_first(dev)) + return -EIO; + /* Clear status bits for ERR_NONFATAL errors only */ pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, &status); pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_SEVER, &sev); @@ -417,6 +420,9 @@ void pci_aer_clear_fatal_status(struct pci_dev *dev) if (!pos) return; + if (pcie_aer_get_firmware_first(dev)) + return; + /* Clear status bits for ERR_FATAL errors only */ pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, &status); pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_SEVER, &sev); @@ -438,6 +444,9 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev) if (!pos) return -EIO; + if (pcie_aer_get_firmware_first(dev)) + return -EIO; + port_type = pci_pcie_type(dev); if (port_type == PCI_EXP_TYPE_ROOT_PORT) { pci_read_config_dword(dev, pos + PCI_ERR_ROOT_STATUS, &status);