Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752880AbdHOWnD (ORCPT ); Tue, 15 Aug 2017 18:43:03 -0400 Received: from mail.kernel.org ([198.145.29.99]:41816 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752841AbdHOWnA (ORCPT ); Tue, 15 Aug 2017 18:43:00 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8FE4422B4C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=helgaas@kernel.org Date: Tue, 15 Aug 2017 17:42:56 -0500 From: Bjorn Helgaas To: Keith Busch Cc: linux-pci@vger.kernel.org, Bjorn Helgaas , linux-kernel@vger.kernel.org, stable@vger.kernel.org, Mayurkumar Patel Subject: Re: [PATCH] pciehp: Fix infinite interupt handler loop Message-ID: <20170815224256.GR32525@bhelgaas-glaptop.roam.corp.google.com> References: <1501571512-8362-1-git-send-email-keith.busch@intel.com> <20170814205948.GF32525@bhelgaas-glaptop.roam.corp.google.com> <20170814221123.GI7233@localhost.localdomain> <20170815204825.GL32525@bhelgaas-glaptop.roam.corp.google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20170815204825.GL32525@bhelgaas-glaptop.roam.corp.google.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5502 Lines: 119 On Tue, Aug 15, 2017 at 03:48:25PM -0500, Bjorn Helgaas wrote: > On Mon, Aug 14, 2017 at 06:11:23PM -0400, Keith Busch wrote: > > On Mon, Aug 14, 2017 at 03:59:48PM -0500, Bjorn Helgaas wrote: > > > On Tue, Aug 01, 2017 at 03:11:52AM -0400, Keith Busch wrote: > > > > We've encountered a particular platform that under some circumstances > > > > always has the power fault detected status raised. The pciehp irq handler > > > > would loop forever because it thinks it is handling new events when in > > > > fact the power fault is not new. This patch fixes that by masking off > > > > the power fault status from new events if the driver hasn't seen the > > > > power fault clear from the previous handling attempt. > > > > > > Can you say which platform this is? If this is a hardware defect, > > > it'd be interesting to know where it happens. > > > > > > But I'm not sure we handle PCI_EXP_SLTSTA correctly. We basically > > > have this: > > > > > > pciehp_isr() > > > { > > > pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &status); > > > events = status & (); > > > pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events); > > > > > > } > > > > > > The write to PCI_EXP_SLTSTA clears PCI_EXP_SLTSTA_PFD because it's > > > RW1C. But we haven't done anything that would actually change the > > > situation that caused a power fault, so I don't think it would be > > > surprising if the hardware immediately reasserted it. > > > > > > So maybe this continual assertion of power fault is really a software > > > bug, not a hardware problem? > > > > I *think* it's a software bug for the exact reason you provided, but I'm > > sure it must be isolated to certain conditions with certain hardware. We'd > > have heard about this regression during 4.9 if it was more wide-spread. > > OK, I think this makes pretty good sense. > > > > > Fixes: fad214b0aa72 ("PCI: pciehp: Process all hotplug events before > > > > looking for new ones") > > > This is on a PEX8733 bridge, and it reports power fault detected status as > > long as the power fault exists. While we can write 1 to clear the event, > > that just rearms the port to retrigger power fault detected status for as > > long as the power controller detects its faulted. The status is cleared > > for good only when the power fault no longer exists rather than when > > it is acknowledged. The spec seems to support that view (Table (7-21: > > Slot Status Register): > > > > Power Fault Detected – If a Power Controller that supports power fault > > detection is implemented, this bit is Set when the Power Controller > > detects a power fault at this slot. > > Interesting: 5651c48cfafe ("PCI pciehp: fix power fault interrupt > storm problem") turned off PCI_EXP_SLTCTL_PFDE in 2009 for basically > the same problem. > > AFAICS, we *still* never turn PCI_EXP_SLTCTL_PFDE on, so in your case, > you're probably taking an interrupt for some other reason, and > coincidentally noticing that PCI_EXP_SLTSTA_PFD is set. > > Maybe it's time to turn PCI_EXP_SLTCTL_PFDE back on along with your > fix to prevent the loop? It seems like kind of a hole that we will > never notice power faults unless some other interrupt happens. Or am > I missing something? > > I'd like to move your PCI_EXP_SLTSTA_PFD out on its own so there's > room for a comment. What do you think of the following? I really > don't think this is specific to a particular platform (other than > maybe timing-wise), so I dropped that from the changelog too. > > > commit 7612b3b28c0b900dcbcdf5e9b9747cc20a1e2455 > Author: Keith Busch > Date: Tue Aug 1 03:11:52 2017 -0400 > > PCI: pciehp: Report power fault only once until we clear it > > When a power fault occurs, the power controller sets Power Fault Detected > in the Slot Status register, and pciehp_isr() queues an INT_POWER_FAULT > event to handle it. > > It also clears Power Fault Detected, but since nothing has yet changed to > correct the power fault, the power controller will likely set it again > immediately, which may cause an infinite loop when pcie_isr() rechecks > Slot Status. > > Fix that by masking off Power Fault Detected from new events if the driver > hasn't seen the power fault clear from the previous handling attempt. > > Fixes: fad214b0aa72 ("PCI: pciehp: Process all hotplug events before looking for new ones") > Signed-off-by: Keith Busch > [bhelgaas: changelog, pull test out and add comment] > Signed-off-by: Bjorn Helgaas > Cc: Mayurkumar Patel > Cc: stable@vger.kernel.org # 4.9+ > > diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c > index 026830a138ae..e5d5ce9e3010 100644 > --- a/drivers/pci/hotplug/pciehp_hpc.c > +++ b/drivers/pci/hotplug/pciehp_hpc.c > @@ -586,6 +586,14 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) > events = status & (PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD | > PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC | > PCI_EXP_SLTSTA_DLLSC); > + > + /* > + * If we've already reported a power fault, don't report it again > + * until we've done something to handle it. > + */ > + if (ctrl->power_fault_detected) > + events &= ~PCI_EXP_SLTSTA_PFD; > + > if (!events) > return IRQ_NONE; > This is on pci/hotplug for v4.14.