2013-06-06 18:15:50

by Dall, Elizabeth J

[permalink] [raw]
Subject: [PATCH v3 0/6] PCI/ACPI: Fix firmware first error recovery with root port in reset

This patch set fixes a bug on platforms that use firmware first AER.
Firmware can leave the root port in Secondary Bus Reset (SBR) and
communicate this to the OS through the "reset" bit in the flags field
of the HEST table and associated CPER records. Firmware wants to do this
so that the error is contained and the hardware is in a known state.

Without these patches, the root port stays in SBR and the device drivers
cannot recover. These patches recognize when the firmware first root port
is in SBR and bring the root port out of SBR so the devices under the root
port can recover.

The changes have been tested on systems with firmware first that set the
"reset" bit by injecting various hardware errors. The errors successfully
recover.

Changes since v1:
Fixed a typo in the comment of patch 2.
Removed incorrect setting of reset bit in patch 3.

Changes since v2:
The v2 patch 1/3 was re-written by Bjorn Helgaas and is now patches 1/6
through 3/6.
The v2 patch 2/3 is now 5/6 and changed to directly use the AER_FATAL define
and introduced patch 4/6 to move the defines to a public header file.
The v2 patch 3/3 is now 6/6 and uses the same default reset link function for
both Downstream Ports and Root Ports.

Signed-off-by: Betty Dall <[email protected]>
---
Betty Dall (6):
PCI/AER: Don't parse HEST table for non-PCIe devices
PCI/AER: Factor out HEST device type matching
PCI/AER: Set dev->__aer_firmware_first only for matching devices
PCI/ACPI: Move AER severity defines to aer.h
ACPI/APEI: Force fatal AER severity when bus has been reset
PCI/AER: Provide reset_link for firmware first root port
---
drivers/acpi/apei/ghes.c | 10 +++++++
drivers/pci/pcie/aer/aerdrv.h | 4 ---
drivers/pci/pcie/aer/aerdrv_acpi.c | 47 ++++++++++++++++++-----------------
drivers/pci/pcie/aer/aerdrv_core.c | 17 +++++++------
include/linux/aer.h | 16 +++++++----
5 files changed, 53 insertions(+), 41 deletions(-)


2013-06-06 18:15:54

by Dall, Elizabeth J

[permalink] [raw]
Subject: [PATCH v3 2/6] PCI/AER: Factor out HEST device type matching

This factors out the matching of HEST structure type and PCIe device type
to improve readability. No functional change.

Signed-off-by: Bjorn Helgaas <[email protected]>
Tested-by: Betty Dall <[email protected]>
---

drivers/pci/pcie/aer/aerdrv_acpi.c | 35 +++++++++++++++++------------------
1 files changed, 17 insertions(+), 18 deletions(-)


diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c
index 4f798ab..e7ea573 100644
--- a/drivers/pci/pcie/aer/aerdrv_acpi.c
+++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
@@ -29,6 +29,22 @@ static inline int hest_match_pci(struct acpi_hest_aer_common *p,
p->function == PCI_FUNC(pci->devfn));
}

+static inline bool hest_match_type(struct acpi_hest_header *hest_hdr,
+ struct pci_dev *dev)
+{
+ u16 hest_type = hest_hdr->type;
+ u8 pcie_type = pci_pcie_type(dev);
+
+ if ((hest_type == ACPI_HEST_TYPE_AER_ROOT_PORT &&
+ pcie_type == PCI_EXP_TYPE_ROOT_PORT) ||
+ (hest_type == ACPI_HEST_TYPE_AER_ENDPOINT &&
+ pcie_type == PCI_EXP_TYPE_ENDPOINT) ||
+ (hest_type == ACPI_HEST_TYPE_AER_BRIDGE &&
+ (dev->class >> 16) == PCI_BASE_CLASS_BRIDGE))
+ return true;
+ return false;
+}
+
struct aer_hest_parse_info {
struct pci_dev *pci_dev;
int firmware_first;
@@ -38,28 +54,11 @@ static int aer_hest_parse(struct acpi_hest_header *hest_hdr, void *data)
{
struct aer_hest_parse_info *info = data;
struct acpi_hest_aer_common *p;
- u8 pcie_type = 0;
- u8 bridge = 0;
int ff = 0;

- switch (hest_hdr->type) {
- case ACPI_HEST_TYPE_AER_ROOT_PORT:
- pcie_type = PCI_EXP_TYPE_ROOT_PORT;
- break;
- case ACPI_HEST_TYPE_AER_ENDPOINT:
- pcie_type = PCI_EXP_TYPE_ENDPOINT;
- break;
- case ACPI_HEST_TYPE_AER_BRIDGE:
- if ((info->pci_dev->class >> 16) == PCI_BASE_CLASS_BRIDGE)
- bridge = 1;
- break;
- default:
- return 0;
- }
-
p = (struct acpi_hest_aer_common *)(hest_hdr + 1);
if (p->flags & ACPI_HEST_GLOBAL) {
- if ((pci_pcie_type(info->pci_dev) == pcie_type) || bridge)
+ if (hest_match_type(hest_hdr, info->pci_dev))
ff = !!(p->flags & ACPI_HEST_FIRMWARE_FIRST);
} else
if (hest_match_pci(p, info->pci_dev))

2013-06-06 18:15:58

by Dall, Elizabeth J

[permalink] [raw]
Subject: [PATCH v2 4/6] PCI/ACPI: Move AER severity defines to aer.h

The function aer_recover_queue() is a public interface and the
severity argument uses #defines that are in the private header
pci/pcie/aer/aerdrv.h.

This patch moves the #defines from pci/pcie/aer/aerdrv.h to
include/linux/aer.h.

Signed-off-by: Betty Dall <[email protected]>
---

drivers/pci/pcie/aer/aerdrv.h | 4 ----
include/linux/aer.h | 16 ++++++++++------
2 files changed, 10 insertions(+), 10 deletions(-)


diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
index d12c77c..90ea3e8 100644
--- a/drivers/pci/pcie/aer/aerdrv.h
+++ b/drivers/pci/pcie/aer/aerdrv.h
@@ -13,10 +13,6 @@
#include <linux/aer.h>
#include <linux/interrupt.h>

-#define AER_NONFATAL 0
-#define AER_FATAL 1
-#define AER_CORRECTABLE 2
-
#define SYSTEM_ERROR_INTR_ON_MESG_MASK (PCI_EXP_RTCTL_SECEE| \
PCI_EXP_RTCTL_SENFEE| \
PCI_EXP_RTCTL_SEFEE)
diff --git a/include/linux/aer.h b/include/linux/aer.h
index ec10e1b..0f6a9e2 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -7,6 +7,10 @@
#ifndef _AER_H_
#define _AER_H_

+#define AER_NONFATAL 0
+#define AER_FATAL 1
+#define AER_CORRECTABLE 2
+
struct aer_header_log_regs {
unsigned int dw0;
unsigned int dw1;
@@ -31,9 +35,9 @@ struct aer_capability_regs {

#if defined(CONFIG_PCIEAER)
/* pci-e port driver needs this function to enable aer */
-extern int pci_enable_pcie_error_reporting(struct pci_dev *dev);
-extern int pci_disable_pcie_error_reporting(struct pci_dev *dev);
-extern int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);
+int pci_enable_pcie_error_reporting(struct pci_dev *dev);
+int pci_disable_pcie_error_reporting(struct pci_dev *dev);
+int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);
#else
static inline int pci_enable_pcie_error_reporting(struct pci_dev *dev)
{
@@ -49,10 +53,10 @@ static inline int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
}
#endif

-extern void cper_print_aer(const char *prefix, struct pci_dev *dev,
+void cper_print_aer(const char *prefix, struct pci_dev *dev,
int cper_severity, struct aer_capability_regs *aer);
-extern int cper_severity_to_aer(int cper_severity);
-extern void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
+int cper_severity_to_aer(int cper_severity);
+void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
int severity);
#endif //_AER_H_

2013-06-06 18:16:03

by Dall, Elizabeth J

[permalink] [raw]
Subject: [PATCH v3 6/6] PCI/AER: Provide reset_link for firmware first root port

The firmware first path does not install the aerdrv root port
service driver, so the firmware first root port does not have
a reset_link callback. When a firmware first root port does not have
a reset_link callback, use a new default reset_link similar to what
we already do for the default_downstream_reset_link(). The firmware
first default reset_link brings the root port out of SBR if firmware
left it in SBR.

Changes since v1:
Removed incorrect setting of p2p_ctrl after the read.

Changes since v2:
Treat the default case for a Root Port the same as the existing
default case for a Downstream Link.

Signed-off-by: Betty Dall <[email protected]>
---

drivers/pci/pcie/aer/aerdrv_core.c | 17 +++++++++--------
1 files changed, 9 insertions(+), 8 deletions(-)


diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index 8ec8b4f..cb29c04 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -400,16 +400,16 @@ void aer_do_secondary_bus_reset(struct pci_dev *dev)
}

/**
- * default_downstream_reset_link - default reset function for Downstream Port
- * @dev: pointer to downstream port's pci_dev data structure
+ * default_reset_link - default reset function
+ * @dev: pointer to pci_dev data structure
*
- * Invoked when performing link reset at Downstream Port w/ no aer driver.
+ * Invoked when performing link reset on a Downstream Port or a
+ * Root Port with no aer driver.
*/
-static pci_ers_result_t default_downstream_reset_link(struct pci_dev *dev)
+static pci_ers_result_t default_reset_link(struct pci_dev *dev)
{
aer_do_secondary_bus_reset(dev);
- dev_printk(KERN_DEBUG, &dev->dev,
- "Downstream Port link has been reset\n");
+ dev_dbg(&dev->dev, "Link has been reset\n");
return PCI_ERS_RESULT_RECOVERED;
}

@@ -458,8 +458,9 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)

if (driver && driver->reset_link) {
status = driver->reset_link(udev);
- } else if (pci_pcie_type(udev) == PCI_EXP_TYPE_DOWNSTREAM) {
- status = default_downstream_reset_link(udev);
+ } else if (pci_pcie_type(udev) == PCI_EXP_TYPE_DOWNSTREAM ||
+ pci_pcie_type(udev) == PCI_EXP_TYPE_ROOT_PORT) {
+ status = default_reset_link(udev);
} else {
dev_printk(KERN_DEBUG, &dev->dev,
"no link-reset support at upstream device %s\n",

2013-06-06 18:16:45

by Dall, Elizabeth J

[permalink] [raw]
Subject: [PATCH v3 5/6] ACPI/APEI: Force fatal AER severity when bus has been reset

The CPER error record has a reset bit that indicates that the platform
has reset the bus. The reset bit can be set for any severity error
including recoverable. From the AER code path's perspective,
any error is fatal if the bus has been reset. This patch upgrades the
severity of the AER recovery to AER_FATAL whenever the CPER error record
indicates that the bus has been reset.

Changes since v1:
Fixed a typo in comment.

Changes since v2:
Set the aer_severity to AER_FATAL rather than using cper_severity_to_aer().
Simplified the comment.

Signed-off-by: Betty Dall <[email protected]>
---

drivers/acpi/apei/ghes.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)


diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index d668a8a..ab31551 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -449,9 +449,19 @@ static void ghes_do_proc(struct ghes *ghes,
pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) {
unsigned int devfn;
int aer_severity;
+
devfn = PCI_DEVFN(pcie_err->device_id.device,
pcie_err->device_id.function);
aer_severity = cper_severity_to_aer(sev);
+
+ /*
+ * If firmware reset the component to contain
+ * the error, we must reinitialize it before
+ * use, so treat it as a fatal AER error.
+ */
+ if (gdata->flags & CPER_SEC_RESET)
+ aer_severity = AER_FATAL;
+
aer_recover_queue(pcie_err->device_id.segment,
pcie_err->device_id.bus,
devfn, aer_severity);

2013-06-06 18:17:04

by Dall, Elizabeth J

[permalink] [raw]
Subject: [PATCH v3 3/6] PCI/AER: Set dev->__aer_firmware_first only for matching devices

Previously, we always updated info->firmware_first, even for HEST entries
that didn't match the device. Therefore, if the last HEST descriptor was
a PCIe structure that didn't match the device, we always cleared
dev->__aer_firmware_first.

Signed-off-by: Bjorn Helgaas <[email protected]>
Tested-by: Betty Dall <[email protected]>
---

drivers/pci/pcie/aer/aerdrv_acpi.c | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)


diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c
index e7ea573..cf611ab 100644
--- a/drivers/pci/pcie/aer/aerdrv_acpi.c
+++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
@@ -54,16 +54,16 @@ static int aer_hest_parse(struct acpi_hest_header *hest_hdr, void *data)
{
struct aer_hest_parse_info *info = data;
struct acpi_hest_aer_common *p;
- int ff = 0;
+ int ff;

p = (struct acpi_hest_aer_common *)(hest_hdr + 1);
+ ff = !!(p->flags & ACPI_HEST_FIRMWARE_FIRST);
if (p->flags & ACPI_HEST_GLOBAL) {
if (hest_match_type(hest_hdr, info->pci_dev))
- ff = !!(p->flags & ACPI_HEST_FIRMWARE_FIRST);
+ info->firmware_first = ff;
} else
if (hest_match_pci(p, info->pci_dev))
- ff = !!(p->flags & ACPI_HEST_FIRMWARE_FIRST);
- info->firmware_first = ff;
+ info->firmware_first = ff;

return 0;
}

2013-06-06 18:15:52

by Dall, Elizabeth J

[permalink] [raw]
Subject: [PATCH v3 1/6] PCI/AER: Don't parse HEST table for non-PCIe devices

AER is a PCIe-only capability, so there's no point in trying to match
a HEST PCIe structure with a non-PCIe device.

Previously, a HEST global AER bridge entry (type 8) could incorrectly
match *any* bridge, even a legacy PCI-PCI bridge, and a non-global
HEST entry could match a legacy PCI device.

Signed-off-by: Bjorn Helgaas <[email protected]>
Tested-by: Betty Dall <[email protected]>
---

drivers/pci/pcie/aer/aerdrv_acpi.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)


diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c
index 5194a7d..4f798ab 100644
--- a/drivers/pci/pcie/aer/aerdrv_acpi.c
+++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
@@ -59,8 +59,7 @@ static int aer_hest_parse(struct acpi_hest_header *hest_hdr, void *data)

p = (struct acpi_hest_aer_common *)(hest_hdr + 1);
if (p->flags & ACPI_HEST_GLOBAL) {
- if ((pci_is_pcie(info->pci_dev) &&
- pci_pcie_type(info->pci_dev) == pcie_type) || bridge)
+ if ((pci_pcie_type(info->pci_dev) == pcie_type) || bridge)
ff = !!(p->flags & ACPI_HEST_FIRMWARE_FIRST);
} else
if (hest_match_pci(p, info->pci_dev))
@@ -89,6 +88,9 @@ static void aer_set_firmware_first(struct pci_dev *pci_dev)

int pcie_aer_get_firmware_first(struct pci_dev *dev)
{
+ if (!pci_is_pcie(dev))
+ return 0;
+
if (!dev->__aer_firmware_first_valid)
aer_set_firmware_first(dev);
return dev->__aer_firmware_first;

2013-06-06 18:23:10

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] PCI/ACPI: Move AER severity defines to aer.h

On Thu, 2013-06-06 at 12:10 -0600, Betty Dall wrote:
> The function aer_recover_queue() is a public interface and the
> severity argument uses #defines that are in the private header
> pci/pcie/aer/aerdrv.h.
>
> This patch moves the #defines from pci/pcie/aer/aerdrv.h to
> include/linux/aer.h.
[]
> diff --git a/include/linux/aer.h b/include/linux/aer.h
[]
> -extern void cper_print_aer(const char *prefix, struct pci_dev *dev,
> +void cper_print_aer(const char *prefix, struct pci_dev *dev,
> int cper_severity, struct aer_capability_regs *aer);

Can you please also realign the arguments on subsequent
lines to the open parenthesis of the first line and reflow
then to 80 cols when appropriate?

> -extern void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
> +void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
> int severity);

2013-06-06 19:28:59

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] PCI/ACPI: Move AER severity defines to aer.h

On Thu, Jun 6, 2013 at 12:23 PM, Joe Perches <[email protected]> wrote:
> On Thu, 2013-06-06 at 12:10 -0600, Betty Dall wrote:
>> The function aer_recover_queue() is a public interface and the
>> severity argument uses #defines that are in the private header
>> pci/pcie/aer/aerdrv.h.
>>
>> This patch moves the #defines from pci/pcie/aer/aerdrv.h to
>> include/linux/aer.h.
> []
>> diff --git a/include/linux/aer.h b/include/linux/aer.h
> []
>> -extern void cper_print_aer(const char *prefix, struct pci_dev *dev,
>> +void cper_print_aer(const char *prefix, struct pci_dev *dev,
>> int cper_severity, struct aer_capability_regs *aer);
>
> Can you please also realign the arguments on subsequent
> lines to the open parenthesis of the first line and reflow
> then to 80 cols when appropriate?

I can do this when I apply them, so don't bother reposting unless you
have more substantive changes to make.

>> -extern void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
>> +void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
>> int severity);
>
>

2013-06-06 20:10:22

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] PCI/ACPI: Move AER severity defines to aer.h

On Thursday, June 06, 2013 01:28:28 PM Bjorn Helgaas wrote:
> On Thu, Jun 6, 2013 at 12:23 PM, Joe Perches <[email protected]> wrote:
> > On Thu, 2013-06-06 at 12:10 -0600, Betty Dall wrote:
> >> The function aer_recover_queue() is a public interface and the
> >> severity argument uses #defines that are in the private header
> >> pci/pcie/aer/aerdrv.h.
> >>
> >> This patch moves the #defines from pci/pcie/aer/aerdrv.h to
> >> include/linux/aer.h.
> > []
> >> diff --git a/include/linux/aer.h b/include/linux/aer.h
> > []
> >> -extern void cper_print_aer(const char *prefix, struct pci_dev *dev,
> >> +void cper_print_aer(const char *prefix, struct pci_dev *dev,
> >> int cper_severity, struct aer_capability_regs *aer);
> >
> > Can you please also realign the arguments on subsequent
> > lines to the open parenthesis of the first line and reflow
> > then to 80 cols when appropriate?
>
> I can do this when I apply them, so don't bother reposting unless you
> have more substantive changes to make.

For the record, I have no objections against this patchset.

Thanks,
Rafael


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2013-06-06 21:05:17

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] PCI/ACPI: Fix firmware first error recovery with root port in reset

On Thu, Jun 6, 2013 at 12:10 PM, Betty Dall <[email protected]> wrote:
> This patch set fixes a bug on platforms that use firmware first AER.
> Firmware can leave the root port in Secondary Bus Reset (SBR) and
> communicate this to the OS through the "reset" bit in the flags field
> of the HEST table and associated CPER records. Firmware wants to do this
> so that the error is contained and the hardware is in a known state.
>
> Without these patches, the root port stays in SBR and the device drivers
> cannot recover. These patches recognize when the firmware first root port
> is in SBR and bring the root port out of SBR so the devices under the root
> port can recover.
>
> The changes have been tested on systems with firmware first that set the
> "reset" bit by injecting various hardware errors. The errors successfully
> recover.
>
> Changes since v1:
> Fixed a typo in the comment of patch 2.
> Removed incorrect setting of reset bit in patch 3.
>
> Changes since v2:
> The v2 patch 1/3 was re-written by Bjorn Helgaas and is now patches 1/6
> through 3/6.
> The v2 patch 2/3 is now 5/6 and changed to directly use the AER_FATAL define
> and introduced patch 4/6 to move the defines to a public header file.
> The v2 patch 3/3 is now 6/6 and uses the same default reset link function for
> both Downstream Ports and Root Ports.
>
> Signed-off-by: Betty Dall <[email protected]>
> ---
> Betty Dall (6):
> PCI/AER: Don't parse HEST table for non-PCIe devices
> PCI/AER: Factor out HEST device type matching
> PCI/AER: Set dev->__aer_firmware_first only for matching devices
> PCI/ACPI: Move AER severity defines to aer.h
> ACPI/APEI: Force fatal AER severity when bus has been reset
> PCI/AER: Provide reset_link for firmware first root port
> ---
> drivers/acpi/apei/ghes.c | 10 +++++++
> drivers/pci/pcie/aer/aerdrv.h | 4 ---
> drivers/pci/pcie/aer/aerdrv_acpi.c | 47 ++++++++++++++++++-----------------
> drivers/pci/pcie/aer/aerdrv_core.c | 17 +++++++------
> include/linux/aer.h | 16 +++++++----
> 5 files changed, 53 insertions(+), 41 deletions(-)

I put these on http://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/log/?h=pci/betty-aer-v3

I'll merge them into -next for v3.11 soon. Thanks, Betty!