2013-05-28 18:53:37

by Dall, Elizabeth J

[permalink] [raw]
Subject: [PATCH 0/3] PCI/ACPI: Fix firmware first error recovery with root port in reset

This patch set fixes a bug on platforms that use firmware first AER.
Firmware can leave the root port in Secondary Bus Reset (SBR) and
communicate this to the OS through the "reset" bit in the flags field
of the HEST table and associated CPER records. Firmware wants to do this
so that the error is contained and the hardware is in a known state.

Without these patches, the root port stays in SBR and the device drivers
cannot recover. These patches recognize when the firmware first root port
is in SBR and bring the root port out of SBR so the devices under the root
port can recover.

The changes have been tested on systems with firmware first that set the
"reset" bit by injecting various hardware errors. The errors successfully
recover.

---
Betty Dall (3):
PCI/AER: Fix incorrect return from aer_hest_parse()
ACPI/APEI: Force fatal AER severity when bus has been reset
PCI/AER: Provide reset_link for firmware first root port
---
drivers/acpi/apei/ghes.c | 21 ++++++++++++++++++-
drivers/pci/pcie/aer/aerdrv_acpi.c | 3 ++
drivers/pci/pcie/aer/aerdrv_core.c | 38 ++++++++++++++++++++++++++++++++++++
3 files changed, 61 insertions(+), 1 deletions(-)


2013-05-28 18:53:42

by Dall, Elizabeth J

[permalink] [raw]
Subject: [PATCH 3/3] PCI/AER: Provide reset_link for firmware first root port

The firmware first path does not install the aerdrv root port
service driver, so the firmware first root port does not have
a reset_link callback. When a firmware first root port does not have
a reset_link callback, use a new default reset_link similar to what
we already do for the default_downstream_reset_link(). The firmware
first default reset_link brings the root port out of SBR if firmware
left it in SBR.

Signed-off-by: Betty Dall <[email protected]>
---

drivers/pci/pcie/aer/aerdrv_core.c | 37 ++++++++++++++++++++++++++++++++++++
1 files changed, 37 insertions(+), 0 deletions(-)


diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index 8ec8b4f..6862fe3 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -413,6 +413,40 @@ static pci_ers_result_t default_downstream_reset_link(struct pci_dev *dev)
return PCI_ERS_RESULT_RECOVERED;
}

+/**
+ * default_ff_root_port_reset_link - default reset function for firmware
+ * first Root Port
+ * @dev: pointer to root port's pci_dev data structure
+ *
+ * Invoked when performing link reset at Root Port w/ no aer driver.
+ * This happens through the firmware first path. Firmware may leave
+ * the Root Port in SBR and it is the OS responsiblity to bring it out
+ * of SBR.
+ */
+static pci_ers_result_t default_ff_root_port_reset_link(struct pci_dev *dev)
+{
+ u16 p2p_ctrl;
+
+ /* Read Secondary Bus Reset */
+ pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &p2p_ctrl);
+ p2p_ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
+
+ /* De-assert Secondary Bus Reset, if it is set */
+ if (p2p_ctrl & PCI_BRIDGE_CTL_BUS_RESET) {
+ p2p_ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
+ pci_write_config_word(dev, PCI_BRIDGE_CONTROL, p2p_ctrl);
+
+ /*
+ * System software must wait for at least 100ms from the end
+ * of a reset of one or more device before it is permitted
+ * to issue Configuration Requests to those devices.
+ */
+ msleep(200);
+ dev_dbg(&dev->dev, "Root Port link has been reset\n");
+ }
+ return PCI_ERS_RESULT_RECOVERED;
+}
+
static int find_aer_service_iter(struct device *device, void *data)
{
struct pcie_port_service_driver *service_driver, **drv;
@@ -460,6 +494,9 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
status = driver->reset_link(udev);
} else if (pci_pcie_type(udev) == PCI_EXP_TYPE_DOWNSTREAM) {
status = default_downstream_reset_link(udev);
+ } else if (pci_pcie_type(udev) == PCI_EXP_TYPE_ROOT_PORT &&
+ pcie_aer_get_firmware_first(udev)) {
+ status = default_ff_root_port_reset_link(udev);
} else {
dev_printk(KERN_DEBUG, &dev->dev,
"no link-reset support at upstream device %s\n",

2013-05-28 18:53:41

by Dall, Elizabeth J

[permalink] [raw]
Subject: [PATCH 2/3] ACPI/APEI: Force fatal AER severity when bus has been reset

The CPER error record has a reset bit that indicates that the platform
has reset the bus. The reset bit can be set for any severity error
including recoverable. From the AER code path's perspective,
any error is fatal if the bus has been reset. This patch upgrades the
severity of the AER recovery to AER_FATAL whenever the CPER error record
indicates that the bus has been reset.

Signed-off-by: Betty Dall <[email protected]>
---

drivers/acpi/apei/ghes.c | 21 ++++++++++++++++++++-
1 files changed, 20 insertions(+), 1 deletions(-)


diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index d668a8a..4a42afc 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -451,7 +451,26 @@ static void ghes_do_proc(struct ghes *ghes,
int aer_severity;
devfn = PCI_DEVFN(pcie_err->device_id.device,
pcie_err->device_id.function);
- aer_severity = cper_severity_to_aer(sev);
+ /*
+ * Some Firmware First implementations
+ * put the device in SBR to contain
+ * the error. This is indicated by the
+ * CPER Section Decriptor Flags reset
+ * bit which means the component must
+ * be re-initialized or re-enabled
+ * prior to use. Promoting the AER
+ * serverity to FATAL will cause the
+ * AER code to link_reset and allow
+ * drivers to reprogram their cards.
+ */
+ if (gdata->flags & CPER_SEC_RESET)
+ aer_severity = cper_severity_to_aer(
+ CPER_SEV_FATAL);
+ else
+ aer_severity =
+ cper_severity_to_aer(sev);
+
+
aer_recover_queue(pcie_err->device_id.segment,
pcie_err->device_id.bus,
devfn, aer_severity);

2013-05-28 18:54:38

by Dall, Elizabeth J

[permalink] [raw]
Subject: [PATCH 1/3] PCI/AER: Fix incorrect return from aer_hest_parse()

The function aer_hest_parse() is called to determine if the given
PCI device is firmware first or not. The code loops through each
section of the HEST table to look for a match. The bug is that
the function always returns whether the last HEST section is firmware
first. The fix stops the iteration once the info.firmware_first
variable is set. This is similar to how the function aer_hest_parse_aff()
stops the iteration.

Signed-off-by: Betty Dall <[email protected]>
---

drivers/pci/pcie/aer/aerdrv_acpi.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)


diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c
index 5194a7d..39b8671 100644
--- a/drivers/pci/pcie/aer/aerdrv_acpi.c
+++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
@@ -42,6 +42,9 @@ static int aer_hest_parse(struct acpi_hest_header *hest_hdr, void *data)
u8 bridge = 0;
int ff = 0;

+ if (info->firmware_first)
+ return 0;
+
switch (hest_hdr->type) {
case ACPI_HEST_TYPE_AER_ROOT_PORT:
pcie_type = PCI_EXP_TYPE_ROOT_PORT;

2013-05-29 15:43:40

by Pearson, Greg

[permalink] [raw]
Subject: Re: [PATCH 2/3] ACPI/APEI: Force fatal AER severity when bus has been reset

On 05/28/2013 12:48 PM, Betty Dall wrote:
> The CPER error record has a reset bit that indicates that the platform
> has reset the bus. The reset bit can be set for any severity error
> including recoverable. From the AER code path's perspective,
> any error is fatal if the bus has been reset. This patch upgrades the
> severity of the AER recovery to AER_FATAL whenever the CPER error record
> indicates that the bus has been reset.
>
> Signed-off-by: Betty Dall <[email protected]>
> ---
>
> drivers/acpi/apei/ghes.c | 21 ++++++++++++++++++++-
> 1 files changed, 20 insertions(+), 1 deletions(-)
>
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index d668a8a..4a42afc 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -451,7 +451,26 @@ static void ghes_do_proc(struct ghes *ghes,
> int aer_severity;
> devfn = PCI_DEVFN(pcie_err->device_id.device,
> pcie_err->device_id.function);
> - aer_severity = cper_severity_to_aer(sev);
> + /*
> + * Some Firmware First implementations
> + * put the device in SBR to contain
> + * the error. This is indicated by the
> + * CPER Section Decriptor Flags reset

Nit Pick, change "Decriptor" to "Descriptor".

--
Greg

> + * bit which means the component must
> + * be re-initialized or re-enabled
> + * prior to use. Promoting the AER
> + * serverity to FATAL will cause the
> + * AER code to link_reset and allow
> + * drivers to reprogram their cards.
> + */
> + if (gdata->flags & CPER_SEC_RESET)
> + aer_severity = cper_severity_to_aer(
> + CPER_SEV_FATAL);
> + else
> + aer_severity =
> + cper_severity_to_aer(sev);
> +
> +
> aer_recover_queue(pcie_err->device_id.segment,
> pcie_err->device_id.bus,
> devfn, aer_severity);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>

2013-05-29 15:43:43

by Pearson, Greg

[permalink] [raw]
Subject: Re: [PATCH 3/3] PCI/AER: Provide reset_link for firmware first root port

On 05/28/2013 12:48 PM, Betty Dall wrote:
> The firmware first path does not install the aerdrv root port
> service driver, so the firmware first root port does not have
> a reset_link callback. When a firmware first root port does not have
> a reset_link callback, use a new default reset_link similar to what
> we already do for the default_downstream_reset_link(). The firmware
> first default reset_link brings the root port out of SBR if firmware
> left it in SBR.
>
> Signed-off-by: Betty Dall <[email protected]>
> ---
>
> drivers/pci/pcie/aer/aerdrv_core.c | 37 ++++++++++++++++++++++++++++++++++++
> 1 files changed, 37 insertions(+), 0 deletions(-)
>
>
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
> index 8ec8b4f..6862fe3 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -413,6 +413,40 @@ static pci_ers_result_t default_downstream_reset_link(struct pci_dev *dev)
> return PCI_ERS_RESULT_RECOVERED;
> }
>
> +/**
> + * default_ff_root_port_reset_link - default reset function for firmware
> + * first Root Port
> + * @dev: pointer to root port's pci_dev data structure
> + *
> + * Invoked when performing link reset at Root Port w/ no aer driver.
> + * This happens through the firmware first path. Firmware may leave
> + * the Root Port in SBR and it is the OS responsiblity to bring it out
> + * of SBR.
> + */
> +static pci_ers_result_t default_ff_root_port_reset_link(struct pci_dev *dev)
> +{
> + u16 p2p_ctrl;
> +
> + /* Read Secondary Bus Reset */
> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &p2p_ctrl);
> + p2p_ctrl |= PCI_BRIDGE_CTL_BUS_RESET;

Remove "p2p_ctrl |= PCI_BRIDGE_CTL_BUS_RESET;" otherwise the following
conditional is always taken.

--
Greg

> +
> + /* De-assert Secondary Bus Reset, if it is set */
> + if (p2p_ctrl & PCI_BRIDGE_CTL_BUS_RESET) {
> + p2p_ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, p2p_ctrl);
> +
> + /*
> + * System software must wait for at least 100ms from the end
> + * of a reset of one or more device before it is permitted
> + * to issue Configuration Requests to those devices.
> + */
> + msleep(200);
> + dev_dbg(&dev->dev, "Root Port link has been reset\n");
> + }
> + return PCI_ERS_RESULT_RECOVERED;
> +}
> +
> static int find_aer_service_iter(struct device *device, void *data)
> {
> struct pcie_port_service_driver *service_driver, **drv;
> @@ -460,6 +494,9 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
> status = driver->reset_link(udev);
> } else if (pci_pcie_type(udev) == PCI_EXP_TYPE_DOWNSTREAM) {
> status = default_downstream_reset_link(udev);
> + } else if (pci_pcie_type(udev) == PCI_EXP_TYPE_ROOT_PORT &&
> + pcie_aer_get_firmware_first(udev)) {
> + status = default_ff_root_port_reset_link(udev);
> } else {
> dev_printk(KERN_DEBUG, &dev->dev,
> "no link-reset support at upstream device %s\n",
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2013-05-29 16:17:01

by Dall, Elizabeth J

[permalink] [raw]
Subject: Re: [PATCH 2/3] ACPI/APEI: Force fatal AER severity when bus has been reset

On Wed, 2013-05-29 at 09:42 -0600, Pearson, Greg wrote:
> On 05/28/2013 12:48 PM, Betty Dall wrote:
> > The CPER error record has a reset bit that indicates that the platform
> > has reset the bus. The reset bit can be set for any severity error
> > including recoverable. From the AER code path's perspective,
> > any error is fatal if the bus has been reset. This patch upgrades the
> > severity of the AER recovery to AER_FATAL whenever the CPER error record
> > indicates that the bus has been reset.
> >
> > Signed-off-by: Betty Dall <[email protected]>
> > ---
> >
> > drivers/acpi/apei/ghes.c | 21 ++++++++++++++++++++-
> > 1 files changed, 20 insertions(+), 1 deletions(-)
> >
> >
> > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > index d668a8a..4a42afc 100644
> > --- a/drivers/acpi/apei/ghes.c
> > +++ b/drivers/acpi/apei/ghes.c
> > @@ -451,7 +451,26 @@ static void ghes_do_proc(struct ghes *ghes,
> > int aer_severity;
> > devfn = PCI_DEVFN(pcie_err->device_id.device,
> > pcie_err->device_id.function);
> > - aer_severity = cper_severity_to_aer(sev);
> > + /*
> > + * Some Firmware First implementations
> > + * put the device in SBR to contain
> > + * the error. This is indicated by the
> > + * CPER Section Decriptor Flags reset
>
> Nit Pick, change "Decriptor" to "Descriptor".
>
> --
> Greg

I will fix that. Thanks.

-Betty
> > + * bit which means the component must
> > + * be re-initialized or re-enabled
> > + * prior to use. Promoting the AER
> > + * serverity to FATAL will cause the
> > + * AER code to link_reset and allow
> > + * drivers to reprogram their cards.
> > + */
> > + if (gdata->flags & CPER_SEC_RESET)
> > + aer_severity = cper_severity_to_aer(
> > + CPER_SEV_FATAL);
> > + else
> > + aer_severity =
> > + cper_severity_to_aer(sev);
> > +
> > +
> > aer_recover_queue(pcie_err->device_id.segment,
> > pcie_err->device_id.bus,
> > devfn, aer_severity);
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >

2013-05-29 16:17:14

by Dall, Elizabeth J

[permalink] [raw]
Subject: Re: [PATCH 3/3] PCI/AER: Provide reset_link for firmware first root port

On Wed, 2013-05-29 at 09:42 -0600, Pearson, Greg wrote:
> On 05/28/2013 12:48 PM, Betty Dall wrote:
> > The firmware first path does not install the aerdrv root port
> > service driver, so the firmware first root port does not have
> > a reset_link callback. When a firmware first root port does not have
> > a reset_link callback, use a new default reset_link similar to what
> > we already do for the default_downstream_reset_link(). The firmware
> > first default reset_link brings the root port out of SBR if firmware
> > left it in SBR.
> >
> > Signed-off-by: Betty Dall <[email protected]>
> > ---
> >
> > drivers/pci/pcie/aer/aerdrv_core.c | 37 ++++++++++++++++++++++++++++++++++++
> > 1 files changed, 37 insertions(+), 0 deletions(-)
> >
> >
> > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
> > index 8ec8b4f..6862fe3 100644
> > --- a/drivers/pci/pcie/aer/aerdrv_core.c
> > +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> > @@ -413,6 +413,40 @@ static pci_ers_result_t default_downstream_reset_link(struct pci_dev *dev)
> > return PCI_ERS_RESULT_RECOVERED;
> > }
> >
> > +/**
> > + * default_ff_root_port_reset_link - default reset function for firmware
> > + * first Root Port
> > + * @dev: pointer to root port's pci_dev data structure
> > + *
> > + * Invoked when performing link reset at Root Port w/ no aer driver.
> > + * This happens through the firmware first path. Firmware may leave
> > + * the Root Port in SBR and it is the OS responsiblity to bring it out
> > + * of SBR.
> > + */
> > +static pci_ers_result_t default_ff_root_port_reset_link(struct pci_dev *dev)
> > +{
> > + u16 p2p_ctrl;
> > +
> > + /* Read Secondary Bus Reset */
> > + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &p2p_ctrl);
> > + p2p_ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
>
> Remove "p2p_ctrl |= PCI_BRIDGE_CTL_BUS_RESET;" otherwise the following
> conditional is always taken.
>
> --
> Greg

Agreed. That was a copy and paste mistake. I will fix and retest.

-Betty

> > +
> > + /* De-assert Secondary Bus Reset, if it is set */
> > + if (p2p_ctrl & PCI_BRIDGE_CTL_BUS_RESET) {
> > + p2p_ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
> > + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, p2p_ctrl);
> > +
> > + /*
> > + * System software must wait for at least 100ms from the end
> > + * of a reset of one or more device before it is permitted
> > + * to issue Configuration Requests to those devices.
> > + */
> > + msleep(200);
> > + dev_dbg(&dev->dev, "Root Port link has been reset\n");
> > + }
> > + return PCI_ERS_RESULT_RECOVERED;
> > +}
> > +
> > static int find_aer_service_iter(struct device *device, void *data)
> > {
> > struct pcie_port_service_driver *service_driver, **drv;
> > @@ -460,6 +494,9 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
> > status = driver->reset_link(udev);
> > } else if (pci_pcie_type(udev) == PCI_EXP_TYPE_DOWNSTREAM) {
> > status = default_downstream_reset_link(udev);
> > + } else if (pci_pcie_type(udev) == PCI_EXP_TYPE_ROOT_PORT &&
> > + pcie_aer_get_firmware_first(udev)) {
> > + status = default_ff_root_port_reset_link(udev);
> > } else {
> > dev_printk(KERN_DEBUG, &dev->dev,
> > "no link-reset support at upstream device %s\n",
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >