This patch set brings in error handling support for DPC
The current implementation of AER and error message broadcasting to the
EP driver is tightly coupled and limited to AER service driver.
It is important to factor out broadcasting and other link handling
callbacks. So that not only when AER gets triggered, but also when DPC get
triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.
The goal of the patch-set is:
DPC should handle the error handling and recovery similar to AER, because
finally both are attempting recovery in some or the other way,
and for that error handling and recovery framework has to be loosely
coupled.
It achieves uniformity and transparency to the error handling agents such
as AER, DPC, with respect to recovery and error handling.
So, this patch-set tries to unify lot of things between error agents and
make them behave in a well defined way. (be it error (FATAL, NON_FATAL)
handling or recovery).
The FATAL error handling is handled with remove/reset_link/re-enumerate
sequence while the NON_FATAL follows the default path.
Documentation/PCI/pci-error-recovery.txt talks more on that.
Changes since v14:
Bjorn's comments addressed
> simplified the patch set, and moved AER_FATAL handling in the beginning.
> rebase the code to 4.17-rc1.
Changes since v13:
Bjorn's comments addressed
> handke FATAL errors with remove devices followed by re-enumeration.
> changes in AER and DPC along with required Documentation.
Changes since v12:
Bjorn's and Keith's Comments addressed.
> Made DPC and AER error handling identical <aligned err.c>
> hanldled cases for hotplug enabled system differently.
Changes since v11:
Bjorn's comments addressed.
> rename pcie-err.c to err.c
> removed EXPORT_SYMBOL
> made generic find_serivce function in port driver.
> removed mutex patch as no need to have mutex in pcie_do_recovery
> brough in DPC_FATAL in aer.h
> so now all the error codes (AER and DPC) are unified in aer.h
Changes since v10:
Christoph Hellwig's, David Laight's and Randy Dunlap's
comments addressed.
> renamed pci_do_recovery to pcie_do_recovery
> removed inner braces in conditional statements.
> restrctured the code in pci_wait_for_link
> EXPORT_SYMBOL_GPL
Changes since v9:
Sinan's comments addressed.
> bool active = true; unnecessary variable removed.
Changes since v8:
Fixed Kbuild errors.
Changes since v7:
Rebased the code on pci master
> https://kernel.googlesource.com/pub/scm/linux/kernel/git/helgaas/pci
Changes since v6:
Sinan's and Stefan's comments implemented.
> reordered patch 6 and 7
> cleaned up
Changes since v5:
Sinan's and Keith's comments incorporated.
> made separate patch for mutex
> unified error repotting codes into driver/pci/pci.h
> got rid of wait link active/inactive and
made generic function in driver/pci/pci.c
Changes since v4:
Bjorn's comments incorporated.
> Renamed only do_recovery.
> moved the things more locally to drivers/pci/pci.h
Changes since v3:
Bjorn's comments incorporated.
> Made separate patch renaming generic pci_err.c
> Introduce pci_err.h to contain all the error types and recovery
> removed all the dependencies on pci.h
Changes since v2:
Based on feedback from Keith:
"
When DPC is triggered due to receipt of an uncorrectable error Message,
the Requester ID from the Message is recorded in the DPC Error
Source ID register and that Message is discarded and not forwarded Upstream.
"
Removed the patch where AER checks if DPC service is active
Changes since v1:
Kbuild errors fixed:
> pci_find_dpc_dev made static
> ras_event.h updated
> pci_find_aer_service call with CONFIG check
> pci_find_dpc_service call with CONFIG check
Oza Pawandeep (9):
PCI: Unify wait for link active into generic PCI
pci-error-recovery: Add AER_FATAL handling
PCI/AER: Handle ERRR_FATAL with removal and re-enumeration of devices
PCI/AER: Rename error recovery to generic PCI naming
PCI/AER: Factor out error reporting from AER
PCI/PORTDRV: Implement generic find service
PCI/PORTDRV: Implement generic find device
PCI/DPC: Unify and plumb error handling into DPC
PCI/DPC: Disable ERR_NONFATAL and enable ERR_FATAL for DPC
Documentation/PCI/pci-error-recovery.txt | 35 ++-
drivers/pci/hotplug/pciehp_hpc.c | 20 +-
drivers/pci/pci.c | 29 +++
drivers/pci/pci.h | 4 +
drivers/pci/pcie/Makefile | 2 +-
drivers/pci/pcie/aer/aerdrv.c | 2 +
drivers/pci/pcie/aer/aerdrv.h | 30 ---
drivers/pci/pcie/aer/aerdrv_core.c | 317 +-------------------------
drivers/pci/pcie/dpc.c | 58 +++--
drivers/pci/pcie/err.c | 374 +++++++++++++++++++++++++++++++
drivers/pci/pcie/portdrv.h | 4 +
drivers/pci/pcie/portdrv_core.c | 67 ++++++
include/linux/aer.h | 1 +
include/uapi/linux/pci_regs.h | 1 +
14 files changed, 540 insertions(+), 404 deletions(-)
create mode 100644 drivers/pci/pcie/err.c
--
2.7.4
This patch renames error recovery to generic name with pcie prefix
Signed-off-by: Oza Pawandeep <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index cec9d8c..22a9589 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -353,6 +353,9 @@ static inline resource_size_t pci_resource_alignment(struct pci_dev *dev,
void pci_enable_acs(struct pci_dev *dev);
+/* PCI error reporting and recovery */
+void pcie_do_recovery(struct pci_dev *dev, int severity);
+
bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
#ifdef CONFIG_PCIEASPM
void pcie_aspm_init_link_state(struct pci_dev *pdev);
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index 655d4e8..be4ee3b 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -475,7 +475,7 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
return status;
}
-static pci_ers_result_t do_fatal_recovery(struct pci_dev *dev, int severity)
+static pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev, int severity)
{
struct pci_dev *udev;
struct pci_bus *parent;
@@ -514,7 +514,7 @@ static pci_ers_result_t do_fatal_recovery(struct pci_dev *dev, int severity)
}
/**
- * do_recovery - handle nonfatal/fatal error recovery process
+ * pcie_do_recovery - handle nonfatal/fatal error recovery process
* @dev: pointer to a pci_dev data structure of agent detecting an error
* @severity: error severity type
*
@@ -522,13 +522,13 @@ static pci_ers_result_t do_fatal_recovery(struct pci_dev *dev, int severity)
* error detected message to all downstream drivers within a hierarchy in
* question and return the returned code.
*/
-static void do_recovery(struct pci_dev *dev, int severity)
+void pcie_do_recovery(struct pci_dev *dev, int severity)
{
pci_ers_result_t status;
enum pci_channel_state state;
if (severity == AER_FATAL) {
- status = do_fatal_recovery(dev, severity);
+ status = pcie_do_fatal_recovery(dev, severity);
if (status != PCI_ERS_RESULT_RECOVERED)
goto failed;
return;
@@ -600,7 +600,7 @@ static void handle_error_source(struct pcie_device *aerdev,
pci_write_config_dword(dev, pos + PCI_ERR_COR_STATUS,
info->status);
} else
- do_recovery(dev, info->severity);
+ pcie_do_recovery(dev, info->severity);
}
#ifdef CONFIG_ACPI_APEI_PCIEAER
@@ -665,7 +665,7 @@ static void aer_recover_work_func(struct work_struct *work)
}
cper_print_aer(pdev, entry.severity, entry.regs);
if (entry.severity != AER_CORRECTABLE)
- do_recovery(pdev, entry.severity);
+ pcie_do_recovery(pdev, entry.severity);
pci_dev_put(pdev);
}
}
--
2.7.4
This patch alters the behavior of handling of ERR_FATAL, where removal
of devices is initiated, followed by reset link, followed by
re-enumeration.
So the errors are handled in a different way as follows:
ERR_NONFATAL => call driver recovery entry points
ERR_FATAL => remove and re-enumerate
please refer to Documentation/PCI/pci-error-recovery.txt for more details.
Signed-off-by: Oza Pawandeep <[email protected]>
diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
index 779b387..206f590 100644
--- a/drivers/pci/pcie/aer/aerdrv.c
+++ b/drivers/pci/pcie/aer/aerdrv.c
@@ -330,6 +330,13 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
reg32 |= ROOT_PORT_INTR_ON_MESG_MASK;
pci_write_config_dword(dev, pos + PCI_ERR_ROOT_COMMAND, reg32);
+ /*
+ * This function is called only on ERR_FATAL now, and since
+ * the pci_report_resume is called only in ERR_NONFATAL case,
+ * the clearing part has to be taken care here.
+ */
+ aer_error_resume(dev);
+
return PCI_ERS_RESULT_RECOVERED;
}
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index 0ea5acc..655d4e8 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -20,6 +20,7 @@
#include <linux/slab.h>
#include <linux/kfifo.h>
#include "aerdrv.h"
+#include "../../pci.h"
#define PCI_EXP_AER_FLAGS (PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
@@ -474,6 +475,44 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
return status;
}
+static pci_ers_result_t do_fatal_recovery(struct pci_dev *dev, int severity)
+{
+ struct pci_dev *udev;
+ struct pci_bus *parent;
+ struct pci_dev *pdev, *temp;
+ pci_ers_result_t result = PCI_ERS_RESULT_RECOVERED;
+
+ if (severity == AER_FATAL)
+ pci_cleanup_aer_uncorrect_error_status(dev);
+
+ if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
+ udev = dev;
+ else
+ udev = dev->bus->self;
+
+ parent = udev->subordinate;
+ pci_lock_rescan_remove();
+ list_for_each_entry_safe_reverse(pdev, temp, &parent->devices,
+ bus_list) {
+ pci_dev_get(pdev);
+ pci_dev_set_disconnected(pdev, NULL);
+ if (pci_has_subordinate(pdev))
+ pci_walk_bus(pdev->subordinate,
+ pci_dev_set_disconnected, NULL);
+ pci_stop_and_remove_bus_device(pdev);
+ pci_dev_put(pdev);
+ }
+
+ result = reset_link(udev);
+ if (result == PCI_ERS_RESULT_RECOVERED)
+ if (pcie_wait_for_link(udev, true))
+ pci_rescan_bus(udev->bus);
+
+ pci_unlock_rescan_remove();
+
+ return result;
+}
+
/**
* do_recovery - handle nonfatal/fatal error recovery process
* @dev: pointer to a pci_dev data structure of agent detecting an error
@@ -485,11 +524,15 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
*/
static void do_recovery(struct pci_dev *dev, int severity)
{
- pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
+ pci_ers_result_t status;
enum pci_channel_state state;
- if (severity == AER_FATAL)
- state = pci_channel_io_frozen;
+ if (severity == AER_FATAL) {
+ status = do_fatal_recovery(dev, severity);
+ if (status != PCI_ERS_RESULT_RECOVERED)
+ goto failed;
+ return;
+ }
else
state = pci_channel_io_normal;
@@ -498,12 +541,6 @@ static void do_recovery(struct pci_dev *dev, int severity)
"error_detected",
report_error_detected);
- if (severity == AER_FATAL) {
- result = reset_link(dev);
- if (result != PCI_ERS_RESULT_RECOVERED)
- goto failed;
- }
-
if (status == PCI_ERS_RESULT_CAN_RECOVER)
status = broadcast_error_message(dev,
state,
--
2.7.4
Current DPC driver does not do recovery, e.g. calling end-point's driver's
callbacks, which sanitize the sw.
DPC driver implements link_reset callback, and calls pci_do_recovery().
Signed-off-by: Oza Pawandeep <[email protected]>
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index 80ec384..aed7c9f 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -73,29 +73,21 @@ static void dpc_wait_link_inactive(struct dpc_dev *dpc)
pcie_wait_for_link(pdev, false);
}
-static void dpc_work(struct work_struct *work)
+static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
{
- struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
- struct pci_dev *dev, *temp, *pdev = dpc->dev->port;
- struct pci_bus *parent = pdev->subordinate;
- u16 cap = dpc->cap_pos, ctl;
-
- pci_lock_rescan_remove();
- list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
- bus_list) {
- pci_dev_get(dev);
- pci_dev_set_disconnected(dev, NULL);
- if (pci_has_subordinate(dev))
- pci_walk_bus(dev->subordinate,
- pci_dev_set_disconnected, NULL);
- pci_stop_and_remove_bus_device(dev);
- pci_dev_put(dev);
- }
- pci_unlock_rescan_remove();
+ struct dpc_dev *dpc;
+ struct pcie_device *pciedev;
+ struct device *devdpc;
+ u16 cap, ctl;
+
+ devdpc = pcie_port_find_device(pdev, PCIE_PORT_SERVICE_DPC);
+ pciedev = to_pcie_device(devdpc);
+ dpc = get_service_data(pciedev);
+ cap = dpc->cap_pos;
dpc_wait_link_inactive(dpc);
if (dpc->rp_extensions && dpc_wait_rp_inactive(dpc))
- return;
+ return PCI_ERS_RESULT_DISCONNECT;
if (dpc->rp_extensions && dpc->rp_pio_status) {
pci_write_config_dword(pdev, cap + PCI_EXP_DPC_RP_PIO_STATUS,
dpc->rp_pio_status);
@@ -108,6 +100,17 @@ static void dpc_work(struct work_struct *work)
pci_read_config_word(pdev, cap + PCI_EXP_DPC_CTL, &ctl);
pci_write_config_word(pdev, cap + PCI_EXP_DPC_CTL,
ctl | PCI_EXP_DPC_CTL_INT_EN);
+
+ return PCI_ERS_RESULT_RECOVERED;
+}
+
+static void dpc_work(struct work_struct *work)
+{
+ struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
+ struct pci_dev *pdev = dpc->dev->port;
+
+ /* From DPC point of view error is always FATAL. */
+ pcie_do_recovery(pdev, DPC_FATAL);
}
static void dpc_process_rp_pio_error(struct dpc_dev *dpc)
@@ -288,6 +291,7 @@ static struct pcie_port_service_driver dpcdriver = {
.service = PCIE_PORT_SERVICE_DPC,
.probe = dpc_probe,
.remove = dpc_remove,
+ .reset_link = dpc_reset_link,
};
static int __init dpc_service_init(void)
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 877785d..526aba8 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -181,11 +181,12 @@ static pci_ers_result_t default_reset_link(struct pci_dev *dev)
return PCI_ERS_RESULT_RECOVERED;
}
-static pci_ers_result_t reset_link(struct pci_dev *dev)
+static pci_ers_result_t reset_link(struct pci_dev *dev, int severity)
{
struct pci_dev *udev;
pci_ers_result_t status;
struct pcie_port_service_driver *driver;
+ u32 service;
if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
/* Reset this port for all subordinates */
@@ -196,7 +197,12 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
}
/* Use the aer driver of the component firstly */
- driver = pcie_port_find_service(udev, PCIE_PORT_SERVICE_AER);
+ if (severity == DPC_FATAL)
+ service = PCIE_PORT_SERVICE_DPC;
+ else
+ service = PCIE_PORT_SERVICE_AER;
+
+ driver = pcie_port_find_service(udev, service);
if (driver && driver->reset_link) {
status = driver->reset_link(udev);
@@ -302,7 +308,7 @@ static pci_ers_result_t do_fatal_recovery(struct pci_dev *dev, int severity)
pci_dev_put(pdev);
}
- result = reset_link(udev);
+ result = reset_link(udev, severity);
if (result == PCI_ERS_RESULT_RECOVERED)
if (pcie_wait_for_link(udev, true))
pci_rescan_bus(udev->bus);
@@ -326,7 +332,8 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
pci_ers_result_t status;
enum pci_channel_state state;
- if (severity == AER_FATAL) {
+ if ((severity == AER_FATAL) ||
+ (severity == DPC_FATAL)) {
status = do_fatal_recovery(dev, severity);
if (status != PCI_ERS_RESULT_RECOVERED)
goto failed;
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 8f87bbe..0c506fe 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -14,6 +14,7 @@
#define AER_NONFATAL 0
#define AER_FATAL 1
#define AER_CORRECTABLE 2
+#define DPC_FATAL 4
struct pci_dev;
--
2.7.4
This patch implements generic pcie_port_find_device() routine.
Signed-off-by: Oza Pawandeep <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index ba6c963..896608a 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -114,4 +114,6 @@ static inline void pcie_pme_interrupt_enable(struct pci_dev *dev, bool en) {}
struct pcie_port_service_driver *pcie_port_find_service(struct pci_dev *dev,
u32 service);
+struct device *pcie_port_find_device(struct pci_dev *dev,
+ u32 service);
#endif /* _PORTDRV_H_ */
diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index d843055..c6147c4 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -20,6 +20,7 @@
#include "portdrv.h"
struct portdrv_service_data {
struct pcie_port_service_driver *drv;
+ struct device *dev;
u32 service;
};
@@ -415,6 +416,7 @@ static int find_service_iter(struct device *device, void *data)
service_driver = to_service_driver(device->driver);
if (service_driver->service == service) {
pdrvs->drv = service_driver;
+ pdrvs->dev = device;
return 1;
}
}
@@ -443,6 +445,27 @@ struct pcie_port_service_driver *pcie_port_find_service(struct pci_dev *dev,
}
/**
+ * pcie_port_find_device - find the struct device
+ * @dev: PCI Express port the service devices associated with
+ * @service: For the service to find
+ *
+ * Find PCI Express port service driver associated with given service
+ */
+struct device *pcie_port_find_device(struct pci_dev *dev,
+ u32 service)
+{
+ struct device *device;
+ struct portdrv_service_data pdrvs;
+
+ pdrvs.dev = NULL;
+ pdrvs.service = service;
+ device_for_each_child(&dev->dev, &pdrvs, find_service_iter);
+
+ device = pdrvs.dev;
+ return device;
+}
+
+/**
* pcie_port_device_remove - unregister PCI Express port service devices
* @dev: PCI Express port the service devices to unregister are associated with
*
--
2.7.4
Clients such as HP, DPC are using pcie_wait_link_active(), which waits
till the link becomes active or inactive.
Made generic function and moved it to drivers/pci/pci.c
Signed-off-by: Oza Pawandeep <[email protected]>
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 18a42f8..e0c2b8e 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -231,25 +231,11 @@ bool pciehp_check_link_active(struct controller *ctrl)
return ret;
}
-static void __pcie_wait_link_active(struct controller *ctrl, bool active)
-{
- int timeout = 1000;
-
- if (pciehp_check_link_active(ctrl) == active)
- return;
- while (timeout > 0) {
- msleep(10);
- timeout -= 10;
- if (pciehp_check_link_active(ctrl) == active)
- return;
- }
- ctrl_dbg(ctrl, "Data Link Layer Link Active not %s in 1000 msec\n",
- active ? "set" : "cleared");
-}
-
static void pcie_wait_link_active(struct controller *ctrl)
{
- __pcie_wait_link_active(ctrl, true);
+ struct pci_dev *pdev = ctrl_dev(ctrl);
+
+ pcie_wait_for_link(pdev, true);
}
static bool pci_bus_check_dev(struct pci_bus *bus, int devfn)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e597655..2e4d1e4 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4138,6 +4138,35 @@ static int pci_pm_reset(struct pci_dev *dev, int probe)
return pci_dev_wait(dev, "PM D3->D0", PCIE_RESET_READY_POLL_MS);
}
+/**
+ * pcie_wait_for_link - Wait for link till it's active/inactive
+ * @pdev: Bridge device
+ * @active: waiting for active or inactive ?
+ *
+ * Use this to wait till link becomes active or inactive.
+ */
+bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
+{
+ int timeout = 1000;
+ bool ret;
+ u16 lnk_status;
+
+ for (;;) {
+ pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
+ ret = !!(lnk_status & PCI_EXP_LNKSTA_DLLLA);
+ if (ret == active)
+ return true;
+ if (timeout <= 0)
+ break;
+ msleep(10);
+ timeout -= 10;
+ }
+
+ pci_info(pdev, "Data Link Layer Link Active not %s in 1000 msec\n",
+ active ? "set" : "cleared");
+
+ return false;
+}
void pci_reset_secondary_bus(struct pci_dev *dev)
{
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 023f7cf..cec9d8c 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -353,6 +353,7 @@ static inline resource_size_t pci_resource_alignment(struct pci_dev *dev,
void pci_enable_acs(struct pci_dev *dev);
+bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
#ifdef CONFIG_PCIEASPM
void pcie_aspm_init_link_state(struct pci_dev *pdev);
void pcie_aspm_exit_link_state(struct pci_dev *pdev);
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index 8c57d60..80ec384 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -68,19 +68,9 @@ static int dpc_wait_rp_inactive(struct dpc_dev *dpc)
static void dpc_wait_link_inactive(struct dpc_dev *dpc)
{
- unsigned long timeout = jiffies + HZ;
struct pci_dev *pdev = dpc->dev->port;
- struct device *dev = &dpc->dev->device;
- u16 lnk_status;
- pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
- while (lnk_status & PCI_EXP_LNKSTA_DLLLA &&
- !time_after(jiffies, timeout)) {
- msleep(10);
- pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
- }
- if (lnk_status & PCI_EXP_LNKSTA_DLLLA)
- dev_warn(dev, "Link state not disabled for DPC event\n");
+ pcie_wait_for_link(pdev, false);
}
static void dpc_work(struct work_struct *work)
--
2.7.4
It adds description on AER_FATAL error handling.
Signed-off-by: Oza Pawandeep <[email protected]>
diff --git a/Documentation/PCI/pci-error-recovery.txt b/Documentation/PCI/pci-error-recovery.txt
index 0b6bb3e..688b691 100644
--- a/Documentation/PCI/pci-error-recovery.txt
+++ b/Documentation/PCI/pci-error-recovery.txt
@@ -110,7 +110,7 @@ The actual steps taken by a platform to recover from a PCI error
event will be platform-dependent, but will follow the general
sequence described below.
-STEP 0: Error Event
+STEP 0: Error Event: ERR_NONFATAL
-------------------
A PCI bus error is detected by the PCI hardware. On powerpc, the slot
is isolated, in that all I/O is blocked: all reads return 0xffffffff,
@@ -228,13 +228,7 @@ proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
proceeds to STEP 4 (Slot Reset)
-STEP 3: Link Reset
-------------------
-The platform resets the link. This is a PCI-Express specific step
-and is done whenever a fatal error has been detected that can be
-"solved" by resetting the link.
-
-STEP 4: Slot Reset
+STEP 3: Slot Reset
------------------
In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
@@ -320,7 +314,7 @@ Failure).
>>> However, it probably should.
-STEP 5: Resume Operations
+STEP 4: Resume Operations
-------------------------
The platform will call the resume() callback on all affected device
drivers if all drivers on the segment have returned
@@ -332,7 +326,7 @@ a result code.
At this point, if a new error happens, the platform will restart
a new error recovery sequence.
-STEP 6: Permanent Failure
+STEP 5: Permanent Failure
-------------------------
A "permanent failure" has occurred, and the platform cannot recover
the device. The platform will call error_detected() with a
@@ -355,6 +349,27 @@ errors. See the discussion in powerpc/eeh-pci-error-recovery.txt
for additional detail on real-life experience of the causes of
software errors.
+STEP 0: Error Event: ERR_FATAL
+-------------------
+PCI bus error is detected by the PCI hardware. On powerpc, the slot is
+isolated, in that all I/O is blocked: all reads return 0xffffffff, all
+writes are ignored.
+
+STEP 1: Remove devices
+--------------------
+Platform removes the devices depending on the error agent, it could be
+this port for all subordinates or upstream component (likely downstream
+port)
+
+STEP 2: Reset link
+--------------------
+The platform resets the link. This is a PCI-Express specific step and is
+done whenever a fatal error has been detected that can be "solved" by
+resetting the link.
+
+STEP 3: Re-enumerate the devices
+--------------------
+Initiates the re-enumeration.
Conclusion; General Remarks
---------------------------
--
2.7.4
This patch disables ERR_NONFATAL trigger for DPC, so now DPC
handles only ERR_FATAL.
Signed-off-by: Oza Pawandeep <[email protected]>
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index aed7c9f..6966e00 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -263,7 +263,7 @@ static int dpc_probe(struct pcie_device *dev)
}
}
- ctl = (ctl & 0xfff4) | PCI_EXP_DPC_CTL_EN_NONFATAL | PCI_EXP_DPC_CTL_INT_EN;
+ ctl = (ctl & 0xfff4) | PCI_EXP_DPC_CTL_EN_FATAL | PCI_EXP_DPC_CTL_INT_EN;
pci_write_config_word(pdev, dpc->cap_pos + PCI_EXP_DPC_CTL, ctl);
dev_info(device, "DPC error containment capabilities: Int Msg #%d, RPExt%c PoisonedTLP%c SwTrigger%c RP PIO Log %d, DL_ActiveErr%c\n",
@@ -281,7 +281,7 @@ static void dpc_remove(struct pcie_device *dev)
u16 ctl;
pci_read_config_word(pdev, dpc->cap_pos + PCI_EXP_DPC_CTL, &ctl);
- ctl &= ~(PCI_EXP_DPC_CTL_EN_NONFATAL | PCI_EXP_DPC_CTL_INT_EN);
+ ctl &= ~(PCI_EXP_DPC_CTL_EN_FATAL | PCI_EXP_DPC_CTL_INT_EN);
pci_write_config_word(pdev, dpc->cap_pos + PCI_EXP_DPC_CTL, ctl);
}
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index 103ba79..86f1cc2 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -981,6 +981,7 @@
#define PCI_EXP_DPC_CAP_DL_ACTIVE 0x1000 /* ERR_COR signal on DL_Active supported */
#define PCI_EXP_DPC_CTL 6 /* DPC control */
+#define PCI_EXP_DPC_CTL_EN_FATAL 0x0001 /* Enable trigger on ERR_FATAL message */
#define PCI_EXP_DPC_CTL_EN_NONFATAL 0x0002 /* Enable trigger on ERR_NONFATAL message */
#define PCI_EXP_DPC_CTL_INT_EN 0x0008 /* DPC Interrupt Enable */
--
2.7.4
This patch factors out error reporting callbacks, which are currently
tightly coupled with AER.
DPC should be able to register callbacks and attempt recovery when DPC
trigger event occurs.
Signed-off-by: Oza Pawandeep <[email protected]>
diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index 800e1d4..03f4e0b 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -2,7 +2,7 @@
#
# Makefile for PCI Express features and port driver
-pcieportdrv-y := portdrv_core.o portdrv_pci.o
+pcieportdrv-y := portdrv_core.o portdrv_pci.o err.o
obj-$(CONFIG_PCIEPORTBUS) += pcieportdrv.o
diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
index 08b4584..b4c9506 100644
--- a/drivers/pci/pcie/aer/aerdrv.h
+++ b/drivers/pci/pcie/aer/aerdrv.h
@@ -76,36 +76,6 @@ struct aer_rpc {
*/
};
-struct aer_broadcast_data {
- enum pci_channel_state state;
- enum pci_ers_result result;
-};
-
-static inline pci_ers_result_t merge_result(enum pci_ers_result orig,
- enum pci_ers_result new)
-{
- if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
- return PCI_ERS_RESULT_NO_AER_DRIVER;
-
- if (new == PCI_ERS_RESULT_NONE)
- return orig;
-
- switch (orig) {
- case PCI_ERS_RESULT_CAN_RECOVER:
- case PCI_ERS_RESULT_RECOVERED:
- orig = new;
- break;
- case PCI_ERS_RESULT_DISCONNECT:
- if (new == PCI_ERS_RESULT_NEED_RESET)
- orig = PCI_ERS_RESULT_NEED_RESET;
- break;
- default:
- break;
- }
-
- return orig;
-}
-
extern struct bus_type pcie_port_bus_type;
void aer_isr(struct work_struct *work);
void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index be4ee3b..51515d1 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -228,191 +228,6 @@ static bool find_source_device(struct pci_dev *parent,
return true;
}
-static int report_error_detected(struct pci_dev *dev, void *data)
-{
- pci_ers_result_t vote;
- const struct pci_error_handlers *err_handler;
- struct aer_broadcast_data *result_data;
- result_data = (struct aer_broadcast_data *) data;
-
- device_lock(&dev->dev);
- dev->error_state = result_data->state;
-
- if (!dev->driver ||
- !dev->driver->err_handler ||
- !dev->driver->err_handler->error_detected) {
- if (result_data->state == pci_channel_io_frozen &&
- dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
- /*
- * In case of fatal recovery, if one of down-
- * stream device has no driver. We might be
- * unable to recover because a later insmod
- * of a driver for this device is unaware of
- * its hw state.
- */
- pci_printk(KERN_DEBUG, dev, "device has %s\n",
- dev->driver ?
- "no AER-aware driver" : "no driver");
- }
-
- /*
- * If there's any device in the subtree that does not
- * have an error_detected callback, returning
- * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
- * the subsequent mmio_enabled/slot_reset/resume
- * callbacks of "any" device in the subtree. All the
- * devices in the subtree are left in the error state
- * without recovery.
- */
-
- if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
- vote = PCI_ERS_RESULT_NO_AER_DRIVER;
- else
- vote = PCI_ERS_RESULT_NONE;
- } else {
- err_handler = dev->driver->err_handler;
- vote = err_handler->error_detected(dev, result_data->state);
- pci_uevent_ers(dev, PCI_ERS_RESULT_NONE);
- }
-
- result_data->result = merge_result(result_data->result, vote);
- device_unlock(&dev->dev);
- return 0;
-}
-
-static int report_mmio_enabled(struct pci_dev *dev, void *data)
-{
- pci_ers_result_t vote;
- const struct pci_error_handlers *err_handler;
- struct aer_broadcast_data *result_data;
- result_data = (struct aer_broadcast_data *) data;
-
- device_lock(&dev->dev);
- if (!dev->driver ||
- !dev->driver->err_handler ||
- !dev->driver->err_handler->mmio_enabled)
- goto out;
-
- err_handler = dev->driver->err_handler;
- vote = err_handler->mmio_enabled(dev);
- result_data->result = merge_result(result_data->result, vote);
-out:
- device_unlock(&dev->dev);
- return 0;
-}
-
-static int report_slot_reset(struct pci_dev *dev, void *data)
-{
- pci_ers_result_t vote;
- const struct pci_error_handlers *err_handler;
- struct aer_broadcast_data *result_data;
- result_data = (struct aer_broadcast_data *) data;
-
- device_lock(&dev->dev);
- if (!dev->driver ||
- !dev->driver->err_handler ||
- !dev->driver->err_handler->slot_reset)
- goto out;
-
- err_handler = dev->driver->err_handler;
- vote = err_handler->slot_reset(dev);
- result_data->result = merge_result(result_data->result, vote);
-out:
- device_unlock(&dev->dev);
- return 0;
-}
-
-static int report_resume(struct pci_dev *dev, void *data)
-{
- const struct pci_error_handlers *err_handler;
-
- device_lock(&dev->dev);
- dev->error_state = pci_channel_io_normal;
-
- if (!dev->driver ||
- !dev->driver->err_handler ||
- !dev->driver->err_handler->resume)
- goto out;
-
- err_handler = dev->driver->err_handler;
- err_handler->resume(dev);
- pci_uevent_ers(dev, PCI_ERS_RESULT_RECOVERED);
-out:
- device_unlock(&dev->dev);
- return 0;
-}
-
-/**
- * broadcast_error_message - handle message broadcast to downstream drivers
- * @dev: pointer to from where in a hierarchy message is broadcasted down
- * @state: error state
- * @error_mesg: message to print
- * @cb: callback to be broadcasted
- *
- * Invoked during error recovery process. Once being invoked, the content
- * of error severity will be broadcasted to all downstream drivers in a
- * hierarchy in question.
- */
-static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
- enum pci_channel_state state,
- char *error_mesg,
- int (*cb)(struct pci_dev *, void *))
-{
- struct aer_broadcast_data result_data;
-
- pci_printk(KERN_DEBUG, dev, "broadcast %s message\n", error_mesg);
- result_data.state = state;
- if (cb == report_error_detected)
- result_data.result = PCI_ERS_RESULT_CAN_RECOVER;
- else
- result_data.result = PCI_ERS_RESULT_RECOVERED;
-
- if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
- /*
- * If the error is reported by a bridge, we think this error
- * is related to the downstream link of the bridge, so we
- * do error recovery on all subordinates of the bridge instead
- * of the bridge and clear the error status of the bridge.
- */
- if (cb == report_error_detected)
- dev->error_state = state;
- pci_walk_bus(dev->subordinate, cb, &result_data);
- if (cb == report_resume) {
- pci_cleanup_aer_uncorrect_error_status(dev);
- dev->error_state = pci_channel_io_normal;
- }
- } else {
- /*
- * If the error is reported by an end point, we think this
- * error is related to the upstream link of the end point.
- */
- if (state == pci_channel_io_normal)
- /*
- * the error is non fatal so the bus is ok, just invoke
- * the callback for the function that logged the error.
- */
- cb(dev, &result_data);
- else
- pci_walk_bus(dev->bus, cb, &result_data);
- }
-
- return result_data.result;
-}
-
-/**
- * default_reset_link - default reset function
- * @dev: pointer to pci_dev data structure
- *
- * Invoked when performing link reset on a Downstream Port or a
- * Root Port with no aer driver.
- */
-static pci_ers_result_t default_reset_link(struct pci_dev *dev)
-{
- pci_reset_bridge_secondary_bus(dev);
- pci_printk(KERN_DEBUG, dev, "downstream link has been reset\n");
- return PCI_ERS_RESULT_RECOVERED;
-}
-
static int find_aer_service_iter(struct device *device, void *data)
{
struct pcie_port_service_driver *service_driver, **drv;
@@ -430,7 +245,7 @@ static int find_aer_service_iter(struct device *device, void *data)
return 0;
}
-static struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev)
+struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev)
{
struct pcie_port_service_driver *drv = NULL;
@@ -439,143 +254,6 @@ static struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev)
return drv;
}
-static pci_ers_result_t reset_link(struct pci_dev *dev)
-{
- struct pci_dev *udev;
- pci_ers_result_t status;
- struct pcie_port_service_driver *driver;
-
- if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
- /* Reset this port for all subordinates */
- udev = dev;
- } else {
- /* Reset the upstream component (likely downstream port) */
- udev = dev->bus->self;
- }
-
- /* Use the aer driver of the component firstly */
- driver = find_aer_service(udev);
-
- if (driver && driver->reset_link) {
- status = driver->reset_link(udev);
- } else if (udev->has_secondary_link) {
- status = default_reset_link(udev);
- } else {
- pci_printk(KERN_DEBUG, dev, "no link-reset support at upstream device %s\n",
- pci_name(udev));
- return PCI_ERS_RESULT_DISCONNECT;
- }
-
- if (status != PCI_ERS_RESULT_RECOVERED) {
- pci_printk(KERN_DEBUG, dev, "link reset at upstream device %s failed\n",
- pci_name(udev));
- return PCI_ERS_RESULT_DISCONNECT;
- }
-
- return status;
-}
-
-static pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev, int severity)
-{
- struct pci_dev *udev;
- struct pci_bus *parent;
- struct pci_dev *pdev, *temp;
- pci_ers_result_t result = PCI_ERS_RESULT_RECOVERED;
-
- if (severity == AER_FATAL)
- pci_cleanup_aer_uncorrect_error_status(dev);
-
- if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
- udev = dev;
- else
- udev = dev->bus->self;
-
- parent = udev->subordinate;
- pci_lock_rescan_remove();
- list_for_each_entry_safe_reverse(pdev, temp, &parent->devices,
- bus_list) {
- pci_dev_get(pdev);
- pci_dev_set_disconnected(pdev, NULL);
- if (pci_has_subordinate(pdev))
- pci_walk_bus(pdev->subordinate,
- pci_dev_set_disconnected, NULL);
- pci_stop_and_remove_bus_device(pdev);
- pci_dev_put(pdev);
- }
-
- result = reset_link(udev);
- if (result == PCI_ERS_RESULT_RECOVERED)
- if (pcie_wait_for_link(udev, true))
- pci_rescan_bus(udev->bus);
-
- pci_unlock_rescan_remove();
-
- return result;
-}
-
-/**
- * pcie_do_recovery - handle nonfatal/fatal error recovery process
- * @dev: pointer to a pci_dev data structure of agent detecting an error
- * @severity: error severity type
- *
- * Invoked when an error is nonfatal/fatal. Once being invoked, broadcast
- * error detected message to all downstream drivers within a hierarchy in
- * question and return the returned code.
- */
-void pcie_do_recovery(struct pci_dev *dev, int severity)
-{
- pci_ers_result_t status;
- enum pci_channel_state state;
-
- if (severity == AER_FATAL) {
- status = pcie_do_fatal_recovery(dev, severity);
- if (status != PCI_ERS_RESULT_RECOVERED)
- goto failed;
- return;
- }
- else
- state = pci_channel_io_normal;
-
- status = broadcast_error_message(dev,
- state,
- "error_detected",
- report_error_detected);
-
- if (status == PCI_ERS_RESULT_CAN_RECOVER)
- status = broadcast_error_message(dev,
- state,
- "mmio_enabled",
- report_mmio_enabled);
-
- if (status == PCI_ERS_RESULT_NEED_RESET) {
- /*
- * TODO: Should call platform-specific
- * functions to reset slot before calling
- * drivers' slot_reset callbacks?
- */
- status = broadcast_error_message(dev,
- state,
- "slot_reset",
- report_slot_reset);
- }
-
- if (status != PCI_ERS_RESULT_RECOVERED)
- goto failed;
-
- broadcast_error_message(dev,
- state,
- "resume",
- report_resume);
-
- pci_info(dev, "AER: Device recovery successful\n");
- return;
-
-failed:
- pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
- /* TODO: Should kernel panic here? */
- pci_info(dev, "AER: Device recovery failed\n");
-}
-
/**
* handle_error_source - handle logging error into an event log
* @aerdev: pointer to pcie_device data structure of the root port
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
new file mode 100644
index 0000000..55df974
--- /dev/null
+++ b/drivers/pci/pcie/err.c
@@ -0,0 +1,377 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This file implements the error recovery as a core part of PCIe error
+ * reporting. When a PCIe error is delivered, an error message will be
+ * collected and printed to console, then, an error recovery procedure
+ * will be executed by following the PCI error recovery rules.
+ *
+ * Copyright (C) 2006 Intel Corp.
+ * Tom Long Nguyen ([email protected])
+ * Zhang Yanmin ([email protected])
+ *
+ */
+
+#include <linux/pci.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/aer.h>
+#include "portdrv.h"
+#include "../pci.h"
+
+struct aer_broadcast_data {
+ enum pci_channel_state state;
+ enum pci_ers_result result;
+};
+
+static pci_ers_result_t merge_result(enum pci_ers_result orig,
+ enum pci_ers_result new)
+{
+ if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
+ return PCI_ERS_RESULT_NO_AER_DRIVER;
+
+ if (new == PCI_ERS_RESULT_NONE)
+ return orig;
+
+ switch (orig) {
+ case PCI_ERS_RESULT_CAN_RECOVER:
+ case PCI_ERS_RESULT_RECOVERED:
+ orig = new;
+ break;
+ case PCI_ERS_RESULT_DISCONNECT:
+ if (new == PCI_ERS_RESULT_NEED_RESET)
+ orig = PCI_ERS_RESULT_NEED_RESET;
+ break;
+ default:
+ break;
+ }
+
+ return orig;
+}
+
+static int report_error_detected(struct pci_dev *dev, void *data)
+{
+ pci_ers_result_t vote;
+ const struct pci_error_handlers *err_handler;
+ struct aer_broadcast_data *result_data;
+
+ result_data = (struct aer_broadcast_data *) data;
+
+ device_lock(&dev->dev);
+ dev->error_state = result_data->state;
+
+ if (!dev->driver ||
+ !dev->driver->err_handler ||
+ !dev->driver->err_handler->error_detected) {
+ if (result_data->state == pci_channel_io_frozen &&
+ dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
+ /*
+ * In case of fatal recovery, if one of down-
+ * stream device has no driver. We might be
+ * unable to recover because a later insmod
+ * of a driver for this device is unaware of
+ * its hw state.
+ */
+ pci_printk(KERN_DEBUG, dev, "device has %s\n",
+ dev->driver ?
+ "no AER-aware driver" : "no driver");
+ }
+
+ /*
+ * If there's any device in the subtree that does not
+ * have an error_detected callback, returning
+ * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
+ * the subsequent mmio_enabled/slot_reset/resume
+ * callbacks of "any" device in the subtree. All the
+ * devices in the subtree are left in the error state
+ * without recovery.
+ */
+
+ if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
+ vote = PCI_ERS_RESULT_NO_AER_DRIVER;
+ else
+ vote = PCI_ERS_RESULT_NONE;
+ } else {
+ err_handler = dev->driver->err_handler;
+ vote = err_handler->error_detected(dev, result_data->state);
+ pci_uevent_ers(dev, PCI_ERS_RESULT_NONE);
+ }
+
+ result_data->result = merge_result(result_data->result, vote);
+ device_unlock(&dev->dev);
+ return 0;
+}
+
+static int report_mmio_enabled(struct pci_dev *dev, void *data)
+{
+ pci_ers_result_t vote;
+ const struct pci_error_handlers *err_handler;
+ struct aer_broadcast_data *result_data;
+
+ result_data = (struct aer_broadcast_data *) data;
+
+ device_lock(&dev->dev);
+ if (!dev->driver ||
+ !dev->driver->err_handler ||
+ !dev->driver->err_handler->mmio_enabled)
+ goto out;
+
+ err_handler = dev->driver->err_handler;
+ vote = err_handler->mmio_enabled(dev);
+ result_data->result = merge_result(result_data->result, vote);
+out:
+ device_unlock(&dev->dev);
+ return 0;
+}
+
+static int report_slot_reset(struct pci_dev *dev, void *data)
+{
+ pci_ers_result_t vote;
+ const struct pci_error_handlers *err_handler;
+ struct aer_broadcast_data *result_data;
+
+ result_data = (struct aer_broadcast_data *) data;
+
+ device_lock(&dev->dev);
+ if (!dev->driver ||
+ !dev->driver->err_handler ||
+ !dev->driver->err_handler->slot_reset)
+ goto out;
+
+ err_handler = dev->driver->err_handler;
+ vote = err_handler->slot_reset(dev);
+ result_data->result = merge_result(result_data->result, vote);
+out:
+ device_unlock(&dev->dev);
+ return 0;
+}
+
+static int report_resume(struct pci_dev *dev, void *data)
+{
+ const struct pci_error_handlers *err_handler;
+
+ device_lock(&dev->dev);
+ dev->error_state = pci_channel_io_normal;
+
+ if (!dev->driver ||
+ !dev->driver->err_handler ||
+ !dev->driver->err_handler->resume)
+ goto out;
+
+ err_handler = dev->driver->err_handler;
+ err_handler->resume(dev);
+ pci_uevent_ers(dev, PCI_ERS_RESULT_RECOVERED);
+out:
+ device_unlock(&dev->dev);
+ return 0;
+}
+
+/**
+ * default_reset_link - default reset function
+ * @dev: pointer to pci_dev data structure
+ *
+ * Invoked when performing link reset on a Downstream Port or a
+ * Root Port with no aer driver.
+ */
+static pci_ers_result_t default_reset_link(struct pci_dev *dev)
+{
+ pci_reset_bridge_secondary_bus(dev);
+ pci_printk(KERN_DEBUG, dev, "downstream link has been reset\n");
+ return PCI_ERS_RESULT_RECOVERED;
+}
+
+static pci_ers_result_t reset_link(struct pci_dev *dev)
+{
+ struct pci_dev *udev;
+ pci_ers_result_t status;
+ struct pcie_port_service_driver *driver;
+
+ if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+ /* Reset this port for all subordinates */
+ udev = dev;
+ } else {
+ /* Reset the upstream component (likely downstream port) */
+ udev = dev->bus->self;
+ }
+
+#if IS_ENABLED(CONFIG_PCIEAER)
+ /* Use the aer driver of the component firstly */
+ driver = find_aer_service(udev);
+#endif
+
+ if (driver && driver->reset_link) {
+ status = driver->reset_link(udev);
+ } else if (udev->has_secondary_link) {
+ status = default_reset_link(udev);
+ } else {
+ pci_printk(KERN_DEBUG, dev, "no link-reset support at upstream device %s\n",
+ pci_name(udev));
+ return PCI_ERS_RESULT_DISCONNECT;
+ }
+
+ if (status != PCI_ERS_RESULT_RECOVERED) {
+ pci_printk(KERN_DEBUG, dev, "link reset at upstream device %s failed\n",
+ pci_name(udev));
+ return PCI_ERS_RESULT_DISCONNECT;
+ }
+
+ return status;
+}
+
+/**
+ * broadcast_error_message - handle message broadcast to downstream drivers
+ * @dev: pointer to from where in a hierarchy message is broadcasted down
+ * @state: error state
+ * @error_mesg: message to print
+ * @cb: callback to be broadcasted
+ *
+ * Invoked during error recovery process. Once being invoked, the content
+ * of error severity will be broadcasted to all downstream drivers in a
+ * hierarchy in question.
+ */
+static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
+ enum pci_channel_state state,
+ char *error_mesg,
+ int (*cb)(struct pci_dev *, void *))
+{
+ struct aer_broadcast_data result_data;
+
+ pci_printk(KERN_DEBUG, dev, "broadcast %s message\n", error_mesg);
+ result_data.state = state;
+ if (cb == report_error_detected)
+ result_data.result = PCI_ERS_RESULT_CAN_RECOVER;
+ else
+ result_data.result = PCI_ERS_RESULT_RECOVERED;
+
+ if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+ /*
+ * If the error is reported by a bridge, we think this error
+ * is related to the downstream link of the bridge, so we
+ * do error recovery on all subordinates of the bridge instead
+ * of the bridge and clear the error status of the bridge.
+ */
+ if (cb == report_error_detected)
+ dev->error_state = state;
+ pci_walk_bus(dev->subordinate, cb, &result_data);
+ if (cb == report_resume) {
+ pci_cleanup_aer_uncorrect_error_status(dev);
+ dev->error_state = pci_channel_io_normal;
+ }
+ } else {
+ /*
+ * If the error is reported by an end point, we think this
+ * error is related to the upstream link of the end point.
+ */
+ if (state == pci_channel_io_normal)
+ /*
+ * the error is non fatal so the bus is ok, just invoke
+ * the callback for the function that logged the error.
+ */
+ cb(dev, &result_data);
+ else
+ pci_walk_bus(dev->bus, cb, &result_data);
+ }
+
+ return result_data.result;
+}
+
+static pci_ers_result_t do_fatal_recovery(struct pci_dev *dev, int severity)
+{
+ struct pci_dev *udev;
+ struct pci_bus *parent;
+ struct pci_dev *pdev, *temp;
+ pci_ers_result_t result = PCI_ERS_RESULT_RECOVERED;
+
+ if (severity == AER_FATAL)
+ pci_cleanup_aer_uncorrect_error_status(dev);
+
+ if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
+ udev = dev;
+ else
+ udev = dev->bus->self;
+
+ parent = udev->subordinate;
+ pci_lock_rescan_remove();
+ list_for_each_entry_safe_reverse(pdev, temp, &parent->devices,
+ bus_list) {
+ pci_dev_get(pdev);
+ pci_dev_set_disconnected(pdev, NULL);
+ if (pci_has_subordinate(pdev))
+ pci_walk_bus(pdev->subordinate,
+ pci_dev_set_disconnected, NULL);
+ pci_stop_and_remove_bus_device(pdev);
+ pci_dev_put(pdev);
+ }
+
+ result = reset_link(udev);
+ if (result == PCI_ERS_RESULT_RECOVERED)
+ if (pcie_wait_for_link(udev, true))
+ pci_rescan_bus(udev->bus);
+
+ pci_unlock_rescan_remove();
+
+ return result;
+}
+
+/**
+ * pcie_do_recovery - handle nonfatal/fatal error recovery process
+ * @dev: pointer to a pci_dev data structure of agent detecting an error
+ * @severity: error severity type
+ *
+ * Invoked when an error is nonfatal/fatal. Once being invoked, broadcast
+ * error detected message to all downstream drivers within a hierarchy in
+ * question and return the returned code.
+ */
+void pcie_do_recovery(struct pci_dev *dev, int severity)
+{
+ pci_ers_result_t status;
+ enum pci_channel_state state;
+
+ if (severity == AER_FATAL) {
+ status = do_fatal_recovery(dev, severity);
+ if (status != PCI_ERS_RESULT_RECOVERED)
+ goto failed;
+ return;
+ } else
+ state = pci_channel_io_normal;
+
+ status = broadcast_error_message(dev,
+ state,
+ "error_detected",
+ report_error_detected);
+
+ if (status == PCI_ERS_RESULT_CAN_RECOVER)
+ status = broadcast_error_message(dev,
+ state,
+ "mmio_enabled",
+ report_mmio_enabled);
+
+ if (status == PCI_ERS_RESULT_NEED_RESET) {
+ /*
+ * TODO: Should call platform-specific
+ * functions to reset slot before calling
+ * drivers' slot_reset callbacks?
+ */
+ status = broadcast_error_message(dev,
+ state,
+ "slot_reset",
+ report_slot_reset);
+ }
+
+ if (status != PCI_ERS_RESULT_RECOVERED)
+ goto failed;
+
+ broadcast_error_message(dev,
+ state,
+ "resume",
+ report_resume);
+
+ pci_info(dev, "AER: Device recovery successful\n");
+ return;
+
+failed:
+ pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
+ /* TODO: Should kernel panic here? */
+ pci_info(dev, "AER: Device recovery failed\n");
+}
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index d0c6783..47c9824 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -112,4 +112,5 @@ static inline bool pcie_pme_no_msi(void) { return false; }
static inline void pcie_pme_interrupt_enable(struct pci_dev *dev, bool en) {}
#endif /* !CONFIG_PCIE_PME */
+struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev);
#endif /* _PORTDRV_H_ */
--
2.7.4
This patch implements generic pcie_port_find_service() routine.
Signed-off-by: Oza Pawandeep <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index 51515d1..a525296 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -228,32 +228,6 @@ static bool find_source_device(struct pci_dev *parent,
return true;
}
-static int find_aer_service_iter(struct device *device, void *data)
-{
- struct pcie_port_service_driver *service_driver, **drv;
-
- drv = (struct pcie_port_service_driver **) data;
-
- if (device->bus == &pcie_port_bus_type && device->driver) {
- service_driver = to_service_driver(device->driver);
- if (service_driver->service == PCIE_PORT_SERVICE_AER) {
- *drv = service_driver;
- return 1;
- }
- }
-
- return 0;
-}
-
-struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev)
-{
- struct pcie_port_service_driver *drv = NULL;
-
- device_for_each_child(&dev->dev, &drv, find_aer_service_iter);
-
- return drv;
-}
-
/**
* handle_error_source - handle logging error into an event log
* @aerdev: pointer to pcie_device data structure of the root port
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 55df974..877785d 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -195,10 +195,8 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
udev = dev->bus->self;
}
-#if IS_ENABLED(CONFIG_PCIEAER)
/* Use the aer driver of the component firstly */
- driver = find_aer_service(udev);
-#endif
+ driver = pcie_port_find_service(udev, PCIE_PORT_SERVICE_AER);
if (driver && driver->reset_link) {
status = driver->reset_link(udev);
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index 47c9824..ba6c963 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -112,5 +112,6 @@ static inline bool pcie_pme_no_msi(void) { return false; }
static inline void pcie_pme_interrupt_enable(struct pci_dev *dev, bool en) {}
#endif /* !CONFIG_PCIE_PME */
-struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev);
+struct pcie_port_service_driver *pcie_port_find_service(struct pci_dev *dev,
+ u32 service);
#endif /* _PORTDRV_H_ */
diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index c9c0663..d843055 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -18,6 +18,10 @@
#include "../pci.h"
#include "portdrv.h"
+struct portdrv_service_data {
+ struct pcie_port_service_driver *drv;
+ u32 service;
+};
/**
* release_pcie_device - free PCI Express port service device structure
@@ -398,6 +402,46 @@ static int remove_iter(struct device *dev, void *data)
return 0;
}
+static int find_service_iter(struct device *device, void *data)
+{
+ struct pcie_port_service_driver *service_driver;
+ struct portdrv_service_data *pdrvs;
+ u32 service;
+
+ pdrvs = (struct portdrv_service_data *) data;
+ service = pdrvs->service;
+
+ if (device->bus == &pcie_port_bus_type && device->driver) {
+ service_driver = to_service_driver(device->driver);
+ if (service_driver->service == service) {
+ pdrvs->drv = service_driver;
+ return 1;
+ }
+ }
+
+ return 0;
+}
+/**
+ * pcie_port_find_service - find the service driver
+ * @dev: PCI Express port the service devices associated with
+ * @service: Service to find
+ *
+ * Find PCI Express port service driver associated with given service
+ */
+struct pcie_port_service_driver *pcie_port_find_service(struct pci_dev *dev,
+ u32 service)
+{
+ struct pcie_port_service_driver *drv;
+ struct portdrv_service_data pdrvs;
+
+ pdrvs.drv = NULL;
+ pdrvs.service = service;
+ device_for_each_child(&dev->dev, &pdrvs, find_service_iter);
+
+ drv = pdrvs.drv;
+ return drv;
+}
+
/**
* pcie_port_device_remove - unregister PCI Express port service devices
* @dev: PCI Express port the service devices to unregister are associated with
--
2.7.4
Hi Oza,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on pci/next]
[also build test ERROR on v4.17-rc3 next-20180503]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Oza-Pawandeep/Address-error-and-recovery-for-AER-and-DPC/20180504-050017
base: https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git next
config: x86_64-randconfig-x015-201817 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64
All errors (new ones prefixed by >>):
drivers/pci/pcie/err.c: In function 'report_error_detected':
>> drivers/pci/pcie/err.c:98:3: error: implicit declaration of function 'pci_uevent_ers'; did you mean 'pci_select_bars'? [-Werror=implicit-function-declaration]
pci_uevent_ers(dev, PCI_ERS_RESULT_NONE);
^~~~~~~~~~~~~~
pci_select_bars
drivers/pci/pcie/err.c: In function 'reset_link':
drivers/pci/pcie/err.c:203:5: warning: 'driver' is used uninitialized in this function [-Wuninitialized]
if (driver && driver->reset_link) {
^
cc1: some warnings being treated as errors
vim +98 drivers/pci/pcie/err.c
52
53 static int report_error_detected(struct pci_dev *dev, void *data)
54 {
55 pci_ers_result_t vote;
56 const struct pci_error_handlers *err_handler;
57 struct aer_broadcast_data *result_data;
58
59 result_data = (struct aer_broadcast_data *) data;
60
61 device_lock(&dev->dev);
62 dev->error_state = result_data->state;
63
64 if (!dev->driver ||
65 !dev->driver->err_handler ||
66 !dev->driver->err_handler->error_detected) {
67 if (result_data->state == pci_channel_io_frozen &&
68 dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
69 /*
70 * In case of fatal recovery, if one of down-
71 * stream device has no driver. We might be
72 * unable to recover because a later insmod
73 * of a driver for this device is unaware of
74 * its hw state.
75 */
76 pci_printk(KERN_DEBUG, dev, "device has %s\n",
77 dev->driver ?
78 "no AER-aware driver" : "no driver");
79 }
80
81 /*
82 * If there's any device in the subtree that does not
83 * have an error_detected callback, returning
84 * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
85 * the subsequent mmio_enabled/slot_reset/resume
86 * callbacks of "any" device in the subtree. All the
87 * devices in the subtree are left in the error state
88 * without recovery.
89 */
90
91 if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
92 vote = PCI_ERS_RESULT_NO_AER_DRIVER;
93 else
94 vote = PCI_ERS_RESULT_NONE;
95 } else {
96 err_handler = dev->driver->err_handler;
97 vote = err_handler->error_detected(dev, result_data->state);
> 98 pci_uevent_ers(dev, PCI_ERS_RESULT_NONE);
99 }
100
101 result_data->result = merge_result(result_data->result, vote);
102 device_unlock(&dev->dev);
103 return 0;
104 }
105
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Oza,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on pci/next]
[also build test ERROR on v4.17-rc3 next-20180503]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Oza-Pawandeep/Address-error-and-recovery-for-AER-and-DPC/20180504-050017
base: https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git next
config: i386-randconfig-a1-05030941 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All errors (new ones prefixed by >>):
drivers/pci/pcie/err.c: In function 'report_error_detected':
>> drivers/pci/pcie/err.c:98:3: error: implicit declaration of function 'pci_uevent_ers' [-Werror=implicit-function-declaration]
pci_uevent_ers(dev, PCI_ERS_RESULT_NONE);
^
drivers/pci/pcie/err.c: In function 'reset_link':
drivers/pci/pcie/err.c:203:5: warning: 'driver' is used uninitialized in this function [-Wuninitialized]
if (driver && driver->reset_link) {
^
cc1: some warnings being treated as errors
vim +/pci_uevent_ers +98 drivers/pci/pcie/err.c
52
53 static int report_error_detected(struct pci_dev *dev, void *data)
54 {
55 pci_ers_result_t vote;
56 const struct pci_error_handlers *err_handler;
57 struct aer_broadcast_data *result_data;
58
59 result_data = (struct aer_broadcast_data *) data;
60
61 device_lock(&dev->dev);
62 dev->error_state = result_data->state;
63
64 if (!dev->driver ||
65 !dev->driver->err_handler ||
66 !dev->driver->err_handler->error_detected) {
67 if (result_data->state == pci_channel_io_frozen &&
68 dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
69 /*
70 * In case of fatal recovery, if one of down-
71 * stream device has no driver. We might be
72 * unable to recover because a later insmod
73 * of a driver for this device is unaware of
74 * its hw state.
75 */
76 pci_printk(KERN_DEBUG, dev, "device has %s\n",
77 dev->driver ?
78 "no AER-aware driver" : "no driver");
79 }
80
81 /*
82 * If there's any device in the subtree that does not
83 * have an error_detected callback, returning
84 * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
85 * the subsequent mmio_enabled/slot_reset/resume
86 * callbacks of "any" device in the subtree. All the
87 * devices in the subtree are left in the error state
88 * without recovery.
89 */
90
91 if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
92 vote = PCI_ERS_RESULT_NO_AER_DRIVER;
93 else
94 vote = PCI_ERS_RESULT_NONE;
95 } else {
96 err_handler = dev->driver->err_handler;
97 vote = err_handler->error_detected(dev, result_data->state);
> 98 pci_uevent_ers(dev, PCI_ERS_RESULT_NONE);
99 }
100
101 result_data->result = merge_result(result_data->result, vote);
102 device_unlock(&dev->dev);
103 return 0;
104 }
105
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On 2018-05-03 10:33, Oza Pawandeep wrote:
> This patch factors out error reporting callbacks, which are currently
> tightly coupled with AER.
>
> DPC should be able to register callbacks and attempt recovery when DPC
> trigger event occurs.
>
> Signed-off-by: Oza Pawandeep <[email protected]>
>
> diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
> index 800e1d4..03f4e0b 100644
> --- a/drivers/pci/pcie/Makefile
> +++ b/drivers/pci/pcie/Makefile
> @@ -2,7 +2,7 @@
> #
> # Makefile for PCI Express features and port driver
>
> -pcieportdrv-y := portdrv_core.o portdrv_pci.o
> +pcieportdrv-y := portdrv_core.o portdrv_pci.o err.o
>
> obj-$(CONFIG_PCIEPORTBUS) += pcieportdrv.o
>
> diff --git a/drivers/pci/pcie/aer/aerdrv.h
> b/drivers/pci/pcie/aer/aerdrv.h
> index 08b4584..b4c9506 100644
> --- a/drivers/pci/pcie/aer/aerdrv.h
> +++ b/drivers/pci/pcie/aer/aerdrv.h
> @@ -76,36 +76,6 @@ struct aer_rpc {
> */
> };
>
> -struct aer_broadcast_data {
> - enum pci_channel_state state;
> - enum pci_ers_result result;
> -};
> -
> -static inline pci_ers_result_t merge_result(enum pci_ers_result orig,
> - enum pci_ers_result new)
> -{
> - if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
> - return PCI_ERS_RESULT_NO_AER_DRIVER;
> -
> - if (new == PCI_ERS_RESULT_NONE)
> - return orig;
> -
> - switch (orig) {
> - case PCI_ERS_RESULT_CAN_RECOVER:
> - case PCI_ERS_RESULT_RECOVERED:
> - orig = new;
> - break;
> - case PCI_ERS_RESULT_DISCONNECT:
> - if (new == PCI_ERS_RESULT_NEED_RESET)
> - orig = PCI_ERS_RESULT_NEED_RESET;
> - break;
> - default:
> - break;
> - }
> -
> - return orig;
> -}
> -
> extern struct bus_type pcie_port_bus_type;
> void aer_isr(struct work_struct *work);
> void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c
> b/drivers/pci/pcie/aer/aerdrv_core.c
> index be4ee3b..51515d1 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -228,191 +228,6 @@ static bool find_source_device(struct pci_dev
> *parent,
> return true;
> }
>
> -static int report_error_detected(struct pci_dev *dev, void *data)
> -{
> - pci_ers_result_t vote;
> - const struct pci_error_handlers *err_handler;
> - struct aer_broadcast_data *result_data;
> - result_data = (struct aer_broadcast_data *) data;
> -
> - device_lock(&dev->dev);
> - dev->error_state = result_data->state;
> -
> - if (!dev->driver ||
> - !dev->driver->err_handler ||
> - !dev->driver->err_handler->error_detected) {
> - if (result_data->state == pci_channel_io_frozen &&
> - dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
> - /*
> - * In case of fatal recovery, if one of down-
> - * stream device has no driver. We might be
> - * unable to recover because a later insmod
> - * of a driver for this device is unaware of
> - * its hw state.
> - */
> - pci_printk(KERN_DEBUG, dev, "device has %s\n",
> - dev->driver ?
> - "no AER-aware driver" : "no driver");
> - }
> -
> - /*
> - * If there's any device in the subtree that does not
> - * have an error_detected callback, returning
> - * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
> - * the subsequent mmio_enabled/slot_reset/resume
> - * callbacks of "any" device in the subtree. All the
> - * devices in the subtree are left in the error state
> - * without recovery.
> - */
> -
> - if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
> - vote = PCI_ERS_RESULT_NO_AER_DRIVER;
> - else
> - vote = PCI_ERS_RESULT_NONE;
> - } else {
> - err_handler = dev->driver->err_handler;
> - vote = err_handler->error_detected(dev, result_data->state);
> - pci_uevent_ers(dev, PCI_ERS_RESULT_NONE);
> - }
> -
> - result_data->result = merge_result(result_data->result, vote);
> - device_unlock(&dev->dev);
> - return 0;
> -}
> -
> -static int report_mmio_enabled(struct pci_dev *dev, void *data)
> -{
> - pci_ers_result_t vote;
> - const struct pci_error_handlers *err_handler;
> - struct aer_broadcast_data *result_data;
> - result_data = (struct aer_broadcast_data *) data;
> -
> - device_lock(&dev->dev);
> - if (!dev->driver ||
> - !dev->driver->err_handler ||
> - !dev->driver->err_handler->mmio_enabled)
> - goto out;
> -
> - err_handler = dev->driver->err_handler;
> - vote = err_handler->mmio_enabled(dev);
> - result_data->result = merge_result(result_data->result, vote);
> -out:
> - device_unlock(&dev->dev);
> - return 0;
> -}
> -
> -static int report_slot_reset(struct pci_dev *dev, void *data)
> -{
> - pci_ers_result_t vote;
> - const struct pci_error_handlers *err_handler;
> - struct aer_broadcast_data *result_data;
> - result_data = (struct aer_broadcast_data *) data;
> -
> - device_lock(&dev->dev);
> - if (!dev->driver ||
> - !dev->driver->err_handler ||
> - !dev->driver->err_handler->slot_reset)
> - goto out;
> -
> - err_handler = dev->driver->err_handler;
> - vote = err_handler->slot_reset(dev);
> - result_data->result = merge_result(result_data->result, vote);
> -out:
> - device_unlock(&dev->dev);
> - return 0;
> -}
> -
> -static int report_resume(struct pci_dev *dev, void *data)
> -{
> - const struct pci_error_handlers *err_handler;
> -
> - device_lock(&dev->dev);
> - dev->error_state = pci_channel_io_normal;
> -
> - if (!dev->driver ||
> - !dev->driver->err_handler ||
> - !dev->driver->err_handler->resume)
> - goto out;
> -
> - err_handler = dev->driver->err_handler;
> - err_handler->resume(dev);
> - pci_uevent_ers(dev, PCI_ERS_RESULT_RECOVERED);
> -out:
> - device_unlock(&dev->dev);
> - return 0;
> -}
> -
> -/**
> - * broadcast_error_message - handle message broadcast to downstream
> drivers
> - * @dev: pointer to from where in a hierarchy message is broadcasted
> down
> - * @state: error state
> - * @error_mesg: message to print
> - * @cb: callback to be broadcasted
> - *
> - * Invoked during error recovery process. Once being invoked, the
> content
> - * of error severity will be broadcasted to all downstream drivers in
> a
> - * hierarchy in question.
> - */
> -static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
> - enum pci_channel_state state,
> - char *error_mesg,
> - int (*cb)(struct pci_dev *, void *))
> -{
> - struct aer_broadcast_data result_data;
> -
> - pci_printk(KERN_DEBUG, dev, "broadcast %s message\n", error_mesg);
> - result_data.state = state;
> - if (cb == report_error_detected)
> - result_data.result = PCI_ERS_RESULT_CAN_RECOVER;
> - else
> - result_data.result = PCI_ERS_RESULT_RECOVERED;
> -
> - if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
> - /*
> - * If the error is reported by a bridge, we think this error
> - * is related to the downstream link of the bridge, so we
> - * do error recovery on all subordinates of the bridge instead
> - * of the bridge and clear the error status of the bridge.
> - */
> - if (cb == report_error_detected)
> - dev->error_state = state;
> - pci_walk_bus(dev->subordinate, cb, &result_data);
> - if (cb == report_resume) {
> - pci_cleanup_aer_uncorrect_error_status(dev);
> - dev->error_state = pci_channel_io_normal;
> - }
> - } else {
> - /*
> - * If the error is reported by an end point, we think this
> - * error is related to the upstream link of the end point.
> - */
> - if (state == pci_channel_io_normal)
> - /*
> - * the error is non fatal so the bus is ok, just invoke
> - * the callback for the function that logged the error.
> - */
> - cb(dev, &result_data);
> - else
> - pci_walk_bus(dev->bus, cb, &result_data);
> - }
> -
> - return result_data.result;
> -}
> -
> -/**
> - * default_reset_link - default reset function
> - * @dev: pointer to pci_dev data structure
> - *
> - * Invoked when performing link reset on a Downstream Port or a
> - * Root Port with no aer driver.
> - */
> -static pci_ers_result_t default_reset_link(struct pci_dev *dev)
> -{
> - pci_reset_bridge_secondary_bus(dev);
> - pci_printk(KERN_DEBUG, dev, "downstream link has been reset\n");
> - return PCI_ERS_RESULT_RECOVERED;
> -}
> -
> static int find_aer_service_iter(struct device *device, void *data)
> {
> struct pcie_port_service_driver *service_driver, **drv;
> @@ -430,7 +245,7 @@ static int find_aer_service_iter(struct device
> *device, void *data)
> return 0;
> }
>
> -static struct pcie_port_service_driver *find_aer_service(struct
> pci_dev *dev)
> +struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev)
> {
> struct pcie_port_service_driver *drv = NULL;
>
> @@ -439,143 +254,6 @@ static struct pcie_port_service_driver
> *find_aer_service(struct pci_dev *dev)
> return drv;
> }
>
> -static pci_ers_result_t reset_link(struct pci_dev *dev)
> -{
> - struct pci_dev *udev;
> - pci_ers_result_t status;
> - struct pcie_port_service_driver *driver;
> -
> - if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
> - /* Reset this port for all subordinates */
> - udev = dev;
> - } else {
> - /* Reset the upstream component (likely downstream port) */
> - udev = dev->bus->self;
> - }
> -
> - /* Use the aer driver of the component firstly */
> - driver = find_aer_service(udev);
> -
> - if (driver && driver->reset_link) {
> - status = driver->reset_link(udev);
> - } else if (udev->has_secondary_link) {
> - status = default_reset_link(udev);
> - } else {
> - pci_printk(KERN_DEBUG, dev, "no link-reset support at upstream
> device %s\n",
> - pci_name(udev));
> - return PCI_ERS_RESULT_DISCONNECT;
> - }
> -
> - if (status != PCI_ERS_RESULT_RECOVERED) {
> - pci_printk(KERN_DEBUG, dev, "link reset at upstream device %s
> failed\n",
> - pci_name(udev));
> - return PCI_ERS_RESULT_DISCONNECT;
> - }
> -
> - return status;
> -}
> -
> -static pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev,
> int severity)
> -{
> - struct pci_dev *udev;
> - struct pci_bus *parent;
> - struct pci_dev *pdev, *temp;
> - pci_ers_result_t result = PCI_ERS_RESULT_RECOVERED;
> -
> - if (severity == AER_FATAL)
> - pci_cleanup_aer_uncorrect_error_status(dev);
> -
> - if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
> - udev = dev;
> - else
> - udev = dev->bus->self;
> -
> - parent = udev->subordinate;
> - pci_lock_rescan_remove();
> - list_for_each_entry_safe_reverse(pdev, temp, &parent->devices,
> - bus_list) {
> - pci_dev_get(pdev);
> - pci_dev_set_disconnected(pdev, NULL);
> - if (pci_has_subordinate(pdev))
> - pci_walk_bus(pdev->subordinate,
> - pci_dev_set_disconnected, NULL);
> - pci_stop_and_remove_bus_device(pdev);
> - pci_dev_put(pdev);
> - }
> -
> - result = reset_link(udev);
> - if (result == PCI_ERS_RESULT_RECOVERED)
> - if (pcie_wait_for_link(udev, true))
> - pci_rescan_bus(udev->bus);
> -
> - pci_unlock_rescan_remove();
> -
> - return result;
> -}
> -
> -/**
> - * pcie_do_recovery - handle nonfatal/fatal error recovery process
> - * @dev: pointer to a pci_dev data structure of agent detecting an
> error
> - * @severity: error severity type
> - *
> - * Invoked when an error is nonfatal/fatal. Once being invoked,
> broadcast
> - * error detected message to all downstream drivers within a hierarchy
> in
> - * question and return the returned code.
> - */
> -void pcie_do_recovery(struct pci_dev *dev, int severity)
> -{
> - pci_ers_result_t status;
> - enum pci_channel_state state;
> -
> - if (severity == AER_FATAL) {
> - status = pcie_do_fatal_recovery(dev, severity);
> - if (status != PCI_ERS_RESULT_RECOVERED)
> - goto failed;
> - return;
> - }
> - else
> - state = pci_channel_io_normal;
> -
> - status = broadcast_error_message(dev,
> - state,
> - "error_detected",
> - report_error_detected);
> -
> - if (status == PCI_ERS_RESULT_CAN_RECOVER)
> - status = broadcast_error_message(dev,
> - state,
> - "mmio_enabled",
> - report_mmio_enabled);
> -
> - if (status == PCI_ERS_RESULT_NEED_RESET) {
> - /*
> - * TODO: Should call platform-specific
> - * functions to reset slot before calling
> - * drivers' slot_reset callbacks?
> - */
> - status = broadcast_error_message(dev,
> - state,
> - "slot_reset",
> - report_slot_reset);
> - }
> -
> - if (status != PCI_ERS_RESULT_RECOVERED)
> - goto failed;
> -
> - broadcast_error_message(dev,
> - state,
> - "resume",
> - report_resume);
> -
> - pci_info(dev, "AER: Device recovery successful\n");
> - return;
> -
> -failed:
> - pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
> - /* TODO: Should kernel panic here? */
> - pci_info(dev, "AER: Device recovery failed\n");
> -}
> -
> /**
> * handle_error_source - handle logging error into an event log
> * @aerdev: pointer to pcie_device data structure of the root port
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> new file mode 100644
> index 0000000..55df974
> --- /dev/null
> +++ b/drivers/pci/pcie/err.c
> @@ -0,0 +1,377 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * This file implements the error recovery as a core part of PCIe
> error
> + * reporting. When a PCIe error is delivered, an error message will be
> + * collected and printed to console, then, an error recovery procedure
> + * will be executed by following the PCI error recovery rules.
> + *
> + * Copyright (C) 2006 Intel Corp.
> + * Tom Long Nguyen ([email protected])
> + * Zhang Yanmin ([email protected])
> + *
> + */
> +
> +#include <linux/pci.h>
> +#include <linux/module.h>
> +#include <linux/pci.h>
> +#include <linux/kernel.h>
> +#include <linux/errno.h>
> +#include <linux/aer.h>
> +#include "portdrv.h"
> +#include "../pci.h"
> +
> +struct aer_broadcast_data {
> + enum pci_channel_state state;
> + enum pci_ers_result result;
> +};
> +
> +static pci_ers_result_t merge_result(enum pci_ers_result orig,
> + enum pci_ers_result new)
> +{
> + if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
> + return PCI_ERS_RESULT_NO_AER_DRIVER;
> +
> + if (new == PCI_ERS_RESULT_NONE)
> + return orig;
> +
> + switch (orig) {
> + case PCI_ERS_RESULT_CAN_RECOVER:
> + case PCI_ERS_RESULT_RECOVERED:
> + orig = new;
> + break;
> + case PCI_ERS_RESULT_DISCONNECT:
> + if (new == PCI_ERS_RESULT_NEED_RESET)
> + orig = PCI_ERS_RESULT_NEED_RESET;
> + break;
> + default:
> + break;
> + }
> +
> + return orig;
> +}
> +
> +static int report_error_detected(struct pci_dev *dev, void *data)
> +{
> + pci_ers_result_t vote;
> + const struct pci_error_handlers *err_handler;
> + struct aer_broadcast_data *result_data;
> +
> + result_data = (struct aer_broadcast_data *) data;
> +
> + device_lock(&dev->dev);
> + dev->error_state = result_data->state;
> +
> + if (!dev->driver ||
> + !dev->driver->err_handler ||
> + !dev->driver->err_handler->error_detected) {
> + if (result_data->state == pci_channel_io_frozen &&
> + dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
> + /*
> + * In case of fatal recovery, if one of down-
> + * stream device has no driver. We might be
> + * unable to recover because a later insmod
> + * of a driver for this device is unaware of
> + * its hw state.
> + */
> + pci_printk(KERN_DEBUG, dev, "device has %s\n",
> + dev->driver ?
> + "no AER-aware driver" : "no driver");
> + }
> +
> + /*
> + * If there's any device in the subtree that does not
> + * have an error_detected callback, returning
> + * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
> + * the subsequent mmio_enabled/slot_reset/resume
> + * callbacks of "any" device in the subtree. All the
> + * devices in the subtree are left in the error state
> + * without recovery.
> + */
> +
> + if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
> + vote = PCI_ERS_RESULT_NO_AER_DRIVER;
> + else
> + vote = PCI_ERS_RESULT_NONE;
> + } else {
> + err_handler = dev->driver->err_handler;
> + vote = err_handler->error_detected(dev, result_data->state);
> + pci_uevent_ers(dev, PCI_ERS_RESULT_NONE);
> + }
> +
> + result_data->result = merge_result(result_data->result, vote);
> + device_unlock(&dev->dev);
> + return 0;
> +}
> +
> +static int report_mmio_enabled(struct pci_dev *dev, void *data)
> +{
> + pci_ers_result_t vote;
> + const struct pci_error_handlers *err_handler;
> + struct aer_broadcast_data *result_data;
> +
> + result_data = (struct aer_broadcast_data *) data;
> +
> + device_lock(&dev->dev);
> + if (!dev->driver ||
> + !dev->driver->err_handler ||
> + !dev->driver->err_handler->mmio_enabled)
> + goto out;
> +
> + err_handler = dev->driver->err_handler;
> + vote = err_handler->mmio_enabled(dev);
> + result_data->result = merge_result(result_data->result, vote);
> +out:
> + device_unlock(&dev->dev);
> + return 0;
> +}
> +
> +static int report_slot_reset(struct pci_dev *dev, void *data)
> +{
> + pci_ers_result_t vote;
> + const struct pci_error_handlers *err_handler;
> + struct aer_broadcast_data *result_data;
> +
> + result_data = (struct aer_broadcast_data *) data;
> +
> + device_lock(&dev->dev);
> + if (!dev->driver ||
> + !dev->driver->err_handler ||
> + !dev->driver->err_handler->slot_reset)
> + goto out;
> +
> + err_handler = dev->driver->err_handler;
> + vote = err_handler->slot_reset(dev);
> + result_data->result = merge_result(result_data->result, vote);
> +out:
> + device_unlock(&dev->dev);
> + return 0;
> +}
> +
> +static int report_resume(struct pci_dev *dev, void *data)
> +{
> + const struct pci_error_handlers *err_handler;
> +
> + device_lock(&dev->dev);
> + dev->error_state = pci_channel_io_normal;
> +
> + if (!dev->driver ||
> + !dev->driver->err_handler ||
> + !dev->driver->err_handler->resume)
> + goto out;
> +
> + err_handler = dev->driver->err_handler;
> + err_handler->resume(dev);
> + pci_uevent_ers(dev, PCI_ERS_RESULT_RECOVERED);
> +out:
> + device_unlock(&dev->dev);
> + return 0;
> +}
> +
> +/**
> + * default_reset_link - default reset function
> + * @dev: pointer to pci_dev data structure
> + *
> + * Invoked when performing link reset on a Downstream Port or a
> + * Root Port with no aer driver.
> + */
> +static pci_ers_result_t default_reset_link(struct pci_dev *dev)
> +{
> + pci_reset_bridge_secondary_bus(dev);
> + pci_printk(KERN_DEBUG, dev, "downstream link has been reset\n");
> + return PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +static pci_ers_result_t reset_link(struct pci_dev *dev)
> +{
> + struct pci_dev *udev;
> + pci_ers_result_t status;
> + struct pcie_port_service_driver *driver;
> +
> + if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
> + /* Reset this port for all subordinates */
> + udev = dev;
> + } else {
> + /* Reset the upstream component (likely downstream port) */
> + udev = dev->bus->self;
> + }
> +
> +#if IS_ENABLED(CONFIG_PCIEAER)
> + /* Use the aer driver of the component firstly */
> + driver = find_aer_service(udev);
> +#endif
> +
> + if (driver && driver->reset_link) {
> + status = driver->reset_link(udev);
> + } else if (udev->has_secondary_link) {
> + status = default_reset_link(udev);
> + } else {
> + pci_printk(KERN_DEBUG, dev, "no link-reset support at upstream
> device %s\n",
> + pci_name(udev));
> + return PCI_ERS_RESULT_DISCONNECT;
> + }
> +
> + if (status != PCI_ERS_RESULT_RECOVERED) {
> + pci_printk(KERN_DEBUG, dev, "link reset at upstream device %s
> failed\n",
> + pci_name(udev));
> + return PCI_ERS_RESULT_DISCONNECT;
> + }
> +
> + return status;
> +}
> +
> +/**
> + * broadcast_error_message - handle message broadcast to downstream
> drivers
> + * @dev: pointer to from where in a hierarchy message is broadcasted
> down
> + * @state: error state
> + * @error_mesg: message to print
> + * @cb: callback to be broadcasted
> + *
> + * Invoked during error recovery process. Once being invoked, the
> content
> + * of error severity will be broadcasted to all downstream drivers in
> a
> + * hierarchy in question.
> + */
> +static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
> + enum pci_channel_state state,
> + char *error_mesg,
> + int (*cb)(struct pci_dev *, void *))
> +{
> + struct aer_broadcast_data result_data;
> +
> + pci_printk(KERN_DEBUG, dev, "broadcast %s message\n", error_mesg);
> + result_data.state = state;
> + if (cb == report_error_detected)
> + result_data.result = PCI_ERS_RESULT_CAN_RECOVER;
> + else
> + result_data.result = PCI_ERS_RESULT_RECOVERED;
> +
> + if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
> + /*
> + * If the error is reported by a bridge, we think this error
> + * is related to the downstream link of the bridge, so we
> + * do error recovery on all subordinates of the bridge instead
> + * of the bridge and clear the error status of the bridge.
> + */
> + if (cb == report_error_detected)
> + dev->error_state = state;
> + pci_walk_bus(dev->subordinate, cb, &result_data);
> + if (cb == report_resume) {
> + pci_cleanup_aer_uncorrect_error_status(dev);
> + dev->error_state = pci_channel_io_normal;
> + }
> + } else {
> + /*
> + * If the error is reported by an end point, we think this
> + * error is related to the upstream link of the end point.
> + */
> + if (state == pci_channel_io_normal)
> + /*
> + * the error is non fatal so the bus is ok, just invoke
> + * the callback for the function that logged the error.
> + */
> + cb(dev, &result_data);
> + else
> + pci_walk_bus(dev->bus, cb, &result_data);
> + }
> +
> + return result_data.result;
> +}
> +
> +static pci_ers_result_t do_fatal_recovery(struct pci_dev *dev, int
> severity)
> +{
> + struct pci_dev *udev;
> + struct pci_bus *parent;
> + struct pci_dev *pdev, *temp;
> + pci_ers_result_t result = PCI_ERS_RESULT_RECOVERED;
> +
> + if (severity == AER_FATAL)
> + pci_cleanup_aer_uncorrect_error_status(dev);
> +
> + if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
> + udev = dev;
> + else
> + udev = dev->bus->self;
> +
> + parent = udev->subordinate;
> + pci_lock_rescan_remove();
> + list_for_each_entry_safe_reverse(pdev, temp, &parent->devices,
> + bus_list) {
> + pci_dev_get(pdev);
> + pci_dev_set_disconnected(pdev, NULL);
> + if (pci_has_subordinate(pdev))
> + pci_walk_bus(pdev->subordinate,
> + pci_dev_set_disconnected, NULL);
> + pci_stop_and_remove_bus_device(pdev);
> + pci_dev_put(pdev);
> + }
> +
> + result = reset_link(udev);
> + if (result == PCI_ERS_RESULT_RECOVERED)
> + if (pcie_wait_for_link(udev, true))
> + pci_rescan_bus(udev->bus);
> +
> + pci_unlock_rescan_remove();
> +
> + return result;
> +}
> +
> +/**
> + * pcie_do_recovery - handle nonfatal/fatal error recovery process
> + * @dev: pointer to a pci_dev data structure of agent detecting an
> error
> + * @severity: error severity type
> + *
> + * Invoked when an error is nonfatal/fatal. Once being invoked,
> broadcast
> + * error detected message to all downstream drivers within a hierarchy
> in
> + * question and return the returned code.
> + */
> +void pcie_do_recovery(struct pci_dev *dev, int severity)
> +{
> + pci_ers_result_t status;
> + enum pci_channel_state state;
> +
> + if (severity == AER_FATAL) {
> + status = do_fatal_recovery(dev, severity);
> + if (status != PCI_ERS_RESULT_RECOVERED)
> + goto failed;
> + return;
> + } else
> + state = pci_channel_io_normal;
> +
> + status = broadcast_error_message(dev,
> + state,
> + "error_detected",
> + report_error_detected);
> +
> + if (status == PCI_ERS_RESULT_CAN_RECOVER)
> + status = broadcast_error_message(dev,
> + state,
> + "mmio_enabled",
> + report_mmio_enabled);
> +
> + if (status == PCI_ERS_RESULT_NEED_RESET) {
> + /*
> + * TODO: Should call platform-specific
> + * functions to reset slot before calling
> + * drivers' slot_reset callbacks?
> + */
> + status = broadcast_error_message(dev,
> + state,
> + "slot_reset",
> + report_slot_reset);
> + }
> +
> + if (status != PCI_ERS_RESULT_RECOVERED)
> + goto failed;
> +
> + broadcast_error_message(dev,
> + state,
> + "resume",
> + report_resume);
> +
> + pci_info(dev, "AER: Device recovery successful\n");
> + return;
> +
> +failed:
> + pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
> + /* TODO: Should kernel panic here? */
> + pci_info(dev, "AER: Device recovery failed\n");
> +}
> diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
> index d0c6783..47c9824 100644
> --- a/drivers/pci/pcie/portdrv.h
> +++ b/drivers/pci/pcie/portdrv.h
> @@ -112,4 +112,5 @@ static inline bool pcie_pme_no_msi(void) { return
> false; }
> static inline void pcie_pme_interrupt_enable(struct pci_dev *dev, bool
> en) {}
> #endif /* !CONFIG_PCIE_PME */
>
> +struct pcie_port_service_driver *find_aer_service(struct pci_dev
> *dev);
> #endif /* _PORTDRV_H_ */
Hi Bjorn,
I will be fixing kbuild error (for x86) along with the comments you
might have.
Regards,
Oza.
On Thu, May 03, 2018 at 01:03:52AM -0400, Oza Pawandeep wrote:
> This patch alters the behavior of handling of ERR_FATAL, where removal
> of devices is initiated, followed by reset link, followed by
> re-enumeration.
>
> So the errors are handled in a different way as follows:
> ERR_NONFATAL => call driver recovery entry points
> ERR_FATAL => remove and re-enumerate
>
> please refer to Documentation/PCI/pci-error-recovery.txt for more details.
>
> Signed-off-by: Oza Pawandeep <[email protected]>
>
> diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
> index 779b387..206f590 100644
> --- a/drivers/pci/pcie/aer/aerdrv.c
> +++ b/drivers/pci/pcie/aer/aerdrv.c
> @@ -330,6 +330,13 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
> reg32 |= ROOT_PORT_INTR_ON_MESG_MASK;
> pci_write_config_dword(dev, pos + PCI_ERR_ROOT_COMMAND, reg32);
>
> + /*
> + * This function is called only on ERR_FATAL now, and since
> + * the pci_report_resume is called only in ERR_NONFATAL case,
> + * the clearing part has to be taken care here.
> + */
> + aer_error_resume(dev);
I don't understand this part. Previously the ERR_FATAL path looked like
this:
do_recovery
reset_link
driver->reset_link
aer_root_reset
pci_reset_bridge_secondary_bus # <-- reset
broadcast_error_message(..., report_resume)
pci_walk_bus(..., report_resume, ...)
report_resume
if (cb == report_resume)
pci_cleanup_aer_uncorrect_error_status
pci_write_config_dword(PCI_ERR_UNCOR_STATUS) # <-- clear status
After this patch, it will look like this:
do_recovery
do_fatal_recovery
pci_cleanup_aer_uncorrect_error_status
pci_write_config_dword(PCI_ERR_UNCOR_STATUS) # <-- clear status
reset_link
driver->reset_link
aer_root_reset
pci_reset_bridge_secondary_bus # <-- reset
aer_error_resume
pcie_capability_write_word(PCI_EXP_DEVSTA) # <-- clear more
pci_write_config_dword(PCI_ERR_UNCOR_STATUS) # <-- clear status
So if I'm understanding correctly, the new path clears the status too
early, then clears it again (plus clearing DEVSTA, which we didn't do
before) later.
I would think we would want to leave aer_root_reset() alone, and just move
the pci_cleanup_aer_uncorrect_error_status() in do_fatal_recovery() down so
it happens after we call reset_link(). That way the reset/clear sequence
would be the same as it was before.
> return PCI_ERS_RESULT_RECOVERED;
> }
>
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
> index 0ea5acc..655d4e8 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -20,6 +20,7 @@
> #include <linux/slab.h>
> #include <linux/kfifo.h>
> #include "aerdrv.h"
> +#include "../../pci.h"
>
> #define PCI_EXP_AER_FLAGS (PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
> PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
> @@ -474,6 +475,44 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
> return status;
> }
>
> +static pci_ers_result_t do_fatal_recovery(struct pci_dev *dev, int severity)
> +{
> + struct pci_dev *udev;
> + struct pci_bus *parent;
> + struct pci_dev *pdev, *temp;
> + pci_ers_result_t result = PCI_ERS_RESULT_RECOVERED;
> +
> + if (severity == AER_FATAL)
> + pci_cleanup_aer_uncorrect_error_status(dev);
> +
> + if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
> + udev = dev;
> + else
> + udev = dev->bus->self;
> +
> + parent = udev->subordinate;
> + pci_lock_rescan_remove();
> + list_for_each_entry_safe_reverse(pdev, temp, &parent->devices,
> + bus_list) {
> + pci_dev_get(pdev);
> + pci_dev_set_disconnected(pdev, NULL);
> + if (pci_has_subordinate(pdev))
> + pci_walk_bus(pdev->subordinate,
> + pci_dev_set_disconnected, NULL);
> + pci_stop_and_remove_bus_device(pdev);
> + pci_dev_put(pdev);
> + }
> +
> + result = reset_link(udev);
> + if (result == PCI_ERS_RESULT_RECOVERED)
> + if (pcie_wait_for_link(udev, true))
> + pci_rescan_bus(udev->bus);
> +
> + pci_unlock_rescan_remove();
> +
> + return result;
> +}
> +
> /**
> * do_recovery - handle nonfatal/fatal error recovery process
> * @dev: pointer to a pci_dev data structure of agent detecting an error
> @@ -485,11 +524,15 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
> */
> static void do_recovery(struct pci_dev *dev, int severity)
> {
> - pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
> + pci_ers_result_t status;
> enum pci_channel_state state;
>
> - if (severity == AER_FATAL)
> - state = pci_channel_io_frozen;
> + if (severity == AER_FATAL) {
> + status = do_fatal_recovery(dev, severity);
> + if (status != PCI_ERS_RESULT_RECOVERED)
> + goto failed;
> + return;
> + }
> else
> state = pci_channel_io_normal;
>
> @@ -498,12 +541,6 @@ static void do_recovery(struct pci_dev *dev, int severity)
> "error_detected",
> report_error_detected);
>
> - if (severity == AER_FATAL) {
> - result = reset_link(dev);
> - if (result != PCI_ERS_RESULT_RECOVERED)
> - goto failed;
> - }
> -
> if (status == PCI_ERS_RESULT_CAN_RECOVER)
> status = broadcast_error_message(dev,
> state,
> --
> 2.7.4
>
On Tue, May 08, 2018 at 06:53:30PM -0500, Bjorn Helgaas wrote:
> On Thu, May 03, 2018 at 01:03:52AM -0400, Oza Pawandeep wrote:
> > This patch alters the behavior of handling of ERR_FATAL, where removal
> > of devices is initiated, followed by reset link, followed by
> > re-enumeration.
> >
> > So the errors are handled in a different way as follows:
> > ERR_NONFATAL => call driver recovery entry points
> > ERR_FATAL => remove and re-enumerate
> >
> > please refer to Documentation/PCI/pci-error-recovery.txt for more details.
> >
> > Signed-off-by: Oza Pawandeep <[email protected]>
> >
> > diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
> > index 779b387..206f590 100644
> > --- a/drivers/pci/pcie/aer/aerdrv.c
> > +++ b/drivers/pci/pcie/aer/aerdrv.c
> > @@ -330,6 +330,13 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
> > reg32 |= ROOT_PORT_INTR_ON_MESG_MASK;
> > pci_write_config_dword(dev, pos + PCI_ERR_ROOT_COMMAND, reg32);
> >
> > + /*
> > + * This function is called only on ERR_FATAL now, and since
> > + * the pci_report_resume is called only in ERR_NONFATAL case,
> > + * the clearing part has to be taken care here.
> > + */
> > + aer_error_resume(dev);
>
> I don't understand this part. Previously the ERR_FATAL path looked like
> this:
>
> do_recovery
> reset_link
> driver->reset_link
> aer_root_reset
> pci_reset_bridge_secondary_bus # <-- reset
> broadcast_error_message(..., report_resume)
> pci_walk_bus(..., report_resume, ...)
> report_resume
> if (cb == report_resume)
> pci_cleanup_aer_uncorrect_error_status
> pci_write_config_dword(PCI_ERR_UNCOR_STATUS) # <-- clear status
>
> After this patch, it will look like this:
>
> do_recovery
> do_fatal_recovery
> pci_cleanup_aer_uncorrect_error_status
> pci_write_config_dword(PCI_ERR_UNCOR_STATUS) # <-- clear status
> reset_link
> driver->reset_link
> aer_root_reset
> pci_reset_bridge_secondary_bus # <-- reset
> aer_error_resume
> pcie_capability_write_word(PCI_EXP_DEVSTA) # <-- clear more
> pci_write_config_dword(PCI_ERR_UNCOR_STATUS) # <-- clear status
>
> So if I'm understanding correctly, the new path clears the status too
> early, then clears it again (plus clearing DEVSTA, which we didn't do
> before) later.
>
> I would think we would want to leave aer_root_reset() alone, and just move
> the pci_cleanup_aer_uncorrect_error_status() in do_fatal_recovery() down so
> it happens after we call reset_link(). That way the reset/clear sequence
> would be the same as it was before.
I've been fiddling with this a bit myself and will post the results to see
what you think.
On 2018-05-09 18:37, Bjorn Helgaas wrote:
> On Tue, May 08, 2018 at 06:53:30PM -0500, Bjorn Helgaas wrote:
>> On Thu, May 03, 2018 at 01:03:52AM -0400, Oza Pawandeep wrote:
>> > This patch alters the behavior of handling of ERR_FATAL, where removal
>> > of devices is initiated, followed by reset link, followed by
>> > re-enumeration.
>> >
>> > So the errors are handled in a different way as follows:
>> > ERR_NONFATAL => call driver recovery entry points
>> > ERR_FATAL => remove and re-enumerate
>> >
>> > please refer to Documentation/PCI/pci-error-recovery.txt for more details.
>> >
>> > Signed-off-by: Oza Pawandeep <[email protected]>
>> >
>> > diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
>> > index 779b387..206f590 100644
>> > --- a/drivers/pci/pcie/aer/aerdrv.c
>> > +++ b/drivers/pci/pcie/aer/aerdrv.c
>> > @@ -330,6 +330,13 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
>> > reg32 |= ROOT_PORT_INTR_ON_MESG_MASK;
>> > pci_write_config_dword(dev, pos + PCI_ERR_ROOT_COMMAND, reg32);
>> >
>> > + /*
>> > + * This function is called only on ERR_FATAL now, and since
>> > + * the pci_report_resume is called only in ERR_NONFATAL case,
>> > + * the clearing part has to be taken care here.
>> > + */
>> > + aer_error_resume(dev);
>>
>> I don't understand this part. Previously the ERR_FATAL path looked
>> like
>> this:
>>
>> do_recovery
>> reset_link
>> driver->reset_link
>> aer_root_reset
>> pci_reset_bridge_secondary_bus # <-- reset
>> broadcast_error_message(..., report_resume)
>> pci_walk_bus(..., report_resume, ...)
>> report_resume
>> if (cb == report_resume)
>> pci_cleanup_aer_uncorrect_error_status
>> pci_write_config_dword(PCI_ERR_UNCOR_STATUS) # <-- clear
>> status
>>
>> After this patch, it will look like this:
>>
>> do_recovery
>> do_fatal_recovery
>> pci_cleanup_aer_uncorrect_error_status
>> pci_write_config_dword(PCI_ERR_UNCOR_STATUS) # <-- clear
>> status
>> reset_link
>> driver->reset_link
>> aer_root_reset
>> pci_reset_bridge_secondary_bus # <-- reset
>> aer_error_resume
>> pcie_capability_write_word(PCI_EXP_DEVSTA) # <--
>> clear more
>> pci_write_config_dword(PCI_ERR_UNCOR_STATUS) # <--
>> clear status
>>
>> So if I'm understanding correctly, the new path clears the status too
>> early, then clears it again (plus clearing DEVSTA, which we didn't do
>> before) later.
>>
>> I would think we would want to leave aer_root_reset() alone, and just
>> move
>> the pci_cleanup_aer_uncorrect_error_status() in do_fatal_recovery()
>> down so
>> it happens after we call reset_link(). That way the reset/clear
>> sequence
>> would be the same as it was before.
>
> I've been fiddling with this a bit myself and will post the results to
> see
> what you think.
ok so you are suggesting to move pci_cleanup_aer_uncorrect_error_status
down which I can do.
And not to call aer_error_resume, because you think its clearing the
status again.
following code: calls aer_error_resume.
pci_broadcast_error_message()
/*
* If the error is reported by an end point, we think
this
* error is related to the upstream link of the end
point.
*/
if (state == pci_channel_io_normal)
/*
* the error is non fatal so the bus is ok, just
invoke
* the callback for the function that logged the
error.
*/
cb(dev, &result_data);
else
pci_walk_bus(dev->bus, cb, &result_data);
besides aer_error_resume does following things in addition to clearing
PCI_ERR_UNCOR_STATUS
/* Clean up Root device status */
pcie_capability_read_word(dev, PCI_EXP_DEVSTA, ®16);
pcie_capability_write_word(dev, PCI_EXP_DEVSTA, reg16);
if (dev->error_state == pci_channel_io_normal)
status &= ~mask; /* Clear corresponding nonfatal bits */
else
status &= mask; /* Clear corresponding fatal bits */
pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, status);
so we have to have conditional call
such as
if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
error_resume
so the code might look like this..
do_recovery
do_fatal_recovery
reset_link
driver->reset_link
aer_root_reset
pci_reset_bridge_secondary_bus # <-- reset
if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
{
aer_error_resume
pcie_capability_write_word(PCI_EXP_DEVSTA) #
<-- clear more
pci_write_config_dword(PCI_ERR_UNCOR_STATUS) #
<--
}
pci_cleanup_aer_uncorrect_error_status(dev);
does it make sense ?
Regards,
Oza.
On Wed, May 09, 2018 at 06:44:53PM +0530, [email protected] wrote:
> On 2018-05-09 18:37, Bjorn Helgaas wrote:
> > On Tue, May 08, 2018 at 06:53:30PM -0500, Bjorn Helgaas wrote:
> > > On Thu, May 03, 2018 at 01:03:52AM -0400, Oza Pawandeep wrote:
> > > > This patch alters the behavior of handling of ERR_FATAL, where removal
> > > > of devices is initiated, followed by reset link, followed by
> > > > re-enumeration.
> > > >
> > > > So the errors are handled in a different way as follows:
> > > > ERR_NONFATAL => call driver recovery entry points
> > > > ERR_FATAL => remove and re-enumerate
> > > >
> > > > please refer to Documentation/PCI/pci-error-recovery.txt for more details.
> > > >
> > > > Signed-off-by: Oza Pawandeep <[email protected]>
> > > >
> > > > diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
> > > > index 779b387..206f590 100644
> > > > --- a/drivers/pci/pcie/aer/aerdrv.c
> > > > +++ b/drivers/pci/pcie/aer/aerdrv.c
> > > > @@ -330,6 +330,13 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
> > > > reg32 |= ROOT_PORT_INTR_ON_MESG_MASK;
> > > > pci_write_config_dword(dev, pos + PCI_ERR_ROOT_COMMAND, reg32);
> > > >
> > > > + /*
> > > > + * This function is called only on ERR_FATAL now, and since
> > > > + * the pci_report_resume is called only in ERR_NONFATAL case,
> > > > + * the clearing part has to be taken care here.
> > > > + */
> > > > + aer_error_resume(dev);
> > >
> > > I don't understand this part. Previously the ERR_FATAL path looked
> > > like
> > > this:
> > >
> > > do_recovery
> > > reset_link
> > > driver->reset_link
> > > aer_root_reset
> > > pci_reset_bridge_secondary_bus # <-- reset
> > > broadcast_error_message(..., report_resume)
> > > pci_walk_bus(..., report_resume, ...)
> > > report_resume
> > > if (cb == report_resume)
> > > pci_cleanup_aer_uncorrect_error_status
> > > pci_write_config_dword(PCI_ERR_UNCOR_STATUS) # <-- clear
> > > status
> > >
> > > After this patch, it will look like this:
> > >
> > > do_recovery
> > > do_fatal_recovery
> > > pci_cleanup_aer_uncorrect_error_status
> > > pci_write_config_dword(PCI_ERR_UNCOR_STATUS) # <-- clear
> > > status
> > > reset_link
> > > driver->reset_link
> > > aer_root_reset
> > > pci_reset_bridge_secondary_bus # <-- reset
> > > aer_error_resume
> > > pcie_capability_write_word(PCI_EXP_DEVSTA) #
> > > <-- clear more
> > > pci_write_config_dword(PCI_ERR_UNCOR_STATUS) #
> > > <-- clear status
> > >
> > > So if I'm understanding correctly, the new path clears the status too
> > > early, then clears it again (plus clearing DEVSTA, which we didn't do
> > > before) later.
> > >
> > > I would think we would want to leave aer_root_reset() alone, and
> > > just move
> > > the pci_cleanup_aer_uncorrect_error_status() in do_fatal_recovery()
> > > down so
> > > it happens after we call reset_link(). That way the reset/clear
> > > sequence
> > > would be the same as it was before.
> >
> > I've been fiddling with this a bit myself and will post the results to
> > see
> > what you think.
>
>
> ok so you are suggesting to move pci_cleanup_aer_uncorrect_error_status down
> which I can do.
>
> And not to call aer_error_resume, because you think its clearing the status
> again.
>
> following code: calls aer_error_resume.
> pci_broadcast_error_message()
> /*
> * If the error is reported by an end point, we think this
> * error is related to the upstream link of the end point.
> */
> if (state == pci_channel_io_normal)
> /*
> * the error is non fatal so the bus is ok, just
> invoke
> * the callback for the function that logged the
> error.
> */
> cb(dev, &result_data);
> else
> pci_walk_bus(dev->bus, cb, &result_data);
Holy crap, I thought this could not possibly get any more complicated,
but you're right; we do actually call aer_error_resume() today via an
extremely convoluted path:
do_recovery(pci_dev)
broadcast_error_message(..., error_detected, ...)
if (AER_FATAL)
reset_link(pci_dev)
udev = BRIDGE ? pci_dev : pci_dev->bus->self
driver->reset_link(udev)
aer_root_reset(udev)
if (CAN_RECOVER)
broadcast_error_message(..., mmio_enabled, ...)
if (NEED_RESET)
broadcast_error_message(..., slot_reset, ...)
broadcast_error_message(dev, ..., report_resume, ...)
if (BRIDGE)
report_resume
driver->resume
pcie_portdrv_err_resume
device_for_each_child(..., resume_iter)
resume_iter
driver->error_resume
aer_error_resume
pci_cleanup_aer_uncorrect_error_status(pci_dev) # only if BRIDGE
pci_write_config_dword(PCI_ERR_UNCOR_STATUS)
aerdriver is the only port service driver that implements
.error_resume(), and aerdriver only binds to root ports. I can't
really believe all these device_for_each_child()/resume_iter()
gyrations are necessary when this is AER code calling AER code.
Bjorn
On 2018-05-10 04:51, Bjorn Helgaas wrote:
> On Wed, May 09, 2018 at 06:44:53PM +0530, [email protected] wrote:
>> On 2018-05-09 18:37, Bjorn Helgaas wrote:
>> > On Tue, May 08, 2018 at 06:53:30PM -0500, Bjorn Helgaas wrote:
>> > > On Thu, May 03, 2018 at 01:03:52AM -0400, Oza Pawandeep wrote:
>> > > > This patch alters the behavior of handling of ERR_FATAL, where removal
>> > > > of devices is initiated, followed by reset link, followed by
>> > > > re-enumeration.
>> > > >
>> > > > So the errors are handled in a different way as follows:
>> > > > ERR_NONFATAL => call driver recovery entry points
>> > > > ERR_FATAL => remove and re-enumerate
>> > > >
>> > > > please refer to Documentation/PCI/pci-error-recovery.txt for more details.
>> > > >
>> > > > Signed-off-by: Oza Pawandeep <[email protected]>
>> > > >
>> > > > diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
>> > > > index 779b387..206f590 100644
>> > > > --- a/drivers/pci/pcie/aer/aerdrv.c
>> > > > +++ b/drivers/pci/pcie/aer/aerdrv.c
>> > > > @@ -330,6 +330,13 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
>> > > > reg32 |= ROOT_PORT_INTR_ON_MESG_MASK;
>> > > > pci_write_config_dword(dev, pos + PCI_ERR_ROOT_COMMAND, reg32);
>> > > >
>> > > > + /*
>> > > > + * This function is called only on ERR_FATAL now, and since
>> > > > + * the pci_report_resume is called only in ERR_NONFATAL case,
>> > > > + * the clearing part has to be taken care here.
>> > > > + */
>> > > > + aer_error_resume(dev);
>> > >
>> > > I don't understand this part. Previously the ERR_FATAL path looked
>> > > like
>> > > this:
>> > >
>> > > do_recovery
>> > > reset_link
>> > > driver->reset_link
>> > > aer_root_reset
>> > > pci_reset_bridge_secondary_bus # <-- reset
>> > > broadcast_error_message(..., report_resume)
>> > > pci_walk_bus(..., report_resume, ...)
>> > > report_resume
>> > > if (cb == report_resume)
>> > > pci_cleanup_aer_uncorrect_error_status
>> > > pci_write_config_dword(PCI_ERR_UNCOR_STATUS) # <-- clear
>> > > status
>> > >
>> > > After this patch, it will look like this:
>> > >
>> > > do_recovery
>> > > do_fatal_recovery
>> > > pci_cleanup_aer_uncorrect_error_status
>> > > pci_write_config_dword(PCI_ERR_UNCOR_STATUS) # <-- clear
>> > > status
>> > > reset_link
>> > > driver->reset_link
>> > > aer_root_reset
>> > > pci_reset_bridge_secondary_bus # <-- reset
>> > > aer_error_resume
>> > > pcie_capability_write_word(PCI_EXP_DEVSTA) #
>> > > <-- clear more
>> > > pci_write_config_dword(PCI_ERR_UNCOR_STATUS) #
>> > > <-- clear status
>> > >
>> > > So if I'm understanding correctly, the new path clears the status too
>> > > early, then clears it again (plus clearing DEVSTA, which we didn't do
>> > > before) later.
>> > >
>> > > I would think we would want to leave aer_root_reset() alone, and
>> > > just move
>> > > the pci_cleanup_aer_uncorrect_error_status() in do_fatal_recovery()
>> > > down so
>> > > it happens after we call reset_link(). That way the reset/clear
>> > > sequence
>> > > would be the same as it was before.
>> >
>> > I've been fiddling with this a bit myself and will post the results to
>> > see
>> > what you think.
>>
>>
>> ok so you are suggesting to move
>> pci_cleanup_aer_uncorrect_error_status down
>> which I can do.
>>
>> And not to call aer_error_resume, because you think its clearing the
>> status
>> again.
>>
>> following code: calls aer_error_resume.
>> pci_broadcast_error_message()
>> /*
>> * If the error is reported by an end point, we think
>> this
>> * error is related to the upstream link of the end
>> point.
>> */
>> if (state == pci_channel_io_normal)
>> /*
>> * the error is non fatal so the bus is ok,
>> just
>> invoke
>> * the callback for the function that logged
>> the
>> error.
>> */
>> cb(dev, &result_data);
>> else
>> pci_walk_bus(dev->bus, cb, &result_data);
>
> Holy crap, I thought this could not possibly get any more complicated,
> but you're right; we do actually call aer_error_resume() today via an
> extremely convoluted path:
>
> do_recovery(pci_dev)
> broadcast_error_message(..., error_detected, ...)
> if (AER_FATAL)
> reset_link(pci_dev)
> udev = BRIDGE ? pci_dev : pci_dev->bus->self
> driver->reset_link(udev)
> aer_root_reset(udev)
> if (CAN_RECOVER)
> broadcast_error_message(..., mmio_enabled, ...)
> if (NEED_RESET)
> broadcast_error_message(..., slot_reset, ...)
> broadcast_error_message(dev, ..., report_resume, ...)
> if (BRIDGE)
> report_resume
> driver->resume
> pcie_portdrv_err_resume
> device_for_each_child(..., resume_iter)
> resume_iter
> driver->error_resume
> aer_error_resume
> pci_cleanup_aer_uncorrect_error_status(pci_dev) # only if
> BRIDGE
> pci_write_config_dword(PCI_ERR_UNCOR_STATUS)
>
> aerdriver is the only port service driver that implements
> .error_resume(), and aerdriver only binds to root ports. I can't
> really believe all these device_for_each_child()/resume_iter()
> gyrations are necessary when this is AER code calling AER code.
>
> Bjorn
here is the code of do_fatal_recovery, where I have moved the things
down and doing only if it is bridge.
let me know how this looks to you, so then I can post v16.
static pci_ers_result_t do_fatal_recovery(struct pci_dev *dev, int
severity)
{
struct pci_dev *udev;
struct pci_bus *parent;
struct pci_dev *pdev, *temp;
struct aer_broadcast_data result_data;
pci_ers_result_t result = PCI_ERS_RESULT_RECOVERED;
if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
udev = dev;
else
udev = dev->bus->self;
parent = udev->subordinate;
pci_lock_rescan_remove();
list_for_each_entry_safe_reverse(pdev, temp, &parent->devices,
bus_list) {
pci_dev_get(pdev);
pci_dev_set_disconnected(pdev, NULL);
if (pci_has_subordinate(pdev))
pci_walk_bus(pdev->subordinate,
pci_dev_set_disconnected, NULL);
pci_stop_and_remove_bus_device(pdev);
pci_dev_put(pdev);
}
result = reset_link(udev, severity);
if (severity == AER_FATAL && dev->hdr_type ==
PCI_HEADER_TYPE_BRIDGE) {
pci_walk_bus(dev->subordinate, report_resume,
&result_data);
pci_cleanup_aer_uncorrect_error_status(dev);
dev->error_state = pci_channel_io_normal;
}
if (result == PCI_ERS_RESULT_RECOVERED)
if (pcie_wait_for_link(udev, true))
pci_rescan_bus(udev->bus);
pci_unlock_rescan_remove();
return result;
}
Regards,
Oza.
On Thu, May 10, 2018 at 12:31:16PM +0530, [email protected] wrote:
> On 2018-05-10 04:51, Bjorn Helgaas wrote:
> > On Wed, May 09, 2018 at 06:44:53PM +0530, [email protected] wrote:
> > > On 2018-05-09 18:37, Bjorn Helgaas wrote:
> > > > On Tue, May 08, 2018 at 06:53:30PM -0500, Bjorn Helgaas wrote:
> > > > > On Thu, May 03, 2018 at 01:03:52AM -0400, Oza Pawandeep wrote:
> > > > > > This patch alters the behavior of handling of ERR_FATAL, where removal
> > > > > > of devices is initiated, followed by reset link, followed by
> > > > > > re-enumeration.
> > > > > >
> > > > > > So the errors are handled in a different way as follows:
> > > > > > ERR_NONFATAL => call driver recovery entry points
> > > > > > ERR_FATAL => remove and re-enumerate
> > > > > >
> > > > > > please refer to Documentation/PCI/pci-error-recovery.txt for more details.
> > > > > >
> > > > > > Signed-off-by: Oza Pawandeep <[email protected]>
> > > > > >
> > > > > > diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
> > > > > > index 779b387..206f590 100644
> > > > > > --- a/drivers/pci/pcie/aer/aerdrv.c
> > > > > > +++ b/drivers/pci/pcie/aer/aerdrv.c
> > > > > > @@ -330,6 +330,13 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
> > > > > > reg32 |= ROOT_PORT_INTR_ON_MESG_MASK;
> > > > > > pci_write_config_dword(dev, pos + PCI_ERR_ROOT_COMMAND, reg32);
> > > > > >
> > > > > > + /*
> > > > > > + * This function is called only on ERR_FATAL now, and since
> > > > > > + * the pci_report_resume is called only in ERR_NONFATAL case,
> > > > > > + * the clearing part has to be taken care here.
> > > > > > + */
> > > > > > + aer_error_resume(dev);
> > > > >
> > > > > I don't understand this part. Previously the ERR_FATAL path looked
> > > > > like
> > > > > this:
> > > > >
> > > > > do_recovery
> > > > > reset_link
> > > > > driver->reset_link
> > > > > aer_root_reset
> > > > > pci_reset_bridge_secondary_bus # <-- reset
> > > > > broadcast_error_message(..., report_resume)
> > > > > pci_walk_bus(..., report_resume, ...)
> > > > > report_resume
> > > > > if (cb == report_resume)
> > > > > pci_cleanup_aer_uncorrect_error_status
> > > > > pci_write_config_dword(PCI_ERR_UNCOR_STATUS) # <-- clear
> > > > > status
> > > > >
> > > > > After this patch, it will look like this:
> > > > >
> > > > > do_recovery
> > > > > do_fatal_recovery
> > > > > pci_cleanup_aer_uncorrect_error_status
> > > > > pci_write_config_dword(PCI_ERR_UNCOR_STATUS) # <-- clear
> > > > > status
> > > > > reset_link
> > > > > driver->reset_link
> > > > > aer_root_reset
> > > > > pci_reset_bridge_secondary_bus # <-- reset
> > > > > aer_error_resume
> > > > > pcie_capability_write_word(PCI_EXP_DEVSTA) #
> > > > > <-- clear more
> > > > > pci_write_config_dword(PCI_ERR_UNCOR_STATUS) #
> > > > > <-- clear status
> > > > >
> > > > > So if I'm understanding correctly, the new path clears the status too
> > > > > early, then clears it again (plus clearing DEVSTA, which we didn't do
> > > > > before) later.
> > > > >
> > > > > I would think we would want to leave aer_root_reset() alone, and
> > > > > just move
> > > > > the pci_cleanup_aer_uncorrect_error_status() in do_fatal_recovery()
> > > > > down so
> > > > > it happens after we call reset_link(). That way the reset/clear
> > > > > sequence
> > > > > would be the same as it was before.
> > > >
> > > > I've been fiddling with this a bit myself and will post the results to
> > > > see
> > > > what you think.
> > >
> > >
> > > ok so you are suggesting to move
> > > pci_cleanup_aer_uncorrect_error_status down
> > > which I can do.
> > >
> > > And not to call aer_error_resume, because you think its clearing the
> > > status
> > > again.
> > >
> > > following code: calls aer_error_resume.
> > > pci_broadcast_error_message()
> > > /*
> > > * If the error is reported by an end point, we
> > > think this
> > > * error is related to the upstream link of the end
> > > point.
> > > */
> > > if (state == pci_channel_io_normal)
> > > /*
> > > * the error is non fatal so the bus is ok,
> > > just
> > > invoke
> > > * the callback for the function that logged
> > > the
> > > error.
> > > */
> > > cb(dev, &result_data);
> > > else
> > > pci_walk_bus(dev->bus, cb, &result_data);
> >
> > Holy crap, I thought this could not possibly get any more complicated,
> > but you're right; we do actually call aer_error_resume() today via an
> > extremely convoluted path:
> >
> > do_recovery(pci_dev)
> > broadcast_error_message(..., error_detected, ...)
> > if (AER_FATAL)
> > reset_link(pci_dev)
> > udev = BRIDGE ? pci_dev : pci_dev->bus->self
> > driver->reset_link(udev)
> > aer_root_reset(udev)
> > if (CAN_RECOVER)
> > broadcast_error_message(..., mmio_enabled, ...)
> > if (NEED_RESET)
> > broadcast_error_message(..., slot_reset, ...)
> > broadcast_error_message(dev, ..., report_resume, ...)
> > if (BRIDGE)
> > report_resume
> > driver->resume
> > pcie_portdrv_err_resume
> > device_for_each_child(..., resume_iter)
> > resume_iter
> > driver->error_resume
> > aer_error_resume
> > pci_cleanup_aer_uncorrect_error_status(pci_dev) # only if
> > BRIDGE
> > pci_write_config_dword(PCI_ERR_UNCOR_STATUS)
> >
> > aerdriver is the only port service driver that implements
> > .error_resume(), and aerdriver only binds to root ports. I can't
> > really believe all these device_for_each_child()/resume_iter()
> > gyrations are necessary when this is AER code calling AER code.
> >
> > Bjorn
>
> here is the code of do_fatal_recovery, where I have moved the things down
> and doing only if it is bridge.
> let me know how this looks to you, so then I can post v16.
This looks superficially OK. It is very difficult for me to verify that
the behavior is equivalent to the current code, but that's not your fault;
it's just a consequence of the existing design.
I have a couple trivial comments elsewhere, and I'll respond to those
patches individually.
> static pci_ers_result_t do_fatal_recovery(struct pci_dev *dev, int severity)
> {
> struct pci_dev *udev;
> struct pci_bus *parent;
> struct pci_dev *pdev, *temp;
> struct aer_broadcast_data result_data;
> pci_ers_result_t result = PCI_ERS_RESULT_RECOVERED;
>
>
> if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
> udev = dev;
> else
> udev = dev->bus->self;
>
> parent = udev->subordinate;
> pci_lock_rescan_remove();
> list_for_each_entry_safe_reverse(pdev, temp, &parent->devices,
> bus_list) {
> pci_dev_get(pdev);
> pci_dev_set_disconnected(pdev, NULL);
> if (pci_has_subordinate(pdev))
> pci_walk_bus(pdev->subordinate,
> pci_dev_set_disconnected, NULL);
> pci_stop_and_remove_bus_device(pdev);
> pci_dev_put(pdev);
> }
>
> result = reset_link(udev, severity);
> if (severity == AER_FATAL && dev->hdr_type ==
> PCI_HEADER_TYPE_BRIDGE) {
> pci_walk_bus(dev->subordinate, report_resume, &result_data);
> pci_cleanup_aer_uncorrect_error_status(dev);
> dev->error_state = pci_channel_io_normal;
> }
> if (result == PCI_ERS_RESULT_RECOVERED)
> if (pcie_wait_for_link(udev, true))
> pci_rescan_bus(udev->bus);
>
> pci_unlock_rescan_remove();
>
> return result;
> }
>
> Regards,
> Oza.
>
>
On 2018-05-10 14:10, Bjorn Helgaas wrote:
> On Thu, May 10, 2018 at 12:31:16PM +0530, [email protected] wrote:
>> On 2018-05-10 04:51, Bjorn Helgaas wrote:
>> > On Wed, May 09, 2018 at 06:44:53PM +0530, [email protected] wrote:
>> > > On 2018-05-09 18:37, Bjorn Helgaas wrote:
>> > > > On Tue, May 08, 2018 at 06:53:30PM -0500, Bjorn Helgaas wrote:
>> > > > > On Thu, May 03, 2018 at 01:03:52AM -0400, Oza Pawandeep wrote:
>> > > > > > This patch alters the behavior of handling of ERR_FATAL, where removal
>> > > > > > of devices is initiated, followed by reset link, followed by
>> > > > > > re-enumeration.
>> > > > > >
>> > > > > > So the errors are handled in a different way as follows:
>> > > > > > ERR_NONFATAL => call driver recovery entry points
>> > > > > > ERR_FATAL => remove and re-enumerate
>> > > > > >
>> > > > > > please refer to Documentation/PCI/pci-error-recovery.txt for more details.
>> > > > > >
>> > > > > > Signed-off-by: Oza Pawandeep <[email protected]>
>> > > > > >
>> > > > > > diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
>> > > > > > index 779b387..206f590 100644
>> > > > > > --- a/drivers/pci/pcie/aer/aerdrv.c
>> > > > > > +++ b/drivers/pci/pcie/aer/aerdrv.c
>> > > > > > @@ -330,6 +330,13 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
>> > > > > > reg32 |= ROOT_PORT_INTR_ON_MESG_MASK;
>> > > > > > pci_write_config_dword(dev, pos + PCI_ERR_ROOT_COMMAND, reg32);
>> > > > > >
>> > > > > > + /*
>> > > > > > + * This function is called only on ERR_FATAL now, and since
>> > > > > > + * the pci_report_resume is called only in ERR_NONFATAL case,
>> > > > > > + * the clearing part has to be taken care here.
>> > > > > > + */
>> > > > > > + aer_error_resume(dev);
>> > > > >
>> > > > > I don't understand this part. Previously the ERR_FATAL path looked
>> > > > > like
>> > > > > this:
>> > > > >
>> > > > > do_recovery
>> > > > > reset_link
>> > > > > driver->reset_link
>> > > > > aer_root_reset
>> > > > > pci_reset_bridge_secondary_bus # <-- reset
>> > > > > broadcast_error_message(..., report_resume)
>> > > > > pci_walk_bus(..., report_resume, ...)
>> > > > > report_resume
>> > > > > if (cb == report_resume)
>> > > > > pci_cleanup_aer_uncorrect_error_status
>> > > > > pci_write_config_dword(PCI_ERR_UNCOR_STATUS) # <-- clear
>> > > > > status
>> > > > >
>> > > > > After this patch, it will look like this:
>> > > > >
>> > > > > do_recovery
>> > > > > do_fatal_recovery
>> > > > > pci_cleanup_aer_uncorrect_error_status
>> > > > > pci_write_config_dword(PCI_ERR_UNCOR_STATUS) # <-- clear
>> > > > > status
>> > > > > reset_link
>> > > > > driver->reset_link
>> > > > > aer_root_reset
>> > > > > pci_reset_bridge_secondary_bus # <-- reset
>> > > > > aer_error_resume
>> > > > > pcie_capability_write_word(PCI_EXP_DEVSTA) #
>> > > > > <-- clear more
>> > > > > pci_write_config_dword(PCI_ERR_UNCOR_STATUS) #
>> > > > > <-- clear status
>> > > > >
>> > > > > So if I'm understanding correctly, the new path clears the status too
>> > > > > early, then clears it again (plus clearing DEVSTA, which we didn't do
>> > > > > before) later.
>> > > > >
>> > > > > I would think we would want to leave aer_root_reset() alone, and
>> > > > > just move
>> > > > > the pci_cleanup_aer_uncorrect_error_status() in do_fatal_recovery()
>> > > > > down so
>> > > > > it happens after we call reset_link(). That way the reset/clear
>> > > > > sequence
>> > > > > would be the same as it was before.
>> > > >
>> > > > I've been fiddling with this a bit myself and will post the results to
>> > > > see
>> > > > what you think.
>> > >
>> > >
>> > > ok so you are suggesting to move
>> > > pci_cleanup_aer_uncorrect_error_status down
>> > > which I can do.
>> > >
>> > > And not to call aer_error_resume, because you think its clearing the
>> > > status
>> > > again.
>> > >
>> > > following code: calls aer_error_resume.
>> > > pci_broadcast_error_message()
>> > > /*
>> > > * If the error is reported by an end point, we
>> > > think this
>> > > * error is related to the upstream link of the end
>> > > point.
>> > > */
>> > > if (state == pci_channel_io_normal)
>> > > /*
>> > > * the error is non fatal so the bus is ok,
>> > > just
>> > > invoke
>> > > * the callback for the function that logged
>> > > the
>> > > error.
>> > > */
>> > > cb(dev, &result_data);
>> > > else
>> > > pci_walk_bus(dev->bus, cb, &result_data);
>> >
>> > Holy crap, I thought this could not possibly get any more complicated,
>> > but you're right; we do actually call aer_error_resume() today via an
>> > extremely convoluted path:
>> >
>> > do_recovery(pci_dev)
>> > broadcast_error_message(..., error_detected, ...)
>> > if (AER_FATAL)
>> > reset_link(pci_dev)
>> > udev = BRIDGE ? pci_dev : pci_dev->bus->self
>> > driver->reset_link(udev)
>> > aer_root_reset(udev)
>> > if (CAN_RECOVER)
>> > broadcast_error_message(..., mmio_enabled, ...)
>> > if (NEED_RESET)
>> > broadcast_error_message(..., slot_reset, ...)
>> > broadcast_error_message(dev, ..., report_resume, ...)
>> > if (BRIDGE)
>> > report_resume
>> > driver->resume
>> > pcie_portdrv_err_resume
>> > device_for_each_child(..., resume_iter)
>> > resume_iter
>> > driver->error_resume
>> > aer_error_resume
>> > pci_cleanup_aer_uncorrect_error_status(pci_dev) # only if
>> > BRIDGE
>> > pci_write_config_dword(PCI_ERR_UNCOR_STATUS)
>> >
>> > aerdriver is the only port service driver that implements
>> > .error_resume(), and aerdriver only binds to root ports. I can't
>> > really believe all these device_for_each_child()/resume_iter()
>> > gyrations are necessary when this is AER code calling AER code.
>> >
>> > Bjorn
>>
>> here is the code of do_fatal_recovery, where I have moved the things
>> down
>> and doing only if it is bridge.
>> let me know how this looks to you, so then I can post v16.
>
> This looks superficially OK. It is very difficult for me to verify
> that
> the behavior is equivalent to the current code, but that's not your
> fault;
> it's just a consequence of the existing design.
>
> I have a couple trivial comments elsewhere, and I'll respond to those
> patches individually.
>
>> static pci_ers_result_t do_fatal_recovery(struct pci_dev *dev, int
>> severity)
>> {
>> struct pci_dev *udev;
>> struct pci_bus *parent;
>> struct pci_dev *pdev, *temp;
>> struct aer_broadcast_data result_data;
>> pci_ers_result_t result = PCI_ERS_RESULT_RECOVERED;
>>
>>
>> if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
>> udev = dev;
>> else
>> udev = dev->bus->self;
>>
>> parent = udev->subordinate;
>> pci_lock_rescan_remove();
>> list_for_each_entry_safe_reverse(pdev, temp, &parent->devices,
>> bus_list) {
>> pci_dev_get(pdev);
>> pci_dev_set_disconnected(pdev, NULL);
>> if (pci_has_subordinate(pdev))
>> pci_walk_bus(pdev->subordinate,
>> pci_dev_set_disconnected, NULL);
>> pci_stop_and_remove_bus_device(pdev);
>> pci_dev_put(pdev);
>> }
>>
>> result = reset_link(udev, severity);
>> if (severity == AER_FATAL && dev->hdr_type ==
>> PCI_HEADER_TYPE_BRIDGE) {
>> pci_walk_bus(dev->subordinate, report_resume,
>> &result_data);
Why are we calling resume?
>> pci_cleanup_aer_uncorrect_error_status(dev);
>> dev->error_state = pci_channel_io_normal;
>> }
>> if (result == PCI_ERS_RESULT_RECOVERED)
>> if (pcie_wait_for_link(udev, true))
>> pci_rescan_bus(udev->bus);
>>
>> pci_unlock_rescan_remove();
>>
>> return result;
>> }
>>
>> Regards,
>> Oza.
>>
>>
On Thu, May 03, 2018 at 01:03:52AM -0400, Oza Pawandeep wrote:
> This patch alters the behavior of handling of ERR_FATAL, where removal
> of devices is initiated, followed by reset link, followed by
> re-enumeration.
>
> So the errors are handled in a different way as follows:
> ERR_NONFATAL => call driver recovery entry points
> ERR_FATAL => remove and re-enumerate
>
> please refer to Documentation/PCI/pci-error-recovery.txt for more details.
>
> Signed-off-by: Oza Pawandeep <[email protected]>
>
> diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
> index 779b387..206f590 100644
> --- a/drivers/pci/pcie/aer/aerdrv.c
> +++ b/drivers/pci/pcie/aer/aerdrv.c
> @@ -330,6 +330,13 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
> reg32 |= ROOT_PORT_INTR_ON_MESG_MASK;
> pci_write_config_dword(dev, pos + PCI_ERR_ROOT_COMMAND, reg32);
>
> + /*
> + * This function is called only on ERR_FATAL now, and since
> + * the pci_report_resume is called only in ERR_NONFATAL case,
> + * the clearing part has to be taken care here.
> + */
> + aer_error_resume(dev);
> +
> return PCI_ERS_RESULT_RECOVERED;
> }
>
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
> index 0ea5acc..655d4e8 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -20,6 +20,7 @@
> #include <linux/slab.h>
> #include <linux/kfifo.h>
> #include "aerdrv.h"
> +#include "../../pci.h"
>
> #define PCI_EXP_AER_FLAGS (PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
> PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
> @@ -474,6 +475,44 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
> return status;
> }
>
> +static pci_ers_result_t do_fatal_recovery(struct pci_dev *dev, int severity)
Here's a possiblity for your consideration. Expose these two interfaces:
void pcie_do_nonfatal_recovery(struct pci_dev *dev);
void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service);
(this would be the end result, after the rename and move to err.c) and move
the fatal/nonfatal testing into the callers, e..g,
handle_error_source(...)
{
...
if (info->severity == AER_NONFATAL)
pcie_do_nonfatal_recovery(dev);
else if (info->severity == AER_FATAL)
pcie_do_fatal_recovery(dev, PCIE_PORT_SERVICE_AER);
}
Then I don't think you would need this code in reset_link():
reset_link(...)
{
...
if (severity == DPC_FATAL)
service = PCIE_PORT_SERVICE_DPC;
...
because you would already have the service.
> +{
> + struct pci_dev *udev;
> + struct pci_bus *parent;
> + struct pci_dev *pdev, *temp;
> + pci_ers_result_t result = PCI_ERS_RESULT_RECOVERED;
> +
> + if (severity == AER_FATAL)
> + pci_cleanup_aer_uncorrect_error_status(dev);
> +
> + if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
> + udev = dev;
> + else
> + udev = dev->bus->self;
> +
> + parent = udev->subordinate;
> + pci_lock_rescan_remove();
> + list_for_each_entry_safe_reverse(pdev, temp, &parent->devices,
> + bus_list) {
> + pci_dev_get(pdev);
> + pci_dev_set_disconnected(pdev, NULL);
> + if (pci_has_subordinate(pdev))
> + pci_walk_bus(pdev->subordinate,
> + pci_dev_set_disconnected, NULL);
> + pci_stop_and_remove_bus_device(pdev);
> + pci_dev_put(pdev);
> + }
> +
> + result = reset_link(udev);
> + if (result == PCI_ERS_RESULT_RECOVERED)
> + if (pcie_wait_for_link(udev, true))
> + pci_rescan_bus(udev->bus);
> +
> + pci_unlock_rescan_remove();
> +
> + return result;
> +}
> +
> /**
> * do_recovery - handle nonfatal/fatal error recovery process
> * @dev: pointer to a pci_dev data structure of agent detecting an error
> @@ -485,11 +524,15 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
> */
> static void do_recovery(struct pci_dev *dev, int severity)
> {
> - pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
> + pci_ers_result_t status;
> enum pci_channel_state state;
>
> - if (severity == AER_FATAL)
> - state = pci_channel_io_frozen;
> + if (severity == AER_FATAL) {
> + status = do_fatal_recovery(dev, severity);
> + if (status != PCI_ERS_RESULT_RECOVERED)
> + goto failed;
> + return;
> + }
> else
> state = pci_channel_io_normal;
>
> @@ -498,12 +541,6 @@ static void do_recovery(struct pci_dev *dev, int severity)
> "error_detected",
> report_error_detected);
>
> - if (severity == AER_FATAL) {
> - result = reset_link(dev);
> - if (result != PCI_ERS_RESULT_RECOVERED)
> - goto failed;
> - }
> -
> if (status == PCI_ERS_RESULT_CAN_RECOVER)
> status = broadcast_error_message(dev,
> state,
> --
> 2.7.4
>
On Thu, May 03, 2018 at 01:03:50AM -0400, Oza Pawandeep wrote:
> + * pcie_wait_for_link - Wait for link till it's active/inactive
> + * @pdev: Bridge device
> + * @active: waiting for active or inactive ?
s/inactive ?/inactive?/
On Thu, May 03, 2018 at 01:03:57AM -0400, Oza Pawandeep wrote:
> Current DPC driver does not do recovery, e.g. calling end-point's driver's
> callbacks, which sanitize the sw.
>
> DPC driver implements link_reset callback, and calls pci_do_recovery().
>
> Signed-off-by: Oza Pawandeep <[email protected]>
>
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index 80ec384..aed7c9f 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -73,29 +73,21 @@ static void dpc_wait_link_inactive(struct dpc_dev *dpc)
> pcie_wait_for_link(pdev, false);
> }
>
> -static void dpc_work(struct work_struct *work)
> +static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
> {
> - struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
> - struct pci_dev *dev, *temp, *pdev = dpc->dev->port;
> - struct pci_bus *parent = pdev->subordinate;
> - u16 cap = dpc->cap_pos, ctl;
> -
> - pci_lock_rescan_remove();
> - list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
> - bus_list) {
> - pci_dev_get(dev);
> - pci_dev_set_disconnected(dev, NULL);
> - if (pci_has_subordinate(dev))
> - pci_walk_bus(dev->subordinate,
> - pci_dev_set_disconnected, NULL);
> - pci_stop_and_remove_bus_device(dev);
> - pci_dev_put(dev);
> - }
> - pci_unlock_rescan_remove();
I think it would be good to have a comment here about why this "reset_link"
function doesn't actually reset the link, e.g.,
/*
* DPC disables the Link automatically in hardware, so it has
* already been reset by the time we get here.
*/
> + struct dpc_dev *dpc;
> + struct pcie_device *pciedev;
> + struct device *devdpc;
> + u16 cap, ctl;
> +
> + devdpc = pcie_port_find_device(pdev, PCIE_PORT_SERVICE_DPC);
> + pciedev = to_pcie_device(devdpc);
> + dpc = get_service_data(pciedev);
> + cap = dpc->cap_pos;
And maybe one about waiting until the link is inactive, then clearing DPC
Trigger Status to allow the port to leave DPC.
> dpc_wait_link_inactive(dpc);
> if (dpc->rp_extensions && dpc_wait_rp_inactive(dpc))
> - return;
> + return PCI_ERS_RESULT_DISCONNECT;
> if (dpc->rp_extensions && dpc->rp_pio_status) {
> pci_write_config_dword(pdev, cap + PCI_EXP_DPC_RP_PIO_STATUS,
> dpc->rp_pio_status);
> @@ -108,6 +100,17 @@ static void dpc_work(struct work_struct *work)
> pci_read_config_word(pdev, cap + PCI_EXP_DPC_CTL, &ctl);
> pci_write_config_word(pdev, cap + PCI_EXP_DPC_CTL,
> ctl | PCI_EXP_DPC_CTL_INT_EN);
> +
> + return PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +static void dpc_work(struct work_struct *work)
> +{
> + struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
> + struct pci_dev *pdev = dpc->dev->port;
> +
> + /* From DPC point of view error is always FATAL. */
> + pcie_do_recovery(pdev, DPC_FATAL);
> }
>
> static void dpc_process_rp_pio_error(struct dpc_dev *dpc)
> @@ -288,6 +291,7 @@ static struct pcie_port_service_driver dpcdriver = {
> .service = PCIE_PORT_SERVICE_DPC,
> .probe = dpc_probe,
> .remove = dpc_remove,
> + .reset_link = dpc_reset_link,
> };
>
> static int __init dpc_service_init(void)
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 877785d..526aba8 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -181,11 +181,12 @@ static pci_ers_result_t default_reset_link(struct pci_dev *dev)
> return PCI_ERS_RESULT_RECOVERED;
> }
>
> -static pci_ers_result_t reset_link(struct pci_dev *dev)
> +static pci_ers_result_t reset_link(struct pci_dev *dev, int severity)
> {
> struct pci_dev *udev;
> pci_ers_result_t status;
> struct pcie_port_service_driver *driver;
> + u32 service;
>
> if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
> /* Reset this port for all subordinates */
> @@ -196,7 +197,12 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
> }
>
> /* Use the aer driver of the component firstly */
> - driver = pcie_port_find_service(udev, PCIE_PORT_SERVICE_AER);
> + if (severity == DPC_FATAL)
> + service = PCIE_PORT_SERVICE_DPC;
> + else
> + service = PCIE_PORT_SERVICE_AER;
> +
> + driver = pcie_port_find_service(udev, service);
This is where I was wondering about passing in "service" directly instead
of "severity".
> if (driver && driver->reset_link) {
> status = driver->reset_link(udev);
> @@ -302,7 +308,7 @@ static pci_ers_result_t do_fatal_recovery(struct pci_dev *dev, int severity)
> pci_dev_put(pdev);
> }
>
> - result = reset_link(udev);
> + result = reset_link(udev, severity);
> if (result == PCI_ERS_RESULT_RECOVERED)
> if (pcie_wait_for_link(udev, true))
> pci_rescan_bus(udev->bus);
> @@ -326,7 +332,8 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
> pci_ers_result_t status;
> enum pci_channel_state state;
>
> - if (severity == AER_FATAL) {
> + if ((severity == AER_FATAL) ||
> + (severity == DPC_FATAL)) {
> status = do_fatal_recovery(dev, severity);
> if (status != PCI_ERS_RESULT_RECOVERED)
> goto failed;
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 8f87bbe..0c506fe 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -14,6 +14,7 @@
> #define AER_NONFATAL 0
> #define AER_FATAL 1
> #define AER_CORRECTABLE 2
> +#define DPC_FATAL 4
>
> struct pci_dev;
>
> --
> 2.7.4
>
On Thu, May 03, 2018 at 01:03:58AM -0400, Oza Pawandeep wrote:
> This patch disables ERR_NONFATAL trigger for DPC, so now DPC
> handles only ERR_FATAL.
> ...
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index 103ba79..86f1cc2 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -981,6 +981,7 @@
> #define PCI_EXP_DPC_CAP_DL_ACTIVE 0x1000 /* ERR_COR signal on DL_Active supported */
>
> #define PCI_EXP_DPC_CTL 6 /* DPC control */
> +#define PCI_EXP_DPC_CTL_EN_FATAL 0x0001 /* Enable trigger on ERR_FATAL message */
There's a tab instead of a space after the #define. The other #defines use
a space. Looks the same in the end, but makes the patch look funny.
> #define PCI_EXP_DPC_CTL_EN_NONFATAL 0x0002 /* Enable trigger on ERR_NONFATAL message */
> #define PCI_EXP_DPC_CTL_INT_EN 0x0008 /* DPC Interrupt Enable */
>
> --
> 2.7.4
>
On Thu, May 03, 2018 at 01:03:56AM -0400, Oza Pawandeep wrote:
> This patch implements generic pcie_port_find_device() routine.
> ...
> + * pcie_port_find_device - find the struct device
> + * @dev: PCI Express port the service devices associated with
> + * @service: For the service to find
> + *
> + * Find PCI Express port service driver associated with given service
> + */
> +struct device *pcie_port_find_device(struct pci_dev *dev,
s/struct device/struct device/ (remove extra space)
> + u32 service)
> +{
> + struct device *device;
> + struct portdrv_service_data pdrvs;
> +
> + pdrvs.dev = NULL;
> + pdrvs.service = service;
> + device_for_each_child(&dev->dev, &pdrvs, find_service_iter);
> +
> + device = pdrvs.dev;
> + return device;
> +}
> +
> +/**
On 2018-05-10 18:45, [email protected] wrote:
> On 2018-05-10 14:10, Bjorn Helgaas wrote:
>> On Thu, May 10, 2018 at 12:31:16PM +0530, [email protected] wrote:
>>> On 2018-05-10 04:51, Bjorn Helgaas wrote:
>>> > On Wed, May 09, 2018 at 06:44:53PM +0530, [email protected] wrote:
>>> > > On 2018-05-09 18:37, Bjorn Helgaas wrote:
>>> > > > On Tue, May 08, 2018 at 06:53:30PM -0500, Bjorn Helgaas wrote:
>>> > > > > On Thu, May 03, 2018 at 01:03:52AM -0400, Oza Pawandeep wrote:
>>> > > > > > This patch alters the behavior of handling of ERR_FATAL, where removal
>>> > > > > > of devices is initiated, followed by reset link, followed by
>>> > > > > > re-enumeration.
>>> > > > > >
>>> > > > > > So the errors are handled in a different way as follows:
>>> > > > > > ERR_NONFATAL => call driver recovery entry points
>>> > > > > > ERR_FATAL => remove and re-enumerate
>>> > > > > >
>>> > > > > > please refer to Documentation/PCI/pci-error-recovery.txt for more details.
>>> > > > > >
>>> > > > > > Signed-off-by: Oza Pawandeep <[email protected]>
>>> > > > > >
>>> > > > > > diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
>>> > > > > > index 779b387..206f590 100644
>>> > > > > > --- a/drivers/pci/pcie/aer/aerdrv.c
>>> > > > > > +++ b/drivers/pci/pcie/aer/aerdrv.c
>>> > > > > > @@ -330,6 +330,13 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
>>> > > > > > reg32 |= ROOT_PORT_INTR_ON_MESG_MASK;
>>> > > > > > pci_write_config_dword(dev, pos + PCI_ERR_ROOT_COMMAND, reg32);
>>> > > > > >
>>> > > > > > + /*
>>> > > > > > + * This function is called only on ERR_FATAL now, and since
>>> > > > > > + * the pci_report_resume is called only in ERR_NONFATAL case,
>>> > > > > > + * the clearing part has to be taken care here.
>>> > > > > > + */
>>> > > > > > + aer_error_resume(dev);
>>> > > > >
>>> > > > > I don't understand this part. Previously the ERR_FATAL path looked
>>> > > > > like
>>> > > > > this:
>>> > > > >
>>> > > > > do_recovery
>>> > > > > reset_link
>>> > > > > driver->reset_link
>>> > > > > aer_root_reset
>>> > > > > pci_reset_bridge_secondary_bus # <-- reset
>>> > > > > broadcast_error_message(..., report_resume)
>>> > > > > pci_walk_bus(..., report_resume, ...)
>>> > > > > report_resume
>>> > > > > if (cb == report_resume)
>>> > > > > pci_cleanup_aer_uncorrect_error_status
>>> > > > > pci_write_config_dword(PCI_ERR_UNCOR_STATUS) # <-- clear
>>> > > > > status
>>> > > > >
>>> > > > > After this patch, it will look like this:
>>> > > > >
>>> > > > > do_recovery
>>> > > > > do_fatal_recovery
>>> > > > > pci_cleanup_aer_uncorrect_error_status
>>> > > > > pci_write_config_dword(PCI_ERR_UNCOR_STATUS) # <-- clear
>>> > > > > status
>>> > > > > reset_link
>>> > > > > driver->reset_link
>>> > > > > aer_root_reset
>>> > > > > pci_reset_bridge_secondary_bus # <-- reset
>>> > > > > aer_error_resume
>>> > > > > pcie_capability_write_word(PCI_EXP_DEVSTA) #
>>> > > > > <-- clear more
>>> > > > > pci_write_config_dword(PCI_ERR_UNCOR_STATUS) #
>>> > > > > <-- clear status
>>> > > > >
>>> > > > > So if I'm understanding correctly, the new path clears the status too
>>> > > > > early, then clears it again (plus clearing DEVSTA, which we didn't do
>>> > > > > before) later.
>>> > > > >
>>> > > > > I would think we would want to leave aer_root_reset() alone, and
>>> > > > > just move
>>> > > > > the pci_cleanup_aer_uncorrect_error_status() in do_fatal_recovery()
>>> > > > > down so
>>> > > > > it happens after we call reset_link(). That way the reset/clear
>>> > > > > sequence
>>> > > > > would be the same as it was before.
>>> > > >
>>> > > > I've been fiddling with this a bit myself and will post the results to
>>> > > > see
>>> > > > what you think.
>>> > >
>>> > >
>>> > > ok so you are suggesting to move
>>> > > pci_cleanup_aer_uncorrect_error_status down
>>> > > which I can do.
>>> > >
>>> > > And not to call aer_error_resume, because you think its clearing the
>>> > > status
>>> > > again.
>>> > >
>>> > > following code: calls aer_error_resume.
>>> > > pci_broadcast_error_message()
>>> > > /*
>>> > > * If the error is reported by an end point, we
>>> > > think this
>>> > > * error is related to the upstream link of the end
>>> > > point.
>>> > > */
>>> > > if (state == pci_channel_io_normal)
>>> > > /*
>>> > > * the error is non fatal so the bus is ok,
>>> > > just
>>> > > invoke
>>> > > * the callback for the function that logged
>>> > > the
>>> > > error.
>>> > > */
>>> > > cb(dev, &result_data);
>>> > > else
>>> > > pci_walk_bus(dev->bus, cb, &result_data);
>>> >
>>> > Holy crap, I thought this could not possibly get any more complicated,
>>> > but you're right; we do actually call aer_error_resume() today via an
>>> > extremely convoluted path:
>>> >
>>> > do_recovery(pci_dev)
>>> > broadcast_error_message(..., error_detected, ...)
>>> > if (AER_FATAL)
>>> > reset_link(pci_dev)
>>> > udev = BRIDGE ? pci_dev : pci_dev->bus->self
>>> > driver->reset_link(udev)
>>> > aer_root_reset(udev)
>>> > if (CAN_RECOVER)
>>> > broadcast_error_message(..., mmio_enabled, ...)
>>> > if (NEED_RESET)
>>> > broadcast_error_message(..., slot_reset, ...)
>>> > broadcast_error_message(dev, ..., report_resume, ...)
>>> > if (BRIDGE)
>>> > report_resume
>>> > driver->resume
>>> > pcie_portdrv_err_resume
>>> > device_for_each_child(..., resume_iter)
>>> > resume_iter
>>> > driver->error_resume
>>> > aer_error_resume
>>> > pci_cleanup_aer_uncorrect_error_status(pci_dev) # only if
>>> > BRIDGE
>>> > pci_write_config_dword(PCI_ERR_UNCOR_STATUS)
>>> >
>>> > aerdriver is the only port service driver that implements
>>> > .error_resume(), and aerdriver only binds to root ports. I can't
>>> > really believe all these device_for_each_child()/resume_iter()
>>> > gyrations are necessary when this is AER code calling AER code.
>>> >
>>> > Bjorn
>>>
>>> here is the code of do_fatal_recovery, where I have moved the things
>>> down
>>> and doing only if it is bridge.
>>> let me know how this looks to you, so then I can post v16.
>>
>> This looks superficially OK. It is very difficult for me to verify
>> that
>> the behavior is equivalent to the current code, but that's not your
>> fault;
>> it's just a consequence of the existing design.
>>
>> I have a couple trivial comments elsewhere, and I'll respond to those
>> patches individually.
>>
>>> static pci_ers_result_t do_fatal_recovery(struct pci_dev *dev, int
>>> severity)
>>> {
>>> struct pci_dev *udev;
>>> struct pci_bus *parent;
>>> struct pci_dev *pdev, *temp;
>>> struct aer_broadcast_data result_data;
>>> pci_ers_result_t result = PCI_ERS_RESULT_RECOVERED;
>>>
>>>
>>> if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
>>> udev = dev;
>>> else
>>> udev = dev->bus->self;
>>>
>>> parent = udev->subordinate;
>>> pci_lock_rescan_remove();
>>> list_for_each_entry_safe_reverse(pdev, temp,
>>> &parent->devices,
>>> bus_list) {
>>> pci_dev_get(pdev);
>>> pci_dev_set_disconnected(pdev, NULL);
>>> if (pci_has_subordinate(pdev))
>>> pci_walk_bus(pdev->subordinate,
>>> pci_dev_set_disconnected, NULL);
>>> pci_stop_and_remove_bus_device(pdev);
>>> pci_dev_put(pdev);
>>> }
>>>
>>> result = reset_link(udev, severity);
>>> if (severity == AER_FATAL && dev->hdr_type ==
>>> PCI_HEADER_TYPE_BRIDGE) {
>>> pci_walk_bus(dev->subordinate, report_resume,
>>> &result_data);
>
> Why are we calling resume?
the reason we have to call resume here, because we are not calling
aer_resume() any more in root_reset.
and we have to call resume only in bridge case.
please have a look at couple of conversation back with Bjorn.
the objective is to align the sequence close to the current code.
>
>>> pci_cleanup_aer_uncorrect_error_status(dev);
>>> dev->error_state = pci_channel_io_normal;
>>> }
>>> if (result == PCI_ERS_RESULT_RECOVERED)
>>> if (pcie_wait_for_link(udev, true))
>>> pci_rescan_bus(udev->bus);
>>>
>>> pci_unlock_rescan_remove();
>>>
>>> return result;
>>> }
>>>
>>> Regards,
>>> Oza.
>>>
>>>
On 2018-05-10 18:52, Bjorn Helgaas wrote:
> On Thu, May 03, 2018 at 01:03:57AM -0400, Oza Pawandeep wrote:
>> Current DPC driver does not do recovery, e.g. calling end-point's
>> driver's
>> callbacks, which sanitize the sw.
>>
>> DPC driver implements link_reset callback, and calls
>> pci_do_recovery().
>>
>> Signed-off-by: Oza Pawandeep <[email protected]>
>>
>> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
>> index 80ec384..aed7c9f 100644
>> --- a/drivers/pci/pcie/dpc.c
>> +++ b/drivers/pci/pcie/dpc.c
>> @@ -73,29 +73,21 @@ static void dpc_wait_link_inactive(struct dpc_dev
>> *dpc)
>> pcie_wait_for_link(pdev, false);
>> }
>>
>> -static void dpc_work(struct work_struct *work)
>> +static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
>> {
>> - struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
>> - struct pci_dev *dev, *temp, *pdev = dpc->dev->port;
>> - struct pci_bus *parent = pdev->subordinate;
>> - u16 cap = dpc->cap_pos, ctl;
>> -
>> - pci_lock_rescan_remove();
>> - list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
>> - bus_list) {
>> - pci_dev_get(dev);
>> - pci_dev_set_disconnected(dev, NULL);
>> - if (pci_has_subordinate(dev))
>> - pci_walk_bus(dev->subordinate,
>> - pci_dev_set_disconnected, NULL);
>> - pci_stop_and_remove_bus_device(dev);
>> - pci_dev_put(dev);
>> - }
>> - pci_unlock_rescan_remove();
>
> I think it would be good to have a comment here about why this
> "reset_link"
> function doesn't actually reset the link, e.g.,
>
> /*
> * DPC disables the Link automatically in hardware, so it has
> * already been reset by the time we get here.
> */
>
>> + struct dpc_dev *dpc;
>> + struct pcie_device *pciedev;
>> + struct device *devdpc;
>> + u16 cap, ctl;
>> +
>> + devdpc = pcie_port_find_device(pdev, PCIE_PORT_SERVICE_DPC);
>> + pciedev = to_pcie_device(devdpc);
>> + dpc = get_service_data(pciedev);
>> + cap = dpc->cap_pos;
>
> And maybe one about waiting until the link is inactive, then clearing
> DPC
> Trigger Status to allow the port to leave DPC.
>
>> dpc_wait_link_inactive(dpc);
>> if (dpc->rp_extensions && dpc_wait_rp_inactive(dpc))
>> - return;
>> + return PCI_ERS_RESULT_DISCONNECT;
>> if (dpc->rp_extensions && dpc->rp_pio_status) {
>> pci_write_config_dword(pdev, cap + PCI_EXP_DPC_RP_PIO_STATUS,
>> dpc->rp_pio_status);
>> @@ -108,6 +100,17 @@ static void dpc_work(struct work_struct *work)
>> pci_read_config_word(pdev, cap + PCI_EXP_DPC_CTL, &ctl);
>> pci_write_config_word(pdev, cap + PCI_EXP_DPC_CTL,
>> ctl | PCI_EXP_DPC_CTL_INT_EN);
>> +
>> + return PCI_ERS_RESULT_RECOVERED;
>> +}
>> +
>> +static void dpc_work(struct work_struct *work)
>> +{
>> + struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
>> + struct pci_dev *pdev = dpc->dev->port;
>> +
>> + /* From DPC point of view error is always FATAL. */
>> + pcie_do_recovery(pdev, DPC_FATAL);
>> }
>>
>> static void dpc_process_rp_pio_error(struct dpc_dev *dpc)
>> @@ -288,6 +291,7 @@ static struct pcie_port_service_driver dpcdriver =
>> {
>> .service = PCIE_PORT_SERVICE_DPC,
>> .probe = dpc_probe,
>> .remove = dpc_remove,
>> + .reset_link = dpc_reset_link,
>> };
>>
>> static int __init dpc_service_init(void)
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index 877785d..526aba8 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -181,11 +181,12 @@ static pci_ers_result_t
>> default_reset_link(struct pci_dev *dev)
>> return PCI_ERS_RESULT_RECOVERED;
>> }
>>
>> -static pci_ers_result_t reset_link(struct pci_dev *dev)
>> +static pci_ers_result_t reset_link(struct pci_dev *dev, int severity)
>> {
>> struct pci_dev *udev;
>> pci_ers_result_t status;
>> struct pcie_port_service_driver *driver;
>> + u32 service;
>>
>> if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
>> /* Reset this port for all subordinates */
>> @@ -196,7 +197,12 @@ static pci_ers_result_t reset_link(struct pci_dev
>> *dev)
>> }
>>
>> /* Use the aer driver of the component firstly */
>> - driver = pcie_port_find_service(udev, PCIE_PORT_SERVICE_AER);
>> + if (severity == DPC_FATAL)
>> + service = PCIE_PORT_SERVICE_DPC;
>> + else
>> + service = PCIE_PORT_SERVICE_AER;
>> +
>> + driver = pcie_port_find_service(udev, service);
>
> This is where I was wondering about passing in "service" directly
> instead
> of "severity".
I will take care of all your comments made on all the patches.
but this one I do not understand.
passing service directly instead of severity ? (I do not think you meant
to write severity there)
perhaps do you mean following ?
if (severity == DPC_FATAL)
pcie_port_find_service(udev, PCIE_PORT_SERVICE_DPC);
else
pcie_port_find_service(udev, PCIE_PORT_SERVICE_AER);
>
>> if (driver && driver->reset_link) {
>> status = driver->reset_link(udev);
>> @@ -302,7 +308,7 @@ static pci_ers_result_t do_fatal_recovery(struct
>> pci_dev *dev, int severity)
>> pci_dev_put(pdev);
>> }
>>
>> - result = reset_link(udev);
>> + result = reset_link(udev, severity);
>> if (result == PCI_ERS_RESULT_RECOVERED)
>> if (pcie_wait_for_link(udev, true))
>> pci_rescan_bus(udev->bus);
>> @@ -326,7 +332,8 @@ void pcie_do_recovery(struct pci_dev *dev, int
>> severity)
>> pci_ers_result_t status;
>> enum pci_channel_state state;
>>
>> - if (severity == AER_FATAL) {
>> + if ((severity == AER_FATAL) ||
>> + (severity == DPC_FATAL)) {
>> status = do_fatal_recovery(dev, severity);
>> if (status != PCI_ERS_RESULT_RECOVERED)
>> goto failed;
>> diff --git a/include/linux/aer.h b/include/linux/aer.h
>> index 8f87bbe..0c506fe 100644
>> --- a/include/linux/aer.h
>> +++ b/include/linux/aer.h
>> @@ -14,6 +14,7 @@
>> #define AER_NONFATAL 0
>> #define AER_FATAL 1
>> #define AER_CORRECTABLE 2
>> +#define DPC_FATAL 4
>>
>> struct pci_dev;
>>
>> --
>> 2.7.4
>>
On Thu, May 10, 2018 at 07:56:00PM +0530, [email protected] wrote:
> On 2018-05-10 18:52, Bjorn Helgaas wrote:
> > On Thu, May 03, 2018 at 01:03:57AM -0400, Oza Pawandeep wrote:
> ...
> > > -static pci_ers_result_t reset_link(struct pci_dev *dev)
> > > +static pci_ers_result_t reset_link(struct pci_dev *dev, int severity)
> > > {
> > > struct pci_dev *udev;
> > > pci_ers_result_t status;
> > > struct pcie_port_service_driver *driver;
> > > + u32 service;
> > >
> > > if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
> > > /* Reset this port for all subordinates */
> > > @@ -196,7 +197,12 @@ static pci_ers_result_t reset_link(struct
> > > pci_dev *dev)
> > > }
> > >
> > > /* Use the aer driver of the component firstly */
> > > - driver = pcie_port_find_service(udev, PCIE_PORT_SERVICE_AER);
> > > + if (severity == DPC_FATAL)
> > > + service = PCIE_PORT_SERVICE_DPC;
> > > + else
> > > + service = PCIE_PORT_SERVICE_AER;
> > > +
> > > + driver = pcie_port_find_service(udev, service);
> >
> > This is where I was wondering about passing in "service" directly
> > instead
> > of "severity".
>
> passing service directly instead of severity ? (I do not think you meant to
> write severity there)
I did mean "severity".
> perhaps do you mean following ?
>
> if (severity == DPC_FATAL)
> pcie_port_find_service(udev, PCIE_PORT_SERVICE_DPC);
> else
> pcie_port_find_service(udev, PCIE_PORT_SERVICE_AER);
No, my thought was that most of the places that use "severity" only
use it to decide between PCIE_PORT_SERVICE_DPC and
PCIE_PORT_SERVICE_AER, so if you just passed in "service" directly,
you wouldn't need this "if" statement.
Bjorn