LinuxLists.cc - [PATCH v2 0/6] Fix issues and cleanup for ERR_FATAL and ERR

2018-06-22 10:01:21

Subject: [PATCH v2 0/6] Fix issues and cleanup for ERR_FATAL and ERR_NONFATAL

These are follow up patches for the series which modifies ERR_FATAL handling.
although there were couple of problems existed before which, itis also fixing.

patch-1:
Fixes the problem where ERR_FATAL and ERR_NONFATAL status should be cleared
taking severity mask into account.

patch-2:
Takes care of clearing error fatal status

patch-3:
Follow up patch where no more need of handling ERR_FATAL
case.

patch-4:
Fixes clearing device status in case of uncorrectable errors.
(e.g. ERR_FATAL and ERR_NONFATAL)

patch-5:
Fixes clearing device status in case of correctable errors.

patch-6:
Follow up patch where no more need of handling pci_channel_io_frozen
in pcie_portdrv_slot_reset()

Oza Pawandeep (6):
PCI/AER: Take severity mask into account while clearing error bits
PCI/AER: Clear uncorrectable fatal error status bits
PCI/ERR: Cleanup ERR_FATAL of error broadcast
PCI/AER: Clear device error status bits during ERR_FATAL and
ERR_NONFATAL
PCI/AER: Fix correctable status bits clearing in device register
PCI/PORTDRV: Remove ERR_FATAL handling from pcie_portdrv_slot_reset()

drivers/pci/pcie/aer.c | 35 +++++++++++++++++++++++------------
drivers/pci/pcie/err.c | 15 +++++++--------
drivers/pci/pcie/portdrv_pci.c | 18 ------------------
include/linux/aer.h | 5 +++++
4 files changed, 35 insertions(+), 38 deletions(-)

--
2.7.4

2018-06-22 09:59:09

by Oza Pawandeep

[permalink] [raw]

Subject: [PATCH v2 3/6] PCI/ERR: Cleanup ERR_FATAL of error broadcast

ERR_FATAL is handled by resetting the Link in software, skipping the
driver pci_error_handlers callbacks, removing the devices from the PCI
subsystem, and re-enumerating, so now no more ERR_FATAL handling is
required inside pci_broadcast_error_message()

Signed-off-by: Oza Pawandeep <[email protected]>

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 00d2875..404bb69 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -259,15 +259,10 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
/*
* If the error is reported by an end point, we think this
* error is related to the upstream link of the end point.
+ * the error is non fatal so the bus is ok, just invoke
+ * the callback for the function that logged the error.
*/
- if (state == pci_channel_io_normal)
- /*
- * the error is non fatal so the bus is ok, just invoke
- * the callback for the function that logged the error.
- */
- cb(dev, &result_data);
- else
- pci_walk_bus(dev->bus, cb, &result_data);
+ cb(dev, &result_data);
}

return result_data.result;
--
2.7.4

2018-06-22 09:59:15

by Oza Pawandeep

[permalink] [raw]

Subject: [PATCH v2 4/6] PCI/AER: Clear device error status bits during ERR_FATAL and ERR_NONFATAL

In both ERR_FATAL and ERR_NONFATA cases the device error status
bits needs to be cleared.

Signed-off-by: Oza Pawandeep <[email protected]>

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index e9c115d..d2d6868 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -357,6 +357,17 @@ int pci_disable_pcie_error_reporting(struct pci_dev *dev)
}
EXPORT_SYMBOL_GPL(pci_disable_pcie_error_reporting);

+int pci_cleanup_aer_error_device_status(struct pci_dev *dev)
+{
+ u16 reg16;
+
+ /* Clean up Root device status */
+ pcie_capability_read_word(dev, PCI_EXP_DEVSTA, &reg16);
+ pcie_capability_write_word(dev, PCI_EXP_DEVSTA, reg16);
+
+ return 0;
+}
+
int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
{
int pos;
@@ -1344,11 +1355,8 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
*/
static void aer_error_resume(struct pci_dev *dev)
{
- u16 reg16;
-
/* Clean up Root device status */
- pcie_capability_read_word(dev, PCI_EXP_DEVSTA, &reg16);
- pcie_capability_write_word(dev, PCI_EXP_DEVSTA, reg16);
+ pci_cleanup_aer_error_device_status(dev);

/* Clean AER Root Error Status */
pci_cleanup_aer_uncorrect_error_status(dev);
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 404bb69..410c35c 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -252,6 +252,7 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
dev->error_state = state;
pci_walk_bus(dev->subordinate, cb, &result_data);
if (cb == report_resume) {
+ pci_cleanup_aer_error_device_status(dev);
pci_cleanup_aer_uncorrect_error_status(dev);
dev->error_state = pci_channel_io_normal;
}
@@ -312,6 +313,7 @@ void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service)
* do error recovery on all subordinates of the bridge instead
* of the bridge and clear the error status of the bridge.
*/
+ pci_cleanup_aer_error_device_status(dev);
pci_cleanup_aer_uncorrect_error_status(dev);
}

diff --git a/include/linux/aer.h b/include/linux/aer.h
index 514bffa..165a147 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -44,6 +44,7 @@ struct aer_capability_regs {
/* PCIe port driver needs this function to enable AER */
int pci_enable_pcie_error_reporting(struct pci_dev *dev);
int pci_disable_pcie_error_reporting(struct pci_dev *dev);
+int pci_cleanup_aer_error_device_status(struct pci_dev *dev);
int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);
int pci_cleanup_aer_error_status_regs(struct pci_dev *dev);
#else
@@ -55,6 +56,10 @@ static inline int pci_disable_pcie_error_reporting(struct pci_dev *dev)
{
return -EINVAL;
}
+static inline int pci_cleanup_aer_error_device_status(struct pci_dev *dev)
+{
+ return -EINVAL;
+}
static inline int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
{
return -EINVAL;
--
2.7.4

2018-06-22 09:59:39

by Oza Pawandeep

[permalink] [raw]

Subject: [PATCH v2 2/6] PCI/AER: Clear uncorrectable fatal error status bits

During ERR_FATAL handling, AER calls pci_cleanup_aer_uncorrect_error_status
which should handle pci_channel_io_frozen case in order to determine if it
has to clear fatal bits.

Signed-off-by: Oza Pawandeep <[email protected]>

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d6cb1f0..e9c115d 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -369,7 +369,12 @@ int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
/* Clean AER Root Error Status */
pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, &status);
pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_SEVER, &mask);
- status &= ~mask; /* Clear corresponding nonfatal bits */
+
+ if (dev->error_state == pci_channel_io_normal)
+ status &= ~mask; /* Clear corresponding nonfatal bits */
+ else
+ status &= mask; /* Clear corresponding fatal bits */
+
if (status)
pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, status);

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index f7ce0cb..00d2875 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -288,6 +288,7 @@ void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service)
struct pci_dev *pdev, *temp;
pci_ers_result_t result;

+ dev->error_state = pci_channel_io_frozen;
if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
udev = dev;
else
@@ -323,6 +324,7 @@ void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service)
if (pcie_wait_for_link(udev, true))
pci_rescan_bus(udev->bus);
pci_info(dev, "Device recovery from fatal error successful\n");
+ dev->error_state = pci_channel_io_normal;
} else {
pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
pci_info(dev, "Device recovery from fatal error failed\n");
--
2.7.4

2018-06-22 10:00:01

by Oza Pawandeep

[permalink] [raw]

Subject: [PATCH v2 5/6] PCI/AER: Fix correctable status bits clearing in device register

In case of correctable error Device Status Register sets
Correctable Error Detected, which should be cleared after handling
the error

Signed-off-by: Oza Pawandeep <[email protected]>

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d2d6868..a42b071 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -818,6 +818,7 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
if (pos)
pci_write_config_dword(dev, pos + PCI_ERR_COR_STATUS,
info->status);
+ pci_cleanup_aer_error_device_status(dev);
} else if (info->severity == AER_NONFATAL)
pcie_do_nonfatal_recovery(dev);
else if (info->severity == AER_FATAL)
--
2.7.4

2018-06-22 10:00:55

by Oza Pawandeep

[permalink] [raw]

Subject: [PATCH v2 6/6] PCI/PORTDRV: Remove ERR_FATAL handling from pcie_portdrv_slot_reset()

We are handling ERR_FATAL by resetting the Link in software,skipping the
driver pci_error_handlers callbacks, removing the devices from the PCI
subsystem, and re-enumerating, because of, no need to handle
pci_channel_io_frozen case anymore.

Besides the walk on the bus is happening on subordinates, inside
broadcast_error_message(), which means that pcie_portdrv_slot_reset()
is never called for RP, and now since the all the devices are removed under
this downstream link, we can safely get rid of ERR_FATAL handling code
in pcie_portdrv_slot_reset().

Signed-off-by: Oza Pawandeep <[email protected]>

diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
index 973f1b8..b970a6d 100644
--- a/drivers/pci/pcie/portdrv_pci.c
+++ b/drivers/pci/pcie/portdrv_pci.c
@@ -42,17 +42,6 @@ __setup("pcie_ports=", pcie_port_setup);

/* global data */

-static int pcie_portdrv_restore_config(struct pci_dev *dev)
-{
- int retval;
-
- retval = pci_enable_device(dev);
- if (retval)
- return retval;
- pci_set_master(dev);
- return 0;
-}
-
#ifdef CONFIG_PM
static int pcie_port_runtime_suspend(struct device *dev)
{
@@ -163,13 +152,6 @@ static pci_ers_result_t pcie_portdrv_mmio_enabled(struct pci_dev *dev)
static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev)
{
/* If fatal, restore cfg space for possible link reset at upstream */
- if (dev->error_state == pci_channel_io_frozen) {
- dev->state_saved = true;
- pci_restore_state(dev);
- pcie_portdrv_restore_config(dev);
- pci_enable_pcie_error_reporting(dev);
- }
-
return PCI_ERS_RESULT_RECOVERED;
}

--
2.7.4

2018-06-22 10:01:26

by Oza Pawandeep

[permalink] [raw]

Subject: [PATCH v2 1/6] PCI/AER: Take severity mask into account while clearing error bits

pci_cleanup_aer_uncorrect_error_status() is called by different slot_reset
callbacks in case of ERR_NONFATAL.

AER uncorrectable error status should take severity into account in order
to clear the bits, so that ERR_NONFATAL path does not clear the bit which
are marked with severity fatal.

Signed-off-by: Oza Pawandeep <[email protected]>

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a2e8838..d6cb1f0 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -360,13 +360,16 @@ EXPORT_SYMBOL_GPL(pci_disable_pcie_error_reporting);
int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
{
int pos;
- u32 status;
+ u32 status, mask;

pos = dev->aer_cap;
if (!pos)
return -EIO;

+ /* Clean AER Root Error Status */
pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, &status);
+ pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_SEVER, &mask);
+ status &= ~mask; /* Clear corresponding nonfatal bits */
if (status)
pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, status);

@@ -1336,8 +1339,6 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
*/
static void aer_error_resume(struct pci_dev *dev)
{
- int pos;
- u32 status, mask;
u16 reg16;

/* Clean up Root device status */
@@ -1345,11 +1346,7 @@ static void aer_error_resume(struct pci_dev *dev)
pcie_capability_write_word(dev, PCI_EXP_DEVSTA, reg16);

/* Clean AER Root Error Status */
- pos = dev->aer_cap;
- pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, &status);
- pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_SEVER, &mask);
- status &= ~mask; /* Clear corresponding nonfatal bits */
- pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, status);
+ pci_cleanup_aer_uncorrect_error_status(dev);
}

static struct pcie_port_service_driver aerdriver = {
--
2.7.4

2018-07-05 15:13:58

by Oza Pawandeep

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] Fix issues and cleanup for ERR_FATAL and ERR_NONFATAL

Hi Bjorn,

Could you help to review this series ?

Regards,
Oza.

On 2018-06-22 15:28, Oza Pawandeep wrote:
> These are follow up patches for the series which modifies ERR_FATAL
> handling.
> although there were couple of problems existed before which, itis also
> fixing.
>
> patch-1:
> Fixes the problem where ERR_FATAL and ERR_NONFATAL status should be
> cleared
> taking severity mask into account.
>
> patch-2:
> Takes care of clearing error fatal status
>
> patch-3:
> Follow up patch where no more need of handling ERR_FATAL
> case.
>
> patch-4:
> Fixes clearing device status in case of uncorrectable errors.
> (e.g. ERR_FATAL and ERR_NONFATAL)
>
> patch-5:
> Fixes clearing device status in case of correctable errors.
>
> patch-6:
> Follow up patch where no more need of handling pci_channel_io_frozen
> in pcie_portdrv_slot_reset()
>
> Oza Pawandeep (6):
> PCI/AER: Take severity mask into account while clearing error bits
> PCI/AER: Clear uncorrectable fatal error status bits
> PCI/ERR: Cleanup ERR_FATAL of error broadcast
> PCI/AER: Clear device error status bits during ERR_FATAL and
> ERR_NONFATAL
> PCI/AER: Fix correctable status bits clearing in device register
> PCI/PORTDRV: Remove ERR_FATAL handling from pcie_portdrv_slot_reset()
>
> drivers/pci/pcie/aer.c | 35
> +++++++++++++++++++++++------------
> drivers/pci/pcie/err.c | 15 +++++++--------
> drivers/pci/pcie/portdrv_pci.c | 18 ------------------
> include/linux/aer.h | 5 +++++
> 4 files changed, 35 insertions(+), 38 deletions(-)

2018-07-17 19:04:30

by Bjorn Helgaas

[permalink] [raw]

Subject: Re: [PATCH v2 1/6] PCI/AER: Take severity mask into account while clearing error bits

Hi Oza,

Thanks for doing this!

On Fri, Jun 22, 2018 at 05:58:09AM -0400, Oza Pawandeep wrote:
> pci_cleanup_aer_uncorrect_error_status() is called by different slot_reset
> callbacks in case of ERR_NONFATAL.

IIRC, the current strategy is:

ERR_COR: log only
ERR_NONFATAL: call driver callbacks (pci_error_handlers)
ERR_FATAL: remove device and re-enumerate

So these slot_reset callbacks are only used for ERR_NONFATAL, which
are all uncorrectable errors, of course.

This patch makes it so that when the slot_reset callbacks call
pci_cleanup_aer_uncorrect_error_status(), we only clear the
bits set by ERR_NONFATAL events (this is determined by
PCI_ERR_UNCOR_SEVER).

That makes good sense to me. All these status bits are RW1CS, so they
will be preserved across a reset but will be cleared when we
re-enumerate, in this path:

pci_init_capabilities
pci_aer_init
pci_cleanup_aer_error_status_regs
# clear all PCI_ERR_UNCOR_STATUS and PCI_ERR_COR_STATUS bits

> AER uncorrectable error status should take severity into account in order
> to clear the bits, so that ERR_NONFATAL path does not clear the bit which
> are marked with severity fatal.

Two comments:

1) Can you split this into two patches, one that changes
pci_cleanup_aer_uncorrect_error_status() so it looks like the error
clearing code in aer_error_resume(), and a second that factors out the
duplicate code?

2) Maybe use "sev" or "sever" instead of "mask" for the local
variable, since there is also a separate PCI_ERR_UNCOR_MASK register,
which is not involved here.

3) The "pci_cleanup_aer_uncorrect_error_status()" name no longer
really describes what this does. Something like
"pci_aer_clear_nonfatal_status()" would be more descriptive. But I
see you have a subsequent patch (which I haven't looked at yet) that
is related to this.

4) I don't think the driver slot_reset callbacks should be responsible
for clearing these AER status bits. Can we clear them somewhere in
the pcie_do_nonfatal_recovery() path and remove these calls from the
drivers?

5) In principle, we should only read PCI_ERR_UNCOR_STATUS *once* per
device when handling an error. We currently read it three times:

aer_isr
aer_isr_one_error
find_source_device
find_device_iter
is_error_source
read PCI_ERR_UNCOR_STATUS # 1
aer_process_err_devices
get_device_error_info(e_info->dev[i])
read PCI_ERR_UNCOR_STATUS # 2
handle_error_source
pcie_do_nonfatal_recovery
...
report_slot_reset
driver->err_handler->slot_reset
pci_cleanup_aer_uncorrect_error_status
read PCI_ERR_UNCOR_STATUS # 3

OK, that was more than two comments :)

> Signed-off-by: Oza Pawandeep <[email protected]>
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index a2e8838..d6cb1f0 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -360,13 +360,16 @@ EXPORT_SYMBOL_GPL(pci_disable_pcie_error_reporting);
> int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
> {
> int pos;
> - u32 status;
> + u32 status, mask;
>
> pos = dev->aer_cap;
> if (!pos)
> return -EIO;
>
> + /* Clean AER Root Error Status */

s/Clean/Clear/ (I see you copied this from aer_error_resume(), and maybe
the second patch could remove or update that comment, too)

> pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, &status);
> + pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_SEVER, &mask);
> + status &= ~mask; /* Clear corresponding nonfatal bits */
> if (status)
> pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, status);
>
> @@ -1336,8 +1339,6 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
> */
> static void aer_error_resume(struct pci_dev *dev)
> {
> - int pos;
> - u32 status, mask;
> u16 reg16;
>
> /* Clean up Root device status */
> @@ -1345,11 +1346,7 @@ static void aer_error_resume(struct pci_dev *dev)
> pcie_capability_write_word(dev, PCI_EXP_DEVSTA, reg16);
>
> /* Clean AER Root Error Status */
> - pos = dev->aer_cap;
> - pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, &status);
> - pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_SEVER, &mask);
> - status &= ~mask; /* Clear corresponding nonfatal bits */
> - pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, status);
> + pci_cleanup_aer_uncorrect_error_status(dev);
> }
>
> static struct pcie_port_service_driver aerdriver = {
> --
> 2.7.4
>

2018-07-17 21:37:24

by Bjorn Helgaas

[permalink] [raw]

Subject: Re: [PATCH v2 1/6] PCI/AER: Take severity mask into account while clearing error bits

On Tue, Jul 17, 2018 at 02:03:29PM -0500, Bjorn Helgaas wrote:
> Hi Oza,
>
> Thanks for doing this!
>
> On Fri, Jun 22, 2018 at 05:58:09AM -0400, Oza Pawandeep wrote:
> > pci_cleanup_aer_uncorrect_error_status() is called by different slot_reset
> > callbacks in case of ERR_NONFATAL.
>
> IIRC, the current strategy is:
>
> ERR_COR: log only
> ERR_NONFATAL: call driver callbacks (pci_error_handlers)
> ERR_FATAL: remove device and re-enumerate
>
> So these slot_reset callbacks are only used for ERR_NONFATAL, which
> are all uncorrectable errors, of course.
>
> This patch makes it so that when the slot_reset callbacks call
> pci_cleanup_aer_uncorrect_error_status(), we only clear the
> bits set by ERR_NONFATAL events (this is determined by
> PCI_ERR_UNCOR_SEVER).
>
> That makes good sense to me. All these status bits are RW1CS, so they
> will be preserved across a reset but will be cleared when we
> re-enumerate, in this path:
>
> pci_init_capabilities
> pci_aer_init
> pci_cleanup_aer_error_status_regs
> # clear all PCI_ERR_UNCOR_STATUS and PCI_ERR_COR_STATUS bits
>
> > AER uncorrectable error status should take severity into account in order
> > to clear the bits, so that ERR_NONFATAL path does not clear the bit which
> > are marked with severity fatal.
>
> Two comments:
>
> 1) Can you split this into two patches, one that changes
> pci_cleanup_aer_uncorrect_error_status() so it looks like the error
> clearing code in aer_error_resume(), and a second that factors out the
> duplicate code?
>
> 2) Maybe use "sev" or "sever" instead of "mask" for the local
> variable, since there is also a separate PCI_ERR_UNCOR_MASK register,
> which is not involved here.

Let me back up a little here: I'm not asking you to do the things
below here. They're just possible future things, so we can think
about them after this series. And the things above are things I can
easily do myself. So no action required from you, unless you think
I'm on the wrong track :)

> 3) The "pci_cleanup_aer_uncorrect_error_status()" name no longer
> really describes what this does. Something like
> "pci_aer_clear_nonfatal_status()" would be more descriptive. But I
> see you have a subsequent patch (which I haven't looked at yet) that
> is related to this.
>
> 4) I don't think the driver slot_reset callbacks should be responsible
> for clearing these AER status bits. Can we clear them somewhere in
> the pcie_do_nonfatal_recovery() path and remove these calls from the
> drivers?
>
> 5) In principle, we should only read PCI_ERR_UNCOR_STATUS *once* per
> device when handling an error. We currently read it three times:
>
> aer_isr
> aer_isr_one_error
> find_source_device
> find_device_iter
> is_error_source
> read PCI_ERR_UNCOR_STATUS # 1
> aer_process_err_devices
> get_device_error_info(e_info->dev[i])
> read PCI_ERR_UNCOR_STATUS # 2
> handle_error_source
> pcie_do_nonfatal_recovery
> ...
> report_slot_reset
> driver->err_handler->slot_reset
> pci_cleanup_aer_uncorrect_error_status
> read PCI_ERR_UNCOR_STATUS # 3
>
> OK, that was more than two comments :)

2018-07-18 09:14:41

by Oza Pawandeep

[permalink] [raw]

Subject: Re: [PATCH v2 1/6] PCI/AER: Take severity mask into account while clearing error bits

On 2018-07-18 03:06, Bjorn Helgaas wrote:
> On Tue, Jul 17, 2018 at 02:03:29PM -0500, Bjorn Helgaas wrote:
>> Hi Oza,
>>
>> Thanks for doing this!
>>
>> On Fri, Jun 22, 2018 at 05:58:09AM -0400, Oza Pawandeep wrote:
>> > pci_cleanup_aer_uncorrect_error_status() is called by different slot_reset
>> > callbacks in case of ERR_NONFATAL.
>>
>> IIRC, the current strategy is:
>>
>> ERR_COR: log only
>> ERR_NONFATAL: call driver callbacks (pci_error_handlers)
>> ERR_FATAL: remove device and re-enumerate
>>
>> So these slot_reset callbacks are only used for ERR_NONFATAL, which
>> are all uncorrectable errors, of course.
>>
>> This patch makes it so that when the slot_reset callbacks call
>> pci_cleanup_aer_uncorrect_error_status(), we only clear the
>> bits set by ERR_NONFATAL events (this is determined by
>> PCI_ERR_UNCOR_SEVER).
>>
>> That makes good sense to me. All these status bits are RW1CS, so they
>> will be preserved across a reset but will be cleared when we
>> re-enumerate, in this path:
>>
>> pci_init_capabilities
>> pci_aer_init
>> pci_cleanup_aer_error_status_regs
>> # clear all PCI_ERR_UNCOR_STATUS and PCI_ERR_COR_STATUS bits
>>
>> > AER uncorrectable error status should take severity into account in order
>> > to clear the bits, so that ERR_NONFATAL path does not clear the bit which
>> > are marked with severity fatal.
>>
>> Two comments:
>>
>> 1) Can you split this into two patches, one that changes
>> pci_cleanup_aer_uncorrect_error_status() so it looks like the error
>> clearing code in aer_error_resume(), and a second that factors out the
>> duplicate code?
>>
>> 2) Maybe use "sev" or "sever" instead of "mask" for the local
>> variable, since there is also a separate PCI_ERR_UNCOR_MASK register,
>> which is not involved here.
>
> Let me back up a little here: I'm not asking you to do the things
> below here. They're just possible future things, so we can think
> about them after this series. And the things above are things I can
> easily do myself. So no action required from you, unless you think
> I'm on the wrong track :)

I agree with your points, and have taken them into account for future
series reference as well.

what about PATCH-2 of this series ?
that clears ERR_FATAL bits, but as you said, during re-enumeration
pci_init_capabilities
pci_aer_init
pci_cleanup_aer_error_status_regs
# clear all PCI_ERR_UNCOR_STATUS and PCI_ERR_COR_STATUS bits

but that should clear the ERR_FATAL of the devices beneath.

PATCH2: we are doing it for BRIDGE where we think where ERR_FATAL was
reported by bridge and the problem is with downstream link.
if ((service == PCIE_PORT_SERVICE_AER) &&
(dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)) {
/*
* If the error is reported by a bridge, we think this error
* is related to the downstream link of the bridge, so we
* do error recovery on all subordinates of the bridge instead
* of the bridge and clear the error status of the bridge.
*/
pci_cleanup_aer_uncorrect_error_status(dev);
}

so overall, I think all the patches are required, if you have comments
please let me know.
so far I see that, no action is required from me.

>
>> 3) The "pci_cleanup_aer_uncorrect_error_status()" name no longer
>> really describes what this does. Something like
>> "pci_aer_clear_nonfatal_status()" would be more descriptive. But I
>> see you have a subsequent patch (which I haven't looked at yet) that
>> is related to this.
>>
>> 4) I don't think the driver slot_reset callbacks should be responsible
>> for clearing these AER status bits. Can we clear them somewhere in
>> the pcie_do_nonfatal_recovery() path and remove these calls from the
>> drivers?
>>
>> 5) In principle, we should only read PCI_ERR_UNCOR_STATUS *once* per
>> device when handling an error. We currently read it three times:
>>
>> aer_isr
>> aer_isr_one_error
>> find_source_device
>> find_device_iter
>> is_error_source
>> read PCI_ERR_UNCOR_STATUS # 1
>> aer_process_err_devices
>> get_device_error_info(e_info->dev[i])
>> read PCI_ERR_UNCOR_STATUS # 2
>> handle_error_source
>> pcie_do_nonfatal_recovery
>> ...
>> report_slot_reset
>> driver->err_handler->slot_reset
>> pci_cleanup_aer_uncorrect_error_status
>> read PCI_ERR_UNCOR_STATUS # 3
>>
>> OK, that was more than two comments :)

2018-07-18 19:11:25

by Bjorn Helgaas

[permalink] [raw]

Subject: Re: [PATCH v2 6/6] PCI/PORTDRV: Remove ERR_FATAL handling from pcie_portdrv_slot_reset()

On Fri, Jun 22, 2018 at 05:58:14AM -0400, Oza Pawandeep wrote:
> We are handling ERR_FATAL by resetting the Link in software,skipping the
> driver pci_error_handlers callbacks, removing the devices from the PCI
> subsystem, and re-enumerating, because of, no need to handle
> pci_channel_io_frozen case anymore.
>
> Besides the walk on the bus is happening on subordinates, inside
> broadcast_error_message(), which means that pcie_portdrv_slot_reset()
> is never called for RP, and now since the all the devices are removed under
> this downstream link, we can safely get rid of ERR_FATAL handling code
> in pcie_portdrv_slot_reset().

portdrv only binds to bridges, and it looks like
broadcast_error_message() only calls the pci_error_handlers callbacks
for subordinate devices, never for the bridge itself:

broadcast_error_message
if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
pci_walk_bus(dev->subordinate, ...)

If that's the case, why do we need pcie_portdrv_err_handler at all?
We should remove it completely if we don't need it.

ISTR some arcane call path for pcie_portdrv_err_resume(), but I don't
remember the details.

> Signed-off-by: Oza Pawandeep <[email protected]>
>
> diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
> index 973f1b8..b970a6d 100644
> --- a/drivers/pci/pcie/portdrv_pci.c
> +++ b/drivers/pci/pcie/portdrv_pci.c
> @@ -42,17 +42,6 @@ __setup("pcie_ports=", pcie_port_setup);
>
> /* global data */
>
> -static int pcie_portdrv_restore_config(struct pci_dev *dev)
> -{
> - int retval;
> -
> - retval = pci_enable_device(dev);
> - if (retval)
> - return retval;
> - pci_set_master(dev);
> - return 0;
> -}
> -
> #ifdef CONFIG_PM
> static int pcie_port_runtime_suspend(struct device *dev)
> {
> @@ -163,13 +152,6 @@ static pci_ers_result_t pcie_portdrv_mmio_enabled(struct pci_dev *dev)
> static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev)
> {
> /* If fatal, restore cfg space for possible link reset at upstream */
> - if (dev->error_state == pci_channel_io_frozen) {
> - dev->state_saved = true;
> - pci_restore_state(dev);
> - pcie_portdrv_restore_config(dev);
> - pci_enable_pcie_error_reporting(dev);
> - }
> -
> return PCI_ERS_RESULT_RECOVERED;
> }
>
> --
> 2.7.4
>