Subject: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

From: Kuppuswamy Sathyanarayanan <[email protected]>

Current pcie_do_recovery() implementation has following two issues:

1. Fatal (DPC) error recovery is currently broken for non-hotplug
capable devices. Current fatal error recovery implementation relies
on PCIe hotplug (pciehp) handler for detaching and re-enumerating
the affected devices/drivers. pciehp handler listens for DLLSC state
changes and handles device/driver detachment on DLLSC_LINK_DOWN event
and re-enumeration on DLLSC_LINK_UP event. So when dealing with
non-hotplug capable devices, recovery code does not restore the state
of the affected devices correctly. Correct implementation should
restore the device state and call report_slot_reset() function after
resetting the link to restore the state of the device/driver.

You can find fatal non-hotplug related issues reported in following links:

https://lore.kernel.org/linux-pci/[email protected]/
https://lore.kernel.org/linux-pci/12115.1588207324@famine/
https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.1585000084.git.sathyanarayanan.kuppuswamy@linux.intel.com/

2. For non-fatal errors if report_error_detected() or
report_mmio_enabled() functions requests PCI_ERS_RESULT_NEED_RESET then
current pcie_do_recovery() implementation does not do the requested
explicit device reset, instead just calls the report_slot_reset() on all
affected devices. Notifying about the reset via report_slot_reset()
without doing the actual device reset is incorrect.

To fix above issues, use PCI_ERS_RESULT_NEED_RESET as error state after
successful reset_link() operation. This will ensure ->slot_reset() be
called after reset_link() operation for fatal errors. Also call
pci_bus_reset() to do slot/bus reset() before triggering device specific
->slot_reset() callback. Also, using pci_bus_reset() will restore the state
of the devices after performing the reset operation.

Even though using pci_bus_reset() will do redundant reset operation after
->reset_link() for fatal errors, it should should affect the functional
behavior.

[original patch is from [email protected]]
[original patch link https://lore.kernel.org/linux-pci/12115.1588207324@famine/]
Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
Signed-off-by: Jay Vosburgh <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---

Changes since v2:
* Changed the subject of patch to "PCI/ERR: Fix reset logic in
pcie_do_recovery() call". v2 patch link is,
https://lore.kernel.org/linux-pci/ce417fbf81a8a46a89535f44b9224ee9fbb55a29.1591307288.git.sathyanarayanan.kuppuswamy@linux.intel.com/
* Squashed "PCI/ERR: Add reset support for non fatal errors" patch.

drivers/pci/pcie/err.c | 41 +++++++++++++++++++++++++++++++++++++----
1 file changed, 37 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 14bb8f54723e..b5eb6ba65be1 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -165,8 +165,29 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
pci_dbg(dev, "broadcast error_detected message\n");
if (state == pci_channel_io_frozen) {
pci_walk_bus(bus, report_frozen_detected, &status);
+ /*
+ * After resetting the link using reset_link() call, the
+ * possible value of error status is either
+ * PCI_ERS_RESULT_DISCONNECT (failure case) or
+ * PCI_ERS_RESULT_NEED_RESET (success case).
+ * So ignore the return value of report_error_detected()
+ * call for fatal errors.
+ *
+ * In EDR mode, since AER and DPC Capabilities are owned by
+ * firmware, reported_error_detected() will return error
+ * status PCI_ERS_RESULT_NO_AER_DRIVER. Continuing
+ * pcie_do_recovery() with error status as
+ * PCI_ERS_RESULT_NO_AER_DRIVER will report recovery failure
+ * irrespective of recovery status. But successful reset_link()
+ * call usually recovers all fatal errors. So ignoring the
+ * status result of report_error_detected() also helps EDR based
+ * error recovery.
+ */
status = reset_link(dev);
- if (status != PCI_ERS_RESULT_RECOVERED) {
+ if (status == PCI_ERS_RESULT_RECOVERED) {
+ status = PCI_ERS_RESULT_NEED_RESET;
+ } else {
+ status = PCI_ERS_RESULT_DISCONNECT;
pci_warn(dev, "link reset failed\n");
goto failed;
}
@@ -182,10 +203,22 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,

if (status == PCI_ERS_RESULT_NEED_RESET) {
/*
- * TODO: Should call platform-specific
- * functions to reset slot before calling
- * drivers' slot_reset callbacks?
+ * TODO: Optimize the call to pci_reset_bus()
+ *
+ * There are two components to pci_reset_bus().
+ *
+ * 1. Do platform specific slot/bus reset.
+ * 2. Save/Restore all devices in the bus.
+ *
+ * For hotplug capable devices and fatal errors,
+ * device is already in reset state due to link
+ * reset. So repeating platform specific slot/bus
+ * reset via pci_reset_bus() call is redundant. So
+ * can optimize this logic and conditionally call
+ * pci_reset_bus().
*/
+ pci_reset_bus(dev);
+
status = PCI_ERS_RESULT_RECOVERED;
pci_dbg(dev, "broadcast slot_reset message\n");
pci_walk_bus(bus, report_slot_reset, &status);
--
2.17.1


Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

Hi Bjorn,

On 7/24/20 12:07 PM, [email protected] wrote:
> From: Kuppuswamy Sathyanarayanan <[email protected]>
>
> Current pcie_do_recovery() implementation has following two issues:
>
> 1. Fatal (DPC) error recovery is currently broken for non-hotplug
> capable devices. Current fatal error recovery implementation relies
> on PCIe hotplug (pciehp) handler for detaching and re-enumerating
> the affected devices/drivers. pciehp handler listens for DLLSC state
> changes and handles device/driver detachment on DLLSC_LINK_DOWN event
> and re-enumeration on DLLSC_LINK_UP event. So when dealing with
> non-hotplug capable devices, recovery code does not restore the state
> of the affected devices correctly. Correct implementation should
> restore the device state and call report_slot_reset() function after
> resetting the link to restore the state of the device/driver.
>
> You can find fatal non-hotplug related issues reported in following links:
>
> https://lore.kernel.org/linux-pci/[email protected]/
> https://lore.kernel.org/linux-pci/12115.1588207324@famine/
> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.1585000084.git.sathyanarayanan.kuppuswamy@linux.intel.com/
>
> 2. For non-fatal errors if report_error_detected() or
> report_mmio_enabled() functions requests PCI_ERS_RESULT_NEED_RESET then
> current pcie_do_recovery() implementation does not do the requested
> explicit device reset, instead just calls the report_slot_reset() on all
> affected devices. Notifying about the reset via report_slot_reset()
> without doing the actual device reset is incorrect.
>
> To fix above issues, use PCI_ERS_RESULT_NEED_RESET as error state after
> successful reset_link() operation. This will ensure ->slot_reset() be
> called after reset_link() operation for fatal errors. Also call
> pci_bus_reset() to do slot/bus reset() before triggering device specific
> ->slot_reset() callback. Also, using pci_bus_reset() will restore the state
> of the devices after performing the reset operation.
>
> Even though using pci_bus_reset() will do redundant reset operation after
> ->reset_link() for fatal errors, it should should affect the functional
> behavior.
Any comment on this patch?
>
> [original patch is from [email protected]]
> [original patch link https://lore.kernel.org/linux-pci/12115.1588207324@famine/]
> Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
> Signed-off-by: Jay Vosburgh <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> ---
>
> Changes since v2:
> * Changed the subject of patch to "PCI/ERR: Fix reset logic in
> pcie_do_recovery() call". v2 patch link is,
> https://lore.kernel.org/linux-pci/ce417fbf81a8a46a89535f44b9224ee9fbb55a29.1591307288.git.sathyanarayanan.kuppuswamy@linux.intel.com/
> * Squashed "PCI/ERR: Add reset support for non fatal errors" patch.
>
> drivers/pci/pcie/err.c | 41 +++++++++++++++++++++++++++++++++++++----
> 1 file changed, 37 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 14bb8f54723e..b5eb6ba65be1 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -165,8 +165,29 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> pci_dbg(dev, "broadcast error_detected message\n");
> if (state == pci_channel_io_frozen) {
> pci_walk_bus(bus, report_frozen_detected, &status);
> + /*
> + * After resetting the link using reset_link() call, the
> + * possible value of error status is either
> + * PCI_ERS_RESULT_DISCONNECT (failure case) or
> + * PCI_ERS_RESULT_NEED_RESET (success case).
> + * So ignore the return value of report_error_detected()
> + * call for fatal errors.
> + *
> + * In EDR mode, since AER and DPC Capabilities are owned by
> + * firmware, reported_error_detected() will return error
> + * status PCI_ERS_RESULT_NO_AER_DRIVER. Continuing
> + * pcie_do_recovery() with error status as
> + * PCI_ERS_RESULT_NO_AER_DRIVER will report recovery failure
> + * irrespective of recovery status. But successful reset_link()
> + * call usually recovers all fatal errors. So ignoring the
> + * status result of report_error_detected() also helps EDR based
> + * error recovery.
> + */
> status = reset_link(dev);
> - if (status != PCI_ERS_RESULT_RECOVERED) {
> + if (status == PCI_ERS_RESULT_RECOVERED) {
> + status = PCI_ERS_RESULT_NEED_RESET;
> + } else {
> + status = PCI_ERS_RESULT_DISCONNECT;
> pci_warn(dev, "link reset failed\n");
> goto failed;
> }
> @@ -182,10 +203,22 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>
> if (status == PCI_ERS_RESULT_NEED_RESET) {
> /*
> - * TODO: Should call platform-specific
> - * functions to reset slot before calling
> - * drivers' slot_reset callbacks?
> + * TODO: Optimize the call to pci_reset_bus()
> + *
> + * There are two components to pci_reset_bus().
> + *
> + * 1. Do platform specific slot/bus reset.
> + * 2. Save/Restore all devices in the bus.
> + *
> + * For hotplug capable devices and fatal errors,
> + * device is already in reset state due to link
> + * reset. So repeating platform specific slot/bus
> + * reset via pci_reset_bus() call is redundant. So
> + * can optimize this logic and conditionally call
> + * pci_reset_bus().
> */
> + pci_reset_bus(dev);
> +
> status = PCI_ERS_RESULT_RECOVERED;
> pci_dbg(dev, "broadcast slot_reset message\n");
> pci_walk_bus(bus, report_slot_reset, &status);
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2020-09-22 18:55:50

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

On Fri, Jul 24, 2020 at 12:07:55PM -0700, [email protected] wrote:
> From: Kuppuswamy Sathyanarayanan <[email protected]>
>
> Current pcie_do_recovery() implementation has following two issues:
>

I'm having trouble parsing this out, probably just lack of my
understanding...

> 1. Fatal (DPC) error recovery is currently broken for non-hotplug
> capable devices. Current fatal error recovery implementation relies
> on PCIe hotplug (pciehp) handler for detaching and re-enumerating
> the affected devices/drivers. pciehp handler listens for DLLSC state
> changes and handles device/driver detachment on DLLSC_LINK_DOWN event
> and re-enumeration on DLLSC_LINK_UP event. So when dealing with
> non-hotplug capable devices, recovery code does not restore the state
> of the affected devices correctly.

Apparently in the hotplug case, something *does* restore the state of
affected devices?

> Correct implementation should
> restore the device state and call report_slot_reset() function after
> resetting the link to restore the state of the device/driver.
>
> You can find fatal non-hotplug related issues reported in following links:
>
> https://lore.kernel.org/linux-pci/[email protected]/
> https://lore.kernel.org/linux-pci/12115.1588207324@famine/
> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.1585000084.git.sathyanarayanan.kuppuswamy@linux.intel.com/
>
> 2. For non-fatal errors if report_error_detected() or
> report_mmio_enabled() functions requests PCI_ERS_RESULT_NEED_RESET then
> current pcie_do_recovery() implementation does not do the requested
> explicit device reset, instead just calls the report_slot_reset() on all
> affected devices. Notifying about the reset via report_slot_reset()
> without doing the actual device reset is incorrect.

Is it possible to fix these two issues separately?

> To fix above issues, use PCI_ERS_RESULT_NEED_RESET as error state after
> successful reset_link() operation. This will ensure ->slot_reset() be
> called after reset_link() operation for fatal errors. Also call
> pci_bus_reset() to do slot/bus reset() before triggering device specific
> ->slot_reset() callback. Also, using pci_bus_reset() will restore the state
> of the devices after performing the reset operation.
>
> Even though using pci_bus_reset() will do redundant reset operation after
> ->reset_link() for fatal errors, it should should affect the functional
> behavior.
>
> [original patch is from [email protected]]
> [original patch link https://lore.kernel.org/linux-pci/12115.1588207324@famine/]
> Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
> Signed-off-by: Jay Vosburgh <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> ---
>
> Changes since v2:
> * Changed the subject of patch to "PCI/ERR: Fix reset logic in
> pcie_do_recovery() call". v2 patch link is,
> https://lore.kernel.org/linux-pci/ce417fbf81a8a46a89535f44b9224ee9fbb55a29.1591307288.git.sathyanarayanan.kuppuswamy@linux.intel.com/
> * Squashed "PCI/ERR: Add reset support for non fatal errors" patch.
>
> drivers/pci/pcie/err.c | 41 +++++++++++++++++++++++++++++++++++++----
> 1 file changed, 37 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 14bb8f54723e..b5eb6ba65be1 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -165,8 +165,29 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> pci_dbg(dev, "broadcast error_detected message\n");
> if (state == pci_channel_io_frozen) {
> pci_walk_bus(bus, report_frozen_detected, &status);
> + /*
> + * After resetting the link using reset_link() call, the
> + * possible value of error status is either
> + * PCI_ERS_RESULT_DISCONNECT (failure case) or
> + * PCI_ERS_RESULT_NEED_RESET (success case).
> + * So ignore the return value of report_error_detected()
> + * call for fatal errors.
> + *
> + * In EDR mode, since AER and DPC Capabilities are owned by
> + * firmware, reported_error_detected() will return error
> + * status PCI_ERS_RESULT_NO_AER_DRIVER. Continuing
> + * pcie_do_recovery() with error status as
> + * PCI_ERS_RESULT_NO_AER_DRIVER will report recovery failure
> + * irrespective of recovery status. But successful reset_link()
> + * call usually recovers all fatal errors. So ignoring the
> + * status result of report_error_detected() also helps EDR based
> + * error recovery.

This chain of connections is too long and complicated to be
maintainable: EDR, AER/DPC ownership, NO_AER_DRIVER, etc. It's always
a bad sign when code needs this much explanation.

I don't know how to simplify this, but it does need to be simplified
somehow. I think it might have been my idea to feed all these paths
(AER, DPC, EDR) through the same recovery function, but I'm starting
to think it was a bad idea.

Or maybe it just isn't factored correctly. IIUC for the DPC and EDR
paths, but not for AER, the device (actually the whole subtree) has
been reset before we even get here. So it might help to separate out
the reset part.

> + */
> status = reset_link(dev);
> - if (status != PCI_ERS_RESULT_RECOVERED) {
> + if (status == PCI_ERS_RESULT_RECOVERED) {
> + status = PCI_ERS_RESULT_NEED_RESET;
> + } else {
> + status = PCI_ERS_RESULT_DISCONNECT;
> pci_warn(dev, "link reset failed\n");
> goto failed;
> }
> @@ -182,10 +203,22 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>
> if (status == PCI_ERS_RESULT_NEED_RESET) {
> /*
> - * TODO: Should call platform-specific
> - * functions to reset slot before calling
> - * drivers' slot_reset callbacks?
> + * TODO: Optimize the call to pci_reset_bus()
> + *
> + * There are two components to pci_reset_bus().
> + *
> + * 1. Do platform specific slot/bus reset.
> + * 2. Save/Restore all devices in the bus.
> + *
> + * For hotplug capable devices and fatal errors,
> + * device is already in reset state due to link
> + * reset. So repeating platform specific slot/bus
> + * reset via pci_reset_bus() call is redundant. So
> + * can optimize this logic and conditionally call
> + * pci_reset_bus().
> */
> + pci_reset_bus(dev);
> +
> status = PCI_ERS_RESULT_RECOVERED;
> pci_dbg(dev, "broadcast slot_reset message\n");
> pci_walk_bus(bus, report_slot_reset, &status);
> --
> 2.17.1
>

2020-09-23 00:04:44

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

On Tue, Sep 22, 2020 at 02:44:51PM -0700, Kuppuswamy, Sathyanarayanan wrote:
>
>
> On 9/22/20 11:52 AM, Bjorn Helgaas wrote:
> > On Fri, Jul 24, 2020 at 12:07:55PM -0700, [email protected] wrote:
> > > From: Kuppuswamy Sathyanarayanan <[email protected]>
> > >
> > > Current pcie_do_recovery() implementation has following two issues:
> > >
> >
> > I'm having trouble parsing this out, probably just lack of my
> > understanding...
> >
> > > 1. Fatal (DPC) error recovery is currently broken for non-hotplug
> > > capable devices. Current fatal error recovery implementation relies
> > > on PCIe hotplug (pciehp) handler for detaching and re-enumerating
> > > the affected devices/drivers. pciehp handler listens for DLLSC state
> > > changes and handles device/driver detachment on DLLSC_LINK_DOWN event
> > > and re-enumeration on DLLSC_LINK_UP event. So when dealing with
> > > non-hotplug capable devices, recovery code does not restore the state
> > > of the affected devices correctly.
> >
> > Apparently in the hotplug case, something *does* restore the state of
> > affected devices?
>
> Yes, in hotplug case, DLLSC state change handler takes over detachment
> /cleanup and re-attachment of affected devices/drivers.

Where does the restore happen here? I.e., what function does this?

Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call



On 9/22/20 4:33 PM, Bjorn Helgaas wrote:
> On Tue, Sep 22, 2020 at 02:44:51PM -0700, Kuppuswamy, Sathyanarayanan wrote:
>>
>>
>> On 9/22/20 11:52 AM, Bjorn Helgaas wrote:
>>> On Fri, Jul 24, 2020 at 12:07:55PM -0700, [email protected] wrote:
>>>> From: Kuppuswamy Sathyanarayanan <[email protected]>
>>>>
>>>> Current pcie_do_recovery() implementation has following two issues:
>>>>
>>>
>>> I'm having trouble parsing this out, probably just lack of my
>>> understanding...
>>>
>>>> 1. Fatal (DPC) error recovery is currently broken for non-hotplug
>>>> capable devices. Current fatal error recovery implementation relies
>>>> on PCIe hotplug (pciehp) handler for detaching and re-enumerating
>>>> the affected devices/drivers. pciehp handler listens for DLLSC state
>>>> changes and handles device/driver detachment on DLLSC_LINK_DOWN event
>>>> and re-enumeration on DLLSC_LINK_UP event. So when dealing with
>>>> non-hotplug capable devices, recovery code does not restore the state
>>>> of the affected devices correctly.
>>>
>>> Apparently in the hotplug case, something *does* restore the state of
>>> affected devices?
>>
>> Yes, in hotplug case, DLLSC state change handler takes over detachment
>> /cleanup and re-attachment of affected devices/drivers.
>
> Where does the restore happen here? I.e., what function does this?

DLLSC link down event will remove affected devices/drivers. And link up event
will re-create all devices.

on DLLSC link down event
->pciehp_ist()
->pciehp_handle_presence_or_link_change()
->pciehp_disable_slot()
->__pciehp_disable_slot()
->remove_board()
->pciehp_unconfigure_device()

on DLLSC link up event
->pciehp_ist()
->pciehp_handle_presence_or_link_change()
->pciehp_enable_slot()
->__pciehp_enable_slot()
->board_added()
->pciehp_configure_device()

>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2020-09-24 01:18:47

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

On 9/22/2020 7:44 PM, Kuppuswamy, Sathyanarayanan wrote:
>> here does the restore happen here?  I.e., what function does this?
>
> DLLSC link down event will remove affected devices/drivers. And link up
> event
> will re-create all devices.
>
> on DLLSC link down event
> ->pciehp_ist()
>   ->pciehp_handle_presence_or_link_change()
>     ->pciehp_disable_slot()
>       ->__pciehp_disable_slot()
>         ->remove_board()
>           ->pciehp_unconfigure_device()
>
> on DLLSC link up event
> ->pciehp_ist()
>   ->pciehp_handle_presence_or_link_change()
>     ->pciehp_enable_slot()
>       ->__pciehp_enable_slot()
>         ->board_added()
>           ->pciehp_configure_device()

AFAIK, DLLSC is a requirement not optional. Why is this not supported by
non-hotplug ports?

Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call



On 9/23/20 6:15 PM, Sinan Kaya wrote:
> On 9/22/2020 7:44 PM, Kuppuswamy, Sathyanarayanan wrote:
>>> here does the restore happen here?  I.e., what function does this?
>>
>> DLLSC link down event will remove affected devices/drivers. And link up
>> event
>> will re-create all devices.
>>
>> on DLLSC link down event
>> ->pciehp_ist()
>>   ->pciehp_handle_presence_or_link_change()
>>     ->pciehp_disable_slot()
>>       ->__pciehp_disable_slot()
>>         ->remove_board()
>>           ->pciehp_unconfigure_device()
>>
>> on DLLSC link up event
>> ->pciehp_ist()
>>   ->pciehp_handle_presence_or_link_change()
>>     ->pciehp_enable_slot()
>>       ->__pciehp_enable_slot()
>>         ->board_added()
>>           ->pciehp_configure_device()
>
> AFAIK, DLLSC is a requirement not optional. Why is this not supported by
> non-hotplug ports?
Its required for hotplug capable ports. Please check PCIe spec v5.0 sec 6.7.3.3.

The Data Link Layer State Changed event provides an indication that the state of
the Data Link Layer Link Active bit in the Link Status Register has changed.
Support for Data Link Layer State Changed events and software notification of these
events are required for hot-plug capable Downstream Ports.

>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2020-09-24 02:17:57

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

On 9/23/2020 10:04 PM, Kuppuswamy, Sathyanarayanan wrote:
>> AFAIK, DLLSC is a requirement not optional. Why is this not supported by
>> non-hotplug ports?
> Its required for hotplug capable ports. Please check PCIe spec v5.0 sec
> 6.7.3.3.
>
> The Data Link Layer State Changed event provides an indication that the
> state of
> the Data Link Layer Link Active bit in the Link Status Register has
> changed.
> Support for Data Link Layer State Changed events and software
> notification of these
> events are required for hot-plug capable Downstream Ports.

I see. Can I assume that your system supports DPC?
DPC is supposed to recover the link via dpc_reset_link().

Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call



On 9/23/20 7:16 PM, Sinan Kaya wrote:
> On 9/23/2020 10:04 PM, Kuppuswamy, Sathyanarayanan wrote:
>>> AFAIK, DLLSC is a requirement not optional. Why is this not supported by
>>> non-hotplug ports?
>> Its required for hotplug capable ports. Please check PCIe spec v5.0 sec
>> 6.7.3.3.
>>
>> The Data Link Layer State Changed event provides an indication that the
>> state of
>> the Data Link Layer Link Active bit in the Link Status Register has
>> changed.
>> Support for Data Link Layer State Changed events and software
>> notification of these
>> events are required for hot-plug capable Downstream Ports.
>
> I see. Can I assume that your system supports DPC?
> DPC is supposed to recover the link via dpc_reset_link().
Yes. But the affected device/drivers cleanup during error recovery
is handled by hotplug handler. So we are facing issue when dealing
with non hotplug capable ports.
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2020-09-24 03:18:33

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

On 9/23/2020 10:51 PM, Kuppuswamy, Sathyanarayanan wrote:
>>
>> I see. Can I assume that your system supports DPC?
>> DPC is supposed to recover the link via dpc_reset_link().
> Yes. But the affected device/drivers cleanup during error recovery
> is handled by hotplug handler. So we are facing issue when dealing
> with non hotplug capable ports.

This is confusing.

Why would hotplug driver be involved unless port supports hotplug and
the link goes down? You said that DLLSC is only supported on hotplug
capable ports.

Need a better description of symptoms and what triggers hotplug driver
to activate.

Can you expand this a little bit?

Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call



On 9/23/20 8:13 PM, Sinan Kaya wrote:
> On 9/23/2020 10:51 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>
>>> I see. Can I assume that your system supports DPC?
>>> DPC is supposed to recover the link via dpc_reset_link().
>> Yes. But the affected device/drivers cleanup during error recovery
>> is handled by hotplug handler. So we are facing issue when dealing
>> with non hotplug capable ports.
>
> This is confusing.
>
> Why would hotplug driver be involved unless port supports hotplug and
> the link goes down? You said that DLLSC is only supported on hotplug
> capable ports.
hotplug driver is *only* involved when dealing with recovery of hotplug
capable ports. For hotplug capable ports, when DPC is triggered and link
goes down, DLLSC handler in pciehp driver will remove the affected
devices/drivers. Once the link comes back it will re-attach them.

>
> Need a better description of symptoms and what triggers hotplug driver
> to activate.
For problem description, please check the following details

Current pcie_do_recovery() implementation has following two issues:

1. Fatal (DPC) error recovery is currently broken for non-hotplug
capable devices. Current fatal error recovery implementation relies
on PCIe hotplug (pciehp) handler for detaching and re-enumerating
the affected devices/drivers. pciehp handler listens for DLLSC state
changes and handles device/driver detachment on DLLSC_LINK_DOWN event
and re-enumeration on DLLSC_LINK_UP event. So when dealing with
non-hotplug capable devices, recovery code does not restore the state
of the affected devices correctly. Correct implementation should
restore the device state and call report_slot_reset() function after
resetting the link to restore the state of the device/driver.

You can find fatal non-hotplug related issues reported in following links:

https://lore.kernel.org/linux-pci/[email protected]/
https://lore.kernel.org/linux-pci/12115.1588207324@famine/
https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.1585000084.git.sathyanarayanan.kuppuswamy@linux.intel.com/

2. For non-fatal errors if report_error_detected() or
report_mmio_enabled() functions requests PCI_ERS_RESULT_NEED_RESET then
current pcie_do_recovery() implementation does not do the requested
explicit device reset, instead just calls the report_slot_reset() on all
affected devices. Notifying about the reset via report_slot_reset()
without doing the actual device reset is incorrect.

To fix above issues, use PCI_ERS_RESULT_NEED_RESET as error state after
successful reset_link() operation. This will ensure ->slot_reset() be
called after reset_link() operation for fatal errors. Also call
pci_bus_reset() to do slot/bus reset() before triggering device specific
->slot_reset() callback. Also, using pci_bus_reset() will restore the state
of the devices after performing the reset operation.

Even though using pci_bus_reset() will do redundant reset operation after
->reset_link() for fatal errors, it should should affect the functional
behavior.

>
> Can you expand this a little bit?
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2020-09-24 20:53:42

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

On 9/24/2020 12:06 AM, Kuppuswamy, Sathyanarayanan wrote:
> For problem description, please check the following details
>
> Current pcie_do_recovery() implementation has following two issues:
>
> 1. Fatal (DPC) error recovery is currently broken for non-hotplug
> capable devices. Current fatal error recovery implementation relies
> on PCIe hotplug (pciehp) handler for detaching and re-enumerating
> the affected devices/drivers. pciehp handler listens for DLLSC state
> changes and handles device/driver detachment on DLLSC_LINK_DOWN event
> and re-enumeration on DLLSC_LINK_UP event. So when dealing with
> non-hotplug capable devices, recovery code does not restore the state
> of the affected devices correctly. Correct implementation should
> restore the device state and call report_slot_reset() function after
> resetting the link to restore the state of the device/driver.
>

So, this is a matter of moving the save/restore logic from the hotplug
driver into common code so that DPC slot reset takes advantage of it?

If that's direction we are going, this is fine change IMO.

> You can find fatal non-hotplug related issues reported in following links:
>
> https://lore.kernel.org/linux-pci/[email protected]/
>
> https://lore.kernel.org/linux-pci/12115.1588207324@famine/
> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.1585000084.git.sathyanarayanan.kuppuswamy@linux.intel.com/
>
>
> 2. For non-fatal errors if report_error_detected() or
> report_mmio_enabled() functions requests PCI_ERS_RESULT_NEED_RESET then
> current pcie_do_recovery() implementation does not do the requested
> explicit device reset, instead just calls the report_slot_reset() on all
> affected devices. Notifying about the reset via report_slot_reset()
> without doing the actual device reset is incorrect.
>

This makes sens too. There seems to be an ordering issue per your
description.

> To fix above issues, use PCI_ERS_RESULT_NEED_RESET as error state after
> successful reset_link() operation. This will ensure ->slot_reset() be
> called after reset_link() operation for fatal errors.

You lost me here. Why do we want to do secondary bus reset on top of
DPC reset?

> Also call
> pci_bus_reset() to do slot/bus reset() before triggering device specific
> ->slot_reset() callback. Also, using pci_bus_reset() will restore the state
> of the devices after performing the reset operation.
>
> Even though using pci_bus_reset() will do redundant reset operation after
> ->reset_link() for fatal errors, it should should affect the functional
> behavior.

Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call



On 9/24/20 1:52 PM, Sinan Kaya wrote:
> On 9/24/2020 12:06 AM, Kuppuswamy, Sathyanarayanan wrote:
>> For problem description, please check the following details
>>
>> Current pcie_do_recovery() implementation has following two issues:
>>
>> 1. Fatal (DPC) error recovery is currently broken for non-hotplug
>> capable devices. Current fatal error recovery implementation relies
>> on PCIe hotplug (pciehp) handler for detaching and re-enumerating
>> the affected devices/drivers. pciehp handler listens for DLLSC state
>> changes and handles device/driver detachment on DLLSC_LINK_DOWN event
>> and re-enumeration on DLLSC_LINK_UP event. So when dealing with
>> non-hotplug capable devices, recovery code does not restore the state
>> of the affected devices correctly. Correct implementation should
>> restore the device state and call report_slot_reset() function after
>> resetting the link to restore the state of the device/driver.
>>
>
> So, this is a matter of moving the save/restore logic from the hotplug
> driver into common code so that DPC slot reset takes advantage of it?
We are not moving it out of hotplug path. But fixing it in this code path.
With this fix, we will not depend on hotplug driver to restore the state.
>
> If that's direction we are going, this is fine change IMO.
>
>> You can find fatal non-hotplug related issues reported in following links:
>>
>> https://lore.kernel.org/linux-pci/[email protected]/
>>
>> https://lore.kernel.org/linux-pci/12115.1588207324@famine/
>> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.1585000084.git.sathyanarayanan.kuppuswamy@linux.intel.com/
>>
>>
>> 2. For non-fatal errors if report_error_detected() or
>> report_mmio_enabled() functions requests PCI_ERS_RESULT_NEED_RESET then
>> current pcie_do_recovery() implementation does not do the requested
>> explicit device reset, instead just calls the report_slot_reset() on all
>> affected devices. Notifying about the reset via report_slot_reset()
>> without doing the actual device reset is incorrect.
>>
>
> This makes sens too. There seems to be an ordering issue per your
> description.
Yes
>
>> To fix above issues, use PCI_ERS_RESULT_NEED_RESET as error state after
>> successful reset_link() operation. This will ensure ->slot_reset() be
>> called after reset_link() operation for fatal errors.
>
> You lost me here. Why do we want to do secondary bus reset on top of
> DPC reset?
For non-hotplug capable slots, when reset (PCI_ERS_RESULT_NEED_RESET) is
requested, we want to reset it before calling ->slot_reset() callback.
>
>> Also call
>> pci_bus_reset() to do slot/bus reset() before triggering device specific
>> ->slot_reset() callback. Also, using pci_bus_reset() will restore the state
>> of the devices after performing the reset operation.
>>
>> Even though using pci_bus_reset() will do redundant reset operation after
>> ->reset_link() for fatal errors, it should should affect the functional
>> behavior.
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2020-09-25 16:59:11

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

On 9/25/2020 1:11 AM, Kuppuswamy, Sathyanarayanan wrote:
>
>
> On 9/24/20 1:52 PM, Sinan Kaya wrote:
>> On 9/24/2020 12:06 AM, Kuppuswamy, Sathyanarayanan wrote:

>>
>> So, this is a matter of moving the save/restore logic from the hotplug
>> driver into common code so that DPC slot reset takes advantage of it?
> We are not moving it out of hotplug path. But fixing it in this code path.
> With this fix, we will not depend on hotplug driver to restore the state.

Any possibility of unification?


[snip]
>>
>>> To fix above issues, use PCI_ERS_RESULT_NEED_RESET as error state after
>>> successful reset_link() operation. This will ensure ->slot_reset() be
>>> called after reset_link() operation for fatal errors.
>>
>> You lost me here. Why do we want to do secondary bus reset on top of
>> DPC reset?
> For non-hotplug capable slots, when reset (PCI_ERS_RESULT_NEED_RESET) is
> requested, we want to reset it before calling ->slot_reset() callback.

Why? Isn't DPC slot reset enough?
What will bus reset do that DPC slot reset won't do?

I can understand calling bus reset if DPC is not supported.
I don't understand the requirement to do double reset.

Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call



On 9/25/20 9:55 AM, Sinan Kaya wrote:
> On 9/25/2020 1:11 AM, Kuppuswamy, Sathyanarayanan wrote:
>>
>>
>> On 9/24/20 1:52 PM, Sinan Kaya wrote:
>>> On 9/24/2020 12:06 AM, Kuppuswamy, Sathyanarayanan wrote:
>
>>>
>>> So, this is a matter of moving the save/restore logic from the hotplug
>>> driver into common code so that DPC slot reset takes advantage of it?
>> We are not moving it out of hotplug path. But fixing it in this code path.
>> With this fix, we will not depend on hotplug driver to restore the state.
>
> Any possibility of unification?
If we do that, it might need rework of hotplug driver. It will be a big
change. IMO, its better not to touch that bee hive.
>
>
> [snip]
>>>
>>>> To fix above issues, use PCI_ERS_RESULT_NEED_RESET as error state after
>>>> successful reset_link() operation. This will ensure ->slot_reset() be
>>>> called after reset_link() operation for fatal errors.
>>>
>>> You lost me here. Why do we want to do secondary bus reset on top of
>>> DPC reset?
>> For non-hotplug capable slots, when reset (PCI_ERS_RESULT_NEED_RESET) is
>> requested, we want to reset it before calling ->slot_reset() callback.
>
> Why? Isn't DPC slot reset enough?
It will do the reset at hardware level. But driver state is not
cleaned up. So doing bus reset will restore both driver and
hardware states.
Also for non-fatal errors, if reset is requested then we still need
some kind of bus reset call here.
> What will bus reset do that DPC slot reset won't do?
>
> I can understand calling bus reset if DPC is not supported.
> I don't understand the requirement to do double reset.
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2020-09-25 17:49:24

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

On 9/25/2020 1:11 PM, Kuppuswamy, Sathyanarayanan wrote:
>> Why? Isn't DPC slot reset enough?
> It will do the reset at hardware level. But driver state is not
> cleaned up. So doing bus reset will restore both driver and
> hardware states.

I really don't like this. If hotplug driver is restoring the state
and DPC driver is not; let's fix the DPC driver rather than causing
two resets and hope for the best.

One approach is to share the restore code between hotplug driver and
DPC driver.

If this is a too involved change, DPC driver should restore state
when hotplug is not supported.

DPC driver should be self-sufficient by itself.

> Also for non-fatal errors, if reset is requested then we still need
> some kind of bus reset call here.

DPC should handle both fatal and non-fatal cases and cause a bus reset
in hardware already before triggering an interrupt.

I disagree that you need an additional reset on top of DPC reset.
Isn't one reset enough?

What will the second reset provide that first reset won't provide?

I see that you are trying to do the second reset only because second
reset restores state.

That looks like a short-term fix only to explode on the next iteration.

Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call



On 9/25/20 10:47 AM, Sinan Kaya wrote:
> On 9/25/2020 1:11 PM, Kuppuswamy, Sathyanarayanan wrote:
>>> Why? Isn't DPC slot reset enough?
>> It will do the reset at hardware level. But driver state is not
>> cleaned up. So doing bus reset will restore both driver and
>> hardware states.
>
> I really don't like this. If hotplug driver is restoring the state
> and DPC driver is not; let's fix the DPC driver rather than causing
> two resets and hope for the best.
>
> One approach is to share the restore code between hotplug driver and
> DPC driver.
>
> If this is a too involved change, DPC driver should restore state
> when hotplug is not supported.
Yes. we can add a condition for hotplug capability check.
>
> DPC driver should be self-sufficient by itself.
>
>> Also for non-fatal errors, if reset is requested then we still need
>> some kind of bus reset call here
>
> DPC should handle both fatal and non-fatal cases
Currently DPC is only triggered for FATAL errors.
and cause a bus reset
> in hardware already before triggering an interrupt.
Error recovery is not triggered only DPC driver. AER also uses the
same error recovery code. If DPC is not supported, then we still need
reset logic.
>
> I disagree that you need an additional reset on top of DPC reset.
> Isn't one reset enough?
>
> What will the second reset provide that first reset won't provide?
>
> I see that you are trying to do the second reset only because second
> reset restores state.
>
> That looks like a short-term fix only to explode on the next iteration.
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2020-09-25 18:32:41

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

On 9/25/2020 2:16 PM, Kuppuswamy, Sathyanarayanan wrote:
>>
>> If this is a too involved change, DPC driver should restore state
>> when hotplug is not supported.
> Yes. we can add a condition for hotplug capability check.
>>
>> DPC driver should be self-sufficient by itself.
>>

Sounds good.

>>> Also for non-fatal errors, if reset is requested then we still need
>>> some kind of bus reset call here
>>
>> DPC should handle both fatal and non-fatal cases
> Currently DPC is only triggered for FATAL errors.
>  and cause a bus reset

Thanks for the heads up.
This seems to have changed since I looked at the DPC code.

>> in hardware already before triggering an interrupt.
> Error recovery is not triggered only DPC driver. AER also uses the
> same error recovery code. If DPC is not supported, then we still need
> reset logic.

It sounds like we are cross-talking two issues.

1. no state restore on DPC after FATAL error.
Let's fix this.

2. no bus reset on NON_FATAL error through AER driver path.
This already tells me that you need to split your change into
multiple patches.

Let's talk about this too. bus reset should be triggered via
AER driver before informing the recovery.

if (status == PCI_ERS_RESULT_NEED_RESET) {
/*
* TODO: Should call platform-specific
* functions to reset slot before calling
* drivers' slot_reset callbacks?
*/
status = PCI_ERS_RESULT_RECOVERED;
pci_dbg(dev, "broadcast slot_reset message\n");
pci_walk_bus(bus, report_slot_reset, &status);
}

2020-09-25 23:52:17

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

On 9/25/2020 2:16 PM, Kuppuswamy, Sathyanarayanan wrote:
>> One approach is to share the restore code between hotplug driver and
>> DPC driver.
>>
>> If this is a too involved change, DPC driver should restore state
>> when hotplug is not supported.
> Yes. we can add a condition for hotplug capability check.

Now that I think about this more...

This won't work. Link is brought down automatically by the DPC hardware.
Therefore, all software state is lost. Restore won't help here.

The only solution I can see is to force driver disconnect and rescan.

Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

Hi,

On 9/25/20 11:30 AM, Sinan Kaya wrote:
> On 9/25/2020 2:16 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>
>>> If this is a too involved change, DPC driver should restore state
>>> when hotplug is not supported.
>> Yes. we can add a condition for hotplug capability check.
>>>
>>> DPC driver should be self-sufficient by itself.
>>>
>
> Sounds good.
>
>>>> Also for non-fatal errors, if reset is requested then we still need
>>>> some kind of bus reset call here
>>>
>>> DPC should handle both fatal and non-fatal cases
>> Currently DPC is only triggered for FATAL errors.
>>  and cause a bus reset
>
> Thanks for the heads up.
> This seems to have changed since I looked at the DPC code.
>
>>> in hardware already before triggering an interrupt.
>> Error recovery is not triggered only DPC driver. AER also uses the
>> same error recovery code. If DPC is not supported, then we still need
>> reset logic.
>
> It sounds like we are cross-talking two issues.
>
> 1. no state restore on DPC after FATAL error.
> Let's fix this.
Agree. Few more detail about the above issue is,

There are two cases under FATAL error.

FATAL + hotplug - In this case, link will be reseted. And hotplug handler
will remove the driver state. This case works well with current code.

FATAL + no-hotplug - In this case, link will still be reseted. But
currently driver state is not properly restored. So I attempted
to restore it using pci_reset_bus().

status = reset_link(dev);
- if (status != PCI_ERS_RESULT_RECOVERED) {
+ if (status == PCI_ERS_RESULT_RECOVERED) {
+ status = PCI_ERS_RESULT_NEED_RESET;

...

if (status == PCI_ERS_RESULT_NEED_RESET) {
/*
- * TODO: Should call platform-specific
- * functions to reset slot before calling
- * drivers' slot_reset callbacks?
+ * TODO: Optimize the call to pci_reset_bus()
+ *
+ * There are two components to pci_reset_bus().
+ *
+ * 1. Do platform specific slot/bus reset.
+ * 2. Save/Restore all devices in the bus.
+ *
+ * For hotplug capable devices and fatal errors,
+ * device is already in reset state due to link
+ * reset. So repeating platform specific slot/bus
+ * reset via pci_reset_bus() call is redundant. So
+ * can optimize this logic and conditionally call
+ * pci_reset_bus().
*/
+ pci_reset_bus(dev);

>
> 2. no bus reset on NON_FATAL error through AER driver path.
> This already tells me that you need to split your change into
> multiple patches.
>
> Let's talk about this too. bus reset should be triggered via
> AER driver before informing the recovery.
But as per error recovery documentation, any call to
->error_detected() or ->mmio_enabled() can request
PCI_ERS_RESULT_NEED_RESET. So we need to add code
to do the actual reset before calling ->slot_reset()
callback. So call to pci_reset_bus() fixes this
issue.

if (status == PCI_ERS_RESULT_NEED_RESET) {
+ pci_reset_bus(dev);


>
> if (status == PCI_ERS_RESULT_NEED_RESET) {
> /*
> * TODO: Should call platform-specific
> * functions to reset slot before calling
> * drivers' slot_reset callbacks?
> */
> status = PCI_ERS_RESULT_RECOVERED;
> pci_dbg(dev, "broadcast slot_reset message\n");
> pci_walk_bus(bus, report_slot_reset, &status);
> }
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2020-09-28 08:44:34

by Ethan Zhao

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

Sathyanarayanan,

On Mon, Sep 28, 2020 at 10:44 AM Kuppuswamy, Sathyanarayanan
<[email protected]> wrote:
>
> Hi,
>
> On 9/25/20 11:30 AM, Sinan Kaya wrote:
> > On 9/25/2020 2:16 PM, Kuppuswamy, Sathyanarayanan wrote:
> >>>
> >>> If this is a too involved change, DPC driver should restore state
> >>> when hotplug is not supported.
> >> Yes. we can add a condition for hotplug capability check.
> >>>
> >>> DPC driver should be self-sufficient by itself.
> >>>
> >
> > Sounds good.
> >
> >>>> Also for non-fatal errors, if reset is requested then we still need
> >>>> some kind of bus reset call here
> >>>
> >>> DPC should handle both fatal and non-fatal cases
> >> Currently DPC is only triggered for FATAL errors.
> >> and cause a bus reset
> >
> > Thanks for the heads up.
> > This seems to have changed since I looked at the DPC code.
> >
> >>> in hardware already before triggering an interrupt.
> >> Error recovery is not triggered only DPC driver. AER also uses the
> >> same error recovery code. If DPC is not supported, then we still need
> >> reset logic.
> >
> > It sounds like we are cross-talking two issues.
> >
> > 1. no state restore on DPC after FATAL error.
> > Let's fix this.
> Agree. Few more detail about the above issue is,
>
> There are two cases under FATAL error.
>
> FATAL + hotplug - In this case, link will be reseted. And hotplug handler
> will remove the driver state. This case works well with current code.
>
> FATAL + no-hotplug - In this case, link will still be reseted. But
> currently driver state is not properly restored. So I attempted
> to restore it using pci_reset_bus().

Seems you should fix something at device driver side, not do double-reset in
DPC driver, one reset is done by hardware, and you want to do another by
DPC driver ?

Why hardware initiated reset is not enough for you ?

Thanks,
Ethan

> status = reset_link(dev);
> - if (status != PCI_ERS_RESULT_RECOVERED) {
> + if (status == PCI_ERS_RESULT_RECOVERED) {
> + status = PCI_ERS_RESULT_NEED_RESET;
>
> ...
>
> if (status == PCI_ERS_RESULT_NEED_RESET) {
> /*
> - * TODO: Should call platform-specific
> - * functions to reset slot before calling
> - * drivers' slot_reset callbacks?
> + * TODO: Optimize the call to pci_reset_bus()
> + *
> + * There are two components to pci_reset_bus().
> + *
> + * 1. Do platform specific slot/bus reset.
> + * 2. Save/Restore all devices in the bus.
> + *
> + * For hotplug capable devices and fatal errors,
> + * device is already in reset state due to link
> + * reset. So repeating platform specific slot/bus
> + * reset via pci_reset_bus() call is redundant. So
> + * can optimize this logic and conditionally call
> + * pci_reset_bus().
> */
> + pci_reset_bus(dev);
>
> >
> > 2. no bus reset on NON_FATAL error through AER driver path.
> > This already tells me that you need to split your change into
> > multiple patches.
> >
> > Let's talk about this too. bus reset should be triggered via
> > AER driver before informing the recovery.
> But as per error recovery documentation, any call to
> ->error_detected() or ->mmio_enabled() can request
> PCI_ERS_RESULT_NEED_RESET. So we need to add code
> to do the actual reset before calling ->slot_reset()
> callback. So call to pci_reset_bus() fixes this
> issue.
>
> if (status == PCI_ERS_RESULT_NEED_RESET) {
> + pci_reset_bus(dev);
>
>
> >
> > if (status == PCI_ERS_RESULT_NEED_RESET) {
> > /*
> > * TODO: Should call platform-specific
> > * functions to reset slot before calling
> > * drivers' slot_reset callbacks?
> > */
> > status = PCI_ERS_RESULT_RECOVERED;
> > pci_dbg(dev, "broadcast slot_reset message\n");
> > pci_walk_bus(bus, report_slot_reset, &status);
> > }
> >
>
> --
> Sathyanarayanan Kuppuswamy
> Linux Kernel Developer

2020-09-28 11:19:18

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

On 9/27/2020 10:43 PM, Kuppuswamy, Sathyanarayanan wrote:
> FATAL + no-hotplug - In this case, link will still be reseted. But
> currently driver state is not properly restored. So I attempted
> to restore it using pci_reset_bus().
>
>          status = reset_link(dev);
> -        if (status != PCI_ERS_RESULT_RECOVERED) {
> +        if (status == PCI_ERS_RESULT_RECOVERED) {
> +            status = PCI_ERS_RESULT_NEED_RESET;
>
> ...
>
>      if (status == PCI_ERS_RESULT_NEED_RESET) {
>          /*
> -         * TODO: Should call platform-specific
> -         * functions to reset slot before calling
> -         * drivers' slot_reset callbacks?
> +         * TODO: Optimize the call to pci_reset_bus()
> +         *
> +         * There are two components to pci_reset_bus().
> +         *
> +         * 1. Do platform specific slot/bus reset.
> +         * 2. Save/Restore all devices in the bus.
> +         *
> +         * For hotplug capable devices and fatal errors,
> +         * device is already in reset state due to link
> +         * reset. So repeating platform specific slot/bus
> +         * reset via pci_reset_bus() call is redundant. So
> +         * can optimize this logic and conditionally call
> +         * pci_reset_bus().
>           */
> +        pci_reset_bus(dev);

I think we have to go to remove/rescan for this case as you also
mentioned above. There is no state to save. All BAR assignments
are gone. Entire device programming is also lost.

I don't think pci_reset_bus() can recover from this situation safely.
It will make things worse by saving/restoring the hardware default
state.

This should remove/rescan logic should be inside DPC's slot_reset()
function BTW. Not here.

2020-09-28 12:13:55

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

On 9/27/2020 10:43 PM, Kuppuswamy, Sathyanarayanan wrote:
>> 2. no bus reset on NON_FATAL error through AER driver path.
>> This already tells me that you need to split your change into
>> multiple patches.
>>
>> Let's talk about this too. bus reset should be triggered via
>> AER driver before informing the recovery.
> But as per error recovery documentation, any call to
> ->error_detected() or ->mmio_enabled() can request
> PCI_ERS_RESULT_NEED_RESET. So we need to add code
> to do the actual reset before calling ->slot_reset()
> callback. So call to pci_reset_bus() fixes this
> issue.
>
>      if (status == PCI_ERS_RESULT_NEED_RESET) {
> +        pci_reset_bus(dev);

This part seems to make sense as you already highlighted there is
a TO-DO in the code.

This is an independent change that deserves its own patch.

2020-09-28 12:20:19

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

On 9/28/2020 7:17 AM, Sinan Kaya wrote:
> This should remove/rescan logic should be inside DPC's slot_reset()
> function BTW. Not here.

Correct function name is dpc_handler().

I hope I did not create confusion with slot_reset() that gets called for
each driver post recovery.

Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call



On 9/28/20 4:17 AM, Sinan Kaya wrote:
> On 9/27/2020 10:43 PM, Kuppuswamy, Sathyanarayanan wrote:
>> FATAL + no-hotplug - In this case, link will still be reseted. But
>> currently driver state is not properly restored. So I attempted
>> to restore it using pci_reset_bus().
>>
>>          status = reset_link(dev);
>> -        if (status != PCI_ERS_RESULT_RECOVERED) {
>> +        if (status == PCI_ERS_RESULT_RECOVERED) {
>> +            status = PCI_ERS_RESULT_NEED_RESET;
>>
>> ...
>>
>>      if (status == PCI_ERS_RESULT_NEED_RESET) {
>>          /*
>> -         * TODO: Should call platform-specific
>> -         * functions to reset slot before calling
>> -         * drivers' slot_reset callbacks?
>> +         * TODO: Optimize the call to pci_reset_bus()
>> +         *
>> +         * There are two components to pci_reset_bus().
>> +         *
>> +         * 1. Do platform specific slot/bus reset.
>> +         * 2. Save/Restore all devices in the bus.
>> +         *
>> +         * For hotplug capable devices and fatal errors,
>> +         * device is already in reset state due to link
>> +         * reset. So repeating platform specific slot/bus
>> +         * reset via pci_reset_bus() call is redundant. So
>> +         * can optimize this logic and conditionally call
>> +         * pci_reset_bus().
>>           */
>> +        pci_reset_bus(dev);
>
> I think we have to go to remove/rescan for this case as you also
> mentioned above. There is no state to save. All BAR assignments
> are gone. Entire device programming is also lost.
>
> I don't think pci_reset_bus() can recover from this situation safely.
> It will make things worse by saving/restoring the hardware default
> state.
>
> This should remove/rescan logic should be inside DPC's slot_reset()
> function BTW. Not here.
Since there is no state restoration for FATAL errors, I am wondering whether
calls to ->error_detected(), ->mmio_enabled() and ->slot_reset() are
required?

Let me know your comments about following pseudo code.

if (fatal error & hotplug_supported)
do nothing // if fatal triggered by DPC, clear DPC state.

if (fatal error & no-hotplug)
perform slot_reset and renumerate affected devices.

>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2020-09-28 18:07:14

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

On 9/28/2020 1:15 PM, Kuppuswamy, Sathyanarayanan wrote:
> Since there is no state restoration for FATAL errors, I am wondering
> whether
> calls to ->error_detected(), ->mmio_enabled() and ->slot_reset() are
> required?

Good question,

Initially when we started, we were trying to handle both NON_FATAL and
FATAL errors in DPC.

We have seen value in unifying AER's callback mechanism with DPC.
It looks like this no longer applies for DPC.

Some drivers want these indication to stop outgoing DMA/timers so that
system can recover quickly.

There is value in calling them with existing AER based design.

I agree it doesn't apply here anymore if we are going to remove the
device driver. Maybe, you should stop calling pcie_do_recovery() in DPC
as well.

>
> Let me know your comments about following pseudo code.
>
> if (fatal error & hotplug_supported)
>    do nothing // if fatal triggered by DPC, clear DPC state.
>
> if (fatal error & no-hotplug)
>   perform slot_reset and renumerate affected devices.

LGTM,

I apologize for calling this slot_reset but slot_reset in err.c code is
for post recovery callback to endpoint drivers. Let's not use this term
here anymore to not confuse ourselves.

remove device + rescan similar to how hotplug remove + hotplug insertion
notifications does eventually.

All of this to be done in DPC driver without any err.c involvement.

Bjorn,

What do you think? Is this a good direction?

Sinan

2020-09-28 18:29:24

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

On 9/28/2020 2:02 PM, Sinan Kaya wrote:
> Since there is no state restoration for FATAL errors, I am wondering
> whether
> calls to ->error_detected(), ->mmio_enabled() and ->slot_reset() are
> required?

I also would like to ask someone closer to the spec language double
check this.

When we recover the link at the end of the DPC handler, what is the
expected state of the endpoint?

Is it a some kind of a reset like secondary bus reset? (I assumed this
one)

Undefined?

or just plain link recovery with everything else as intact as it used
to be?

Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call



On 9/28/20 11:25 AM, Sinan Kaya wrote:
> On 9/28/2020 2:02 PM, Sinan Kaya wrote:
>> Since there is no state restoration for FATAL errors, I am wondering
>> whether
>> calls to ->error_detected(), ->mmio_enabled() and ->slot_reset() are
>> required?
>
> I also would like to ask someone closer to the spec language double
> check this.
>
> When we recover the link at the end of the DPC handler, what is the
> expected state of the endpoint?
>
> Is it a some kind of a reset like secondary bus reset? (I assumed this
> one)
I think it will be in reset state.
>
> Undefined?
>
> or just plain link recovery with everything else as intact as it used
> to be?
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

Hi Sinan,

On 9/28/20 11:32 AM, Kuppuswamy, Sathyanarayanan wrote:
>
>
> On 9/28/20 11:25 AM, Sinan Kaya wrote:
>> On 9/28/2020 2:02 PM, Sinan Kaya wrote:
>>> Since there is no state restoration for FATAL errors, I am wondering
>>> whether
>>> calls to ->error_detected(), ->mmio_enabled() and ->slot_reset() are
>>> required?
>>
>> I also would like to ask someone closer to the spec language double
>> check this.
>>
>> When we recover the link at the end of the DPC handler, what is the
>> expected state of the endpoint?
>>
>> Is it a some kind of a reset like secondary bus reset? (I assumed this
>>   one)
> I think it will be in reset state.
>>
>> Undefined?
>>
>> or just plain link recovery with everything else as intact as it used
>> to be?
>>
>

Please check the following version. It should fix most of the reset issues
properly.

https://lore.kernel.org/linux-pci/5c5bca0bdb958e456176fe6ede10ba8f838fbafc.1602263264.git.sathyanarayanan.kuppuswamy@linux.intel.com/T/#t

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2020-10-12 14:54:06

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

On 10/12/2020 1:13 AM, Kuppuswamy, Sathyanarayanan wrote:
> Hi Sinan,
>
> On 9/28/20 11:32 AM, Kuppuswamy, Sathyanarayanan wrote:
>>
>>
>> On 9/28/20 11:25 AM, Sinan Kaya wrote:
>>> On 9/28/2020 2:02 PM, Sinan Kaya wrote:
>>>> Since there is no state restoration for FATAL errors, I am wondering
>>>> whether
>>>> calls to ->error_detected(), ->mmio_enabled() and ->slot_reset() are
>>>> required?
>>>
>>> I also would like to ask someone closer to the spec language double
>>> check this.
>>>
>>> When we recover the link at the end of the DPC handler, what is the
>>> expected state of the endpoint?
>>>
>>> Is it a some kind of a reset like secondary bus reset? (I assumed this
>>>   one)
>> I think it will be in reset state.
>>>
>>> Undefined?
>>>
>>> or just plain link recovery with everything else as intact as it used
>>> to be?
>>>
>>
>
> Please check the following version. It should fix most of the reset issues
> properly.
>
> https://lore.kernel.org/linux-pci/5c5bca0bdb958e456176fe6ede10ba8f838fbafc.1602263264.git.sathyanarayanan.kuppuswamy@linux.intel.com/T/#t
>
>

Thanks, good stuff.