2020-11-02 15:12:10

by Hedi Berriche

[permalink] [raw]
Subject: [PATCH v4 0/1] PCI/ERR: fix regression introduced by 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")

This is essentially a resend of v3 as it failed to get enough traction;
no code change, I only added Sinan Kaya's Reviewed-by.

- Changes since v3:
* added Sinan Kaya <[email protected]> Reviewed-by

- Changes since v2:
* set status to PCI_ERS_RESULT_RECOVERED, in case of successful link
reset, if and only if the initial value of error status is
PCI_ERS_RESULT_DISCONNECT or PCI_ERS_RESULT_NO_AER_DRIVER.

- Changes since v1:
* changed the commit message to clarify what broke post commit 6d2c89441571
* dropped the misnomer post_reset_status variable in favour of a more natural
approach that relies on a boolean to keep track of the outcome of reset_link()

After commit 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
pcie_do_recovery() no longer calls ->slot_reset() in the case of a successful
reset which breaks error recovery by breaking driver (re)initialisation.

Cc: Russ Anderson <[email protected]>
Cc: Kuppuswamy Sathyanarayanan <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Ashok Raj <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Sinan Kaya <[email protected]>

Cc: [email protected] # v5.7+

---
Hedi Berriche (1):
PCI/ERR: don't clobber status after reset_link()

drivers/pci/pcie/err.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

--
2.28.0


2020-11-02 15:12:23

by Hedi Berriche

[permalink] [raw]
Subject: [PATCH v4 1/1] PCI/ERR: don't clobber status after reset_link()

Commit 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
broke pcie_do_recovery(): updating status after reset_link() has the ill
side effect of causing recovery to fail if the error status is
PCI_ERS_RESULT_CAN_RECOVER or PCI_ERS_RESULT_NEED_RESET as the following
code will *never* run in the case of a successful reset_link()

177 if (status == PCI_ERS_RESULT_CAN_RECOVER) {
...
181 }

183 if (status == PCI_ERS_RESULT_NEED_RESET) {
...
192 }

For instance in the case of PCI_ERS_RESULT_NEED_RESET we end up not
calling ->slot_reset() (because we skip report_slot_reset()) thus
breaking driver (re)initialisation.

Don't clobber status with the return value of reset_link(); set status
to PCI_ERS_RESULT_RECOVERED, in case of successful link reset, if and
only if the initial value of error status is PCI_ERS_RESULT_DISCONNECT
or PCI_ERS_RESULT_NO_AER_DRIVER.

Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
Signed-off-by: Hedi Berriche <[email protected]>

Reviewed-by: Sinan Kaya <[email protected]>
Cc: Russ Anderson <[email protected]>
Cc: Kuppuswamy Sathyanarayanan <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Ashok Raj <[email protected]>
Cc: Joerg Roedel <[email protected]>

Cc: [email protected] # v5.7+
---
drivers/pci/pcie/err.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index c543f419d8f9..2730826cfd8a 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -165,10 +165,13 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
pci_dbg(dev, "broadcast error_detected message\n");
if (state == pci_channel_io_frozen) {
pci_walk_bus(bus, report_frozen_detected, &status);
- status = reset_link(dev);
- if (status != PCI_ERS_RESULT_RECOVERED) {
+ if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) {
pci_warn(dev, "link reset failed\n");
goto failed;
+ } else {
+ if (status == PCI_ERS_RESULT_DISCONNECT ||
+ status == PCI_ERS_RESULT_NO_AER_DRIVER)
+ status = PCI_ERS_RESULT_RECOVERED;
}
} else {
pci_walk_bus(bus, report_normal_detected, &status);
--
2.28.0

2020-11-12 16:02:26

by Hedi Berriche

[permalink] [raw]
Subject: Re: [PATCH v4 1/1] PCI/ERR: don't clobber status after reset_link()

On Mon, Nov 02, 2020 at 15:10 Hedi Berriche wrote:
>Commit 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
>broke pcie_do_recovery(): updating status after reset_link() has the ill
>side effect of causing recovery to fail if the error status is
>PCI_ERS_RESULT_CAN_RECOVER or PCI_ERS_RESULT_NEED_RESET as the following
>code will *never* run in the case of a successful reset_link()
>
> 177 if (status == PCI_ERS_RESULT_CAN_RECOVER) {
> ...
> 181 }
>
> 183 if (status == PCI_ERS_RESULT_NEED_RESET) {
> ...
> 192 }
>
>For instance in the case of PCI_ERS_RESULT_NEED_RESET we end up not
>calling ->slot_reset() (because we skip report_slot_reset()) thus
>breaking driver (re)initialisation.
>
>Don't clobber status with the return value of reset_link(); set status
>to PCI_ERS_RESULT_RECOVERED, in case of successful link reset, if and
>only if the initial value of error status is PCI_ERS_RESULT_DISCONNECT
>or PCI_ERS_RESULT_NO_AER_DRIVER.
>
>Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
>Signed-off-by: Hedi Berriche <[email protected]>
>
>Reviewed-by: Sinan Kaya <[email protected]>
>Cc: Russ Anderson <[email protected]>
>Cc: Kuppuswamy Sathyanarayanan <[email protected]>
>Cc: Bjorn Helgaas <[email protected]>
>Cc: Ashok Raj <[email protected]>
>Cc: Joerg Roedel <[email protected]>
>
>Cc: [email protected] # v5.7+
>---
> drivers/pci/pcie/err.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>index c543f419d8f9..2730826cfd8a 100644
>--- a/drivers/pci/pcie/err.c
>+++ b/drivers/pci/pcie/err.c
>@@ -165,10 +165,13 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> pci_dbg(dev, "broadcast error_detected message\n");
> if (state == pci_channel_io_frozen) {
> pci_walk_bus(bus, report_frozen_detected, &status);
>- status = reset_link(dev);
>- if (status != PCI_ERS_RESULT_RECOVERED) {
>+ if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) {
> pci_warn(dev, "link reset failed\n");
> goto failed;
>+ } else {
>+ if (status == PCI_ERS_RESULT_DISCONNECT ||
>+ status == PCI_ERS_RESULT_NO_AER_DRIVER)
>+ status = PCI_ERS_RESULT_RECOVERED;
> }
> } else {
> pci_walk_bus(bus, report_normal_detected, &status);
>--
>2.28.0

Bjorn,

Sorry to bug you, but could you please cast your eyes on this patch and let me know whether you have
any concerns that might be barring it from inclusion.

Cheers,
Hedi.
--
Be careful of reading health books, you might die of a misprint.
-- Mark Twain

2020-12-11 10:13:16

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v4 1/1] PCI/ERR: don't clobber status after reset_link()

On Mon, Nov 02, 2020 at 03:09:51PM +0000, Hedi Berriche wrote:
> Commit 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
> broke pcie_do_recovery(): updating status after reset_link() has the ill
> side effect of causing recovery to fail if the error status is
> PCI_ERS_RESULT_CAN_RECOVER or PCI_ERS_RESULT_NEED_RESET as the following
> code will *never* run in the case of a successful reset_link()
>
> 177 if (status == PCI_ERS_RESULT_CAN_RECOVER) {
> ...
> 181 }
>
> 183 if (status == PCI_ERS_RESULT_NEED_RESET) {
> ...
> 192 }

The line numbers are basically useless because they depend on some
particular version of the file.

> For instance in the case of PCI_ERS_RESULT_NEED_RESET we end up not
> calling ->slot_reset() (because we skip report_slot_reset()) thus
> breaking driver (re)initialisation.
>
> Don't clobber status with the return value of reset_link(); set status
> to PCI_ERS_RESULT_RECOVERED, in case of successful link reset, if and
> only if the initial value of error status is PCI_ERS_RESULT_DISCONNECT
> or PCI_ERS_RESULT_NO_AER_DRIVER.
>
> Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
> Signed-off-by: Hedi Berriche <[email protected]>
>
> Reviewed-by: Sinan Kaya <[email protected]>
> Cc: Russ Anderson <[email protected]>
> Cc: Kuppuswamy Sathyanarayanan <[email protected]>
> Cc: Bjorn Helgaas <[email protected]>
> Cc: Ashok Raj <[email protected]>
> Cc: Joerg Roedel <[email protected]>
>
> Cc: [email protected] # v5.7+
> ---
> drivers/pci/pcie/err.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index c543f419d8f9..2730826cfd8a 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -165,10 +165,13 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> pci_dbg(dev, "broadcast error_detected message\n");
> if (state == pci_channel_io_frozen) {
> pci_walk_bus(bus, report_frozen_detected, &status);
> - status = reset_link(dev);
> - if (status != PCI_ERS_RESULT_RECOVERED) {
> + if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) {
> pci_warn(dev, "link reset failed\n");
> goto failed;
> + } else {
> + if (status == PCI_ERS_RESULT_DISCONNECT ||
> + status == PCI_ERS_RESULT_NO_AER_DRIVER)
> + status = PCI_ERS_RESULT_RECOVERED;

This code (even before your patch) doesn't match
Documentation/PCI/pci-error-recovery.rst very well. The code handles
pci_channel_io_frozen specially, but I don't think this is mentioned
in the doc.

The doc says we call ->error_detected() for all affected drivers.
Then we're supposed to do a slot reset if any driver returned
NEED_RESET. But in fact, we always do a reset for the
pci_channel_io_frozen case and never do one otherwise, regardless of
what ->error_detected() returned.

The doc says DISCONNECT means "Driver ... doesn't want to recover at
all." Many drivers can return either NEED_RESET or DISCONNECT, and I
assume they expect them to be handled differently. But I'm not sure
what DISCONNECT really means. Do we reset the device? Do we not
attempt recovery at all?

After your patch, if the reset_link() succeeded, we convert DISCONNECT
and NO_AER_DRIVER to RECOVERED. IIUC, that means we do exactly the
same thing if the consensus of the ->error_detected() functions was
RECOVERED, DISCONNECT, or NO_AER_DRIVER: we call reset_link() and
continue with "status = PCI_ERS_RESULT_RECOVERED".

(I'd reverse the sense of the "if (reset_link())" to make this easier
to read)

> }
> } else {
> pci_walk_bus(bus, report_normal_detected, &status);
> --
> 2.28.0
>

2021-01-08 22:32:16

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v4 1/1] PCI/ERR: don't clobber status after reset_link()

[+cc Keith]

On Thu, Dec 10, 2020 at 04:41:42PM -0600, Bjorn Helgaas wrote:
> On Mon, Nov 02, 2020 at 03:09:51PM +0000, Hedi Berriche wrote:
> > Commit 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
> > broke pcie_do_recovery(): updating status after reset_link() has the ill
> > side effect of causing recovery to fail if the error status is
> > PCI_ERS_RESULT_CAN_RECOVER or PCI_ERS_RESULT_NEED_RESET as the following
> > code will *never* run in the case of a successful reset_link()
> >
> > 177 if (status == PCI_ERS_RESULT_CAN_RECOVER) {
> > ...
> > 181 }
> >
> > 183 if (status == PCI_ERS_RESULT_NEED_RESET) {
> > ...
> > 192 }
>
> The line numbers are basically useless because they depend on some
> particular version of the file.
>
> > For instance in the case of PCI_ERS_RESULT_NEED_RESET we end up not
> > calling ->slot_reset() (because we skip report_slot_reset()) thus
> > breaking driver (re)initialisation.
> >
> > Don't clobber status with the return value of reset_link(); set status
> > to PCI_ERS_RESULT_RECOVERED, in case of successful link reset, if and
> > only if the initial value of error status is PCI_ERS_RESULT_DISCONNECT
> > or PCI_ERS_RESULT_NO_AER_DRIVER.
> >
> > Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
> > Signed-off-by: Hedi Berriche <[email protected]>
> >
> > Reviewed-by: Sinan Kaya <[email protected]>
> > Cc: Russ Anderson <[email protected]>
> > Cc: Kuppuswamy Sathyanarayanan <[email protected]>
> > Cc: Bjorn Helgaas <[email protected]>
> > Cc: Ashok Raj <[email protected]>
> > Cc: Joerg Roedel <[email protected]>
> >
> > Cc: [email protected] # v5.7+
> > ---
> > drivers/pci/pcie/err.c | 7 +++++--
> > 1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> > index c543f419d8f9..2730826cfd8a 100644
> > --- a/drivers/pci/pcie/err.c
> > +++ b/drivers/pci/pcie/err.c
> > @@ -165,10 +165,13 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> > pci_dbg(dev, "broadcast error_detected message\n");
> > if (state == pci_channel_io_frozen) {
> > pci_walk_bus(bus, report_frozen_detected, &status);
> > - status = reset_link(dev);
> > - if (status != PCI_ERS_RESULT_RECOVERED) {
> > + if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) {
> > pci_warn(dev, "link reset failed\n");
> > goto failed;
> > + } else {
> > + if (status == PCI_ERS_RESULT_DISCONNECT ||
> > + status == PCI_ERS_RESULT_NO_AER_DRIVER)
> > + status = PCI_ERS_RESULT_RECOVERED;
>
> This code (even before your patch) doesn't match
> Documentation/PCI/pci-error-recovery.rst very well. The code handles
> pci_channel_io_frozen specially, but I don't think this is mentioned
> in the doc.
>
> The doc says we call ->error_detected() for all affected drivers.
> Then we're supposed to do a slot reset if any driver returned
> NEED_RESET. But in fact, we always do a reset for the
> pci_channel_io_frozen case and never do one otherwise, regardless of
> what ->error_detected() returned.
>
> The doc says DISCONNECT means "Driver ... doesn't want to recover at
> all." Many drivers can return either NEED_RESET or DISCONNECT, and I
> assume they expect them to be handled differently. But I'm not sure
> what DISCONNECT really means. Do we reset the device? Do we not
> attempt recovery at all?
>
> After your patch, if the reset_link() succeeded, we convert DISCONNECT
> and NO_AER_DRIVER to RECOVERED. IIUC, that means we do exactly the
> same thing if the consensus of the ->error_detected() functions was
> RECOVERED, DISCONNECT, or NO_AER_DRIVER: we call reset_link() and
> continue with "status = PCI_ERS_RESULT_RECOVERED".
>
> (I'd reverse the sense of the "if (reset_link())" to make this easier
> to read)

Can we push this forward now? There are several pending patches in
this area from Keith and Sathyanarayanan; I haven't gotten to them
yet, so not sure whether they help address any of this.

> > }
> > } else {
> > pci_walk_bus(bus, report_normal_detected, &status);
> > --
> > 2.28.0
> >

Subject: Re: [PATCH v4 1/1] PCI/ERR: don't clobber status after reset_link()



On 1/8/21 2:30 PM, Bjorn Helgaas wrote:
> Can we push this forward now? There are several pending patches in
> this area from Keith and Sathyanarayanan; I haven't gotten to them
> yet, so not sure whether they help address any of this.

Following two patches should also address the same issue.

My patch:

https://patchwork.kernel.org/project/linux-pci/patch/6f63321637fef86b6cf0beebf98b987062f9e811.1610153755.git.sathyanarayanan.kuppuswamy@linux.intel.com/

Keith's patch:

https://patchwork.kernel.org/project/linux-pci/patch/[email protected]/



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2021-02-08 12:42:29

by Hedi Berriche

[permalink] [raw]
Subject: Re: [PATCH v4 1/1] PCI/ERR: don't clobber status after reset_link()

On Mon, Jan 25, 2021 at 09:34 Kuppuswamy, Sathyanarayanan wrote:
>
>
>On 1/8/21 2:30 PM, Bjorn Helgaas wrote:
>>Can we push this forward now? There are several pending patches in
>>this area from Keith and Sathyanarayanan; I haven't gotten to them
>>yet, so not sure whether they help address any of this.
>
>Following two patches should also address the same issue.
>
>My patch:
>
>https://patchwork.kernel.org/project/linux-pci/patch/6f63321637fef86b6cf0beebf98b987062f9e811.1610153755.git.sathyanarayanan.kuppuswamy@linux.intel.com/

This series does *not* fix the problem for me.
>
>Keith's patch:
>
>https://patchwork.kernel.org/project/linux-pci/patch/[email protected]/

Keith's series *does* fix the problem for me:

Acked-by: Hedi Berriche <[email protected]>
Tested-by: Hedi Berriche <[email protected]>

Cheers,
Hedi.
>
>
>
>--
>Sathyanarayanan Kuppuswamy
>Linux Kernel Developer

--
Be careful of reading health books, you might die of a misprint.
-- Mark Twain