2022-04-27 10:48:35

by Mahesh Salgaonkar

[permalink] [raw]
Subject: [PATCH v6] PCI hotplug: rpaphp: Error out on busy status from get-sensor-state

When certain PHB HW failure causes phyp to recover PHB, it marks the PE
state as temporarily unavailable until recovery is complete. This also
triggers an EEH handler in Linux which needs to notify drivers, and perform
recovery. But before notifying the driver about the PCI error it uses
get_adapter_state()->get-sensor-state() operation of the hotplug_slot to
determine if the slot contains a device or not. if the slot is empty, the
recovery is skipped entirely.

However on certain PHB failures, the rtas call get-sensor-state() returns
extended busy error (9902) until PHB is recovered by phyp. Once PHB is
recovered, the get-sensor-state() returns success with correct presence
status. The RTAS call interface rtas_get_sensor() loops over the rtas call
on extended delay return code (9902) until the return value is either
success (0) or error (-1). This causes the EEH handler to get stuck for ~6
seconds before it could notify that the pci error has been detected and
stop any active operations. Hence with running I/O traffic, during this 6
seconds, the network driver continues its operation and hits a timeout
(netdev watchdog). On timeouts, network driver go into ffdc capture mode
and reset path assuming the PCI device is in fatal condition. This
sometimes causes EEH recovery to fail. This impacts the ssh connection and
leads to the system being inaccessible.

------------
[52732.244731] DEBUG: ibm_read_slot_reset_state2()
[52732.244762] DEBUG: ret = 0, rets[0]=5, rets[1]=1, rets[2]=4000, rets[3]=>
[52732.244798] DEBUG: in eeh_slot_presence_check
[52732.244804] DEBUG: error state check
[52732.244807] DEBUG: Is slot hotpluggable
[52732.244810] DEBUG: hotpluggable ops ?
[52732.244953] DEBUG: Calling ops->get_adapter_status
[52732.244958] DEBUG: calling rpaphp_get_sensor_state
[52736.564262] ------------[ cut here ]------------
[52736.564299] NETDEV WATCHDOG: enP64p1s0f3 (tg3): transmit queue 0 timed o>
[52736.564324] WARNING: CPU: 1442 PID: 0 at net/sched/sch_generic.c:478 dev>
[...]
[52736.564505] NIP [c000000000c32368] dev_watchdog+0x438/0x440
[52736.564513] LR [c000000000c32364] dev_watchdog+0x434/0x440
------------

To avoid this issue, fix the pci hotplug driver (rpaphp) to return an error
if the slot presence state can not be detected immediately while PE is in
EEH recovery state. Current implementation uses rtas_get_sensor() API which
blocks the slot check state until rtas call returns success. Change
rpaphp_get_sensor_state() to invoke rtas_call(get-sensor-state) directly
only if the respective pe is in EEH recovery state, and take actions based
on rtas return status.

In normal cases (non-EEH case) rpaphp_get_sensor_state() will continue to
invoke rtas_get_sensor() as it was earlier with no change in existing
behavior.

Signed-off-by: Mahesh Salgaonkar <[email protected]>
Reviewed-by: Nathan Lynch <[email protected]>
---
Change in v6:
- Fixed typo's in the patch description as per review comments.

Change in v5:
- Fixup #define macros with parentheses around the values.

Change in V4:
- Error out on sensor busy only if pe is going through EEH recovery instead
of always error out.

Change in V3:
- Invoke rtas_call(get-sensor-state) directly from
rpaphp_get_sensor_state() directly and do special handling.
- See v2 at
https://lists.ozlabs.org/pipermail/linuxppc-dev/2021-November/237336.html

Change in V2:
- Alternate approach to fix the EEH issue instead of delaying slot presence
check proposed at
https://lists.ozlabs.org/pipermail/linuxppc-dev/2021-November/236956.html

Also refer:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2021-November/237027.html
---
drivers/pci/hotplug/rpaphp_pci.c | 100 +++++++++++++++++++++++++++++++++++++-
1 file changed, 97 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/hotplug/rpaphp_pci.c b/drivers/pci/hotplug/rpaphp_pci.c
index c380bdacd1466..e463e915a052a 100644
--- a/drivers/pci/hotplug/rpaphp_pci.c
+++ b/drivers/pci/hotplug/rpaphp_pci.c
@@ -18,12 +18,107 @@
#include "../pci.h" /* for pci_add_new_bus */
#include "rpaphp.h"

+/*
+ * RTAS call get-sensor-state(DR_ENTITY_SENSE) return values as per PAPR:
+ * -1: Hardware Error
+ * -2: RTAS_BUSY
+ * -3: Invalid sensor. RTAS Parameter Error.
+ * -9000: Need DR entity to be powered up and unisolated before RTAS call
+ * -9001: Need DR entity to be powered up, but not unisolated, before RTAS call
+ * -9002: DR entity unusable
+ * 990x: Extended delay - where x is a number in the range of 0-5
+ */
+#define RTAS_HARDWARE_ERROR (-1)
+#define RTAS_INVALID_SENSOR (-3)
+#define SLOT_UNISOLATED (-9000)
+#define SLOT_NOT_UNISOLATED (-9001)
+#define SLOT_NOT_USABLE (-9002)
+
+static int rtas_to_errno(int rtas_rc)
+{
+ int rc;
+
+ switch (rtas_rc) {
+ case RTAS_HARDWARE_ERROR:
+ rc = -EIO;
+ break;
+ case RTAS_INVALID_SENSOR:
+ rc = -EINVAL;
+ break;
+ case SLOT_UNISOLATED:
+ case SLOT_NOT_UNISOLATED:
+ rc = -EFAULT;
+ break;
+ case SLOT_NOT_USABLE:
+ rc = -ENODEV;
+ break;
+ case RTAS_BUSY:
+ case RTAS_EXTENDED_DELAY_MIN...RTAS_EXTENDED_DELAY_MAX:
+ rc = -EBUSY;
+ break;
+ default:
+ err("%s: unexpected RTAS error %d\n", __func__, rtas_rc);
+ rc = -ERANGE;
+ break;
+ }
+ return rc;
+}
+
+/*
+ * get_adapter_status() can be called by the EEH handler during EEH recovery.
+ * On certain PHB failures, the rtas call get-sensor-state() returns extended
+ * busy error (9902) until PHB is recovered by phyp. The rtas call interface
+ * rtas_get_sensor() loops over the rtas call on extended delay return code
+ * (9902) until the return value is either success (0) or error (-1). This
+ * causes the EEH handler to get stuck for ~6 seconds before it could notify
+ * that the pci error has been detected and stop any active operations. This
+ * sometimes causes EEH recovery to fail. To avoid this issue, invoke
+ * rtas_call(get-sensor-state) directly if the respective pe is in EEH recovery
+ * state and return -EBUSY error based on rtas return status. This will help
+ * the EEH handler to notify the driver about the pci error immediately and
+ * successfully proceed with EEH recovery steps.
+ */
+static int __rpaphp_get_sensor_state(struct slot *slot, int *state)
+{
+ int rc;
+#ifdef CONFIG_EEH
+ int token = rtas_token("get-sensor-state");
+ struct pci_dn *pdn;
+ struct eeh_pe *pe;
+ struct pci_controller *phb = PCI_DN(slot->dn)->phb;
+
+ if (token == RTAS_UNKNOWN_SERVICE)
+ return -ENOENT;
+
+ /*
+ * Fallback to existing method for empty slot or pe isn't in EEH
+ * recovery.
+ */
+ if (list_empty(&PCI_DN(phb->dn)->child_list))
+ goto fallback;
+
+ pdn = list_first_entry(&PCI_DN(phb->dn)->child_list,
+ struct pci_dn, list);
+ pe = eeh_dev_to_pe(pdn->edev);
+ if (pe && (pe->state & EEH_PE_RECOVERING)) {
+ rc = rtas_call(token, 2, 2, state, DR_ENTITY_SENSE,
+ slot->index);
+ if (rc)
+ rc = rtas_to_errno(rc);
+ return rc;
+ }
+fallback:
+#endif
+ rc = rtas_get_sensor(DR_ENTITY_SENSE, slot->index, state);
+ return rc;
+}
+
int rpaphp_get_sensor_state(struct slot *slot, int *state)
{
int rc;
int setlevel;

- rc = rtas_get_sensor(DR_ENTITY_SENSE, slot->index, state);
+ rc = __rpaphp_get_sensor_state(slot, state);

if (rc < 0) {
if (rc == -EFAULT || rc == -EEXIST) {
@@ -39,8 +134,7 @@ int rpaphp_get_sensor_state(struct slot *slot, int *state)
dbg("%s: power on slot[%s] failed rc=%d.\n",
__func__, slot->name, rc);
} else {
- rc = rtas_get_sensor(DR_ENTITY_SENSE,
- slot->index, state);
+ rc = __rpaphp_get_sensor_state(slot, state);
}
} else if (rc == -ENODEV)
info("%s: slot is unusable\n", __func__);



2022-04-29 07:01:48

by Nathan Lynch

[permalink] [raw]
Subject: Re: [PATCH v6] PCI hotplug: rpaphp: Error out on busy status from get-sensor-state

Bjorn Helgaas <[email protected]> writes:
> On Tue, Apr 26, 2022 at 11:07:39PM +0530, Mahesh Salgaonkar wrote:
>> +/*
>> + * RTAS call get-sensor-state(DR_ENTITY_SENSE) return values as per PAPR:
>> + * -1: Hardware Error
>> + * -2: RTAS_BUSY
>> + * -3: Invalid sensor. RTAS Parameter Error.
>> + * -9000: Need DR entity to be powered up and unisolated before RTAS call
>> + * -9001: Need DR entity to be powered up, but not unisolated, before RTAS call
>> + * -9002: DR entity unusable
>> + * 990x: Extended delay - where x is a number in the range of 0-5
>> + */
>> +#define RTAS_HARDWARE_ERROR (-1)
>> +#define RTAS_INVALID_SENSOR (-3)
>> +#define SLOT_UNISOLATED (-9000)
>> +#define SLOT_NOT_UNISOLATED (-9001)
>
> I would say "isolated" instead of "not unisolated", but I suppose this
> follows language in the spec. If so, you should follow the spec.

"not unisolated" is the spec language.


>> +#define SLOT_NOT_USABLE (-9002)
>> +
>> +static int rtas_to_errno(int rtas_rc)
>> +{
>> + int rc;
>> +
>> + switch (rtas_rc) {
>> + case RTAS_HARDWARE_ERROR:
>> + rc = -EIO;
>> + break;
>> + case RTAS_INVALID_SENSOR:
>> + rc = -EINVAL;
>> + break;
>> + case SLOT_UNISOLATED:
>> + case SLOT_NOT_UNISOLATED:
>> + rc = -EFAULT;
>> + break;
>> + case SLOT_NOT_USABLE:
>> + rc = -ENODEV;
>> + break;
>> + case RTAS_BUSY:
>> + case RTAS_EXTENDED_DELAY_MIN...RTAS_EXTENDED_DELAY_MAX:
>> + rc = -EBUSY;
>> + break;
>> + default:
>> + err("%s: unexpected RTAS error %d\n", __func__, rtas_rc);
>> + rc = -ERANGE;
>> + break;
>> + }
>> + return rc;
>
> This basically duplicates rtas_error_rc(). Why do we need two copies?

It treats RTAS_BUSY, RTAS_EXTENDED_DELAY_MIN...RTAS_EXTENDED_DELAY_MAX
differently, which is part of the point of this change.

Aside: rtas_error_rc() (from powerpc's rtas.c) is badly named. Its
conversions make sense for only a handful of RTAS calls. RTAS error
codes have function-specific interpretations.

2022-04-30 17:28:45

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v6] PCI hotplug: rpaphp: Error out on busy status from get-sensor-state

On Thu, Apr 28, 2022 at 05:31:38PM -0500, Nathan Lynch wrote:
> Bjorn Helgaas <[email protected]> writes:
> > On Tue, Apr 26, 2022 at 11:07:39PM +0530, Mahesh Salgaonkar wrote:
> >> +/*
> >> + * RTAS call get-sensor-state(DR_ENTITY_SENSE) return values as per PAPR:
> >> + * -1: Hardware Error
> >> + * -2: RTAS_BUSY
> >> + * -3: Invalid sensor. RTAS Parameter Error.
> >> + * -9000: Need DR entity to be powered up and unisolated before RTAS call
> >> + * -9001: Need DR entity to be powered up, but not unisolated, before RTAS call
> >> + * -9002: DR entity unusable
> >> + * 990x: Extended delay - where x is a number in the range of 0-5
> >> + */
> >> +#define RTAS_HARDWARE_ERROR (-1)
> >> +#define RTAS_INVALID_SENSOR (-3)
> >> +#define SLOT_UNISOLATED (-9000)
> >> +#define SLOT_NOT_UNISOLATED (-9001)

> >> +static int rtas_to_errno(int rtas_rc)
> >> +{
> >> + int rc;
> >> +
> >> + switch (rtas_rc) {
> >> + case RTAS_HARDWARE_ERROR:
> >> + rc = -EIO;
> >> + break;
> >> + case RTAS_INVALID_SENSOR:
> >> + rc = -EINVAL;
> >> + break;
> >> + case SLOT_UNISOLATED:
> >> + case SLOT_NOT_UNISOLATED:
> >> + rc = -EFAULT;
> >> + break;
> >> + case SLOT_NOT_USABLE:
> >> + rc = -ENODEV;
> >> + break;
> >> + case RTAS_BUSY:
> >> + case RTAS_EXTENDED_DELAY_MIN...RTAS_EXTENDED_DELAY_MAX:
> >> + rc = -EBUSY;
> >> + break;
> >> + default:
> >> + err("%s: unexpected RTAS error %d\n", __func__, rtas_rc);
> >> + rc = -ERANGE;
> >> + break;
> >> + }
> >> + return rc;
> >
> > This basically duplicates rtas_error_rc(). Why do we need two copies?
>
> It treats RTAS_BUSY, RTAS_EXTENDED_DELAY_MIN...RTAS_EXTENDED_DELAY_MAX
> differently, which is part of the point of this change.

I think it would reduce confusion overall to:

- add RTAS_HARDWARE_ERROR, RTAS_INVALID_SENSOR to rtas.h

- rename and move SLOT_UNISOLATED, etc to rtas.h; they look
analogous to function-specific things like RTAS_SUSPEND_ABORTED

- change rtas_error_rc() to use the RTAS_HARDWARE_ERROR, etc
constants

- use the constants (SLOT_NOT_USABLE) instead of "9902" in the
commit log and code comments

> Aside: rtas_error_rc() (from powerpc's rtas.c) is badly named. Its
> conversions make sense for only a handful of RTAS calls. RTAS error
> codes have function-specific interpretations.

Maybe there's a case for factoring out the generic error codes and
have rtas_to_errno() (which sounds like maybe it should be renamed as
well) handle the function-specific part and fall back to the generic
one otherwise:

int rtas_to_errno(int rtas_rc)
{
switch (rtas_rc) {
case SLOT_UNISOLATED:
case SLOT_NOT_UNISOLATED:
return -EINVAL;
case SLOT_NOT_USABLE:
return -ENODEV;
...
default:
return rtas_error_rc(rtas_rc);
}
}

2022-05-01 08:45:29

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v6] PCI hotplug: rpaphp: Error out on busy status from get-sensor-state

On Tue, Apr 26, 2022 at 11:07:39PM +0530, Mahesh Salgaonkar wrote:
> When certain PHB HW failure causes phyp to recover PHB, it marks the PE
> state as temporarily unavailable until recovery is complete. This also
> triggers an EEH handler in Linux which needs to notify drivers, and perform
> recovery. But before notifying the driver about the PCI error it uses
> get_adapter_state()->get-sensor-state() operation of the hotplug_slot to
> determine if the slot contains a device or not. if the slot is empty, the
If
> recovery is skipped entirely.
>
> However on certain PHB failures, the rtas call get-sensor-state() returns
> extended busy error (9902) until PHB is recovered by phyp. Once PHB is
> recovered, the get-sensor-state() returns success with correct presence
> status. The RTAS call interface rtas_get_sensor() loops over the rtas call
> on extended delay return code (9902) until the return value is either
> success (0) or error (-1). This causes the EEH handler to get stuck for ~6
> seconds before it could notify that the pci error has been detected and
> stop any active operations. Hence with running I/O traffic, during this 6
> seconds, the network driver continues its operation and hits a timeout
> (netdev watchdog). On timeouts, network driver go into ffdc capture mode

I assume ffdc == First Failure Data Capture (please expand and remove
the redundant "capture") Is this a powerpc thing? "ffdc" doesn't
occur in drivers/net, so I don't know what network driver this refers
to.

> and reset path assuming the PCI device is in fatal condition. This
> sometimes causes EEH recovery to fail. This impacts the ssh connection and
> leads to the system being inaccessible.
>
> ------------
> [52732.244731] DEBUG: ibm_read_slot_reset_state2()
> [52732.244762] DEBUG: ret = 0, rets[0]=5, rets[1]=1, rets[2]=4000, rets[3]=>
> [52732.244798] DEBUG: in eeh_slot_presence_check
> [52732.244804] DEBUG: error state check
> [52732.244807] DEBUG: Is slot hotpluggable
> [52732.244810] DEBUG: hotpluggable ops ?
> [52732.244953] DEBUG: Calling ops->get_adapter_status
> [52732.244958] DEBUG: calling rpaphp_get_sensor_state
> [52736.564262] ------------[ cut here ]------------
> [52736.564299] NETDEV WATCHDOG: enP64p1s0f3 (tg3): transmit queue 0 timed o>
> [52736.564324] WARNING: CPU: 1442 PID: 0 at net/sched/sch_generic.c:478 dev>
> [...]
> [52736.564505] NIP [c000000000c32368] dev_watchdog+0x438/0x440
> [52736.564513] LR [c000000000c32364] dev_watchdog+0x434/0x440
> ------------
>
> To avoid this issue, fix the pci hotplug driver (rpaphp) to return an error
> if the slot presence state can not be detected immediately while PE is in
> EEH recovery state. Current implementation uses rtas_get_sensor() API which
> blocks the slot check state until rtas call returns success. Change
> rpaphp_get_sensor_state() to invoke rtas_call(get-sensor-state) directly
> only if the respective pe is in EEH recovery state, and take actions based
> on rtas return status.

I'm not too clear on what the problem is. I guess you don't want the
netdev watchdog timeout. Is the NIC still operating? It's just the
PHB leading to the NIC that has an issue?

Apparently the remedy is to return -ENODEV (from SLOT_NOT_USABLE ==
-9002) from rpaphp_get_sensor_state() instead of doing the retries.
It would be good to explain why *that* is safe.

> In normal cases (non-EEH case) rpaphp_get_sensor_state() will continue to
> invoke rtas_get_sensor() as it was earlier with no change in existing
> behavior.

Nits:
Follow historical convention in subject line.
s/phyp/pHyp/ (or whatever the normal styling is)
s/pe/PE/ (used inconsistently above and in comment)
s/rtas/RTAS/ (Michael mentioned this already, but I guess you missed some)
s/pci/PCI/
s/ffdc/First Failure Data Capture/ (or the correct expansion)
Make similar changes in the comment below.

> Signed-off-by: Mahesh Salgaonkar <[email protected]>
> Reviewed-by: Nathan Lynch <[email protected]>
> ---
> Change in v6:
> - Fixed typo's in the patch description as per review comments.
>
> Change in v5:
> - Fixup #define macros with parentheses around the values.
>
> Change in V4:
> - Error out on sensor busy only if pe is going through EEH recovery instead
> of always error out.
>
> Change in V3:
> - Invoke rtas_call(get-sensor-state) directly from
> rpaphp_get_sensor_state() directly and do special handling.
> - See v2 at
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2021-November/237336.html
>
> Change in V2:
> - Alternate approach to fix the EEH issue instead of delaying slot presence
> check proposed at
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2021-November/236956.html
>
> Also refer:
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2021-November/237027.html
> ---
> drivers/pci/hotplug/rpaphp_pci.c | 100 +++++++++++++++++++++++++++++++++++++-
> 1 file changed, 97 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/pci/hotplug/rpaphp_pci.c b/drivers/pci/hotplug/rpaphp_pci.c
> index c380bdacd1466..e463e915a052a 100644
> --- a/drivers/pci/hotplug/rpaphp_pci.c
> +++ b/drivers/pci/hotplug/rpaphp_pci.c
> @@ -18,12 +18,107 @@
> #include "../pci.h" /* for pci_add_new_bus */
> #include "rpaphp.h"
>
> +/*
> + * RTAS call get-sensor-state(DR_ENTITY_SENSE) return values as per PAPR:
> + * -1: Hardware Error
> + * -2: RTAS_BUSY
> + * -3: Invalid sensor. RTAS Parameter Error.
> + * -9000: Need DR entity to be powered up and unisolated before RTAS call
> + * -9001: Need DR entity to be powered up, but not unisolated, before RTAS call
> + * -9002: DR entity unusable
> + * 990x: Extended delay - where x is a number in the range of 0-5
> + */
> +#define RTAS_HARDWARE_ERROR (-1)
> +#define RTAS_INVALID_SENSOR (-3)
> +#define SLOT_UNISOLATED (-9000)
> +#define SLOT_NOT_UNISOLATED (-9001)

I would say "isolated" instead of "not unisolated", but I suppose this
follows language in the spec. If so, you should follow the spec.

> +#define SLOT_NOT_USABLE (-9002)
> +
> +static int rtas_to_errno(int rtas_rc)
> +{
> + int rc;
> +
> + switch (rtas_rc) {
> + case RTAS_HARDWARE_ERROR:
> + rc = -EIO;
> + break;
> + case RTAS_INVALID_SENSOR:
> + rc = -EINVAL;
> + break;
> + case SLOT_UNISOLATED:
> + case SLOT_NOT_UNISOLATED:
> + rc = -EFAULT;
> + break;
> + case SLOT_NOT_USABLE:
> + rc = -ENODEV;
> + break;
> + case RTAS_BUSY:
> + case RTAS_EXTENDED_DELAY_MIN...RTAS_EXTENDED_DELAY_MAX:
> + rc = -EBUSY;
> + break;
> + default:
> + err("%s: unexpected RTAS error %d\n", __func__, rtas_rc);
> + rc = -ERANGE;
> + break;
> + }
> + return rc;

This basically duplicates rtas_error_rc(). Why do we need two copies?

> +}
> +
> +/*
> + * get_adapter_status() can be called by the EEH handler during EEH recovery.
> + * On certain PHB failures, the rtas call get-sensor-state() returns extended
> + * busy error (9902) until PHB is recovered by phyp. The rtas call interface
> + * rtas_get_sensor() loops over the rtas call on extended delay return code
> + * (9902) until the return value is either success (0) or error (-1). This
> + * causes the EEH handler to get stuck for ~6 seconds before it could notify
> + * that the pci error has been detected and stop any active operations. This
> + * sometimes causes EEH recovery to fail. To avoid this issue, invoke
> + * rtas_call(get-sensor-state) directly if the respective pe is in EEH recovery
> + * state and return -EBUSY error based on rtas return status. This will help
> + * the EEH handler to notify the driver about the pci error immediately and
> + * successfully proceed with EEH recovery steps.
> + */
> +static int __rpaphp_get_sensor_state(struct slot *slot, int *state)
> +{
> + int rc;
> +#ifdef CONFIG_EEH
> + int token = rtas_token("get-sensor-state");
> + struct pci_dn *pdn;
> + struct eeh_pe *pe;
> + struct pci_controller *phb = PCI_DN(slot->dn)->phb;
> +
> + if (token == RTAS_UNKNOWN_SERVICE)
> + return -ENOENT;
> +
> + /*
> + * Fallback to existing method for empty slot or pe isn't in EEH
> + * recovery.
> + */
> + if (list_empty(&PCI_DN(phb->dn)->child_list))
> + goto fallback;
> +
> + pdn = list_first_entry(&PCI_DN(phb->dn)->child_list,
> + struct pci_dn, list);

I guess you don't need locking to ensure that child_list doesn't
become empty between the list_empty() and the list_first_entry()?
I didn't see locking at other places that traverse it, but it's not
obvious to me what protects it.

> + pe = eeh_dev_to_pe(pdn->edev);
> + if (pe && (pe->state & EEH_PE_RECOVERING)) {
> + rc = rtas_call(token, 2, 2, state, DR_ENTITY_SENSE,
> + slot->index);
> + if (rc)
> + rc = rtas_to_errno(rc);
> + return rc;

I'd probably make rtas_to_errno(0) return 0, then do:

return rtas_to_errno(rc);

> + }
> +fallback:
> +#endif
> + rc = rtas_get_sensor(DR_ENTITY_SENSE, slot->index, state);
> + return rc;

return rtas_get_sensor(DR_ENTITY_SENSE, ...);

> +}
> +
> int rpaphp_get_sensor_state(struct slot *slot, int *state)
> {
> int rc;
> int setlevel;
>
> - rc = rtas_get_sensor(DR_ENTITY_SENSE, slot->index, state);
> + rc = __rpaphp_get_sensor_state(slot, state);
>
> if (rc < 0) {
> if (rc == -EFAULT || rc == -EEXIST) {
> @@ -39,8 +134,7 @@ int rpaphp_get_sensor_state(struct slot *slot, int *state)
> dbg("%s: power on slot[%s] failed rc=%d.\n",
> __func__, slot->name, rc);
> } else {
> - rc = rtas_get_sensor(DR_ENTITY_SENSE,
> - slot->index, state);
> + rc = __rpaphp_get_sensor_state(slot, state);
> }
> } else if (rc == -ENODEV)
> info("%s: slot is unusable\n", __func__);
>
>

2022-06-12 17:31:58

by Mahesh Salgaonkar

[permalink] [raw]
Subject: Re: [PATCH v6] PCI hotplug: rpaphp: Error out on busy status from get-sensor-state

On 2022-04-28 15:47:40 Thu, Bjorn Helgaas wrote:
> On Tue, Apr 26, 2022 at 11:07:39PM +0530, Mahesh Salgaonkar wrote:
> > When certain PHB HW failure causes phyp to recover PHB, it marks the PE
> > state as temporarily unavailable until recovery is complete. This also
> > triggers an EEH handler in Linux which needs to notify drivers, and perform
> > recovery. But before notifying the driver about the PCI error it uses
> > get_adapter_state()->get-sensor-state() operation of the hotplug_slot to
> > determine if the slot contains a device or not. if the slot is empty, the
> If
> > recovery is skipped entirely.
> >
> > However on certain PHB failures, the rtas call get-sensor-state() returns
> > extended busy error (9902) until PHB is recovered by phyp. Once PHB is
> > recovered, the get-sensor-state() returns success with correct presence
> > status. The RTAS call interface rtas_get_sensor() loops over the rtas call
> > on extended delay return code (9902) until the return value is either
> > success (0) or error (-1). This causes the EEH handler to get stuck for ~6
> > seconds before it could notify that the pci error has been detected and
> > stop any active operations. Hence with running I/O traffic, during this 6
> > seconds, the network driver continues its operation and hits a timeout
> > (netdev watchdog). On timeouts, network driver go into ffdc capture mode
>
> I assume ffdc == First Failure Data Capture (please expand and remove
> the redundant "capture") Is this a powerpc thing? "ffdc" doesn't
> occur in drivers/net, so I don't know what network driver this refers
> to.

Sorry for delay in response.

What I meant by ffdc here is that bnx2 driver calls bnx2x_panic_dump()
soon after netdev watchdog timeout, and starts dumping additional debug
information to console.

======
[ 9416.991596] bnx2x: [bnx2x_panic_dump:930(enP19p1s0f1)]begin crash dump -----------------
[ 9416.991599] bnx2x: [bnx2x_panic_dump:940(enP19p1s0f1)]def_idx(0x438) def_att_idx(0x4) attn_state(0x0) spq_prod_idx(0x42) next_stats_cnt(0x413)
[ 9416.991604] bnx2x: [bnx2x_panic_dump:945(enP19p1s0f1)]DSB: attn bits(0x0) ack(0x100) id(0x0) idx(0x4)
[ 9416.991608] bnx2x: [bnx2x_panic_dump:946(enP19p1s0f1)] def (0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x43c 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0) igu_sb_id(0xff) igu_seg_id(0xff) pf_id(0xff) vnic_id(0xff) vf_id(0xff) vf_valid (0xff) state(0xff)
[...]
[...]
[ 9417.778071] bnx2x: [bnx2x_mc_assert:751(enP19p1s0f1)]USTORM_ASSERT_INDEX 0x2f = 0xffffffff 0xffffffff 0xffffffff 0xffffffff
[ 9417.778077] bnx2x: [bnx2x_mc_assert:751(enP19p1s0f1)]USTORM_ASSERT_INDEX 0x30 = 0xffffffff 0xffffffff 0xffffffff 0xffffffff
[ 9417.778083] bnx2x: [bnx2x_mc_assert:751(enP19p1s0f1)]USTORM_ASSERT_INDEX 0x31 = 0xffffffff 0xffffffff 0xffffffff 0xffffffff
[ 9417.778086] bnx2x: [bnx2x_mc_assert:763(enP19p1s0f1)]Chip Revision: everest3, FW Version: 7_13_15
[ 9417.778091] bnx2x: [bnx2x_panic_dump:1202(enP19p1s0f1)]end crash dump -----------------
======

>
> > and reset path assuming the PCI device is in fatal condition. This
> > sometimes causes EEH recovery to fail. This impacts the ssh connection and
> > leads to the system being inaccessible.
> >
> > ------------
> > [52732.244731] DEBUG: ibm_read_slot_reset_state2()
> > [52732.244762] DEBUG: ret = 0, rets[0]=5, rets[1]=1, rets[2]=4000, rets[3]=>
> > [52732.244798] DEBUG: in eeh_slot_presence_check
> > [52732.244804] DEBUG: error state check
> > [52732.244807] DEBUG: Is slot hotpluggable
> > [52732.244810] DEBUG: hotpluggable ops ?
> > [52732.244953] DEBUG: Calling ops->get_adapter_status
> > [52732.244958] DEBUG: calling rpaphp_get_sensor_state
> > [52736.564262] ------------[ cut here ]------------
> > [52736.564299] NETDEV WATCHDOG: enP64p1s0f3 (tg3): transmit queue 0 timed o>
> > [52736.564324] WARNING: CPU: 1442 PID: 0 at net/sched/sch_generic.c:478 dev>
> > [...]
> > [52736.564505] NIP [c000000000c32368] dev_watchdog+0x438/0x440
> > [52736.564513] LR [c000000000c32364] dev_watchdog+0x434/0x440
> > ------------
> >
> > To avoid this issue, fix the pci hotplug driver (rpaphp) to return an error
> > if the slot presence state can not be detected immediately while PE is in
> > EEH recovery state. Current implementation uses rtas_get_sensor() API which
> > blocks the slot check state until rtas call returns success. Change
> > rpaphp_get_sensor_state() to invoke rtas_call(get-sensor-state) directly
> > only if the respective pe is in EEH recovery state, and take actions based
> > on rtas return status.
>
> I'm not too clear on what the problem is. I guess you don't want the
> netdev watchdog timeout. Is the NIC still operating? It's just the
> PHB leading to the NIC that has an issue?

Yes, NIC stops functioning. Since EEH handler delays the reporting of
PCI error to driver, it starts its own recovery while pHyp is still
recovering the PHB. As part of recovery, driver tries to reset the
device and it keeps failing since every PCI read/write returns ff's. And
when EEH recovery kicks-in, the driver is unable to recover the device.
To get the NIC working again it needs a reboot or re-assign the I/O
adapter from HMC. Hence, it becomes important to inform driver about the
PCI error detection as early as possible. This way driver is aware of PCI
error and waits for EEH handler's next action for successful recovery.
This way driver does not try its own recovery to mess things up while
pHyp is still recovering the PHB.

[ 9531.168587] EEH: Beginning: 'slot_reset'
[ 9531.168601] PCI 0013:01:00.0#10000: EEH: Invoking bnx2x->slot_reset()
[...]
[ 9614.110094] bnx2x: [bnx2x_func_stop:9129(enP19p1s0f0)]FUNC_STOP ramrod failed. Running a dry transaction
[ 9614.110300] bnx2x: [bnx2x_igu_int_disable:902(enP19p1s0f0)]BUG! Proper val not read from IGU!
[ 9629.178067] bnx2x: [bnx2x_fw_command:3055(enP19p1s0f0)]FW failed to respond!
[ 9629.178085] bnx2x 0013:01:00.0 enP19p1s0f0: bc 7.10.4
[ 9629.178091] bnx2x: [bnx2x_fw_dump_lvl:789(enP19p1s0f0)]Cannot dump MCP info while in PCI error
[ 9644.241813] bnx2x: [bnx2x_io_slot_reset:14245(enP19p1s0f0)]IO slot reset --> driver unload
[...]
[ 9644.241819] PCI 0013:01:00.0#10000: EEH: bnx2x driver reports: 'disconnect'
[ 9644.241823] PCI 0013:01:00.1#10000: EEH: Invoking bnx2x->slot_reset()
[ 9644.241827] bnx2x: [bnx2x_io_slot_reset:14229(enP19p1s0f1)]IO slot reset initializing...
[ 9644.241916] bnx2x 0013:01:00.1: enabling device (0140 -> 0142)
[ 9644.258604] bnx2x: [bnx2x_io_slot_reset:14245(enP19p1s0f1)]IO slot reset --> driver unload
[ 9644.258612] PCI 0013:01:00.1#10000: EEH: bnx2x driver reports: 'disconnect'
[ 9644.258615] EEH: Finished:'slot_reset' with aggregate recovery state:'disconnect'
[ 9644.258620] EEH: Unable to recover from failure from PHB#13-PE#10000.
[ 9644.261811] EEH: Beginning: 'error_detected(permanent failure)'
[...]
[ 9644.261823] EEH: Finished:'error_detected(permanent failure)'

>
> Apparently the remedy is to return -ENODEV (from SLOT_NOT_USABLE ==
> -9002) from rpaphp_get_sensor_state() instead of doing the retries.
> It would be good to explain why *that* is safe.

This remedy is only for PE which is in EEH recovery mode. In all other
cases there is no functionality change. This way EEH handler will not be
blocked on rpaphp_get_sensor_state() and can immediately notify driver
about the PCI error and stop any active operations.

>
> > In normal cases (non-EEH case) rpaphp_get_sensor_state() will continue to
> > invoke rtas_get_sensor() as it was earlier with no change in existing
> > behavior.
>
> Nits:
> Follow historical convention in subject line.
> s/phyp/pHyp/ (or whatever the normal styling is)
> s/pe/PE/ (used inconsistently above and in comment)
> s/rtas/RTAS/ (Michael mentioned this already, but I guess you missed some)
> s/pci/PCI/
> s/ffdc/First Failure Data Capture/ (or the correct expansion)
> Make similar changes in the comment below.
>

Sure, will do.

[...]
> > +}
> > +
> > +/*
> > + * get_adapter_status() can be called by the EEH handler during EEH recovery.
> > + * On certain PHB failures, the rtas call get-sensor-state() returns extended
> > + * busy error (9902) until PHB is recovered by phyp. The rtas call interface
> > + * rtas_get_sensor() loops over the rtas call on extended delay return code
> > + * (9902) until the return value is either success (0) or error (-1). This
> > + * causes the EEH handler to get stuck for ~6 seconds before it could notify
> > + * that the pci error has been detected and stop any active operations. This
> > + * sometimes causes EEH recovery to fail. To avoid this issue, invoke
> > + * rtas_call(get-sensor-state) directly if the respective pe is in EEH recovery
> > + * state and return -EBUSY error based on rtas return status. This will help
> > + * the EEH handler to notify the driver about the pci error immediately and
> > + * successfully proceed with EEH recovery steps.
> > + */
> > +static int __rpaphp_get_sensor_state(struct slot *slot, int *state)
> > +{
> > + int rc;
> > +#ifdef CONFIG_EEH
> > + int token = rtas_token("get-sensor-state");
> > + struct pci_dn *pdn;
> > + struct eeh_pe *pe;
> > + struct pci_controller *phb = PCI_DN(slot->dn)->phb;
> > +
> > + if (token == RTAS_UNKNOWN_SERVICE)
> > + return -ENOENT;
> > +
> > + /*
> > + * Fallback to existing method for empty slot or pe isn't in EEH
> > + * recovery.
> > + */
> > + if (list_empty(&PCI_DN(phb->dn)->child_list))
> > + goto fallback;
> > +
> > + pdn = list_first_entry(&PCI_DN(phb->dn)->child_list,
> > + struct pci_dn, list);
>
> I guess you don't need locking to ensure that child_list doesn't
> become empty between the list_empty() and the list_first_entry()?

Maybe I can switch to use of list_first_entry_or_null() ?

> I didn't see locking at other places that traverse it, but it's not
> obvious to me what protects it.
>
> > + pe = eeh_dev_to_pe(pdn->edev);
> > + if (pe && (pe->state & EEH_PE_RECOVERING)) {
> > + rc = rtas_call(token, 2, 2, state, DR_ENTITY_SENSE,
> > + slot->index);
> > + if (rc)
> > + rc = rtas_to_errno(rc);
> > + return rc;
>
> I'd probably make rtas_to_errno(0) return 0, then do:
>
> return rtas_to_errno(rc);
>

Agree.

Thanks,
-Mahesh.