2022-09-17 12:25:23

by Maciej W. Rozycki

[permalink] [raw]
Subject: [PATCH v5 0/5] pci: Work around ASMedia ASM2824 PCIe link training failures

Hi,

This is v5 of the change to work around a PCIe link training phenomenon
where a pair of devices both capable of operating at a link speed above
2.5GT/s seems unable to negotiate the link speed and continues training
indefinitely with the Link Training bit switching on and off repeatedly
and the data link layer never reaching the active state.

This was originally observed in a configuration featuring a downstream
port of the ASMedia ASM2824 Gen 3 switch wired to the upstream port of the
Pericom PI7C9X2G304 Gen 2 switch. However in the course of review I have
come to the conclusion that similarly to the earlier similar change to
U-Boot it is indeed expected to be safe to apply this workaround to any
downstream port that has failed link negotiation provided that:

1. the port is capable of reporting the data link layer link active
status (because unlike U-Boot we cannot busy-loop continuously polling
the link training bit),

and:

2. we don't attempt to lift the 2.5GT/s speed restriction, imposed as the
basis of the workaround, for devices not explicitly known to continue
working in that case.

It is expected to be safe because the workaround is applied to a failed
link, that is one that does not (at the time this code is executed) work
anyway, so trying to bring it up cannot make the situation worse. So this
version of the workaround is attempted for all PCIe devices discovered,
and only the lifting of the 2.5GT/s speed restriction is qualified by the
vendor:device ID, currently one of the ASMedia ASM2824 device only.

Broadening the scope of the quirk has in turn made it necessary to make
some adjustments to code elsewhere and consequently what was originally a
single patch has now become a small series instead.

This has been verified with a SiFive HiFive unmatched board, booting with
or without the workaround activated in U-Boot, which covered both the link
retraining part of the quirk and the lifting of speed restriction already
imposed by U-Boot.

Please see individual change descriptions for further details.

Questions or comments? Otherwise please apply.

Maciej


2022-09-17 12:25:56

by Maciej W. Rozycki

[permalink] [raw]
Subject: [PATCH v5 3/5] PCI: Export PCI link retrain timeout

Rename LINK_RETRAIN_TIMEOUT to PCIE_LINK_RETRAIN_TIMEOUT and make it
available via "pci.h" for PCI drivers to use.

Signed-off-by: Maciej W. Rozycki <[email protected]>
---
New change in v5.
---
drivers/pci/pci.h | 2 ++
drivers/pci/pcie/aspm.c | 4 +---
2 files changed, 3 insertions(+), 3 deletions(-)

linux-pcie-link-retrain-timeout.diff
Index: linux-macro/drivers/pci/pci.h
===================================================================
--- linux-macro.orig/drivers/pci/pci.h
+++ linux-macro/drivers/pci/pci.h
@@ -11,6 +11,8 @@

#define PCI_VSEC_ID_INTEL_TBT 0x1234 /* Thunderbolt */

+#define PCIE_LINK_RETRAIN_TIMEOUT HZ
+
extern const unsigned char pcie_link_speed[];
extern bool pci_early_dump;

Index: linux-macro/drivers/pci/pcie/aspm.c
===================================================================
--- linux-macro.orig/drivers/pci/pcie/aspm.c
+++ linux-macro/drivers/pci/pcie/aspm.c
@@ -89,8 +89,6 @@ static const char *policy_str[] = {
[POLICY_POWER_SUPERSAVE] = "powersupersave"
};

-#define LINK_RETRAIN_TIMEOUT HZ
-
/*
* The L1 PM substate capability is only implemented in function 0 in a
* multi function device.
@@ -212,7 +210,7 @@ static bool pcie_retrain_link(struct pci
}

/* Wait for link training end. Break out after waiting for timeout */
- end_jiffies = jiffies + LINK_RETRAIN_TIMEOUT;
+ end_jiffies = jiffies + PCIE_LINK_RETRAIN_TIMEOUT;
do {
pcie_capability_read_word(parent, PCI_EXP_LNKSTA, &reg16);
if (!(reg16 & PCI_EXP_LNKSTA_LT))

2022-09-17 12:31:05

by Maciej W. Rozycki

[permalink] [raw]
Subject: [PATCH v5 5/5] PCI: Work around PCIe link training failures

Attempt to handle cases such as with a downstream port of the ASMedia
ASM2824 PCIe switch where link training never completes and the link
continues switching between speeds indefinitely with the data link layer
never reaching the active state.

It has been observed with a downstream port of the ASMedia ASM2824 Gen 3
switch wired to the upstream port of the Pericom PI7C9X2G304 Gen 2
switch, using a Delock Riser Card PCI Express x1 > 2 x PCIe x1 device,
P/N 41433, wired to a SiFive HiFive Unmatched board. In this setup the
switches are supposed to negotiate the link speed of preferably 5.0GT/s,
falling back to 2.5GT/s.

Instead the link continues oscillating between the two speeds, at the
rate of 34-35 times per second, with link training reported repeatedly
active ~84% of the time. Forcibly limiting the target link speed to
2.5GT/s with the upstream ASM2824 device however makes the two switches
communicate correctly. Removing the speed restriction afterwards makes
the two devices switch to 5.0GT/s then.

Make use of these observations then and detect the inability to train
the link, by checking for the Data Link Layer Link Active status bit
being off while the Link Bandwidth Management Status indicating that
hardware has changed the link speed or width in an attempt to correct
unreliable link operation.

Restrict the speed to 2.5GT/s then with the Target Link Speed field,
request a retrain and wait 200ms for the data link to go up. If this
turns out successful, then lift the restriction, letting the devices
negotiate a higher speed.

Also check for a 2.5GT/s speed restriction the firmware may have already
arranged and lift it too with ports of devices known to continue working
afterwards, currently the ASM2824 only, that already report their data
link being up.

Signed-off-by: Maciej W. Rozycki <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://source.denx.de/u-boot/u-boot/-/commit/a398a51ccc68
---
Changes from v4:

- Remove <linux/bug.h> inclusion no longer needed.

- Make the quirk generic based on probing device features rather than
specific to the ASM2824 part only; take the Retrain Link bit erratum
into account.

- Still lift the 2.5GT/s speed restriction with the ASM2824 only.

- Increase retrain timeout from 200ms to 1s (PCIE_LINK_RETRAIN_TIMEOUT).

- Remove retrain success notification.

- Use PCIe helpers rather than generic PCI functions throughout.

- Trim down and update the wording of the change description for the
switch from an ASM2824-specific to a generic fixup.

Changes from v3:

- Remove the <linux/pci_ids.h> entry for the ASM2824.

Changes from v2:

- Regenerate for 5.17-rc2 for a merge conflict.

- Replace BUG_ON for a missing PCI Express capability with WARN_ON and an
early return.

Changes from v1:

- Regenerate for a merge conflict.
---
drivers/pci/quirks.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 118 insertions(+)

linux-pcie-asm2824-manual-retrain.diff
Index: linux-macro/drivers/pci/quirks.c
===================================================================
--- linux-macro.orig/drivers/pci/quirks.c
+++ linux-macro/drivers/pci/quirks.c
@@ -5956,3 +5956,121 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_I
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x56c0, aspm_l1_acceptable_latency);
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x56c1, aspm_l1_acceptable_latency);
#endif
+
+/*
+ * Retrain the link of a downstream PCIe port by hand if necessary.
+ *
+ * This is needed at least where a downstream port of the ASMedia ASM2824
+ * Gen 3 switch is wired to the upstream port of the Pericom PI7C9X2G304
+ * Gen 2 switch, and observed with the Delock Riser Card PCI Express x1 >
+ * 2 x PCIe x1 device, P/N 41433, plugged into the SiFive HiFive Unmatched
+ * board.
+ *
+ * In such a configuration the switches are supposed to negotiate the link
+ * speed of preferably 5.0GT/s, falling back to 2.5GT/s. However the link
+ * continues switching between the two speeds indefinitely and the data
+ * link layer never reaches the active state, with link training reported
+ * repeatedly active ~84% of the time. Forcing the target link speed to
+ * 2.5GT/s with the upstream ASM2824 device makes the two switches talk to
+ * each other correctly however. And more interestingly retraining with a
+ * higher target link speed afterwards lets the two successfully negotiate
+ * 5.0GT/s.
+ *
+ * With the ASM2824 we can rely on the otherwise optional Data Link Layer
+ * Link Active status bit and in the failed link training scenario it will
+ * be off along with the Link Bandwidth Management Status indicating that
+ * hardware has changed the link speed or width in an attempt to correct
+ * unreliable link operation. For a port that has been left unconnected
+ * both bits will be clear. So use this information to detect the problem
+ * rather than polling the Link Training bit and watching out for flips or
+ * at least the active status.
+ *
+ * Since the exact nature of the problem isn't known and in principle this
+ * could trigger where an ASM2824 device is downstream rather upstream,
+ * apply this erratum workaround to any downstream ports as long as they
+ * support Link Active reporting and have the Link Control 2 register.
+ * Restrict the speed to 2.5GT/s then with the Target Link Speed field,
+ * request a retrain and wait 200ms for the data link to go up.
+ *
+ * If this turns out successful and we know by the Vendor:Device ID it is
+ * safe to do so, then lift the restriction, letting the devices negotiate
+ * a higher speed. Also check for a similar 2.5GT/s speed restriction the
+ * firmware may have already arranged and lift it with ports that already
+ * report their data link being up.
+ */
+static void pcie_downstream_link_retrain(struct pci_dev *dev)
+{
+ static const struct pci_device_id ids[] = {
+ { PCI_VDEVICE(ASMEDIA, 0x2824) }, /* ASMedia ASM2824 */
+ {}
+ };
+ u16 lnksta, lnkctl2;
+ u32 lnkcap;
+
+ if (!pci_is_pcie(dev) || !pcie_downstream_port(dev)
+ || !pcie_cap_has_lnkctl2(dev))
+ return;
+
+ /* It's too early yet to use `dev->link_active_reporting'. */
+ pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
+ if (!(lnkcap & PCI_EXP_LNKCAP_DLLLARC))
+ return;
+
+ pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2);
+ pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta);
+ if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) ==
+ PCI_EXP_LNKSTA_LBMS) {
+ unsigned long timeout;
+ u16 lnkctl;
+
+ pci_info(dev, "broken device, retraining non-functional downstream link at 2.5GT/s\n");
+
+ pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
+ lnkctl |= PCI_EXP_LNKCTL_RL;
+ lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
+ lnkctl2 |= PCI_EXP_LNKCTL2_TLS_2_5GT;
+ pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
+ pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnkctl);
+ /*
+ * Due to an erratum in some devices the Retrain Link bit
+ * needs to be cleared again manually to allow the link
+ * training to succeed.
+ */
+ lnkctl &= ~PCI_EXP_LNKCTL_RL;
+ if (dev->clear_retrain_link)
+ pcie_capability_write_word(dev, PCI_EXP_LNKCTL,
+ lnkctl);
+
+ timeout = jiffies + PCIE_LINK_RETRAIN_TIMEOUT;
+ do {
+ pcie_capability_read_word(dev, PCI_EXP_LNKSTA,
+ &lnksta);
+ if (lnksta & PCI_EXP_LNKSTA_DLLLA)
+ break;
+ usleep_range(10000, 20000);
+ } while (time_before(jiffies, timeout));
+
+ if (!(lnksta & PCI_EXP_LNKSTA_DLLLA)) {
+ pci_info(dev, "retraining failed\n");
+ return;
+ }
+ }
+
+ if ((lnksta & PCI_EXP_LNKSTA_DLLLA) &&
+ (lnkctl2 & PCI_EXP_LNKCTL2_TLS) == PCI_EXP_LNKCTL2_TLS_2_5GT &&
+ pci_match_id(ids, dev)) {
+ u16 lnkctl;
+
+ pci_info(dev, "removing 2.5GT/s downstream link speed restriction\n");
+ pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
+ lnkctl |= PCI_EXP_LNKCTL_RL;
+ lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
+ lnkctl2 |= lnkcap & PCI_EXP_LNKCAP_SLS;
+ pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
+ pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnkctl);
+ }
+}
+DECLARE_PCI_FIXUP_HEADER(PCI_ANY_ID, PCI_ANY_ID,
+ pcie_downstream_link_retrain);
+DECLARE_PCI_FIXUP_RESUME_EARLY(PCI_ANY_ID, PCI_ANY_ID,
+ pcie_downstream_link_retrain);

2022-10-09 14:40:54

by Pali Rohár

[permalink] [raw]
Subject: Re: [PATCH v5 0/5] pci: Work around ASMedia ASM2824 PCIe link training failures

Bjorn, Krzysztof: could you please look at this patch series and say
what do you think about it? It is quite strange issue for which is
defined PCI_ANY_ID quirk... And is needs to be somehow workarounded.

On Saturday 17 September 2022 13:03:05 Maciej W. Rozycki wrote:
> Hi,
>
> This is v5 of the change to work around a PCIe link training phenomenon
> where a pair of devices both capable of operating at a link speed above
> 2.5GT/s seems unable to negotiate the link speed and continues training
> indefinitely with the Link Training bit switching on and off repeatedly
> and the data link layer never reaching the active state.
>
> This was originally observed in a configuration featuring a downstream
> port of the ASMedia ASM2824 Gen 3 switch wired to the upstream port of the
> Pericom PI7C9X2G304 Gen 2 switch. However in the course of review I have
> come to the conclusion that similarly to the earlier similar change to
> U-Boot it is indeed expected to be safe to apply this workaround to any
> downstream port that has failed link negotiation provided that:
>
> 1. the port is capable of reporting the data link layer link active
> status (because unlike U-Boot we cannot busy-loop continuously polling
> the link training bit),
>
> and:
>
> 2. we don't attempt to lift the 2.5GT/s speed restriction, imposed as the
> basis of the workaround, for devices not explicitly known to continue
> working in that case.
>
> It is expected to be safe because the workaround is applied to a failed
> link, that is one that does not (at the time this code is executed) work
> anyway, so trying to bring it up cannot make the situation worse. So this
> version of the workaround is attempted for all PCIe devices discovered,
> and only the lifting of the 2.5GT/s speed restriction is qualified by the
> vendor:device ID, currently one of the ASMedia ASM2824 device only.
>
> Broadening the scope of the quirk has in turn made it necessary to make
> some adjustments to code elsewhere and consequently what was originally a
> single patch has now become a small series instead.
>
> This has been verified with a SiFive HiFive unmatched board, booting with
> or without the workaround activated in U-Boot, which covered both the link
> retraining part of the quirk and the lifting of speed restriction already
> imposed by U-Boot.
>
> Please see individual change descriptions for further details.
>
> Questions or comments? Otherwise please apply.
>
> Maciej

2022-11-02 00:08:02

by Pali Rohár

[permalink] [raw]
Subject: Re: [PATCH v5 0/5] pci: Work around ASMedia ASM2824 PCIe link training failures

Bjorn, Krzysztof: Gentle reminder.

On Sunday 09 October 2022 16:14:34 Pali Rohár wrote:
> Bjorn, Krzysztof: could you please look at this patch series and say
> what do you think about it? It is quite strange issue for which is
> defined PCI_ANY_ID quirk... And is needs to be somehow workarounded.
>
> On Saturday 17 September 2022 13:03:05 Maciej W. Rozycki wrote:
> > Hi,
> >
> > This is v5 of the change to work around a PCIe link training phenomenon
> > where a pair of devices both capable of operating at a link speed above
> > 2.5GT/s seems unable to negotiate the link speed and continues training
> > indefinitely with the Link Training bit switching on and off repeatedly
> > and the data link layer never reaching the active state.
> >
> > This was originally observed in a configuration featuring a downstream
> > port of the ASMedia ASM2824 Gen 3 switch wired to the upstream port of the
> > Pericom PI7C9X2G304 Gen 2 switch. However in the course of review I have
> > come to the conclusion that similarly to the earlier similar change to
> > U-Boot it is indeed expected to be safe to apply this workaround to any
> > downstream port that has failed link negotiation provided that:
> >
> > 1. the port is capable of reporting the data link layer link active
> > status (because unlike U-Boot we cannot busy-loop continuously polling
> > the link training bit),
> >
> > and:
> >
> > 2. we don't attempt to lift the 2.5GT/s speed restriction, imposed as the
> > basis of the workaround, for devices not explicitly known to continue
> > working in that case.
> >
> > It is expected to be safe because the workaround is applied to a failed
> > link, that is one that does not (at the time this code is executed) work
> > anyway, so trying to bring it up cannot make the situation worse. So this
> > version of the workaround is attempted for all PCIe devices discovered,
> > and only the lifting of the 2.5GT/s speed restriction is qualified by the
> > vendor:device ID, currently one of the ASMedia ASM2824 device only.
> >
> > Broadening the scope of the quirk has in turn made it necessary to make
> > some adjustments to code elsewhere and consequently what was originally a
> > single patch has now become a small series instead.
> >
> > This has been verified with a SiFive HiFive unmatched board, booting with
> > or without the workaround activated in U-Boot, which covered both the link
> > retraining part of the quirk and the lifting of speed restriction already
> > imposed by U-Boot.
> >
> > Please see individual change descriptions for further details.
> >
> > Questions or comments? Otherwise please apply.
> >
> > Maciej

2022-11-03 23:48:05

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v5 5/5] PCI: Work around PCIe link training failures

[+cc Pali]

On Sat, Sep 17, 2022 at 01:03:38PM +0100, Maciej W. Rozycki wrote:
> Attempt to handle cases such as with a downstream port of the ASMedia
> ASM2824 PCIe switch where link training never completes and the link
> continues switching between speeds indefinitely with the data link layer
> never reaching the active state.
>
> It has been observed with a downstream port of the ASMedia ASM2824 Gen 3
> switch wired to the upstream port of the Pericom PI7C9X2G304 Gen 2
> switch, using a Delock Riser Card PCI Express x1 > 2 x PCIe x1 device,
> P/N 41433, wired to a SiFive HiFive Unmatched board. In this setup the
> switches are supposed to negotiate the link speed of preferably 5.0GT/s,
> falling back to 2.5GT/s.
>
> Instead the link continues oscillating between the two speeds, at the
> rate of 34-35 times per second, with link training reported repeatedly
> active ~84% of the time. Forcibly limiting the target link speed to
> 2.5GT/s with the upstream ASM2824 device however makes the two switches
> communicate correctly. Removing the speed restriction afterwards makes
> the two devices switch to 5.0GT/s then.
>
> Make use of these observations then and detect the inability to train
> the link, by checking for the Data Link Layer Link Active status bit
> being off while the Link Bandwidth Management Status indicating that
> hardware has changed the link speed or width in an attempt to correct
> unreliable link operation.
>
> Restrict the speed to 2.5GT/s then with the Target Link Speed field,
> request a retrain and wait 200ms for the data link to go up. If this
> turns out successful, then lift the restriction, letting the devices
> negotiate a higher speed.
>
> Also check for a 2.5GT/s speed restriction the firmware may have already
> arranged and lift it too with ports of devices known to continue working
> afterwards, currently the ASM2824 only, that already report their data
> link being up.

This quirk is run at boot-time and resume-time. What happens after a
Secondary Bus Reset, as is done by pci_reset_secondary_bus()?

PCIe r6.0, sec 7.5.1.3.13, says "setting Secondary Bus Reset triggers
a hot reset on the corresponding PCI Express Port". Sec 4.2.7 says
LinkUp is 0 in the LTSSM Hot Reset state, and the Hot Reset state
leads to Detect, so it looks like this reset would cause the link to
go down and come back up.

Can you tell if that's what happens? Does the link negotiation fail
then, too?

If it does fail then, I don't know how hard we need to work to fix it.
Maybe we just accept it? Or maybe we need a "quirk-after-reset" phase
or something?

Bjorn

> Signed-off-by: Maciej W. Rozycki <[email protected]>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Link: https://source.denx.de/u-boot/u-boot/-/commit/a398a51ccc68
> ---
> Changes from v4:
>
> - Remove <linux/bug.h> inclusion no longer needed.
>
> - Make the quirk generic based on probing device features rather than
> specific to the ASM2824 part only; take the Retrain Link bit erratum
> into account.
>
> - Still lift the 2.5GT/s speed restriction with the ASM2824 only.
>
> - Increase retrain timeout from 200ms to 1s (PCIE_LINK_RETRAIN_TIMEOUT).
>
> - Remove retrain success notification.
>
> - Use PCIe helpers rather than generic PCI functions throughout.
>
> - Trim down and update the wording of the change description for the
> switch from an ASM2824-specific to a generic fixup.
>
> Changes from v3:
>
> - Remove the <linux/pci_ids.h> entry for the ASM2824.
>
> Changes from v2:
>
> - Regenerate for 5.17-rc2 for a merge conflict.
>
> - Replace BUG_ON for a missing PCI Express capability with WARN_ON and an
> early return.
>
> Changes from v1:
>
> - Regenerate for a merge conflict.
> ---
> drivers/pci/quirks.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 118 insertions(+)
>
> linux-pcie-asm2824-manual-retrain.diff
> Index: linux-macro/drivers/pci/quirks.c
> ===================================================================
> --- linux-macro.orig/drivers/pci/quirks.c
> +++ linux-macro/drivers/pci/quirks.c
> @@ -5956,3 +5956,121 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_I
> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x56c0, aspm_l1_acceptable_latency);
> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x56c1, aspm_l1_acceptable_latency);
> #endif
> +
> +/*
> + * Retrain the link of a downstream PCIe port by hand if necessary.
> + *
> + * This is needed at least where a downstream port of the ASMedia ASM2824
> + * Gen 3 switch is wired to the upstream port of the Pericom PI7C9X2G304
> + * Gen 2 switch, and observed with the Delock Riser Card PCI Express x1 >
> + * 2 x PCIe x1 device, P/N 41433, plugged into the SiFive HiFive Unmatched
> + * board.
> + *
> + * In such a configuration the switches are supposed to negotiate the link
> + * speed of preferably 5.0GT/s, falling back to 2.5GT/s. However the link
> + * continues switching between the two speeds indefinitely and the data
> + * link layer never reaches the active state, with link training reported
> + * repeatedly active ~84% of the time. Forcing the target link speed to
> + * 2.5GT/s with the upstream ASM2824 device makes the two switches talk to
> + * each other correctly however. And more interestingly retraining with a
> + * higher target link speed afterwards lets the two successfully negotiate
> + * 5.0GT/s.
> + *
> + * With the ASM2824 we can rely on the otherwise optional Data Link Layer
> + * Link Active status bit and in the failed link training scenario it will
> + * be off along with the Link Bandwidth Management Status indicating that
> + * hardware has changed the link speed or width in an attempt to correct
> + * unreliable link operation. For a port that has been left unconnected
> + * both bits will be clear. So use this information to detect the problem
> + * rather than polling the Link Training bit and watching out for flips or
> + * at least the active status.
> + *
> + * Since the exact nature of the problem isn't known and in principle this
> + * could trigger where an ASM2824 device is downstream rather upstream,
> + * apply this erratum workaround to any downstream ports as long as they
> + * support Link Active reporting and have the Link Control 2 register.
> + * Restrict the speed to 2.5GT/s then with the Target Link Speed field,
> + * request a retrain and wait 200ms for the data link to go up.
> + *
> + * If this turns out successful and we know by the Vendor:Device ID it is
> + * safe to do so, then lift the restriction, letting the devices negotiate
> + * a higher speed. Also check for a similar 2.5GT/s speed restriction the
> + * firmware may have already arranged and lift it with ports that already
> + * report their data link being up.
> + */
> +static void pcie_downstream_link_retrain(struct pci_dev *dev)
> +{
> + static const struct pci_device_id ids[] = {
> + { PCI_VDEVICE(ASMEDIA, 0x2824) }, /* ASMedia ASM2824 */
> + {}
> + };
> + u16 lnksta, lnkctl2;
> + u32 lnkcap;
> +
> + if (!pci_is_pcie(dev) || !pcie_downstream_port(dev)
> + || !pcie_cap_has_lnkctl2(dev))
> + return;
> +
> + /* It's too early yet to use `dev->link_active_reporting'. */
> + pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
> + if (!(lnkcap & PCI_EXP_LNKCAP_DLLLARC))
> + return;
> +
> + pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2);
> + pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta);
> + if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) ==
> + PCI_EXP_LNKSTA_LBMS) {
> + unsigned long timeout;
> + u16 lnkctl;
> +
> + pci_info(dev, "broken device, retraining non-functional downstream link at 2.5GT/s\n");
> +
> + pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
> + lnkctl |= PCI_EXP_LNKCTL_RL;
> + lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
> + lnkctl2 |= PCI_EXP_LNKCTL2_TLS_2_5GT;
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnkctl);
> + /*
> + * Due to an erratum in some devices the Retrain Link bit
> + * needs to be cleared again manually to allow the link
> + * training to succeed.
> + */
> + lnkctl &= ~PCI_EXP_LNKCTL_RL;
> + if (dev->clear_retrain_link)
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL,
> + lnkctl);
> +
> + timeout = jiffies + PCIE_LINK_RETRAIN_TIMEOUT;
> + do {
> + pcie_capability_read_word(dev, PCI_EXP_LNKSTA,
> + &lnksta);
> + if (lnksta & PCI_EXP_LNKSTA_DLLLA)
> + break;
> + usleep_range(10000, 20000);
> + } while (time_before(jiffies, timeout));
> +
> + if (!(lnksta & PCI_EXP_LNKSTA_DLLLA)) {
> + pci_info(dev, "retraining failed\n");
> + return;
> + }
> + }
> +
> + if ((lnksta & PCI_EXP_LNKSTA_DLLLA) &&
> + (lnkctl2 & PCI_EXP_LNKCTL2_TLS) == PCI_EXP_LNKCTL2_TLS_2_5GT &&
> + pci_match_id(ids, dev)) {
> + u16 lnkctl;
> +
> + pci_info(dev, "removing 2.5GT/s downstream link speed restriction\n");
> + pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
> + lnkctl |= PCI_EXP_LNKCTL_RL;
> + lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
> + lnkctl2 |= lnkcap & PCI_EXP_LNKCAP_SLS;
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnkctl);
> + }
> +}
> +DECLARE_PCI_FIXUP_HEADER(PCI_ANY_ID, PCI_ANY_ID,
> + pcie_downstream_link_retrain);
> +DECLARE_PCI_FIXUP_RESUME_EARLY(PCI_ANY_ID, PCI_ANY_ID,
> + pcie_downstream_link_retrain);

2022-11-04 00:17:23

by Pali Rohár

[permalink] [raw]
Subject: Re: [PATCH v5 5/5] PCI: Work around PCIe link training failures

On Thursday 03 November 2022 18:13:35 Bjorn Helgaas wrote:
> [+cc Pali]
>
> On Sat, Sep 17, 2022 at 01:03:38PM +0100, Maciej W. Rozycki wrote:
> > Attempt to handle cases such as with a downstream port of the ASMedia
> > ASM2824 PCIe switch where link training never completes and the link
> > continues switching between speeds indefinitely with the data link layer
> > never reaching the active state.
> >
> > It has been observed with a downstream port of the ASMedia ASM2824 Gen 3
> > switch wired to the upstream port of the Pericom PI7C9X2G304 Gen 2
> > switch, using a Delock Riser Card PCI Express x1 > 2 x PCIe x1 device,
> > P/N 41433, wired to a SiFive HiFive Unmatched board. In this setup the
> > switches are supposed to negotiate the link speed of preferably 5.0GT/s,
> > falling back to 2.5GT/s.
> >
> > Instead the link continues oscillating between the two speeds, at the
> > rate of 34-35 times per second, with link training reported repeatedly
> > active ~84% of the time. Forcibly limiting the target link speed to
> > 2.5GT/s with the upstream ASM2824 device however makes the two switches
> > communicate correctly. Removing the speed restriction afterwards makes
> > the two devices switch to 5.0GT/s then.
> >
> > Make use of these observations then and detect the inability to train
> > the link, by checking for the Data Link Layer Link Active status bit
> > being off while the Link Bandwidth Management Status indicating that
> > hardware has changed the link speed or width in an attempt to correct
> > unreliable link operation.
> >
> > Restrict the speed to 2.5GT/s then with the Target Link Speed field,
> > request a retrain and wait 200ms for the data link to go up. If this
> > turns out successful, then lift the restriction, letting the devices
> > negotiate a higher speed.
> >
> > Also check for a 2.5GT/s speed restriction the firmware may have already
> > arranged and lift it too with ports of devices known to continue working
> > afterwards, currently the ASM2824 only, that already report their data
> > link being up.
>
> This quirk is run at boot-time and resume-time. What happens after a
> Secondary Bus Reset, as is done by pci_reset_secondary_bus()?

Flipping SBR bit can be done on any PCI-to-PCI bridge device and in this
topology there are following: PCIe Root Port, ASMedia PCIe Switch
Upstream Port, ASMedia PCIe Switch Downstream Port, Pericom PCIe Switch
Upstream Port, Pericom PCIe Switch Downstream Port.
(Maciej, I hope that this is whole topology and there is not some other
device of PCI-to-PCI bridge type in your setup; please correct me)

Bjorn, to make it clear, on which device you mean to issue secondary bus
reset?

Because I would not be surprised if different things happen when issuing
bus reset on different parts of that topology.

> PCIe r6.0, sec 7.5.1.3.13, says "setting Secondary Bus Reset triggers
> a hot reset on the corresponding PCI Express Port". Sec 4.2.7 says
> LinkUp is 0 in the LTSSM Hot Reset state, and the Hot Reset state
> leads to Detect, so it looks like this reset would cause the link to
> go down and come back up.
>
> Can you tell if that's what happens? Does the link negotiation fail
> then, too?
>
> If it does fail then, I don't know how hard we need to work to fix it.
> Maybe we just accept it? Or maybe we need a "quirk-after-reset" phase
> or something?
>
> Bjorn
>
> > Signed-off-by: Maciej W. Rozycki <[email protected]>
> > Link: https://lore.kernel.org/lkml/[email protected]/
> > Link: https://source.denx.de/u-boot/u-boot/-/commit/a398a51ccc68
> > ---
> > Changes from v4:
> >
> > - Remove <linux/bug.h> inclusion no longer needed.
> >
> > - Make the quirk generic based on probing device features rather than
> > specific to the ASM2824 part only; take the Retrain Link bit erratum
> > into account.
> >
> > - Still lift the 2.5GT/s speed restriction with the ASM2824 only.
> >
> > - Increase retrain timeout from 200ms to 1s (PCIE_LINK_RETRAIN_TIMEOUT).
> >
> > - Remove retrain success notification.
> >
> > - Use PCIe helpers rather than generic PCI functions throughout.
> >
> > - Trim down and update the wording of the change description for the
> > switch from an ASM2824-specific to a generic fixup.
> >
> > Changes from v3:
> >
> > - Remove the <linux/pci_ids.h> entry for the ASM2824.
> >
> > Changes from v2:
> >
> > - Regenerate for 5.17-rc2 for a merge conflict.
> >
> > - Replace BUG_ON for a missing PCI Express capability with WARN_ON and an
> > early return.
> >
> > Changes from v1:
> >
> > - Regenerate for a merge conflict.
> > ---
> > drivers/pci/quirks.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 118 insertions(+)
> >
> > linux-pcie-asm2824-manual-retrain.diff
> > Index: linux-macro/drivers/pci/quirks.c
> > ===================================================================
> > --- linux-macro.orig/drivers/pci/quirks.c
> > +++ linux-macro/drivers/pci/quirks.c
> > @@ -5956,3 +5956,121 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_I
> > DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x56c0, aspm_l1_acceptable_latency);
> > DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x56c1, aspm_l1_acceptable_latency);
> > #endif
> > +
> > +/*
> > + * Retrain the link of a downstream PCIe port by hand if necessary.
> > + *
> > + * This is needed at least where a downstream port of the ASMedia ASM2824
> > + * Gen 3 switch is wired to the upstream port of the Pericom PI7C9X2G304
> > + * Gen 2 switch, and observed with the Delock Riser Card PCI Express x1 >
> > + * 2 x PCIe x1 device, P/N 41433, plugged into the SiFive HiFive Unmatched
> > + * board.
> > + *
> > + * In such a configuration the switches are supposed to negotiate the link
> > + * speed of preferably 5.0GT/s, falling back to 2.5GT/s. However the link
> > + * continues switching between the two speeds indefinitely and the data
> > + * link layer never reaches the active state, with link training reported
> > + * repeatedly active ~84% of the time. Forcing the target link speed to
> > + * 2.5GT/s with the upstream ASM2824 device makes the two switches talk to
> > + * each other correctly however. And more interestingly retraining with a
> > + * higher target link speed afterwards lets the two successfully negotiate
> > + * 5.0GT/s.
> > + *
> > + * With the ASM2824 we can rely on the otherwise optional Data Link Layer
> > + * Link Active status bit and in the failed link training scenario it will
> > + * be off along with the Link Bandwidth Management Status indicating that
> > + * hardware has changed the link speed or width in an attempt to correct
> > + * unreliable link operation. For a port that has been left unconnected
> > + * both bits will be clear. So use this information to detect the problem
> > + * rather than polling the Link Training bit and watching out for flips or
> > + * at least the active status.
> > + *
> > + * Since the exact nature of the problem isn't known and in principle this
> > + * could trigger where an ASM2824 device is downstream rather upstream,
> > + * apply this erratum workaround to any downstream ports as long as they
> > + * support Link Active reporting and have the Link Control 2 register.
> > + * Restrict the speed to 2.5GT/s then with the Target Link Speed field,
> > + * request a retrain and wait 200ms for the data link to go up.
> > + *
> > + * If this turns out successful and we know by the Vendor:Device ID it is
> > + * safe to do so, then lift the restriction, letting the devices negotiate
> > + * a higher speed. Also check for a similar 2.5GT/s speed restriction the
> > + * firmware may have already arranged and lift it with ports that already
> > + * report their data link being up.
> > + */
> > +static void pcie_downstream_link_retrain(struct pci_dev *dev)
> > +{
> > + static const struct pci_device_id ids[] = {
> > + { PCI_VDEVICE(ASMEDIA, 0x2824) }, /* ASMedia ASM2824 */
> > + {}
> > + };
> > + u16 lnksta, lnkctl2;
> > + u32 lnkcap;
> > +
> > + if (!pci_is_pcie(dev) || !pcie_downstream_port(dev)
> > + || !pcie_cap_has_lnkctl2(dev))
> > + return;
> > +
> > + /* It's too early yet to use `dev->link_active_reporting'. */
> > + pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
> > + if (!(lnkcap & PCI_EXP_LNKCAP_DLLLARC))
> > + return;
> > +
> > + pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2);
> > + pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta);
> > + if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) ==
> > + PCI_EXP_LNKSTA_LBMS) {
> > + unsigned long timeout;
> > + u16 lnkctl;
> > +
> > + pci_info(dev, "broken device, retraining non-functional downstream link at 2.5GT/s\n");
> > +
> > + pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
> > + lnkctl |= PCI_EXP_LNKCTL_RL;
> > + lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
> > + lnkctl2 |= PCI_EXP_LNKCTL2_TLS_2_5GT;
> > + pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
> > + pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnkctl);
> > + /*
> > + * Due to an erratum in some devices the Retrain Link bit
> > + * needs to be cleared again manually to allow the link
> > + * training to succeed.
> > + */
> > + lnkctl &= ~PCI_EXP_LNKCTL_RL;
> > + if (dev->clear_retrain_link)
> > + pcie_capability_write_word(dev, PCI_EXP_LNKCTL,
> > + lnkctl);
> > +
> > + timeout = jiffies + PCIE_LINK_RETRAIN_TIMEOUT;
> > + do {
> > + pcie_capability_read_word(dev, PCI_EXP_LNKSTA,
> > + &lnksta);
> > + if (lnksta & PCI_EXP_LNKSTA_DLLLA)
> > + break;
> > + usleep_range(10000, 20000);
> > + } while (time_before(jiffies, timeout));
> > +
> > + if (!(lnksta & PCI_EXP_LNKSTA_DLLLA)) {
> > + pci_info(dev, "retraining failed\n");
> > + return;
> > + }
> > + }
> > +
> > + if ((lnksta & PCI_EXP_LNKSTA_DLLLA) &&
> > + (lnkctl2 & PCI_EXP_LNKCTL2_TLS) == PCI_EXP_LNKCTL2_TLS_2_5GT &&
> > + pci_match_id(ids, dev)) {
> > + u16 lnkctl;
> > +
> > + pci_info(dev, "removing 2.5GT/s downstream link speed restriction\n");
> > + pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
> > + lnkctl |= PCI_EXP_LNKCTL_RL;
> > + lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
> > + lnkctl2 |= lnkcap & PCI_EXP_LNKCAP_SLS;
> > + pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
> > + pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnkctl);
> > + }
> > +}
> > +DECLARE_PCI_FIXUP_HEADER(PCI_ANY_ID, PCI_ANY_ID,
> > + pcie_downstream_link_retrain);
> > +DECLARE_PCI_FIXUP_RESUME_EARLY(PCI_ANY_ID, PCI_ANY_ID,
> > + pcie_downstream_link_retrain);

2022-11-04 00:25:35

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v5 5/5] PCI: Work around PCIe link training failures

On Fri, Nov 04, 2022 at 12:41:11AM +0100, Pali Roh?r wrote:
> On Thursday 03 November 2022 18:13:35 Bjorn Helgaas wrote:
> > [+cc Pali]
> >
> > On Sat, Sep 17, 2022 at 01:03:38PM +0100, Maciej W. Rozycki wrote:
> > > Attempt to handle cases such as with a downstream port of the ASMedia
> > > ASM2824 PCIe switch where link training never completes and the link
> > > continues switching between speeds indefinitely with the data link layer
> > > never reaching the active state.
> > >
> > > It has been observed with a downstream port of the ASMedia ASM2824 Gen 3
> > > switch wired to the upstream port of the Pericom PI7C9X2G304 Gen 2
> > > switch, using a Delock Riser Card PCI Express x1 > 2 x PCIe x1 device,
> > > P/N 41433, wired to a SiFive HiFive Unmatched board. In this setup the
> > > switches are supposed to negotiate the link speed of preferably 5.0GT/s,
> > > falling back to 2.5GT/s.
> > >
> > > Instead the link continues oscillating between the two speeds, at the
> > > rate of 34-35 times per second, with link training reported repeatedly
> > > active ~84% of the time. Forcibly limiting the target link speed to
> > > 2.5GT/s with the upstream ASM2824 device however makes the two switches
> > > communicate correctly. Removing the speed restriction afterwards makes
> > > the two devices switch to 5.0GT/s then.
> > >
> > > Make use of these observations then and detect the inability to train
> > > the link, by checking for the Data Link Layer Link Active status bit
> > > being off while the Link Bandwidth Management Status indicating that
> > > hardware has changed the link speed or width in an attempt to correct
> > > unreliable link operation.
> > >
> > > Restrict the speed to 2.5GT/s then with the Target Link Speed field,
> > > request a retrain and wait 200ms for the data link to go up. If this
> > > turns out successful, then lift the restriction, letting the devices
> > > negotiate a higher speed.
> > >
> > > Also check for a 2.5GT/s speed restriction the firmware may have already
> > > arranged and lift it too with ports of devices known to continue working
> > > afterwards, currently the ASM2824 only, that already report their data
> > > link being up.
> >
> > This quirk is run at boot-time and resume-time. What happens after a
> > Secondary Bus Reset, as is done by pci_reset_secondary_bus()?
>
> Flipping SBR bit can be done on any PCI-to-PCI bridge device and in this
> topology there are following: PCIe Root Port, ASMedia PCIe Switch
> Upstream Port, ASMedia PCIe Switch Downstream Port, Pericom PCIe Switch
> Upstream Port, Pericom PCIe Switch Downstream Port.
> (Maciej, I hope that this is whole topology and there is not some other
> device of PCI-to-PCI bridge type in your setup; please correct me)
>
> Bjorn, to make it clear, on which device you mean to issue secondary bus
> reset?

IIUC, the problem is observed on the link between the ASM2824
downstream port and the PI7C9X2G304 upstream port, so my question is
about asserting SBR on the ASM2824 downstream port. I think that
should cause the link between ASM2824 and PI7C9X2G304 to go down and
back up.

Thanks for the question; I didn't notice before that this quirk
applies to *all* devices. I'm a little queasy about trying to fix
problems we have not observed. In this case, I think the hardware is
*supposed* to establish a link at the highest supported speed
automatically.

If we need to work around a hardware bug, that's fine, but I'm not
sure I want to blindly try to help things along.

> Because I would not be surprised if different things happen when issuing
> bus reset on different parts of that topology.
>
> > PCIe r6.0, sec 7.5.1.3.13, says "setting Secondary Bus Reset triggers
> > a hot reset on the corresponding PCI Express Port". Sec 4.2.7 says
> > LinkUp is 0 in the LTSSM Hot Reset state, and the Hot Reset state
> > leads to Detect, so it looks like this reset would cause the link to
> > go down and come back up.
> >
> > Can you tell if that's what happens? Does the link negotiation fail
> > then, too?
> >
> > If it does fail then, I don't know how hard we need to work to fix it.
> > Maybe we just accept it? Or maybe we need a "quirk-after-reset" phase
> > or something?
> >
> > Bjorn

2022-11-09 03:06:31

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: [PATCH v5 5/5] PCI: Work around PCIe link training failures

Hi Bjorn,

Thank you for coming back to this patch series. I'll try to address your
concerns, but it may take a little. The reason is I'm currently on site
at my lab until the end of the week and barring my day job, etc. I want to
focus on items to do (and I do have a bunch) that require local hardware
access. The issue concerned with this patch series does not, so I'll get
to looking into it in more depth hopefully from next week. For the time
being however please see below.

On Thu, 3 Nov 2022, Bjorn Helgaas wrote:

> > > > Also check for a 2.5GT/s speed restriction the firmware may have already
> > > > arranged and lift it too with ports of devices known to continue working
> > > > afterwards, currently the ASM2824 only, that already report their data
> > > > link being up.
> > >
> > > This quirk is run at boot-time and resume-time. What happens after a
> > > Secondary Bus Reset, as is done by pci_reset_secondary_bus()?
> >
> > Flipping SBR bit can be done on any PCI-to-PCI bridge device and in this
> > topology there are following: PCIe Root Port, ASMedia PCIe Switch
> > Upstream Port, ASMedia PCIe Switch Downstream Port, Pericom PCIe Switch
> > Upstream Port, Pericom PCIe Switch Downstream Port.
> > (Maciej, I hope that this is whole topology and there is not some other
> > device of PCI-to-PCI bridge type in your setup; please correct me)

There is actually a PCIe-to-PCI bridge device further downstream (device
0000:08:00.0 in the listings below; bus 09 is conventional PCI), but it
doesn't matter for the issue concerned; the issue triggers whether the
bridge module has been plugged or not.

> > Bjorn, to make it clear, on which device you mean to issue secondary bus
> > reset?
>
> IIUC, the problem is observed on the link between the ASM2824
> downstream port and the PI7C9X2G304 upstream port, so my question is
> about asserting SBR on the ASM2824 downstream port. I think that
> should cause the link between ASM2824 and PI7C9X2G304 to go down and
> back up.

That would be my expectation as well. Is there a reliable way to request
that however without actually writing a piece of code to do so from inside
the kernel? Sadly our documentation is vague on the matter, in particular
Documentation/ABI/testing/sysfs-bus-pci, but here's what I have obtained:

# lspci -t
-[0000:00]---00.0-[01-0b]----00.0-[02-0b]--+-00.0-[03]--
+-02.0-[04]----00.0
+-03.0-[05-09]----00.0-[06-09]--+-01.0-[07]--+-00.0
| | \-00.3
| \-02.0-[08-09]----00.0-[09]--+-01.0
| \-02.0
+-04.0-[0a]----00.0
\-08.0-[0b]--+-00.0
\-00.1
# for name in /sys/bus/pci/devices/0000\:??\:??.?/reset_method; do echo "$(basename $(dirname $name)): $(cat $name)"; done
0000:01:00.0: pm bus
0000:02:00.0: pm bus
0000:02:02.0: pm
0000:02:03.0: pm
0000:02:04.0: pm
0000:02:08.0: pm
0000:04:00.0: bus
0000:05:00.0: bus
0000:06:01.0: bus
0000:07:00.0: bus
0000:08:00.0: bus
0000:09:01.0: pm bus
0000:0a:00.0: flr bus
0000:0b:00.0: pm bus
0000:0b:00.1: pm
#

(mind that the problematic link is between 0000:02:03.0 and 0000:05:00.0),
and then:

# echo 1 >/sys/bus/pci/devices/0000\:05\:00.0/reset
-sh: echo: write error: Inappropriate ioctl for device
#

(which I gather is supposed to poke at 0000:02:03.0's SBR) so it doesn't
seem to be effective. Needless to say:

# echo 1 >/sys/bus/pci/devices/0000\:02\:03.0/reset
#

while apparently succeeds doesn't do anything spectacular. Any
suggestions?

In any case I would expect resetting a bus tree to cause all the quirks
required to be reapplied, preferably after bus re-enumeration even as the
topology can change afterwards if not for the hardware erratum handled
here, although indeed I would find it reasonable to only perform the
actions on the bus subtree reset and not it's parent. In that case the
quirk would have to be applied by hand somehow.

> Thanks for the question; I didn't notice before that this quirk
> applies to *all* devices. I'm a little queasy about trying to fix
> problems we have not observed. In this case, I think the hardware is
> *supposed* to establish a link at the highest supported speed
> automatically.

I believe hardware is supposed to establish a link at all in the first
place (i.e. initially at 2.5GT/s, and then possibly upgrade afterwards as
capabilities have been exchanged between the endpoints), however as this
case indicates this is not necessarily universally guaranteed.

> If we need to work around a hardware bug, that's fine, but I'm not
> sure I want to blindly try to help things along.

I have chewed it over before choosing this approach and did this for what
I think is a very good reason, first for U-Boot and then, when I figured
out a feasible approach, also for Linux. Bear in mind that we have a pair
of devices that completely fail to negotiate link ever by default and we
do not know whether it's the upstream device (which we can see) or the
downstream device (which we can't, because the link is down) that is the
culprit. With that we have as follows.

1. Both devices in question are PCIe switches, so they can be swapped with
each other. In that case what is now the upstream device will become
the downstream device and vice versa. Now the Pericom/Diodes switch
only has 2.5GT/s downstream ports, so the issue by definition cannot
trigger, but companies/designers tend to reuse hardware IP blocks, and
I am quite sure Pericom/Diodes has switches in its product line that
support rates beyond 2.5GT/s in their downstream ports. So a swapped
scenario is in my opinion quite real. Matching on the vendor:device ID
won't work then as the suspect ASMedia part will not be seen due to a
failed link. If it's the Pericom/Diodes switch that is the culprit,
then it's even worse, as it is the downstream device already.

2. We don't know really what the ultimate cause is and therefore whether
the issue can trigger with other vendor devices (and presumably one of
the two involved here). And the nature of the failure is such, that I
think it is very unlikely a user will reach out to our (Linux kernel)
community for a workaround. It does not look like a software problem
at all and neither like one that can be solved by software.

Therefore they will most likely conclude hardware they have is broken
and throw it out causing frustration even though the hardware might be
(im)perfectly capable of working, just requiring a little poke to get
going.

Just as I almost threw it out even though I dare say I am quite an
experienced engineer, reasonably familiar with odd hardware and the
Linux kernel and ultimately I was able to come up with a solution. It
is only because I noticed by chance that 0000:02:03.0 device's Link
Training bit is flipping, combined with my imagination as to how to
possibly deal with such a situation that I succeeded in bringing the
link up. I genuinely thought all was lost and I was exceedingly happy
when I overcame it as the PI7C9X2G304 switch module was key to my
project with no suitable alternative on the horizon.

3. We apply this quirk to failed links only. As a non-working link is a
fatal condition, things cannot get worse by trying to retrain it by
hand at 2.5GT/s, as otherwise it doesn't work anyway. So it's just a
last ditch attempt to make it run really.

Can you think of a scenario where retraining a non-working link would
cause more damage than leaving it alone in the non-working state?

This quirk only uses generic PCIe features, common to all hardware, and
therefore I think it can be applied in a generic manner. Now there
could be broken hardware that does not correctly handle the generic
features used here, but if it has a failed link too, then it will just
continue not to work. So it won't be a regression.

Now if I made any mistake in getting the conditions wrong that are
supposed to determine that a link is non-working, than I'll be glad to
have the code corrected. Such a case shouldn't determine the overall
validity of the approach I propose here though.

NB I have included code to lift the 2.5GT/s restriction, matched by the
downstream port's vendor:device ID, but now that I've been thinking about
this issue again I suspect it might be too liberal, because we don't
really know if it is safe to do so if the downstream device is not a
PI7C9X2G304. So perhaps we need to match on the vendor:device IDs of both
endpoints of such a broken link. What do you think?

I'll let you chew my reasoning over now too, as I have an issue with a
non-working PCIe parallel port device to sort out (device 0000:07:00.0
above) and that actually requires flipping rocker switches in my lab,
which I cannot do remotely. Thank you for your review.

Maciej

2022-11-09 05:57:12

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v5 5/5] PCI: Work around PCIe link training failures

[+cc Alex, in case he has any reset-related comments]

On Wed, Nov 09, 2022 at 02:57:57AM +0000, Maciej W. Rozycki wrote:
> Hi Bjorn,
>
> Thank you for coming back to this patch series. I'll try to address your
> concerns, but it may take a little. The reason is I'm currently on site
> at my lab until the end of the week and barring my day job, etc. I want to
> focus on items to do (and I do have a bunch) that require local hardware
> access. The issue concerned with this patch series does not, so I'll get
> to looking into it in more depth hopefully from next week. For the time
> being however please see below.
>
> On Thu, 3 Nov 2022, Bjorn Helgaas wrote:
>
> > > > > Also check for a 2.5GT/s speed restriction the firmware may have already
> > > > > arranged and lift it too with ports of devices known to continue working
> > > > > afterwards, currently the ASM2824 only, that already report their data
> > > > > link being up.
> > > >
> > > > This quirk is run at boot-time and resume-time. What happens after a
> > > > Secondary Bus Reset, as is done by pci_reset_secondary_bus()?
> > >
> > > Flipping SBR bit can be done on any PCI-to-PCI bridge device and in this
> > > topology there are following: PCIe Root Port, ASMedia PCIe Switch
> > > Upstream Port, ASMedia PCIe Switch Downstream Port, Pericom PCIe Switch
> > > Upstream Port, Pericom PCIe Switch Downstream Port.
> > > (Maciej, I hope that this is whole topology and there is not some other
> > > device of PCI-to-PCI bridge type in your setup; please correct me)
>
> There is actually a PCIe-to-PCI bridge device further downstream (device
> 0000:08:00.0 in the listings below; bus 09 is conventional PCI), but it
> doesn't matter for the issue concerned; the issue triggers whether the
> bridge module has been plugged or not.
>
> > > Bjorn, to make it clear, on which device you mean to issue secondary bus
> > > reset?
> >
> > IIUC, the problem is observed on the link between the ASM2824
> > downstream port and the PI7C9X2G304 upstream port, so my question is
> > about asserting SBR on the ASM2824 downstream port. I think that
> > should cause the link between ASM2824 and PI7C9X2G304 to go down and
> > back up.
>
> That would be my expectation as well. Is there a reliable way to request
> that however without actually writing a piece of code to do so from inside
> the kernel? Sadly our documentation is vague on the matter, in particular
> Documentation/ABI/testing/sysfs-bus-pci, but here's what I have obtained:
>
> # lspci -t
> -[0000:00]---00.0-[01-0b]----00.0-[02-0b]--+-00.0-[03]--
> +-02.0-[04]----00.0
> +-03.0-[05-09]----00.0-[06-09]--+-01.0-[07]--+-00.0
> | | \-00.3
> | \-02.0-[08-09]----00.0-[09]--+-01.0
> | \-02.0
> +-04.0-[0a]----00.0
> \-08.0-[0b]--+-00.0
> \-00.1
> # for name in /sys/bus/pci/devices/0000\:??\:??.?/reset_method; do echo "$(basename $(dirname $name)): $(cat $name)"; done
> 0000:01:00.0: pm bus
> 0000:02:00.0: pm bus
> 0000:02:02.0: pm
> 0000:02:03.0: pm
> 0000:02:04.0: pm
> 0000:02:08.0: pm
> 0000:04:00.0: bus
> 0000:05:00.0: bus
> 0000:06:01.0: bus
> 0000:07:00.0: bus
> 0000:08:00.0: bus
> 0000:09:01.0: pm bus
> 0000:0a:00.0: flr bus
> 0000:0b:00.0: pm bus
> 0000:0b:00.1: pm
> #
>
> (mind that the problematic link is between 0000:02:03.0 and 0000:05:00.0),
> and then:
>
> # echo 1 >/sys/bus/pci/devices/0000\:05\:00.0/reset
> -sh: echo: write error: Inappropriate ioctl for device
> #
>
> (which I gather is supposed to poke at 0000:02:03.0's SBR) so it doesn't
> seem to be effective.

05:00.0 supports the "bus" method, i.e., pci_reset_bus_function(),
which tries pci_dev_reset_slot_function() followed by
pci_parent_bus_reset(). Both of them return -ENOTTY if the device
(05:00.0) has a secondary bus ("dev->subordinate"), so I think nothing
happens here.

This was prompted by my question about whether the link comes up
correctly after an SBR. It looks like it's not a problem when 02:03.0
and 05:00.0 are both switches because we don't use SBR in that case.

But if 05:00.0 were an endpoint instead of a switch, we *would* use
SBR, and then (I think) the link would go down, and this quirk would
not be applied.

> Needless to say:
>
> # echo 1 >/sys/bus/pci/devices/0000\:02\:03.0/reset
> #
>
> while apparently succeeds doesn't do anything spectacular. Any
> suggestions?
>
> In any case I would expect resetting a bus tree to cause all the quirks
> required to be reapplied, preferably after bus re-enumeration even as the
> topology can change afterwards if not for the hardware erratum handled
> here, although indeed I would find it reasonable to only perform the
> actions on the bus subtree reset and not it's parent. In that case the
> quirk would have to be applied by hand somehow.

A lot of things can change after a reset, especially if it makes a
device reload updated firmware, but we don't actually re-enumerate or
reapply quirks. I'm looking at pci_reset_function(), where we
basically save the parts of config space we know about, do the reset,
and restore config space.

> > Thanks for the question; I didn't notice before that this quirk
> > applies to *all* devices. I'm a little queasy about trying to fix
> > problems we have not observed. In this case, I think the hardware is
> > *supposed* to establish a link at the highest supported speed
> > automatically.
>
> I believe hardware is supposed to establish a link at all in the first
> place (i.e. initially at 2.5GT/s, and then possibly upgrade afterwards as
> capabilities have been exchanged between the endpoints), however as this
> case indicates this is not necessarily universally guaranteed.
>
> > If we need to work around a hardware bug, that's fine, but I'm not
> > sure I want to blindly try to help things along.
>
> I have chewed it over before choosing this approach and did this for what
> I think is a very good reason, first for U-Boot and then, when I figured
> out a feasible approach, also for Linux. Bear in mind that we have a pair
> of devices that completely fail to negotiate link ever by default and we
> do not know whether it's the upstream device (which we can see) or the
> downstream device (which we can't, because the link is down) that is the
> culprit. With that we have as follows.
>
> 1. Both devices in question are PCIe switches, so they can be swapped with
> each other. In that case what is now the upstream device will become
> the downstream device and vice versa. Now the Pericom/Diodes switch
> only has 2.5GT/s downstream ports, so the issue by definition cannot
> trigger, but companies/designers tend to reuse hardware IP blocks, and
> I am quite sure Pericom/Diodes has switches in its product line that
> support rates beyond 2.5GT/s in their downstream ports. So a swapped
> scenario is in my opinion quite real. Matching on the vendor:device ID
> won't work then as the suspect ASMedia part will not be seen due to a
> failed link. If it's the Pericom/Diodes switch that is the culprit,
> then it's even worse, as it is the downstream device already.
>
> 2. We don't know really what the ultimate cause is and therefore whether
> the issue can trigger with other vendor devices (and presumably one of
> the two involved here). And the nature of the failure is such, that I
> think it is very unlikely a user will reach out to our (Linux kernel)
> community for a workaround. It does not look like a software problem
> at all and neither like one that can be solved by software.
>
> Therefore they will most likely conclude hardware they have is broken
> and throw it out causing frustration even though the hardware might be
> (im)perfectly capable of working, just requiring a little poke to get
> going.
>
> Just as I almost threw it out even though I dare say I am quite an
> experienced engineer, reasonably familiar with odd hardware and the
> Linux kernel and ultimately I was able to come up with a solution. It
> is only because I noticed by chance that 0000:02:03.0 device's Link
> Training bit is flipping, combined with my imagination as to how to
> possibly deal with such a situation that I succeeded in bringing the
> link up. I genuinely thought all was lost and I was exceedingly happy
> when I overcame it as the PI7C9X2G304 switch module was key to my
> project with no suitable alternative on the horizon.
>
> 3. We apply this quirk to failed links only. As a non-working link is a
> fatal condition, things cannot get worse by trying to retrain it by
> hand at 2.5GT/s, as otherwise it doesn't work anyway. So it's just a
> last ditch attempt to make it run really.
>
> Can you think of a scenario where retraining a non-working link would
> cause more damage than leaving it alone in the non-working state?
>
> This quirk only uses generic PCIe features, common to all hardware, and
> therefore I think it can be applied in a generic manner. Now there
> could be broken hardware that does not correctly handle the generic
> features used here, but if it has a failed link too, then it will just
> continue not to work. So it won't be a regression.

The first big chunk of the quirk (that retrains the link) only runs on
Downstream Ports, and only when PCI_EXP_LNKSTA_DLLLA is clear, i.e.,
when the link is down. What happens if this Downstream Port doesn't
have anything below it, e.g., it leads to an empty slot? Is there
something that prevents us from attempting a retrain and waiting for
the timeout?

I was about to say the second chunk is always run even when the link
was up, but now I see that it's only for the ASMedia ASM2824. So this
part is more like a traditional quirk that could specify that device.

If we think the first part (attempting the retrain) is safe to do
whenever the link is down, maybe we should do that more directly as
part of the PCI core instead of as a quirk?

For anybody joining late, the patch is here:
https://lore.kernel.org/r/[email protected]

> Now if I made any mistake in getting the conditions wrong that are
> supposed to determine that a link is non-working, than I'll be glad to
> have the code corrected. Such a case shouldn't determine the overall
> validity of the approach I propose here though.
>
> NB I have included code to lift the 2.5GT/s restriction, matched by the
> downstream port's vendor:device ID, but now that I've been thinking about
> this issue again I suspect it might be too liberal, because we don't
> really know if it is safe to do so if the downstream device is not a
> PI7C9X2G304. So perhaps we need to match on the vendor:device IDs of both
> endpoints of such a broken link. What do you think?
>
> I'll let you chew my reasoning over now too, as I have an issue with a
> non-working PCIe parallel port device to sort out (device 0000:07:00.0
> above) and that actually requires flipping rocker switches in my lab,
> which I cannot do remotely. Thank you for your review.
>
> Maciej

2022-11-09 20:46:24

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v5 5/5] PCI: Work around PCIe link training failures

On Tue, 8 Nov 2022 23:04:18 -0600
Bjorn Helgaas <[email protected]> wrote:

> [+cc Alex, in case he has any reset-related comments]
>
> On Wed, Nov 09, 2022 at 02:57:57AM +0000, Maciej W. Rozycki wrote:
> > Hi Bjorn,
> >
> > Thank you for coming back to this patch series. I'll try to address your
> > concerns, but it may take a little. The reason is I'm currently on site
> > at my lab until the end of the week and barring my day job, etc. I want to
> > focus on items to do (and I do have a bunch) that require local hardware
> > access. The issue concerned with this patch series does not, so I'll get
> > to looking into it in more depth hopefully from next week. For the time
> > being however please see below.
> >
> > On Thu, 3 Nov 2022, Bjorn Helgaas wrote:
> >
> > > > > > Also check for a 2.5GT/s speed restriction the firmware may have already
> > > > > > arranged and lift it too with ports of devices known to continue working
> > > > > > afterwards, currently the ASM2824 only, that already report their data
> > > > > > link being up.
> > > > >
> > > > > This quirk is run at boot-time and resume-time. What happens after a
> > > > > Secondary Bus Reset, as is done by pci_reset_secondary_bus()?
> > > >
> > > > Flipping SBR bit can be done on any PCI-to-PCI bridge device and in this
> > > > topology there are following: PCIe Root Port, ASMedia PCIe Switch
> > > > Upstream Port, ASMedia PCIe Switch Downstream Port, Pericom PCIe Switch
> > > > Upstream Port, Pericom PCIe Switch Downstream Port.
> > > > (Maciej, I hope that this is whole topology and there is not some other
> > > > device of PCI-to-PCI bridge type in your setup; please correct me)
> >
> > There is actually a PCIe-to-PCI bridge device further downstream (device
> > 0000:08:00.0 in the listings below; bus 09 is conventional PCI), but it
> > doesn't matter for the issue concerned; the issue triggers whether the
> > bridge module has been plugged or not.
> >
> > > > Bjorn, to make it clear, on which device you mean to issue secondary bus
> > > > reset?
> > >
> > > IIUC, the problem is observed on the link between the ASM2824
> > > downstream port and the PI7C9X2G304 upstream port, so my question is
> > > about asserting SBR on the ASM2824 downstream port. I think that
> > > should cause the link between ASM2824 and PI7C9X2G304 to go down and
> > > back up.
> >
> > That would be my expectation as well. Is there a reliable way to request
> > that however without actually writing a piece of code to do so from inside
> > the kernel? Sadly our documentation is vague on the matter, in particular
> > Documentation/ABI/testing/sysfs-bus-pci, but here's what I have obtained:
> >
> > # lspci -t
> > -[0000:00]---00.0-[01-0b]----00.0-[02-0b]--+-00.0-[03]--
> > +-02.0-[04]----00.0
> > +-03.0-[05-09]----00.0-[06-09]--+-01.0-[07]--+-00.0
> > | | \-00.3
> > | \-02.0-[08-09]----00.0-[09]--+-01.0
> > | \-02.0
> > +-04.0-[0a]----00.0
> > \-08.0-[0b]--+-00.0
> > \-00.1
> > # for name in /sys/bus/pci/devices/0000\:??\:??.?/reset_method; do echo "$(basename $(dirname $name)): $(cat $name)"; done
> > 0000:01:00.0: pm bus
> > 0000:02:00.0: pm bus
> > 0000:02:02.0: pm
> > 0000:02:03.0: pm
> > 0000:02:04.0: pm
> > 0000:02:08.0: pm
> > 0000:04:00.0: bus
> > 0000:05:00.0: bus
> > 0000:06:01.0: bus
> > 0000:07:00.0: bus
> > 0000:08:00.0: bus
> > 0000:09:01.0: pm bus
> > 0000:0a:00.0: flr bus
> > 0000:0b:00.0: pm bus
> > 0000:0b:00.1: pm
> > #
> >
> > (mind that the problematic link is between 0000:02:03.0 and 0000:05:00.0),
> > and then:
> >
> > # echo 1 >/sys/bus/pci/devices/0000\:05\:00.0/reset
> > -sh: echo: write error: Inappropriate ioctl for device
> > #
> >
> > (which I gather is supposed to poke at 0000:02:03.0's SBR) so it doesn't
> > seem to be effective.
>
> 05:00.0 supports the "bus" method, i.e., pci_reset_bus_function(),
> which tries pci_dev_reset_slot_function() followed by
> pci_parent_bus_reset(). Both of them return -ENOTTY if the device
> (05:00.0) has a secondary bus ("dev->subordinate"), so I think nothing
> happens here.

Right, the pci-sysfs reset attribute is only meant for a reset scope
limited to the device, we'd need something to call pci_reset_bus() to
commit to the whole hierarchy, which is not something we typically do.
vfio-pci will only bind to endpoint devices, so it shouldn't provide an
interface to inject a bus reset here either.

Based on the fact that there's a pericom switch in play here, I'll just
note that I think this is the same device with other link speed issues
as well:

https://lore.kernel.org/all/[email protected]/

This fell off my plate some time ago, but as noted there, enabling ACS
when the upstream and downstream ports run at different link rates
exposes errata where packets are queued and not delivered within the
switch.

Could enabling ACS on this device be contributing to the issue here,
for example triggering the Asmedia downstream port to get into this
link reseting issue? A test with
pci=disable_acs_redir=0000:06:01.0;0000:06:02.0 could be interesting
assuming this occurs on an platform that has an IOMMU, ie. calls
pci_request_acs(). Thanks,

Alex


2022-11-29 10:10:01

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: [PATCH v5 5/5] PCI: Work around PCIe link training failures

On Wed, 9 Nov 2022, Alex Williamson wrote:

> > 05:00.0 supports the "bus" method, i.e., pci_reset_bus_function(),
> > which tries pci_dev_reset_slot_function() followed by
> > pci_parent_bus_reset(). Both of them return -ENOTTY if the device
> > (05:00.0) has a secondary bus ("dev->subordinate"), so I think nothing
> > happens here.
>
> Right, the pci-sysfs reset attribute is only meant for a reset scope
> limited to the device, we'd need something to call pci_reset_bus() to
> commit to the whole hierarchy, which is not something we typically do.
> vfio-pci will only bind to endpoint devices, so it shouldn't provide an
> interface to inject a bus reset here either.
>
> Based on the fact that there's a pericom switch in play here, I'll just
> note that I think this is the same device with other link speed issues
> as well:
>
> https://lore.kernel.org/all/[email protected]/

Thanks for the pointer. This has been superseded by commit acd61ffb2f16
("PCI: Add ACS quirk for Pericom PI7C9X2G switches"), right? In which
case it is a match ([12d8:2304]), though the quirk does not trigger here,
i.e. no message is printed about store-forward mode activation:

pcieport 0000:05:00.0: calling pci_fixup_pericom_acs_store_forward+0x0/0xba @ 1
pcieport 0000:05:00.0: pci_fixup_pericom_acs_store_forward+0x0/0xba took 0 usecs
[...]
pci 0000:05:00.0: calling pci_fixup_pericom_acs_store_forward+0x0/0xba @ 1
pci 0000:05:00.0: pci_fixup_pericom_acs_store_forward+0x0/0xba took 0 usecs
[...]
pcieport 0000:06:01.0: calling pci_fixup_pericom_acs_store_forward+0x0/0xba @ 1
pcieport 0000:06:01.0: pci_fixup_pericom_acs_store_forward+0x0/0xba took 3 usecs
[...]
pcieport 0000:06:02.0: calling pci_fixup_pericom_acs_store_forward+0x0/0xba @ 1
pcieport 0000:06:02.0: pci_fixup_pericom_acs_store_forward+0x0/0xba took 2 usecs

NB I don't know why the quirk for the upstream port (05:00.0) is called
twice, both via pcieport and via pci.

> This fell off my plate some time ago, but as noted there, enabling ACS
> when the upstream and downstream ports run at different link rates
> exposes errata where packets are queued and not delivered within the
> switch.
>
> Could enabling ACS on this device be contributing to the issue here,
> for example triggering the Asmedia downstream port to get into this
> link reseting issue? A test with
> pci=disable_acs_redir=0000:06:01.0;0000:06:02.0 could be interesting
> assuming this occurs on an platform that has an IOMMU, ie. calls
> pci_request_acs(). Thanks,

We have no IOMMU support for any RISC-V machine at the moment:

config ARCH_RV64I
[...]
select SWIOTLB if MMU

and:

software IO TLB: area num 4.
software IO TLB: mapped [mem 0x00000000fb732000-0x00000000ff732000] (64MB)

so IIUC this issue does not apply. Thank you for your input.

Maciej

2022-11-29 10:23:40

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: [PATCH v5 5/5] PCI: Work around PCIe link training failures

On Tue, 8 Nov 2022, Bjorn Helgaas wrote:

> > That would be my expectation as well. Is there a reliable way to request
> > that however without actually writing a piece of code to do so from inside
> > the kernel? Sadly our documentation is vague on the matter, in particular
> > Documentation/ABI/testing/sysfs-bus-pci, but here's what I have obtained:
> >
> > # lspci -t
> > -[0000:00]---00.0-[01-0b]----00.0-[02-0b]--+-00.0-[03]--
> > +-02.0-[04]----00.0
> > +-03.0-[05-09]----00.0-[06-09]--+-01.0-[07]--+-00.0
> > | | \-00.3
> > | \-02.0-[08-09]----00.0-[09]--+-01.0
> > | \-02.0
> > +-04.0-[0a]----00.0
> > \-08.0-[0b]--+-00.0
> > \-00.1
> > # for name in /sys/bus/pci/devices/0000\:??\:??.?/reset_method; do echo "$(basename $(dirname $name)): $(cat $name)"; done
> > 0000:01:00.0: pm bus
> > 0000:02:00.0: pm bus
> > 0000:02:02.0: pm
> > 0000:02:03.0: pm
> > 0000:02:04.0: pm
> > 0000:02:08.0: pm
> > 0000:04:00.0: bus
> > 0000:05:00.0: bus
> > 0000:06:01.0: bus
> > 0000:07:00.0: bus
> > 0000:08:00.0: bus
> > 0000:09:01.0: pm bus
> > 0000:0a:00.0: flr bus
> > 0000:0b:00.0: pm bus
> > 0000:0b:00.1: pm
> > #
> >
> > (mind that the problematic link is between 0000:02:03.0 and 0000:05:00.0),
> > and then:
> >
> > # echo 1 >/sys/bus/pci/devices/0000\:05\:00.0/reset
> > -sh: echo: write error: Inappropriate ioctl for device
> > #
> >
> > (which I gather is supposed to poke at 0000:02:03.0's SBR) so it doesn't
> > seem to be effective.
>
> 05:00.0 supports the "bus" method, i.e., pci_reset_bus_function(),
> which tries pci_dev_reset_slot_function() followed by
> pci_parent_bus_reset(). Both of them return -ENOTTY if the device
> (05:00.0) has a secondary bus ("dev->subordinate"), so I think nothing
> happens here.
>
> This was prompted by my question about whether the link comes up
> correctly after an SBR. It looks like it's not a problem when 02:03.0
> and 05:00.0 are both switches because we don't use SBR in that case.
>
> But if 05:00.0 were an endpoint instead of a switch, we *would* use
> SBR, and then (I think) the link would go down, and this quirk would
> not be applied.

After an SBR the link goes down, but the situation is actually somewhat
more complex than you describe.

Mind that the Target Link Speed field in the Link Control 2 register is
sticky. Consequently if the link goes down with TLS set to 2.5GT/s then
it comes back up just fine, because it's still clamped at 2.5GT/s after
reset. The U-Boot workaround takes advantage of this observation and
leaves TLS at 2.5GT/s. This way if an OS is booted that is unaware of
this erratum and resets the PCI/e hierarchy then U-Boot's workaround
remains in action and the link resumes operation.

Now when TLS is reset back to 5GT/s or higher, then the link goes down
and starts its infinite training dance again. This happens when TLS has
been reset by hand in U-boot and Linux is booted and resets the PCI/e
hierarchy in the course (but then our quirk chimes in), or under running
Linux with this change applied when an SBR is issued via `setpci' or by a
modified kernel that has the `dev->subordinate' condition removed.

> > Needless to say:
> >
> > # echo 1 >/sys/bus/pci/devices/0000\:02\:03.0/reset
> > #
> >
> > while apparently succeeds doesn't do anything spectacular. Any
> > suggestions?
> >
> > In any case I would expect resetting a bus tree to cause all the quirks
> > required to be reapplied, preferably after bus re-enumeration even as the
> > topology can change afterwards if not for the hardware erratum handled
> > here, although indeed I would find it reasonable to only perform the
> > actions on the bus subtree reset and not it's parent. In that case the
> > quirk would have to be applied by hand somehow.
>
> A lot of things can change after a reset, especially if it makes a
> device reload updated firmware, but we don't actually re-enumerate or
> reapply quirks. I'm looking at pci_reset_function(), where we
> basically save the parts of config space we know about, do the reset,
> and restore config space.

Right. That's what the debug messages I have observed indicated as well.

For the record, I made an experiment and with the `dev->subordinate'
condition removed I reset device 05:00.0 in my system via:

# echo 1 >/sys/bus/pci/devices/0000\:05\:00.0/reset

With the TLS reset back to its 8GT/s default here the link did not come
back up as expected and consequently no device from this bus subhierarchy
was accessible anymore:

# lspci -s 05: -vnn
05:00.0 PCI bridge [0604]: Pericom Semiconductor PI7C9X2G304 EL/SL PCIe2 3-Port/4-Lane Packet Switch [12d8:2304] (rev ff) (prog-if ff)
!!! Unknown header type 7f

Retraining by hand and then rescanning brought the subhierarchy back:

# setpci -s 02\:03.0 CAP_EXP+0x30.W=0061; setpci -s 02\:03.0 CAP_EXP+0x10.W=0020; sleep 1; setpci -s 02\:03.0 CAP_EXP+0x30.W=0063; setpci -s 02\:03.0 CAP_EXP+0x10.W=0020
# echo 1 >/sys/bus/pci/devices/0000\:02\:03.0/rescan
pci_bus 0000:03: busn_res: [bus 03] end is updated to 03
pci_bus 0000:04: busn_res: [bus 04] end is updated to 04
pci 0000:05:00.0: bridge configuration invalid ([bus ff-ff]), reconfiguring
pcieport 0000:06:01.0: bridge configuration invalid ([bus ff-ff]), reconfiguring
pcieport 0000:06:02.0: bridge configuration invalid ([bus ff-ff]), reconfiguring
pci_bus 0000:07: busn_res: [bus 07] end is updated to 07
pci 0000:08:00.0: bridge configuration invalid ([bus ff-ff]), reconfiguring
pci_bus 0000:09: busn_res: [bus 09] end is updated to 09
pci_bus 0000:08: busn_res: [bus 08-09] end is updated to 09
pci_bus 0000:06: busn_res: [bus 06-09] end is updated to 09
pci_bus 0000:05: busn_res: [bus 05-09] end is updated to 09
pci_bus 0000:0a: busn_res: [bus 0a] end is updated to 0a
pci_bus 0000:0b: busn_res: [bus 0b] end is updated to 0b
# lspci -s 05: -vnn
05:00.0 PCI bridge [0604]: Pericom Semiconductor PI7C9X2G304 EL/SL PCIe2 3-Port/4-Lane Packet Switch [12d8:2304] (rev 05) (prog-if 00 [Normal decode])
Flags: fast devsel
Bus: primary=00, secondary=00, subordinate=00, sec-latency=0
I/O behind bridge: 00000000-00000fff [size=4K]
Memory behind bridge: 00000000-000fffff [size=1M]
Prefetchable memory behind bridge: 0000000000000000-00000000000fffff [size=1M]
Capabilities: [40] Power Management version 3
Capabilities: [5c] Vital Product Data
Capabilities: [64] Vendor Specific Information: Len=34 <?>
Capabilities: [b0] Subsystem: Device [0000:0000]
Capabilities: [c0] Express Upstream Port, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Virtual Channel
Capabilities: [20c] Power Budgeting <?>
Capabilities: [230] Latency Tolerance Reporting

(though bridge BARs aren't exactly correct anymore).

With the TLS clamped at 2.5GT/s the link did come back up and the bus
subhierarchy remained accessible except for a downstream conventional PCI
bus:

# lspci -s 09: -vnn
09:01.0 FDDI network controller [0202]: Digital Equipment Corporation PCI-to-PDQ Interface Chip [PFI] FDDI (DEFPA) [1011:000f] (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: defxx

09:02.0 ATM network controller [0203]: Microsemi / PMC / IDT IDT77201/77211 155Mbps ATM SAR Controller [NICStAR] [111d:0001] (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: nicstar

which had to be restored by hand:

# echo 1 >/sys/bus/pci/devices/0000\:08\:00.0/rescan
pci_bus 0000:08: scanning bus
pci 0000:08:00.0: scanning [bus 00-00] behind bridge, pass 0
pci 0000:08:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
pci 0000:08:00.0: scanning [bus 00-00] behind bridge, pass 1
pci_bus 0000:09: scanning bus
pci_bus 0000:09: bus scan returning with max=09
pci_bus 0000:09: busn_res: [bus 09] end is updated to 09
pci_bus 0000:08: bus scan returning with max=09
# lspci -s 09: -vnn
09:01.0 FDDI network controller [0202]: Digital Equipment Corporation PCI-to-PDQ Interface Chip [PFI] FDDI (DEFPA) [1011:000f] (rev 02)
Subsystem: Digital Equipment Corporation FDDIcontroller/PCI (DEFPA) [1011:def1]
Flags: bus master, medium devsel, latency 64, IRQ 24
Memory at 60831000 (32-bit, non-prefetchable) [size=128]
I/O ports at 2100 [size=128]
Memory at 60820000 (32-bit, non-prefetchable) [size=64K]
Capabilities: [50] Power Management version 2
Kernel driver in use: defxx

09:02.0 ATM network controller [0203]: Microsemi / PMC / IDT IDT77201/77211 155Mbps ATM SAR Controller [NICStAR] [111d:0001] (rev 03)
Flags: bus master, medium devsel, latency 64, IRQ 26
I/O ports at 2000 [size=256]
Memory at 60830000 (32-bit, non-prefetchable) [size=4K]
Expansion ROM at 60800000 [virtual] [disabled] [size=128K]
Kernel driver in use: nicstar

> > I have chewed it over before choosing this approach and did this for what
> > I think is a very good reason, first for U-Boot and then, when I figured
> > out a feasible approach, also for Linux. Bear in mind that we have a pair
> > of devices that completely fail to negotiate link ever by default and we
> > do not know whether it's the upstream device (which we can see) or the
> > downstream device (which we can't, because the link is down) that is the
> > culprit. With that we have as follows.
> >
> > 1. Both devices in question are PCIe switches, so they can be swapped with
> > each other. In that case what is now the upstream device will become
> > the downstream device and vice versa. Now the Pericom/Diodes switch
> > only has 2.5GT/s downstream ports, so the issue by definition cannot
> > trigger, but companies/designers tend to reuse hardware IP blocks, and
> > I am quite sure Pericom/Diodes has switches in its product line that
> > support rates beyond 2.5GT/s in their downstream ports. So a swapped
> > scenario is in my opinion quite real. Matching on the vendor:device ID
> > won't work then as the suspect ASMedia part will not be seen due to a
> > failed link. If it's the Pericom/Diodes switch that is the culprit,
> > then it's even worse, as it is the downstream device already.
> >
> > 2. We don't know really what the ultimate cause is and therefore whether
> > the issue can trigger with other vendor devices (and presumably one of
> > the two involved here). And the nature of the failure is such, that I
> > think it is very unlikely a user will reach out to our (Linux kernel)
> > community for a workaround. It does not look like a software problem
> > at all and neither like one that can be solved by software.
> >
> > Therefore they will most likely conclude hardware they have is broken
> > and throw it out causing frustration even though the hardware might be
> > (im)perfectly capable of working, just requiring a little poke to get
> > going.
> >
> > Just as I almost threw it out even though I dare say I am quite an
> > experienced engineer, reasonably familiar with odd hardware and the
> > Linux kernel and ultimately I was able to come up with a solution. It
> > is only because I noticed by chance that 0000:02:03.0 device's Link
> > Training bit is flipping, combined with my imagination as to how to
> > possibly deal with such a situation that I succeeded in bringing the
> > link up. I genuinely thought all was lost and I was exceedingly happy
> > when I overcame it as the PI7C9X2G304 switch module was key to my
> > project with no suitable alternative on the horizon.
> >
> > 3. We apply this quirk to failed links only. As a non-working link is a
> > fatal condition, things cannot get worse by trying to retrain it by
> > hand at 2.5GT/s, as otherwise it doesn't work anyway. So it's just a
> > last ditch attempt to make it run really.
> >
> > Can you think of a scenario where retraining a non-working link would
> > cause more damage than leaving it alone in the non-working state?
> >
> > This quirk only uses generic PCIe features, common to all hardware, and
> > therefore I think it can be applied in a generic manner. Now there
> > could be broken hardware that does not correctly handle the generic
> > features used here, but if it has a failed link too, then it will just
> > continue not to work. So it won't be a regression.
>
> The first big chunk of the quirk (that retrains the link) only runs on
> Downstream Ports, and only when PCI_EXP_LNKSTA_DLLLA is clear, i.e.,
> when the link is down. What happens if this Downstream Port doesn't
> have anything below it, e.g., it leads to an empty slot? Is there
> something that prevents us from attempting a retrain and waiting for
> the timeout?

The conditions I came up with for the detection of a failed link
negotiation are supposed to tell this situation apart from an absent
endpoint on the other end of a link.

Indeed in the system affected there is the 02:00.0 downstream port device
within the same ASM2824 switch that is not connected to anything (no slot
there or an onboard device; essentially just a wasted x2 downstream port
the board designers could not come up with a use for) and the quirk just
does not trigger for it, because the conditions it checks are not met for
an unconnected link. Specifically the Link Bandwidth Management Status
bit in the Link Status register cannot be set to 1 for an unconnected
link, because setting the bit requires a link to hold the DL_Up status and
that does not happen when there is no component on the other side of the
link. That's at least my understanding.

I have no immediate way to verify other ports, because I am away from my
lab now and the system has its PCI/e hierarchy otherwise fully occupied,
but the behaviour was consistent as I made extensive verification earlier
on.

The same condition is used by the U-Boot workaround BTW.

> I was about to say the second chunk is always run even when the link
> was up, but now I see that it's only for the ASMedia ASM2824. So this
> part is more like a traditional quirk that could specify that device.
>
> If we think the first part (attempting the retrain) is safe to do
> whenever the link is down, maybe we should do that more directly as
> part of the PCI core instead of as a quirk?

At this point in my understanding of the situation I tend to agree.
This will guarantee that device resets issued under our control will do
the right thing (we can't do anything I suppose about requests issued by
`setpci' behind our back) and is also what the U-Boot workaround I have
come up with does. I'll make v6 along these lines.

Thank you for your inquisitiveness and the review.

Maciej