LinuxLists.cc - [PATCH v8 0/7] pci: Work around ASMedia ASM2824 PCIe link training failures

2023-04-06 00:22:10

Subject: [PATCH v8 0/7] pci: Work around ASMedia ASM2824 PCIe link training failures

Hi,

This is v8 of the change to work around a PCIe link training phenomenon
where a pair of devices both capable of operating at a link speed above
2.5GT/s seems unable to negotiate the link speed and continues training
indefinitely with the Link Training bit switching on and off repeatedly
and the data link layer never reaching the active state.

This version adds a Reviewed-by: tag by Lukas Wunner accidentally missed
from 6/7 in v7 and reorders said change to the front of the series.

Last two iterations:
<https://lore.kernel.org/r/[email protected]/>
<https://lore.kernel.org/r/[email protected]/>.

Maciej

2023-04-06 00:22:32

by Maciej W. Rozycki

[permalink] [raw]

Subject: [PATCH v8 3/7] PCI: Execute `quirk_enable_clear_retrain_link' earlier

Make `quirk_enable_clear_retrain_link' `pci_fixup_early' so that any later
fixups can rely on `clear_retrain_link' to have been already initialised.

Signed-off-by: Maciej W. Rozycki <[email protected]>
---
Changes from v7:

- Reorder from 2/7.

No change from v6.

No change from v5.

New change in v5.
---
drivers/pci/quirks.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

linux-pcie-clear-retrain-link-early.diff
Index: linux-macro/drivers/pci/quirks.c
===================================================================
--- linux-macro.orig/drivers/pci/quirks.c
+++ linux-macro/drivers/pci/quirks.c
@@ -2407,9 +2407,9 @@ static void quirk_enable_clear_retrain_l
dev->clear_retrain_link = 1;
pci_info(dev, "Enable PCIe Retrain Link quirk\n");
}
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_PERICOM, 0xe110, quirk_enable_clear_retrain_link);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_PERICOM, 0xe111, quirk_enable_clear_retrain_link);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_PERICOM, 0xe130, quirk_enable_clear_retrain_link);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_PERICOM, 0xe110, quirk_enable_clear_retrain_link);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_PERICOM, 0xe111, quirk_enable_clear_retrain_link);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_PERICOM, 0xe130, quirk_enable_clear_retrain_link);

static void fixup_rev1_53c810(struct pci_dev *dev)
{

2023-04-06 00:22:45

by Maciej W. Rozycki

[permalink] [raw]

Subject: [PATCH v8 4/7] PCI: Initialize `link_active_reporting' earlier

Determine whether Data Link Layer Link Active Reporting is available
ahead of calling any fixups so that the cached value can be used there
and later on.

Signed-off-by: Maciej W. Rozycki <[email protected]>
---
Changes from v7:

- Reorder from 3/7.

Changes from v6:

- Regenerate against 6.3-rc5.

New change in v6.
---
drivers/pci/probe.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

linux-pcie-link-active-reporting-early.diff
Index: linux-macro/drivers/pci/probe.c
===================================================================
--- linux-macro.orig/drivers/pci/probe.c
+++ linux-macro/drivers/pci/probe.c
@@ -820,7 +820,6 @@ static void pci_set_bus_speed(struct pci

pcie_capability_read_dword(bridge, PCI_EXP_LNKCAP, &linkcap);
bus->max_bus_speed = pcie_link_speed[linkcap & PCI_EXP_LNKCAP_SLS];
- bridge->link_active_reporting = !!(linkcap & PCI_EXP_LNKCAP_DLLLARC);

pcie_capability_read_word(bridge, PCI_EXP_LNKSTA, &linksta);
pcie_update_link_speed(bus, linksta);
@@ -1829,6 +1828,7 @@ int pci_setup_device(struct pci_dev *dev
int pos = 0;
struct pci_bus_region region;
struct resource *res;
+ u32 linkcap;

hdr_type = pci_hdr_type(dev);

@@ -1876,6 +1876,10 @@ int pci_setup_device(struct pci_dev *dev
/* "Unknown power state" */
dev->current_state = PCI_UNKNOWN;

+ /* Set it early to make it available to fixups, etc. */
+ pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &linkcap);
+ dev->link_active_reporting = !!(linkcap & PCI_EXP_LNKCAP_DLLLARC);
+
/* Early fixups, before probing the BARs */
pci_fixup_device(pci_fixup_early, dev);

2023-04-06 00:22:48

by Maciej W. Rozycki

[permalink] [raw]

Subject: [PATCH v8 2/7] PCI: Export PCI link retrain timeout

Rename LINK_RETRAIN_TIMEOUT to PCIE_LINK_RETRAIN_TIMEOUT and make it
available via "pci.h" for PCI drivers to use.

Signed-off-by: Maciej W. Rozycki <[email protected]>
---
Changes from v7:

- Reorder from 1/7.

No change from v6.

No change from v5.

New change in v5.
---
drivers/pci/pci.h | 2 ++
drivers/pci/pcie/aspm.c | 4 +---
2 files changed, 3 insertions(+), 3 deletions(-)

linux-pcie-link-retrain-timeout.diff
Index: linux-macro/drivers/pci/pci.h
===================================================================
--- linux-macro.orig/drivers/pci/pci.h
+++ linux-macro/drivers/pci/pci.h
@@ -11,6 +11,8 @@

#define PCI_VSEC_ID_INTEL_TBT 0x1234 /* Thunderbolt */

+#define PCIE_LINK_RETRAIN_TIMEOUT HZ
+
extern const unsigned char pcie_link_speed[];
extern bool pci_early_dump;

Index: linux-macro/drivers/pci/pcie/aspm.c
===================================================================
--- linux-macro.orig/drivers/pci/pcie/aspm.c
+++ linux-macro/drivers/pci/pcie/aspm.c
@@ -90,8 +90,6 @@ static const char *policy_str[] = {
[POLICY_POWER_SUPERSAVE] = "powersupersave"
};

-#define LINK_RETRAIN_TIMEOUT HZ
-
/*
* The L1 PM substate capability is only implemented in function 0 in a
* multi function device.
@@ -213,7 +211,7 @@ static bool pcie_retrain_link(struct pci
}

/* Wait for link training end. Break out after waiting for timeout */
- end_jiffies = jiffies + LINK_RETRAIN_TIMEOUT;
+ end_jiffies = jiffies + PCIE_LINK_RETRAIN_TIMEOUT;
do {
pcie_capability_read_word(parent, PCI_EXP_LNKSTA, &reg16);
if (!(reg16 & PCI_EXP_LNKSTA_LT))

2023-04-06 00:23:24

by Maciej W. Rozycki

[permalink] [raw]

Subject: [PATCH v8 5/7] powerpc/eeh: Rely on `link_active_reporting'

Use `link_active_reporting' to determine whether Data Link Layer Link
Active Reporting is available rather than re-retrieving the capability.

Signed-off-by: Maciej W. Rozycki <[email protected]>
---
NB this has been compile-tested only with a PPC64LE configuration.

Changes from v7:

- Reorder from 4/7.

No change from v6.

New change in v6.
---
arch/powerpc/kernel/eeh_pe.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

linux-pcie-link-active-reporting-eeh.diff
Index: linux-macro/arch/powerpc/kernel/eeh_pe.c
===================================================================
--- linux-macro.orig/arch/powerpc/kernel/eeh_pe.c
+++ linux-macro/arch/powerpc/kernel/eeh_pe.c
@@ -671,9 +671,8 @@ static void eeh_bridge_check_link(struct
eeh_ops->write_config(edev, cap + PCI_EXP_LNKCTL, 2, val);

/* Check link */
- eeh_ops->read_config(edev, cap + PCI_EXP_LNKCAP, 4, &val);
- if (!(val & PCI_EXP_LNKCAP_DLLLARC)) {
- eeh_edev_dbg(edev, "No link reporting capability (0x%08x) \n", val);
+ if (!edev->pdev->link_active_reporting) {
+ eeh_edev_dbg(edev, "No link reporting capability\n");
msleep(1000);
return;
}

2023-04-06 00:24:06

by Maciej W. Rozycki

[permalink] [raw]

Subject: [PATCH v8 6/7] net/mlx5: Rely on `link_active_reporting'

Use `link_active_reporting' to determine whether Data Link Layer Link
Active Reporting is available rather than re-retrieving the capability.

Signed-off-by: Maciej W. Rozycki <[email protected]>
---
NB this has been compile-tested only with PPC64LE and x86-64
configurations.

Changes from v7:

- Reorder from 5/7.

Changes from v6:

- Regenerate against 6.3-rc5.

New change in v6.
---
drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)

linux-pcie-link-active-reporting-mlx5.diff
Index: linux-macro/drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c
===================================================================
--- linux-macro.orig/drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c
+++ linux-macro/drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c
@@ -307,7 +307,6 @@ static int mlx5_pci_link_toggle(struct m
unsigned long timeout;
struct pci_dev *sdev;
int cap, err;
- u32 reg32;

/* Check that all functions under the pci bridge are PFs of
* this device otherwise fail this function.
@@ -346,11 +345,8 @@ static int mlx5_pci_link_toggle(struct m
return err;

/* Check link */
- err = pci_read_config_dword(bridge, cap + PCI_EXP_LNKCAP, &reg32);
- if (err)
- return err;
- if (!(reg32 & PCI_EXP_LNKCAP_DLLLARC)) {
- mlx5_core_warn(dev, "No PCI link reporting capability (0x%08x)\n", reg32);
+ if (!bridge->link_active_reporting) {
+ mlx5_core_warn(dev, "No PCI link reporting capability\n");
msleep(1000);
goto restore;
}

2023-04-06 00:26:16

by Maciej W. Rozycki

[permalink] [raw]

Subject: [PATCH v8 7/7] PCI: Work around PCIe link training failures

Attempt to handle cases such as with a downstream port of the ASMedia
ASM2824 PCIe switch where link training never completes and the link
continues switching between speeds indefinitely with the data link layer
never reaching the active state.

It has been observed with a downstream port of the ASMedia ASM2824 Gen 3
switch wired to the upstream port of the Pericom PI7C9X2G304 Gen 2
switch, using a Delock Riser Card PCI Express x1 > 2 x PCIe x1 device,
P/N 41433, wired to a SiFive HiFive Unmatched board. In this setup the
switches are supposed to negotiate the link speed of preferably 5.0GT/s,
falling back to 2.5GT/s.

Instead the link continues oscillating between the two speeds, at the
rate of 34-35 times per second, with link training reported repeatedly
active ~84% of the time. Forcibly limiting the target link speed to
2.5GT/s with the upstream ASM2824 device however makes the two switches
communicate correctly. Removing the speed restriction afterwards makes
the two devices switch to 5.0GT/s then.

Make use of these observations then and detect the inability to train
the link, by checking for the Data Link Layer Link Active status bit
being off while the Link Bandwidth Management Status indicating that
hardware has changed the link speed or width in an attempt to correct
unreliable link operation.

Restrict the speed to 2.5GT/s then with the Target Link Speed field,
request a retrain and wait 200ms for the data link to go up. If this
turns out successful, then lift the restriction, letting the devices
negotiate a higher speed.

Also check for a 2.5GT/s speed restriction the firmware may have already
arranged and lift it too with ports of devices known to continue working
afterwards, currently the ASM2824 only, that already report their data
link being up.

Signed-off-by: Maciej W. Rozycki <[email protected]>
Link: https://lore.kernel.org/r/[email protected]/
Link: https://source.denx.de/u-boot/u-boot/-/commit/a398a51ccc68
---
No changes from v7.

Changes from v6:

- Regenerate against 6.3-rc5.

- Shorten the lore.kernel.org archive link in the change description.

Changes from v5:

- Move from a quirk into PCI core and call at device probing, hot-plug,
reset and resume. Keep the ASMedia part under CONFIG_PCI_QUIRKS.

- Rely on `dev->link_active_reporting' rather than re-retrieving the
capability.

Changes from v4:

- Remove <linux/bug.h> inclusion no longer needed.

- Make the quirk generic based on probing device features rather than
specific to the ASM2824 part only; take the Retrain Link bit erratum
into account.

- Still lift the 2.5GT/s speed restriction with the ASM2824 only.

- Increase retrain timeout from 200ms to 1s (PCIE_LINK_RETRAIN_TIMEOUT).

- Remove retrain success notification.

- Use PCIe helpers rather than generic PCI functions throughout.

- Trim down and update the wording of the change description for the
switch from an ASM2824-specific to a generic fixup.

Changes from v3:

- Remove the <linux/pci_ids.h> entry for the ASM2824.

Changes from v2:

- Regenerate for 5.17-rc2 for a merge conflict.

- Replace BUG_ON for a missing PCI Express capability with WARN_ON and an
early return.

Changes from v1:

- Regenerate for a merge conflict.
---
drivers/pci/pci.c | 154 ++++++++++++++++++++++++++++++++++++++++++++++++++--
drivers/pci/pci.h | 1
drivers/pci/probe.c | 2
3 files changed, 152 insertions(+), 5 deletions(-)

linux-pcie-asm2824-manual-retrain.diff
Index: linux-macro/drivers/pci/pci.c
===================================================================
--- linux-macro.orig/drivers/pci/pci.c
+++ linux-macro/drivers/pci/pci.c
@@ -859,6 +859,132 @@ int pci_wait_for_pending(struct pci_dev
return 0;
}

+/*
+ * Retrain the link of a downstream PCIe port by hand if necessary.
+ *
+ * This is needed at least where a downstream port of the ASMedia ASM2824
+ * Gen 3 switch is wired to the upstream port of the Pericom PI7C9X2G304
+ * Gen 2 switch, and observed with the Delock Riser Card PCI Express x1 >
+ * 2 x PCIe x1 device, P/N 41433, plugged into the SiFive HiFive Unmatched
+ * board.
+ *
+ * In such a configuration the switches are supposed to negotiate the link
+ * speed of preferably 5.0GT/s, falling back to 2.5GT/s. However the link
+ * continues switching between the two speeds indefinitely and the data
+ * link layer never reaches the active state, with link training reported
+ * repeatedly active ~84% of the time. Forcing the target link speed to
+ * 2.5GT/s with the upstream ASM2824 device makes the two switches talk to
+ * each other correctly however. And more interestingly retraining with a
+ * higher target link speed afterwards lets the two successfully negotiate
+ * 5.0GT/s.
+ *
+ * With the ASM2824 we can rely on the otherwise optional Data Link Layer
+ * Link Active status bit and in the failed link training scenario it will
+ * be off along with the Link Bandwidth Management Status indicating that
+ * hardware has changed the link speed or width in an attempt to correct
+ * unreliable link operation. For a port that has been left unconnected
+ * both bits will be clear. So use this information to detect the problem
+ * rather than polling the Link Training bit and watching out for flips or
+ * at least the active status.
+ *
+ * Since the exact nature of the problem isn't known and in principle this
+ * could trigger where an ASM2824 device is downstream rather upstream,
+ * apply this erratum workaround to any downstream ports as long as they
+ * support Link Active reporting and have the Link Control 2 register.
+ * Restrict the speed to 2.5GT/s then with the Target Link Speed field,
+ * request a retrain and wait 200ms for the data link to go up.
+ *
+ * If this turns out successful and we know by the Vendor:Device ID it is
+ * safe to do so, then lift the restriction, letting the devices negotiate
+ * a higher speed. Also check for a similar 2.5GT/s speed restriction the
+ * firmware may have already arranged and lift it with ports that already
+ * report their data link being up.
+ *
+ * Return 0 if the link has been successfully retrained, otherwise -1.
+ */
+int pcie_downstream_link_retrain(struct pci_dev *dev)
+{
+ static const struct pci_device_id ids[] = {
+ { PCI_VDEVICE(ASMEDIA, 0x2824) }, /* ASMedia ASM2824 */
+ {}
+ };
+ u16 lnksta, lnkctl2;
+
+ if (!pci_is_pcie(dev) || !pcie_downstream_port(dev) ||
+ !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting)
+ return -1;
+
+ pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2);
+ pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta);
+ if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) ==
+ PCI_EXP_LNKSTA_LBMS) {
+ unsigned long timeout;
+ u16 lnkctl;
+
+ pci_info(dev, "broken device, retraining non-functional downstream link at 2.5GT/s\n");
+
+ pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
+ lnkctl |= PCI_EXP_LNKCTL_RL;
+ lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
+ lnkctl2 |= PCI_EXP_LNKCTL2_TLS_2_5GT;
+ pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
+ pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnkctl);
+ /*
+ * Due to an erratum in some devices the Retrain Link bit
+ * needs to be cleared again manually to allow the link
+ * training to succeed.
+ */
+ lnkctl &= ~PCI_EXP_LNKCTL_RL;
+ if (dev->clear_retrain_link)
+ pcie_capability_write_word(dev, PCI_EXP_LNKCTL,
+ lnkctl);
+
+ timeout = jiffies + PCIE_LINK_RETRAIN_TIMEOUT;
+ do {
+ pcie_capability_read_word(dev, PCI_EXP_LNKSTA,
+ &lnksta);
+ if (lnksta & PCI_EXP_LNKSTA_DLLLA)
+ break;
+ usleep_range(10000, 20000);
+ } while (time_before(jiffies, timeout));
+
+ if (!(lnksta & PCI_EXP_LNKSTA_DLLLA)) {
+ pci_info(dev, "retraining failed\n");
+ return -1;
+ }
+ }
+
+ if (IS_ENABLED(CONFIG_PCI_QUIRKS) && (lnksta & PCI_EXP_LNKSTA_DLLLA) &&
+ (lnkctl2 & PCI_EXP_LNKCTL2_TLS) == PCI_EXP_LNKCTL2_TLS_2_5GT &&
+ pci_match_id(ids, dev)) {
+ u32 lnkcap;
+ u16 lnkctl;
+
+ pci_info(dev, "removing 2.5GT/s downstream link speed restriction\n");
+ pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
+ pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
+ lnkctl |= PCI_EXP_LNKCTL_RL;
+ lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
+ lnkctl2 |= lnkcap & PCI_EXP_LNKCAP_SLS;
+ pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
+ pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnkctl);
+ }
+
+ return 0;
+}
+
+/* Same as above, but called for a downstream device. */
+static int pcie_upstream_link_retrain(struct pci_dev *dev)
+{
+ struct pci_dev *bridge;
+
+ bridge = pci_upstream_bridge(dev);
+ if (bridge)
+ return pcie_downstream_link_retrain(bridge);
+ else
+ return -1;
+}
+
static int pci_acs_enable;

/**
@@ -1148,8 +1274,8 @@ void pci_resume_bus(struct pci_bus *bus)

static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
{
+ int retrain = 0;
int delay = 1;
- u32 id;

/*
* After reset, the device should not silently discard config
@@ -1163,21 +1289,37 @@ static int pci_dev_wait(struct pci_dev *
* Command register instead of Vendor ID so we don't have to
* contend with the CRS SV value.
*/
- pci_read_config_dword(dev, PCI_COMMAND, &id);
- while (PCI_POSSIBLE_ERROR(id)) {
+ for (;;) {
+ u32 id;
+
+ pci_read_config_dword(dev, PCI_COMMAND, &id);
+ if (!PCI_POSSIBLE_ERROR(id)) {
+ if (delay > PCI_RESET_WAIT)
+ pci_info(dev, "ready %dms after %s\n",
+ delay - 1, reset_type);
+ break;
+ }
+
if (delay > timeout) {
pci_warn(dev, "not ready %dms after %s; giving up\n",
delay - 1, reset_type);
return -ENOTTY;
}

- if (delay > PCI_RESET_WAIT)
+ if (delay > PCI_RESET_WAIT) {
+ if (!retrain) {
+ retrain = 1;
+ if (pcie_upstream_link_retrain(dev) == 0) {
+ delay = 1;
+ continue;
+ }
+ }
pci_info(dev, "not ready %dms after %s; waiting\n",
delay - 1, reset_type);
+ }

msleep(delay);
delay *= 2;
- pci_read_config_dword(dev, PCI_COMMAND, &id);
}

if (delay > PCI_RESET_WAIT)
@@ -4894,6 +5036,8 @@ static bool pcie_wait_for_link_delay(str
msleep(10);
timeout -= 10;
}
+ if (active && !ret)
+ ret = pcie_downstream_link_retrain(pdev) == 0;
if (active && ret)
msleep(delay);

Index: linux-macro/drivers/pci/pci.h
===================================================================
--- linux-macro.orig/drivers/pci/pci.h
+++ linux-macro/drivers/pci/pci.h
@@ -37,6 +37,7 @@ int pci_mmap_fits(struct pci_dev *pdev,
enum pci_mmap_api mmap_api);

bool pci_reset_supported(struct pci_dev *dev);
+int pcie_downstream_link_retrain(struct pci_dev *dev);
void pci_init_reset_methods(struct pci_dev *dev);
int pci_bridge_secondary_bus_reset(struct pci_dev *dev);
int pci_bus_error_reset(struct pci_dev *dev);
Index: linux-macro/drivers/pci/probe.c
===================================================================
--- linux-macro.orig/drivers/pci/probe.c
+++ linux-macro/drivers/pci/probe.c
@@ -2549,6 +2549,8 @@ void pci_device_add(struct pci_dev *dev,
dma_set_max_seg_size(&dev->dev, 65536);
dma_set_seg_boundary(&dev->dev, 0xffffffff);

+ pcie_downstream_link_retrain(dev);
+
/* Fix up broken headers */
pci_fixup_device(pci_fixup_header, dev);

2023-05-04 22:43:43

by Bjorn Helgaas

[permalink] [raw]

Subject: Re: [PATCH v8 7/7] PCI: Work around PCIe link training failures

On Thu, Apr 06, 2023 at 01:21:31AM +0100, Maciej W. Rozycki wrote:
> Attempt to handle cases such as with a downstream port of the ASMedia
> ASM2824 PCIe switch where link training never completes and the link
> continues switching between speeds indefinitely with the data link layer
> never reaching the active state.

We're going to land this series this cycle, come hell or high water.

We talked about reusing pcie_retrain_link() earlier. IIRC that didn't
work: ASPM needs to use PCI_EXP_LNKSTA_LT because not all devices
support PCI_EXP_LNKSTA_DLLLA, and you need PCI_EXP_LNKSTA_DLLLA
because the erratum makes PCI_EXP_LNKSTA_LT flap.

What if we made pcie_retrain_link() reusable by making it:

bool pcie_retrain_link(struct pci_dev *pdev, u16 link_status_bit)

so ASPM could use pcie_retrain_link(link->pdev, PCI_EXP_LNKSTA_LT) and
you could use pcie_retrain_link(dev, PCI_EXP_LNKSTA_DLLLA)?

Maybe do it two steps?

1) Move pcie_retrain_link() just after pcie_wait_for_link() and make
it take link->pdev instead of link.

2) Add the bit parameter.

I'm OK with having pcie_retrain_link() in pci.c, but the surrounding
logic about restricting to 2.5GT/s, retraining, removing the
restriction, retraining again is stuff I'd rather have in quirks.c so
it doesn't clutter pci.c.

I think it'd be good if the pci_device_add() path made clear that this
is a workaround for a problem, e.g.,

void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
{
...
if (pcie_link_failed(dev))
pcie_fix_link_train(dev);

where pcie_fix_link_train() could live in quirks.c (with a stub when
CONFIG_PCI_QUIRKS isn't enabled). It *might* even be worth adding it
and the stub first because that's a trivial patch and wouldn't clutter
the probe.c git history with all the grotty details about ASM2824 and
this topology.

> +int pcie_downstream_link_retrain(struct pci_dev *dev)
> +{
> + static const struct pci_device_id ids[] = {
> + { PCI_VDEVICE(ASMEDIA, 0x2824) }, /* ASMedia ASM2824 */
> + {}
> + };
> + u16 lnksta, lnkctl2;
> +
> + if (!pci_is_pcie(dev) || !pcie_downstream_port(dev) ||
> + !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting)
> + return -1;
> +
> + pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2);
> + pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta);
> + if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) ==
> + PCI_EXP_LNKSTA_LBMS) {

You go to some trouble to make sure PCI_EXP_LNKSTA_LBMS is set, and I
can't remember what the reason is. If you make a preparatory patch
like this, it would give a place for that background, e.g.,

+bool pcie_link_failed(struct pci_dev *dev)
+{
+ u16 lnksta;
+
+ if (!pci_is_pcie(dev) || !pcie_downstream_port(dev) ||
+ !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting)
+ return false;
+
+ pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta);
+ if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) ==
+ PCI_EXP_LNKSTA_LBMS)
+ return true;
+
+ return false;
+}

If this is a generic thing and checking PCI_EXP_LNKSTA_LBMS makes
sense for everybody, it could go in pci.c; otherwise it could go in
quirks.c as well. I guess it's not *truly* generic anyway because it
only detects link training failures for devices that have LNKCTL2 and
link_active_reporting.

> + unsigned long timeout;
> + u16 lnkctl;
> +
> + pci_info(dev, "broken device, retraining non-functional downstream link at 2.5GT/s\n");
> +
> + pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
> + lnkctl |= PCI_EXP_LNKCTL_RL;
> + lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
> + lnkctl2 |= PCI_EXP_LNKCTL2_TLS_2_5GT;
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnkctl);
> + /*
> + * Due to an erratum in some devices the Retrain Link bit
> + * needs to be cleared again manually to allow the link
> + * training to succeed.
> + */
> + lnkctl &= ~PCI_EXP_LNKCTL_RL;
> + if (dev->clear_retrain_link)
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL,
> + lnkctl);
> +
> + timeout = jiffies + PCIE_LINK_RETRAIN_TIMEOUT;
> + do {
> + pcie_capability_read_word(dev, PCI_EXP_LNKSTA,
> + &lnksta);
> + if (lnksta & PCI_EXP_LNKSTA_DLLLA)
> + break;
> + usleep_range(10000, 20000);
> + } while (time_before(jiffies, timeout));
> +
> + if (!(lnksta & PCI_EXP_LNKSTA_DLLLA)) {
> + pci_info(dev, "retraining failed\n");
> + return -1;
> + }
> + }

> + if (IS_ENABLED(CONFIG_PCI_QUIRKS) && (lnksta & PCI_EXP_LNKSTA_DLLLA) &&
> + (lnkctl2 & PCI_EXP_LNKCTL2_TLS) == PCI_EXP_LNKCTL2_TLS_2_5GT &&
> + pci_match_id(ids, dev)) {
> + u32 lnkcap;
> + u16 lnkctl;
> +
> + pci_info(dev, "removing 2.5GT/s downstream link speed restriction\n");
> + pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
> + pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
> + lnkctl |= PCI_EXP_LNKCTL_RL;
> + lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
> + lnkctl2 |= lnkcap & PCI_EXP_LNKCAP_SLS;
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnkctl);

This starts a retrain; should we wait for training to complete?

> + }

If we put most of this into a pcie_fix_link_train() (separated from
detecting the *need* to fix something), could it be made to look
sort of like this? (I suppose you'd want to return bool and rename
it that reads naturally, e.g., "pcie_link_forcibly_retrained()",
"pcie_link_retrained()", etc)

+void pcie_fix_link_train(struct pci_dev *dev)
+{
+ u16 lnkctl2;
+ u32 lnkcap;
+ bool linkup;
+
+ pci_info(dev, "attempting link retrain at 2.5GT/s\n");
+ pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2);
+ lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
+ lnkctl2 |= PCI_EXP_LNKCTL2_TLS_2_5GT;
+ pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
+
+ linkup = pcie_retrain_link(dev, PCI_EXP_LNKSTA_DLLLA);
+ if (!linkup) {
+ pci_info(dev, "retraining failed\n");
+ return;
+ }
+
+ if (LNKCAP supports only 2.5GT/s)
+ return;
+
+ if (!pci_match_id(ids, dev))
+ return;

Your comment said "if we know this is *safe*"; I can't remember if
pci_match_id() is there to avoid a known problem?

+
+ pci_info(dev, "attempting link retrain at max supported rate\n");
+ pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
+ lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
+ lnkctl2 |= lnkcap & PCI_EXP_LNKCAP_SLS;
+ pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
+
+ linkup = pcie_retrain_link(dev, PCI_EXP_LNKSTA_DLLLA);
+ if (!linkup)
+ pci_info(dev, "retraining failed\n");
+}

> +
> + return 0;
> +}
> +
> +/* Same as above, but called for a downstream device. */
> +static int pcie_upstream_link_retrain(struct pci_dev *dev)
> +{
> + struct pci_dev *bridge;
> +
> + bridge = pci_upstream_bridge(dev);
> + if (bridge)
> + return pcie_downstream_link_retrain(bridge);
> + else
> + return -1;
> +}
> +
> static int pci_acs_enable;
>
> /**
> @@ -1148,8 +1274,8 @@ void pci_resume_bus(struct pci_bus *bus)
>
> static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
> {
> + int retrain = 0;
> int delay = 1;
> - u32 id;
>
> /*
> * After reset, the device should not silently discard config
> @@ -1163,21 +1289,37 @@ static int pci_dev_wait(struct pci_dev *
> * Command register instead of Vendor ID so we don't have to
> * contend with the CRS SV value.
> */
> - pci_read_config_dword(dev, PCI_COMMAND, &id);
> - while (PCI_POSSIBLE_ERROR(id)) {
> + for (;;) {
> + u32 id;
> +
> + pci_read_config_dword(dev, PCI_COMMAND, &id);
> + if (!PCI_POSSIBLE_ERROR(id)) {
> + if (delay > PCI_RESET_WAIT)
> + pci_info(dev, "ready %dms after %s\n",
> + delay - 1, reset_type);
> + break;
> + }
> +
> if (delay > timeout) {
> pci_warn(dev, "not ready %dms after %s; giving up\n",
> delay - 1, reset_type);
> return -ENOTTY;
> }
>
> - if (delay > PCI_RESET_WAIT)
> + if (delay > PCI_RESET_WAIT) {
> + if (!retrain) {
> + retrain = 1;
> + if (pcie_upstream_link_retrain(dev) == 0) {
> + delay = 1;
> + continue;
> + }
> + }
> pci_info(dev, "not ready %dms after %s; waiting\n",
> delay - 1, reset_type);
> + }

Thanks for fixing this in the reset path, too. Can we move this part
to a separate patch? It's related to the rest of the patch, but it
looks so much different that I think it would be easier to understand
by itself.

I think I might try to fold the pcie_upstream_link_retrain() directly
in here because the "upstream link retrain" in the function name
doesn't really make sense in PCIe terms.

Bjorn

2023-05-04 22:43:49

by Bjorn Helgaas

[permalink] [raw]

Subject: Re: [PATCH v8 2/7] PCI: Export PCI link retrain timeout

On Thu, Apr 06, 2023 at 01:21:09AM +0100, Maciej W. Rozycki wrote:
> Rename LINK_RETRAIN_TIMEOUT to PCIE_LINK_RETRAIN_TIMEOUT and make it
> available via "pci.h" for PCI drivers to use.

> +#define PCIE_LINK_RETRAIN_TIMEOUT HZ

This is basically just a rename and move, but since we're touching it
anyway, can we make it "PCIE_LINK_RETRAIN_TIMEOUT_MS 1000" here and
use msecs_to_jiffies() below?

I know jiffies and HZ are probably idiomatic elsewhere in the kernel,
and this particular timeout is arbitrary and not based on anything in
the spec, but many of the delays in PCI *are* straight from a spec, so
I'd like to make the units more explicit.

> extern const unsigned char pcie_link_speed[];
> extern bool pci_early_dump;
>
> Index: linux-macro/drivers/pci/pcie/aspm.c
> ===================================================================
> --- linux-macro.orig/drivers/pci/pcie/aspm.c
> +++ linux-macro/drivers/pci/pcie/aspm.c
> @@ -90,8 +90,6 @@ static const char *policy_str[] = {
> [POLICY_POWER_SUPERSAVE] = "powersupersave"
> };
>
> -#define LINK_RETRAIN_TIMEOUT HZ
> -
> /*
> * The L1 PM substate capability is only implemented in function 0 in a
> * multi function device.
> @@ -213,7 +211,7 @@ static bool pcie_retrain_link(struct pci
> }
>
> /* Wait for link training end. Break out after waiting for timeout */
> - end_jiffies = jiffies + LINK_RETRAIN_TIMEOUT;
> + end_jiffies = jiffies + PCIE_LINK_RETRAIN_TIMEOUT;
> do {
> pcie_capability_read_word(parent, PCI_EXP_LNKSTA, &reg16);
> if (!(reg16 & PCI_EXP_LNKSTA_LT))

2023-05-07 18:53:04

by Maciej W. Rozycki

[permalink] [raw]

Subject: Re: [PATCH v8 7/7] PCI: Work around PCIe link training failures

On Thu, 4 May 2023, Bjorn Helgaas wrote:

> On Thu, Apr 06, 2023 at 01:21:31AM +0100, Maciej W. Rozycki wrote:
> > Attempt to handle cases such as with a downstream port of the ASMedia
> > ASM2824 PCIe switch where link training never completes and the link
> > continues switching between speeds indefinitely with the data link layer
> > never reaching the active state.
>
> We're going to land this series this cycle, come hell or high water.

Thank you for coming back to me and for your promise. I'll strive to
address your concerns next weekend.

Unfortunately a PDU in my remote lab has botched up and I've lost control
over it (thankfully not one for the RISC-V machine affected by the patch
series, so I can still manage it for reboots, etc., but the botched PDU is
actually upstream), so depending on how situation develops I may have to
book air travel instead and spend the whole weekend getting things back to
normal operation at my lab. That unit was not supposed to fail, not in
such a silly way anyway, sigh...

Maciej

2023-05-14 21:29:16

by Maciej W. Rozycki

[permalink] [raw]

Subject: Re: [PATCH v8 7/7] PCI: Work around PCIe link training failures

On Sun, 7 May 2023, Maciej W. Rozycki wrote:

> > We're going to land this series this cycle, come hell or high water.
>
> Thank you for coming back to me and for your promise. I'll strive to
> address your concerns next weekend.
>
> Unfortunately a PDU in my remote lab has botched up and I've lost control
> over it (thankfully not one for the RISC-V machine affected by the patch
> series, so I can still manage it for reboots, etc., but the botched PDU is
> actually upstream), so depending on how situation develops I may have to
> book air travel instead and spend the whole weekend getting things back to
> normal operation at my lab. That unit was not supposed to fail, not in
> such a silly way anyway, sigh...

Last Thu the situation with the PDU became critical, so I spent a better
part of yesterday and today travelling and then all night long getting
things sorted. So it'll have to be next weekend when I get back to these
patches. I hope we can still make it regardless.

Maciej

2023-06-11 17:48:18

by Maciej W. Rozycki

[permalink] [raw]

Subject: Re: [PATCH v8 7/7] PCI: Work around PCIe link training failures

On Thu, 4 May 2023, Bjorn Helgaas wrote:

> We talked about reusing pcie_retrain_link() earlier. IIRC that didn't
> work: ASPM needs to use PCI_EXP_LNKSTA_LT because not all devices
> support PCI_EXP_LNKSTA_DLLLA, and you need PCI_EXP_LNKSTA_DLLLA
> because the erratum makes PCI_EXP_LNKSTA_LT flap.
>
> What if we made pcie_retrain_link() reusable by making it:
>
> bool pcie_retrain_link(struct pci_dev *pdev, u16 link_status_bit)
>
> so ASPM could use pcie_retrain_link(link->pdev, PCI_EXP_LNKSTA_LT) and
> you could use pcie_retrain_link(dev, PCI_EXP_LNKSTA_DLLLA)?

This is somewhat more complicated, because of the inverted logic between
the two status bits. Therefore I think a boolean flag is more adequate
with preparatory logic within the function itself. This will tighten the
call interface as well.

> Maybe do it two steps?
>
> 1) Move pcie_retrain_link() just after pcie_wait_for_link() and make
> it take link->pdev instead of link.
>
> 2) Add the bit parameter.

Having compared the two pieces side by side now I think it makes sense.
While there are minor differences, most prominently the original code is
more aggressive than mine in polling the status bit, I think these details
are not significant enough to argue over them here. And we can consider
switching to more modern `usleep_range' interface separately.

A minor pessimisation resulting is that LNKSTA has to be reread in the
caller after return from `pcie_retrain_link'; previously the last value
read in the poll loop could have been reused.

> I'm OK with having pcie_retrain_link() in pci.c, but the surrounding
> logic about restricting to 2.5GT/s, retraining, removing the
> restriction, retraining again is stuff I'd rather have in quirks.c so
> it doesn't clutter pci.c.

Well, it was there in quirks.c originally and I only moved this piece
following your earlier suggestion:

> If we think the first part (attempting the retrain) is safe to do
> whenever the link is down, maybe we should do that more directly as
> part of the PCI core instead of as a quirk?

as in here: <https://lore.kernel.org/r/20221109050418.GA529724@bhelgaas/>,
though if you did change your mind after all, I can move it back, sure.
It's not always that the first thought is the best, or sometimes good at
all.

> I think it'd be good if the pci_device_add() path made clear that this
> is a workaround for a problem, e.g.,
>
> void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
> {
> ...
> if (pcie_link_failed(dev))
> pcie_fix_link_train(dev);
>
> where pcie_fix_link_train() could live in quirks.c (with a stub when
> CONFIG_PCI_QUIRKS isn't enabled). It *might* even be worth adding it
> and the stub first because that's a trivial patch and wouldn't clutter
> the probe.c git history with all the grotty details about ASM2824 and
> this topology.

I have added a stub now, both as an intermediate step and ultimately for
!PCI_QUIRKS, but I disagree about having the check in pci.c and the fix in
quirks.c, because from the code structure's point of view it makes no
sense IMHO to have the check enabled and the fix disabled both at a time
for !PCI_QUIRKS, even if the compiler would actually optimise the check
away in that case.

Please let me know if you maintain your suggestion and if so, then why
you find it so important. I think with the use of `pcie_retrain_link'
this code has become straightforward enough not to need to be split or
factored out any further (and while factoring out the conditionals only
would make some sense to me, it would require duplicating configuration
register accesses even further).

> > +int pcie_downstream_link_retrain(struct pci_dev *dev)
> > +{
> > + static const struct pci_device_id ids[] = {
> > + { PCI_VDEVICE(ASMEDIA, 0x2824) }, /* ASMedia ASM2824 */
> > + {}
> > + };
> > + u16 lnksta, lnkctl2;
> > +
> > + if (!pci_is_pcie(dev) || !pcie_downstream_port(dev) ||
> > + !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting)
> > + return -1;
> > +
> > + pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2);
> > + pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta);
> > + if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) ==
> > + PCI_EXP_LNKSTA_LBMS) {
>
> You go to some trouble to make sure PCI_EXP_LNKSTA_LBMS is set, and I
> can't remember what the reason is. If you make a preparatory patch
> like this, it would give a place for that background, e.g.,

It has been already documented along with the code in question:

* With the ASM2824 we can rely on the otherwise optional Data Link Layer
* Link Active status bit and in the failed link training scenario it will
* be off along with the Link Bandwidth Management Status indicating that
* hardware has changed the link speed or width in an attempt to correct
* unreliable link operation. For a port that has been left unconnected
* both bits will be clear. [...]

> +bool pcie_link_failed(struct pci_dev *dev)
> +{
> + u16 lnksta;
> +
> + if (!pci_is_pcie(dev) || !pcie_downstream_port(dev) ||
> + !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting)
> + return false;
> +
> + pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta);
> + if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) ==
> + PCI_EXP_LNKSTA_LBMS)
> + return true;
> +
> + return false;
> +}
>
> If this is a generic thing and checking PCI_EXP_LNKSTA_LBMS makes
> sense for everybody, it could go in pci.c; otherwise it could go in
> quirks.c as well. I guess it's not *truly* generic anyway because it
> only detects link training failures for devices that have LNKCTL2 and
> link_active_reporting.

I do not have enough information to tell whether this is generic or not.

Checking for PCI_EXP_LNKSTA_LBMS is important, because otherwise this
code would attempt to retrain every empty slot or otherwise unconnected
PCIe link, which we do not want to do and which would surely take a lot of
time with some of the larger systems, to say nothing of the log clutter
showing that there is something wrong with the system while indeed there
is nothing.

Out of all the ports whose data link layer is not in the DL_Active state
the LBMS bit is only set for the failed link in my system and I suspect it
is related to the link speed negotiation erratum causing unsuccessful link
training to repeat indefinitely.

By definition LBMS cannot be set for an unconnected link, because the bit
is only allowed to be set for an event observed that has happened with a
port reporting no DL_Down status at any time throughout the event, which
can only happen with the physical layer up, which of course cannot happen
for an unconnected link (of course I can imagine another erratum affecting
the LBMS bit, but that has not been reported yet).

While making sure I got all the details in the previous paragraph right I
have gone through a reference to the DL_Feature data link layer state (and
a potential need to disable it for interacting with a non-compliant legacy
downstream device), but neither device involved supports it, so it can't
possibly be the cause for the phenomenon observed.

IOW the LBMS bit serves the purpose of indicating that there is actually
a device down an inactive link (the state of the physical layer's LinkUp
bit is not directly accessible via software). And one might argue that
the state where LBMS is set but DLLLA is clear (where actually supported)
after a device reset is indeed a generic sign of an odd link training
issue.

If you think it would make sense to include any piece of the text above
with the existing documentation, then I'll be happy to improve it.

> > + if (IS_ENABLED(CONFIG_PCI_QUIRKS) && (lnksta & PCI_EXP_LNKSTA_DLLLA) &&
> > + (lnkctl2 & PCI_EXP_LNKCTL2_TLS) == PCI_EXP_LNKCTL2_TLS_2_5GT &&
> > + pci_match_id(ids, dev)) {
> > + u32 lnkcap;
> > + u16 lnkctl;
> > +
> > + pci_info(dev, "removing 2.5GT/s downstream link speed restriction\n");
> > + pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
> > + pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
> > + lnkctl |= PCI_EXP_LNKCTL_RL;
> > + lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
> > + lnkctl2 |= lnkcap & PCI_EXP_LNKCAP_SLS;
> > + pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
> > + pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnkctl);
>
> This starts a retrain; should we wait for training to complete?

Yep, why not, with `pcie_retrain_link' now updated it's trivial, and we
can then also verify the result (and do nothing about it except for
reporting, as it's supposed to never happen, so let's just wait and see).

> If we put most of this into a pcie_fix_link_train() (separated from
> detecting the *need* to fix something), could it be made to look
> sort of like this? (I suppose you'd want to return bool and rename
> it that reads naturally, e.g., "pcie_link_forcibly_retrained()",
> "pcie_link_retrained()", etc)

Ah, I concluded to make it return `bool' independently, having not seen
this suggestion of yours yet, so it seems like we're getting in sync, and
likewise I renamed the function to `pcie_failed_link_retrain' already.

> +void pcie_fix_link_train(struct pci_dev *dev)
> +{
> + u16 lnkctl2;
> + u32 lnkcap;
> + bool linkup;
> +
> + pci_info(dev, "attempting link retrain at 2.5GT/s\n");
> + pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2);
> + lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
> + lnkctl2 |= PCI_EXP_LNKCTL2_TLS_2_5GT;
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
> +
> + linkup = pcie_retrain_link(dev, PCI_EXP_LNKSTA_DLLLA);
> + if (!linkup) {
> + pci_info(dev, "retraining failed\n");
> + return;
> + }
> +
> + if (LNKCAP supports only 2.5GT/s)
> + return;
> +
> + if (!pci_match_id(ids, dev))
> + return;
>
> Your comment said "if we know this is *safe*"; I can't remember if
> pci_match_id() is there to avoid a known problem?

It's the other way round, the intent is to lift the speed restriction and
retrain for devices known to succeed and survive only.

It cannot be universally guaranteed that such a retrain will succeed even
if 2.5GT/s works, and moreover this piece is independent from the attempt
to recover made immediately above and will also run where the firmware has
clamped the speed of the link somehow, whether for this erratum or for
another reason (remember that the speed clamp is sticky, so it will have
survived our bus/interconnect hierarchy reset).

In particular Pali has reported (in an earlier discussion concerning this
erratum on the U-Boot mailing list) the existence of downstream devices
that lock up when a link retrain is attempted, so we don't want to request
one for an otherwise known-working link (retraining a dead link won't hurt
of course regardless, because at worst it'll just stay in its non-working
state, and we don't have a way to figure out what might be there anyway).
Cf. <https://lists.denx.de/pipermail/u-boot/2021-November/467201.html>.

> > +/* Same as above, but called for a downstream device. */
> > +static int pcie_upstream_link_retrain(struct pci_dev *dev)
> > +{
> > + struct pci_dev *bridge;
> > +
> > + bridge = pci_upstream_bridge(dev);
> > + if (bridge)
> > + return pcie_downstream_link_retrain(bridge);
> > + else
> > + return -1;
> > +}
> > +
> > static int pci_acs_enable;
> >
> > /**
> > @@ -1148,8 +1274,8 @@ void pci_resume_bus(struct pci_bus *bus)
> >
> > static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
> > {
> > + int retrain = 0;
> > int delay = 1;
> > - u32 id;
> >
> > /*
> > * After reset, the device should not silently discard config
> > @@ -1163,21 +1289,37 @@ static int pci_dev_wait(struct pci_dev *
> > * Command register instead of Vendor ID so we don't have to
> > * contend with the CRS SV value.
> > */
> > - pci_read_config_dword(dev, PCI_COMMAND, &id);
> > - while (PCI_POSSIBLE_ERROR(id)) {
> > + for (;;) {
> > + u32 id;
> > +
> > + pci_read_config_dword(dev, PCI_COMMAND, &id);
> > + if (!PCI_POSSIBLE_ERROR(id)) {
> > + if (delay > PCI_RESET_WAIT)
> > + pci_info(dev, "ready %dms after %s\n",
> > + delay - 1, reset_type);
> > + break;
> > + }
> > +
> > if (delay > timeout) {
> > pci_warn(dev, "not ready %dms after %s; giving up\n",
> > delay - 1, reset_type);
> > return -ENOTTY;
> > }
> >
> > - if (delay > PCI_RESET_WAIT)
> > + if (delay > PCI_RESET_WAIT) {
> > + if (!retrain) {
> > + retrain = 1;
> > + if (pcie_upstream_link_retrain(dev) == 0) {
> > + delay = 1;
> > + continue;
> > + }
> > + }
> > pci_info(dev, "not ready %dms after %s; waiting\n",
> > delay - 1, reset_type);
> > + }
>
> Thanks for fixing this in the reset path, too. Can we move this part
> to a separate patch? It's related to the rest of the patch, but it
> looks so much different that I think it would be easier to understand
> by itself.

I think making a system halfway-fixed would make little sense, but with
the actual fix actually made last as you suggested I think this can be
split off, because it'll make no functional change by itself.

While at it I have verified that the initial value of `retrain' does not
have to be changed for the compiler to optimise away any code related to
it in the !PCI_QUIRKS case where `pcie_parent_link_retrain' gets optimised
away too, so we're good here.

I changed `retrain' to `bool' though and inverted the logic as I find it
more natural this way.

> I think I might try to fold the pcie_upstream_link_retrain() directly
> in here because the "upstream link retrain" in the function name
> doesn't really make sense in PCIe terms.

Well, it does, as you can indeed request a retrain for an upstream port
device. This is not however what this function does, so I agree it's
confusing. I have replaced "upstream" with "parent" in the function name
then to avoid this ambiguity.

Otherwise I think factoring this piece out makes code more readable, as
it's already quite deeply nested in blocks, and the compiler will inline
it anyway, so I'd rather keep it as a separate function.

With the observations made I'll be posting a rewritten patch series now.
I realise there might still be issues outstanding, but this rewrite was
already humongous enough and I think it deserves a second pair of eyeballs
before massaging it any further.

And last but not least, thank you for waiting, it was quite a stretch for
me to fit this effort in among all the stuff currently on my table and all
the unforeseen events.

Maciej