2024-02-02 16:46:42

by Ilpo Järvinen

[permalink] [raw]
Subject: [PATCH 1/1] PCI: Cleanup link activation wait logic

The combination of logic in pcie_failed_link_retrain() and
pcie_wait_for_link_delay() is hard to track and does not really make
sense in some cases.

To cleanup the logic mess:

1. Change pcie_failed_link_retrain() to return true only if link was
retrained successfully due to the Target Speed quirk. If there is no
LBMS set, return false instead of true because no retraining was
even attempted. This seems correct considering expectations of both
callers of pcie_failed_link_retrain().

2. Handle link-was-not-retrained-successfully return (false) from
pcie_failed_link_retrain() properly in pcie_wait_for_link_delay() by
directly returning false.

Signed-off-by: Ilpo Järvinen <[email protected]>
---
drivers/pci/pci.c | 4 +---
drivers/pci/quirks.c | 25 +++++++++++++------------
2 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index d8f11a078924..ca4159472a72 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5068,9 +5068,7 @@ static bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active,
msleep(20);
rc = pcie_wait_for_link_status(pdev, false, active);
if (active) {
- if (rc)
- rc = pcie_failed_link_retrain(pdev);
- if (rc)
+ if (rc < 0 && !pcie_failed_link_retrain(pdev))
return false;

msleep(delay);
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index efb2ddbff115..e729157be95d 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -88,24 +88,25 @@ bool pcie_failed_link_retrain(struct pci_dev *dev)
!pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting)
return false;

- pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2);
pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta);
- if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) ==
- PCI_EXP_LNKSTA_LBMS) {
- pci_info(dev, "broken device, retraining non-functional downstream link at 2.5GT/s\n");
+ if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) !=
+ PCI_EXP_LNKSTA_LBMS)
+ return false;

- lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
- lnkctl2 |= PCI_EXP_LNKCTL2_TLS_2_5GT;
- pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
+ pci_info(dev, "broken device, retraining non-functional downstream link at 2.5GT/s\n");

- if (pcie_retrain_link(dev, false)) {
- pci_info(dev, "retraining failed\n");
- return false;
- }
+ pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2);
+ lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
+ lnkctl2 |= PCI_EXP_LNKCTL2_TLS_2_5GT;
+ pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);

- pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta);
+ if (pcie_retrain_link(dev, false)) {
+ pci_info(dev, "retraining failed\n");
+ return false;
}

+ pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta);
+
if ((lnksta & PCI_EXP_LNKSTA_DLLLA) &&
(lnkctl2 & PCI_EXP_LNKCTL2_TLS) == PCI_EXP_LNKCTL2_TLS_2_5GT &&
pci_match_id(ids, dev)) {
--
2.39.2



2024-02-16 13:28:58

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [PATCH 1/1] PCI: Cleanup link activation wait logic

On Fri, 2 Feb 2024, Maciej W. Rozycki wrote:

> On Fri, 2 Feb 2024, Ilpo J?rvinen wrote:
>
> > 1. Change pcie_failed_link_retrain() to return true only if link was
> > retrained successfully due to the Target Speed quirk. If there is no
> > LBMS set, return false instead of true because no retraining was
> > even attempted. This seems correct considering expectations of both
> > callers of pcie_failed_link_retrain().
>
> You change the logic here in that the second conditional isn't run if the
> first has not. This is wrong, unclamping is not supposed to rely on LBMS.
> It is supposed to be always run and any failure has to be reported too, as
> a retraining error.

Now that (I think) I fully understand the intent of the second
condition/block one additional question occurred to me.

How is the 2nd condition even supposed to work in the current place when
firmware has pre-arranged the 2.5GT/s resctriction? Wouldn't the link come
up fine in that case and the quirk code is not called at all since the
link came up successfully?


Yet another thing in this quirk code I don't like is how it can leaves the
target speed to 2.5GT/s when the quirk fails to get the link working
(which actually does happen in the disconnection cases because DLLLA won't
be set so the target speed will not be restored).


--
i.

2024-02-16 14:23:46

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [PATCH 1/1] PCI: Cleanup link activation wait logic

On Fri, 16 Feb 2024, Maciej W. Rozycki wrote:
> On Fri, 16 Feb 2024, Ilpo Järvinen wrote:
>
> > > You change the logic here in that the second conditional isn't run if the
> > > first has not. This is wrong, unclamping is not supposed to rely on LBMS.
> > > It is supposed to be always run and any failure has to be reported too, as
> > > a retraining error.
> >
> > Now that (I think) I fully understand the intent of the second
> > condition/block one additional question occurred to me.
> >
> > How is the 2nd condition even supposed to work in the current place when
> > firmware has pre-arranged the 2.5GT/s resctriction? Wouldn't the link come
> > up fine in that case and the quirk code is not called at all since the
> > link came up successfully?
>
> The quirk is called unconditionally from `pci_device_add', so an attempt
> to unclamp will always happen with a working link for qualifying devices.

Ah, thanks. I'd stared the other two calls enough of times I'd forgotten
the 3rd one even existed.

> > Yet another thing in this quirk code I don't like is how it can leaves the
> > target speed to 2.5GT/s when the quirk fails to get the link working
> > (which actually does happen in the disconnection cases because DLLLA won't
> > be set so the target speed will not be restored).
>
> I chose to leave the target speed at the most recent setting, because the
> link doesn't work in that case anyway, so I concluded it doesn't matter,
> but reduces messing with the device; technically you should retrain again
> afterwards. I'm not opposed to changing this if you have a use case.

It remains suboptimally set in a case where something is plugged again
into that port, for Thunderbolt it doesn't matter as the PCIe speed picked
is quite bogus anyway, but disconnect then plug something again is not
limited to Thunderbolt.

I've no immediate plans on changing it now but it may come relevant when
attempting to make the bandwidth controller to trigger the quirk. To me
there are two quirks, not just one so I might have to split them to make
it better suited for triggering them from bwctrl.

--
i.

2024-02-26 12:43:49

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: [PATCH 1/1] PCI: Cleanup link activation wait logic

On Fri, 16 Feb 2024, Ilpo Järvinen wrote:

> > > Yet another thing in this quirk code I don't like is how it can leaves the
> > > target speed to 2.5GT/s when the quirk fails to get the link working
> > > (which actually does happen in the disconnection cases because DLLLA won't
> > > be set so the target speed will not be restored).
> >
> > I chose to leave the target speed at the most recent setting, because the
> > link doesn't work in that case anyway, so I concluded it doesn't matter,
> > but reduces messing with the device; technically you should retrain again
> > afterwards. I'm not opposed to changing this if you have a use case.
>
> It remains suboptimally set in a case where something is plugged again
> into that port, for Thunderbolt it doesn't matter as the PCIe speed picked
> is quite bogus anyway, but disconnect then plug something again is not
> limited to Thunderbolt.

Sure, my understanding has been all PCIe option devices are supposed to
be hot-pluggable, at least these in the regular form factor (which is why
PCIe edge connector contacts have varying lengths, unlike conventional
PCI).

Maciej

2024-02-26 12:53:54

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: [PATCH 1/1] PCI: Cleanup link activation wait logic

On Sat, 10 Feb 2024, Maciej W. Rozycki wrote:

> > > You change the logic here in that the second conditional isn't run if the
> > > first has not. This is wrong, unclamping is not supposed to rely on LBMS.
> > > It is supposed to be always run and any failure has to be reported too, as
> > > a retraining error. I'll send an update according to what I have outlined
> > > before then, with some testing done first.
> >
> > Oh I see now, I'm sorry, I didn't read all the way to the last paragraph
> > of the commit message because the earlier one in the commit message hinted
> > the restriction is removed afterwards so I thought it was only linked to
> > the first part of the quirk.
>
> No worries. I have submitted the changes now:
> <https://lore.kernel.org/r/[email protected]/>.
>
> Unfortunately due to a sudden hardware failure I wasn't able to do any
> non-trivial verification, as explained in the cover letter. I'll try to
> get back to it as soon as reasonably possible, hopefully next month.

I have the failure sorted now, simply reseating one of the connections
made all the option modules work again as previously. I'm not impressed
by this lack of reliability, but maybe it was just bad lack, as things
used to work just fine for over 2 years before I replaced the mainboard.

I'll see if I can do some verification this week.

Maciej