2024-04-24 03:34:13

by Smita Koralahalli

[permalink] [raw]
Subject: [PATCH] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction

Clear Link Bandwidth Management Status (LBMS) if set, on a hot-remove event.

The hot-remove event could result in target link speed reduction if LBMS
is set, due to a delay in Presence Detect State Change (PDSC) happening
after a Data Link Layer State Change event (DLLSC).

In reality, PDSC and DLLSC events rarely come in simultaneously. Delay in
PDSC can sometimes be too late and the slot could have already been
powered down just by a DLLSC event. And the delayed PDSC could falsely be
interpreted as an interrupt raised to turn the slot on. This false process
of powering the slot on, without a link forces the kernel to retrain the
link if LBMS is set, to a lower speed to restablish the link thereby
bringing down the link speeds [2].

According to PCIe r6.2 sec 7.5.3.8 [1], it is derived that, LBMS cannot
be set for an unconnected link and if set, it serves the purpose of
indicating that there is actually a device down an inactive link.
However, hardware could have already set LBMS when the device was
connected to the port i.e when the state was DL_Up or DL_Active. Some
hardwares would have even attempted retrain going into recovery mode,
just before transitioning to DL_Down.

Thus the set LBMS is never cleared and might force software to cause link
speed drops when there is no link [2].

Dmesg before:
pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
pcieport 0000:20:01.1: pciehp: Slot(59): Card present
pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
pcieport 0000:20:01.1: retraining failed
pcieport 0000:20:01.1: pciehp: Slot(59): No link

Dmesg after:
pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
pcieport 0000:20:01.1: pciehp: Slot(59): Card present
pcieport 0000:20:01.1: pciehp: Slot(59): No link

[1] PCI Express Base Specification Revision 6.2, Jan 25 2024.
https://members.pcisig.com/wg/PCI-SIG/document/20590
[2] Commit a89c82249c37 ("PCI: Work around PCIe link training failures")

Signed-off-by: Smita Koralahalli <[email protected]>
---
1. Should be based on top of fixes for link retrain status in
pcie_wait_for_link_delay()
https://patchwork.kernel.org/project/linux-pci/list/?series=824858
https://lore.kernel.org/linux-pci/[email protected]/

Without the fixes patch output would be:
pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
pcieport 0000:20:01.1: pciehp: Slot(59): Card present
pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
pcieport 0000:20:01.1: retraining failed
pcieport 0000:20:01.1: pciehp: Slot(59): No device found.

2. I initially attempted to wait for both events PDSC and DLLSC to happen
and then turn on the slot.
Similar to: https://lore.kernel.org/lkml/[email protected]/
but before turning on the slot.

Something like:
- ctrl->state = POWERON_STATE;
- mutex_unlock(&ctrl->state_lock);
- if (present)
+ if (present && link_active) {
+ ctrl->state = POWERON_STATE;
+ mutex_unlock(&ctrl->state_lock);
ctrl_info(ctrl, "Slot(%s): Card present\n",
slot_name(ctrl));
- if (link_active)
ctrl_info(ctrl, "Slot(%s): Link Up\n",
slot_name(ctrl));
- ctrl->request_result = pciehp_enable_slot(ctrl);
- break;
+ ctrl->request_result = pciehp_enable_slot(ctrl);
+ break;
+ }
+ else {
+ mutex_unlock(&ctrl->state_lock);
+ break;
+ }

This would also avoid printing the lines below on a remove event.
pcieport 0000:20:01.1: pciehp: Slot(59): Card present
pcieport 0000:20:01.1: pciehp: Slot(59): No link

I understand this would likely be not applicable in places where broken
devices hardwire PDS to zero and PDSC would never happen. But I'm open to
making changes if this is more applicable. Because, SW cannot directly
track the interference of HW in attempting link retrain and setting LBMS.

3. I tried introducing delay similar to pcie_wait_for_presence() but I
was not successful in picking the right numbers. Hence hit with the same
link speed drop.

4. For some reason I was unable to clear LBMS with:
pcie_capability_clear_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
PCI_EXP_LNKSTA_LBMS);
---
drivers/pci/hotplug/pciehp_pci.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
index ad12515a4a12..9155fdfd1d37 100644
--- a/drivers/pci/hotplug/pciehp_pci.c
+++ b/drivers/pci/hotplug/pciehp_pci.c
@@ -92,7 +92,7 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
{
struct pci_dev *dev, *temp;
struct pci_bus *parent = ctrl->pcie->port->subordinate;
- u16 command;
+ u16 command, lnksta;

ctrl_dbg(ctrl, "%s: domain:bus:dev = %04x:%02x:00\n",
__func__, pci_domain_nr(parent), parent->number);
@@ -134,4 +134,10 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
}

pci_unlock_rescan_remove();
+
+ /* Clear LBMS on removal */
+ pcie_capability_read_word(ctrl->pcie->port, PCI_EXP_LNKSTA, &lnksta);
+ if (lnksta & PCI_EXP_LNKSTA_LBMS)
+ pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
+ PCI_EXP_LNKSTA_LBMS);
}
--
2.17.1



2024-04-24 09:32:58

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [PATCH] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction

On Wed, 24 Apr 2024, Smita Koralahalli wrote:

> Clear Link Bandwidth Management Status (LBMS) if set, on a hot-remove event.
>
> The hot-remove event could result in target link speed reduction if LBMS
> is set, due to a delay in Presence Detect State Change (PDSC) happening
> after a Data Link Layer State Change event (DLLSC).
>
> In reality, PDSC and DLLSC events rarely come in simultaneously. Delay in
> PDSC can sometimes be too late and the slot could have already been
> powered down just by a DLLSC event. And the delayed PDSC could falsely be
> interpreted as an interrupt raised to turn the slot on. This false process
> of powering the slot on, without a link forces the kernel to retrain the
> link if LBMS is set, to a lower speed to restablish the link thereby
> bringing down the link speeds [2].
>
> According to PCIe r6.2 sec 7.5.3.8 [1], it is derived that, LBMS cannot
> be set for an unconnected link and if set, it serves the purpose of
> indicating that there is actually a device down an inactive link.
> However, hardware could have already set LBMS when the device was
> connected to the port i.e when the state was DL_Up or DL_Active. Some
> hardwares would have even attempted retrain going into recovery mode,
> just before transitioning to DL_Down.
>
> Thus the set LBMS is never cleared and might force software to cause link
> speed drops when there is no link [2].
>
> Dmesg before:
> pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
> pcieport 0000:20:01.1: pciehp: Slot(59): Card present
> pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
> pcieport 0000:20:01.1: retraining failed
> pcieport 0000:20:01.1: pciehp: Slot(59): No link
>
> Dmesg after:
> pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
> pcieport 0000:20:01.1: pciehp: Slot(59): Card present
> pcieport 0000:20:01.1: pciehp: Slot(59): No link
>
> [1] PCI Express Base Specification Revision 6.2, Jan 25 2024.
> https://members.pcisig.com/wg/PCI-SIG/document/20590
> [2] Commit a89c82249c37 ("PCI: Work around PCIe link training failures")
>
> Signed-off-by: Smita Koralahalli <[email protected]>
> ---
> 1. Should be based on top of fixes for link retrain status in
> pcie_wait_for_link_delay()
> https://patchwork.kernel.org/project/linux-pci/list/?series=824858
> https://lore.kernel.org/linux-pci/[email protected]/
>
> Without the fixes patch output would be:
> pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
> pcieport 0000:20:01.1: pciehp: Slot(59): Card present
> pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
> pcieport 0000:20:01.1: retraining failed
> pcieport 0000:20:01.1: pciehp: Slot(59): No device found.

Did you hit the 60 sec delay issue without series 824858? If you've tested
them and the fixes helped your case, could you perhaps give Tested-by for
that series too (in the relevant thread)?

> 2. I initially attempted to wait for both events PDSC and DLLSC to happen
> and then turn on the slot.
> Similar to: https://lore.kernel.org/lkml/[email protected]/
> but before turning on the slot.
>
> Something like:
> - ctrl->state = POWERON_STATE;
> - mutex_unlock(&ctrl->state_lock);
> - if (present)
> + if (present && link_active) {
> + ctrl->state = POWERON_STATE;
> + mutex_unlock(&ctrl->state_lock);
> ctrl_info(ctrl, "Slot(%s): Card present\n",
> slot_name(ctrl));
> - if (link_active)
> ctrl_info(ctrl, "Slot(%s): Link Up\n",
> slot_name(ctrl));
> - ctrl->request_result = pciehp_enable_slot(ctrl);
> - break;
> + ctrl->request_result = pciehp_enable_slot(ctrl);
> + break;
> + }
> + else {
> + mutex_unlock(&ctrl->state_lock);
> + break;
> + }
>
> This would also avoid printing the lines below on a remove event.
> pcieport 0000:20:01.1: pciehp: Slot(59): Card present
> pcieport 0000:20:01.1: pciehp: Slot(59): No link
>
> I understand this would likely be not applicable in places where broken
> devices hardwire PDS to zero and PDSC would never happen. But I'm open to
> making changes if this is more applicable. Because, SW cannot directly
> track the interference of HW in attempting link retrain and setting LBMS.
>
> 3. I tried introducing delay similar to pcie_wait_for_presence() but I
> was not successful in picking the right numbers. Hence hit with the same
> link speed drop.
>
> 4. For some reason I was unable to clear LBMS with:
> pcie_capability_clear_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
> PCI_EXP_LNKSTA_LBMS);

LBMS is write-1-to-clear, pcie_capability_clear_word() tries to write 0
there (the accessor doesn't do what you seem to expect, it clears normal
bits, not write-1-to-clear bits).

> ---
> drivers/pci/hotplug/pciehp_pci.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
> index ad12515a4a12..9155fdfd1d37 100644
> --- a/drivers/pci/hotplug/pciehp_pci.c
> +++ b/drivers/pci/hotplug/pciehp_pci.c
> @@ -92,7 +92,7 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
> {
> struct pci_dev *dev, *temp;
> struct pci_bus *parent = ctrl->pcie->port->subordinate;
> - u16 command;
> + u16 command, lnksta;
>
> ctrl_dbg(ctrl, "%s: domain:bus:dev = %04x:%02x:00\n",
> __func__, pci_domain_nr(parent), parent->number);
> @@ -134,4 +134,10 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
> }
>
> pci_unlock_rescan_remove();
> +
> + /* Clear LBMS on removal */
> + pcie_capability_read_word(ctrl->pcie->port, PCI_EXP_LNKSTA, &lnksta);
> + if (lnksta & PCI_EXP_LNKSTA_LBMS)
> + pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
> + PCI_EXP_LNKSTA_LBMS);

It's enough to unconditionally write PCI_EXP_LNKSTA_LBMS, no need to
check first. The comment is just spelling out what can already be read
from the code so I'd drop the comment.

I agree it makes sense to clear the LBMS when device is removed.

--
i.


Subject: Re: [PATCH] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction


On 4/23/24 8:33 PM, Smita Koralahalli wrote:
> Clear Link Bandwidth Management Status (LBMS) if set, on a hot-remove event.
>
> The hot-remove event could result in target link speed reduction if LBMS
> is set, due to a delay in Presence Detect State Change (PDSC) happening
> after a Data Link Layer State Change event (DLLSC).

What is the actual impact? Since this happens during hot-remove,
and there is no device, the link retrain will fail, right?

>
> In reality, PDSC and DLLSC events rarely come in simultaneously. Delay in
> PDSC can sometimes be too late and the slot could have already been
> powered down just by a DLLSC event. And the delayed PDSC could falsely be
> interpreted as an interrupt raised to turn the slot on. This false process
> of powering the slot on, without a link forces the kernel to retrain the
> link if LBMS is set, to a lower speed to restablish the link thereby
> bringing down the link speeds [2].
>
> According to PCIe r6.2 sec 7.5.3.8 [1], it is derived that, LBMS cannot
> be set for an unconnected link and if set, it serves the purpose of

I did not find this detail in the spec. Can you copy paste the lines
in the spec that mentions the case where it cannot set in an
unconnected link? All I see are the cases where it can be set.

> indicating that there is actually a device down an inactive link.
> However, hardware could have already set LBMS when the device was
> connected to the port i.e when the state was DL_Up or DL_Active. Some

Isn't LBMS only set during DL_Down transition ? Why would it be
set during DL_Up?

> hardwares would have even attempted retrain going into recovery mode,
> just before transitioning to DL_Down.
>
> Thus the set LBMS is never cleared and might force software to cause link
> speed drops when there is no link [2].
>
> Dmesg before:
> pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
> pcieport 0000:20:01.1: pciehp: Slot(59): Card present
> pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
> pcieport 0000:20:01.1: retraining failed
> pcieport 0000:20:01.1: pciehp: Slot(59): No link
>
> Dmesg after:
> pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
> pcieport 0000:20:01.1: pciehp: Slot(59): Card present
> pcieport 0000:20:01.1: pciehp: Slot(59): No link
>
> [1] PCI Express Base Specification Revision 6.2, Jan 25 2024.
> https://members.pcisig.com/wg/PCI-SIG/document/20590
> [2] Commit a89c82249c37 ("PCI: Work around PCIe link training failures")
>
> Signed-off-by: Smita Koralahalli <[email protected]>
> ---
> 1. Should be based on top of fixes for link retrain status in
> pcie_wait_for_link_delay()
> https://patchwork.kernel.org/project/linux-pci/list/?series=824858
> https://lore.kernel.org/linux-pci/[email protected]/
>
> Without the fixes patch output would be:
> pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
> pcieport 0000:20:01.1: pciehp: Slot(59): Card present
> pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
> pcieport 0000:20:01.1: retraining failed
> pcieport 0000:20:01.1: pciehp: Slot(59): No device found.
>
> 2. I initially attempted to wait for both events PDSC and DLLSC to happen
> and then turn on the slot.
> Similar to: https://lore.kernel.org/lkml/[email protected]/
> but before turning on the slot.
>
> Something like:
> - ctrl->state = POWERON_STATE;
> - mutex_unlock(&ctrl->state_lock);
> - if (present)
> + if (present && link_active) {
> + ctrl->state = POWERON_STATE;
> + mutex_unlock(&ctrl->state_lock);
> ctrl_info(ctrl, "Slot(%s): Card present\n",
> slot_name(ctrl));
> - if (link_active)
> ctrl_info(ctrl, "Slot(%s): Link Up\n",
> slot_name(ctrl));
> - ctrl->request_result = pciehp_enable_slot(ctrl);
> - break;
> + ctrl->request_result = pciehp_enable_slot(ctrl);
> + break;
> + }
> + else {
> + mutex_unlock(&ctrl->state_lock);
> + break;
> + }
>
> This would also avoid printing the lines below on a remove event.
> pcieport 0000:20:01.1: pciehp: Slot(59): Card present
> pcieport 0000:20:01.1: pciehp: Slot(59): No link
>
> I understand this would likely be not applicable in places where broken
> devices hardwire PDS to zero and PDSC would never happen. But I'm open to
> making changes if this is more applicable. Because, SW cannot directly
> track the interference of HW in attempting link retrain and setting LBMS.
>
> 3. I tried introducing delay similar to pcie_wait_for_presence() but I
> was not successful in picking the right numbers. Hence hit with the same
> link speed drop.
>
> 4. For some reason I was unable to clear LBMS with:
> pcie_capability_clear_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
> PCI_EXP_LNKSTA_LBMS);
> ---
> drivers/pci/hotplug/pciehp_pci.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
> index ad12515a4a12..9155fdfd1d37 100644
> --- a/drivers/pci/hotplug/pciehp_pci.c
> +++ b/drivers/pci/hotplug/pciehp_pci.c
> @@ -92,7 +92,7 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
> {
> struct pci_dev *dev, *temp;
> struct pci_bus *parent = ctrl->pcie->port->subordinate;
> - u16 command;
> + u16 command, lnksta;
>
> ctrl_dbg(ctrl, "%s: domain:bus:dev = %04x:%02x:00\n",
> __func__, pci_domain_nr(parent), parent->number);
> @@ -134,4 +134,10 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
> }
>
> pci_unlock_rescan_remove();
> +
> + /* Clear LBMS on removal */
> + pcie_capability_read_word(ctrl->pcie->port, PCI_EXP_LNKSTA, &lnksta);
> + if (lnksta & PCI_EXP_LNKSTA_LBMS)
> + pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
> + PCI_EXP_LNKSTA_LBMS);
> }

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


2024-04-25 20:24:36

by Smita Koralahalli

[permalink] [raw]
Subject: Re: [PATCH] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction

Hi Ilpo,

On 4/24/2024 2:32 AM, Ilpo Järvinen wrote:
> On Wed, 24 Apr 2024, Smita Koralahalli wrote:
>
>> Clear Link Bandwidth Management Status (LBMS) if set, on a hot-remove event.
>>
>> The hot-remove event could result in target link speed reduction if LBMS
>> is set, due to a delay in Presence Detect State Change (PDSC) happening
>> after a Data Link Layer State Change event (DLLSC).
>>
>> In reality, PDSC and DLLSC events rarely come in simultaneously. Delay in
>> PDSC can sometimes be too late and the slot could have already been
>> powered down just by a DLLSC event. And the delayed PDSC could falsely be
>> interpreted as an interrupt raised to turn the slot on. This false process
>> of powering the slot on, without a link forces the kernel to retrain the
>> link if LBMS is set, to a lower speed to restablish the link thereby
>> bringing down the link speeds [2].
>>
>> According to PCIe r6.2 sec 7.5.3.8 [1], it is derived that, LBMS cannot
>> be set for an unconnected link and if set, it serves the purpose of
>> indicating that there is actually a device down an inactive link.
>> However, hardware could have already set LBMS when the device was
>> connected to the port i.e when the state was DL_Up or DL_Active. Some
>> hardwares would have even attempted retrain going into recovery mode,
>> just before transitioning to DL_Down.
>>
>> Thus the set LBMS is never cleared and might force software to cause link
>> speed drops when there is no link [2].
>>
>> Dmesg before:
>> pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
>> pcieport 0000:20:01.1: pciehp: Slot(59): Card present
>> pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
>> pcieport 0000:20:01.1: retraining failed
>> pcieport 0000:20:01.1: pciehp: Slot(59): No link
>>
>> Dmesg after:
>> pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
>> pcieport 0000:20:01.1: pciehp: Slot(59): Card present
>> pcieport 0000:20:01.1: pciehp: Slot(59): No link
>>
>> [1] PCI Express Base Specification Revision 6.2, Jan 25 2024.
>> https://members.pcisig.com/wg/PCI-SIG/document/20590
>> [2] Commit a89c82249c37 ("PCI: Work around PCIe link training failures")
>>
>> Signed-off-by: Smita Koralahalli <[email protected]>
>> ---
>> 1. Should be based on top of fixes for link retrain status in
>> pcie_wait_for_link_delay()
>> https://patchwork.kernel.org/project/linux-pci/list/?series=824858
>> https://lore.kernel.org/linux-pci/[email protected]/
>>
>> Without the fixes patch output would be:
>> pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
>> pcieport 0000:20:01.1: pciehp: Slot(59): Card present
>> pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
>> pcieport 0000:20:01.1: retraining failed
>> pcieport 0000:20:01.1: pciehp: Slot(59): No device found.
>
> Did you hit the 60 sec delay issue without series 824858? If you've tested
> them and the fixes helped your case, could you perhaps give Tested-by for
> that series too (in the relevant thread)?

I'm assuming the 60s delay issue is from pci_dev_wait()?

Correct me if I'm wrong.
I think series 824858 potentially fixes the bug at two different places.
What you are seeing is at suspend/resume operation called from the calls
below.

pci_pm_runtime_resume()
pci_pm_bridge_power_up_actions()
pci_bridge_wait_for_secondary_bus()
pcie_wait_for_link_delay()
pcie_failed_link_retrain()
pci_dev_wait()

But series 824858 helped me in properly returning an error code from
pcie_wait_for_link_delay() and also avoiding the 100ms delay inside
pcie_wait_for_link_delay() and probably the timeout in
pcie_wait_for_presence()..

The sequence of operations which I'm looking at is after an PDSC event
as below:
pciehp_handle_presence_or_link_change()
pciehp_enable_slot()
__pciehp_enable_slot()
board_added()
pciehp_check_link_status()
pcie_wait_for_link()
pcie_wait_for_link_delay()
pcie_failed_link_retrain()

pcie_failed_link_retrain() would initially return false on a "failed
link retrain" attempt which would make pcie_wait_for_link_delay() and
pcie_wait_for_link() to erroneously succeed thereby unnecessarily
proceeding with other checks in pciehp_check_link_status().

Series 824858 fixes the bug by properly returning an error code.

However, I had missed looking at your patchset when I initially wrote
this. From the patch below I see you have addressed clearing LBMS as
well but at a different place. But I didn't understand why was it dropped.

https://lore.kernel.org/all/[email protected]/

From what I understood while experimenting is, tracking the set/unset
behavior of LBMS is hard, as HW has the right to attempt retrain at any
point except when the status is DL_Down as per the statement from the
the SPEC: "This bit is Set by hardware to indicate that either of the
following has occurred without the Port transitioning through DL_Down
status".

The set LBMS when the port was not in DL_Down, is never unset after the
Port is transitioned to DL_Down. So, I think clearing it after the port
status is DL_Down (which I assume happens after DLLSC interrupt fires to
bring the slot down) makes it remain cleared only until the port remains
at DL_Down state. As soon as the port transitions to other states (I
don't know how SW could track the states) there is no guarantee that the
bit is still clear as HW might have attempted retrain.

The only way would be to track the DL_Down and Active/Up states. I at
the moment don't know how to do it or if it is possible to do it in SW
in the first place. Hence, I'm more inclined pushing [2] below as a fix
for link speed drop. However, that has some complexities as well :(

After talking to HW folks one place where our HW sets LBMS is:

Device is removed.
DL_Up doesn't immediately change.
HW will just see errors on the receiver and doesn't automatically
know that it was because of a hot remove, and tries to recover the
link.
LBMS gets set here as rate change is attempted.
DL_Down occurs after this.

Once DL_Down occurs nobody is supposed to set the bit. But the set bit
is never cleared and is creating all issues.

HW mostly considers other parameters and LTSSM behaviors in
transitioning between Active/Up to Down states which I'm not sure at the
moment how much of it is transparent to OS. :/

>
>> 2. I initially attempted to wait for both events PDSC and DLLSC to happen
>> and then turn on the slot.
>> Similar to: https://lore.kernel.org/lkml/[email protected]/
>> but before turning on the slot.
>>
>> Something like:
>> - ctrl->state = POWERON_STATE;
>> - mutex_unlock(&ctrl->state_lock);
>> - if (present)
>> + if (present && link_active) {
>> + ctrl->state = POWERON_STATE;
>> + mutex_unlock(&ctrl->state_lock);
>> ctrl_info(ctrl, "Slot(%s): Card present\n",
>> slot_name(ctrl));
>> - if (link_active)
>> ctrl_info(ctrl, "Slot(%s): Link Up\n",
>> slot_name(ctrl));
>> - ctrl->request_result = pciehp_enable_slot(ctrl);
>> - break;
>> + ctrl->request_result = pciehp_enable_slot(ctrl);
>> + break;
>> + }
>> + else {
>> + mutex_unlock(&ctrl->state_lock);
>> + break;
>> + }
>>
>> This would also avoid printing the lines below on a remove event.
>> pcieport 0000:20:01.1: pciehp: Slot(59): Card present
>> pcieport 0000:20:01.1: pciehp: Slot(59): No link
>>
>> I understand this would likely be not applicable in places where broken
>> devices hardwire PDS to zero and PDSC would never happen. But I'm open to
>> making changes if this is more applicable. Because, SW cannot directly
>> track the interference of HW in attempting link retrain and setting LBMS.
>>
>> 3. I tried introducing delay similar to pcie_wait_for_presence() but I
>> was not successful in picking the right numbers. Hence hit with the same
>> link speed drop.
>>
>> 4. For some reason I was unable to clear LBMS with:
>> pcie_capability_clear_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
>> PCI_EXP_LNKSTA_LBMS);
>
> LBMS is write-1-to-clear, pcie_capability_clear_word() tries to write 0
> there (the accessor doesn't do what you seem to expect, it clears normal
> bits, not write-1-to-clear bits).

Got it thanks!

>
>> ---
>> drivers/pci/hotplug/pciehp_pci.c | 8 +++++++-
>> 1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
>> index ad12515a4a12..9155fdfd1d37 100644
>> --- a/drivers/pci/hotplug/pciehp_pci.c
>> +++ b/drivers/pci/hotplug/pciehp_pci.c
>> @@ -92,7 +92,7 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
>> {
>> struct pci_dev *dev, *temp;
>> struct pci_bus *parent = ctrl->pcie->port->subordinate;
>> - u16 command;
>> + u16 command, lnksta;
>>
>> ctrl_dbg(ctrl, "%s: domain:bus:dev = %04x:%02x:00\n",
>> __func__, pci_domain_nr(parent), parent->number);
>> @@ -134,4 +134,10 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
>> }
>>
>> pci_unlock_rescan_remove();
>> +
>> + /* Clear LBMS on removal */
>> + pcie_capability_read_word(ctrl->pcie->port, PCI_EXP_LNKSTA, &lnksta);
>> + if (lnksta & PCI_EXP_LNKSTA_LBMS)
>> + pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
>> + PCI_EXP_LNKSTA_LBMS);
>
> It's enough to unconditionally write PCI_EXP_LNKSTA_LBMS, no need to
> check first. The comment is just spelling out what can already be read
> from the code so I'd drop the comment.

Sure, I will make changes once I send v2 and if we consider to address
it this way.. :)

>
> I agree it makes sense to clear the LBMS when device is removed.

Thank you!
>

Thanks,
Smita

2024-04-25 20:49:47

by Smita Koralahalli

[permalink] [raw]
Subject: Re: [PATCH] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction

Hi,

On 4/24/2024 8:10 PM, Kuppuswamy Sathyanarayanan wrote:
>
> On 4/23/24 8:33 PM, Smita Koralahalli wrote:
>> Clear Link Bandwidth Management Status (LBMS) if set, on a hot-remove event.
>>
>> The hot-remove event could result in target link speed reduction if LBMS
>> is set, due to a delay in Presence Detect State Change (PDSC) happening
>> after a Data Link Layer State Change event (DLLSC).
>
> What is the actual impact? Since this happens during hot-remove,
> and there is no device, the link retrain will fail, right?

That's right. Link retrain fails resulting in reduced link speeds. It
shouldn't attempt link retrain in the first place if there is no device.

>
>>
>> In reality, PDSC and DLLSC events rarely come in simultaneously. Delay in
>> PDSC can sometimes be too late and the slot could have already been
>> powered down just by a DLLSC event. And the delayed PDSC could falsely be
>> interpreted as an interrupt raised to turn the slot on. This false process
>> of powering the slot on, without a link forces the kernel to retrain the
>> link if LBMS is set, to a lower speed to restablish the link thereby
>> bringing down the link speeds [2].
>>
>> According to PCIe r6.2 sec 7.5.3.8 [1], it is derived that, LBMS cannot
>> be set for an unconnected link and if set, it serves the purpose of
>
> I did not find this detail in the spec. Can you copy paste the lines
> in the spec that mentions the case where it cannot set in an
> unconnected link? All I see are the cases where it can be set.

This is the only line in the Spec: "Link Bandwidth Management Status -
This bit is Set by hardware to indicate that either of the following has
occurred without the Port transitioning through DL_Down status".

I have mostly written down the inferences for the statement. And part of
the inferences I have even picked up from Bjorn's statements in the
below link: (probably all of them :))

https://lore.kernel.org/all/[email protected]/

Repasting..

"...LBMS cannot be set for an unconnected link, because the bit is only
allowed to be set for an event observed that has happened with a port
reporting no DL_Down status at any time throughout the event, which can
only happen with the physical layer up, which of course cannot happen
for an unconnected link....
...
.IOW the LBMS bit serves the purpose of indicating that there is
actually a device down an inactive link ...."

>
>> indicating that there is actually a device down an inactive link.
>> However, hardware could have already set LBMS when the device was
>> connected to the port i.e when the state was DL_Up or DL_Active. Some
>
> Isn't LBMS only set during DL_Down transition ? Why would it be
> set during DL_Up?

The statement in Spec is very confusing :/ LBMS could only be set when
the state is not DL_Down. But it could already be set before the port
enters DL_Down.

Tried my attempt to collect some points here:
https://lore.kernel.org/linux-pci/[email protected]/

Hoping I'm on the right track!

Thanks
Smita

>
>> hardwares would have even attempted retrain going into recovery mode,
>> just before transitioning to DL_Down.
>>
>> Thus the set LBMS is never cleared and might force software to cause link
>> speed drops when there is no link [2].
>>
>> Dmesg before:
>> pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
>> pcieport 0000:20:01.1: pciehp: Slot(59): Card present
>> pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
>> pcieport 0000:20:01.1: retraining failed
>> pcieport 0000:20:01.1: pciehp: Slot(59): No link
>>
>> Dmesg after:
>> pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
>> pcieport 0000:20:01.1: pciehp: Slot(59): Card present
>> pcieport 0000:20:01.1: pciehp: Slot(59): No link
>>
>> [1] PCI Express Base Specification Revision 6.2, Jan 25 2024.
>> https://members.pcisig.com/wg/PCI-SIG/document/20590
>> [2] Commit a89c82249c37 ("PCI: Work around PCIe link training failures")
>>
>> Signed-off-by: Smita Koralahalli <[email protected]>
>> ---
>> 1. Should be based on top of fixes for link retrain status in
>> pcie_wait_for_link_delay()
>> https://patchwork.kernel.org/project/linux-pci/list/?series=824858
>> https://lore.kernel.org/linux-pci/[email protected]/
>>
>> Without the fixes patch output would be:
>> pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
>> pcieport 0000:20:01.1: pciehp: Slot(59): Card present
>> pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
>> pcieport 0000:20:01.1: retraining failed
>> pcieport 0000:20:01.1: pciehp: Slot(59): No device found.
>>
>> 2. I initially attempted to wait for both events PDSC and DLLSC to happen
>> and then turn on the slot.
>> Similar to: https://lore.kernel.org/lkml/[email protected]/
>> but before turning on the slot.
>>
>> Something like:
>> - ctrl->state = POWERON_STATE;
>> - mutex_unlock(&ctrl->state_lock);
>> - if (present)
>> + if (present && link_active) {
>> + ctrl->state = POWERON_STATE;
>> + mutex_unlock(&ctrl->state_lock);
>> ctrl_info(ctrl, "Slot(%s): Card present\n",
>> slot_name(ctrl));
>> - if (link_active)
>> ctrl_info(ctrl, "Slot(%s): Link Up\n",
>> slot_name(ctrl));
>> - ctrl->request_result = pciehp_enable_slot(ctrl);
>> - break;
>> + ctrl->request_result = pciehp_enable_slot(ctrl);
>> + break;
>> + }
>> + else {
>> + mutex_unlock(&ctrl->state_lock);
>> + break;
>> + }
>>
>> This would also avoid printing the lines below on a remove event.
>> pcieport 0000:20:01.1: pciehp: Slot(59): Card present
>> pcieport 0000:20:01.1: pciehp: Slot(59): No link
>>
>> I understand this would likely be not applicable in places where broken
>> devices hardwire PDS to zero and PDSC would never happen. But I'm open to
>> making changes if this is more applicable. Because, SW cannot directly
>> track the interference of HW in attempting link retrain and setting LBMS.
>>
>> 3. I tried introducing delay similar to pcie_wait_for_presence() but I
>> was not successful in picking the right numbers. Hence hit with the same
>> link speed drop.
>>
>> 4. For some reason I was unable to clear LBMS with:
>> pcie_capability_clear_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
>> PCI_EXP_LNKSTA_LBMS);
>> ---
>> drivers/pci/hotplug/pciehp_pci.c | 8 +++++++-
>> 1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
>> index ad12515a4a12..9155fdfd1d37 100644
>> --- a/drivers/pci/hotplug/pciehp_pci.c
>> +++ b/drivers/pci/hotplug/pciehp_pci.c
>> @@ -92,7 +92,7 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
>> {
>> struct pci_dev *dev, *temp;
>> struct pci_bus *parent = ctrl->pcie->port->subordinate;
>> - u16 command;
>> + u16 command, lnksta;
>>
>> ctrl_dbg(ctrl, "%s: domain:bus:dev = %04x:%02x:00\n",
>> __func__, pci_domain_nr(parent), parent->number);
>> @@ -134,4 +134,10 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
>> }
>>
>> pci_unlock_rescan_remove();
>> +
>> + /* Clear LBMS on removal */
>> + pcie_capability_read_word(ctrl->pcie->port, PCI_EXP_LNKSTA, &lnksta);
>> + if (lnksta & PCI_EXP_LNKSTA_LBMS)
>> + pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
>> + PCI_EXP_LNKSTA_LBMS);
>> }
>

2024-04-26 12:52:53

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [PATCH] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction

On Thu, 25 Apr 2024, Smita Koralahalli wrote:
> On 4/24/2024 2:32 AM, Ilpo J?rvinen wrote:
> > On Wed, 24 Apr 2024, Smita Koralahalli wrote:
> >
> > > Clear Link Bandwidth Management Status (LBMS) if set, on a hot-remove
> > > event.
> > >
> > > The hot-remove event could result in target link speed reduction if LBMS
> > > is set, due to a delay in Presence Detect State Change (PDSC) happening
> > > after a Data Link Layer State Change event (DLLSC).
> > >
> > > In reality, PDSC and DLLSC events rarely come in simultaneously. Delay in
> > > PDSC can sometimes be too late and the slot could have already been
> > > powered down just by a DLLSC event. And the delayed PDSC could falsely be
> > > interpreted as an interrupt raised to turn the slot on. This false process
> > > of powering the slot on, without a link forces the kernel to retrain the
> > > link if LBMS is set, to a lower speed to restablish the link thereby
> > > bringing down the link speeds [2].
> > >
> > > According to PCIe r6.2 sec 7.5.3.8 [1], it is derived that, LBMS cannot
> > > be set for an unconnected link and if set, it serves the purpose of
> > > indicating that there is actually a device down an inactive link.
> > > However, hardware could have already set LBMS when the device was
> > > connected to the port i.e when the state was DL_Up or DL_Active. Some
> > > hardwares would have even attempted retrain going into recovery mode,
> > > just before transitioning to DL_Down.
> > >
> > > Thus the set LBMS is never cleared and might force software to cause link
> > > speed drops when there is no link [2].
> > >
> > > Dmesg before:
> > > pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
> > > pcieport 0000:20:01.1: pciehp: Slot(59): Card present
> > > pcieport 0000:20:01.1: broken device, retraining non-functional
> > > downstream link at 2.5GT/s
> > > pcieport 0000:20:01.1: retraining failed
> > > pcieport 0000:20:01.1: pciehp: Slot(59): No link
> > >
> > > Dmesg after:
> > > pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
> > > pcieport 0000:20:01.1: pciehp: Slot(59): Card present
> > > pcieport 0000:20:01.1: pciehp: Slot(59): No link
> > >
> > > [1] PCI Express Base Specification Revision 6.2, Jan 25 2024.
> > > https://members.pcisig.com/wg/PCI-SIG/document/20590
> > > [2] Commit a89c82249c37 ("PCI: Work around PCIe link training failures")
> > >
> > > Signed-off-by: Smita Koralahalli <[email protected]>
> > > ---
> > > 1. Should be based on top of fixes for link retrain status in
> > > pcie_wait_for_link_delay()
> > > https://patchwork.kernel.org/project/linux-pci/list/?series=824858
> > > https://lore.kernel.org/linux-pci/[email protected]/
> > >
> > > Without the fixes patch output would be:
> > > pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
> > > pcieport 0000:20:01.1: pciehp: Slot(59): Card present
> > > pcieport 0000:20:01.1: broken device, retraining non-functional
> > > downstream link at 2.5GT/s
> > > pcieport 0000:20:01.1: retraining failed
> > > pcieport 0000:20:01.1: pciehp: Slot(59): No device found.
> >
> > Did you hit the 60 sec delay issue without series 824858? If you've tested
> > them and the fixes helped your case, could you perhaps give Tested-by for
> > that series too (in the relevant thread)?
>
> I'm assuming the 60s delay issue is from pci_dev_wait()?

Yes.

> Correct me if I'm wrong.
> I think series 824858 potentially fixes the bug at two different places. What
> you are seeing is at suspend/resume operation called from the calls below.
>
> pci_pm_runtime_resume()
> pci_pm_bridge_power_up_actions()
> pci_bridge_wait_for_secondary_bus()
> pcie_wait_for_link_delay()
> pcie_failed_link_retrain()
> pci_dev_wait()
>
> But series 824858 helped me in properly returning an error code from
> pcie_wait_for_link_delay() and also avoiding the 100ms delay inside
> pcie_wait_for_link_delay() and probably the timeout in
> pcie_wait_for_presence()..
>
> The sequence of operations which I'm looking at is after an PDSC event as
> below:
> pciehp_handle_presence_or_link_change()
> pciehp_enable_slot()
> __pciehp_enable_slot()
> board_added()
> pciehp_check_link_status()
> pcie_wait_for_link()
> pcie_wait_for_link_delay()
> pcie_failed_link_retrain()
>
> pcie_failed_link_retrain() would initially return false on a "failed link
> retrain" attempt which would make pcie_wait_for_link_delay() and
> pcie_wait_for_link() to erroneously succeed thereby unnecessarily proceeding
> with other checks in pciehp_check_link_status().
>
> Series 824858 fixes the bug by properly returning an error code.

Yes, that series also fixes other bugs too. I wasn't aware how the
incorrect return value impacted hotplug too.

> However, I had missed looking at your patchset when I initially wrote
> this. From the patch below I see you have addressed clearing LBMS as well but
> at a different place. But I didn't understand why was it dropped.
>
> https://lore.kernel.org/all/[email protected]/

That change wasn't perfect solution given how the target speed quirk
works. Unconditional clearing might have broken the quirk so localizing
the LBMS clearing into the cases where the device is actually gone would
be the best approach.

For normal hotplug, it's relatively easy to clear LBMS as demonstrated by
your patch (I have consider putting it into somewhere in remove.c if the
device is in disconnected state while removing the function 0 but roughly
the same idea as in your patch).

The additional problem with suspend/resume cycle is that the device can
disappear while suspended and we only notice the disconnection after
hotplug has resumed. This occurs only after full portdrv resume because it
requires interrupt which is too late to avoid the 60secs delay. Series
824858 avoids the 60secs delay even for suspend/resume but if LBMS could
be cleared in time, it would help things further because the quirk
wouldn't even trigger, or at minimum, the wait delays could be
short-circuited if device disconnection is noticed while the wait is in
progress.

> From what I understood while experimenting is, tracking the set/unset behavior
> of LBMS is hard, as HW has the right to attempt retrain at any point except
> when the status is DL_Down as per the statement from the the SPEC: "This bit
> is Set by hardware to indicate that either of the following has occurred
> without the Port transitioning through DL_Down status".

It's actually not that hard. On medium term, I'm trying to get PCIe BW
controller into kernel (about to send a new version of it, probably next
week), which will keep clearing LBMS by enabling BW notifications that are
the SW/OS way to track LBMS assertions.

> The set LBMS when the port was not in DL_Down, is never unset after the Port
> is transitioned to DL_Down. So, I think clearing it after the port status is
> DL_Down (which I assume happens after DLLSC interrupt fires to bring the slot
> down) makes it remain cleared only until the port remains at DL_Down state. As
> soon as the port transitions to other states (I don't know how SW could track
> the states) there is no guarantee that the bit is still clear as HW might have
> attempted retrain.
>
> The only way would be to track the DL_Down and Active/Up states. I at the
> moment don't know how to do it or if it is possible to do it in SW in the
> first place. Hence, I'm more inclined pushing [2] below as a fix for link
> speed drop. However, that has some complexities as well :(
>
> After talking to HW folks one place where our HW sets LBMS is:
>
> Device is removed.
> DL_Up doesn't immediately change.
> HW will just see errors on the receiver and doesn't automatically
> know that it was because of a hot remove, and tries to recover the
> link.
> LBMS gets set here as rate change is attempted.

That's interesting. I have earlier thought it's only set when the rate
change was actually successful but it seems that isn't even meant to be
the case here when I now reread the spec.

> DL_Down occurs after this.
>
> Once DL_Down occurs nobody is supposed to set the bit. But the set bit is
> never cleared and is creating all issues.
>
> HW mostly considers other parameters and LTSSM behaviors in transitioning
> between Active/Up to Down states which I'm not sure at the moment how much of
> it is transparent to OS. :/
>
> > > @@ -134,4 +134,10 @@ void pciehp_unconfigure_device(struct controller
> > > *ctrl, bool presence)
> > > }
> > > pci_unlock_rescan_remove();
> > > +
> > > + /* Clear LBMS on removal */
> > > + pcie_capability_read_word(ctrl->pcie->port, PCI_EXP_LNKSTA, &lnksta);
> > > + if (lnksta & PCI_EXP_LNKSTA_LBMS)
> > > + pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
> > > + PCI_EXP_LNKSTA_LBMS);
> >
> > It's enough to unconditionally write PCI_EXP_LNKSTA_LBMS, no need to
> > check first. The comment is just spelling out what can already be read
> > from the code so I'd drop the comment.
>
> Sure, I will make changes once I send v2 and if we consider to address it this
> way.. :)
>
> >
> > I agree it makes sense to clear the LBMS when device is removed.
>
> Thank you!

As I said, I think it's correct thing to do when we know the device is
gone and shouldn't interfere with the target speed quirk unlike my
clear-LBMS-on-resume attempt.

This won't help the disconnected while suspended case for now but it's
still a step into correct direction even for that.

--
i.

2024-05-06 07:38:15

by Lukas Wunner

[permalink] [raw]
Subject: Re: [PATCH] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction

On Wed, Apr 24, 2024 at 03:33:39AM +0000, Smita Koralahalli wrote:
> Clear Link Bandwidth Management Status (LBMS) if set, on a hot-remove event.
>
> The hot-remove event could result in target link speed reduction if LBMS
> is set, due to a delay in Presence Detect State Change (PDSC) happening
> after a Data Link Layer State Change event (DLLSC).
>
> In reality, PDSC and DLLSC events rarely come in simultaneously. Delay in
> PDSC can sometimes be too late and the slot could have already been
> powered down just by a DLLSC event. And the delayed PDSC could falsely be
> interpreted as an interrupt raised to turn the slot on. This false process
> of powering the slot on, without a link forces the kernel to retrain the
> link if LBMS is set, to a lower speed to restablish the link thereby
> bringing down the link speeds [2].
>
> According to PCIe r6.2 sec 7.5.3.8 [1], it is derived that, LBMS cannot
> be set for an unconnected link and if set, it serves the purpose of
> indicating that there is actually a device down an inactive link.
> However, hardware could have already set LBMS when the device was
> connected to the port i.e when the state was DL_Up or DL_Active. Some
> hardwares would have even attempted retrain going into recovery mode,
> just before transitioning to DL_Down.
>
> Thus the set LBMS is never cleared and might force software to cause link
> speed drops when there is no link [2].

Is this an issue introduced by commit a89c82249c37 ("PCI: Work around
PCIe link training failures")?

If so, could you add a Fixes tag?

I might be mistaken but my impression is that we're seeing a lot of
fallout as a result of that commit.

Thanks,

Lukas