2022-02-02 16:03:42

by Vidya Sagar

[permalink] [raw]
Subject: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume

Previously ASPM L1 Substates control registers (CTL1 and CTL2) weren't
saved and restored during suspend/resume leading to L1 Substates
configuration being lost post-resume.

Save the L1 Substates control registers so that the configuration is
retained post-resume.

Signed-off-by: Vidya Sagar <[email protected]>
---
Hi,
Similar patch was merged in the past through the commit 4257f7e008ea
("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") but it later
got reverted through the commit 40fb68c7725a as Kenneth R. Crudup
<[email protected]> reported disk IO errors because of it. I couldn't spend much
time debugging the issue back then, but taking a fresh look at the issue, it
seems more like an issue with the platform in question than this change itself.
Reason being that there are other devices that support ASPM L1 Sub-States
on that platform (as observed in the lspci output mentioned at
https://lore.kernel.org/linux-pci/[email protected]/ )
and assuming that L1 Sub-States are indeed enabled for those devices, there
are no issues reported from those devices except from the NVMe drive.
When it comes to the NVMe driver, the code at
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/pci.c?h=v5.17-rc2#n3008
has some quirks for some of the models from Dell Inc and I'm wondering if
the model on which the issue was observed might need a quirk of its own??

So,
Kenneth R. Crudup <[email protected]>
Could you please try this patch on top of linux-next and collect more info?
- 'sudo lspci -vvvv' output before and after hibernation
- could you please confirm the working of this patch for non NVMe devices that
support L1 Sub-States?
- Could you please try "NVME_QUIRK_NO_DEEPEST_PS" and "NVME_QUIRK_SIMPLE_SUSPEND"
quirks (one at a time) in check_vendor_combination_bug() API and see if it
makes any difference?

Thanks & Regards,
Vidya Sagar

drivers/pci/pci.c | 7 +++++++
drivers/pci/pci.h | 4 ++++
drivers/pci/pcie/aspm.c | 44 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 55 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 9ecce435fb3f..75a8b264ddac 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1617,6 +1617,7 @@ int pci_save_state(struct pci_dev *dev)
return i;

pci_save_ltr_state(dev);
+ pci_save_aspm_l1ss_state(dev);
pci_save_dpc_state(dev);
pci_save_aer_state(dev);
pci_save_ptm_state(dev);
@@ -1723,6 +1724,7 @@ void pci_restore_state(struct pci_dev *dev)
* LTR itself (in the PCIe capability).
*/
pci_restore_ltr_state(dev);
+ pci_restore_aspm_l1ss_state(dev);

pci_restore_pcie_state(dev);
pci_restore_pasid_state(dev);
@@ -3430,6 +3432,11 @@ void pci_allocate_cap_save_buffers(struct pci_dev *dev)
if (error)
pci_err(dev, "unable to allocate suspend buffer for LTR\n");

+ error = pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_L1SS,
+ 2 * sizeof(u32));
+ if (error)
+ pci_err(dev, "unable to allocate suspend buffer for ASPM-L1SS\n");
+
pci_allocate_vc_save_buffers(dev);
}

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 3d60cabde1a1..5de1cfe07749 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -562,11 +562,15 @@ void pcie_aspm_init_link_state(struct pci_dev *pdev);
void pcie_aspm_exit_link_state(struct pci_dev *pdev);
void pcie_aspm_pm_state_change(struct pci_dev *pdev);
void pcie_aspm_powersave_config_link(struct pci_dev *pdev);
+void pci_save_aspm_l1ss_state(struct pci_dev *dev);
+void pci_restore_aspm_l1ss_state(struct pci_dev *dev);
#else
static inline void pcie_aspm_init_link_state(struct pci_dev *pdev) { }
static inline void pcie_aspm_exit_link_state(struct pci_dev *pdev) { }
static inline void pcie_aspm_pm_state_change(struct pci_dev *pdev) { }
static inline void pcie_aspm_powersave_config_link(struct pci_dev *pdev) { }
+static inline void pci_save_aspm_l1ss_state(struct pci_dev *dev) { }
+static inline void pci_restore_aspm_l1ss_state(struct pci_dev *dev) { }
#endif

#ifdef CONFIG_PCIE_ECRC
diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index a96b7424c9bc..2c29fdd20059 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -726,6 +726,50 @@ static void pcie_config_aspm_l1ss(struct pcie_link_state *link, u32 state)
PCI_L1SS_CTL1_L1SS_MASK, val);
}

+void pci_save_aspm_l1ss_state(struct pci_dev *dev)
+{
+ int aspm_l1ss;
+ struct pci_cap_saved_state *save_state;
+ u32 *cap;
+
+ if (!pci_is_pcie(dev))
+ return;
+
+ aspm_l1ss = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_L1SS);
+ if (!aspm_l1ss)
+ return;
+
+ save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_L1SS);
+ if (!save_state)
+ return;
+
+ cap = (u32 *)&save_state->cap.data[0];
+ pci_read_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL2, cap++);
+ pci_read_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL1, cap++);
+}
+
+void pci_restore_aspm_l1ss_state(struct pci_dev *dev)
+{
+ int aspm_l1ss;
+ struct pci_cap_saved_state *save_state;
+ u32 *cap;
+
+ if (!pci_is_pcie(dev))
+ return;
+
+ aspm_l1ss = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_L1SS);
+ if (!aspm_l1ss)
+ return;
+
+ save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_L1SS);
+ if (!save_state)
+ return;
+
+ cap = (u32 *)&save_state->cap.data[0];
+ pci_write_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL2, *cap++);
+ pci_write_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL1, *cap++);
+}
+
static void pcie_config_aspm_dev(struct pci_dev *pdev, u32 val)
{
pcie_capability_clear_and_set_word(pdev, PCI_EXP_LNKCTL,
--
2.17.1


2022-02-05 06:53:45

by Kenneth Crudup

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume


On Fri, 4 Feb 2022, Bjorn Helgaas wrote:

> Do we have a theory on what might have caused the regression before?
> Or at least something that is different this time around?

If you'd like, I could try re-applying the previous problem commit or your
attempted fix on top of Linus' master if you'd like to see if something was
fixed somewhere else in the PCIe subsystem, but if you think it's not worth-
while I'm satisfied with the current fix (or probably more-exactly for my
particular machine, lack of regression).

-Kenny

--
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County CA

2022-02-05 18:57:08

by Kenneth Crudup

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume


> > If you'd like, I could try re-applying the previous problem commit or your
> > attempted fix on top of Linus' master if you'd like to see if something was
> > fixed somewhere else in the PCIe subsystem, but if you think it's not worth-
> > while I'm satisfied with the current fix (or probably more-exactly for my
> > particular machine, lack of regression).

On Sat, 5 Feb 2022, Vidya Sagar wrote:

> That would be a good starting point to understand it better. In fact if the
> previous problematic patch works fine on master, then, we are sure that
> something in the sub-system would have fixed the issue.

So this is my report of the regression I'd found with Bjorn's original commit:
----
Date: Fri, 25 Dec 2020 16:38:56
From: Kenneth R. Crudup <[email protected]>
To: [email protected]
Cc: [email protected]
Subject: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume
failures

I've been running Linus' master branch on my laptop (Dell XPS 13 2-in-1). With
this commit in place, after resuming from hibernate my machine is essentially
useless, with a torrent of disk I/O errors on my NVMe device (at least, and
possibly other devices affected) until a reboot.

I do use tlp to set the PCIe ASPM to "performance" on AC and "powersupersave"
on battery.

Let me know if you need more information.
----

I just reapplied it on top of Linus' master and not only did it go in cleanly(!),
NOW I'm not getting any issues after a suspend/resume. I've attached the output
of "lspci -vvvvnn" before a hibernation (but not the very *first* one; if you
need that output, let me know) and will submit the same post-hibernation (which
is the same as the pre-hibernation case) and the post-suspend case (which is
slightly different) in subsequent E-mails (due to attachment size).

-Kenny

--
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County CA


Attachments:
lspci-bjorn-4257f7e008-after-suspend (41.61 kB)
lspci-bjorn-4257f7e008-before-hibernate (41.58 kB)
Download all attachments

2022-02-07 11:28:45

by Kenneth Crudup

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume


After-suspend lspci

-Kenny

--
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County CA


Attachments:
lspci-bjorn-4257f7e008-after-suspend (41.61 kB)

2022-02-08 22:29:44

by Kenneth Crudup

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume


On Sat, 5 Feb 2022, Kenneth R. Crudup wrote:

> > If you'd like, I could try re-applying the previous problem commit or your
> > attempted fix on top of Linus' master if you'd like to see if something was
> > fixed somewhere else in the PCIe subsystem, but if you think it's not worth-
> > while I'm satisfied with the current fix (or probably more-exactly for my
> > particular machine, lack of regression).

On Sat, 5 Feb 2022, Vidya Sagar wrote:

> That would be a good starting point to understand it better. In fact if the
> previous problematic patch works fine on master, then, we are sure that
> something in the sub-system would have fixed the issue.

So this is my report of the regression I'd found with Bjorn's original commit:
----
Date: Fri, 25 Dec 2020 16:38:56
From: Kenneth R. Crudup <[email protected]>
To: [email protected]
Cc: [email protected]
Subject: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume
failures

I've been running Linus' master branch on my laptop (Dell XPS 13 2-in-1). With
this commit in place, after resuming from hibernate my machine is essentially
useless, with a torrent of disk I/O errors on my NVMe device (at least, and
possibly other devices affected) until a reboot.

I do use tlp to set the PCIe ASPM to "performance" on AC and "powersupersave"
on battery.

Let me know if you need more information.
----

I just reapplied it on top of Linus' master and not only did it go in cleanly(!),
NOW I'm not getting any issues after a suspend/resume. I've attached the output
of "lspci -vvvvnn" before a hibernation (but not the very *first* one; if you
need that output, let me know) and will submit the same post-hibernation (which
is the same as the pre-hibernation case) and the post-suspend case (which is
slightly different) in subsequent E-mails (due to attachment size).

-Kenny

--
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County CA


Attachments:
lspci-bjorn-4257f7e008-before-hibernate (41.58 kB)

2022-02-08 22:35:50

by Abhishek Sahu

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume

On 2/1/2022 6:05 PM, Vidya Sagar wrote:
> Previously ASPM L1 Substates control registers (CTL1 and CTL2) weren't
> saved and restored during suspend/resume leading to L1 Substates
> configuration being lost post-resume.
>
> Save the L1 Substates control registers so that the configuration is
> retained post-resume.
>

Thanks Vidya for making this change.

I have applied your patch in v5.17-rc2 and did 100 cycles of
suspend/resume in a Alder lake based notebook which has
NVIDIA discrete GPU and it is working fine.

After Boot:

# lspci -d "0x10de:" -vvv|grep "L1SubCtl1"
L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+

After Suspend/resume without this patch:

L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-

After Suspend/resume with this patch:
L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+


Tested-by: Abhishek Sahu <[email protected]>

Thanks,
Abhishek

> Signed-off-by: Vidya Sagar <[email protected]>
> ---
> Hi,
> Similar patch was merged in the past through the commit 4257f7e008ea
> ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") but it later
> got reverted through the commit 40fb68c7725a as Kenneth R. Crudup
> <[email protected]> reported disk IO errors because of it. I couldn't spend much
> time debugging the issue back then, but taking a fresh look at the issue, it
> seems more like an issue with the platform in question than this change itself.
> Reason being that there are other devices that support ASPM L1 Sub-States
> on that platform (as observed in the lspci output mentioned at
> https://lore.kernel.org/linux-pci/[email protected]/ )
> and assuming that L1 Sub-States are indeed enabled for those devices, there
> are no issues reported from those devices except from the NVMe drive.
> When it comes to the NVMe driver, the code at
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/pci.c?h=v5.17-rc2#n3008
> has some quirks for some of the models from Dell Inc and I'm wondering if
> the model on which the issue was observed might need a quirk of its own??
>
> So,
> Kenneth R. Crudup <[email protected]>
> Could you please try this patch on top of linux-next and collect more info?
> - 'sudo lspci -vvvv' output before and after hibernation
> - could you please confirm the working of this patch for non NVMe devices that
> support L1 Sub-States?
> - Could you please try "NVME_QUIRK_NO_DEEPEST_PS" and "NVME_QUIRK_SIMPLE_SUSPEND"
> quirks (one at a time) in check_vendor_combination_bug() API and see if it
> makes any difference?
>
> Thanks & Regards,
> Vidya Sagar
>
> drivers/pci/pci.c | 7 +++++++
> drivers/pci/pci.h | 4 ++++
> drivers/pci/pcie/aspm.c | 44 +++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 55 insertions(+)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 9ecce435fb3f..75a8b264ddac 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1617,6 +1617,7 @@ int pci_save_state(struct pci_dev *dev)
> return i;
>
> pci_save_ltr_state(dev);
> + pci_save_aspm_l1ss_state(dev);
> pci_save_dpc_state(dev);
> pci_save_aer_state(dev);
> pci_save_ptm_state(dev);
> @@ -1723,6 +1724,7 @@ void pci_restore_state(struct pci_dev *dev)
> * LTR itself (in the PCIe capability).
> */
> pci_restore_ltr_state(dev);
> + pci_restore_aspm_l1ss_state(dev);
>
> pci_restore_pcie_state(dev);
> pci_restore_pasid_state(dev);
> @@ -3430,6 +3432,11 @@ void pci_allocate_cap_save_buffers(struct pci_dev *dev)
> if (error)
> pci_err(dev, "unable to allocate suspend buffer for LTR\n");
>
> + error = pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_L1SS,
> + 2 * sizeof(u32));
> + if (error)
> + pci_err(dev, "unable to allocate suspend buffer for ASPM-L1SS\n");
> +
> pci_allocate_vc_save_buffers(dev);
> }
>
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 3d60cabde1a1..5de1cfe07749 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -562,11 +562,15 @@ void pcie_aspm_init_link_state(struct pci_dev *pdev);
> void pcie_aspm_exit_link_state(struct pci_dev *pdev);
> void pcie_aspm_pm_state_change(struct pci_dev *pdev);
> void pcie_aspm_powersave_config_link(struct pci_dev *pdev);
> +void pci_save_aspm_l1ss_state(struct pci_dev *dev);
> +void pci_restore_aspm_l1ss_state(struct pci_dev *dev);
> #else
> static inline void pcie_aspm_init_link_state(struct pci_dev *pdev) { }
> static inline void pcie_aspm_exit_link_state(struct pci_dev *pdev) { }
> static inline void pcie_aspm_pm_state_change(struct pci_dev *pdev) { }
> static inline void pcie_aspm_powersave_config_link(struct pci_dev *pdev) { }
> +static inline void pci_save_aspm_l1ss_state(struct pci_dev *dev) { }
> +static inline void pci_restore_aspm_l1ss_state(struct pci_dev *dev) { }
> #endif
>
> #ifdef CONFIG_PCIE_ECRC
> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> index a96b7424c9bc..2c29fdd20059 100644
> --- a/drivers/pci/pcie/aspm.c
> +++ b/drivers/pci/pcie/aspm.c
> @@ -726,6 +726,50 @@ static void pcie_config_aspm_l1ss(struct pcie_link_state *link, u32 state)
> PCI_L1SS_CTL1_L1SS_MASK, val);
> }
>
> +void pci_save_aspm_l1ss_state(struct pci_dev *dev)
> +{
> + int aspm_l1ss;
> + struct pci_cap_saved_state *save_state;
> + u32 *cap;
> +
> + if (!pci_is_pcie(dev))
> + return;
> +
> + aspm_l1ss = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_L1SS);
> + if (!aspm_l1ss)
> + return;
> +
> + save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_L1SS);
> + if (!save_state)
> + return;
> +
> + cap = (u32 *)&save_state->cap.data[0];
> + pci_read_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL2, cap++);
> + pci_read_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL1, cap++);
> +}
> +
> +void pci_restore_aspm_l1ss_state(struct pci_dev *dev)
> +{
> + int aspm_l1ss;
> + struct pci_cap_saved_state *save_state;
> + u32 *cap;
> +
> + if (!pci_is_pcie(dev))
> + return;
> +
> + aspm_l1ss = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_L1SS);
> + if (!aspm_l1ss)
> + return;
> +
> + save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_L1SS);
> + if (!save_state)
> + return;
> +
> + cap = (u32 *)&save_state->cap.data[0];
> + pci_write_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL2, *cap++);
> + pci_write_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL1, *cap++);
> +}
> +
> static void pcie_config_aspm_dev(struct pci_dev *pdev, u32 val)
> {
> pcie_capability_clear_and_set_word(pdev, PCI_EXP_LNKCTL,


2022-02-09 06:05:25

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume

On Sat, Feb 05, 2022 at 09:30:07AM -0800, Kenneth R. Crudup wrote:
> > > If you'd like, I could try re-applying the previous problem
> > > commit or your attempted fix on top of Linus' master if you'd
> > > like to see if something was fixed somewhere else in the PCIe
> > > subsystem, but if you think it's not worth- while I'm satisfied
> > > with the current fix (or probably more-exactly for my particular
> > > machine, lack of regression).
>
> On Sat, 5 Feb 2022, Vidya Sagar wrote:
>
> > That would be a good starting point to understand it better. In fact if the
> > previous problematic patch works fine on master, then, we are sure that
> > something in the sub-system would have fixed the issue.
>
> So this is my report of the regression I'd found with Bjorn's original commit:
> ----
> Date: Fri, 25 Dec 2020 16:38:56
> From: Kenneth R. Crudup <[email protected]>
> To: [email protected]
> Cc: [email protected]
> Subject: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume
> failures
>
> I've been running Linus' master branch on my laptop (Dell XPS 13 2-in-1). With
> this commit in place, after resuming from hibernate my machine is essentially
> useless, with a torrent of disk I/O errors on my NVMe device (at least, and
> possibly other devices affected) until a reboot.
>
> I do use tlp to set the PCIe ASPM to "performance" on AC and "powersupersave"
> on battery.
>
> Let me know if you need more information.
> ----
>
> I just reapplied it on top of Linus' master and not only did it go
> in cleanly(!), NOW I'm not getting any issues after a
> suspend/resume.

So on 12/25/2020 (just before v5.11-rc1), you saw I/O errors after
resume from hibernate, and you apparently went to the trouble to
bisect it to 4257f7e008ea ("PCI/ASPM: Save/restore L1SS Capability for
suspend/resume").

We reverted 4257f7e008ea, and the revert appeared in v5.11-rc7.

I assume you re-applied 4257f7e008ea ("PCI/ASPM: Save/restore L1SS
Capability for suspend/resume") on top of something between v5.17-rc2
and v5.17-rc3, and you don't see those I/O errors.

It's possible something was fixed elsewhere between v5.11-rc1 and
v5.17-rc2, but I'm not really convinced by that theory.

I think it's more likely that something changed in BIOS settings or
other configuration, which means other people may trip over it even if
you don't see it. Do you remember any BIOS updates, BIOS setup
tweaks, hardware changes, kernel parameter changes, etc?

If the problem really was fixed by some change elsewhere, it *should*
still happen on v5.11-rc1. I think we should verify that and try to
figure out what the other change was. People who want to backport the
L1SS save/restore will need to know that anyway.

Bjorn

2022-02-09 10:01:20

by Kenneth Crudup

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume


On Mon, 7 Feb 2022, Bjorn Helgaas wrote:

> I think it's more likely that something changed in BIOS settings or
> other configuration, which means other people may trip over it even if
> you don't see it. Do you remember any BIOS updates, BIOS setup
> tweaks, hardware changes, kernel parameter changes, etc?

I thought about that Saturday morning, and yeah, it's been over a year, and
Dell has released at least two BIOS updates since then.

> If the problem really was fixed by some change elsewhere, it *should*
> still happen on v5.11-rc1. I think we should verify that and try to
> figure out what the other change was.

Yeah, not a bad idea. I'm a little busy now, but I'll cut a kernel at
4257f7e0, remake, and see what happens. Gimme a day or so.

-Kenny

--
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County CA

2022-02-15 14:11:43

by Kenneth Crudup

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume


-Kenny

--
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County CA


Attachments:
kernel-4257f7e00-post-hibernate (41.58 kB)

2022-02-15 14:30:52

by Kenneth Crudup

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume


> On Mon, 7 Feb 2022, Bjorn Helgaas wrote:

> > If the problem really was fixed by some change elsewhere, it *should*
> > still happen on v5.11-rc1. I think we should verify that and try to
> > figure out what the other change was.

> Yeah, not a bad idea. I'm a little busy now, but I'll cut a kernel at
> 4257f7e0, remake, and see what happens. Gimme a day or so.

So I did this- checked out a branch hard-reset to commit 4257f7e0, used my
current _defconfig on it, and built and tested it- NOW it doesn't exhibit the
NVMe-device-killing behavior I'd seen a year earlier; I'm assuming Dell must
have fixed something in the BIOS?

Anyway, I'd done an "lspci -vvvnn" before and after hibernation, if you're
interested, and a diff of same; I'll send those in subsequent E-mails.

-Kenny

--
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County CA


Attachments:
kernel-4257f7e00-pre-hibernate (41.58 kB)
kernel-4257f7e00-pre-post-diff (3.07 kB)
Download all attachments

2022-02-16 06:16:14

by Kenneth Crudup

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume


On Wed, 16 Feb 2022, Vidya Sagar wrote:

> I see that the ASPM-L1 state of Realtek NIC which was in disabled state before
> hibernate got enabled after hibernate.

That's actually my SD-Card reader; there's a good chance the BIOS does "something"
to it at boot time, as it's possible to boot from SD-Card on my laptop.

> This patch doesn't do anything to LnkCtl register which has control for ASPM L1
> state.

> Could you please check why ASPM L1 got enabled post hibernation?

I wouldn't know how to do that; if you're still interested in that let me know
what to do to determine that.

-Kenny

--
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County CA

2022-02-16 07:32:44

by Vidya Sagar

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume



On 2/15/2022 6:40 PM, Kenneth R. Crudup wrote:
> External email: Use caution opening links or attachments
>
>
>> On Mon, 7 Feb 2022, Bjorn Helgaas wrote:
>
>>> If the problem really was fixed by some change elsewhere, it *should*
>>> still happen on v5.11-rc1. I think we should verify that and try to
>>> figure out what the other change was.
>
>> Yeah, not a bad idea. I'm a little busy now, but I'll cut a kernel at
>> 4257f7e0, remake, and see what happens. Gimme a day or so.
>
> So I did this- checked out a branch hard-reset to commit 4257f7e0, used my
> current _defconfig on it, and built and tested it- NOW it doesn't exhibit the
> NVMe-device-killing behavior I'd seen a year earlier; I'm assuming Dell must
> have fixed something in the BIOS?
>
> Anyway, I'd done an "lspci -vvvnn" before and after hibernation, if you're
> interested, and a diff of same; I'll send those in subsequent E-mails.
Thanks Kenny for the dump.
I see that the ASPM-L1 state of Realtek NIC which was in disabled state
before hibernate got enabled after hibernate. This is weird. This patch
doesn't do anything to LnkCtl register which has control for ASPM L1
state. Could you please check why ASPM L1 got enabled post hibernation?

>
> -Kenny
>
> --
> Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County CA
>

2022-02-16 13:26:08

by Vidya Sagar

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume



On 2/16/2022 11:30 AM, Kenneth R. Crudup wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, 16 Feb 2022, Vidya Sagar wrote:
>
>> I see that the ASPM-L1 state of Realtek NIC which was in disabled state before
>> hibernate got enabled after hibernate.
>
> That's actually my SD-Card reader; there's a good chance the BIOS does "something"
> to it at boot time, as it's possible to boot from SD-Card on my laptop.
>
>> This patch doesn't do anything to LnkCtl register which has control for ASPM L1
>> state.
>
>> Could you please check why ASPM L1 got enabled post hibernation?
>
> I wouldn't know how to do that; if you're still interested in that let me know
> what to do to determine that.
I would like Bjorn to take a call on it.
At this point, there are contradictions in observations.
Just to summarize,
- The root ports in your laptop don't have support for L1SS
- With the same old code base with which the errors were observed plus
my patch on top of it, I see that ASPM-L1 state getting enabled for one
of the endpoints (Realtek SD-Card reader) after system comes out of
hibernation even though ASPM-L1 was disabled before the system enter
into hibernation. No errors are reported now.
- With the linux-next top of the tree plus my patch, no change in the
ASPM states and no errors also reported.

This points to BIOS being buggy (both old and new with new one being
less problematic)

Bjorn, what are your comments on it?

>
> -Kenny
>
> --
> Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County CA
>

2022-04-12 23:42:39

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume

[+cc Ricky for rtsx_pci ASPM behavior, Rajat, Prasad for L1 SS stuff,
Victor for interest in disabling ASPM during save/restore]

On Wed, Feb 16, 2022 at 06:41:39PM +0530, Vidya Sagar wrote:
> On 2/16/2022 11:30 AM, Kenneth R. Crudup wrote:
> > On Wed, 16 Feb 2022, Vidya Sagar wrote:
> >
> > > I see that the ASPM-L1 state of Realtek NIC which was in
> > > disabled state before hibernate got enabled after hibernate.
> >
> > That's actually my SD-Card reader; there's a good chance the BIOS
> > does "something" to it at boot time, as it's possible to boot from
> > SD-Card on my laptop.
> >
> > > This patch doesn't do anything to LnkCtl register which has
> > > control for ASPM L1 state.
> >
> > > Could you please check why ASPM L1 got enabled post hibernation?
> >
> > I wouldn't know how to do that; if you're still interested in that
> > let me know what to do to determine that.

> I would like Bjorn to take a call on it.
> At this point, there are contradictions in observations.

Remind me what contradictions you see? I know Kenny saw NVMe errors
on a kernel that included 4257f7e008ea ("PCI/ASPM: Save/restore L1SS
Capability for suspend/resume") in December 2020 [1], and that he did
*not* see those errors on 4257f7e008ea in February 2022 [2]. Is that
what you mean?

> Just to summarize,
> - The root ports in your laptop don't have support for L1SS
> - With the same old code base with which the errors were observed plus my
> patch on top of it, I see that ASPM-L1 state getting enabled for one of the
> endpoints (Realtek SD-Card reader) after system comes out of hibernation
> even though ASPM-L1 was disabled before the system enter into hibernation.
> No errors are reported now.

I assume you refer to [2], where on 4257f7e008ea ("PCI/ASPM:
Save/restore L1SS Capability for suspend/resume"), Kenny saw ASPM L1
disabled before hibernate and enabled afterwards:

--- pre-hibernate
+++ post-hibernate
00:1d.7 PCI bridge [0604]: Intel [8086:34b7]
Bus: primary=00, secondary=58, subordinate=58
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
58:00.0 RTS525A PCI Express Card Reader [10ec:525a]
- LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
+ LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk-
+ ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-

Per PCIe r6.0, sec 7.5.3.7, "ASPM L1 must be enabled by software in
the Upstream component on a Link prior to enabling ASPM L1 in the
Downstream component on that Link," so this definitely seems broken,
but wouldn't explain the NVMe issue.

The PCI core (pcie_config_aspm_link()) always enables L1 in the
upstream component before the downstream one, but 58:00.0 uses the
rtsx_pci driver, which does a lot of its own ASPM fiddling, so my
guess is that it's doing something wrong here.

> - With the linux-next top of the tree plus my patch, no change in the ASPM
> states and no errors also reported.

I don't know which report this refers to.

> This points to BIOS being buggy (both old and new with new one being less
> problematic)

I agree that a BIOS change between [1] and [2] seems plausible, but I
don't think we can prove that yet. I'm slightly queasy because while
Kenny may have updated his BIOS, most people will not have.

I think we should try this patch again with some changes and maybe
some debug logging:

- I wonder if we should integrate the LTR, L1 SS, and Link Control
ASPM restore instead of having them spread around through
pci_restore_ltr_state(), pci_restore_aspm_l1ss_state(), and
pci_restore_pcie_state(). Maybe a new pci_restore_aspm() that
would be called from pci_restore_pcie_state()?

- For L1 PM Substates configuration, sec 5.5.4 says that both ports
must be configured while ASPM L1 is disabled, but I don't think we
currently guarantee this: we restore all the upstream component
state first, and we don't know the ASPM state of the downstream
one. Maybe we need to:

* When restoring upstream component,
+ disable its ASPM

* When restoring downstream component,
+ disable its ASPM
+ restore upstream component's LTR, L1SS
+ restore downstream component's LTR, L1SS
+ restore upstream component's ASPM
+ restore downstream component's ASPM

This seems pretty messy, but seems like what the spec requires.

- Add some pci_dbg() logging of all these save/restore values to
help debug any issues.

Bjorn

[1] https://lore.kernel.org/r/20201228040513.GA611645@bjorn-Precision-5520
[2] https://lore.kernel.org/r/[email protected]

2022-04-13 03:12:18

by Kai-Heng Feng

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume

On Wed, Apr 13, 2022 at 6:50 AM Bjorn Helgaas <[email protected]> wrote:
>
> [+cc Ricky for rtsx_pci ASPM behavior, Rajat, Prasad for L1 SS stuff,
> Victor for interest in disabling ASPM during save/restore]
>
> On Wed, Feb 16, 2022 at 06:41:39PM +0530, Vidya Sagar wrote:
> > On 2/16/2022 11:30 AM, Kenneth R. Crudup wrote:
> > > On Wed, 16 Feb 2022, Vidya Sagar wrote:
> > >
> > > > I see that the ASPM-L1 state of Realtek NIC which was in
> > > > disabled state before hibernate got enabled after hibernate.
> > >
> > > That's actually my SD-Card reader; there's a good chance the BIOS
> > > does "something" to it at boot time, as it's possible to boot from
> > > SD-Card on my laptop.
> > >
> > > > This patch doesn't do anything to LnkCtl register which has
> > > > control for ASPM L1 state.
> > >
> > > > Could you please check why ASPM L1 got enabled post hibernation?
> > >
> > > I wouldn't know how to do that; if you're still interested in that
> > > let me know what to do to determine that.
>
> > I would like Bjorn to take a call on it.
> > At this point, there are contradictions in observations.
>
> Remind me what contradictions you see? I know Kenny saw NVMe errors
> on a kernel that included 4257f7e008ea ("PCI/ASPM: Save/restore L1SS
> Capability for suspend/resume") in December 2020 [1], and that he did
> *not* see those errors on 4257f7e008ea in February 2022 [2]. Is that
> what you mean?
>
> > Just to summarize,
> > - The root ports in your laptop don't have support for L1SS
> > - With the same old code base with which the errors were observed plus my
> > patch on top of it, I see that ASPM-L1 state getting enabled for one of the
> > endpoints (Realtek SD-Card reader) after system comes out of hibernation
> > even though ASPM-L1 was disabled before the system enter into hibernation.
> > No errors are reported now.
>
> I assume you refer to [2], where on 4257f7e008ea ("PCI/ASPM:
> Save/restore L1SS Capability for suspend/resume"), Kenny saw ASPM L1
> disabled before hibernate and enabled afterwards:
>
> --- pre-hibernate
> +++ post-hibernate
> 00:1d.7 PCI bridge [0604]: Intel [8086:34b7]
> Bus: primary=00, secondary=58, subordinate=58
> LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
> 58:00.0 RTS525A PCI Express Card Reader [10ec:525a]
> - LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
> - ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> + LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk-
> + ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
>
> Per PCIe r6.0, sec 7.5.3.7, "ASPM L1 must be enabled by software in
> the Upstream component on a Link prior to enabling ASPM L1 in the
> Downstream component on that Link," so this definitely seems broken,
> but wouldn't explain the NVMe issue.
>
> The PCI core (pcie_config_aspm_link()) always enables L1 in the
> upstream component before the downstream one, but 58:00.0 uses the
> rtsx_pci driver, which does a lot of its own ASPM fiddling, so my
> guess is that it's doing something wrong here.
>
> > - With the linux-next top of the tree plus my patch, no change in the ASPM
> > states and no errors also reported.
>
> I don't know which report this refers to.
>
> > This points to BIOS being buggy (both old and new with new one being less
> > problematic)
>
> I agree that a BIOS change between [1] and [2] seems plausible, but I
> don't think we can prove that yet. I'm slightly queasy because while
> Kenny may have updated his BIOS, most people will not have.
>
> I think we should try this patch again with some changes and maybe
> some debug logging:
>
> - I wonder if we should integrate the LTR, L1 SS, and Link Control
> ASPM restore instead of having them spread around through
> pci_restore_ltr_state(), pci_restore_aspm_l1ss_state(), and
> pci_restore_pcie_state(). Maybe a new pci_restore_aspm() that
> would be called from pci_restore_pcie_state()?
>
> - For L1 PM Substates configuration, sec 5.5.4 says that both ports
> must be configured while ASPM L1 is disabled, but I don't think we
> currently guarantee this: we restore all the upstream component
> state first, and we don't know the ASPM state of the downstream
> one. Maybe we need to:
>
> * When restoring upstream component,
> + disable its ASPM
>
> * When restoring downstream component,
> + disable its ASPM
> + restore upstream component's LTR, L1SS
> + restore downstream component's LTR, L1SS
> + restore upstream component's ASPM
> + restore downstream component's ASPM

Right now L1SS isn't reprogrammed after S3, and that causes WD NVMe
starts to spew lots of AER errors.
So yes please restore L1SS upon resume. Meanwhile I am asking vendor
_why_ restoring L1SS is crucial for it to work.

I also wonder what's the purpose of pcie_aspm_pm_state_change()? Can't
we just restore ASPM bits like other configs?

Kai-Heng

>
> This seems pretty messy, but seems like what the spec requires.
>
> - Add some pci_dbg() logging of all these save/restore values to
> help debug any issues.
>
> Bjorn
>
> [1] https://lore.kernel.org/r/20201228040513.GA611645@bjorn-Precision-5520
> [2] https://lore.kernel.org/r/[email protected]

2022-04-13 16:25:01

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume

On Tue, Apr 12, 2022 at 05:50:47PM -0500, Bjorn Helgaas wrote:
> ...
> I think we should try this patch again with some changes and maybe
> some debug logging:
>
> - I wonder if we should integrate the LTR, L1 SS, and Link Control
> ASPM restore instead of having them spread around through
> pci_restore_ltr_state(), pci_restore_aspm_l1ss_state(), and
> pci_restore_pcie_state(). Maybe a new pci_restore_aspm() that
> would be called from pci_restore_pcie_state()?

Sorry, this was a dumb idea, please ignore it. Maybe it would be
useful in the future sometime, but right now when we're trying to
understand the issue, the code churn would just confuse things. I
think we need minimal changes right now (like 4257f7e008ea).

> - For L1 PM Substates configuration, sec 5.5.4 says that both ports
> must be configured while ASPM L1 is disabled, but I don't think we
> currently guarantee this: we restore all the upstream component
> state first, and we don't know the ASPM state of the downstream
> one. Maybe we need to:
>
> * When restoring upstream component,
> + disable its ASPM
>
> * When restoring downstream component,
> + disable its ASPM
> + restore upstream component's LTR, L1SS
> + restore downstream component's LTR, L1SS
> + restore upstream component's ASPM
> + restore downstream component's ASPM
>
> This seems pretty messy, but seems like what the spec requires.

Actually, I think it's unlikely that a downstream device would have
ASPM enabled while we're restoring state of the upstream device, since
we're probably restoring state after a reset or coming back from
D3cold. But we may still need to wait to enable the upstream device
ASPM until after configuring it in the downstream device.

Logging of the previous & new state of these registers might help
untangle things.

> - Add some pci_dbg() logging of all these save/restore values to
> help debug any issues.

2022-04-15 10:26:16

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume

On Wed, Apr 13, 2022 at 08:19:26AM +0800, Kai-Heng Feng wrote:
> On Wed, Apr 13, 2022 at 6:50 AM Bjorn Helgaas <[email protected]> wrote:
> > ...

> > - For L1 PM Substates configuration, sec 5.5.4 says that both ports
> > must be configured while ASPM L1 is disabled, but I don't think we
> > currently guarantee this: we restore all the upstream component
> > state first, and we don't know the ASPM state of the downstream
> > one. Maybe we need to:
> >
> > * When restoring upstream component,
> > + disable its ASPM
> >
> > * When restoring downstream component,
> > + disable its ASPM
> > + restore upstream component's LTR, L1SS
> > + restore downstream component's LTR, L1SS
> > + restore upstream component's ASPM
> > + restore downstream component's ASPM
>
> Right now L1SS isn't reprogrammed after S3, and that causes WD NVMe
> starts to spew lots of AER errors.

Right now we don't fully restore L1SS-related state after S3, so maybe
there's some inconsistency that leads to the AER errors.

Could you collect the "lspci -vv" state before and after S3 so we can
compare them?

> So yes please restore L1SS upon resume. Meanwhile I am asking vendor
> _why_ restoring L1SS is crucial for it to work.
>
> I also wonder what's the purpose of pcie_aspm_pm_state_change()? Can't
> we just restore ASPM bits like other configs?

Good question. What's the context? This is in the
pci_raw_set_power_state() path, not the pci_restore_state() path, so
seems like a separate discussion.

Bjorn

2022-04-16 01:27:54

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume

On Fri, Apr 15, 2022 at 10:26:19PM +0800, Kai-Heng Feng wrote:
> On Fri, Apr 15, 2022 at 12:41 AM Bjorn Helgaas <[email protected]> wrote:
> > On Wed, Apr 13, 2022 at 08:19:26AM +0800, Kai-Heng Feng wrote:
> > > On Wed, Apr 13, 2022 at 6:50 AM Bjorn Helgaas <[email protected]> wrote:
> > > > ...
> >
> > > > - For L1 PM Substates configuration, sec 5.5.4 says that both ports
> > > > must be configured while ASPM L1 is disabled, but I don't think we
> > > > currently guarantee this: we restore all the upstream component
> > > > state first, and we don't know the ASPM state of the downstream
> > > > one. Maybe we need to:
> > > >
> > > > * When restoring upstream component,
> > > > + disable its ASPM
> > > >
> > > > * When restoring downstream component,
> > > > + disable its ASPM
> > > > + restore upstream component's LTR, L1SS
> > > > + restore downstream component's LTR, L1SS
> > > > + restore upstream component's ASPM
> > > > + restore downstream component's ASPM
> > >
> > > Right now L1SS isn't reprogrammed after S3, and that causes WD NVMe
> > > starts to spew lots of AER errors.
> >
> > Right now we don't fully restore L1SS-related state after S3, so maybe
> > there's some inconsistency that leads to the AER errors.

> > Could you collect the "lspci -vv" state before and after S3 so we can
> > compare them?
> >
> > > So yes please restore L1SS upon resume. Meanwhile I am asking vendor
> > > _why_ restoring L1SS is crucial for it to work.
> > >
> > > I also wonder what's the purpose of pcie_aspm_pm_state_change()? Can't
> > > we just restore ASPM bits like other configs?
> >
> > Good question. What's the context? This is in the
> > pci_raw_set_power_state() path, not the pci_restore_state() path, so
> > seems like a separate discussion.
>
> Because this patch alone doesn't restore T_PwrOn and LTR1.2_Threshold.

I assume the post-S3 path is basically this:

pci_pm_resume_noirq
pci_pm_default_resume_early
pci_power_up
pci_raw_set_power_state(D0)
pcie_aspm_pm_state_change
pcie_config_aspm_path
pcie_config_aspm_link
pcie_config_aspm_l1ss
clear PCI_EXP_LNKCTL_ASPM_L1 etc
set PCI_L1SS_CTL1_ASPM_L1_1 etc
pcie_config_aspm_dev
set PCI_EXP_LNKCTL_ASPM_L1 etc
pci_restore_state
pci_restore_ltr_state
pci_restore_aspm_l1ss_state # after this patch
pci_restore_pcie_state

Hmm... I think I see what you're saying. pcie_aspm_pm_state_change()
fiddles with ASPM and L1SS enable bits. It likely disables L1,
enables L1SS, enables L1, but never restores the LTR capability or the
T_PwrOn and LTR1.2_Threshold bits you mention.

Then we turn around and overwrite all that stuff (and the LTR cap) in
pci_restore_state(). That all seems fairly broken, and I agree, I
don't know why pcie_aspm_pm_state_change() exists.

> So I forced the pcie_aspm_pm_state_change() calling path to eventually
> call aspm_calc_l1ss_info() which solved the problem for me.

This would update T_PwrOn and LTR1.2_Threshold, so I guess it makes
sense that this would help. But I think we need to figure out the
reason why pcie_aspm_pm_state_change() exists and get rid of it or at
least better integrate it with pci_restore_state().

If we call pcie_aspm_pm_state_change() after D3cold or reset, things
still aren't going to work because the LTR cap isn't restored or
programmed.

2022-04-16 02:25:42

by Kai-Heng Feng

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume

On Fri, Apr 15, 2022 at 12:41 AM Bjorn Helgaas <[email protected]> wrote:
>
> On Wed, Apr 13, 2022 at 08:19:26AM +0800, Kai-Heng Feng wrote:
> > On Wed, Apr 13, 2022 at 6:50 AM Bjorn Helgaas <[email protected]> wrote:
> > > ...
>
> > > - For L1 PM Substates configuration, sec 5.5.4 says that both ports
> > > must be configured while ASPM L1 is disabled, but I don't think we
> > > currently guarantee this: we restore all the upstream component
> > > state first, and we don't know the ASPM state of the downstream
> > > one. Maybe we need to:
> > >
> > > * When restoring upstream component,
> > > + disable its ASPM
> > >
> > > * When restoring downstream component,
> > > + disable its ASPM
> > > + restore upstream component's LTR, L1SS
> > > + restore downstream component's LTR, L1SS
> > > + restore upstream component's ASPM
> > > + restore downstream component's ASPM
> >
> > Right now L1SS isn't reprogrammed after S3, and that causes WD NVMe
> > starts to spew lots of AER errors.
>
> Right now we don't fully restore L1SS-related state after S3, so maybe
> there's some inconsistency that leads to the AER errors.
>
> Could you collect the "lspci -vv" state before and after S3 so we can
> compare them?
>
> > So yes please restore L1SS upon resume. Meanwhile I am asking vendor
> > _why_ restoring L1SS is crucial for it to work.
> >
> > I also wonder what's the purpose of pcie_aspm_pm_state_change()? Can't
> > we just restore ASPM bits like other configs?
>
> Good question. What's the context? This is in the
> pci_raw_set_power_state() path, not the pci_restore_state() path, so
> seems like a separate discussion.

Because this patch alone doesn't restore T_PwrOn and LTR1.2_Threshold.

So I forced the pcie_aspm_pm_state_change() calling path to eventually
call aspm_calc_l1ss_info() which solved the problem for me.

Let me investigate a bit further.

Kai-Heng

>
> Bjorn

2022-04-21 18:06:02

by Kai-Heng Feng

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume

On Sat, Apr 16, 2022 at 5:25 AM Bjorn Helgaas <[email protected]> wrote:
>
> On Fri, Apr 15, 2022 at 10:26:19PM +0800, Kai-Heng Feng wrote:
> > On Fri, Apr 15, 2022 at 12:41 AM Bjorn Helgaas <[email protected]> wrote:
> > > On Wed, Apr 13, 2022 at 08:19:26AM +0800, Kai-Heng Feng wrote:
> > > > On Wed, Apr 13, 2022 at 6:50 AM Bjorn Helgaas <[email protected]> wrote:
> > > > > ...
> > >
> > > > > - For L1 PM Substates configuration, sec 5.5.4 says that both ports
> > > > > must be configured while ASPM L1 is disabled, but I don't think we
> > > > > currently guarantee this: we restore all the upstream component
> > > > > state first, and we don't know the ASPM state of the downstream
> > > > > one. Maybe we need to:
> > > > >
> > > > > * When restoring upstream component,
> > > > > + disable its ASPM
> > > > >
> > > > > * When restoring downstream component,
> > > > > + disable its ASPM
> > > > > + restore upstream component's LTR, L1SS
> > > > > + restore downstream component's LTR, L1SS
> > > > > + restore upstream component's ASPM
> > > > > + restore downstream component's ASPM
> > > >
> > > > Right now L1SS isn't reprogrammed after S3, and that causes WD NVMe
> > > > starts to spew lots of AER errors.
> > >
> > > Right now we don't fully restore L1SS-related state after S3, so maybe
> > > there's some inconsistency that leads to the AER errors.
>
> > > Could you collect the "lspci -vv" state before and after S3 so we can
> > > compare them?
> > >
> > > > So yes please restore L1SS upon resume. Meanwhile I am asking vendor
> > > > _why_ restoring L1SS is crucial for it to work.
> > > >
> > > > I also wonder what's the purpose of pcie_aspm_pm_state_change()? Can't
> > > > we just restore ASPM bits like other configs?
> > >
> > > Good question. What's the context? This is in the
> > > pci_raw_set_power_state() path, not the pci_restore_state() path, so
> > > seems like a separate discussion.
> >
> > Because this patch alone doesn't restore T_PwrOn and LTR1.2_Threshold.
>
> I assume the post-S3 path is basically this:
>
> pci_pm_resume_noirq
> pci_pm_default_resume_early
> pci_power_up
> pci_raw_set_power_state(D0)
> pcie_aspm_pm_state_change
> pcie_config_aspm_path
> pcie_config_aspm_link
> pcie_config_aspm_l1ss
> clear PCI_EXP_LNKCTL_ASPM_L1 etc
> set PCI_L1SS_CTL1_ASPM_L1_1 etc
> pcie_config_aspm_dev
> set PCI_EXP_LNKCTL_ASPM_L1 etc
> pci_restore_state
> pci_restore_ltr_state
> pci_restore_aspm_l1ss_state # after this patch
> pci_restore_pcie_state
>
> Hmm... I think I see what you're saying. pcie_aspm_pm_state_change()
> fiddles with ASPM and L1SS enable bits. It likely disables L1,
> enables L1SS, enables L1, but never restores the LTR capability or the
> T_PwrOn and LTR1.2_Threshold bits you mention.
>
> Then we turn around and overwrite all that stuff (and the LTR cap) in
> pci_restore_state(). That all seems fairly broken, and I agree, I
> don't know why pcie_aspm_pm_state_change() exists.

I went through the whole discussion again, maybe Kenneth's case is
also the result of pcie_aspm_pm_state_change()?
Since Kenneth is using TLP to switch ASPM between performance and
powersaving/powersupersaving, that means 'aspm_disabled' is false.
Hence the KOXIA NVMe stops working post suspend and Realtek card
reader toggles L1ss post hibernation.

Kenneth, can you please see if removing pcie_aspm_pm_state_change()
from pci_raw_set_power_state() helps?

Anyway, this can be easier to spot if dmesg was attached.

>
> > So I forced the pcie_aspm_pm_state_change() calling path to eventually
> > call aspm_calc_l1ss_info() which solved the problem for me.
>
> This would update T_PwrOn and LTR1.2_Threshold, so I guess it makes
> sense that this would help. But I think we need to figure out the
> reason why pcie_aspm_pm_state_change() exists and get rid of it or at
> least better integrate it with pci_restore_state().
>
> If we call pcie_aspm_pm_state_change() after D3cold or reset, things
> still aren't going to work because the LTR cap isn't restored or
> programmed.

More than that, the ASPM sysfs won't be restored correctly after
resume [1] because of it.
So I'd like to post a patch to drop pcie_aspm_pm_state_change() if
there's no objection.

[1] https://lore.kernel.org/linux-pci/[email protected]/

Kai-Heng

2022-04-22 18:43:28

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume

On Thu, Apr 21, 2022 at 02:16:29PM +0800, Kai-Heng Feng wrote:
> On Sat, Apr 16, 2022 at 5:25 AM Bjorn Helgaas <[email protected]> wrote:
> > On Fri, Apr 15, 2022 at 10:26:19PM +0800, Kai-Heng Feng wrote:
> > ...

> > > So I forced the pcie_aspm_pm_state_change() calling path to eventually
> > > call aspm_calc_l1ss_info() which solved the problem for me.
> >
> > This would update T_PwrOn and LTR1.2_Threshold, so I guess it makes
> > sense that this would help. But I think we need to figure out the
> > reason why pcie_aspm_pm_state_change() exists and get rid of it or at
> > least better integrate it with pci_restore_state().
> >
> > If we call pcie_aspm_pm_state_change() after D3cold or reset, things
> > still aren't going to work because the LTR cap isn't restored or
> > programmed.
>
> More than that, the ASPM sysfs won't be restored correctly after
> resume [1] because of it.
> So I'd like to post a patch to drop pcie_aspm_pm_state_change() if
> there's no objection.

Please do :) Obviously somebody thought we needed
pcie_aspm_pm_state_change() for some reason or it wouldn't have been
added. But I didn't figure out since it was included in the very
first ASPM support: 6c723d5bd89f ("PCI: PCIE ASPM support").

Maybe the comment that "devices changed PM state, we should recheck if
latency meets all functions' requirement" is a clue but I haven't
followed up.

> [1] https://lore.kernel.org/linux-pci/[email protected]/
>
> Kai-Heng

2022-04-22 19:09:38

by Kenneth Crudup

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume


On Thu, 21 Apr 2022, Kai-Heng Feng wrote:

> I went through the whole discussion again, maybe Kenneth's case is
> also the result of pcie_aspm_pm_state_change()?

> Since Kenneth is using TLP to switch ASPM between performance and
> powersaving/powersupersaving, that means 'aspm_disabled' is false.
> Hence the KOXIA NVMe stops working post suspend and Realtek card
> reader toggles L1ss post hibernation.

> Kenneth, can you please see if removing pcie_aspm_pm_state_change()
> from pci_raw_set_power_state() helps?
> Anyway, this can be easier to spot if dmesg was attached.

Well, I haven't had an issue with resume/return from hibernate for quite some
time, and the patch I'd reported a long time ago is now in the Linus' master
I've been running. I believe a BIOS change from Dell fixed it up for me.

But if you're looking for information in any case, I could still do this and
get back to the topic re: the results.

-Kenny

--
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County CA

2022-04-22 22:50:07

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume

On Thu, Apr 21, 2022 at 01:40:02PM -0700, Kenneth R. Crudup wrote:
> On Thu, 21 Apr 2022, Kai-Heng Feng wrote:
>
> > I went through the whole discussion again, maybe Kenneth's case is
> > also the result of pcie_aspm_pm_state_change()?
>
> > Since Kenneth is using TLP to switch ASPM between performance and
> > powersaving/powersupersaving, that means 'aspm_disabled' is false.
> > Hence the KOXIA NVMe stops working post suspend and Realtek card
> > reader toggles L1ss post hibernation.
>
> > Kenneth, can you please see if removing pcie_aspm_pm_state_change()
> > from pci_raw_set_power_state() helps?
> > Anyway, this can be easier to spot if dmesg was attached.
>
> Well, I haven't had an issue with resume/return from hibernate for quite some
> time, and the patch I'd reported a long time ago is now in the Linus' master
> I've been running. I believe a BIOS change from Dell fixed it up for me.

Which patch are you referring to?

Vidya's original patch [1] is not upstream, at least AFAIK. Well, it
*was* merged as 4257f7e008ea [2] in v5.11-rc1, but then reverted by
40fb68c7725a [3] in v5.11-rc7.

[1] https://lore.kernel.org/r/[email protected]
[2] https://git.kernel.org/linus/4257f7e008ea
[3] https://git.kernel.org/linus/40fb68c7725a

2022-11-23 22:01:49

by Matthias Kaehlcke

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume

Hi,

not sure this is the best thread to reply to, but we are also
observing suspend issues with the same Kioxia NVMe on a platform
with a Qualcomm sc7280 SoC. The system runs a v5.15 downstream
kernel which includes most (post v5.15) ASPM patches from
upstream.

There are two issues with the Kioxia NVMe:

1. Power consumption is high unless a LTR_L1.2_THRESHOLD of 0ns
is configured (related dubious patch: [1])

2. The system often hangs on resume unless a longer delay is
added in the suspend pass. QC engineers say that the NVMe is
taking so much time to settle in L1ss.

Other NVMe models don't exhibit power or suspend issues on this
platform, except for one model which also needs a shorter
delay during suspend, otherwise the system will hang
occasionally upon resume.

The second issue could possibly be 'fixed' with a quirk for
the Kioxia NVMe model, though it seems the issue is not seen on
all platforms, apparently the delay is not needed on Kenny's
system.

I'm currently a bit at a loss with the first issue. The patch
mentioned above claims that the (no-)snoop latencies are
involved, which may or may not be true. In this thread I saw
Kenny posting 'lspci' output from his (now) working system.
I noticed max (no-)snoop values of 3145728ns, which seems to
be some sort of default (programmed) max. On my system these
values are 0ns, which is the default value of the registers.
I tried to set these to 3146us from the kernel to see if that
makes a difference, but could only successfully update the max
snoop latency, but not non-snoop (maybe this can be only done
at early initialization time?). With just the max snoop latency
set to 3146us power consumption of the NVMe remains high.

The output of lspci from my system is attached.

In this thread it was mentioned that possibly a BIOS update
fixed the issue Kenny was seeing. What kind of values is
the BIOS supposed to adjust (I'm a PCI n00b)?

Any suggestions about what else to try?

Thanks

m.

[1] https://patchwork.kernel.org/project/linux-arm-msm/patch/[email protected]/

---

0001:00:00.0 PCI bridge: Qualcomm Device 010b (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 183
IOMMU group: 0
Region 0: Memory at 40700000 (32-bit, non-prefetchable) [size=4K]
Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
I/O behind bridge: 00001000-00001fff [size=4K]
Memory behind bridge: 40300000-404fffff [size=2M]
Prefetchable memory behind bridge: 0000000040500000-00000000406fffff [size=2M]
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable+ Count=1/32 Maskable+ 64bit+
Address: 00000000fffff000 Data: 0000
Masking: fffffffe Pending: 00000000
Capabilities: [70] Express (v2) Root Port (Slot+), MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0
ExtTag- RBE+
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x2, ASPM L0s L1, Exit Latency L0s <1us, L1 <64us
ClockPM- Surprise- LLActRep+ BwNot- ASPMOptComp+
LnkCtl: ASPM L0s L1 Enabled; RCB 128 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s (ok), Width x2 (ok)
TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
SltCap: AttnBtn+ PwrCtrl+ MRL+ AttnInd+ PwrInd+ HotPlug+ Surprise+
Slot #0, PowerLimit 0.000W; Interlock+ NoCompl-
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
Control: AttnInd Off, PwrInd Off, Power- Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock-
Changed: MRL- PresDet- LinkState-
RootCap: CRSVisible-
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP+ LTR+
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- LN System CLS Not Supported, TPHComp+ ExtTPHComp- ARIFwd-
AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled, ARIFwd-
AtomicOpsCtl: ReqEn- EgressBlck-
LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
RootCmd: CERptEn- NFERptEn- FERptEn-
RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
Capabilities: [148 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [168 v1] Transaction Processing Hints
No steering table available
Capabilities: [1fc v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=70us PortTPowerOnTime=0us
L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
T_CommonMode=70us LTR1.2_Threshold=86016ns
L1SubCtl2: T_PwrOn=10us
Kernel driver in use: pcieport

0001:01:00.0 Non-Volatile memory controller: KIOXIA Corporation Device 0001 (prog-if 02 [NVM Express])
Subsystem: KIOXIA Corporation Device 0001
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 182
IOMMU group: 0
Region 0: Memory at 40300000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [40] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <2us, L1 <32us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s (ok), Width x2 (downgraded)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [80] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [90] MSI: Enable- Count=1/32 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [b0] MSI-X: Enable+ Count=32 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=0 offset=00003000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [150 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
Status: NegoPending- InProgress-
Capabilities: [260 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [300 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [400 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
PortCommonModeRestoreTime=60us PortTPowerOnTime=10us
L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
T_CommonMode=0us LTR1.2_Threshold=86016ns
L1SubCtl2: T_PwrOn=10us
Kernel driver in use: nvme


On Tue, Apr 12, 2022 at 05:50:47PM -0500, Bjorn Helgaas wrote:
> [+cc Ricky for rtsx_pci ASPM behavior, Rajat, Prasad for L1 SS stuff,
> Victor for interest in disabling ASPM during save/restore]
>
> On Wed, Feb 16, 2022 at 06:41:39PM +0530, Vidya Sagar wrote:
> > On 2/16/2022 11:30 AM, Kenneth R. Crudup wrote:
> > > On Wed, 16 Feb 2022, Vidya Sagar wrote:
> > >
> > > > I see that the ASPM-L1 state of Realtek NIC which was in
> > > > disabled state before hibernate got enabled after hibernate.
> > >
> > > That's actually my SD-Card reader; there's a good chance the BIOS
> > > does "something" to it at boot time, as it's possible to boot from
> > > SD-Card on my laptop.
> > >
> > > > This patch doesn't do anything to LnkCtl register which has
> > > > control for ASPM L1 state.
> > >
> > > > Could you please check why ASPM L1 got enabled post hibernation?
> > >
> > > I wouldn't know how to do that; if you're still interested in that
> > > let me know what to do to determine that.
>
> > I would like Bjorn to take a call on it.
> > At this point, there are contradictions in observations.
>
> Remind me what contradictions you see? I know Kenny saw NVMe errors
> on a kernel that included 4257f7e008ea ("PCI/ASPM: Save/restore L1SS
> Capability for suspend/resume") in December 2020 [1], and that he did
> *not* see those errors on 4257f7e008ea in February 2022 [2]. Is that
> what you mean?
>
> > Just to summarize,
> > - The root ports in your laptop don't have support for L1SS
> > - With the same old code base with which the errors were observed plus my
> > patch on top of it, I see that ASPM-L1 state getting enabled for one of the
> > endpoints (Realtek SD-Card reader) after system comes out of hibernation
> > even though ASPM-L1 was disabled before the system enter into hibernation.
> > No errors are reported now.
>
> I assume you refer to [2], where on 4257f7e008ea ("PCI/ASPM:
> Save/restore L1SS Capability for suspend/resume"), Kenny saw ASPM L1
> disabled before hibernate and enabled afterwards:
>
> --- pre-hibernate
> +++ post-hibernate
> 00:1d.7 PCI bridge [0604]: Intel [8086:34b7]
> Bus: primary=00, secondary=58, subordinate=58
> LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
> 58:00.0 RTS525A PCI Express Card Reader [10ec:525a]
> - LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
> - ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> + LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk-
> + ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
>
> Per PCIe r6.0, sec 7.5.3.7, "ASPM L1 must be enabled by software in
> the Upstream component on a Link prior to enabling ASPM L1 in the
> Downstream component on that Link," so this definitely seems broken,
> but wouldn't explain the NVMe issue.
>
> The PCI core (pcie_config_aspm_link()) always enables L1 in the
> upstream component before the downstream one, but 58:00.0 uses the
> rtsx_pci driver, which does a lot of its own ASPM fiddling, so my
> guess is that it's doing something wrong here.
>
> > - With the linux-next top of the tree plus my patch, no change in the ASPM
> > states and no errors also reported.
>
> I don't know which report this refers to.
>
> > This points to BIOS being buggy (both old and new with new one being less
> > problematic)
>
> I agree that a BIOS change between [1] and [2] seems plausible, but I
> don't think we can prove that yet. I'm slightly queasy because while
> Kenny may have updated his BIOS, most people will not have.
>
> I think we should try this patch again with some changes and maybe
> some debug logging:
>
> - I wonder if we should integrate the LTR, L1 SS, and Link Control
> ASPM restore instead of having them spread around through
> pci_restore_ltr_state(), pci_restore_aspm_l1ss_state(), and
> pci_restore_pcie_state(). Maybe a new pci_restore_aspm() that
> would be called from pci_restore_pcie_state()?
>
> - For L1 PM Substates configuration, sec 5.5.4 says that both ports
> must be configured while ASPM L1 is disabled, but I don't think we
> currently guarantee this: we restore all the upstream component
> state first, and we don't know the ASPM state of the downstream
> one. Maybe we need to:
>
> * When restoring upstream component,
> + disable its ASPM
>
> * When restoring downstream component,
> + disable its ASPM
> + restore upstream component's LTR, L1SS
> + restore downstream component's LTR, L1SS
> + restore upstream component's ASPM
> + restore downstream component's ASPM
>
> This seems pretty messy, but seems like what the spec requires.
>
> - Add some pci_dbg() logging of all these save/restore values to
> help debug any issues.
>
> Bjorn
>
> [1] https://lore.kernel.org/r/20201228040513.GA611645@bjorn-Precision-5520
> [2] https://lore.kernel.org/r/[email protected]
>

2022-11-23 22:55:43

by Kenneth Crudup

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume


For #2, you mean "S3 suspend"? My normally solid machine is hit-or-
miss on coming out of S3 suspend lately but that could be related
to me running Linus' master.

Should I try out "Dubious Patch #1"? I lose a %age/hr in S3.

-Kenny

On Wed, 23 Nov 2022, Matthias Kaehlcke wrote:

> Hi,
>
> not sure this is the best thread to reply to, but we are also
> observing suspend issues with the same Kioxia NVMe on a platform
> with a Qualcomm sc7280 SoC. The system runs a v5.15 downstream
> kernel which includes most (post v5.15) ASPM patches from
> upstream.
>
> There are two issues with the Kioxia NVMe:
>
> 1. Power consumption is high unless a LTR_L1.2_THRESHOLD of 0ns
> is configured (related dubious patch: [1])
>
> 2. The system often hangs on resume unless a longer delay is
> added in the suspend pass. QC engineers say that the NVMe is
> taking so much time to settle in L1ss.
>
> Other NVMe models don't exhibit power or suspend issues on this
> platform, except for one model which also needs a shorter
> delay during suspend, otherwise the system will hang
> occasionally upon resume.
>
> The second issue could possibly be 'fixed' with a quirk for
> the Kioxia NVMe model, though it seems the issue is not seen on
> all platforms, apparently the delay is not needed on Kenny's
> system.
>
> I'm currently a bit at a loss with the first issue. The patch
> mentioned above claims that the (no-)snoop latencies are
> involved, which may or may not be true. In this thread I saw
> Kenny posting 'lspci' output from his (now) working system.
> I noticed max (no-)snoop values of 3145728ns, which seems to
> be some sort of default (programmed) max. On my system these
> values are 0ns, which is the default value of the registers.
> I tried to set these to 3146us from the kernel to see if that
> makes a difference, but could only successfully update the max
> snoop latency, but not non-snoop (maybe this can be only done
> at early initialization time?). With just the max snoop latency
> set to 3146us power consumption of the NVMe remains high.
>
> The output of lspci from my system is attached.
>
> In this thread it was mentioned that possibly a BIOS update
> fixed the issue Kenny was seeing. What kind of values is
> the BIOS supposed to adjust (I'm a PCI n00b)?
>
> Any suggestions about what else to try?
>
> Thanks
>
> m.
>
> [1] https://patchwork.kernel.org/project/linux-arm-msm/patch/[email protected]/
>
> ---
>
> 0001:00:00.0 PCI bridge: Qualcomm Device 010b (prog-if 00 [Normal decode])
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0
> Interrupt: pin A routed to IRQ 183
> IOMMU group: 0
> Region 0: Memory at 40700000 (32-bit, non-prefetchable) [size=4K]
> Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
> I/O behind bridge: 00001000-00001fff [size=4K]
> Memory behind bridge: 40300000-404fffff [size=2M]
> Prefetchable memory behind bridge: 0000000040500000-00000000406fffff [size=2M]
> Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
> BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
> PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
> Capabilities: [40] Power Management version 3
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [50] MSI: Enable+ Count=1/32 Maskable+ 64bit+
> Address: 00000000fffff000 Data: 0000
> Masking: fffffffe Pending: 00000000
> Capabilities: [70] Express (v2) Root Port (Slot+), MSI 00
> DevCap: MaxPayload 128 bytes, PhantFunc 0
> ExtTag- RBE+
> DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
> RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
> MaxPayload 128 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
> LnkCap: Port #0, Speed 8GT/s, Width x2, ASPM L0s L1, Exit Latency L0s <1us, L1 <64us
> ClockPM- Surprise- LLActRep+ BwNot- ASPMOptComp+
> LnkCtl: ASPM L0s L1 Enabled; RCB 128 bytes, Disabled- CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 8GT/s (ok), Width x2 (ok)
> TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
> SltCap: AttnBtn+ PwrCtrl+ MRL+ AttnInd+ PwrInd+ HotPlug+ Surprise+
> Slot #0, PowerLimit 0.000W; Interlock+ NoCompl-
> SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
> Control: AttnInd Off, PwrInd Off, Power- Interlock-
> SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock-
> Changed: MRL- PresDet- LinkState-
> RootCap: CRSVisible-
> RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
> RootSta: PME ReqID 0000, PMEStatus- PMEPending-
> DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP+ LTR+
> 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
> EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
> FRS- LN System CLS Not Supported, TPHComp+ ExtTPHComp- ARIFwd-
> AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled, ARIFwd-
> AtomicOpsCtl: ReqEn- EgressBlck-
> LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
> LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
> Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> Compliance De-emphasis: -6dB
> LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
> EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
> Retimer- 2Retimers- CrosslinkRes: unsupported
> Capabilities: [100 v2] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
> AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
> MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
> HeaderLog: 00000000 00000000 00000000 00000000
> RootCmd: CERptEn- NFERptEn- FERptEn-
> RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
> FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
> ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
> Capabilities: [148 v1] Secondary PCI Express
> LnkCtl3: LnkEquIntrruptEn- PerformEqu-
> LaneErrStat: 0
> Capabilities: [168 v1] Transaction Processing Hints
> No steering table available
> Capabilities: [1fc v1] L1 PM Substates
> L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
> PortCommonModeRestoreTime=70us PortTPowerOnTime=0us
> L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> T_CommonMode=70us LTR1.2_Threshold=86016ns
> L1SubCtl2: T_PwrOn=10us
> Kernel driver in use: pcieport
>
> 0001:01:00.0 Non-Volatile memory controller: KIOXIA Corporation Device 0001 (prog-if 02 [NVM Express])
> Subsystem: KIOXIA Corporation Device 0001
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0
> Interrupt: pin A routed to IRQ 182
> IOMMU group: 0
> Region 0: Memory at 40300000 (64-bit, non-prefetchable) [size=16K]
> Capabilities: [40] Express (v2) Endpoint, MSI 00
> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
> DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
> MaxPayload 128 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
> LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <2us, L1 <32us
> ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 8GT/s (ok), Width x2 (downgraded)
> TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
> 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix-
> EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
> FRS- TPHComp- ExtTPHComp-
> AtomicOpsCap: 32bit- 64bit- 128bitCAS-
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled,
> AtomicOpsCtl: ReqEn-
> LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
> LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
> Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> Compliance De-emphasis: -6dB
> LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
> EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
> Retimer- 2Retimers- CrosslinkRes: unsupported
> Capabilities: [80] Power Management version 3
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [90] MSI: Enable- Count=1/32 Maskable+ 64bit+
> Address: 0000000000000000 Data: 0000
> Masking: 00000000 Pending: 00000000
> Capabilities: [b0] MSI-X: Enable+ Count=32 Masked-
> Vector table: BAR=0 offset=00002000
> PBA: BAR=0 offset=00003000
> Capabilities: [100 v2] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
> AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
> MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
> HeaderLog: 00000000 00000000 00000000 00000000
> Capabilities: [150 v1] Virtual Channel
> Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
> Arb: Fixed- WRR32- WRR64- WRR128-
> Ctrl: ArbSelect=Fixed
> Status: InProgress-
> VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
> Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
> Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
> Status: NegoPending- InProgress-
> Capabilities: [260 v1] Latency Tolerance Reporting
> Max snoop latency: 0ns
> Max no snoop latency: 0ns
> Capabilities: [300 v1] Secondary PCI Express
> LnkCtl3: LnkEquIntrruptEn- PerformEqu-
> LaneErrStat: 0
> Capabilities: [400 v1] L1 PM Substates
> L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
> PortCommonModeRestoreTime=60us PortTPowerOnTime=10us
> L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> T_CommonMode=0us LTR1.2_Threshold=86016ns
> L1SubCtl2: T_PwrOn=10us
> Kernel driver in use: nvme
>
>
> On Tue, Apr 12, 2022 at 05:50:47PM -0500, Bjorn Helgaas wrote:
> > [+cc Ricky for rtsx_pci ASPM behavior, Rajat, Prasad for L1 SS stuff,
> > Victor for interest in disabling ASPM during save/restore]
> >
> > On Wed, Feb 16, 2022 at 06:41:39PM +0530, Vidya Sagar wrote:
> > > On 2/16/2022 11:30 AM, Kenneth R. Crudup wrote:
> > > > On Wed, 16 Feb 2022, Vidya Sagar wrote:
> > > >
> > > > > I see that the ASPM-L1 state of Realtek NIC which was in
> > > > > disabled state before hibernate got enabled after hibernate.
> > > >
> > > > That's actually my SD-Card reader; there's a good chance the BIOS
> > > > does "something" to it at boot time, as it's possible to boot from
> > > > SD-Card on my laptop.
> > > >
> > > > > This patch doesn't do anything to LnkCtl register which has
> > > > > control for ASPM L1 state.
> > > >
> > > > > Could you please check why ASPM L1 got enabled post hibernation?
> > > >
> > > > I wouldn't know how to do that; if you're still interested in that
> > > > let me know what to do to determine that.
> >
> > > I would like Bjorn to take a call on it.
> > > At this point, there are contradictions in observations.
> >
> > Remind me what contradictions you see? I know Kenny saw NVMe errors
> > on a kernel that included 4257f7e008ea ("PCI/ASPM: Save/restore L1SS
> > Capability for suspend/resume") in December 2020 [1], and that he did
> > *not* see those errors on 4257f7e008ea in February 2022 [2]. Is that
> > what you mean?
> >
> > > Just to summarize,
> > > - The root ports in your laptop don't have support for L1SS
> > > - With the same old code base with which the errors were observed plus my
> > > patch on top of it, I see that ASPM-L1 state getting enabled for one of the
> > > endpoints (Realtek SD-Card reader) after system comes out of hibernation
> > > even though ASPM-L1 was disabled before the system enter into hibernation.
> > > No errors are reported now.
> >
> > I assume you refer to [2], where on 4257f7e008ea ("PCI/ASPM:
> > Save/restore L1SS Capability for suspend/resume"), Kenny saw ASPM L1
> > disabled before hibernate and enabled afterwards:
> >
> > --- pre-hibernate
> > +++ post-hibernate
> > 00:1d.7 PCI bridge [0604]: Intel [8086:34b7]
> > Bus: primary=00, secondary=58, subordinate=58
> > LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
> > 58:00.0 RTS525A PCI Express Card Reader [10ec:525a]
> > - LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
> > - ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > + LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk-
> > + ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
> >
> > Per PCIe r6.0, sec 7.5.3.7, "ASPM L1 must be enabled by software in
> > the Upstream component on a Link prior to enabling ASPM L1 in the
> > Downstream component on that Link," so this definitely seems broken,
> > but wouldn't explain the NVMe issue.
> >
> > The PCI core (pcie_config_aspm_link()) always enables L1 in the
> > upstream component before the downstream one, but 58:00.0 uses the
> > rtsx_pci driver, which does a lot of its own ASPM fiddling, so my
> > guess is that it's doing something wrong here.
> >
> > > - With the linux-next top of the tree plus my patch, no change in the ASPM
> > > states and no errors also reported.
> >
> > I don't know which report this refers to.
> >
> > > This points to BIOS being buggy (both old and new with new one being less
> > > problematic)
> >
> > I agree that a BIOS change between [1] and [2] seems plausible, but I
> > don't think we can prove that yet. I'm slightly queasy because while
> > Kenny may have updated his BIOS, most people will not have.
> >
> > I think we should try this patch again with some changes and maybe
> > some debug logging:
> >
> > - I wonder if we should integrate the LTR, L1 SS, and Link Control
> > ASPM restore instead of having them spread around through
> > pci_restore_ltr_state(), pci_restore_aspm_l1ss_state(), and
> > pci_restore_pcie_state(). Maybe a new pci_restore_aspm() that
> > would be called from pci_restore_pcie_state()?
> >
> > - For L1 PM Substates configuration, sec 5.5.4 says that both ports
> > must be configured while ASPM L1 is disabled, but I don't think we
> > currently guarantee this: we restore all the upstream component
> > state first, and we don't know the ASPM state of the downstream
> > one. Maybe we need to:
> >
> > * When restoring upstream component,
> > + disable its ASPM
> >
> > * When restoring downstream component,
> > + disable its ASPM
> > + restore upstream component's LTR, L1SS
> > + restore downstream component's LTR, L1SS
> > + restore upstream component's ASPM
> > + restore downstream component's ASPM
> >
> > This seems pretty messy, but seems like what the spec requires.
> >
> > - Add some pci_dbg() logging of all these save/restore values to
> > help debug any issues.
> >
> > Bjorn
> >
> > [1] https://lore.kernel.org/r/20201228040513.GA611645@bjorn-Precision-5520
> > [2] https://lore.kernel.org/r/[email protected]
> >
>
>

--
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County CA

2022-11-24 00:32:53

by Matthias Kaehlcke

[permalink] [raw]
Subject: Re: [PATCH V1] PCI/ASPM: Save/restore L1SS Capability for suspend/resume

On Wed, Nov 23, 2022 at 02:01:59PM -0800, Kenneth R. Crudup wrote:
>
> For #2, you mean "S3 suspend"?

Yes

> My normally solid machine is hit-or- miss on coming out of S3
> suspend lately but that could be related to me running Linus'
> master.

Interesting. Did this suspend/resume flakiness only (re)appear
recently?

> Should I try out "Dubious Patch #1"? I lose a %age/hr in S3.

If you are seeing high power consumption of the NVMe the patch
might help. On my system it goes from values in the 400mW range
to <10mW in S3. S0 power consumption is also reduced
significantly.

The implementation of the patch is incorrect even for what it
claims to do, but it still has the desired impact on the power
consumption of the Kioxia. Alternatively you could also just
set 'scale' and 'value' to 0 (possibly just 'value' would be
enough) instead of the values set by encode_l12_threshold(),
which is essentially what the patch does.

If messing with LTR_L1.2_THRESHOLD isn't enough for stabilizing
suspend/resume add an msleep of 200ms to the suspend() handler
of your PCIe driver and see if that helps. Not a proper
solution obviously, but could be a good data point, especially
if it helps.

I'm interested to learn about your findings!

m.

> On Wed, 23 Nov 2022, Matthias Kaehlcke wrote:
>
> > Hi,
> >
> > not sure this is the best thread to reply to, but we are also
> > observing suspend issues with the same Kioxia NVMe on a platform
> > with a Qualcomm sc7280 SoC. The system runs a v5.15 downstream
> > kernel which includes most (post v5.15) ASPM patches from
> > upstream.
> >
> > There are two issues with the Kioxia NVMe:
> >
> > 1. Power consumption is high unless a LTR_L1.2_THRESHOLD of 0ns
> > is configured (related dubious patch: [1])
> >
> > 2. The system often hangs on resume unless a longer delay is
> > added in the suspend pass. QC engineers say that the NVMe is
> > taking so much time to settle in L1ss.
> >
> > Other NVMe models don't exhibit power or suspend issues on this
> > platform, except for one model which also needs a shorter
> > delay during suspend, otherwise the system will hang
> > occasionally upon resume.
> >
> > The second issue could possibly be 'fixed' with a quirk for
> > the Kioxia NVMe model, though it seems the issue is not seen on
> > all platforms, apparently the delay is not needed on Kenny's
> > system.
> >
> > I'm currently a bit at a loss with the first issue. The patch
> > mentioned above claims that the (no-)snoop latencies are
> > involved, which may or may not be true. In this thread I saw
> > Kenny posting 'lspci' output from his (now) working system.
> > I noticed max (no-)snoop values of 3145728ns, which seems to
> > be some sort of default (programmed) max. On my system these
> > values are 0ns, which is the default value of the registers.
> > I tried to set these to 3146us from the kernel to see if that
> > makes a difference, but could only successfully update the max
> > snoop latency, but not non-snoop (maybe this can be only done
> > at early initialization time?). With just the max snoop latency
> > set to 3146us power consumption of the NVMe remains high.
> >
> > The output of lspci from my system is attached.
> >
> > In this thread it was mentioned that possibly a BIOS update
> > fixed the issue Kenny was seeing. What kind of values is
> > the BIOS supposed to adjust (I'm a PCI n00b)?
> >
> > Any suggestions about what else to try?
> >
> > Thanks
> >
> > m.
> >
> > [1] https://patchwork.kernel.org/project/linux-arm-msm/patch/[email protected]/
> >
> > ---
> >
> > 0001:00:00.0 PCI bridge: Qualcomm Device 010b (prog-if 00 [Normal decode])
> > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
> > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> > Latency: 0
> > Interrupt: pin A routed to IRQ 183
> > IOMMU group: 0
> > Region 0: Memory at 40700000 (32-bit, non-prefetchable) [size=4K]
> > Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
> > I/O behind bridge: 00001000-00001fff [size=4K]
> > Memory behind bridge: 40300000-404fffff [size=2M]
> > Prefetchable memory behind bridge: 0000000040500000-00000000406fffff [size=2M]
> > Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
> > BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
> > PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
> > Capabilities: [40] Power Management version 3
> > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
> > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > Capabilities: [50] MSI: Enable+ Count=1/32 Maskable+ 64bit+
> > Address: 00000000fffff000 Data: 0000
> > Masking: fffffffe Pending: 00000000
> > Capabilities: [70] Express (v2) Root Port (Slot+), MSI 00
> > DevCap: MaxPayload 128 bytes, PhantFunc 0
> > ExtTag- RBE+
> > DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
> > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
> > MaxPayload 128 bytes, MaxReadReq 512 bytes
> > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
> > LnkCap: Port #0, Speed 8GT/s, Width x2, ASPM L0s L1, Exit Latency L0s <1us, L1 <64us
> > ClockPM- Surprise- LLActRep+ BwNot- ASPMOptComp+
> > LnkCtl: ASPM L0s L1 Enabled; RCB 128 bytes, Disabled- CommClk+
> > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > LnkSta: Speed 8GT/s (ok), Width x2 (ok)
> > TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
> > SltCap: AttnBtn+ PwrCtrl+ MRL+ AttnInd+ PwrInd+ HotPlug+ Surprise+
> > Slot #0, PowerLimit 0.000W; Interlock+ NoCompl-
> > SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
> > Control: AttnInd Off, PwrInd Off, Power- Interlock-
> > SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock-
> > Changed: MRL- PresDet- LinkState-
> > RootCap: CRSVisible-
> > RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
> > RootSta: PME ReqID 0000, PMEStatus- PMEPending-
> > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP+ LTR+
> > 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
> > EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
> > FRS- LN System CLS Not Supported, TPHComp+ ExtTPHComp- ARIFwd-
> > AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
> > DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled, ARIFwd-
> > AtomicOpsCtl: ReqEn- EgressBlck-
> > LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
> > LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
> > Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> > Compliance De-emphasis: -6dB
> > LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
> > EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
> > Retimer- 2Retimers- CrosslinkRes: unsupported
> > Capabilities: [100 v2] Advanced Error Reporting
> > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> > CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
> > CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
> > AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
> > MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
> > HeaderLog: 00000000 00000000 00000000 00000000
> > RootCmd: CERptEn- NFERptEn- FERptEn-
> > RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
> > FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
> > ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
> > Capabilities: [148 v1] Secondary PCI Express
> > LnkCtl3: LnkEquIntrruptEn- PerformEqu-
> > LaneErrStat: 0
> > Capabilities: [168 v1] Transaction Processing Hints
> > No steering table available
> > Capabilities: [1fc v1] L1 PM Substates
> > L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
> > PortCommonModeRestoreTime=70us PortTPowerOnTime=0us
> > L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > T_CommonMode=70us LTR1.2_Threshold=86016ns
> > L1SubCtl2: T_PwrOn=10us
> > Kernel driver in use: pcieport
> >
> > 0001:01:00.0 Non-Volatile memory controller: KIOXIA Corporation Device 0001 (prog-if 02 [NVM Express])
> > Subsystem: KIOXIA Corporation Device 0001
> > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
> > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> > Latency: 0
> > Interrupt: pin A routed to IRQ 182
> > IOMMU group: 0
> > Region 0: Memory at 40300000 (64-bit, non-prefetchable) [size=16K]
> > Capabilities: [40] Express (v2) Endpoint, MSI 00
> > DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
> > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
> > DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
> > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
> > MaxPayload 128 bytes, MaxReadReq 512 bytes
> > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
> > LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <2us, L1 <32us
> > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> > LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > LnkSta: Speed 8GT/s (ok), Width x2 (downgraded)
> > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
> > 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix-
> > EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
> > FRS- TPHComp- ExtTPHComp-
> > AtomicOpsCap: 32bit- 64bit- 128bitCAS-
> > DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled,
> > AtomicOpsCtl: ReqEn-
> > LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
> > LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
> > Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> > Compliance De-emphasis: -6dB
> > LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
> > EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
> > Retimer- 2Retimers- CrosslinkRes: unsupported
> > Capabilities: [80] Power Management version 3
> > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
> > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > Capabilities: [90] MSI: Enable- Count=1/32 Maskable+ 64bit+
> > Address: 0000000000000000 Data: 0000
> > Masking: 00000000 Pending: 00000000
> > Capabilities: [b0] MSI-X: Enable+ Count=32 Masked-
> > Vector table: BAR=0 offset=00002000
> > PBA: BAR=0 offset=00003000
> > Capabilities: [100 v2] Advanced Error Reporting
> > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> > CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
> > CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
> > AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
> > MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
> > HeaderLog: 00000000 00000000 00000000 00000000
> > Capabilities: [150 v1] Virtual Channel
> > Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
> > Arb: Fixed- WRR32- WRR64- WRR128-
> > Ctrl: ArbSelect=Fixed
> > Status: InProgress-
> > VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
> > Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
> > Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
> > Status: NegoPending- InProgress-
> > Capabilities: [260 v1] Latency Tolerance Reporting
> > Max snoop latency: 0ns
> > Max no snoop latency: 0ns
> > Capabilities: [300 v1] Secondary PCI Express
> > LnkCtl3: LnkEquIntrruptEn- PerformEqu-
> > LaneErrStat: 0
> > Capabilities: [400 v1] L1 PM Substates
> > L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
> > PortCommonModeRestoreTime=60us PortTPowerOnTime=10us
> > L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > T_CommonMode=0us LTR1.2_Threshold=86016ns
> > L1SubCtl2: T_PwrOn=10us
> > Kernel driver in use: nvme
> >
> >
> > On Tue, Apr 12, 2022 at 05:50:47PM -0500, Bjorn Helgaas wrote:
> > > [+cc Ricky for rtsx_pci ASPM behavior, Rajat, Prasad for L1 SS stuff,
> > > Victor for interest in disabling ASPM during save/restore]
> > >
> > > On Wed, Feb 16, 2022 at 06:41:39PM +0530, Vidya Sagar wrote:
> > > > On 2/16/2022 11:30 AM, Kenneth R. Crudup wrote:
> > > > > On Wed, 16 Feb 2022, Vidya Sagar wrote:
> > > > >
> > > > > > I see that the ASPM-L1 state of Realtek NIC which was in
> > > > > > disabled state before hibernate got enabled after hibernate.
> > > > >
> > > > > That's actually my SD-Card reader; there's a good chance the BIOS
> > > > > does "something" to it at boot time, as it's possible to boot from
> > > > > SD-Card on my laptop.
> > > > >
> > > > > > This patch doesn't do anything to LnkCtl register which has
> > > > > > control for ASPM L1 state.
> > > > >
> > > > > > Could you please check why ASPM L1 got enabled post hibernation?
> > > > >
> > > > > I wouldn't know how to do that; if you're still interested in that
> > > > > let me know what to do to determine that.
> > >
> > > > I would like Bjorn to take a call on it.
> > > > At this point, there are contradictions in observations.
> > >
> > > Remind me what contradictions you see? I know Kenny saw NVMe errors
> > > on a kernel that included 4257f7e008ea ("PCI/ASPM: Save/restore L1SS
> > > Capability for suspend/resume") in December 2020 [1], and that he did
> > > *not* see those errors on 4257f7e008ea in February 2022 [2]. Is that
> > > what you mean?
> > >
> > > > Just to summarize,
> > > > - The root ports in your laptop don't have support for L1SS
> > > > - With the same old code base with which the errors were observed plus my
> > > > patch on top of it, I see that ASPM-L1 state getting enabled for one of the
> > > > endpoints (Realtek SD-Card reader) after system comes out of hibernation
> > > > even though ASPM-L1 was disabled before the system enter into hibernation.
> > > > No errors are reported now.
> > >
> > > I assume you refer to [2], where on 4257f7e008ea ("PCI/ASPM:
> > > Save/restore L1SS Capability for suspend/resume"), Kenny saw ASPM L1
> > > disabled before hibernate and enabled afterwards:
> > >
> > > --- pre-hibernate
> > > +++ post-hibernate
> > > 00:1d.7 PCI bridge [0604]: Intel [8086:34b7]
> > > Bus: primary=00, secondary=58, subordinate=58
> > > LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
> > > 58:00.0 RTS525A PCI Express Card Reader [10ec:525a]
> > > - LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
> > > - ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > + LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk-
> > > + ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
> > >
> > > Per PCIe r6.0, sec 7.5.3.7, "ASPM L1 must be enabled by software in
> > > the Upstream component on a Link prior to enabling ASPM L1 in the
> > > Downstream component on that Link," so this definitely seems broken,
> > > but wouldn't explain the NVMe issue.
> > >
> > > The PCI core (pcie_config_aspm_link()) always enables L1 in the
> > > upstream component before the downstream one, but 58:00.0 uses the
> > > rtsx_pci driver, which does a lot of its own ASPM fiddling, so my
> > > guess is that it's doing something wrong here.
> > >
> > > > - With the linux-next top of the tree plus my patch, no change in the ASPM
> > > > states and no errors also reported.
> > >
> > > I don't know which report this refers to.
> > >
> > > > This points to BIOS being buggy (both old and new with new one being less
> > > > problematic)
> > >
> > > I agree that a BIOS change between [1] and [2] seems plausible, but I
> > > don't think we can prove that yet. I'm slightly queasy because while
> > > Kenny may have updated his BIOS, most people will not have.
> > >
> > > I think we should try this patch again with some changes and maybe
> > > some debug logging:
> > >
> > > - I wonder if we should integrate the LTR, L1 SS, and Link Control
> > > ASPM restore instead of having them spread around through
> > > pci_restore_ltr_state(), pci_restore_aspm_l1ss_state(), and
> > > pci_restore_pcie_state(). Maybe a new pci_restore_aspm() that
> > > would be called from pci_restore_pcie_state()?
> > >
> > > - For L1 PM Substates configuration, sec 5.5.4 says that both ports
> > > must be configured while ASPM L1 is disabled, but I don't think we
> > > currently guarantee this: we restore all the upstream component
> > > state first, and we don't know the ASPM state of the downstream
> > > one. Maybe we need to:
> > >
> > > * When restoring upstream component,
> > > + disable its ASPM
> > >
> > > * When restoring downstream component,
> > > + disable its ASPM
> > > + restore upstream component's LTR, L1SS
> > > + restore downstream component's LTR, L1SS
> > > + restore upstream component's ASPM
> > > + restore downstream component's ASPM
> > >
> > > This seems pretty messy, but seems like what the spec requires.
> > >
> > > - Add some pci_dbg() logging of all these save/restore values to
> > > help debug any issues.
> > >
> > > Bjorn
> > >
> > > [1] https://lore.kernel.org/r/20201228040513.GA611645@bjorn-Precision-5520
> > > [2] https://lore.kernel.org/r/[email protected]
> > >
> >
> >
>
> --
> Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County CA