2020-12-28 04:07:43

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures

> From: Kenneth R. Crudup <[email protected]>
>
> I've been running Linus' master branch on my laptop (Dell XPS 13
> 2-in-1). With this commit in place, after resuming from hibernate
> my machine is essentially useless, with a torrent of disk I/O errors
> on my NVMe device (at least, and possibly other devices affected)
> until a reboot.
>
> I do use tlp to set the PCIe ASPM to "performance" on AC and
> "powersupersave" on battery.

Thanks a lot for the report, and sorry for the breakage.

4257f7e008ea restores PCI_L1SS_CTL1, then PCI_L1SS_CTL2. I think it
should do those in the reverse order, since the Enable bits are in
PCI_L1SS_CTL1. It also restores L1SS state (potentially enabling
L1.x) before we restore the PCIe Capability (potentially enabling ASPM
as a whole). Those probably should also be in the other order.

If it's convenient, can you try the patch below? If the debug patch
doesn't help:

- Are you seeing the hibernate/resume problem when on AC or on
battery?

- What if you don't use tlp? Does hibernate/resume work fine then?
If tlp makes a difference, can you collect "sudo lspci -vv" output
with and without using tlp (before hibernate)?

- If you revert 4257f7e008ea, does hibernate/resume work fine? Both
with the tlp tweak and without?

- Collect "sudo lspci -vv" output before hibernate and (if possible)
after resume when you see the problem.

I guess tlp only uses /sys/module/pcie_aspm/parameters/policy, so it
sets the same ASPM policy system-wide.

Bjorn

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b9fecc25d213..6598b5cd3154 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1665,9 +1665,8 @@ void pci_restore_state(struct pci_dev *dev)
* LTR itself (in the PCIe capability).
*/
pci_restore_ltr_state(dev);
- pci_restore_aspm_l1ss_state(dev);
-
pci_restore_pcie_state(dev);
+ pci_restore_aspm_l1ss_state(dev);
pci_restore_pasid_state(dev);
pci_restore_pri_state(dev);
pci_restore_ats_state(dev);
diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index a08e7d6dc248..c4a99274b4bb 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -752,8 +752,8 @@ void pci_save_aspm_l1ss_state(struct pci_dev *dev)
return;

cap = (u32 *)&save_state->cap.data[0];
- pci_read_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL1, cap++);
pci_read_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL2, cap++);
+ pci_read_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL1, cap++);
}

void pci_restore_aspm_l1ss_state(struct pci_dev *dev)
@@ -774,8 +774,8 @@ void pci_restore_aspm_l1ss_state(struct pci_dev *dev)
return;

cap = (u32 *)&save_state->cap.data[0];
- pci_write_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL1, *cap++);
pci_write_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL2, *cap++);
+ pci_write_config_dword(dev, aspm_l1ss + PCI_L1SS_CTL1, *cap++);
}

static void pcie_config_aspm_dev(struct pci_dev *pdev, u32 val)


2020-12-28 05:00:33

by Kenneth Crudup

[permalink] [raw]
Subject: Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures


On Sun, 27 Dec 2020, Bjorn Helgaas wrote:

> If it's convenient, can you try the patch below?

Will do!

Also:

> - Are you seeing the hibernate/resume problem when on AC or on
> battery?

Um, I forget :) but want to say "both". I'll try both ways and let you know.

> - If you revert 4257f7e008ea, does hibernate/resume work fine? Both
> with the tlp tweak and without?

Yeah, but TBH there were two other PM regressions in this -rc cycle, so
you guys are in good company :)

> - Collect "sudo lspci -vv" output before hibernate and (if possible)
> after resume when you see the problem.

See attached.

> I guess tlp only uses /sys/module/pcie_aspm/parameters/policy, so it
> sets the same ASPM policy system-wide.

Yeah.

-Kenny

--
Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA


Attachments:
lspci (13.71 kB)

2020-12-28 05:45:54

by Kenneth Crudup

[permalink] [raw]
Subject: Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures


OK, got more info:

On Sun, 27 Dec 2020, Bjorn Helgaas wrote:

> If it's convenient, can you try the patch below? If the debug patch
> doesn't help:

> - Are you seeing the hibernate/resume problem when on AC or on
> battery?

OK, so:

- on TLP, before your patch, it panic()s on AC, but not on battery
- on TLP, with your patch, it panic()s on battery, but NOT on AC
- if TLP is masked, it still panic()s, but I forget if AC or battery
- even if I mask TLP, with your commit in place it panic()s

> - If you revert 4257f7e008ea, does hibernate/resume work fine? Both
> with the tlp tweak and without?

Definitely with the revert everything works. I'll try your patch after the
revert and see if anything changes.

-Kenny

--
Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA

2020-12-28 06:33:23

by Kenneth Crudup

[permalink] [raw]
Subject: Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures


On Sun, 27 Dec 2020, Kenneth R. Crudup wrote:

> I'll try your patch after the revert and see if anything changes.

I just realized today's patch makes no sense if it's reverted, so nevermind.

-Kenny

--
Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA

2020-12-30 06:56:49

by Vidya Sagar

[permalink] [raw]
Subject: Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures

Ideally Bjorn's patch should have worked.
Could you please collect 'sudo lspci -vv' (please don't forget to give
sudo) with Bjorn's patch before and after hibernate?
Also, is it right to say that with policy set to "performance" there is
no issue during hibernate/resume?

- Vidya Sagar

On 12/28/2020 12:00 PM, Kenneth R. Crudup wrote:
> External email: Use caution opening links or attachments
>
>
> On Sun, 27 Dec 2020, Kenneth R. Crudup wrote:
>
>> I'll try your patch after the revert and see if anything changes.
>
> I just realized today's patch makes no sense if it's reverted, so nevermind.
>
> -Kenny
>
> --
> Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA
>

2021-01-22 21:32:13

by Kenneth Crudup

[permalink] [raw]
Subject: Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures


> > From: Kenneth R. Crudup <[email protected]>
> > I've been running Linus' master branch on my laptop (Dell XPS 13
> > 2-in-1). With this commit in place, after resuming from hibernate
> > my machine is essentially useless, with a torrent of disk I/O errors
> > on my NVMe device (at least, and possibly other devices affected)
> > until a reboot.
> >
> > I do use tlp to set the PCIe ASPM to "performance" on AC and
> > "powersupersave" on battery.

On Sun, 27 Dec 2020, Bjorn Helgaas wrote:

> Thanks a lot for the report, and sorry for the breakage.
> 4257f7e008ea restores PCI_L1SS_CTL1, then PCI_L1SS_CTL2. I think it
> should do those in the reverse order, since the Enable bits are in
> PCI_L1SS_CTL1. It also restores L1SS state (potentially enabling
> L1.x) before we restore the PCIe Capability (potentially enabling ASPM
> as a whole). Those probably should also be in the other order.

Any new news on this? Disabling "tlp" (which just shifts the problem around
on my machine) shouldn't be a solution for this issue.

I'd thought it may have been tied to some of the PM regressions of the last
week of December, but all of those have been fixed but this still remains.

Thanks,

-Kenny

--
Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA

2021-01-28 00:06:02

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures

On Fri, Jan 22, 2021 at 12:11:08PM -0800, Kenneth R. Crudup wrote:
> > > From: Kenneth R. Crudup <[email protected]>
> > > I've been running Linus' master branch on my laptop (Dell XPS 13
> > > 2-in-1). With this commit in place, after resuming from hibernate
> > > my machine is essentially useless, with a torrent of disk I/O errors
> > > on my NVMe device (at least, and possibly other devices affected)
> > > until a reboot.
> > >
> > > I do use tlp to set the PCIe ASPM to "performance" on AC and
> > > "powersupersave" on battery.
>
> On Sun, 27 Dec 2020, Bjorn Helgaas wrote:
>
> > Thanks a lot for the report, and sorry for the breakage.
> > 4257f7e008ea restores PCI_L1SS_CTL1, then PCI_L1SS_CTL2. I think it
> > should do those in the reverse order, since the Enable bits are in
> > PCI_L1SS_CTL1. It also restores L1SS state (potentially enabling
> > L1.x) before we restore the PCIe Capability (potentially enabling ASPM
> > as a whole). Those probably should also be in the other order.
>
> Any new news on this? Disabling "tlp" (which just shifts the problem around
> on my machine) shouldn't be a solution for this issue.

Agreed; disabling "tlp" is a workaround but not a solution.

> I'd thought it may have been tied to some of the PM regressions of the last
> week of December, but all of those have been fixed but this still remains.

I haven't seen anything yet and haven't had a chance to look into it
more myself.

We're at v5.11-rc5 already, so I guess we'll have to think about
reverting 4257f7e008ea ("PCI/ASPM: Save/restore L1SS Capability for
suspend/resume") before v5.11-final unless we can make some progress.

That would mean ASPM L1 substate configuration would be lost by a
suspend/resume, so we'd give up some power saving. But that's better
than the regression you're seeing.

I'll tentatively queue up a revert on for-linus pending progress on a
better fix. For some reason I can't find your initial report of the
regression. The first thing I can find is this:

https://lore.kernel.org/linux-pci/20201228040513.GA611645@bjorn-Precision-5520/

Do you have a URL for your initial report that I could include in the
revert commit log?

Bjorn

2021-01-28 00:06:34

by Kenneth Crudup

[permalink] [raw]
Subject: Re: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume failures


On Wed, 27 Jan 2021, Bjorn Helgaas wrote:

> Do you have a URL for your initial report that I could include in the
> revert commit log?

I don't, as I'd emailed the committers first and that was then CCed to the
mailing list, but here's what I'd sent:

----
Date: Fri, 25 Dec 2020 16:38:56
From: Kenneth R. Crudup <[email protected]>
To: [email protected]
Cc: [email protected]
Subject: Commit 4257f7e0 ("PCI/ASPM: Save/restore L1SS Capability for suspend/resume") causing hibernate resume
failures

I've been running Linus' master branch on my laptop (Dell XPS 13 2-in-1). With
this commit in place, after resuming from hibernate my machine is essentially
useless, with a torrent of disk I/O errors on my NVMe device (at least, and
possibly other devices affected) until a reboot.

I do use tlp to set the PCIe ASPM to "performance" on AC and "powersupersave"
on battery.

Let me know if you need more information.

-Kenny

----

--
Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Orange County CA