Issue:
On resume from suspend to RAM there is no output for about 12 seconds,
then shortly a blinking cursor is visible in the upper left corner on an
otherwise black screen which is followed by a reboot.
Setup:
* Machine: ASUS mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1)
* Firmware: 0508 (latest; also tested previous 0505)
* OS: Ubuntu 23.10 (except kernel)
* Kernel: 6.6.8 (also tested 6.7-rc7; config attached)
Debugging summary:
* Kernel 5.10.205 isn’t affected.
* Bisect identified commit 08d0cc5f34265d1a1e3031f319f594bd1970976c as
cause.
* PCI device 0000:03:00.0 (Intel 8265 Wifi) causes resume issues as long
as ASPM is enabled (default).
* The commit message indicates that a quirk could be written to mitigate
the issue but I don’t know how to write such a quirk.
Confirmed workarounds:
* Connect a USB flash drive (no clue why; maybe this causes a delay that
lets the resume succeed)
* Revert commit 08d0cc5f34265d1a1e3031f319f594bd1970976c (commit seemed
intentional; a quirk seems to be the preferred solution)
* pcie_aspm=off
* pcie_aspm.policy=performance
* echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm
Debugging details:
* The resume trigger (power button, keyboard, mouse) doesn’t seem to
make any difference.
* Double checked that the kernel is configured to *not* reboot on panic.
* Double checked that there still isn't any kernel output without quiet
and splash.
* The issue doesn’t happen if a USB flash drive is connected. The
content of the flash drive doesn’t appear to matter. The USB port
doesn’t appear to matter.
* No information in any logs after the reboot. I suspect the resume from
suspend to RAM isn’t getting far enough as that logs could be written.
* Kernel 5.10.205 isn’t affected. Kernel 5.15.145, 6.6.8 and 6.7-rc7 are
affected.
* A kernel bisect has revealed the following commit as cause:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=08d0cc5f34265d1a1e3031f319f594bd1970976c
* The commit was part of kernel 5.20 and has been backported to 5.15.
* The commit mentions that a device-specific quirk could be added in
case of new issues.
* According to sysfs and lspci only device 0000:03:00.0 (Intel 8265
Wifi) has ASPM enabled by default.
* Disabling ASPM for device 0000:03:00.0 lets the resume from suspend to
RAM succeed.
* Enabling ASPM for all devices except 0000:03:00.0 lets the resume from
suspend to RAM succeed.
* This would indicate that a quirk is missing for the device
0000:03:00.0 (Intel 8265 Wifi) but I have no clue how to write such a
quirk or how to get the specifics for such a quirk.
* I still have no clue how a USB flash drive plays into all this. Maybe
some kind of a timing issue where the connected USB flash drive delays
something long enough so that the resume succeeds. Maybe the code
removed by commit 08d0cc5f34265d1a1e3031f319f594bd1970976c caused a
similar delay. ¯\_(ツ)_/¯
[+cc David (more details at
https://lore.kernel.org/r/[email protected])]
Hi Michael, thank you very much for debugging and reporting this!
Sorry for the major inconvenience.
On Mon, Dec 25, 2023 at 07:29:02PM +0100, Michael Schaller wrote:
> Issue:
> On resume from suspend to RAM there is no output for about 12 seconds, then
> shortly a blinking cursor is visible in the upper left corner on an
> otherwise black screen which is followed by a reboot.
>
> Setup:
> * Machine: ASUS mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1)
> * Firmware: 0508 (latest; also tested previous 0505)
> * OS: Ubuntu 23.10 (except kernel)
> * Kernel: 6.6.8 (also tested 6.7-rc7; config attached)
>
> Debugging summary:
> * Kernel 5.10.205 isn’t affected.
> * Bisect identified commit 08d0cc5f34265d1a1e3031f319f594bd1970976c as
> cause.
#regzbot introduced: 08d0cc5f3426^
> * PCI device 0000:03:00.0 (Intel 8265 Wifi) causes resume issues as long as
> ASPM is enabled (default).
> * The commit message indicates that a quirk could be written to mitigate the
> issue but I don’t know how to write such a quirk.
>
> Confirmed workarounds:
> * Connect a USB flash drive (no clue why; maybe this causes a delay that
> lets the resume succeed)
> * Revert commit 08d0cc5f34265d1a1e3031f319f594bd1970976c (commit seemed
> intentional; a quirk seems to be the preferred solution)
> * pcie_aspm=off
> * pcie_aspm.policy=performance
> * echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm
>
> Debugging details:
> * The resume trigger (power button, keyboard, mouse) doesn’t seem to make
> any difference.
> * Double checked that the kernel is configured to *not* reboot on panic.
> * Double checked that there still isn't any kernel output without quiet and
> splash.
> * The issue doesn’t happen if a USB flash drive is connected. The content of
> the flash drive doesn’t appear to matter. The USB port doesn’t appear to
> matter.
> * No information in any logs after the reboot. I suspect the resume from
> suspend to RAM isn’t getting far enough as that logs could be written.
> * Kernel 5.10.205 isn’t affected. Kernel 5.15.145, 6.6.8 and 6.7-rc7 are
> affected.
> * A kernel bisect has revealed the following commit as cause:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=08d0cc5f34265d1a1e3031f319f594bd1970976c
> * The commit was part of kernel 5.20 and has been backported to 5.15.
> * The commit mentions that a device-specific quirk could be added in case of
> new issues.
> * According to sysfs and lspci only device 0000:03:00.0 (Intel 8265 Wifi)
> has ASPM enabled by default.
> * Disabling ASPM for device 0000:03:00.0 lets the resume from suspend to RAM
> succeed.
> * Enabling ASPM for all devices except 0000:03:00.0 lets the resume from
> suspend to RAM succeed.
> * This would indicate that a quirk is missing for the device 0000:03:00.0
> (Intel 8265 Wifi) but I have no clue how to write such a quirk or how to get
> the specifics for such a quirk.
> * I still have no clue how a USB flash drive plays into all this. Maybe some
> kind of a timing issue where the connected USB flash drive delays something
> long enough so that the resume succeeds. Maybe the code removed by commit
> 08d0cc5f34265d1a1e3031f319f594bd1970976c caused a similar delay. ¯\_(ツ)_/¯
We have some known issues with saving and restoring ASPM state on
suspend/resume, in particular with ASPM L1 Substates, which are
enabled on this device.
David Box has a patch in the works that should fix one of those
issues:
https://lore.kernel.org/r/[email protected]
It's not merged yet, but it's possible it might fix or at least be
related to this. If you try it out, please let us know what happens.
> 03:00.0 Network controller: Intel Corporation Wireless 8265 / 8275 (rev 78)
> Capabilities: [40] Express (v2) Endpoint, MSI 00
> LnkCap: Port #6, Speed 2.5GT/s, Width x1, ASPM L1, Exit Latency L1 <8us
> ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
> LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
> Capabilities: [14c v1] Latency Tolerance Reporting
> Max snoop latency: 1048576ns
> Max no snoop latency: 1048576ns
> Capabilities: [154 v1] L1 PM Substates
> L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
> PortCommonModeRestoreTime=30us PortTPowerOnTime=18us
> L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> T_CommonMode=0us LTR1.2_Threshold=186368ns
> L1SubCtl2: T_PwrOn=150us
> $ grep -F '' /sys/bus/pci/devices/*/link/*pm
> /sys/bus/pci/devices/0000:03:00.0/link/clkpm:1
> /sys/bus/pci/devices/0000:03:00.0/link/l1_1_aspm:1
> /sys/bus/pci/devices/0000:03:00.0/link/l1_1_pcipm:1
> /sys/bus/pci/devices/0000:03:00.0/link/l1_2_aspm:1
> /sys/bus/pci/devices/0000:03:00.0/link/l1_2_pcipm:1
> /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm:1
> /sys/bus/pci/devices/0000:04:00.0/link/clkpm:0
> ...
> Hi Michael, thank you very much for debugging and reporting this!
> Sorry for the major inconvenience.
>
You're welcome and the only inconvenience was doing the bisect. ;-)
> We have some known issues with saving and restoring ASPM state on
> suspend/resume, in particular with ASPM L1 Substates, which are
> enabled on this device.
>
> David Box has a patch in the works that should fix one of those
> issues:
> https://lore.kernel.org/r/[email protected]
>
> It's not merged yet, but it's possible it might fix or at least be
> related to this. If you try it out, please let us know what happens.
>
I gave the proposed patch a quick test on top of kernel 6.6.8 and it
didn't affect this issue (the reboot still happens on resume attempt).
On Mon, Dec 25, 2023 at 07:29:02PM +0100, Michael Schaller wrote:
> Issue:
> On resume from suspend to RAM there is no output for about 12 seconds, then
> shortly a blinking cursor is visible in the upper left corner on an
> otherwise black screen which is followed by a reboot.
>
> Setup:
> * Machine: ASUS mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1)
> * Firmware: 0508 (latest; also tested previous 0505)
> * OS: Ubuntu 23.10 (except kernel)
> * Kernel: 6.6.8 (also tested 6.7-rc7; config attached)
>
> Debugging summary:
> * Kernel 5.10.205 isn’t affected.
> * Bisect identified commit 08d0cc5f34265d1a1e3031f319f594bd1970976c as
> cause.
> * PCI device 0000:03:00.0 (Intel 8265 Wifi) causes resume issues as long as
> ASPM is enabled (default).
> * The commit message indicates that a quirk could be written to mitigate the
> issue but I don’t know how to write such a quirk.
>
> Confirmed workarounds:
> * Connect a USB flash drive (no clue why; maybe this causes a delay that
> lets the resume succeed)
> * Revert commit 08d0cc5f34265d1a1e3031f319f594bd1970976c (commit seemed
> intentional; a quirk seems to be the preferred solution)
> * pcie_aspm=off
> * pcie_aspm.policy=performance
> * echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm
>
> Debugging details:
> * The resume trigger (power button, keyboard, mouse) doesn’t seem to make
> any difference.
> * Double checked that the kernel is configured to *not* reboot on panic.
> * Double checked that there still isn't any kernel output without quiet and
> splash.
> * The issue doesn’t happen if a USB flash drive is connected. The content of
> the flash drive doesn’t appear to matter. The USB port doesn’t appear to
> matter.
> * No information in any logs after the reboot. I suspect the resume from
> suspend to RAM isn’t getting far enough as that logs could be written.
> * Kernel 5.10.205 isn’t affected. Kernel 5.15.145, 6.6.8 and 6.7-rc7 are
> affected.
> * A kernel bisect has revealed the following commit as cause:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=08d0cc5f34265d1a1e3031f319f594bd1970976c
> * The commit was part of kernel 5.20 and has been backported to 5.15.
> * The commit mentions that a device-specific quirk could be added in case of
> new issues.
> * According to sysfs and lspci only device 0000:03:00.0 (Intel 8265 Wifi)
> has ASPM enabled by default.
> * Disabling ASPM for device 0000:03:00.0 lets the resume from suspend to RAM
> succeed.
> * Enabling ASPM for all devices except 0000:03:00.0 lets the resume from
> suspend to RAM succeed.
> * This would indicate that a quirk is missing for the device 0000:03:00.0
> (Intel 8265 Wifi) but I have no clue how to write such a quirk or how to get
> the specifics for such a quirk.
> * I still have no clue how a USB flash drive plays into all this. Maybe some
> kind of a timing issue where the connected USB flash drive delays something
> long enough so that the resume succeeds. Maybe the code removed by commit
> 08d0cc5f34265d1a1e3031f319f594bd1970976c caused a similar delay. ¯\_(ツ)_/¯
Hmmm. 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
appeared in v6.0, released Oct 2, 2022, so it's been there a while.
But I think the best option is to revert it until this issue is
resolved. Per the commit log, 08d0cc5f3426 solved two problems:
1) ASPM config changes done via sysfs are lost if the device power
state is changed, e.g., typically set to D3hot in .suspend() and
D0 in .resume().
2) If L1SS is restored during system resume, that restored state
would be overwritten.
Problem 2) relates to a patch that is currently reverted (a7152be79b62
("Revert "PCI/ASPM: Save L1 PM Substates Capability for
suspend/resume""), so I don't think reverting 08d0cc5f3426 will make
this problem worse.
Reverting 08d0cc5f3426 will make 1) a problem again. But my guess is
ASPM changes via sysfs are fairly unusual and the device probably
remains functional even though it may use more power because the ASPM
configuration was lost.
So unless somebody has a counter-argument, I plan to queue a revert of
08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") for
v6.7.
Bjorn
On 01.01.24 19:13, Bjorn Helgaas wrote:
> On Mon, Dec 25, 2023 at 07:29:02PM +0100, Michael Schaller wrote:
>> Issue:
>> On resume from suspend to RAM there is no output for about 12 seconds, then
>> shortly a blinking cursor is visible in the upper left corner on an
>> otherwise black screen which is followed by a reboot.
>>
>> Setup:
>> * Machine: ASUS mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1)
>> * Firmware: 0508 (latest; also tested previous 0505)
>> * OS: Ubuntu 23.10 (except kernel)
>> * Kernel: 6.6.8 (also tested 6.7-rc7; config attached)
>>
>> Debugging summary:
>> * Kernel 5.10.205 isn’t affected.
>> * Bisect identified commit 08d0cc5f34265d1a1e3031f319f594bd1970976c as
>> cause.
>> * PCI device 0000:03:00.0 (Intel 8265 Wifi) causes resume issues as long as
>> ASPM is enabled (default).
>> * The commit message indicates that a quirk could be written to mitigate the
>> issue but I don’t know how to write such a quirk.
>>
>> Confirmed workarounds:
>> * Connect a USB flash drive (no clue why; maybe this causes a delay that
>> lets the resume succeed)
>> * Revert commit 08d0cc5f34265d1a1e3031f319f594bd1970976c (commit seemed
>> intentional; a quirk seems to be the preferred solution)
>> * pcie_aspm=off
>> * pcie_aspm.policy=performance
>> * echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm
>>
>> Debugging details:
>> * The resume trigger (power button, keyboard, mouse) doesn’t seem to make
>> any difference.
>> * Double checked that the kernel is configured to *not* reboot on panic.
>> * Double checked that there still isn't any kernel output without quiet and
>> splash.
>> * The issue doesn’t happen if a USB flash drive is connected. The content of
>> the flash drive doesn’t appear to matter. The USB port doesn’t appear to
>> matter.
>> * No information in any logs after the reboot. I suspect the resume from
>> suspend to RAM isn’t getting far enough as that logs could be written.
>> * Kernel 5.10.205 isn’t affected. Kernel 5.15.145, 6.6.8 and 6.7-rc7 are
>> affected.
>> * A kernel bisect has revealed the following commit as cause:
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=08d0cc5f34265d1a1e3031f319f594bd1970976c
>> * The commit was part of kernel 5.20 and has been backported to 5.15.
>> * The commit mentions that a device-specific quirk could be added in case of
>> new issues.
>> * According to sysfs and lspci only device 0000:03:00.0 (Intel 8265 Wifi)
>> has ASPM enabled by default.
>> * Disabling ASPM for device 0000:03:00.0 lets the resume from suspend to RAM
>> succeed.
>> * Enabling ASPM for all devices except 0000:03:00.0 lets the resume from
>> suspend to RAM succeed.
>> * This would indicate that a quirk is missing for the device 0000:03:00.0
>> (Intel 8265 Wifi) but I have no clue how to write such a quirk or how to get
>> the specifics for such a quirk.
>> * I still have no clue how a USB flash drive plays into all this. Maybe some
>> kind of a timing issue where the connected USB flash drive delays something
>> long enough so that the resume succeeds. Maybe the code removed by commit
>> 08d0cc5f34265d1a1e3031f319f594bd1970976c caused a similar delay. ¯\_(ツ)_/¯
>
> Hmmm. 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> appeared in v6.0, released Oct 2, 2022, so it's been there a while.
>
> But I think the best option is to revert it until this issue is
> resolved. Per the commit log, 08d0cc5f3426 solved two problems:
>
> 1) ASPM config changes done via sysfs are lost if the device power
> state is changed, e.g., typically set to D3hot in .suspend() and
> D0 in .resume().
>
> 2) If L1SS is restored during system resume, that restored state
> would be overwritten.
>
> Problem 2) relates to a patch that is currently reverted (a7152be79b62
> ("Revert "PCI/ASPM: Save L1 PM Substates Capability for
> suspend/resume""), so I don't think reverting 08d0cc5f3426 will make
> this problem worse.
>
> Reverting 08d0cc5f3426 will make 1) a problem again. But my guess is
> ASPM changes via sysfs are fairly unusual and the device probably
> remains functional even though it may use more power because the ASPM
> configuration was lost.
>
> So unless somebody has a counter-argument, I plan to queue a revert of
> 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") for
> v6.7.
>
> Bjorn
If it helps I could also try if a partial revert of 08d0cc5f3426 would
be sufficient. This might also narrow down the issue and give more
insight where the issue originates from.
Let me know what you think.
Michael
On Mon, Jan 01, 2024 at 07:57:40PM +0100, Michael Schaller wrote:
> On 01.01.24 19:13, Bjorn Helgaas wrote:
> > On Mon, Dec 25, 2023 at 07:29:02PM +0100, Michael Schaller wrote:
> > ...
> > So unless somebody has a counter-argument, I plan to queue a revert of
> > 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") for
> > v6.7.
>
> If it helps I could also try if a partial revert of 08d0cc5f3426 would be
> sufficient. This might also narrow down the issue and give more insight
> where the issue originates from.
We're so close to the v6.7 final release that I doubt we can figure
out the problem and test a fix before v6.7. I'm sure Kai-Heng would
appreciate any additional data, but I don't think it's urgent at this
point.
Bjorn
On 01.01.24 23:15, Bjorn Helgaas wrote:
> On Mon, Jan 01, 2024 at 07:57:40PM +0100, Michael Schaller wrote:
>> On 01.01.24 19:13, Bjorn Helgaas wrote:
>>> On Mon, Dec 25, 2023 at 07:29:02PM +0100, Michael Schaller wrote:
>>> ...
>
>>> So unless somebody has a counter-argument, I plan to queue a revert of
>>> 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") for
>>> v6.7.
>>
>> If it helps I could also try if a partial revert of 08d0cc5f3426 would be
>> sufficient. This might also narrow down the issue and give more insight
>> where the issue originates from.
>
> We're so close to the v6.7 final release that I doubt we can figure
> out the problem and test a fix before v6.7. I'm sure Kai-Heng would
> appreciate any additional data, but I don't think it's urgent at this
> point.
>
> Bjorn
We're indeed close to the final v6.7 release, which in turn means that a
last minute revert of a 16 month old commit might cause even more
regressions as there have been quite a few ASPM changes afterwards and
there won't be much testing anymore before the final release.
Furthermore, given the age of the commit and that it has been backported
to kernel 5.15, the question is also if the revert would be backported
to the affected LTS kernels?
If this regression risk is acceptable then I'm all for reverting the
commit now and then working on a fix.
Michael
From: Bjorn Helgaas <[email protected]>
This reverts commit 08d0cc5f34265d1a1e3031f319f594bd1970976c.
Michael reported that when attempting to resume from suspend to RAM on ASUS
mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1), 08d0cc5f3426
("PCI/ASPM: Remove pcie_aspm_pm_state_change()") caused a 12-second delay
with no output, followed by a reboot.
Workarounds include:
- Reverting 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
- Booting with "pcie_aspm=off"
- Booting with "pcie_aspm.policy=performance"
- "echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm"
before suspending
- Connecting a USB flash drive
Fixes: 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
Reported-by: Michael Schaller <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Bjorn Helgaas <[email protected]>
Cc: <[email protected]>
---
drivers/pci/pci.c | 6 ++++++
drivers/pci/pci.h | 2 ++
drivers/pci/pcie/aspm.c | 19 +++++++++++++++++++
3 files changed, 27 insertions(+)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 55bc3576a985..bdbf8a94b4d0 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1335,6 +1335,9 @@ static int pci_set_full_power_state(struct pci_dev *dev)
pci_restore_bars(dev);
}
+ if (dev->bus->self)
+ pcie_aspm_pm_state_change(dev->bus->self);
+
return 0;
}
@@ -1429,6 +1432,9 @@ static int pci_set_low_power_state(struct pci_dev *dev, pci_power_t state)
pci_power_name(dev->current_state),
pci_power_name(state));
+ if (dev->bus->self)
+ pcie_aspm_pm_state_change(dev->bus->self);
+
return 0;
}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 5ecbcf041179..f43873049d52 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -569,10 +569,12 @@ int pcie_retrain_link(struct pci_dev *pdev, bool use_lt);
#ifdef CONFIG_PCIEASPM
void pcie_aspm_init_link_state(struct pci_dev *pdev);
void pcie_aspm_exit_link_state(struct pci_dev *pdev);
+void pcie_aspm_pm_state_change(struct pci_dev *pdev);
void pcie_aspm_powersave_config_link(struct pci_dev *pdev);
#else
static inline void pcie_aspm_init_link_state(struct pci_dev *pdev) { }
static inline void pcie_aspm_exit_link_state(struct pci_dev *pdev) { }
+static inline void pcie_aspm_pm_state_change(struct pci_dev *pdev) { }
static inline void pcie_aspm_powersave_config_link(struct pci_dev *pdev) { }
#endif
diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index 50b04ae5c394..8715e951c491 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -1008,6 +1008,25 @@ void pcie_aspm_exit_link_state(struct pci_dev *pdev)
up_read(&pci_bus_sem);
}
+/* @pdev: the root port or switch downstream port */
+void pcie_aspm_pm_state_change(struct pci_dev *pdev)
+{
+ struct pcie_link_state *link = pdev->link_state;
+
+ if (aspm_disabled || !link)
+ return;
+ /*
+ * Devices changed PM state, we should recheck if latency
+ * meets all functions' requirement
+ */
+ down_read(&pci_bus_sem);
+ mutex_lock(&aspm_lock);
+ pcie_update_aspm_capable(link->root);
+ pcie_config_aspm_path(link);
+ mutex_unlock(&aspm_lock);
+ up_read(&pci_bus_sem);
+}
+
void pcie_aspm_powersave_config_link(struct pci_dev *pdev)
{
struct pcie_link_state *link = pdev->link_state;
--
2.34.1
On 1/2/2024 3:25 PM, Bjorn Helgaas wrote:
> From: Bjorn Helgaas <[email protected]>
>
> This reverts commit 08d0cc5f34265d1a1e3031f319f594bd1970976c.
>
> Michael reported that when attempting to resume from suspend to RAM on ASUS
> mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1), 08d0cc5f3426
> ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") caused a 12-second delay
> with no output, followed by a reboot.
>
> Workarounds include:
>
> - Reverting 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> - Booting with "pcie_aspm=off"
> - Booting with "pcie_aspm.policy=performance"
> - "echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm"
> before suspending
> - Connecting a USB flash drive
>
Did you find the root cause? Is this issue specific to that particular
device? If yes, can we do a quirk?
> Fixes: 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> Reported-by: Michael Schaller <[email protected]>
> Link: https://lore.kernel.org/r/[email protected]
> Signed-off-by: Bjorn Helgaas <[email protected]>
> Cc: <[email protected]>
> ---
> drivers/pci/pci.c | 6 ++++++
> drivers/pci/pci.h | 2 ++
> drivers/pci/pcie/aspm.c | 19 +++++++++++++++++++
> 3 files changed, 27 insertions(+)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 55bc3576a985..bdbf8a94b4d0 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1335,6 +1335,9 @@ static int pci_set_full_power_state(struct pci_dev *dev)
> pci_restore_bars(dev);
> }
>
> + if (dev->bus->self)
> + pcie_aspm_pm_state_change(dev->bus->self);
> +
> return 0;
> }
>
> @@ -1429,6 +1432,9 @@ static int pci_set_low_power_state(struct pci_dev *dev, pci_power_t state)
> pci_power_name(dev->current_state),
> pci_power_name(state));
>
> + if (dev->bus->self)
> + pcie_aspm_pm_state_change(dev->bus->self);
> +
> return 0;
> }
>
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 5ecbcf041179..f43873049d52 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -569,10 +569,12 @@ int pcie_retrain_link(struct pci_dev *pdev, bool use_lt);
> #ifdef CONFIG_PCIEASPM
> void pcie_aspm_init_link_state(struct pci_dev *pdev);
> void pcie_aspm_exit_link_state(struct pci_dev *pdev);
> +void pcie_aspm_pm_state_change(struct pci_dev *pdev);
> void pcie_aspm_powersave_config_link(struct pci_dev *pdev);
> #else
> static inline void pcie_aspm_init_link_state(struct pci_dev *pdev) { }
> static inline void pcie_aspm_exit_link_state(struct pci_dev *pdev) { }
> +static inline void pcie_aspm_pm_state_change(struct pci_dev *pdev) { }
> static inline void pcie_aspm_powersave_config_link(struct pci_dev *pdev) { }
> #endif
>
> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> index 50b04ae5c394..8715e951c491 100644
> --- a/drivers/pci/pcie/aspm.c
> +++ b/drivers/pci/pcie/aspm.c
> @@ -1008,6 +1008,25 @@ void pcie_aspm_exit_link_state(struct pci_dev *pdev)
> up_read(&pci_bus_sem);
> }
>
> +/* @pdev: the root port or switch downstream port */
> +void pcie_aspm_pm_state_change(struct pci_dev *pdev)
> +{
> + struct pcie_link_state *link = pdev->link_state;
> +
> + if (aspm_disabled || !link)
> + return;
> + /*
> + * Devices changed PM state, we should recheck if latency
> + * meets all functions' requirement
> + */
> + down_read(&pci_bus_sem);
> + mutex_lock(&aspm_lock);
> + pcie_update_aspm_capable(link->root);
> + pcie_config_aspm_path(link);
> + mutex_unlock(&aspm_lock);
> + up_read(&pci_bus_sem);
> +}
> +
> void pcie_aspm_powersave_config_link(struct pci_dev *pdev)
> {
> struct pcie_link_state *link = pdev->link_state;
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
On Tue, Jan 02, 2024 at 03:33:51PM -0800, Kuppuswamy Sathyanarayanan wrote:
> On 1/2/2024 3:25 PM, Bjorn Helgaas wrote:
> > From: Bjorn Helgaas <[email protected]>
> >
> > This reverts commit 08d0cc5f34265d1a1e3031f319f594bd1970976c.
> >
> > Michael reported that when attempting to resume from suspend to RAM on ASUS
> > mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1), 08d0cc5f3426
> > ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") caused a 12-second delay
> > with no output, followed by a reboot.
> >
> > Workarounds include:
> >
> > - Reverting 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> > - Booting with "pcie_aspm=off"
> > - Booting with "pcie_aspm.policy=performance"
> > - "echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm"
> > before suspending
> > - Connecting a USB flash drive
>
> Did you find the root cause? Is this issue specific to that particular
> device? If yes, can we do a quirk?
Unfortunately we don't know the root cause yet. Without knowing the
root cause, I don't think we can make a good quirk.
Bjorn
On 02.01.24 14:50, Michael Schaller wrote:
> On 01.01.24 23:15, Bjorn Helgaas wrote:
>> On Mon, Jan 01, 2024 at 07:57:40PM +0100, Michael Schaller wrote:
>>> On 01.01.24 19:13, Bjorn Helgaas wrote:
>>>> On Mon, Dec 25, 2023 at 07:29:02PM +0100, Michael Schaller wrote:
>>>> ...
>>
>>>> So unless somebody has a counter-argument, I plan to queue a revert of
>>>> 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") for
>>>> v6.7.
>>>
>>> If it helps I could also try if a partial revert of 08d0cc5f3426
>>> would be
>>> sufficient. This might also narrow down the issue and give more insight
>>> where the issue originates from.
>>
>> We're so close to the v6.7 final release that I doubt we can figure
>> out the problem and test a fix before v6.7. I'm sure Kai-Heng would
>> appreciate any additional data, but I don't think it's urgent at this
>> point.
>
> We're indeed close to the final v6.7 release, which in turn means that a
> last minute revert of a 16 month old commit might cause even more
> regressions as there have been quite a few ASPM changes afterwards and
> there won't be much testing anymore before the final release.
>
> Furthermore, given the age of the commit and that it has been backported
> to kernel 5.15, the question is also if the revert would be backported
> to the affected LTS kernels?
>
> If this regression risk is acceptable then I'm all for reverting the
> commit now and then working on a fix.
FWIW (just in case some of you might not be aware of this): Linus not
that long ago said this about regressions that are somewhat older:
"""
There's obviously a time limit: if that "regression in an earlier
release" was a year or more ago, and just took forever for people to
notice, and it had semantic changes that now mean that fixing the
regression could cause a _new_ regression, then that can cause me to
go "Oh, now the new semantics are what we have to live with".
"""
For full context see:
https://lore.kernel.org/all/CAHk-=wis_qQy4oDNynNKi5b7Qhosmxtoj1jxo5wmB6SRUwQUBQ@mail.gmail.com/
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.
On Mon, 1 Jan 2024, Bjorn Helgaas wrote:
> On Mon, Dec 25, 2023 at 07:29:02PM +0100, Michael Schaller wrote:
> > Issue:
> > On resume from suspend to RAM there is no output for about 12 seconds, then
> > shortly a blinking cursor is visible in the upper left corner on an
> > otherwise black screen which is followed by a reboot.
> >
> > Setup:
> > * Machine: ASUS mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1)
> > * Firmware: 0508 (latest; also tested previous 0505)
> > * OS: Ubuntu 23.10 (except kernel)
> > * Kernel: 6.6.8 (also tested 6.7-rc7; config attached)
> >
> > Debugging summary:
> > * Kernel 5.10.205 isn’t affected.
> > * Bisect identified commit 08d0cc5f34265d1a1e3031f319f594bd1970976c as
> > cause.
> > * PCI device 0000:03:00.0 (Intel 8265 Wifi) causes resume issues as long as
> > ASPM is enabled (default).
> > * The commit message indicates that a quirk could be written to mitigate the
> > issue but I don’t know how to write such a quirk.
> >
> > Confirmed workarounds:
> > * Connect a USB flash drive (no clue why; maybe this causes a delay that
> > lets the resume succeed)
> > * Revert commit 08d0cc5f34265d1a1e3031f319f594bd1970976c (commit seemed
> > intentional; a quirk seems to be the preferred solution)
> > * pcie_aspm=off
> > * pcie_aspm.policy=performance
> > * echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm
> >
> > Debugging details:
> > * The resume trigger (power button, keyboard, mouse) doesn’t seem to make
> > any difference.
> > * Double checked that the kernel is configured to *not* reboot on panic.
> > * Double checked that there still isn't any kernel output without quiet and
> > splash.
> > * The issue doesn’t happen if a USB flash drive is connected. The content of
> > the flash drive doesn’t appear to matter. The USB port doesn’t appear to
> > matter.
> > * No information in any logs after the reboot. I suspect the resume from
> > suspend to RAM isn’t getting far enough as that logs could be written.
> > * Kernel 5.10.205 isn’t affected. Kernel 5.15.145, 6.6.8 and 6.7-rc7 are
> > affected.
> > * A kernel bisect has revealed the following commit as cause:
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=08d0cc5f34265d1a1e3031f319f594bd1970976c
> > * The commit was part of kernel 5.20 and has been backported to 5.15.
> > * The commit mentions that a device-specific quirk could be added in case of
> > new issues.
> > * According to sysfs and lspci only device 0000:03:00.0 (Intel 8265 Wifi)
> > has ASPM enabled by default.
> > * Disabling ASPM for device 0000:03:00.0 lets the resume from suspend to RAM
> > succeed.
> > * Enabling ASPM for all devices except 0000:03:00.0 lets the resume from
> > suspend to RAM succeed.
> > * This would indicate that a quirk is missing for the device 0000:03:00.0
> > (Intel 8265 Wifi) but I have no clue how to write such a quirk or how to get
> > the specifics for such a quirk.
> > * I still have no clue how a USB flash drive plays into all this. Maybe some
> > kind of a timing issue where the connected USB flash drive delays something
> > long enough so that the resume succeeds. Maybe the code removed by commit
> > 08d0cc5f34265d1a1e3031f319f594bd1970976c caused a similar delay. ¯\_(ツ)_/¯
>
> Hmmm. 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> appeared in v6.0, released Oct 2, 2022, so it's been there a while.
>
> But I think the best option is to revert it until this issue is
> resolved. Per the commit log, 08d0cc5f3426 solved two problems:
>
> 1) ASPM config changes done via sysfs are lost if the device power
> state is changed, e.g., typically set to D3hot in .suspend() and
> D0 in .resume().
>
> 2) If L1SS is restored during system resume, that restored state
> would be overwritten.
>
> Problem 2) relates to a patch that is currently reverted (a7152be79b62
> ("Revert "PCI/ASPM: Save L1 PM Substates Capability for
> suspend/resume""), so I don't think reverting 08d0cc5f3426 will make
> this problem worse.
>
> Reverting 08d0cc5f3426 will make 1) a problem again. But my guess is
> ASPM changes via sysfs are fairly unusual and the device probably
> remains functional even though it may use more power because the ASPM
> configuration was lost.
>
> So unless somebody has a counter-argument, I plan to queue a revert of
> 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") for
> v6.7.
Hi,
I cannot understand how 1) even occurs. AFAICT, nothing
pcie_aspm_pm_state_change() calls into overwrites link->aspm_disable that
is the variable storing user inputs via sysfs. So how the changes via
sysfs are lost?
--
i.
On Wed, Jan 3, 2024 at 11:41 PM Ilpo Järvinen
<[email protected]> wrote:
>
> On Mon, 1 Jan 2024, Bjorn Helgaas wrote:
>
> > On Mon, Dec 25, 2023 at 07:29:02PM +0100, Michael Schaller wrote:
> > > Issue:
> > > On resume from suspend to RAM there is no output for about 12 seconds, then
> > > shortly a blinking cursor is visible in the upper left corner on an
> > > otherwise black screen which is followed by a reboot.
> > >
> > > Setup:
> > > * Machine: ASUS mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1)
> > > * Firmware: 0508 (latest; also tested previous 0505)
> > > * OS: Ubuntu 23.10 (except kernel)
> > > * Kernel: 6.6.8 (also tested 6.7-rc7; config attached)
> > >
> > > Debugging summary:
> > > * Kernel 5.10.205 isn’t affected.
> > > * Bisect identified commit 08d0cc5f34265d1a1e3031f319f594bd1970976c as
> > > cause.
> > > * PCI device 0000:03:00.0 (Intel 8265 Wifi) causes resume issues as long as
> > > ASPM is enabled (default).
> > > * The commit message indicates that a quirk could be written to mitigate the
> > > issue but I don’t know how to write such a quirk.
> > >
> > > Confirmed workarounds:
> > > * Connect a USB flash drive (no clue why; maybe this causes a delay that
> > > lets the resume succeed)
> > > * Revert commit 08d0cc5f34265d1a1e3031f319f594bd1970976c (commit seemed
> > > intentional; a quirk seems to be the preferred solution)
> > > * pcie_aspm=off
> > > * pcie_aspm.policy=performance
> > > * echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm
> > >
> > > Debugging details:
> > > * The resume trigger (power button, keyboard, mouse) doesn’t seem to make
> > > any difference.
> > > * Double checked that the kernel is configured to *not* reboot on panic.
> > > * Double checked that there still isn't any kernel output without quiet and
> > > splash.
> > > * The issue doesn’t happen if a USB flash drive is connected. The content of
> > > the flash drive doesn’t appear to matter. The USB port doesn’t appear to
> > > matter.
> > > * No information in any logs after the reboot. I suspect the resume from
> > > suspend to RAM isn’t getting far enough as that logs could be written.
> > > * Kernel 5.10.205 isn’t affected. Kernel 5.15.145, 6.6.8 and 6.7-rc7 are
> > > affected.
> > > * A kernel bisect has revealed the following commit as cause:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=08d0cc5f34265d1a1e3031f319f594bd1970976c
> > > * The commit was part of kernel 5.20 and has been backported to 5.15.
> > > * The commit mentions that a device-specific quirk could be added in case of
> > > new issues.
> > > * According to sysfs and lspci only device 0000:03:00.0 (Intel 8265 Wifi)
> > > has ASPM enabled by default.
> > > * Disabling ASPM for device 0000:03:00.0 lets the resume from suspend to RAM
> > > succeed.
> > > * Enabling ASPM for all devices except 0000:03:00.0 lets the resume from
> > > suspend to RAM succeed.
> > > * This would indicate that a quirk is missing for the device 0000:03:00.0
> > > (Intel 8265 Wifi) but I have no clue how to write such a quirk or how to get
> > > the specifics for such a quirk.
> > > * I still have no clue how a USB flash drive plays into all this. Maybe some
> > > kind of a timing issue where the connected USB flash drive delays something
> > > long enough so that the resume succeeds. Maybe the code removed by commit
> > > 08d0cc5f34265d1a1e3031f319f594bd1970976c caused a similar delay. ¯\_(ツ)_/¯
> >
> > Hmmm. 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> > appeared in v6.0, released Oct 2, 2022, so it's been there a while.
> >
> > But I think the best option is to revert it until this issue is
> > resolved. Per the commit log, 08d0cc5f3426 solved two problems:
> >
> > 1) ASPM config changes done via sysfs are lost if the device power
> > state is changed, e.g., typically set to D3hot in .suspend() and
> > D0 in .resume().
> >
> > 2) If L1SS is restored during system resume, that restored state
> > would be overwritten.
> >
> > Problem 2) relates to a patch that is currently reverted (a7152be79b62
> > ("Revert "PCI/ASPM: Save L1 PM Substates Capability for
> > suspend/resume""), so I don't think reverting 08d0cc5f3426 will make
> > this problem worse.
> >
> > Reverting 08d0cc5f3426 will make 1) a problem again. But my guess is
> > ASPM changes via sysfs are fairly unusual and the device probably
> > remains functional even though it may use more power because the ASPM
> > configuration was lost.
> >
> > So unless somebody has a counter-argument, I plan to queue a revert of
> > 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") for
> > v6.7.
>
> Hi,
>
> I cannot understand how 1) even occurs. AFAICT, nothing
> pcie_aspm_pm_state_change() calls into overwrites link->aspm_disable that
> is the variable storing user inputs via sysfs. So how the changes via
> sysfs are lost?
Because it's states being enabled via sysfs get overwritten, not the
disabled ones.
Kai-Heng
>
> --
> i.
Hi Michael,
On Tue, Jan 2, 2024 at 2:57 AM Michael Schaller <[email protected]> wrote:
>
> On 01.01.24 19:13, Bjorn Helgaas wrote:
> > On Mon, Dec 25, 2023 at 07:29:02PM +0100, Michael Schaller wrote:
> >> Issue:
> >> On resume from suspend to RAM there is no output for about 12 seconds, then
> >> shortly a blinking cursor is visible in the upper left corner on an
> >> otherwise black screen which is followed by a reboot.
> >>
> >> Setup:
> >> * Machine: ASUS mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1)
> >> * Firmware: 0508 (latest; also tested previous 0505)
> >> * OS: Ubuntu 23.10 (except kernel)
> >> * Kernel: 6.6.8 (also tested 6.7-rc7; config attached)
> >>
> >> Debugging summary:
> >> * Kernel 5.10.205 isn’t affected.
> >> * Bisect identified commit 08d0cc5f34265d1a1e3031f319f594bd1970976c as
> >> cause.
> >> * PCI device 0000:03:00.0 (Intel 8265 Wifi) causes resume issues as long as
> >> ASPM is enabled (default).
> >> * The commit message indicates that a quirk could be written to mitigate the
> >> issue but I don’t know how to write such a quirk.
> >>
> >> Confirmed workarounds:
> >> * Connect a USB flash drive (no clue why; maybe this causes a delay that
> >> lets the resume succeed)
> >> * Revert commit 08d0cc5f34265d1a1e3031f319f594bd1970976c (commit seemed
> >> intentional; a quirk seems to be the preferred solution)
> >> * pcie_aspm=off
> >> * pcie_aspm.policy=performance
> >> * echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm
> >>
> >> Debugging details:
> >> * The resume trigger (power button, keyboard, mouse) doesn’t seem to make
> >> any difference.
> >> * Double checked that the kernel is configured to *not* reboot on panic.
> >> * Double checked that there still isn't any kernel output without quiet and
> >> splash.
> >> * The issue doesn’t happen if a USB flash drive is connected. The content of
> >> the flash drive doesn’t appear to matter. The USB port doesn’t appear to
> >> matter.
> >> * No information in any logs after the reboot. I suspect the resume from
> >> suspend to RAM isn’t getting far enough as that logs could be written.
> >> * Kernel 5.10.205 isn’t affected. Kernel 5.15.145, 6.6.8 and 6.7-rc7 are
> >> affected.
> >> * A kernel bisect has revealed the following commit as cause:
> >> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=08d0cc5f34265d1a1e3031f319f594bd1970976c
> >> * The commit was part of kernel 5.20 and has been backported to 5.15.
> >> * The commit mentions that a device-specific quirk could be added in case of
> >> new issues.
> >> * According to sysfs and lspci only device 0000:03:00.0 (Intel 8265 Wifi)
> >> has ASPM enabled by default.
> >> * Disabling ASPM for device 0000:03:00.0 lets the resume from suspend to RAM
> >> succeed.
> >> * Enabling ASPM for all devices except 0000:03:00.0 lets the resume from
> >> suspend to RAM succeed.
> >> * This would indicate that a quirk is missing for the device 0000:03:00.0
> >> (Intel 8265 Wifi) but I have no clue how to write such a quirk or how to get
> >> the specifics for such a quirk.
> >> * I still have no clue how a USB flash drive plays into all this. Maybe some
> >> kind of a timing issue where the connected USB flash drive delays something
> >> long enough so that the resume succeeds. Maybe the code removed by commit
> >> 08d0cc5f34265d1a1e3031f319f594bd1970976c caused a similar delay. ¯\_(ツ)_/¯
> >
> > Hmmm. 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> > appeared in v6.0, released Oct 2, 2022, so it's been there a while.
> >
> > But I think the best option is to revert it until this issue is
> > resolved. Per the commit log, 08d0cc5f3426 solved two problems:
> >
> > 1) ASPM config changes done via sysfs are lost if the device power
> > state is changed, e.g., typically set to D3hot in .suspend() and
> > D0 in .resume().
> >
> > 2) If L1SS is restored during system resume, that restored state
> > would be overwritten.
> >
> > Problem 2) relates to a patch that is currently reverted (a7152be79b62
> > ("Revert "PCI/ASPM: Save L1 PM Substates Capability for
> > suspend/resume""), so I don't think reverting 08d0cc5f3426 will make
> > this problem worse.
> >
> > Reverting 08d0cc5f3426 will make 1) a problem again. But my guess is
> > ASPM changes via sysfs are fairly unusual and the device probably
> > remains functional even though it may use more power because the ASPM
> > configuration was lost.
> >
> > So unless somebody has a counter-argument, I plan to queue a revert of
> > 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") for
> > v6.7.
> >
> > Bjorn
>
> If it helps I could also try if a partial revert of 08d0cc5f3426 would
> be sufficient. This might also narrow down the issue and give more
> insight where the issue originates from.
>
> Let me know what you think.
Just wondering, does `echo 0 > /sys/power/pm_asysnc` help?
Kai-Heng
>
> Michael
On Fri, 5 Jan 2024, Kai-Heng Feng wrote:
> On Wed, Jan 3, 2024 at 11:41 PM Ilpo Järvinen
> <[email protected]> wrote:
> >
> > On Mon, 1 Jan 2024, Bjorn Helgaas wrote:
> >
> > > On Mon, Dec 25, 2023 at 07:29:02PM +0100, Michael Schaller wrote:
> > > > Issue:
> > > > On resume from suspend to RAM there is no output for about 12 seconds, then
> > > > shortly a blinking cursor is visible in the upper left corner on an
> > > > otherwise black screen which is followed by a reboot.
> > > >
> > > > Setup:
> > > > * Machine: ASUS mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1)
> > > > * Firmware: 0508 (latest; also tested previous 0505)
> > > > * OS: Ubuntu 23.10 (except kernel)
> > > > * Kernel: 6.6.8 (also tested 6.7-rc7; config attached)
> > > >
> > > > Debugging summary:
> > > > * Kernel 5.10.205 isn’t affected.
> > > > * Bisect identified commit 08d0cc5f34265d1a1e3031f319f594bd1970976c as
> > > > cause.
> > > > * PCI device 0000:03:00.0 (Intel 8265 Wifi) causes resume issues as long as
> > > > ASPM is enabled (default).
> > > > * The commit message indicates that a quirk could be written to mitigate the
> > > > issue but I don’t know how to write such a quirk.
> > > >
> > > > Confirmed workarounds:
> > > > * Connect a USB flash drive (no clue why; maybe this causes a delay that
> > > > lets the resume succeed)
> > > > * Revert commit 08d0cc5f34265d1a1e3031f319f594bd1970976c (commit seemed
> > > > intentional; a quirk seems to be the preferred solution)
> > > > * pcie_aspm=off
> > > > * pcie_aspm.policy=performance
> > > > * echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm
> > > >
> > > > Debugging details:
> > > > * The resume trigger (power button, keyboard, mouse) doesn’t seem to make
> > > > any difference.
> > > > * Double checked that the kernel is configured to *not* reboot on panic.
> > > > * Double checked that there still isn't any kernel output without quiet and
> > > > splash.
> > > > * The issue doesn’t happen if a USB flash drive is connected. The content of
> > > > the flash drive doesn’t appear to matter. The USB port doesn’t appear to
> > > > matter.
> > > > * No information in any logs after the reboot. I suspect the resume from
> > > > suspend to RAM isn’t getting far enough as that logs could be written.
> > > > * Kernel 5.10.205 isn’t affected. Kernel 5.15.145, 6.6.8 and 6.7-rc7 are
> > > > affected.
> > > > * A kernel bisect has revealed the following commit as cause:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=08d0cc5f34265d1a1e3031f319f594bd1970976c
> > > > * The commit was part of kernel 5.20 and has been backported to 5.15.
> > > > * The commit mentions that a device-specific quirk could be added in case of
> > > > new issues.
> > > > * According to sysfs and lspci only device 0000:03:00.0 (Intel 8265 Wifi)
> > > > has ASPM enabled by default.
> > > > * Disabling ASPM for device 0000:03:00.0 lets the resume from suspend to RAM
> > > > succeed.
> > > > * Enabling ASPM for all devices except 0000:03:00.0 lets the resume from
> > > > suspend to RAM succeed.
> > > > * This would indicate that a quirk is missing for the device 0000:03:00.0
> > > > (Intel 8265 Wifi) but I have no clue how to write such a quirk or how to get
> > > > the specifics for such a quirk.
> > > > * I still have no clue how a USB flash drive plays into all this. Maybe some
> > > > kind of a timing issue where the connected USB flash drive delays something
> > > > long enough so that the resume succeeds. Maybe the code removed by commit
> > > > 08d0cc5f34265d1a1e3031f319f594bd1970976c caused a similar delay. ¯\_(ツ)_/¯
> > >
> > > Hmmm. 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> > > appeared in v6.0, released Oct 2, 2022, so it's been there a while.
> > >
> > > But I think the best option is to revert it until this issue is
> > > resolved. Per the commit log, 08d0cc5f3426 solved two problems:
> > >
> > > 1) ASPM config changes done via sysfs are lost if the device power
> > > state is changed, e.g., typically set to D3hot in .suspend() and
> > > D0 in .resume().
> > >
> > > 2) If L1SS is restored during system resume, that restored state
> > > would be overwritten.
> > >
> > > Problem 2) relates to a patch that is currently reverted (a7152be79b62
> > > ("Revert "PCI/ASPM: Save L1 PM Substates Capability for
> > > suspend/resume""), so I don't think reverting 08d0cc5f3426 will make
> > > this problem worse.
> > >
> > > Reverting 08d0cc5f3426 will make 1) a problem again. But my guess is
> > > ASPM changes via sysfs are fairly unusual and the device probably
> > > remains functional even though it may use more power because the ASPM
> > > configuration was lost.
> > >
> > > So unless somebody has a counter-argument, I plan to queue a revert of
> > > 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") for
> > > v6.7.
> >
> > Hi,
> >
> > I cannot understand how 1) even occurs. AFAICT, nothing
> > pcie_aspm_pm_state_change() calls into overwrites link->aspm_disable that
> > is the variable storing user inputs via sysfs. So how the changes via
> > sysfs are lost?
>
> Because it's states being enabled via sysfs get overwritten, not the
> disabled ones.
This leaves me even less sure what you're even talking about here. Are we
talking about aspm_attr_store_common() which "enables" states by changing
link->aspm_disable? (But aspm_attr_store_common() just as much "disables"
states by altering link->aspm_disable so I don't see how there's
difference between enabled/disabled ones).
During pcie_aspm_pm_state_change(), pcie_config_aspm_link() then uses
link->aspm_capable and link->aspm_disable as input but it won't change
link->aspm_disable (= it won't overwrite the user's input).
pcie_update_aspm_capable() done before calling pcie_config_aspm_path() can
alter link->aspm_capable (looks very much intentional) which can lead into
some state not being available any more for pcie_config_aspm_link().
Is this what you are trying to say, that some state gets removed from
link->aspm_capable and the effective result is that a state user enabled
via sysfs can no longer be enabled?
I even looked into aspm.c from 08d0cc5f3426 but cannot see significant
differences on how link->aspm_disable is being handled vs current code.
--
i.
On 05.01.24 04:25, Kai-Heng Feng wrote:
> Just wondering, does `echo 0 > /sys/power/pm_asysnc` help?
Yes, `echo 0 | sudo tee /sys/power/pm_async` does indeed also result in
a working resume. I've tested this on kernel 6.6.9 (which still has
commit 08d0cc5f3426). I've also attached the relevant dmesg output of
the suspend/resume cycle in case this helps.
Furthermore does this mean that commit 08d0cc5f3426 isn't at fault but
rather that we are dealing with a timing issue?
Michael
On Fri, Jan 05, 2024 at 12:18:32PM +0100, Michael Schaller wrote:
> On 05.01.24 04:25, Kai-Heng Feng wrote:
> > Just wondering, does `echo 0 > /sys/power/pm_asysnc` help?
>
> Yes, `echo 0 | sudo tee /sys/power/pm_async` does indeed also result in a
> working resume. I've tested this on kernel 6.6.9 (which still has commit
> 08d0cc5f3426). I've also attached the relevant dmesg output of the
> suspend/resume cycle in case this helps.
Thanks for testing that!
> Furthermore does this mean that commit 08d0cc5f3426 isn't at fault but
> rather that we are dealing with a timing issue?
PCI does have a few software timing requirements, mostly related to
reset and power state (D0/D3cold). ASPM has some timing parameters,
too, but I think they're all requirements on the hardware, not on
software.
Adding an arbitrary delay anywhere shouldn't break anything, and other
than those few required situations, it shouldn't fix anything either.
Bjorn
Hi Bjorn,
On Tue, Jan 02, 2024 at 05:25:50PM -0600, Bjorn Helgaas wrote:
> From: Bjorn Helgaas <[email protected]>
>
> This reverts commit 08d0cc5f34265d1a1e3031f319f594bd1970976c.
>
> Michael reported that when attempting to resume from suspend to RAM on ASUS
> mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1), 08d0cc5f3426
> ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") caused a 12-second delay
> with no output, followed by a reboot.
>
> Workarounds include:
>
> - Reverting 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> - Booting with "pcie_aspm=off"
> - Booting with "pcie_aspm.policy=performance"
> - "echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm"
> before suspending
> - Connecting a USB flash drive
>
> Fixes: 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> Reported-by: Michael Schaller <[email protected]>
> Link: https://lore.kernel.org/r/[email protected]
> Signed-off-by: Bjorn Helgaas <[email protected]>
> Cc: <[email protected]>
> ---
> +/* @pdev: the root port or switch downstream port */
> +void pcie_aspm_pm_state_change(struct pci_dev *pdev)
> +{
> + struct pcie_link_state *link = pdev->link_state;
> +
> + if (aspm_disabled || !link)
> + return;
> + /*
> + * Devices changed PM state, we should recheck if latency
> + * meets all functions' requirement
> + */
> + down_read(&pci_bus_sem);
> + mutex_lock(&aspm_lock);
> + pcie_update_aspm_capable(link->root);
> + pcie_config_aspm_path(link);
> + mutex_unlock(&aspm_lock);
> + up_read(&pci_bus_sem);
> +}
This function is now restored in 6.7 final and is called in paths which
already hold the pci_bus_sem as reported by lockdep (see splat below).
This can potentially lead to a deadlock and specifically prevents using
lockdep on Qualcomm platforms.
Not sure if you want to propagate whether the bus semaphore is held to
pcie_aspm_pm_state_change() or if there was some alternative to
restoring this function which should be explored instead.
Johan
============================================
WARNING: possible recursive locking detected
6.7.0 #40 Not tainted
--------------------------------------------
kworker/u16:5/90 is trying to acquire lock:
ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pcie_aspm_pm_state_change+0x58/0xdc
pcieport 0002:00:00.0: PME: Signaling with IRQ 197
but task is already holding lock:
ffffacfa78ced000
pcieport 0002:00:00.0: AER: enabled with IRQ 197
(pci_bus_sem
nvme nvme0: pci function 0002:01:00.0
){++++}-{3:3}
nvme 0002:01:00.0: enabling device (0000 -> 0002)
, at: pci_walk_bus+0x34/0xbc
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(pci_bus_sem);
lock(pci_bus_sem);
*** DEADLOCK ***
May be due to missing lock nesting notation
4 locks held by kworker/u16:5/90:
#0: ffff06c5c0008d38 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x150/0x53c
#1: ffff800081c0bdd0 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work+0x150/0x53c
#2: ffff06c5c0b7d0f8 (&dev->mutex){....}-{3:3}, at: __driver_attach_async_helper+0x3c/0xf4
#3: ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pci_walk_bus+0x34/0xbc
stack backtrace:
CPU: 1 PID: 90 Comm: kworker/u16:5 Not tainted 6.7.0 #40
Hardware name: LENOVO 21BYZ9SRUS/21BYZ9SRUS, BIOS N3HET53W (1.25 ) 10/12/2022
Workqueue: events_unbound async_run_entry_fn
Call trace:
dump_backtrace+0x9c/0x11c
show_stack+0x18/0x24
dump_stack_lvl+0x60/0xac
dump_stack+0x18/0x24
print_deadlock_bug+0x25c/0x348
__lock_acquire+0x10a4/0x2064
lock_acquire+0x1e8/0x318
down_read+0x60/0x184
pcie_aspm_pm_state_change+0x58/0xdc
pci_set_full_power_state+0xa8/0x114
pci_set_power_state+0xc4/0x120
qcom_pcie_enable_aspm+0x1c/0x3c [pcie_qcom]
pci_walk_bus+0x64/0xbc
qcom_pcie_host_post_init_2_7_0+0x28/0x34 [pcie_qcom]
On Fri, Jan 5, 2024 at 11:51 PM Bjorn Helgaas <[email protected]> wrote:
>
> On Fri, Jan 05, 2024 at 12:18:32PM +0100, Michael Schaller wrote:
> > On 05.01.24 04:25, Kai-Heng Feng wrote:
> > > Just wondering, does `echo 0 > /sys/power/pm_asysnc` help?
> >
> > Yes, `echo 0 | sudo tee /sys/power/pm_async` does indeed also result in a
> > working resume. I've tested this on kernel 6.6.9 (which still has commit
> > 08d0cc5f3426). I've also attached the relevant dmesg output of the
> > suspend/resume cycle in case this helps.
>
> Thanks for testing that!
>
> > Furthermore does this mean that commit 08d0cc5f3426 isn't at fault but
> > rather that we are dealing with a timing issue?
>
> PCI does have a few software timing requirements, mostly related to
> reset and power state (D0/D3cold). ASPM has some timing parameters,
> too, but I think they're all requirements on the hardware, not on
> software.
>
> Adding an arbitrary delay anywhere shouldn't break anything, and other
> than those few required situations, it shouldn't fix anything either.
At least it means 8d0cc5f3426 isn't the culprit?
Michael, does the issue happen when iwlwifi module is not loaded? It
can be related to iwlwifi firmware.
Kai-Heng
>
> Bjorn
On 10.01.24 04:43, Kai-Heng Feng wrote:
> On Fri, Jan 5, 2024 at 11:51 PM Bjorn Helgaas <[email protected]> wrote:
>>
>> On Fri, Jan 05, 2024 at 12:18:32PM +0100, Michael Schaller wrote:
>>> On 05.01.24 04:25, Kai-Heng Feng wrote:
>>>> Just wondering, does `echo 0 > /sys/power/pm_asysnc` help?
>>>
>>> Yes, `echo 0 | sudo tee /sys/power/pm_async` does indeed also result in a
>>> working resume. I've tested this on kernel 6.6.9 (which still has commit
>>> 08d0cc5f3426). I've also attached the relevant dmesg output of the
>>> suspend/resume cycle in case this helps.
>>
>> Thanks for testing that!
>>
>>> Furthermore does this mean that commit 08d0cc5f3426 isn't at fault but
>>> rather that we are dealing with a timing issue?
>>
>> PCI does have a few software timing requirements, mostly related to
>> reset and power state (D0/D3cold). ASPM has some timing parameters,
>> too, but I think they're all requirements on the hardware, not on
>> software.
>>
>> Adding an arbitrary delay anywhere shouldn't break anything, and other
>> than those few required situations, it shouldn't fix anything either.
>
> At least it means 8d0cc5f3426 isn't the culprit?
>
> Michael, does the issue happen when iwlwifi module is not loaded? It
> can be related to iwlwifi firmware.
>
> Kai-Heng
>
The issue still happens if the iwlwifi module has been blacklisted and
after a reboot. This was again with vanilla kernel 6.6.9 and I've
confirmed via dmesg that iwlwifi wasn't loaded.
I've also checked if there is a newer firmware but Ubuntu 23.10 is
already using the newest firmware available from
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/log/iwlwifi-8265-36.ucode
(version 36.ca7b901d.0 according to dmesg).
Michael
>>
>> Bjorn
Hi Bjorn,
I never got a reply to this one so resending with updated Subject in
case it got buried in your inbox.
On Mon, Jan 08, 2024 at 09:39:07AM +0100, Johan Hovold wrote:
> On Tue, Jan 02, 2024 at 05:25:50PM -0600, Bjorn Helgaas wrote:
> > From: Bjorn Helgaas <[email protected]>
> >
> > This reverts commit 08d0cc5f34265d1a1e3031f319f594bd1970976c.
> >
> > Michael reported that when attempting to resume from suspend to RAM on ASUS
> > mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1), 08d0cc5f3426
> > ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") caused a 12-second delay
> > with no output, followed by a reboot.
> >
> > Workarounds include:
> >
> > - Reverting 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> > - Booting with "pcie_aspm=off"
> > - Booting with "pcie_aspm.policy=performance"
> > - "echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm"
> > before suspending
> > - Connecting a USB flash drive
> >
> > Fixes: 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> > Reported-by: Michael Schaller <[email protected]>
> > Link: https://lore.kernel.org/r/[email protected]
> > Signed-off-by: Bjorn Helgaas <[email protected]>
> > Cc: <[email protected]>
> > ---
>
> > +/* @pdev: the root port or switch downstream port */
> > +void pcie_aspm_pm_state_change(struct pci_dev *pdev)
> > +{
> > + struct pcie_link_state *link = pdev->link_state;
> > +
> > + if (aspm_disabled || !link)
> > + return;
> > + /*
> > + * Devices changed PM state, we should recheck if latency
> > + * meets all functions' requirement
> > + */
> > + down_read(&pci_bus_sem);
> > + mutex_lock(&aspm_lock);
> > + pcie_update_aspm_capable(link->root);
> > + pcie_config_aspm_path(link);
> > + mutex_unlock(&aspm_lock);
> > + up_read(&pci_bus_sem);
> > +}
>
> This function is now restored in 6.7 final and is called in paths which
> already hold the pci_bus_sem as reported by lockdep (see splat below).
>
> This can potentially lead to a deadlock and specifically prevents using
> lockdep on Qualcomm platforms.
>
> Not sure if you want to propagate whether the bus semaphore is held to
> pcie_aspm_pm_state_change() or if there was some alternative to
> restoring this function which should be explored instead.
So to summarise, this patch, which is now commit
f93e71aea6c6 ("Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()"")
introduced a regression in 6.7-final for Qualcomm platforms (and some
Intel platforms) similar to the one recently fixed by commit
f352ce999260 ("PCI: qcom: Fix potential deadlock when enabling ASPM").
Johan
#regzbot introduced: f93e71aea6c6
> ============================================
> WARNING: possible recursive locking detected
> 6.7.0 #40 Not tainted
> --------------------------------------------
> kworker/u16:5/90 is trying to acquire lock:
> ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pcie_aspm_pm_state_change+0x58/0xdc
> pcieport 0002:00:00.0: PME: Signaling with IRQ 197
>
> but task is already holding lock:
> ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pci_walk_bus+0x34/0xbc
>
> other info that might help us debug this:
> Possible unsafe locking scenario:
>
> CPU0
> ----
> lock(pci_bus_sem);
> lock(pci_bus_sem);
>
> *** DEADLOCK ***
>
> May be due to missing lock nesting notation
>
> 4 locks held by kworker/u16:5/90:
> #0: ffff06c5c0008d38 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x150/0x53c
> #1: ffff800081c0bdd0 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work+0x150/0x53c
> #2: ffff06c5c0b7d0f8 (&dev->mutex){....}-{3:3}, at: __driver_attach_async_helper+0x3c/0xf4
> #3: ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pci_walk_bus+0x34/0xbc
>
> stack backtrace:
> CPU: 1 PID: 90 Comm: kworker/u16:5 Not tainted 6.7.0 #40
> Hardware name: LENOVO 21BYZ9SRUS/21BYZ9SRUS, BIOS N3HET53W (1.25 ) 10/12/2022
> Workqueue: events_unbound async_run_entry_fn
> Call trace:
> dump_backtrace+0x9c/0x11c
> show_stack+0x18/0x24
> dump_stack_lvl+0x60/0xac
> dump_stack+0x18/0x24
> print_deadlock_bug+0x25c/0x348
> __lock_acquire+0x10a4/0x2064
> lock_acquire+0x1e8/0x318
> down_read+0x60/0x184
> pcie_aspm_pm_state_change+0x58/0xdc
> pci_set_full_power_state+0xa8/0x114
> pci_set_power_state+0xc4/0x120
> qcom_pcie_enable_aspm+0x1c/0x3c [pcie_qcom]
> pci_walk_bus+0x64/0xbc
> qcom_pcie_host_post_init_2_7_0+0x28/0x34 [pcie_qcom]
On Mon, Jan 22, 2024 at 11:53:35AM +0100, Johan Hovold wrote:
> Hi Bjorn,
>
> I never got a reply to this one so resending with updated Subject in
> case it got buried in your inbox.
I did see it but decided it was better to fix the problem with resume
causing an unintended reboot, even though fixing that meant breaking
lockdep again, since I don't think we have user reports of the
potential deadlock lockdep finds.
08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") was a
start at fixing other problems and also improving the ASPM style, so I
hope somebody steps up to fix both it and the lockdep issue. I
haven't looked at it enough to have a preference for *how* to fix it.
Bjorn
> On Mon, Jan 08, 2024 at 09:39:07AM +0100, Johan Hovold wrote:
>
> > On Tue, Jan 02, 2024 at 05:25:50PM -0600, Bjorn Helgaas wrote:
> > > From: Bjorn Helgaas <[email protected]>
> > >
> > > This reverts commit 08d0cc5f34265d1a1e3031f319f594bd1970976c.
> > >
> > > Michael reported that when attempting to resume from suspend to RAM on ASUS
> > > mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1), 08d0cc5f3426
> > > ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") caused a 12-second delay
> > > with no output, followed by a reboot.
> > >
> > > Workarounds include:
> > >
> > > - Reverting 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> > > - Booting with "pcie_aspm=off"
> > > - Booting with "pcie_aspm.policy=performance"
> > > - "echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm"
> > > before suspending
> > > - Connecting a USB flash drive
> > >
> > > Fixes: 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> > > Reported-by: Michael Schaller <[email protected]>
> > > Link: https://lore.kernel.org/r/[email protected]
> > > Signed-off-by: Bjorn Helgaas <[email protected]>
> > > Cc: <[email protected]>
> > > ---
> >
> > > +/* @pdev: the root port or switch downstream port */
> > > +void pcie_aspm_pm_state_change(struct pci_dev *pdev)
> > > +{
> > > + struct pcie_link_state *link = pdev->link_state;
> > > +
> > > + if (aspm_disabled || !link)
> > > + return;
> > > + /*
> > > + * Devices changed PM state, we should recheck if latency
> > > + * meets all functions' requirement
> > > + */
> > > + down_read(&pci_bus_sem);
> > > + mutex_lock(&aspm_lock);
> > > + pcie_update_aspm_capable(link->root);
> > > + pcie_config_aspm_path(link);
> > > + mutex_unlock(&aspm_lock);
> > > + up_read(&pci_bus_sem);
> > > +}
> >
> > This function is now restored in 6.7 final and is called in paths which
> > already hold the pci_bus_sem as reported by lockdep (see splat below).
> >
> > This can potentially lead to a deadlock and specifically prevents using
> > lockdep on Qualcomm platforms.
> >
> > Not sure if you want to propagate whether the bus semaphore is held to
> > pcie_aspm_pm_state_change() or if there was some alternative to
> > restoring this function which should be explored instead.
>
> So to summarise, this patch, which is now commit
>
> f93e71aea6c6 ("Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()"")
>
> introduced a regression in 6.7-final for Qualcomm platforms (and some
> Intel platforms) similar to the one recently fixed by commit
>
> f352ce999260 ("PCI: qcom: Fix potential deadlock when enabling ASPM").
>
> Johan
>
>
> #regzbot introduced: f93e71aea6c6
>
> > ============================================
> > WARNING: possible recursive locking detected
> > 6.7.0 #40 Not tainted
> > --------------------------------------------
> > kworker/u16:5/90 is trying to acquire lock:
> > ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pcie_aspm_pm_state_change+0x58/0xdc
> > pcieport 0002:00:00.0: PME: Signaling with IRQ 197
> >
> > but task is already holding lock:
> > ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pci_walk_bus+0x34/0xbc
> >
> > other info that might help us debug this:
> > Possible unsafe locking scenario:
> >
> > CPU0
> > ----
> > lock(pci_bus_sem);
> > lock(pci_bus_sem);
> >
> > *** DEADLOCK ***
> >
> > May be due to missing lock nesting notation
> >
> > 4 locks held by kworker/u16:5/90:
> > #0: ffff06c5c0008d38 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x150/0x53c
> > #1: ffff800081c0bdd0 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work+0x150/0x53c
> > #2: ffff06c5c0b7d0f8 (&dev->mutex){....}-{3:3}, at: __driver_attach_async_helper+0x3c/0xf4
> > #3: ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pci_walk_bus+0x34/0xbc
> >
> > stack backtrace:
> > CPU: 1 PID: 90 Comm: kworker/u16:5 Not tainted 6.7.0 #40
> > Hardware name: LENOVO 21BYZ9SRUS/21BYZ9SRUS, BIOS N3HET53W (1.25 ) 10/12/2022
> > Workqueue: events_unbound async_run_entry_fn
> > Call trace:
> > dump_backtrace+0x9c/0x11c
> > show_stack+0x18/0x24
> > dump_stack_lvl+0x60/0xac
> > dump_stack+0x18/0x24
> > print_deadlock_bug+0x25c/0x348
> > __lock_acquire+0x10a4/0x2064
> > lock_acquire+0x1e8/0x318
> > down_read+0x60/0x184
> > pcie_aspm_pm_state_change+0x58/0xdc
> > pci_set_full_power_state+0xa8/0x114
> > pci_set_power_state+0xc4/0x120
> > qcom_pcie_enable_aspm+0x1c/0x3c [pcie_qcom]
> > pci_walk_bus+0x64/0xbc
> > qcom_pcie_host_post_init_2_7_0+0x28/0x34 [pcie_qcom]
On Mon, Jan 22, 2024 at 12:26:15PM -0600, Bjorn Helgaas wrote:
> On Mon, Jan 22, 2024 at 11:53:35AM +0100, Johan Hovold wrote:
> > I never got a reply to this one so resending with updated Subject in
> > case it got buried in your inbox.
>
> I did see it but decided it was better to fix the problem with resume
> causing an unintended reboot, even though fixing that meant breaking
> lockdep again, since I don't think we have user reports of the
> potential deadlock lockdep finds.
That may be because I fixed the previous regression in 6.7-rc1 before
any users had a chance to hit the deadlock on Qualcomm platforms.
I can easily trigger a deadlock on the X13s by instrumenting 6.7-final
with a delay to increase the race window.
And any user hitting this occasionally is likely not going to be able to
track it down to this lock inversion (unless they have lockdep enabled).
> 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") was a
> start at fixing other problems and also improving the ASPM style, so I
> hope somebody steps up to fix both it and the lockdep issue. I
> haven't looked at it enough to have a preference for *how* to fix it.
Ok, but since you were the one introducing the locking regression in
6.7-final shouldn't you look into fixing it?
Especially if there were alternatives to restoring the offending commit
which would solve the underlying issue for the resume failure without
breaking other platforms.
I don't want to spend more time on this if the offending commit could
simply be reverted.
Johan
On Tue, Jan 23, 2024 at 06:25:52PM +0100, Johan Hovold wrote:
> On Mon, Jan 22, 2024 at 12:26:15PM -0600, Bjorn Helgaas wrote:
> > On Mon, Jan 22, 2024 at 11:53:35AM +0100, Johan Hovold wrote:
> > > I never got a reply to this one so resending with updated Subject in
> > > case it got buried in your inbox.
> >
> > I did see it but decided it was better to fix the problem with resume
> > causing an unintended reboot, even though fixing that meant breaking
> > lockdep again, since I don't think we have user reports of the
> > potential deadlock lockdep finds.
>
> That may be because I fixed the previous regression in 6.7-rc1 before
> any users had a chance to hit the deadlock on Qualcomm platforms.
>
> I can easily trigger a deadlock on the X13s by instrumenting 6.7-final
> with a delay to increase the race window.
>
> And any user hitting this occasionally is likely not going to be able to
> track it down to this lock inversion (unless they have lockdep enabled).
I agree, it's a problem we need to fix.
> > 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") was a
> > start at fixing other problems and also improving the ASPM style, so I
> > hope somebody steps up to fix both it and the lockdep issue. I
> > haven't looked at it enough to have a preference for *how* to fix it.
>
> Ok, but since you were the one introducing the locking regression in
> 6.7-final shouldn't you look into fixing it?
>
> Especially if there were alternatives to restoring the offending commit
> which would solve the underlying issue for the resume failure without
> breaking other platforms.
Did somebody propose an alternate patch? If so, I missed it, but we
could look at it now.
> I don't want to spend more time on this if the offending commit could
> simply be reverted.
I don't quite follow. By simply reverting, do you mean to revert
f93e71aea6c6 ("Revert "PCI/ASPM: Remove
pcie_aspm_pm_state_change()"")? IIUC that would break Michael's
machine again.
Bjorn
On Tue, Jan 23, 2024 at 04:36:48PM -0600, Bjorn Helgaas wrote:
> On Tue, Jan 23, 2024 at 06:25:52PM +0100, Johan Hovold wrote:
> > On Mon, Jan 22, 2024 at 12:26:15PM -0600, Bjorn Helgaas wrote:
> > > On Mon, Jan 22, 2024 at 11:53:35AM +0100, Johan Hovold wrote:
> > > 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") was a
> > > start at fixing other problems and also improving the ASPM style, so I
> > > hope somebody steps up to fix both it and the lockdep issue. I
> > > haven't looked at it enough to have a preference for *how* to fix it.
> >
> > Ok, but since you were the one introducing the locking regression in
> > 6.7-final shouldn't you look into fixing it?
> >
> > Especially if there were alternatives to restoring the offending commit
> > which would solve the underlying issue for the resume failure without
> > breaking other platforms.
>
> Did somebody propose an alternate patch? If so, I missed it, but we
> could look at it now.
I've only skimmed the discussion leading up to the revert, but I got the
impression that other alternatives were looked at as it was still not
clear what the underlying issue actually was.
As Michael and Thorsten pointed out before the revert, it may have been
better not to do a last minute revert of a 16 month old commit which
risks introducing regressions (and brought back another sysfs issue
IIUC) before fully understanding what is really going on here.
> > I don't want to spend more time on this if the offending commit could
> > simply be reverted.
>
> I don't quite follow. By simply reverting, do you mean to revert
> f93e71aea6c6 ("Revert "PCI/ASPM: Remove
> pcie_aspm_pm_state_change()"")? IIUC that would break Michael's
> machine again.
Right, at least until that issue is fully understood and alternative
fixes have been considered.
If that's not an option, we need to rework core to pass a flag through
more than one layer to indicate whether pcie_aspm_pm_state_change()
should take the bus semaphore or not. I'd rather not do that if it can
be avoided.
Johan
On Wed, Jan 24, 2024 at 09:16:38AM +0100, Johan Hovold wrote:
> On Tue, Jan 23, 2024 at 04:36:48PM -0600, Bjorn Helgaas wrote:
> > I don't quite follow. By simply reverting, do you mean to revert
> > f93e71aea6c6 ("Revert "PCI/ASPM: Remove
> > pcie_aspm_pm_state_change()"")? IIUC that would break Michael's
> > machine again.
>
> Right, at least until that issue is fully understood and alternative
> fixes have been considered.
>
> If that's not an option, we need to rework core to pass a flag through
> more than one layer to indicate whether pcie_aspm_pm_state_change()
> should take the bus semaphore or not. I'd rather not do that if it can
> be avoided.
As a revert appears unlikely to happen, let's fix the regression by
adding a new helper pci_set_power_state_locked() that can be called
with the bus lock held:
https://lore.kernel.org/lkml/[email protected]/
Johan
On 22.01.24 11:53, Johan Hovold wrote:
>
> So to summarise, this patch, which is now commit
>
> f93e71aea6c6 ("Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()"")
>
> introduced a regression in 6.7-final for Qualcomm platforms (and some
> Intel platforms) similar to the one recently fixed by commit
>
> f352ce999260 ("PCI: qcom: Fix potential deadlock when enabling ASPM").
>
> #regzbot introduced: f93e71aea6c6
#regzbot fix: 1e560864159d00
#regzbot ignore-activity
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.
Hi Michael,
Sorry for the belated response.
On Wed, Jan 10, 2024 at 8:40 PM Michael Schaller <[email protected]> wrote:
>
>
> On 10.01.24 04:43, Kai-Heng Feng wrote:
> > On Fri, Jan 5, 2024 at 11:51 PM Bjorn Helgaas <[email protected]> wrote:
> >>
> >> On Fri, Jan 05, 2024 at 12:18:32PM +0100, Michael Schaller wrote:
> >>> On 05.01.24 04:25, Kai-Heng Feng wrote:
> >>>> Just wondering, does `echo 0 > /sys/power/pm_asysnc` help?
> >>>
> >>> Yes, `echo 0 | sudo tee /sys/power/pm_async` does indeed also result in a
> >>> working resume. I've tested this on kernel 6.6.9 (which still has commit
> >>> 08d0cc5f3426). I've also attached the relevant dmesg output of the
> >>> suspend/resume cycle in case this helps.
> >>
> >> Thanks for testing that!
> >>
> >>> Furthermore does this mean that commit 08d0cc5f3426 isn't at fault but
> >>> rather that we are dealing with a timing issue?
> >>
> >> PCI does have a few software timing requirements, mostly related to
> >> reset and power state (D0/D3cold). ASPM has some timing parameters,
> >> too, but I think they're all requirements on the hardware, not on
> >> software.
> >>
> >> Adding an arbitrary delay anywhere shouldn't break anything, and other
> >> than those few required situations, it shouldn't fix anything either.
> >
> > At least it means 8d0cc5f3426 isn't the culprit?
> >
> > Michael, does the issue happen when iwlwifi module is not loaded? It
> > can be related to iwlwifi firmware.
> >
> > Kai-Heng
> >
> The issue still happens if the iwlwifi module has been blacklisted and
> after a reboot. This was again with vanilla kernel 6.6.9 and I've
> confirmed via dmesg that iwlwifi wasn't loaded.
Can you please give latest mainline kernel a try? With commit
f93e71aea6c60ebff8adbd8941e678302d377869 (Revert "PCI/ASPM: Remove
pcie_aspm_pm_state_change()") reverted.
Also do you have efi-pstore enabled? Is there anything logged in
/var/lib/systemd/pstore (assuming systemd is used)?
Kai-Heng
>
> I've also checked if there is a newer firmware but Ubuntu 23.10 is
> already using the newest firmware available from
> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/log/iwlwifi-8265-36.ucode
> (version 36.ca7b901d.0 according to dmesg).
>
> Michael
>
> >>
> >> Bjorn
Hi Kai-Heng,
---- On Thu, 07 Mar 2024 07:51:05 +0100 Kai-Heng Feng wrote ---
> Hi Michael,
>
> Sorry for the belated response.
>
No worries.
> On Wed, Jan 10, 2024 at 8:40 PM Michael Schaller [email protected]> wrote:
> >
> >
> > On 10.01.24 04:43, Kai-Heng Feng wrote:
> > > On Fri, Jan 5, 2024 at 11:51 PM Bjorn Helgaas helgaas@kernelorg> wrote:
> > >>
> > >> On Fri, Jan 05, 2024 at 12:18:32PM +0100, Michael Schaller wrote:
> > >>> On 05.01.24 04:25, Kai-Heng Feng wrote:
> > >>>> Just wondering, does `echo 0 > /sys/power/pm_asysnc` help?
> > >>>
> > >>> Yes, `echo 0 | sudo tee /sys/power/pm_async` does indeed also result in a
> > >>> working resume. I've tested this on kernel 6.6.9 (which still has commit
> > >>> 08d0cc5f3426). I've also attached the relevant dmesg output of the
> > >>> suspend/resume cycle in case this helps.
> > >>
> > >> Thanks for testing that!
> > >>
> > >>> Furthermore does this mean that commit 08d0cc5f3426 isn't at fault but
> > >>> rather that we are dealing with a timing issue?
> > >>
> > >> PCI does have a few software timing requirements, mostly related to
> > >> reset and power state (D0/D3cold). ASPM has some timing parameters,
> > >> too, but I think they're all requirements on the hardware, not on
> > >> software.
> > >>
> > >> Adding an arbitrary delay anywhere shouldn't break anything, and other
> > >> than those few required situations, it shouldn't fix anything either.
> > >
> > > At least it means 8d0cc5f3426 isn't the culprit?
> > >
> > > Michael, does the issue happen when iwlwifi module is not loaded? It
> > > can be related to iwlwifi firmware.
> > >
> > > Kai-Heng
> > >
> > The issue still happens if the iwlwifi module has been blacklisted and
> > after a reboot. This was again with vanilla kernel 6.6.9 and I've
> > confirmed via dmesg that iwlwifi wasn't loaded.
>
> Can you please give latest mainline kernel a try? With commit
> f93e71aea6c60ebff8adbd8941e678302d377869 (Revert "PCI/ASPM: Remove
> pcie_aspm_pm_state_change()") reverted.
>
> Also do you have efi-pstore enabled? Is there anything logged in
> /var/lib/systemd/pstore (assuming systemd is used)?
>
I'm happy to test once I'm out of the hospital (long COVID). I should be back home in 5 weeks.
Michael
> Kai-Heng
>
> >
> > I've also checked if there is a newer firmware but Ubuntu 23.10 is
> > already using the newest firmware available from
> > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/log/iwlwifi-8265-36.ucode
> > (version 36.ca7b901d.0 according to dmesg).
> >
> > Michael
> >
> > >>
> > >> Bjorn
>
On Thu, Mar 07, 2024 at 02:51:05PM +0800, Kai-Heng Feng wrote:
> On Wed, Jan 10, 2024 at 8:40 PM Michael Schaller <[email protected]> wrote:
> > On 10.01.24 04:43, Kai-Heng Feng wrote:
> > > On Fri, Jan 5, 2024 at 11:51 PM Bjorn Helgaas <[email protected]> wrote:
> > >> On Fri, Jan 05, 2024 at 12:18:32PM +0100, Michael Schaller wrote:
> > >>> On 05.01.24 04:25, Kai-Heng Feng wrote:
> > >>>> Just wondering, does `echo 0 > /sys/power/pm_asysnc` help?
> > >>>
> > >>> Yes, `echo 0 | sudo tee /sys/power/pm_async` does indeed also result in a
> > >>> working resume. I've tested this on kernel 6.6.9 (which still has commit
> > >>> 08d0cc5f3426). I've also attached the relevant dmesg output of the
> > >>> suspend/resume cycle in case this helps.
> > >>
> > >> Thanks for testing that!
> > >>
> > >>> Furthermore does this mean that commit 08d0cc5f3426 isn't at fault but
> > >>> rather that we are dealing with a timing issue?
> > >>
> > >> PCI does have a few software timing requirements, mostly related to
> > >> reset and power state (D0/D3cold). ASPM has some timing parameters,
> > >> too, but I think they're all requirements on the hardware, not on
> > >> software.
> > >>
> > >> Adding an arbitrary delay anywhere shouldn't break anything, and other
> > >> than those few required situations, it shouldn't fix anything either.
> > >
> > > At least it means 8d0cc5f3426 isn't the culprit?
> > >
> > > Michael, does the issue happen when iwlwifi module is not loaded? It
> > > can be related to iwlwifi firmware.
> > >
> > The issue still happens if the iwlwifi module has been blacklisted and
> > after a reboot. This was again with vanilla kernel 6.6.9 and I've
> > confirmed via dmesg that iwlwifi wasn't loaded.
>
> Can you please give latest mainline kernel a try? With commit
> f93e71aea6c60ebff8adbd8941e678302d377869 (Revert "PCI/ASPM: Remove
> pcie_aspm_pm_state_change()") reverted.
>
> Also do you have efi-pstore enabled? Is there anything logged in
> /var/lib/systemd/pstore (assuming systemd is used)?
It seems possible that some recent ASPM fixes could help this issue.
These fixes are not upstream yet, but should appear in v6.9-rc1.
Your (Michael's) bisection identified 08d0cc5f3426 ("PCI/ASPM: Remove
pcie_aspm_pm_state_change()"), which appeared in v6.0. This was
intended to solve the problem of ASPM config changes made via sysfs
getting lost.
We removed 08d0cc5f3426 in v6.7 with f93e71aea6c6 ("Revert "PCI/ASPM:
Remove pcie_aspm_pm_state_change()"") to address the reboot after
resume problem that you reported.
e4dbf699467e ("PCI/ASPM: Update save_state when configuration
changes") is planned for v6.9-rc1 and should solve the same problem
08d0cc5f3426 tried to solve, but in a different way.
390fd84739c5 ("PCI/ASPM: Save L1 PM Substates Capability for
suspend/resume") is also planned for v6.9-rc1 and fixes some problems
with restoring L1 Substates config during resume. These substates are
enabled for your 03:00.0 device, so this commit may also be related.
That's all a long way to say that I think testing v6.9-rc1 or later
(or linux-next as of Mar 7 or later) would be very interesting.
> > I've also checked if there is a newer firmware but Ubuntu 23.10 is
> > already using the newest firmware available from
> > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/log/iwlwifi-8265-36.ucode
> > (version 36.ca7b901d.0 according to dmesg).
> >
> > Michael
> >
> > >>
> > >> Bjorn