2023-10-31 08:21:55

by Bagas Sanjaya

[permalink] [raw]
Subject: Fwd: Regression: Kernel 6.4 rc1 and higher causes Steam Deck to fail to wake from suspend (bisected)

Hi,

I notice a regression report on Bugzilla [1]. Quoting from it:

> On Kernel 6.4 rc1 and higher if you put the Steam Deck into suspend then press the power button again it will not wake up.
>
> I don't have a clue as to -why- this commit breaks wake from suspend on steam deck, but it does. Bisected to:
>
> ```
> 1ad11eafc63ac16e667853bee4273879226d2d1b is the first bad commit
> commit 1ad11eafc63ac16e667853bee4273879226d2d1b
> Author: Bjorn Helgaas <[email protected]>
> Date: Tue Mar 7 14:32:43 2023 -0600
>
> nvme-pci: drop redundant pci_enable_pcie_error_reporting()
>
> pci_enable_pcie_error_reporting() enables the device to send ERR_*
> Messages. Since f26e58bf6f54 ("PCI/AER: Enable error reporting when AER is
> native"), the PCI core does this for all devices during enumeration, so the
> driver doesn't need to do it itself.
>
> Remove the redundant pci_enable_pcie_error_reporting() call from the
> driver. Also remove the corresponding pci_disable_pcie_error_reporting()
> from the driver .remove() path.
>
> Note that this only controls ERR_* Messages from the device. An ERR_*
> Message may cause the Root Port to generate an interrupt, depending on the
> AER Root Error Command register managed by the AER service driver.
>
> Signed-off-by: Bjorn Helgaas <[email protected]>
> Reviewed-by: Chaitanya Kulkarni <[email protected]>
> Signed-off-by: Christoph Hellwig <[email protected]>
>
> drivers/nvme/host/pci.c | 6 +-----
> 1 file changed, 1 insertion(+), 5 deletions(-)
> ```
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-6.4.y&id=1ad11eafc63ac16e667853bee4273879226d2d1b
>
> Reverting that commit by itself on top of 6.5.9 (stable) allows it to wake from suspend properly.

See Bugzilla for the full thread.

Anyway, I'm adding this regression to regzbot:

#regression introduced: 1ad11eafc63ac1 https://bugzilla.kernel.org/show_bug.cgi?id=218090
#regression title: Steam Deck fails to wake from suspend due to pci_enable_pcie_error_reporting() removal

Thanks.

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=218090

--
An old man doll... just what I always wanted! - Clara


2023-11-01 11:46:14

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: Fwd: Regression: Kernel 6.4 rc1 and higher causes Steam Deck to fail to wake from suspend (bisected)

[+cc linux-pci]

On Tue, Oct 31, 2023 at 03:21:20PM +0700, Bagas Sanjaya wrote:
> Hi,
>
> I notice a regression report on Bugzilla [1]. Quoting from it:
>
> > On Kernel 6.4 rc1 and higher if you put the Steam Deck into suspend then press the power button again it will not wake up.
> >
> > I don't have a clue as to -why- this commit breaks wake from suspend on steam deck, but it does. Bisected to:
> >
> > ```
> > 1ad11eafc63ac16e667853bee4273879226d2d1b is the first bad commit
> > commit 1ad11eafc63ac16e667853bee4273879226d2d1b
> > Author: Bjorn Helgaas <[email protected]>
> > Date: Tue Mar 7 14:32:43 2023 -0600
> >
> > nvme-pci: drop redundant pci_enable_pcie_error_reporting()
> >
> > pci_enable_pcie_error_reporting() enables the device to send ERR_*
> > Messages. Since f26e58bf6f54 ("PCI/AER: Enable error reporting when AER is
> > native"), the PCI core does this for all devices during enumeration, so the
> > driver doesn't need to do it itself.
> >
> > Remove the redundant pci_enable_pcie_error_reporting() call from the
> > driver. Also remove the corresponding pci_disable_pcie_error_reporting()
> > from the driver .remove() path.
> >
> > Note that this only controls ERR_* Messages from the device. An ERR_*
> > Message may cause the Root Port to generate an interrupt, depending on the
> > AER Root Error Command register managed by the AER service driver.
> >
> > Signed-off-by: Bjorn Helgaas <[email protected]>
> > Reviewed-by: Chaitanya Kulkarni <[email protected]>
> > Signed-off-by: Christoph Hellwig <[email protected]>
> >
> > drivers/nvme/host/pci.c | 6 +-----
> > 1 file changed, 1 insertion(+), 5 deletions(-)
> > ```
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-6.4.y&id=1ad11eafc63ac16e667853bee4273879226d2d1b
> >
> > Reverting that commit by itself on top of 6.5.9 (stable) allows it to wake from suspend properly.
>
> See Bugzilla for the full thread.
>
> Anyway, I'm adding this regression to regzbot:
>
> #regression introduced: 1ad11eafc63ac1 https://bugzilla.kernel.org/show_bug.cgi?id=218090
> #regression title: Steam Deck fails to wake from suspend due to pci_enable_pcie_error_reporting() removal
>
> Thanks.
>
> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=218090

Thanks, I requested some dmesg logs and lspci output to help debug
this.

2024-03-30 13:47:11

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: Fwd: Regression: Kernel 6.4 rc1 and higher causes Steam Deck to fail to wake from suspend (bisected)

[+cc Keith, Sagi, Hannes, Kai-Heng, +bcc silverspring from bugzilla]

On Wed, Nov 01, 2023 at 06:45:41AM -0500, Bjorn Helgaas wrote:
> On Tue, Oct 31, 2023 at 03:21:20PM +0700, Bagas Sanjaya wrote:
> > I notice a regression report on Bugzilla [1]. Quoting from it:
> >
> > > On Kernel 6.4 rc1 and higher if you put the Steam Deck into
> > > suspend then press the power button again it will not wake up.
> > >
> > > I don't have a clue as to -why- this commit breaks wake from
> > > suspend on steam deck, but it does. Bisected to:
> > >
> > > ```
> > > 1ad11eafc63ac16e667853bee4273879226d2d1b is the first bad commit
> > > commit 1ad11eafc63ac16e667853bee4273879226d2d1b
> > > Author: Bjorn Helgaas <[email protected]>
> > > Date: Tue Mar 7 14:32:43 2023 -0600
> > >
> > > nvme-pci: drop redundant pci_enable_pcie_error_reporting()
> > >
> > > pci_enable_pcie_error_reporting() enables the device to send ERR_*
> > > Messages. Since f26e58bf6f54 ("PCI/AER: Enable error reporting when AER is
> > > native"), the PCI core does this for all devices during enumeration, so the
> > > driver doesn't need to do it itself.
> > >
> > > Remove the redundant pci_enable_pcie_error_reporting() call from the
> > > driver. Also remove the corresponding pci_disable_pcie_error_reporting()
> > > from the driver .remove() path.
> > >
> > > Note that this only controls ERR_* Messages from the device. An ERR_*
> > > Message may cause the Root Port to generate an interrupt, depending on the
> > > AER Root Error Command register managed by the AER service driver.
> > >
> > > Signed-off-by: Bjorn Helgaas <[email protected]>
> > > Reviewed-by: Chaitanya Kulkarni <[email protected]>
> > > Signed-off-by: Christoph Hellwig <[email protected]>
> > >
> > > drivers/nvme/host/pci.c | 6 +-----
> > > 1 file changed, 1 insertion(+), 5 deletions(-)
> > > ```

> > > Reverting that commit by itself on top of 6.5.9 (stable) allows
> > > it to wake from suspend properly.

> > [1]: https://bugzilla.kernel.org/show_bug.cgi?id=218090
>
> Thanks, I requested some dmesg logs and lspci output to help debug
> this.

silverspring attached lspci output and a dmesg log from v6.8 to the
bugzilla and also noted that "pci=noaer" works around the problem.

The problem commit is 1ad11eafc63a ("nvme-pci: drop redundant
pci_enable_pcie_error_reporting()")
(https://git.kernel.org/linus/1ad11eafc63a)

1ad11eafc63a removed pci_disable_pcie_error_reporting() from the
nvme_suspend() path, so we now leave the PCIe Device Control error
enables set when we didn't before. My theory is that the PCIe link
goes down during suspend, which causes an error interrupt, and the
interrupt causes a problem on Steam Deck. Maybe there's some BIOS
connection.

"pci=noaer" would work around this because those error enables would
never be set in the first place.

I asked reporters to test the debug patches below to disable those
error interrupts during suspend.

I don't think this would be the *right* fix; if we need to do this, I
think it should be done by the PCI core, not by individual drivers.
Kai-Heng has been suggesting this for a while for a different
scenario.

Bjorn


commit 60c07557d0cc ("Revert "PCI/AER: Drop unused pci_disable_pcie_error_reporting()"")
Author: Bjorn Helgaas <[email protected]>
Date: Fri Mar 29 17:54:30 2024 -0500

Revert "PCI/AER: Drop unused pci_disable_pcie_error_reporting()"

This reverts commit 69b264df8a412820e98867dbab871c6526c5e5aa.


diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index ac6293c24976..273f9c6f93dd 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -243,6 +243,18 @@ static int pci_enable_pcie_error_reporting(struct pci_dev *dev)
return pcibios_err_to_errno(rc);
}

+int pci_disable_pcie_error_reporting(struct pci_dev *dev)
+{
+ int rc;
+
+ if (!pcie_aer_is_native(dev))
+ return -EIO;
+
+ rc = pcie_capability_clear_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);
+ return pcibios_err_to_errno(rc);
+}
+EXPORT_SYMBOL_GPL(pci_disable_pcie_error_reporting);
+
int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
{
int aer = dev->aer_cap;
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 4b97f38f3fcf..425e5e430e65 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -40,9 +40,14 @@ struct aer_capability_regs {
int pcie_read_tlp_log(struct pci_dev *dev, int where, struct pcie_tlp_log *log);

#if defined(CONFIG_PCIEAER)
+int pci_disable_pcie_error_reporting(struct pci_dev *dev);
int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
int pcie_aer_is_native(struct pci_dev *dev);
#else
+static inline int pci_disable_pcie_error_reporting(struct pci_dev *dev)
+{
+ return -EINVAL;
+}
static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
{
return -EINVAL;
commit 8efb88cf23d4 ("nvme-pci: disable error reporting in nvme_dev_disable()")
Author: Bjorn Helgaas <[email protected]>
Date: Fri Mar 29 17:52:39 2024 -0500

nvme-pci: disable error reporting in nvme_dev_disable()

Debug patch.

The PCI core enables error reporting in pci_aer_init() for all devices that
advertise AER support.

During suspend, nvme_suspend() calls nvme_dev_disable() in several cases.
Prior to 1ad11eafc63a, nvme_dev_disable() disabled error reporting.

After 1ad11eafc63a, error reporting will remain enabled during suspend.
Maybe having error reporting enabled during suspend causes a problem on
Steam Deck.

"pci=noaer" prevents pci_aer_init() from enabling error reporting, so as
long as the BIOS doesn't enable it, error reporting should always be
disabled.

nvme_suspend
nvme_disable_prepare_reset
nvme_dev_disable


diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8e0bb9692685..2be838b5d1f6 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -5,6 +5,7 @@
*/

#include <linux/acpi.h>
+#include <linux/aer.h>
#include <linux/async.h>
#include <linux/blkdev.h>
#include <linux/blk-mq.h>
@@ -2603,8 +2604,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
nvme_suspend_io_queues(dev);
nvme_suspend_queue(dev, 0);
pci_free_irq_vectors(pdev);
- if (pci_is_enabled(pdev))
+ if (pci_is_enabled(pdev)) {
+ pci_disable_pcie_error_reporting(pdev);
pci_disable_device(pdev);
+ }
nvme_reap_pending_cqes(dev);

nvme_cancel_tagset(&dev->ctrl);

2024-04-10 06:21:24

by Kai-Heng Feng

[permalink] [raw]
Subject: Re: Fwd: Regression: Kernel 6.4 rc1 and higher causes Steam Deck to fail to wake from suspend (bisected)

Hi Bjorn,

On Sat, Mar 30, 2024 at 9:47 PM Bjorn Helgaas <[email protected]> wrote:
>
> [+cc Keith, Sagi, Hannes, Kai-Heng, +bcc silverspring from bugzilla]
>
> On Wed, Nov 01, 2023 at 06:45:41AM -0500, Bjorn Helgaas wrote:
> > On Tue, Oct 31, 2023 at 03:21:20PM +0700, Bagas Sanjaya wrote:
> > > I notice a regression report on Bugzilla [1]. Quoting from it:
> > >
> > > > On Kernel 6.4 rc1 and higher if you put the Steam Deck into
> > > > suspend then press the power button again it will not wake up.
> > > >
> > > > I don't have a clue as to -why- this commit breaks wake from
> > > > suspend on steam deck, but it does. Bisected to:
> > > >
> > > > ```
> > > > 1ad11eafc63ac16e667853bee4273879226d2d1b is the first bad commit
> > > > commit 1ad11eafc63ac16e667853bee4273879226d2d1b
> > > > Author: Bjorn Helgaas <[email protected]>
> > > > Date: Tue Mar 7 14:32:43 2023 -0600
> > > >
> > > > nvme-pci: drop redundant pci_enable_pcie_error_reporting()
> > > >
> > > > pci_enable_pcie_error_reporting() enables the device to send ERR_*
> > > > Messages. Since f26e58bf6f54 ("PCI/AER: Enable error reporting when AER is
> > > > native"), the PCI core does this for all devices during enumeration, so the
> > > > driver doesn't need to do it itself.
> > > >
> > > > Remove the redundant pci_enable_pcie_error_reporting() call from the
> > > > driver. Also remove the corresponding pci_disable_pcie_error_reporting()
> > > > from the driver .remove() path.
> > > >
> > > > Note that this only controls ERR_* Messages from the device. An ERR_*
> > > > Message may cause the Root Port to generate an interrupt, depending on the
> > > > AER Root Error Command register managed by the AER service driver.
> > > >
> > > > Signed-off-by: Bjorn Helgaas <[email protected]>
> > > > Reviewed-by: Chaitanya Kulkarni <[email protected]>
> > > > Signed-off-by: Christoph Hellwig <[email protected]>
> > > >
> > > > drivers/nvme/host/pci.c | 6 +-----
> > > > 1 file changed, 1 insertion(+), 5 deletions(-)
> > > > ```
>
> > > > Reverting that commit by itself on top of 6.5.9 (stable) allows
> > > > it to wake from suspend properly.
>
> > > [1]: https://bugzilla.kernel.org/show_bug.cgi?id=218090
> >
> > Thanks, I requested some dmesg logs and lspci output to help debug
> > this.
>
> silverspring attached lspci output and a dmesg log from v6.8 to the
> bugzilla and also noted that "pci=noaer" works around the problem.
>
> The problem commit is 1ad11eafc63a ("nvme-pci: drop redundant
> pci_enable_pcie_error_reporting()")
> (https://git.kernel.org/linus/1ad11eafc63a)
>
> 1ad11eafc63a removed pci_disable_pcie_error_reporting() from the
> nvme_suspend() path, so we now leave the PCIe Device Control error
> enables set when we didn't before. My theory is that the PCIe link
> goes down during suspend, which causes an error interrupt, and the
> interrupt causes a problem on Steam Deck. Maybe there's some BIOS
> connection.
>
> "pci=noaer" would work around this because those error enables would
> never be set in the first place.
>
> I asked reporters to test the debug patches below to disable those
> error interrupts during suspend.
>
> I don't think this would be the *right* fix; if we need to do this, I
> think it should be done by the PCI core, not by individual drivers.
> Kai-Heng has been suggesting this for a while for a different
> scenario.

Should I send the patch to mailing list again to stir more discussion?

Kai-Heng

>
> Bjorn
>
>
> commit 60c07557d0cc ("Revert "PCI/AER: Drop unused pci_disable_pcie_error_reporting()"")
> Author: Bjorn Helgaas <[email protected]>
> Date: Fri Mar 29 17:54:30 2024 -0500
>
> Revert "PCI/AER: Drop unused pci_disable_pcie_error_reporting()"
>
> This reverts commit 69b264df8a412820e98867dbab871c6526c5e5aa.
>
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index ac6293c24976..273f9c6f93dd 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -243,6 +243,18 @@ static int pci_enable_pcie_error_reporting(struct pci_dev *dev)
> return pcibios_err_to_errno(rc);
> }
>
> +int pci_disable_pcie_error_reporting(struct pci_dev *dev)
> +{
> + int rc;
> +
> + if (!pcie_aer_is_native(dev))
> + return -EIO;
> +
> + rc = pcie_capability_clear_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);
> + return pcibios_err_to_errno(rc);
> +}
> +EXPORT_SYMBOL_GPL(pci_disable_pcie_error_reporting);
> +
> int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
> {
> int aer = dev->aer_cap;
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 4b97f38f3fcf..425e5e430e65 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -40,9 +40,14 @@ struct aer_capability_regs {
> int pcie_read_tlp_log(struct pci_dev *dev, int where, struct pcie_tlp_log *log);
>
> #if defined(CONFIG_PCIEAER)
> +int pci_disable_pcie_error_reporting(struct pci_dev *dev);
> int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
> int pcie_aer_is_native(struct pci_dev *dev);
> #else
> +static inline int pci_disable_pcie_error_reporting(struct pci_dev *dev)
> +{
> + return -EINVAL;
> +}
> static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
> {
> return -EINVAL;
> commit 8efb88cf23d4 ("nvme-pci: disable error reporting in nvme_dev_disable()")
> Author: Bjorn Helgaas <[email protected]>
> Date: Fri Mar 29 17:52:39 2024 -0500
>
> nvme-pci: disable error reporting in nvme_dev_disable()
>
> Debug patch.
>
> The PCI core enables error reporting in pci_aer_init() for all devices that
> advertise AER support.
>
> During suspend, nvme_suspend() calls nvme_dev_disable() in several cases.
> Prior to 1ad11eafc63a, nvme_dev_disable() disabled error reporting.
>
> After 1ad11eafc63a, error reporting will remain enabled during suspend.
> Maybe having error reporting enabled during suspend causes a problem on
> Steam Deck.
>
> "pci=noaer" prevents pci_aer_init() from enabling error reporting, so as
> long as the BIOS doesn't enable it, error reporting should always be
> disabled.
>
> nvme_suspend
> nvme_disable_prepare_reset
> nvme_dev_disable
>
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 8e0bb9692685..2be838b5d1f6 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -5,6 +5,7 @@
> */
>
> #include <linux/acpi.h>
> +#include <linux/aer.h>
> #include <linux/async.h>
> #include <linux/blkdev.h>
> #include <linux/blk-mq.h>
> @@ -2603,8 +2604,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> nvme_suspend_io_queues(dev);
> nvme_suspend_queue(dev, 0);
> pci_free_irq_vectors(pdev);
> - if (pci_is_enabled(pdev))
> + if (pci_is_enabled(pdev)) {
> + pci_disable_pcie_error_reporting(pdev);
> pci_disable_device(pdev);
> + }
> nvme_reap_pending_cqes(dev);
>
> nvme_cancel_tagset(&dev->ctrl);

2024-04-10 21:10:48

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: Fwd: Regression: Kernel 6.4 rc1 and higher causes Steam Deck to fail to wake from suspend (bisected)

On Wed, Apr 10, 2024 at 02:20:31PM +0800, Kai-Heng Feng wrote:
> On Sat, Mar 30, 2024 at 9:47 PM Bjorn Helgaas <[email protected]> wrote:
> > On Wed, Nov 01, 2023 at 06:45:41AM -0500, Bjorn Helgaas wrote:
> > > On Tue, Oct 31, 2023 at 03:21:20PM +0700, Bagas Sanjaya wrote:
> > > > I notice a regression report on Bugzilla [1]. Quoting from it:
> > > >
> > > > > On Kernel 6.4 rc1 and higher if you put the Steam Deck into
> > > > > suspend then press the power button again it will not wake up.
> > > > >
> > > > > I don't have a clue as to -why- this commit breaks wake from
> > > > > suspend on steam deck, but it does. Bisected to:
> > > > >
> > > > > ```
> > > > > 1ad11eafc63ac16e667853bee4273879226d2d1b is the first bad commit
> > > > > commit 1ad11eafc63ac16e667853bee4273879226d2d1b
> > > > > Author: Bjorn Helgaas <[email protected]>
> ...

> > silverspring attached lspci output and a dmesg log from v6.8 to the
> > bugzilla and also noted that "pci=noaer" works around the problem.
> >
> > The problem commit is 1ad11eafc63a ("nvme-pci: drop redundant
> > pci_enable_pcie_error_reporting()")
> > (https://git.kernel.org/linus/1ad11eafc63a)
> >
> > 1ad11eafc63a removed pci_disable_pcie_error_reporting() from the
> > nvme_suspend() path, so we now leave the PCIe Device Control error
> > enables set when we didn't before. My theory is that the PCIe link
> > goes down during suspend, which causes an error interrupt, and the
> > interrupt causes a problem on Steam Deck. Maybe there's some BIOS
> > connection.
> >
> > "pci=noaer" would work around this because those error enables would
> > never be set in the first place.
> >
> > I asked reporters to test the debug patches below to disable those
> > error interrupts during suspend.
> >
> > I don't think this would be the *right* fix; if we need to do this, I
> > think it should be done by the PCI core, not by individual drivers.
> > Kai-Heng has been suggesting this for a while for a different
> > scenario.
>
> Should I send the patch to mailing list again to stir more discussion?

Yes, please. Include the folks from this thread, too, and the Steam
Deck bugzilla link since we have more more problem reports now.

Bjorn