2021-05-20 03:36:18

by Koba Ko

[permalink] [raw]
Subject: [PATCH] nvme-pci: Avoid to go into d3cold if device can't use npss.

During suspend, if the device can't use npss, driver would put
controller to shutdown simply and let host to control power management.
After resume, host can't change power state of the closed controller
from D3cold to D0.
For these devices, just avoid to go deeper than d3hot.

Tested-by: Henrik Juul Hansen <[email protected]>
Signed-off-by: Koba Ko <[email protected]>
---
drivers/nvme/host/pci.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a29b170701fc..caaf52051689 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -26,6 +26,7 @@
#include <linux/io-64-nonatomic-hi-lo.h>
#include <linux/sed-opal.h>
#include <linux/pci-p2pdma.h>
+#include <linux/pm_runtime.h>

#include "trace.h"
#include "nvme.h"
@@ -2958,6 +2959,15 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)

dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));

+ if (pm_suspend_via_firmware() || !dev->ctrl.npss ||
+ !pcie_aspm_enabled(pdev) ||
+ dev->nr_host_mem_descs ||
+ (dev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND)) {
+ pdev->d3cold_allowed = false;
+ pci_d3cold_disable(pdev);
+ pm_runtime_resume(&pdev->dev);
+ }
+
nvme_reset_ctrl(&dev->ctrl);
async_schedule(nvme_async_probe, dev);

--
2.25.1


2021-05-25 08:26:24

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: Avoid to go into d3cold if device can't use npss.

On Thu, May 20, 2021 at 11:33:15AM +0800, Koba Ko wrote:
> After resume, host can't change power state of the closed controller
> from D3cold to D0.

Why?

> For these devices, just avoid to go deeper than d3hot.

What are "these devices"?

> @@ -2958,6 +2959,15 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>
> dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
>
> + if (pm_suspend_via_firmware() || !dev->ctrl.npss ||
> + !pcie_aspm_enabled(pdev) ||
> + dev->nr_host_mem_descs ||
> + (dev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND)) {

Before we start open coding this in even more places we really want a
little helper function for these checks, which should be accomodated with
the comment near the existing copy of the checks.

> + pdev->d3cold_allowed = false;
> + pci_d3cold_disable(pdev);
> + pm_runtime_resume(&pdev->dev);

Why do we need to both set d3cold_allowed and call pci_d3cold_disable?

What is the pm_runtime_resume doing here?

2021-05-25 21:08:24

by Kai-Heng Feng

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: Avoid to go into d3cold if device can't use npss.

On Tue, May 25, 2021 at 3:44 PM Christoph Hellwig <[email protected]> wrote:
>
> On Thu, May 20, 2021 at 11:33:15AM +0800, Koba Ko wrote:
> > After resume, host can't change power state of the closed controller
> > from D3cold to D0.
>
> Why?

IIUC it's a regression introduced by commit b97120b15ebd ("nvme-pci:
use simple suspend when a HMB is enabled"). The affected NVMe is using
HMB.

That commit intentionally put the device to D3hot instead of D0 on
suspend, as the root port of the NVMe device has _PR3, the NVMe was
put to D3cold as a result. I believe because the other OS doesn't put
the NVMe to D3cold, so turning off the power resource is untested by
the vendor.

I think the proper fix would be reverting that commit, and
teardown/setup DMA on suspend/resume for HMB NVMes.

Kai-Heng

>
> > For these devices, just avoid to go deeper than d3hot.
>
> What are "these devices"?
>
> > @@ -2958,6 +2959,15 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >
> > dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
> >
> > + if (pm_suspend_via_firmware() || !dev->ctrl.npss ||
> > + !pcie_aspm_enabled(pdev) ||
> > + dev->nr_host_mem_descs ||
> > + (dev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND)) {
>
> Before we start open coding this in even more places we really want a
> little helper function for these checks, which should be accomodated with
> the comment near the existing copy of the checks.
>
> > + pdev->d3cold_allowed = false;
> > + pci_d3cold_disable(pdev);
> > + pm_runtime_resume(&pdev->dev);
>
> Why do we need to both set d3cold_allowed and call pci_d3cold_disable?
>
> What is the pm_runtime_resume doing here?

2021-05-26 00:05:20

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: Avoid to go into d3cold if device can't use npss.

On Tue, May 25, 2021 at 09:44:26AM +0200, Christoph Hellwig wrote:
> On Thu, May 20, 2021 at 11:33:15AM +0800, Koba Ko wrote:

> > @@ -2958,6 +2959,15 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >
> > dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
> >
> > + if (pm_suspend_via_firmware() || !dev->ctrl.npss ||
> > + !pcie_aspm_enabled(pdev) ||
> > + dev->nr_host_mem_descs ||
> > + (dev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND)) {
>
> Before we start open coding this in even more places we really want a
> little helper function for these checks, which should be accomodated with
> the comment near the existing copy of the checks.
>
> > + pdev->d3cold_allowed = false;
> > + pci_d3cold_disable(pdev);
> > + pm_runtime_resume(&pdev->dev);
>
> Why do we need to both set d3cold_allowed and call pci_d3cold_disable?

Ugh, this looks pretty hard to maintain.

I don't see why setting d3cold_allowed=false is useful.

pci_d3cold_disable() already sets dev->no_d3cold=true, and the only place
we look at d3cold_allowed is pci_dev_check_d3cold():

if (dev->no_d3cold || !dev->d3cold_allowed || ...)

so we won't even look at d3cold_allowed when no_d3cold is set.

I don't know why we need both no_d3cold and d3cold_allowed in the
first place. 448bd857d48e ("PCI/PM: add PCIe runtime D3cold support")
added them, but without explanation for that.

2021-05-26 02:56:38

by Koba Ko

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: Avoid to go into d3cold if device can't use npss.

On Tue, May 25, 2021 at 3:44 PM Christoph Hellwig <[email protected]> wrote:
>
> On Thu, May 20, 2021 at 11:33:15AM +0800, Koba Ko wrote:
> > After resume, host can't change power state of the closed controller
> > from D3cold to D0.
>
> Why?
As per Kai-Heng said, it's a regression introduced by commit
b97120b15ebd ("nvme-pci:
use simple suspend when a HMB is enabled"). The affected NVMe is using HMB.
the target nvme ssd uses HMB and the target machine would put nvme to d3cold.
During suspend, nvme driver would shutdown the nvme controller caused by
commit b97120b15ebd ("nvme-pci: use simple suspend when a HMB is enabled").
During resuming, the nvme controller can't change the power state from
d3cold to d0.
# nvme 0000:58:00.0: can't change power state from D3cold to D0
(config space inaccessible)
Tried some machines, they only put nvme to d3hot so even if nvme is
forced to shutdown,
it could be resumed correctly.

As per commit b97120b15ebd , the TP spec would allow nvme to access
the host memory in any power state in S3.
but the Host would fail to manage. I agree with Kai-Heng's suggestion
but this TP would be broken.

>
> > For these devices, just avoid to go deeper than d3hot.
>
> What are "these devices"?

It's a Samsung ssd using HMB.

> > @@ -2958,6 +2959,15 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >
> > dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
> >
> > + if (pm_suspend_via_firmware() || !dev->ctrl.npss ||
> > + !pcie_aspm_enabled(pdev) ||
> > + dev->nr_host_mem_descs ||
> > + (dev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND)) {
>
> Before we start open coding this in even more places we really want a
> little helper function for these checks, which should be accomodated with
> the comment near the existing copy of the checks.

Thanks, I will refine this.

>
> > + pdev->d3cold_allowed = false;
> > + pci_d3cold_disable(pdev);
> > + pm_runtime_resume(&pdev->dev);
>
> Why do we need to both set d3cold_allowed and call pci_d3cold_disable?
>
> What is the pm_runtime_resume doing here?
I referenced the codes of d3cold_allowed_store@d3cold_allowed_store fun,
As per Bjorn and search in multiple drivers, only pci_d3cold_disable is enough.

2021-05-26 03:03:15

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: Avoid to go into d3cold if device can't use npss.

On Wed, May 26, 2021 at 10:02:27AM +0800, Koba Ko wrote:
> On Tue, May 25, 2021 at 3:44 PM Christoph Hellwig <[email protected]> wrote:
> >
> > On Thu, May 20, 2021 at 11:33:15AM +0800, Koba Ko wrote:
> > > After resume, host can't change power state of the closed controller
> > > from D3cold to D0.
> >
> > Why?
> As per Kai-Heng said, it's a regression introduced by commit
> b97120b15ebd ("nvme-pci:
> use simple suspend when a HMB is enabled"). The affected NVMe is using HMB.

That really doesn't add up. The mentioned commit restores the driver
behavior for HMB drives that existed prior to d916b1be94b6d from kernel
5.3. Is that NVMe device broken in pre-5.3 kernels, too?

2021-05-26 13:08:58

by Kai-Heng Feng

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: Avoid to go into d3cold if device can't use npss.

On Wed, May 26, 2021 at 10:49 AM Keith Busch <[email protected]> wrote:
>
> On Wed, May 26, 2021 at 10:02:27AM +0800, Koba Ko wrote:
> > On Tue, May 25, 2021 at 3:44 PM Christoph Hellwig <[email protected]> wrote:
> > >
> > > On Thu, May 20, 2021 at 11:33:15AM +0800, Koba Ko wrote:
> > > > After resume, host can't change power state of the closed controller
> > > > from D3cold to D0.
> > >
> > > Why?
> > As per Kai-Heng said, it's a regression introduced by commit
> > b97120b15ebd ("nvme-pci:
> > use simple suspend when a HMB is enabled"). The affected NVMe is using HMB.
>
> That really doesn't add up. The mentioned commit restores the driver
> behavior for HMB drives that existed prior to d916b1be94b6d from kernel
> 5.3. Is that NVMe device broken in pre-5.3 kernels, too?

Quite likely. The system in question is a late 2020 Ice Lake laptop,
so it was released after 5.3 kernel.

Kai-Heng

2021-05-26 14:24:01

by Kai-Heng Feng

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: Avoid to go into d3cold if device can't use npss.

On Wed, May 26, 2021 at 8:59 PM Christoph Hellwig <[email protected]> wrote:
>
> On Wed, May 26, 2021 at 08:11:41PM +0800, Kai-Heng Feng wrote:
> > On Wed, May 26, 2021 at 10:49 AM Keith Busch <[email protected]> wrote:
> > >
> > > On Wed, May 26, 2021 at 10:02:27AM +0800, Koba Ko wrote:
> > > > On Tue, May 25, 2021 at 3:44 PM Christoph Hellwig <[email protected]> wrote:
> > > > >
> > > > > On Thu, May 20, 2021 at 11:33:15AM +0800, Koba Ko wrote:
> > > > > > After resume, host can't change power state of the closed controller
> > > > > > from D3cold to D0.
> > > > >
> > > > > Why?
> > > > As per Kai-Heng said, it's a regression introduced by commit
> > > > b97120b15ebd ("nvme-pci:
> > > > use simple suspend when a HMB is enabled"). The affected NVMe is using HMB.
> > >
> > > That really doesn't add up. The mentioned commit restores the driver
> > > behavior for HMB drives that existed prior to d916b1be94b6d from kernel
> > > 5.3. Is that NVMe device broken in pre-5.3 kernels, too?
> >
> > Quite likely. The system in question is a late 2020 Ice Lake laptop,
> > so it was released after 5.3 kernel.
>
> This is just a mess. We had to disable the sensible power state based
> suspend on these systems because Intel broke it by just cutting the power
> off. And now the shutdown based one doesn't work either because it can't
> handle d3cold. Someone we need to stop Intel and the integrators from
> doing stupid things, and I'm not sure how.

To be fair, resuming the NVMe from D3hot is much slower than keep it
at D0, which gives us a faster s2idle resume time. And now AMD also
requires s2idle on their latest laptops.

And it's more like NVMe controllers don't respect PCI D3hot.

>
> But degrading all systems even more is just a bad idea, so I fear we'll
> need a quirk again. Can you figure out by switching the cards if this
> is the fault of the platform or the nvme device?

Here's the original bug report:
https://bugs.launchpad.net/bugs/1912057

Because the NVMe continues to work after s2idle and the symbol is
rather subtle, so I suspect this is not platform or vendor specific.
Is it possible to disable DMA for HMB NVMe on suspend?

Kai-Heng

>
> >
> > Kai-Heng
> ---end quoted text---

2021-05-26 14:30:19

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: Avoid to go into d3cold if device can't use npss.

On Wed, May 26, 2021 at 10:21:59PM +0800, Kai-Heng Feng wrote:
> To be fair, resuming the NVMe from D3hot is much slower than keep it
> at D0, which gives us a faster s2idle resume time. And now AMD also
> requires s2idle on their latest laptops.

We'd much prefer to use it, but due to the broken platforms we can't
unfortunately.

> And it's more like NVMe controllers don't respect PCI D3hot.

What do you mean with that?

> Because the NVMe continues to work after s2idle and the symbol is
> rather subtle, so I suspect this is not platform or vendor specific.
> Is it possible to disable DMA for HMB NVMe on suspend?

Not in shipping products. The NVMe technical working group is working
on a way to do that, but it will take a while until that shows up in
products.

2021-05-26 18:43:54

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: Avoid to go into d3cold if device can't use npss.

On Wed, May 26, 2021 at 08:11:41PM +0800, Kai-Heng Feng wrote:
> On Wed, May 26, 2021 at 10:49 AM Keith Busch <[email protected]> wrote:
> >
> > On Wed, May 26, 2021 at 10:02:27AM +0800, Koba Ko wrote:
> > > On Tue, May 25, 2021 at 3:44 PM Christoph Hellwig <[email protected]> wrote:
> > > >
> > > > On Thu, May 20, 2021 at 11:33:15AM +0800, Koba Ko wrote:
> > > > > After resume, host can't change power state of the closed controller
> > > > > from D3cold to D0.
> > > >
> > > > Why?
> > > As per Kai-Heng said, it's a regression introduced by commit
> > > b97120b15ebd ("nvme-pci:
> > > use simple suspend when a HMB is enabled"). The affected NVMe is using HMB.
> >
> > That really doesn't add up. The mentioned commit restores the driver
> > behavior for HMB drives that existed prior to d916b1be94b6d from kernel
> > 5.3. Is that NVMe device broken in pre-5.3 kernels, too?
>
> Quite likely. The system in question is a late 2020 Ice Lake laptop,
> so it was released after 5.3 kernel.

This is just a mess. We had to disable the sensible power state based
suspend on these systems because Intel broke it by just cutting the power
off. And now the shutdown based one doesn't work either because it can't
handle d3cold. Someone we need to stop Intel and the integrators from
doing stupid things, and I'm not sure how.

But degrading all systems even more is just a bad idea, so I fear we'll
need a quirk again. Can you figure out by switching the cards if this
is the fault of the platform or the nvme device?

>
> Kai-Heng
---end quoted text---

2021-05-26 19:06:32

by Kai-Heng Feng

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: Avoid to go into d3cold if device can't use npss.

On Wed, May 26, 2021 at 11:06 PM Bjorn Helgaas <[email protected]> wrote:
>
> On Wed, May 26, 2021 at 10:47:13PM +0800, Kai-Heng Feng wrote:
> > On Wed, May 26, 2021 at 10:28 PM Christoph Hellwig <[email protected]> wrote:
> > > On Wed, May 26, 2021 at 10:21:59PM +0800, Kai-Heng Feng wrote:
> > > > To be fair, resuming the NVMe from D3hot is much slower than keep it
> > > > at D0, which gives us a faster s2idle resume time. And now AMD also
> > > > requires s2idle on their latest laptops.
> > >
> > > We'd much prefer to use it, but due to the broken platforms we can't
> > > unfortunately.
> > >
> > > > And it's more like NVMe controllers don't respect PCI D3hot.
> > >
> > > What do you mean with that?
> >
> > Originally, we found that under s2idle, most NVMe controllers caused
> > substantially more power if D3hot was used.
> > We were told by all the major NVMe vendors that D3hot is not
> > supported.
>
> What is this supposed to mean? PCIe r5.0, sec 5.3.1, says
>
> All Functions must support the D0 and D3 states (both D3Hot and D3Cold).
>
> Since D3hot is required for all functions, I don't think there is a
> standard way to discover whether D3hot is supported. The PM
> Capability (sec 7.5.2.1) has D1_Support and D2_Support bits, but no
> D3_Support bit.
>
> Are the vendors just saying "sorry, our devices don't conform to the
> spec"?

Yes, that's exactly what they said. Because Windows Modern Standby
always keep the NVMe at D0, so D3hot is untested by the vendors.

NVMe vendors explicitly asked us to keep the NVMe controllers at D0 for s2idle.

>
> If what you really mean is "D3hot is supported and it works, but it
> consumes more power than expected,

If D3hot consumes more power than D0, is it really supported?

> or the D3hot->D0 transition takes
> longer than expected," that's a totally different thing, and you should
> say *that*.

The D3hot->D0 transition doesn't take longer than expected. It's just
relatively slower than keeping the NVMe at D0.

Kai-Heng

>
> Bjorn

2021-05-26 22:08:05

by Kai-Heng Feng

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: Avoid to go into d3cold if device can't use npss.

On Wed, May 26, 2021 at 10:28 PM Christoph Hellwig <[email protected]> wrote:
>
> On Wed, May 26, 2021 at 10:21:59PM +0800, Kai-Heng Feng wrote:
> > To be fair, resuming the NVMe from D3hot is much slower than keep it
> > at D0, which gives us a faster s2idle resume time. And now AMD also
> > requires s2idle on their latest laptops.
>
> We'd much prefer to use it, but due to the broken platforms we can't
> unfortunately.
>
> > And it's more like NVMe controllers don't respect PCI D3hot.
>
> What do you mean with that?

Originally, we found that under s2idle, most NVMe controllers caused
substantially more power if D3hot was used.
We were told by all the major NVMe vendors that D3hot is not
supported. It may also disable APST.
And that's the reason why we have the host-managed power control for s2idle.

IIRC only Samsung NVMes respect D3hot and keeps the power consumption low.

>
> > Because the NVMe continues to work after s2idle and the symbol is
> > rather subtle, so I suspect this is not platform or vendor specific.
> > Is it possible to disable DMA for HMB NVMe on suspend?
>
> Not in shipping products. The NVMe technical working group is working
> on a way to do that, but it will take a while until that shows up in
> products.

Hmm, then what else can we do? Because D3hot isn't support by the
vendor, does it really stop HMB?

Kai-Heng

2021-05-26 22:24:54

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: Avoid to go into d3cold if device can't use npss.

On Wed, May 26, 2021 at 10:47:13PM +0800, Kai-Heng Feng wrote:
> On Wed, May 26, 2021 at 10:28 PM Christoph Hellwig <[email protected]> wrote:
> > On Wed, May 26, 2021 at 10:21:59PM +0800, Kai-Heng Feng wrote:
> > > To be fair, resuming the NVMe from D3hot is much slower than keep it
> > > at D0, which gives us a faster s2idle resume time. And now AMD also
> > > requires s2idle on their latest laptops.
> >
> > We'd much prefer to use it, but due to the broken platforms we can't
> > unfortunately.
> >
> > > And it's more like NVMe controllers don't respect PCI D3hot.
> >
> > What do you mean with that?
>
> Originally, we found that under s2idle, most NVMe controllers caused
> substantially more power if D3hot was used.
> We were told by all the major NVMe vendors that D3hot is not
> supported.

What is this supposed to mean? PCIe r5.0, sec 5.3.1, says

All Functions must support the D0 and D3 states (both D3Hot and D3Cold).

Since D3hot is required for all functions, I don't think there is a
standard way to discover whether D3hot is supported. The PM
Capability (sec 7.5.2.1) has D1_Support and D2_Support bits, but no
D3_Support bit.

Are the vendors just saying "sorry, our devices don't conform to the
spec"?

If what you really mean is "D3hot is supported and it works, but it
consumes more power than expected, or the D3hot->D0 transition takes
longer than expected," that's a totally different thing, and you should
say *that*.

Bjorn

2021-05-27 15:49:53

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: Avoid to go into d3cold if device can't use npss.

On Thu, May 27, 2021 at 12:24:06AM +0800, Kai-Heng Feng wrote:
> Yes, that's exactly what they said. Because Windows Modern Standby
> always keep the NVMe at D0, so D3hot is untested by the vendors.

So all the problems we've deal with latetly are that platforms cut the
power off. So it is not kept always at D0 which would be the right thing,
but moved into D3cold.

Which makes this patch to disable D3cold rather counterintuitive.

2021-05-27 16:00:42

by Kai-Heng Feng

[permalink] [raw]
Subject: Re: [PATCH] nvme-pci: Avoid to go into d3cold if device can't use npss.

On Thu, May 27, 2021 at 7:40 PM Christoph Hellwig <[email protected]> wrote:
>
> On Thu, May 27, 2021 at 12:24:06AM +0800, Kai-Heng Feng wrote:
> > Yes, that's exactly what they said. Because Windows Modern Standby
> > always keep the NVMe at D0, so D3hot is untested by the vendors.
>
> So all the problems we've deal with latetly are that platforms cut the
> power off. So it is not kept always at D0 which would be the right thing,
> but moved into D3cold.

It's still okay to use D3cold under s2idle if nvme_acpi_storage_d3()
check passes.

Otherwise D0 should be used. I am unsure how to proceed with HMB NVMes
if we can't setup DMA after resume...

>
> Which makes this patch to disable D3cold rather counterintuitive.

The intention of this patch is indeed very unclear.

Kai-Heng